Why Software Feels Fragile (Even Though We Know Better)
When I look at how we build software today, something keeps nagging at me. We have continuous integration pipelines that run thousands of tests before a single line touches production. We have microservices supposedly designed to isolate failure. We have feature flags, canary deployments, and rollback mechanisms that would have been science fiction thirty years ago. And yet every project still carries this low-grade anxiety, the feeling that one missed dependency update, one misconfigured environment variable, or one person leaving the company could send everything cascading.
It makes no sense from a purely technical standpoint. The patterns exist. We know how to build systems that survive individual component failure. So why does the overall experience feel so fragile?
I think the answer lives in what software engineers tend to avoid talking about: organizational structure. Not org charts and reporting lines, but something deeper: the invisible architecture of who knows what, who decides what, and what happens when those connections break.
Every system has a failure mode. The question is whether that failure mode is contained within the technical boundaries we’ve designed or whether it bleeds through the social ones. A microservice can crash gracefully if someone understands its dependencies and knows how to restart it. But if only one person on the team understands the payment service’s connection to the legacy billing system, and that person goes on vacation, suddenly a routine deployment feels like defusing a bomb. The technical infrastructure is fine. The social infrastructure just collapsed.
I’ve seen this play out in different ways across every project I’ve worked on. Sometimes it shows up as what I call the “bus factor,” which is how many people need to get hit by a bus before the project dies. More often, though, it’s subtler. It’s the person who knows which configuration flag controls the database connection pool because they set it up during a 2 AM outage eighteen months ago. It’s the undocumented assumption baked into a cron job that nobody questions because “it’s always worked.” It’s the way certain code paths become tribal knowledge, passed down through informal conversations rather than formal documentation.
The funny thing is that most of these fragility points are completely predictable. They’re not mysterious or surprising if you look for them. Any team member who’s been on a project long enough to experience someone leaving will tell you about the undocumented system they inherited. Anyone who’s joined a team mid-project knows what it feels like to navigate code without understanding why certain decisions were made. These are not edge cases. They’re the default state of software organizations.
So what do we do about it? The conventional answer is “better documentation” or “more testing.” Both help, but neither gets at the root cause. Better documentation assumes people will read it and that it stays current. More testing assumes someone remembers to write the right tests. Neither addresses the underlying organizational pattern that creates fragility in the first place.
I think the answer is simpler than most people make it. You build redundancy into your knowledge distribution the same way you build redundancy into your infrastructure. Not by writing more docs, but by rotating who understands what. The person who knows the payment service should spend time with someone else walking through its architecture. The person who set up that 2 AM configuration should pair-program with someone on a related feature so the context transfers naturally. You treat knowledge distribution like an infrastructure problem, because it is one.
This doesn’t require any special process or ceremony. It’s just a shift in how you think about team structure. Instead of assigning people to domains and leaving them there, you rotate them through. Not constantly, not disruptively, but deliberately enough that no single person becomes the sole bridge between two parts of the system. When someone leaves, they take knowledge with them, but now it’s distributed across multiple people instead of concentrated in one.
The result is a team that feels more like a mesh network than a star topology. Individual nodes can fail without bringing down the whole thing. Communication paths are redundant. And the anxiety that comes from knowing your project depends on one person understanding something nobody else does, and that slowly fades away.
It’s not glamorous. It doesn’t show up in sprint planning or performance reviews. But it’s the difference between a team that feels like it’s holding things together by willpower and a team that actually has structural resilience. And honestly, if we’ve spent thirty years learning how to build fault-tolerant software systems, maybe it’s time we started applying those same principles to the people building them.