Psychopathic Systems: Why Catastrophic Outages Are Never Accidents
Catastrophic system outages are never purely accidental because the conditions enabling them accumulate quietly over time, embedded within organizational structures, incentive misalignments, and suppressed warning signals. Like destructive human behaviour, systemic failure reveals latent dysfunction that existed long before the visible collapse, suggesting every major outage is ultimately a predictable consequence of chronic, overlooked vulnerabilities.
Catastrophic system outages are never purely accidental because the conditions enabling them accumulate quietly over time, embedded within organizational structures, incentive misalignments, and suppressed warning signals. Like destructive human behaviour, systemic failure reveals latent dysfunction that existed long before the visible collapse, suggesting every major outage is ultimately a predictable consequence of chronic, overlooked vulnerabilities.
1. The Anatomy of Capability
There is something deeply uncomfortable about studying catastrophic human behaviour, not because of the acts themselves, but because of what they reveal about latent capability. I was recently watching a documentary on the anatomy of serial killers, not out of fascination with violence, but because I became interested in understanding how destructive capability forms inside apparently normal people. What creates it? What suppresses it? What environments allow it to emerge? How does it remain hidden for years while those around the individual continue believing they are fundamentally safe?
The comparison I am about to draw is not moral. Systems are not evil. The parallel is structural: hidden capability, suppressed warning signals, environmental reinforcement, and delayed manifestation. That is the only territory this article inhabits.
A serial killer is still technically a human capability. It is an abhorrent one, but it remains a capability nonetheless. The documentary repeatedly returned to the same themes of suppression, escalation, environmental reinforcement, emotional masking, and systemic failure around the individual. The more I watched, the more uncomfortable the parallel became, because distributed systems behave in remarkably similar ways structurally, even if the comparison obviously breaks down morally and emotionally.
Most organisations fundamentally misunderstand the difference between failures, outages, and catastrophic systemic events because they collapse all three into the same operational category called “incidents.” In doing so, they destroy their ability to reason correctly about systemic danger. They begin treating all disruption as merely different points along the same reliability curve, when in reality these events reveal completely different truths about the architecture underneath.
2. Failures, Outages, and Catastrophic Capability
Failures happen constantly inside healthy systems. Packets are dropped, nodes restart unexpectedly, threads deadlock, containers crash, DNS fails transiently, retries time out, and humans deploy broken code. Modern distributed systems are designed around the assumption that components are unreliable because reliability at scale does not emerge from preventing failure. Instead, it emerges from containing failure before it can propagate beyond a controlled boundary.
These three categories are not severity levels. They are different kinds of truth about what a system actually is:
- Failures are expected. They are evidence that components are unreliable, which is correct.
- Outages are containment failures. They are evidence that isolation boundaries did not hold.
- Catastrophic outages are survivability failures. They are evidence that the architectural assumption of proportionate harm was false.
Most organisations compress all three into “incident management,” which destroys signal quality in governance and board reporting. A P1 ticket and a multi day national payment rail collapse are not the same category of event. Treating them as equivalent prevents the organisation from ever reasoning honestly about structural danger.
Healthy systems therefore expect faults continuously, which means failure itself is not the problem. An outage represents something far more serious because an outage means containment failed. A fault escaped the boundaries that were supposed to isolate it and became visible to users. Even then, most outages remain relatively bounded events because recovery mechanisms still function, operational visibility still exists, rollback paths remain available, and the organisation retains enough systemic understanding to stabilise the platform before trust collapses entirely.
Catastrophic outages belong to a completely different category of event because they are not merely “larger outages” or “more severe incidents.” A multi hour or multi day systemic collapse means the platform demonstrated a latent capability for disproportionate harm that already existed long before the outage itself occurred. Fault isolation failed, recovery failed, dependency understanding failed, rollback failed, organisational coordination failed, visibility failed, and architectural boundaries failed simultaneously. The platform did not merely experience bad luck. It revealed something fundamental about its structure.
When a single sensor driver update caused global endpoint collapse across airlines, hospitals, banks, and broadcasters simultaneously in 2024, the event was not a defect in the conventional sense. It was evidence that survivability assumptions across an enormous portion of critical infrastructure were false at the same moment. The architecture had been silently capable of that outcome long before the morning it appeared.
This is the part organisations struggle to confront honestly because catastrophic outages force a reinterpretation of the system itself. If a 15 year old commits murder, nobody responds by saying, “well statistically he was not a murderer for 99.99% of his life.” The moment the capability manifests, the entire understanding of the individual changes permanently because the event forces society to acknowledge that the capacity for catastrophic harm already existed. The act itself simply exposed it.
The same applies to systems. Once a platform demonstrates the capability to catastrophically fail, every enabling condition that made the outcome possible must immediately be treated as real, present, and dangerous. The outage was not a random anomaly floating through space and time. It was proof that the system had already evolved into something structurally unsafe.
Users understand this instinctively, even when engineering organisations do not. Users forgive failures because failures are expected. Users tolerate occasional outages because reality is messy and technology is imperfect. Catastrophic outages, however, permanently damage trust because users unconsciously recognise that the platform has crossed a psychological boundary. They no longer think about reliability percentages because they begin thinking about survivability.
A bank disappearing for days, a cloud provider collapsing globally, or a payment rail becoming unavailable at national scale changes the relationship users have with the platform forever because the event demonstrates structural collapse capability rather than statistical probability. Human beings do not evaluate existential trust through uptime metrics. A bridge that collapses once is no longer considered “mostly safe,” while a parachute that fails once is no longer considered “highly reliable.” The demonstrated capability for catastrophic harm permanently alters perception because it changes how the system itself is understood.
3. Psychopathic Systems
Distributed systems possess another deeply uncomfortable characteristic in common with psychopathy because they execute destructive capability without conscience, empathy, hesitation, or self reflection. A dangerous platform can process millions of harmful operations with perfect consistency while simultaneously reporting healthy dashboards and passing operational checks. It can corrupt state while maintaining latency objectives, amplify cascading failures while every subsystem reports green status, and deadlock recovery paths while operational tooling insists the environment remains available.
The system has no understanding of harm because deterministic execution does not contain morality. Organisational intent means nothing to logic. A platform does not “mean well” in the same way that a collapsing bridge does not care about the people crossing it. It simply executes the capability embedded inside its architecture with flawless emotional detachment.
Organisations often speak about platforms as though intent matters. “The system tried to recover.” “The platform became unstable.” “The environment was behaving unexpectedly.” But distributed systems possess no conscience, no self preservation instinct, and no empathy for the people depending on them. They execute architecture exactly as written, including every destructive path. When an organisation says a system “should have recovered automatically,” it is revealing that a human expectation was placed on a deterministic structure incapable of holding it.
This is precisely where organisational denial becomes more dangerous than the outage itself. Every catastrophic outage is followed by remarkably similar language. Leaders describe the event as a “perfect storm,” an “edge case,” an “unprecedented incident,” or “bad luck.” Organisations instinctively reach for probabilistic explanations because probabilistic language protects them psychologically from confronting systemic pathology. If the outage was random, then nobody has to confront the deeply uncomfortable possibility that the platform was structurally capable of collapse all along.
Yet catastrophic outages are almost never random. Every catastrophic system rehearsed its catastrophe many times in miniature before finally succeeding at full scale. Small recovery failures were tolerated. Near misses were rationalised. Alert fatigue became normalised. Operational debt accumulated silently. Dependency coupling increased gradually. Rollback procedures decayed without testing. Resilience became PowerPoint theatre instead of engineering reality. Teams lost visibility into how failures propagated between domains, while organisations continued celebrating uptime statistics that measured stability without measuring survivability.
Over time, the organisation slowly trained the platform to become dangerous while simultaneously convincing itself that resilience was improving. This is one of the defining pathologies of modern engineering culture because many organisations become extraordinarily sophisticated at measuring uptime while remaining surprisingly weak at measuring survivability.
4. Reliability Is Not Survivability
Reliability and survivability are not remotely equivalent concepts, despite the fact they are often treated interchangeably inside executive reporting and operational governance. A platform can achieve exceptional reliability metrics while still containing latent collapse capability. In fact, long periods of apparent stability frequently make the problem worse because organisations begin mistaking the absence of catastrophe for evidence of resilience. Many systems appear stable only because the precise chain of events required to expose their fragility has not yet occurred.
This is why large scale outages often appear sudden externally while feeling strangely inevitable internally. Deep down, engineers usually know where the danger lives. They know which database cannot really fail, which retry mechanism can weaponise recovery traffic, which queue has become globally coupled, which operational procedure only works if specific people are awake simultaneously, and which service boundaries exist only on architecture diagrams rather than in reality.
The catastrophe therefore already exists long before the outage occurs. The outage simply reveals it publicly.
Mature engineering organisations understand this distinction intuitively and therefore optimise for survivability rather than perfection. Survivability engineering assumes failures are inevitable and instead focuses on preventing those failures from becoming civilisation scale events inside the organisation. The objective is not eliminating faults entirely. The objective is preventing disproportionate harm when faults inevitably occur.
That philosophical shift changes architectural priorities completely because blast radius reduction becomes more important than average uptime, domain isolation becomes more important than centralisation efficiency, graceful degradation becomes more important than feature completeness, and asynchronous decoupling becomes more important than orchestration elegance. Rebuildability starts mattering more than failover theatre, while operational simplicity becomes significantly more valuable than architectural cleverness.
The most important systems are therefore not the systems that never fail. They are the systems incapable of amplifying failure beyond controlled boundaries.
5. Redeemable Systems and the Role of CoEs
Fortunately, systems possess one characteristic humans do not because systems remain fundamentally redeemable through deterministic redesign. Human beings can conceal intent indefinitely, but systems cannot. Even when complexity obscures behaviour, distributed systems remain logical structures that can be decomposed, isolated, constrained, redesigned, and structurally prevented from causing disproportionate harm.
This is where real engineering maturity begins because mature organisations stop asking whether systems can fail and start asking whether systems can survive failure safely. They treat existential failure capability itself as unacceptable regardless of how statistically rare the event appears because capability matters more than frequency.
This is also why genuine Centres of Excellence matter so profoundly, although most organisations misunderstand what a CoE should actually be. A real CoE is not a standards committee, an architecture review board, or a governance theatre layer producing PowerPoint decks no one reads. A real Centre of Excellence exists to systematically remove organisational capability for uncontrolled amplification by functioning as an engineering immune system embedded inside the organisation itself.
Its purpose is to continuously ask dangerous questions. Can this service take down the organisation? Can this dependency amplify failure globally? Can this retry mechanism weaponise recovery traffic? Can this operational process deadlock rollback? Can this identity boundary collapse between domains? Can this queue saturate the entire platform? Can this deployment model bypass containment controls during an emergency? Can operators still recover the environment when multiple assumptions fail simultaneously?
In modern banking, a single globally coupled dependency can freeze payments, onboarding, fraud controls, support channels, and identity verification simultaneously. That is not a theoretical risk. That is the demonstrated capability profile of architectures that were built for reliability without being engineered for survivability. The CoE exists precisely to find those couplings before an outage makes them visible to twelve million clients at once.
Most importantly, a real CoE refuses to normalise demonstrated structural danger simply because the event appears statistically rare. A nuclear reactor that melts down once every hundred years is still considered fundamentally dangerous because the scale of harm exceeds acceptable containment assumptions. The same logic increasingly applies to digital systems embedded into banking, healthcare, communications, logistics, identity, and social coordination itself.
Healthy systems are therefore not systems that never fail. Healthy systems are systems incapable of causing disproportionate harm when they do fail. That distinction may become one of the defining engineering philosophies of the next decade as society becomes progressively more dependent on infrastructure operated at enormous scale by increasingly small groups of people.
Catastrophic outages are never born in the moment they appear. They are rehearsed quietly for years inside architectures, processes, incentives, and organisational denial until one day the platform finally succeeds at destroying itself.