Figure 1: Traditional DR Exercise vs Real World Outage
Disaster recovery is one of the most comforting practices in enterprise technology and one of the least honest. Organisations spend significant time and money designing DR strategies, running carefully choreographed exercises, producing polished post exercise reports, and reassuring themselves that they are prepared for major outages. The problem is not intent. The problem is that most DR exercises are optimised to demonstrate control and preparedness in artificial conditions, while real failures are chaotic, asymmetric and hostile to planning. When outages occur under real load, the assumptions underpinning these exercises fail almost immediately.
What most organisations call disaster recovery is closer to rehearsal than resilience. It tests whether people can follow a script, whether environments can be brought online when nothing else is going wrong, and whether senior stakeholders can be reassured. It does not test whether systems can survive reality.
1. DR Exercises Validate Planning Discipline, Not Failure Behaviour
Traditional DR exercises are run like projects. They are planned well in advance, aligned to change freezes, coordinated across teams, and executed when everyone knows exactly what is supposed to happen. This alone invalidates most of the conclusions drawn from them. Real outages are not announced, they do not arrive at convenient times, and they rarely fail cleanly. They emerge as partial failures, ambiguous symptoms and cascading side effects. Alerts contradict each other, dashboards lag reality, and engineers are forced to reason under pressure with incomplete information.
A recovery strategy that depends on precise sequencing, complete information and the availability of specific individuals is fragile by definition. The more a DR exercise depends on human coordination to succeed, the less likely it is to work when humans are stressed, unavailable or wrong. Resilience is not something that can be planned into existence through documentation. It is an emergent property of systems that behave safely when things go wrong without requiring perfect execution.
2. Recovery Is Almost Always Tested in the Absence of Load
Figure 2: Recovery Under Load With and Without Chaos Testing
The single most damaging flaw in DR testing is that it is almost always performed when systems are idle. Queues are empty, clients are disconnected, traffic is suppressed, and downstream systems are healthy. This creates a deeply misleading picture of recoverability. In real outages, load does not disappear. It concentrates. Clients retry, SDKs back off and then retry again, load balancers redistribute traffic aggressively, queues accumulate messages faster than they can be drained, and databases slow down at precisely the moment demand spikes.
Back pressure is the defining characteristic of real recovery scenarios, and it is almost entirely absent from DR exercises. A system that starts cleanly with no load may never become healthy when forced to recover while saturated. Recovery logic that looks correct in isolation frequently collapses when subjected to retry storms and backlog replays. Testing recovery without load is equivalent to testing a fire escape in an empty building and declaring it safe.
3. Recovery Commonly Triggers the Second Outage
DR plans tend to assume orderly reconnection. Services are expected to come back online, accept traffic gradually, and stabilise. Reality delivers the opposite. When systems reappear, clients reconnect simultaneously, message brokers attempt to drain entire backlogs at once, caches stampede databases, authentication systems spike, and internal rate limits are exceeded by internal callers rather than external users.
This thundering herd effect means that recovery itself often becomes the second outage, frequently worse than the first. Systems may technically be up while remaining unusable because they are overwhelmed the moment they re enter service. DR exercises rarely expose this behaviour because load is deliberately suppressed, leading organisations to confuse clean startup with safe recovery.
4. Why Real World DR Testing Is So Hard
The uncomfortable truth is that most organizations avoid real world DR testing not because they are lazy or incompetent, but because the technology they run makes realistic testing commercially irrational.
In traditional enterprise estates a genuine failover is not a minor operational event. A large SQL Server estate or a mainframe environment routinely takes well over an hour to fail over cleanly, and that is assuming everything behaves exactly as designed. During that window queues back up, batch windows are missed, downstream systems time out, and customers feel the impact immediately. Pulling the pin on a system like this during peak volumes is not a test, it is a deliberate business outage. No executive will approve that, and nor should they.
This creates an inevitable compromise. DR tests are scheduled during low load periods, often weekends or nights, precisely when the system behaves best. The back pressure that exists during real trading hours is absent. Cache warm up effects are invisible. Connection storms never happen. Latent data consistency problems remain hidden. The test passes, confidence is reported upward, and nothing meaningful has actually been proven.
The core issue is not testing discipline, it is recovery time characteristics. If your recovery time objective is measured in hours, then every real test carries a material business risk. As a result, organizations rationally choose theater over truth.
Change the technology and the equation changes completely. Platforms like Aurora Serverless fundamentally alter the cost of failure. A failover becomes an operational blip measured in seconds rather than an existential event measured in hours. Endpoints are reattached, capacity is rehydrated automatically, and traffic resumes quickly enough that controlled testing becomes possible even with real workloads. Once confidence is built at lower volumes, the same mechanism can be exercised progressively closer to peak without taking the business hostage.
This is the key distinction most DR conversations miss. You cannot meaningfully test DR if the act of testing is itself catastrophic. Modern architectures that fail fast and recover fast are not just operationally elegant, they are the only ones that make honest DR validation feasible. Everything else optimizes for paperwork, not resilience.
5. Availability Is Tested While Correctness Is Ignored
Most DR exercises optimise for availability signals rather than correctness. They focus on whether systems start, endpoints respond and dashboards turn green, while ignoring whether the system is still right. Modern architectures are asynchronous, distributed and event driven. Outages cut through workflows mid execution. Transactions may be partially applied, events may be published but never consumed, compensating actions may not run, and side effects may occur without corresponding state changes.
DR testing almost never validates whether business invariants still hold after recovery. It rarely checks for duplicated actions, missing compensations or widened consistency windows. Availability without correctness is not resilience. It is simply data corruption delivered faster.
6. Idempotency Is Assumed Rather Than Proven
Many systems claim idempotency at an architectural level, but real implementations are usually only partially idempotent. Idempotency keys are often scoped incorrectly, deduplication windows expire too quickly, global uniqueness is not enforced, and side effects are not adequately guarded. External integrations frequently replay blindly, amplifying the problem.
Outages expose these weaknesses because retries occur across multiple layers simultaneously. Messages are delivered more than once, requests are replayed long after original context has been lost, and systems are forced to process duplicates at scale. DR exercises rarely test this behaviour under load. They validate that systems start, not that they behave safely when flooded with replays. Idempotency that only works in steady state is not idempotency. It is an assumption waiting to fail.
7. DNS and Replication Lag Are Treated as Minor Details
DNS based failover is a common component of DR strategies because it looks clean and simple on diagrams. In practice it is unreliable and unpredictable. TTLs are not respected uniformly, client side caches persist far longer than expected, mobile networks are extremely sticky, corporate resolvers behave inconsistently, and CDN propagation is neither instantaneous nor symmetrical.
During real incidents, traffic often arrives from both old and new locations for extended periods. Systems must tolerate split traffic and asymmetric routing rather than assuming clean cutover. DR exercises that expect DNS to behave deterministically are rehearsing a scenario that almost never occurs in production.
8. Hidden Coupling Between Domains Undermines Recovery
Most large scale recovery failures are not caused by the system being recovered, but by something it depends on. Shared authentication services, centralised configuration systems, common message brokers, logging pipelines and global rate limits quietly undermine isolation. During DR exercises these couplings remain invisible because everything is brought up together in a controlled order. In real outages, dependencies fail independently, partially and out of sequence.
True resilience requires domain isolation with explicitly bounded blast radius. If recovery of one system depends on the health of multiple others, none of which are isolated, then recovery is fragile regardless of how well rehearsed it is.
9. Human Factors Are Removed From the Equation
DR exercises assume ideal human conditions. The right people are available, everyone knows it is a test, stress levels are low, and communication is structured and calm. Real incidents are defined by the opposite conditions. People are tired, unavailable or already overloaded, context is missing, and decisions are made under extreme cognitive load.
Systems that require heroics to recover are not resilient. They are brittle. Good systems assume humans will be late, distracted and wrong, and still recover safely.
10. DR Is Designed for Audit Cycles, Not Continuous Failure
Most DR programs exist to satisfy auditors, regulators and risk committees rather than to survive reality. This leads to annual exercises, static runbooks, binary success metrics and a complete absence of continuous feedback. Meanwhile production systems change daily.
A DR plan that is not continuously exercised against live systems is obsolete by default. The confidence it provides is inversely proportional to its accuracy.
11. Chaos Testing Is the Only Honest Substitute
Real resilience is built by failing systems while they are doing real work. That means killing instances under load, partitioning networks unpredictably, breaking dependencies intentionally, injecting latency and observing the blast radius honestly. Chaos testing exposes retry amplification, back pressure collapse, hidden coupling and unsafe assumptions that scripted DR exercises systematically hide.
It is uncomfortable and politically difficult, but it is the only approach that resembles reality.
12. What Systems Should Actually Be Proven To Do
A meaningful resilience strategy does not ask whether systems can be recovered quietly. It proves, continuously, that systems can recover under sustained load, tolerate duplication safely, remain isolated from unrelated domains, degrade gracefully, preserve business invariants and recover with minimal human coordination even when failure timing and scope are unpredictable.
Anything less is optimism masquerading as engineering.
13. DR Exercises Provide Reassurance, Not Resilience
Traditional DR exercises make organisations feel prepared without exposing uncomfortable truths. They work only when the system is quiet, the people are calm and the plan is followed perfectly. Reality offers none of these conditions.
If your recovery strategy only works in ideal circumstances, it is not a strategy. It is theater.
Organisations like to believe they reward outcomes. In reality, they reward visibility. This is the essence of the Last Mile Fallacy: the mistaken belief that the final visible step in a chain of work is where most of the value was created. We tip the waiter rather than the chef, praise the presenter rather than the people who built the system, and congratulate the messenger rather than the makers. The last mile feels real because it has a face, a voice, and a narrative that fits neatly into a meeting. The earlier miles are quieter, messier, and far harder to explain, so they are systematically undervalued.
2. Why the Last Mile Feels Like the Work
In a restaurant, tipping the waiter makes emotional sense. The waiter is the person we interact with, the person who absorbs our frustration and responds to our gratitude. But this social convention quietly becomes a value model inside organisations. The kitchen created the asset through skill, preparation, and repeated invisible decisions, yet the reward follows the interaction rather than the creation. Over time, companies internalise this logic and begin to believe that delivery, reporting, and presence are the work, rather than the final expression of it.
3. How This Manifests in Technology Organisations
In technology teams, the “waiter” is often the person reporting progress. They attend the meetings, translate uncertainty into status updates, and turn complex engineering realities into something consumable. When detailed questions arise, the response is often that they are not close to the detail and will check with the team, yet the recognition and perceived ownership remain firmly with them. This is rarely malicious. It is structural. The system rewards those who can package work, not those who quietly do it.
4. Where Engineering Value Is Actually Created
Most engineering value is created long before anything becomes visible. It appears in design tradeoffs that avoid future failure, in performance headroom that only matters on the worst day, and in risks retired early enough that no one ever knows they existed. Great engineering is often deliberately invisible. When systems work, there is nothing to point at. This creates a fundamental mismatch: great engineering does not naturally market itself, and engineers are rarely motivated to try.
5. Great Engineering Is Not Great Marketing
Engineers are intrinsically poor at being seen and heard, not because they lack confidence or capability, but because their incentives are different. Precision matters more than persuasion. Acknowledging uncertainty is a strength, not a weakness. Over claiming feels dishonest, and time spent talking is time not spent fixing the problem. What looks like poor communication from the outside is often professional integrity on the inside. Expecting engineers to compensate for this by becoming self marketers misunderstands both the role and the culture.
6. Psychological Safety and the Broker Layer
Because it is often unsafe to speak openly, information rarely flows directly from engineers to senior leadership. Saying that a deadline is fictional, that a design is fragile, or that a system is one person away from failure carries real career risk. As a result, organisations evolve a broker layer that filters, softens, and sanitises reality before it travels upward. Leadership frequently mistakes this calm, polished narrative for control, when in fact it is distance. The truth still exists, but it has been made comfortable.
7. A Legitimate Counter Argument: When Work Really Is a Commodity
There are domains where the Last Mile Fallacy is less of a fallacy and more of an economic reality. Construction is a useful example. Laying bricks is physically demanding, highly skilled work, yet bricklayers are often interchangeable. The intellectual property of the building exists before the first brick is laid. The architect, engineer, and planner have already made the critical decisions. The bricklayer is executing a predefined plan, acting on instructions rather than solving open ended problems. In this context, replacing the individual doing the work does not meaningfully alter the outcome, because the value was created upstream and encoded into the design.
8. Why Software Engineering Is Fundamentally Different
Software engineering does not work this way. The value does not sit fully formed before execution begins. Engineers are not simply following instructions; they are continuously solving problems, making trade offs, and adapting to constraints in real time. You are not paying for keystrokes or lines of code. You are paying for judgment, efficiency, and the ability to reach a desired outcome under uncertainty. Two engineers given the same requirements will produce radically different systems in terms of performance, resilience, cost, and long term maintainability. The intellectual property is created during the work, not before it.
9. Presence Without Proximity
This distinction makes the organisational behaviour even more ironic. Many organisations demand a return to the office in the name of proximity, collaboration, and culture, yet senior leaders rarely leave their own rooms, floors, or meeting cycles to engage directly with the teams doing this problem solving work. Call centres remain unvisited, engineering floors remain symbolic, and frontline reality is discussed rather than observed. The distance remains intact, even as the optics change.
10. Key Man Risk Is Consistently Misplaced
Ask where key man risk sits and many organisations will point to programme leads or senior managers. In reality, the risk sits with the engineer who understands the legacy edge case, the operator who knows which alert lies, or the developer who fixed the outage at two in the morning. These people are not interchangeable because the intellectual property lives in their decisions and experience. When they leave, knowledge evaporates. When brokers leave, slides are rewritten.
11. Leadership’s Actual Responsibility
In an engineering led organisation, leadership is not there to extract updates or admire dashboards. Leadership exists to go to the work, to create enough psychological safety that people can speak honestly, and to translate engineering reality into organisational decisions. Engineers should not be forced to become marketers in order to be heard. Leaders must do the harder work of listening, observing, and engaging deeply enough that truth can surface without fear.
12. Escaping the Last Mile Fallacy
Organisations that escape the Last Mile Fallacy do not do so with better reporting or more metrics. They do it by collapsing the distance between value creation and decision making. Leaders spend time with teams without an agenda, reward early bad news rather than late surprises, and recognise impact rather than airtime. They understand that visibility is not the same as value, and that presence is not the same as leadership.
13. Conclusion
The waiter matters, brick layers mater, and in some industries the kitchen can be commoditised. But software engineering is not bricklaying. When organisations treat it as such, they misunderstand where value is created and where risk truly lives. The Last Mile Fallacy is not a communication problem or a leadership choice. And like most choices, it can be changed.
If you genuinely believe that software engineers are merely following orders from you middle management, then you should outsource your technology to vendors and grind the costs downwards. Good luck with this strategy 🙂
There’s a peculiar asymmetry in how humans handle their own incompetence. It reveals itself most starkly when you compare two scenarios: a cancer patient undergoing chemotherapy, and a project manager pushing delivery dates on a complex technology initiative.
Both involve life altering stakes. Both require deep expertise the decision maker doesn’t possess. Yet in one case, we defer completely. In the other, we somehow feel qualified to drive.
The Chemotherapy Paradox
Firstly, lets be clear: incompetence is contextual. Very few people are able to declare themselves as “universally incompetent”. What does this mean? Well just because you have low or no technology competence, it doesn’t mean you are without merit or purpose. The trick is to tie your competencies to the work you are involved in….
When someone receives a cancer diagnosis, something remarkable happens to their ego. They sit across from an oncologist who explains a treatment protocol involving cytotoxic agents that will poison their body in carefully calibrated doses. Their hair will fall out. They’ll experience chronic nausea. Their immune system will crater. The treatment itself might kill them. And they say: “Okay. When do we start?”
This isn’t weakness. It’s wisdom born of acknowledged ignorance. The patient knows they don’t understand the pharmacokinetics of cisplatin or the mechanisms of programmed cell death. They can’t evaluate whether the proposed regimen optimises for tumour response versus quality of life. They lack the fourteen years of training required to even parse the relevant literature.
So they yield. Completely. They ask questions to understand, not to challenge. They follow instructions precisely. They don’t suggest alternative dosing schedules based on something they read online.
This is how humans behave when they genuinely know they don’t know.
The Technology Incompetence Paradox
Now consider the enterprise technology project. A complex migration, perhaps, or a new trading platform. The stakes are high, reputational damage, regulatory exposure, hundreds of millions in potential losses.
The project manager or business sponsor sits across from a principal engineer who explains the technical approach. The engineer describes the challenges: distributed consensus problems, data consistency guarantees, failure mode analysis, performance characteristics under load.
The manager’s eyes glaze slightly. If pressed, they’ll readily admit: “I’m not technical.”
And then, in the very next breath, they’ll ask: “But surely it can’t be that hard? Can’t we just…?”
This is the incompetence paradox in its purest form. The same person who just acknowledged they don’t understand the domain immediately proceeds to:
Push for aggressive delivery dates
Propose “simple” solutions
Question engineering estimates
Mandate shortcuts they can’t evaluate
Drive decisions they’re fundamentally unqualified to make
Ship dates to senior business heads without any engineering validation
In the chemotherapy scenario, acknowledged incompetence produces deference. In the technology scenario, it somehow produces confidence.
Why the Difference?
Several factors drive this asymmetry, and none of them are flattering.
Visibility of consequences. The cancer patient sees the stakes viscerally. The tumour is in their body. The chemotherapy will make them physically ill. The consequences of getting it wrong are personal and immediate. Technology failures, by contrast, are abstract until they’re not. The distributed system that can’t maintain consistency under partition? That’s someone else’s problem until it becomes a P1 incident at 3am.
Illegibility of expertise. Medicine has successfully constructed barriers to amateur interference. White coats. Incomprehensible terminology. Decades of credentialing. Technology, despite being equally complex, has failed to establish similar deference boundaries. Everyone has an iPhone. Everyone has opinions about software.
The Dunning Kruger acceleration. A little knowledge is dangerous, and technology provides just enough surface familiarity to be catastrophically misleading. The manager has used Jira. They’ve seen a Gantt chart. They once wrote an Excel macro. This creates an illusion of adjacent competence that simply doesn’t exist when facing a PET scan.
Accountability diffusion. When chemotherapy fails, the consequences land on a single body. When a technology project fails, it becomes a distributed systems problem of its own, blame fragments across teams, timelines, “changing requirements,” and “unforeseen complexity.” The manager who pushed impossible dates never personally experiences the 4am production incident.
The Absence of Technical Empathy
What’s really missing in failing technology organisations is technical empathy, the capacity to understand, at a meaningful level, what trade offs are being made and why they matter.
When a doctor says “this treatment has a 30% chance of significant side effects,” the patient grasps that this is a trade off. They may not understand the mechanism, but they understand the structure of the decision: accepting known harm for potential benefit.
When an engineer says “if we skip the integration testing phase, we increase the probability of data corruption in production,” the non technical manager hears noise. They don’t have the context to evaluate severity. They don’t understand what “data corruption” actually means for the business. They certainly can’t weigh it against the abstract pressure of “the date.”
So they default to the only metric they can measure: the schedule.
The Project Management Dysfunction
Consider the role of the typical project manager in a failing technology initiative. Their tools are timelines, status reports and burn down charts. Their currency is dates.
When has a project manager ever walked into a steering committee and said: “We need to slow down. There’s too much risk accumulating in this product. The pace is creating technical debt that will compound into failure.”?
They don’t. They can’t. They lack the technical depth to identify the risk, and their incentive structure punishes such honesty even if they could.
Instead, when the date slips, they “rebaseline.” They “replan.” They produce a new Gantt chart that looks exactly like the old one, shifted right by six weeks.
This is treated as project management. It’s actually just administrative recording of failure in progress.
The phrase “we missed the date and are rebaselining” is presented as neutral status reporting. But it obscures a critical question: why did we miss the date? Was it:
Scope creep from stakeholders who don’t understand impact?
Technical debt from previous shortcuts coming due?
Unrealistic estimates imposed by people unqualified to evaluate them?
Architectural decisions that traded speed for fragility?
The rebaseline answers none of these questions. It simply moves the failure point further into the future, where it will be larger and more expensive.
The Trade Off Vacuum
Here’s a question that exposes the dysfunction: when did a generic manager last table a meaningful technical trade off?
Not “can we do X faster?” That’s not a trade off. That’s just pressure wearing a question mark.
A real trade off sounds like: “If we reduce the scope of automated testing from 80% coverage to 60% coverage, we save three weeks but increase production defect probability by roughly 40%. Given our risk tolerance and the cost of production incidents, is that a trade we want to make?”
This requires understanding what automated testing actually does. What coverage means. How defect probability correlates with test coverage. What production incidents cost.
Generic managers don’t table these trade offs because they can’t. They lack the technical vocabulary, the domain knowledge, and often the intellectual honesty to engage at this level. Instead, they ask: “Why does testing take so long? Can’t we just test the important bits?”
And engineers, exhausted by years of this, learn to either capitulate, obfuscate or buffer their estimates so grossly that the organisation ends up outsourcing the work to another company that is more than happy to create favourable lies about timelines. None of this serves the organisation.
Solving Organisational Problems with Technology (And Making Everything Worse)
There’s a particularly insidious failure mode that emerges from this partial knowledge problem: the instinct to solve organisational dysfunction with technology.
The logic seems sound on the surface. The current system is slow. Teams are frustrated. Data is inconsistent. Processes are manual. The obvious answer? A rewrite. A new platform. A transformation programme.
What follows is depressingly predictable.
The rewrite begins with enthusiasm. A new technology stack is selected, often chosen for its novelty rather than its fit. Kubernetes, because containers are modern. A graph database, because someone read an article. Event sourcing, because it sounds sophisticated. Microservices, because monoliths are unfashionable.
Each decision is wrapped in enough management noise to sound credible. Slide decks proliferate. Vendor presentations are scheduled. Architecture review boards convene. The language is confident: “cloud native,” “future proof,” “scalable by design.”
But anyone with genuine technical depth would immediately challenge the rationality of these decisions. Why do we need a graph database for what is fundamentally a relational problem? What operational capability do we have to run a Kubernetes cluster? Who will maintain this event sourcing infrastructure in three years when the contractors have left?
These questions don’t get asked, because the people making the decisions lack the technical vocabulary to even understand them. And the engineers who could ask them have learned that such questions are career limiting.
So the rewrite proceeds. And the organisation gets worse.
I’ve seen this pattern repeatedly. A legacy system, ugly, creaking, but fundamentally functional, is replaced by a modern platform that is architecturally elegant and operationally catastrophic. The new system requires specialists that don’t exist in the organisation. It has failure modes that nobody anticipated. It solves problems that weren’t actually problems while failing to address the issues that drove the rewrite in the first place.
The teams initially call out a “bedding in period.” The new platform just needs time. People need to adjust. There are teething problems. This is normal.
Months pass. The bedding in period extends. Workarounds accumulate. Shadow spreadsheets emerge. Users quietly route around the new system wherever possible.
Eventually, the inevitable emperor’s robes moment arrives. External specialists are called in, expensive consultants with genuine technical depth, and they deliver the verdict everyone already knew: the new platform is not fit for purpose. The technology choices were inappropriate. The architecture doesn’t match the organisation’s capabilities. The complexity is unjustified.
But by now, tens of millions have been spent. Careers have been built on the transformation. Admitting failure is organisationally impossible. So the platform staggers on, a monument to what happens when partial knowledge drives technology decisions.
The tragedy is that the original problems were often organisational, not technological. The legacy system was slow because processes were broken. Data was inconsistent because ownership was unclear. Teams were frustrated because communication was poor.
No amount of Kubernetes will fix a lack of clear data ownership. No event sourcing architecture will resolve dysfunctional team dynamics. No graph database will compensate for the absence of defined business processes.
But technology feels like action. It appears on roadmaps. It has milestones and deliverables. It can be purchased, installed, and demonstrated. Organisational change is messy, slow, and hard to measure. So we default to technology, and we make everything worse.
The vendors, of course, are delighted to help. They arrive with glossy presentations and reference architectures. They speak with confidence about “digital transformation” and “platform modernisation.” They don’t mention that their incentives are misaligned with yours—they profit from complexity, from licensing, from the ongoing support contracts that complex systems require.
Each unnecessary vendor, each cool but inappropriate technology, each unjustified architectural decision adds another layer of complexity. And complexity is not neutral. It requires expertise to manage. It creates failure modes. It slows everything down. It is, in essence, a tax on every future change.
The partially knowledgeable manager sees a vendor presentation and thinks “this could solve our problems.” The technically competent engineer sees the same presentation and thinks “this would create twelve new problems while solving none of the ones we actually have.”
But the engineer’s voice doesn’t carry. They’re “just technical.” They don’t understand “the business context.” They’re “resistant to change.”
And so the organisation lurches from one technology driven transformation to the next, never addressing the underlying dysfunction, always adding complexity, always wondering why things keep getting worse.
The “Tried and Tested” Fallacy
Here’s where it gets even more frustrating. The non technical leader doesn’t always swing toward shiny new technology. Sometimes they swing to the opposite extreme: “Let’s just use something tried and tested.”
This sounds like wisdom. It sounds like hard won experience tempering youthful enthusiasm. It sounds like the voice of reason.
It’s not. It’s the same dysfunction wearing different clothes.
“Tried and tested” is a lobotomised decision bootstrapped with a meaningless phrase. What does it actually mean? Tried by whom? Tested in what context? Tested against what requirements? Proven suitable for what scale, what failure modes, what operational constraints?
The phrase “tried and tested” is a conversation stopper disguised as a conversation ender. It signals: “We have no appetite for a discussion about technology choices in this technology project.”
Let that sink in. A technology project where the leadership has explicitly opted out of meaningful dialogue about technology choices. This is not conservatism. This is abdication.
The “cool new technology” failure and the “tried and tested” failure are mirror images of the same underlying problem: decisions made without genuine engagement with the technical trade offs.
When someone says “let’s use Kubernetes because it’s modern,” they’re not engaging with whether container orchestration solves any problem you actually have.
When someone says “let’s stick with Oracle because it’s tried and tested,” they’re not engaging with whether a proprietary database at £50,000 per CPU core is justified by your actual consistency and scaling requirements.
Both statements translate to the same thing: “I cannot evaluate this decision on its merits, so I’m using a heuristic that sounds defensible.”
The difference is that “cool technology” gets blamed when projects fail. “Tried and tested” rarely does. If you fail with a boring technology stack, it’s attributed to execution. If you fail with a modern stack, the technology choice itself becomes the scapegoat.
This asymmetry in blame creates a perverse incentive. Non technical leaders learn that “tried and tested” is the career safe choice, regardless of whether it’s the right choice. They’re not optimising for project success. They’re optimising for blame avoidance.
A genuine technology decision process looks nothing like either extreme. It starts with a clear articulation of requirements. It evaluates options against those requirements. It surfaces trade offs explicitly. It makes a choice that the team understands and owns.
“We chose PostgreSQL because our consistency requirements are strict, our scale is moderate, our team has deep expertise, and the operational model fits our on call capacity.”
That’s a decision. “Tried and tested” is not a decision. It’s a refusal to make one while pretending you have.
The Path to Success
The organisations that consistently deliver complex technology successfully share a common characteristic: deep, meaningful dialogue between business stakeholders and engineering teams.
This doesn’t mean business people becoming engineers. It means:
Genuine deference on technical matters. When the engineering team says something is hard, the first response is “help me understand why” rather than “surely it can’t be that hard.”
Trade offs surfaced and owned. When shortcuts are taken, everyone understands what’s being traded. The business explicitly accepts the risk rather than pretending it doesn’t exist.
Subject matter experts in the room. Decisions about architecture, timelines, and scope are made with engineers who understand the implications, not by managers shuffling dates on a chart.
Outcome accountability that includes quality. Project managers measured solely on date adherence will optimise for date adherence, quality be damned. Organisations that include defect rates, production stability, and technical debt in their success metrics get different behaviour.
Permission to slow down. Someone with standing and authority needs the ability to say “stop: we’re accumulating too much risk” and have that statement carry weight.
The Humility Test
There’s a simple test for whether an organisation has healthy technical business relationships. Ask a senior business stakeholder to explain, in their own words, the three most significant technical risks in the current programme.
Not “the timeline is aggressive.” That’s not a technical risk; that’s a schedule statement.
Actual technical risks: “We’re using eventual consistency which means during a failure scenario, customers might see stale data for up to thirty seconds. We’ve accepted this trade off because strong consistency would add four months to the timeline.”
If they can’t articulate anything at this level of specificity, they’re driving a car they don’t understand. And unlike a rental car, when this one crashes, it takes the whole organisation with it.
Conclusion
The cancer patient accepts chemotherapy because they know they don’t know. They yield to expertise. They follow guidance. They ask questions to understand rather than to challenge.
The technology manager pushes dates because they don’t know they don’t know. Their partial knowledge, enough to be dangerous, not enough to be useful, creates false confidence. They challenge without understanding. They drive without seeing the road.
The solution isn’t to make every business stakeholder into an engineer. It’s to cultivate the same humility that the cancer patient naturally possesses: a genuine acceptance that some domains require deference, that expertise matters, and that acknowledging your own incompetence is the first step toward not letting it kill the patient.
In this case, the patient is your programme. And the chemotherapy, the painful, slow, disciplined process of building quality software, is the only treatment that works.
Rebaselining isn’t treatment. It’s just rescheduling the funeral. There is no substitute for meaningful discussions. Planning is just regenerating a stuck thought, over a different timeline.
AI is a powerful accelerator when problems are well defined and bounded, but in complex greenfield systems vague intent hardens into architecture and creates long term risk that no amount of automation can undo.
1. What Vibe Coding Really Is
Vibe coding is the practice of describing intent in natural language and allowing AI to infer structure, logic, and implementation directly from that description. It is appealing because it feels frictionless. You skip formal specifications, you skip design reviews, and you skip the uncomfortable work of forcing vague ideas into precise constraints. You describe what you want and something runnable appears.
The danger is that human language is not executable. It is contextual, approximate, and filled with assumptions that are never stated. When engineers treat language as if it were a programming language they are pretending ambiguity does not exist. AI does not remove that ambiguity. It simply makes choices on your behalf and hides those choices behind confident output.
This creates a false sense of progress. Code exists, tests may even pass, and demos look convincing. But the hardest decisions have not been made, they have merely been deferred and embedded invisibly into the system.
2. Language Is Not Logic And Never Was
Dave Varley has consistently highlighted that language evolved for human conversation, not for deterministic execution. Humans resolve ambiguity through shared context, interruption, and correction. Machines do not have those feedback loops. When you say make this scalable or make this secure you are not issuing instructions, you are expressing intent without constraints.
Scalable might mean high throughput, burst tolerance, geographic distribution, or cost efficiency. Secure might mean basic authentication or resilience against a motivated attacker. AI must choose one interpretation. It will do so based on statistical patterns in its training data, not on your business reality. That choice is invisible until the system is under stress.
At that point the system will behave correctly according to the wrong assumptions. This is why translating vague language into production systems is inherently hazardous. The failure mode is not obvious bugs, it is systemic misalignment between what the business needs and what the system was implicitly built to optimise.
3. Where Greenfield AI Coding Breaks Down And Where It Is Perfectly Fine
It is important to be precise. The risk is not greenfield work itself. The risk is complex greenfield systems, where ambiguity, coupling, and long lived architectural decisions matter. Simple greenfield services that are isolated, well bounded, and easily unitisable are often excellent candidates for AI assisted generation.
Problems arise when teams treat all greenfield work as equal.
Complex greenfield systems are those where early decisions define the operational, regulatory, and scaling envelope for years. These systems require intentional design because small assumptions compound over time and become expensive or impossible to reverse. In these environments relying on vibe coding is dangerous because there is no existing behaviour to validate against and no production history to expose incorrect assumptions.
Complex greenfield systems require explicit decisions on concerns that natural language routinely hides, including:
Failure modes and recovery strategies across services
Scalability limits and saturation behaviour under load
Regulatory, audit, and compliance obligations
Data ownership, retention, and deletion semantics
Observability requirements and operational accountability
Security threat models and trust boundaries
When these concerns are not explicitly designed they are implicitly inferred by the AI. Those inferences become embedded in code paths, schemas, and runtime behaviour. Because they were never articulated they were never reviewed. This creates architectural debt at inception. The system may pass functional tests yet fail under real world pressure where those hidden assumptions no longer hold.
By contrast, simple greenfield services behave very differently. Small services with a single responsibility, minimal state, clear inputs and outputs, and a limited blast radius are often ideal for AI assisted generation. If a service can be fully described by its interface, exhaustively unit tested, and replaced without systemic impact, then misinterpretation risk is low and correction cost is small.
AI works well when reversibility is cheap. It becomes hazardous when ambiguity hardens into architecture.
4. Where AI Clearly Wins Because the Problem Is Defined
AI excels when the source state exists and the target state is known. In these cases the task is not invention but translation, validation, and repetition. This is where AI consistently outperforms humans.
4.1 Migrating Java Versions
Java version migrations are governed by explicit rules. APIs are deprecated, removed, or replaced in documented ways. Behavioural changes are known and testable. AI can scan entire codebases across hundreds of repositories, identify incompatible constructs, refactor them consistently, and generate validation tests.
Humans are slow and inconsistent at this work because it is repetitive and detail heavy. AI does not get bored and does not miss edge cases. What used to take months of coordinated effort is increasingly a one click, multi repository transformation.
4.2 Swapping Database Engines
Database engine migrations are another area where constraints are well understood. SQL dialect differences, transactional semantics, and indexing behaviour are documented. AI can rewrite queries, translate stored procedures, flag unsupported features, and generate migration tests that prove equivalence.
Humans historically learned databases by doing this work manually. That learning value still exists, but the labour component no longer makes economic sense. AI performs the translation faster, more consistently, and with fewer missed edge cases.
4.3 Generating Unit Tests
Unit testing is fundamentally about enumerating behaviour. Given existing code, AI can infer expected inputs, outputs, and edge cases. It can generate tests that cover boundary conditions, null handling, and error paths that humans often skip due to time pressure.
This raises baseline quality dramatically and frees engineers to focus on defining correctness rather than writing boilerplate.
4.4 Building Operational Dashboards
Operational dashboards translate metrics into insight. The important signals are well known: latency, error rates, saturation, and throughput. AI can identify which metrics matter, correlate signals across services, and generate dashboards that focus on tail behaviour rather than averages.
The result is dashboards that are useful during incidents rather than decorative artifacts.
5. The End of Engineering Training Wheels
Many tasks that once served as junior engineering work are now automated. Refactors, migrations, test generation, and dashboard creation were how engineers built intuition. That work still needs to be understood, but it no longer needs to be done manually.
This changes team dynamics. Senior engineers are coding again because AI removes the time cost of boilerplate. When the yield of time spent writing code improves, experienced engineers re engage with implementation and apply judgment where it actually matters.
The industry now faces a structural challenge. The old apprenticeship path is gone, but the need for deep understanding remains. Organisations that fail to adapt their talent models will feel this gap acutely.
6. AI As an Organisational X Ray
AI is also transforming how organisations understand themselves. By scanning all repositories across a company, AI can rank contributions by real impact rather than activity volume. It can identify where knowledge is concentrated in individuals, exposing key person risk. It can quantify technical debt and price remediation effort so leadership can see risk in economic terms.
It can also surface scaling choke points and cyber weaknesses that manual reviews often miss. This removes plausible deniability. Technical debt and systemic risk become visible and measurable whether the organisation is comfortable with that or not.
7. The Cardinal Sin of AI Operations And Why It Breaks Production
AI driven operations can be powerful, but only under strict architectural conditions. The most dangerous mistake teams make is allowing AI tools to interact directly with live transactional systems that use pessimistic locking and have no read replicas.
Pessimistic locks exist to protect transactional integrity. When a transaction holds a lock it blocks other reads or writes until the lock is released. An AI system that continuously probes production tables for insight can unintentionally extend lock duration or introduce poorly sequenced queries. This leads to deadlocks, where transactions block each other indefinitely, and to increased contention that slows down write throughput for real customer traffic.
The impact is severe. Production write latency increases, customer facing operations slow down, and in worst cases the system enters cascading failure as retries amplify contention. This is not theoretical. It is a predictable outcome of mixing analytical exploration with locked OLTP workloads.
AI operational tooling should only ever interact with systems that have:
Real time read replicas separated from write traffic
No impact on transactional locking paths
The ability to support heterogeneous indexing
Heterogeneous indexing allows different replicas to optimise for different query patterns without affecting write performance. This is where AI driven analytics becomes safe and effective. Without these properties, AI ops is not just ineffective, it is actively dangerous.
8. Conclusion Clarity Over Vibes
AI is an extraordinary force multiplier, but it does not absolve engineers of responsibility. Vibe coding feels productive because it hides complexity. In complex greenfield systems that hidden complexity becomes long term risk.
Where AI shines is in transforming known systems, automating mechanical work, and exposing organisational reality. It enables senior engineers to code again and forces businesses to confront technical debt honestly.
AI is not a replacement for engineering judgment. It is an architectural accelerant. When intent is clear, constraints are explicit, and blast radius is contained, AI dramatically increases leverage. When intent is vague and architecture is implicit, AI fossilises early mistakes at machine speed.
The organisations that win will not be those that let AI think for them, but those that use it to execute clearly articulated decisions faster and more honestly than their competitors ever could.
How Domain Isolation Creates Evolutionary Pressure for Better Software
After two decades building trading platforms and banking systems, I’ve watched the same pattern repeat itself countless times. A production incident occurs. The war room fills. And then the finger pointing begins.
“It’s the database team’s problem.” “No, it’s that batch job from payments.” “Actually, I think it’s the new release from the cards team.” Three weeks later, you might have an answer. Or you might just have a temporary workaround and a room full of people who’ve learned to blame each other more effectively.
This is the tragedy of the commons playing out in enterprise technology, and it’s killing your ability to evolve.
1. The Shared Infrastructure Trap
Traditional enterprise architecture loves shared infrastructure. It makes intuitive sense: why would you run fifteen database clusters when one big one will do? Why have each team manage their own message broker when a central platform team can run one for everybody? Economies of scale. Centralised expertise. Lower costs.
Except that’s not what actually happens.
What happens is that your shared Oracle RAC cluster becomes a battleground. The trading desk needs low latency queries. The batch processing team needs to run massive overnight jobs. The reporting team needs to scan entire tables. Everyone has legitimate needs, and everyone’s needs conflict with everyone else’s. The DBA team becomes a bottleneck, fielding requests from twelve different product owners, all of whom believe their work is the priority.
When the CPU spikes to 100% at 2pm on a Tuesday, the incident call has fifteen people on it, and nobody knows whose query caused it. The monitoring shows increased load, but the load comes from everywhere. Everyone claims their release was tested. Everyone points at someone else.
This isn’t a technical problem. It’s an accountability problem. And you cannot solve accountability problems with better monitoring dashboards.
2. Darwinian Pressure in Software Systems
Nature solved this problem billions of years ago. Organisms that make poor decisions suffer the consequences directly. There’s no committee meeting to discuss why the antelope got eaten. The feedback loop is immediate and unambiguous. Whilst nobody wants to watch it, teams secretly take comfort in not being the limping buffalo at the back of the herd. Teams get fit, they resist decisions that will put them in an unsafe place as they know they will receive an uncomfortable amount of focus from senior management.
Modern software architecture can learn from this. When you isolate domains, truly isolate them, with their own data stores, their own compute, their own failure boundaries, you create Darwinian pressure. Teams that write inefficient code see their own costs rise. Teams that deploy buggy releases see their own services degrade. Teams that don’t invest in resilience suffer their own outages.
There’s no hiding. There’s no ambiguity. There’s no three week investigation to determine fault. There is no watered down document that hints at the issue, but doesn’t really call it out, as all the teams couldn’t agree on something more pointed. The feedback loop tightens from weeks to hours, sometimes minutes.
This isn’t about blame. It’s about learning. When the consequences of your decisions land squarely on your own service, you learn faster. You care more. You invest in the right things because you directly experience the cost of not investing.
3. The Architecture of Isolation
Achieving genuine domain isolation requires more than just drawing boxes on a whiteboard and calling them “microservices.” It requires rethinking how domains interact with each other and with their data.
Data Localisation Through Replication
The hardest shift for most organisations is accepting that data duplication isn’t a sin. In a shared database world, we’re taught that the single source of truth is sacred. Duplicate data creates consistency problems. Normalisation is good.
But in a distributed world, the shared database is the coupling that prevents isolation. If three domains query the same customer table, they’re coupled. An index change that helps one domain might destroy another’s performance. A schema migration requires coordinating across teams. The tragedy of the commons returns.
Instead, each domain should own its data. If another domain needs that data, replicate it. Event driven patterns work well here: when a customer’s address changes, publish an event. Subscribing domains update their local copies. Yes, there’s eventual consistency. Yes, the data might be milliseconds or seconds stale. But in exchange, each domain can optimise its own data structures for its own access patterns, make schema changes without coordinating with half the organisation, and scale its data tier independently.
Queues as Circuit Breakers
Synchronous service to service calls are the other hidden coupling that defeats isolation. When the channel service calls the fraud service, and the fraud service calls the customer service, you’ve created a distributed monolith. A failure anywhere propagates everywhere. An outage in customer data brings down payments.
Asynchronous messaging changes this dynamic entirely. When a payment needs fraud checking, it drops a message on a queue. If the fraud service is slow or down, the queue absorbs the backlog. The payment service doesn’t fail, it just sees increased latency on fraud decisions. Customers might wait a few extra seconds for approval rather than seeing an error page.
This doesn’t make the fraud service’s problems disappear. The fraud team still needs to fix their outage, but you can make business choices about how to deal with the outage. For example, you can choose to bypass the checks for payments to “known” beneficiaries or below certain threshold values, so the blast radius is contained and can be managed. The payments team’s SLAs aren’t destroyed by someone else’s incident. The Darwinian pressure lands where it belongs: on the team whose service is struggling.
Proxy Layers for Graceful Degradation
Not everything can be asynchronous. Sometimes you need a real time answer. But even synchronous dependencies can be isolated through intelligent proxy layers.
A well designed proxy can cache responses, serve stale data during outages, fall back to default behaviours, and implement circuit breakers that fail fast rather than hanging. When the downstream service returns, the proxy heals automatically.
The key insight is that the proxy belongs to the calling domain, not the called domain. The payments team decides how to handle fraud service failures. Maybe they approve transactions under a certain threshold automatically. Maybe they queue high value transactions for manual review. The fraud team doesn’t need to know or care, they just need to get their service healthy again.
4. Escaping the Monolith: Strategies for Service Eviction
Understanding the destination is one thing. Knowing how to get there from where you are is another entirely. Most enterprises aren’t starting with a blank slate. They’re staring at a decade old shared Oracle database with three hundred stored procedures, an enterprise service bus that routes traffic for forty applications, and a monolithic core banking system that everyone is terrified to touch.
The good news is that you don’t need to rebuild everything from scratch. The better news is that you can create structural incentives that make migration inevitable rather than optional.
Service Eviction: Making the Old World Uncomfortable
Service eviction is the deliberate practice of making shared infrastructure progressively less attractive to use while making domain-isolated alternatives progressively more attractive. This isn’t about being obstructive. It’s about aligning incentives with architecture.
Start with change management. On shared infrastructure, every change requires coordination. You need a CAB ticket. You need sign-off from every consuming team. You need a four week lead time and a rollback plan approved by someone three levels up. The change window is 2am Sunday, and if anything goes wrong, you’re in a war room with fifteen other teams.
On domain isolated services, changes are the team’s own business. They deploy when they’re ready. They roll back if they need to. Nobody else is affected because nobody else shares their infrastructure. The contrast becomes visceral: painful, bureaucratic change processes on shared services versus autonomous, rapid iteration on isolated ones.
This isn’t artificial friction. It’s honest friction. Shared infrastructure genuinely does require more coordination because changes genuinely do affect more people. You’re just making the hidden costs visible and letting teams experience them directly.
Data Localisation Through Kafka: Breaking the Database Coupling
The shared database is usually the hardest dependency to break. Everyone queries it. Everyone depends on its schema. Moving data feels impossibly risky.
Kafka changes the game by enabling data localisation without requiring big bang migrations. The pattern works like this: identify a domain that wants autonomy. Have the source system publish events to Kafka whenever relevant data changes. Have the target domain consume those events and maintain its own local copy of the data it needs.
Initially, this looks like unnecessary duplication. The data exists in Oracle and in the domain’s local store. But that duplication is exactly what enables isolation. The domain can now evolve its schema independently. It can optimise its indexes for its access patterns. It can scale its data tier without affecting anyone else. And critically, it can be tested and deployed without coordinating database changes with twelve other teams.
Kafka’s log based architecture makes this particularly powerful. New consumers can replay history to bootstrap their local state. The event stream becomes the source of truth for what changed and when. Individual domains derive their local views from that stream, each optimised for their specific needs.
The key insight is that you’re not migrating data. You’re replicating it through events until the domain no longer needs to query the shared database directly. Once every query can be served from local data, the coupling is broken. The shared database becomes a publisher of events rather than a shared resource everyone depends on.
The Strangler Fig: Gradual Replacement Without Risk
The strangler fig pattern, named after the tropical tree that gradually envelops and replaces its host, is the safest approach to extracting functionality from monoliths. Rather than replacing large systems wholesale, you intercept specific functions at the boundary and gradually route traffic to new implementations.
Put a proxy in front of the monolith. Initially, it routes everything through unchanged. Then, one function at a time, build the replacement in the target domain. Route traffic for that function to the new service while everything else continues to hit the monolith. When the new service is proven, remove the old code from the monolith.
The beauty of this approach is that failure is localised and reversible. If the new service has issues, flip the routing back. The monolith is still there, still working. You haven’t burned any bridges. You can take the time to get it right because you’re not under pressure from a hard cutover deadline.
Combined with Kafka-based data localisation, the strangler pattern becomes even more powerful. The new domain service consumes events to build its local state, the proxy routes relevant traffic to it, and the old monolith gradually loses responsibilities until what remains is small enough to either rewrite completely or simply turn off.
Asymmetric Change Management: The Hidden Accelerator
This is the strategy that sounds controversial but works remarkably well: make change management deliberately asymmetric between shared services and domain isolated services.
On the shared database or monolith, changes require extensive governance. Four week CAB cycles. Impact assessments signed off by every consuming team. Mandatory production support during changes. Post-implementation reviews. Change freezes around month-end, quarter-end, and peak trading periods.
On domain-isolated services, teams own their deployment pipeline end to end. They can deploy multiple times per day if their automation supports it. No CAB tickets. No external sign offs. If they break their own service, they fix their own service.
This asymmetry isn’t punitive. It reflects genuine risk. Changes to shared infrastructure genuinely do have broader blast radius. They genuinely do require more coordination. You’re simply making the cost of that coordination visible rather than hiding it in endless meetings and implicit dependencies.
The effect is predictable. Teams that want to move fast migrate to domain isolation. Teams that are comfortable with quarterly releases can stay on shared infrastructure. Over time, the ambitious teams have extracted their most critical functionality into isolated domains. What remains on shared infrastructure is genuinely stable, rarely changing functionality that doesn’t need rapid iteration.
The natural equilibrium is that shared infrastructure becomes genuinely shared: common utilities, reference data, things that change slowly and benefit from centralisation. Everything else migrates to where it can evolve independently.
The Migration Playbook
Put it together and the playbook looks like this:
First, establish Kafka as your enterprise event backbone. Every system of record publishes events when data changes. This is table stakes for everything else.
Second, identify a domain with high change velocity that’s suffering under shared infrastructure governance. They’re your early adopter. Help them establish their own data store, consuming events from Kafka to maintain local state.
Third, put a strangler proxy in front of relevant monolith functions. Route traffic to the new domain service. Prove it works. Remove the old implementation.
Fourth, give the domain team autonomous deployment capability. Let them experience the difference between deploying through a four-week CAB cycle versus deploying whenever they’re ready.
Fifth, publicise the success. Other teams will notice. They’ll start asking for the same thing. Now you have demand driven migration rather than architecture-mandated migration.
The key is that you’re not forcing anyone to migrate. You’re creating conditions where migration is obviously attractive. The teams that care about velocity self select. The shared infrastructure naturally shrinks to genuinely shared concerns.
5. The Cultural Shift
Architecture is easy compared to culture. You can draw domain boundaries in a week. Convincing people to live within them takes years.
The shared infrastructure model creates a particular kind of learned helplessness. When everything is everyone’s problem, nothing is anyone’s problem. Teams optimise for deflecting blame rather than improving reliability. Political skills matter more than engineering skills. The best career move is often to avoid owning anything that might fail.
Domain isolation flips this dynamic. Teams own their outcomes completely. There’s nowhere to hide, but there’s also genuine autonomy. You can choose your own technology stack. You can release when you’re ready without coordinating with twelve other teams. You can invest in reliability knowing that you’ll reap the benefits directly.
This autonomy attracts a different kind of engineer. People who want to own things. People who take pride in uptime and performance. People who’d rather fix problems than explain why problems aren’t their fault.
The teams that thrive under this model are the ones that learn fastest. They build observability into everything because they need to understand their own systems. They invest in automated testing because they can’t blame someone else when their deploys go wrong. They design for failure because they know they’ll be the ones getting paged.
The teams that don’t adapt… well, that’s the Darwinian part. Their services become known as unreliable. Other teams design around them. Eventually, the organisation notices that some teams consistently deliver and others consistently struggle. The feedback becomes impossible to ignore.
6. Conway’s Law: Accepting the Inevitable, Rejecting the Unnecessary
Melvin Conway observed in 1967 that organisations design systems that mirror their communication structures. Fifty years of software engineering has done nothing to disprove him. Your architecture will reflect your org chart whether you plan for it or not.
This isn’t a problem to be solved. It’s a reality to be acknowledged. Your domain boundaries will follow team boundaries. Your service interfaces will reflect the negotiations between teams. The political realities of your organisation will manifest in your technical architecture. Fighting this is futile.
But here’s what Conway’s Law doesn’t require: shared suffering.
Traditional enterprise architecture interprets Conway’s Law as an argument for centralisation. If teams need to communicate, give them shared infrastructure to communicate through. If domains overlap, put the overlapping data in a shared database. The result is that Conway’s Law manifests not just in system boundaries but in shared pain. When one team struggles, everyone struggles. When one domain has an incident, twelve teams join the war room.
Domain isolation accepts Conway’s Law while rejecting this unnecessary coupling. Yes, your domains will align with your teams. Yes, your service boundaries will reflect organisational reality. But each team’s infrastructure can be genuinely isolated. Public cloud makes this trivially achievable through account-level separation.
Give each domain its own AWS account or Azure subscription. Their blast radius is contained by cloud provider boundaries, not just by architectural diagrams. Their cost allocation is automatic. Their security boundaries are enforced by IAM, not by policy documents. Their quotas and limits are independent. When the fraud team accidentally spins up a thousand Lambda functions, the payments team doesn’t notice because they’re in a completely separate account with separate limits.
Conway’s Law still shapes your domain design. The payments team builds payment services. The fraud team builds fraud services. The boundaries reflect the org chart. But the implementation of those boundaries can be absolute rather than aspirational. Account level isolation means that even if your domain design isn’t perfect, the consequences of imperfection are contained.
This is the insight that transforms Conway’s Law from a constraint into an enabler. You’re not fighting organisational reality. You’re aligning infrastructure isolation with organisational boundaries so that each team genuinely owns their outcomes. The communication overhead that Conway identified still exists, but it happens through well-defined APIs and event contracts rather than through shared database contention and incident calls.
7. The Transition Path
You can’t flip a switch and move from shared infrastructure to domain isolation overnight. The dependencies are too deep. The skills don’t exist. The organisational structures don’t support it.
But you can start. Pick a domain that’s struggling with the current model, probably one that’s constantly blamed for incidents they didn’t cause. Give them their own database, their own compute, their own deployment pipeline. Build the event publishing infrastructure so they can share data with other domains through replication rather than direct queries.
Watch what happens. The team will stumble initially. They’ve never had to think about database sizing or query optimisation because that was always someone else’s job. But within a few months, they’ll own it. They’ll understand their system in a way they never did before. Their incident response will get faster because there’s no ambiguity about whose system is broken.
More importantly, other teams will notice. They’ll see a team that deploys whenever they want, that doesn’t get dragged into incident calls for problems they didn’t cause, that actually controls their own destiny. They’ll start asking for the same thing.
This is how architectural change actually happens, not through mandates from enterprise architecture, but through demonstrated success that creates demand.
8. The Economics Question
I can already hear the objections. “This is more expensive. We’ll have fifteen databases instead of one. Fifteen engineering teams managing infrastructure instead of one platform team.”
To which I’d say: you’re already paying these costs, you’re just hiding them.
Every hour spent in an incident call where twelve teams try to figure out whose code caused the database to spike is a cost. Every delayed release because you’re waiting for a shared schema migration is a cost. Every workaround another team implements because your shared service doesn’t quite meet their needs is a cost. Every engineer who leaves because they’re tired of fighting political battles instead of building software is a cost.
Domain isolation makes these costs visible and allocates them to the teams that incur them. That visibility is uncomfortable, but it’s also the prerequisite for improvement.
And yes, you’ll run more database clusters. But they’ll be right sized for their workloads. You won’t be paying for headroom that exists only because you can’t predict which team will spike load next. You won’t be over provisioning because the shared platform has to handle everyone’s worst case simultaneously.
9. But surely AWS is shared infrastructure?
A common pushback when discussing domain isolation and ownership is: “But surely AWS is shared infrastructure?” The answer is yes , but that observation misses the point of what ownership actually means in a Darwinian architectural model.
Ownership here is not about blame or liability when something goes wrong. It is about control and autonomy. The critical question is not who gets blamed, but who has the ability to act, change, and learn.
AWS operates under a clearly defined Shared Responsibility Model. AWS is responsible for the security of the cloud, the physical data centres, hardware, networking, and the underlying virtualization layers. Customers are responsible for security in the cloud, everything they configure, deploy, and operate on top of that platform.
Crucially, AWS gives you complete control over the things you are responsible for. You are not handed vague obligations without tools. You are given APIs, policy engines, telemetry, and automation primitives to fully own your outcomes. Identity and access management, network boundaries, encryption, scaling policies, deployment strategies, data durability, and recovery are all explicitly within your control.
This is why AWS being “shared infrastructure” does not undermine architectural ownership. Ownership is not defined by exclusive physical hardware; it is defined by decision-making authority and freedom to evolve. A team that owns its AWS account, VPC, services, and data can change direction without negotiating with a central platform team, can experiment safely within its own blast radius, and can immediately feel the consequences of poor design decisions.
That feedback loop is the point.
From a Darwinian perspective, AWS actually amplifies evolutionary pressure. Teams that design resilient, observable, well isolated systems thrive. Teams that cut corners experience outages, cost overruns, and operational pain, quickly and unambiguously. There is no shared infrastructure committee to absorb the consequences or hide failure behind abstraction layers.
So yes, AWS is shared infrastructure — but it is shared in a way that preserves local control, clear responsibility boundaries, and fast feedback. And those are the exact conditions required for domain isolation to work, and for better software to evolve over time.
10. Evolution, Not Design
The deepest insight from evolutionary biology is that complex, well adapted systems don’t emerge from top down design. They emerge from the accumulation of countless small improvements, each one tested against reality, with failures eliminated and successes preserved.
Enterprise architecture traditionally works the opposite way. Architects design systems from above. Teams implement those designs. Feedback loops are slow and filtered through layers of abstraction. By the time the architecture proves unsuitable, it’s too deeply embedded to change.
Domain isolation enables architectural evolution. Each team can experiment within their boundary. Good patterns spread as other teams observe and adopt them. Bad patterns get contained and eventually eliminated. The overall system improves through distributed learning rather than centralised planning.
This doesn’t mean architects become irrelevant. Someone needs to define the contracts between domains, design the event schemas, establish the standards for how services discover and communicate with each other. But the architect’s role shifts from designing systems to designing the conditions under which good systems can emerge.
10. The End State
I’ve seen organisations make this transition. It takes years, not months. It requires sustained leadership commitment. It forces difficult conversations about team structure and accountability.
But the end state is remarkable. Incident calls have three people on them instead of thirty. Root cause is established in minutes instead of weeks. Teams ship daily instead of quarterly. Engineers actually enjoy their work because they’re building things instead of attending meetings about who broke what.
Pain at the Source
The core idea is deceptively simple: put the pain of an issue right next to its source. When your database is slow, you feel it. When your deployment breaks, you fix it. The feedback loop is immediate and unambiguous.
But here’s what surprises people: this doesn’t make teams selfish. Far from it.
In the shared infrastructure world, teams spend enormous energy on defence. Every incident requires proving innocence. Every performance problem demands demonstrating that your code isn’t the cause. Every outage triggers a political battle over whose budget absorbs the remediation. Teams are exhausted not from building software but from fighting for survival in an environment of ambiguous, omnipresent enterprise guilt.
Domain isolation eliminates this overhead entirely. When your service has a problem, it’s your problem. There’s no ambiguity. There’s no blame game. There’s no three week investigation. You fix it and move on.
Cooperation, Not Competition
And suddenly, teams have energy to spare.
When the fraud team struggles with a complex caching problem, the payments team can offer to help. Not because they’re implicated, not because they’re defending themselves, but because they have genuine expertise and genuine capacity. They arrive as subject matter experts, and the fraud team receives them gratefully as such. There’s no suspicion that help comes with strings attached or that collaboration is really just blame shifting in disguise.
Teams become more cooperative in this world, not less. They show off where they’ve been successful. They write internal blog posts about their observability stack. They present at tech talks about how they achieved sub second deployments. Other teams gladly copy them because there’s no competitive zero sum dynamic. Your success doesn’t threaten my budget. Your innovation doesn’t make my team look bad. We’re all trying to build great software, and we can finally focus on that instead of on survival.
Breaking Hostage Dynamics
And you’re no longer hostage to hostage hiring.
In the shared infrastructure world, a single team can hold the entire organisation ransom. They build a group wide service. It becomes critical. It becomes a disaster. Suddenly they need twenty emergency engineers or the company is at risk. The service shouldn’t exist in the first place, but now it’s too important to fail and too broken to survive without massive investment. The team that created the problem gets rewarded with headcount. The teams that built sustainable, well-designed services get nothing because they’re not on fire.
Domain isolation breaks this perverse incentive. If a team builds a disaster, it’s their disaster. They can’t hold the organisation hostage because their blast radius is contained. Other domains have already designed around them with circuit breakers and fallbacks. The failing service can be deprecated, strangled out, or left to die without taking the company with it. Emergency hiring goes to teams that are succeeding and need to scale, not to teams that are failing and need to be rescued.
The Over Partitioning Trap
I should add a warning: I’ve also seen teams inflict shared pain on themselves, even without shared infrastructure.
They do this by hiring swathes of middle managers and over partitioning into tiny subdomains. Each team becomes responsible for a minuscule pool of resources. Nobody owns anything meaningful. To compensate, they hire armies of planners to try and align these micro teams. The teams fire emails and Jira tickets at each other to inch their ten year roadmap forward. Meetings multiply. Coordination overhead explodes. The organisation has recreated shared infrastructure pain through organisational structure rather than technology.
When something fails in this model, it quickly becomes clear that only a very few people actually understand anything. These elite few become the shared gatekeepers. Without them, no team can do anything. They’re the only ones who know how the pieces fit together, the only ones who can debug cross team issues, the only ones who can approve changes that touch multiple micro domains. You’ve replaced shared database contention with shared human contention. The bottleneck has moved from Oracle to a handful of exhausted architects.
It’s critical not to over partition into tiny subdomains. A domain should be large enough that a team can own something meaningful end to end. They should be able to deliver customer value without coordinating with five other teams. They should understand their entire service, not just their fragment of a service.
These nonsensical subdomains generally only occur when non technical staff have a disproportionately loud voice in team structure. When project managers dominate the discussions and own the narrative for the services. When the org chart is designed around reporting lines and budget centres rather than around software that needs to work together. When the people deciding team boundaries have never debugged a production incident or traced a request across service boundaries.
Domain isolation only works when domains are sized correctly. Too large and you’re back to the tragedy of the commons within the domain. Too small and you’ve created a distributed tragedy of the commons where the shared resource is human coordination rather than technical infrastructure. The sweet spot is teams large enough to own meaningful outcomes and small enough to maintain genuine accountability.
The Commons Solved
The shared infrastructure isn’t completely gone. Some things genuinely benefit from centralisation. But it’s the exception rather than the rule. And crucially, the teams that use shared infrastructure do so by choice, understanding the trade offs, rather than by mandate.
The tragedy of the commons is solved not by better governance of the commons, but by eliminating the commons. Give teams genuine ownership. Let them succeed or fail on their own merits. Trust that the Darwinian pressure will drive improvement faster than any amount of central planning ever could.
Nature figured this out a long time ago. It’s time enterprise architecture caught up.
Most organisations don’t fail because they lack intelligence, capital, or ambition. They fail because leadership becomes arrogant, distant, and insulated from reality.
What Is Humility?
Humility is the quality of having a modest view of one’s own importance. It is an accurate assessment of one’s strengths and limitations, combined with an openness to learning and an awareness that others may know more. In organisational terms, humility manifests as the capacity to hear uncomfortable truths, acknowledge mistakes, and value input from every level of the business.
Humility is one of the hardest things to teach a person. It is not a skill that can be acquired through training programmes or leadership workshops. It is an awareness instilled during childhood, shaped by parents, teachers, and early experiences that teach a person they are not the centre of the universe. By the time someone reaches adulthood, this awareness is either present or it isn’t. You cannot send a forty-year-old executive on a course and expect them to emerge humble. The neural pathways, the reflexive responses, the fundamental orientation towards self and others, these are set early and run deep.
For companies, the challenge is even greater. Organisations are not people, but they develop personalities, and those personalities crystallise quickly. If humility was not baked into the culture from day one, if the founders did not model it, if the early hires did not embody it, then the organisation will struggle to acquire it later. Only new leadership and a significant passage of time can shift an entrenched culture of arrogance. Even then, the change is slow, painful, and far from guaranteed.
Two Models of Leadership
The Authoritarian, Indulgent Leader
These leaders rule from altitude. They sit on different floors, park in different car parks, eat in different canteens, and live inside executive bubbles carefully engineered to shield them from friction. Authority flows downwards through decrees. A skewed form of reality flows upwards through sanitised PowerPoint.
They almost never use their own products or services. They don’t visit the call centre. They don’t stand on the shop floor. They don’t watch a customer struggle with a process they personally approved. They never ask their staff how they can help. Instead, they consume dashboards and reports to try to understand the business that is waiting for their leadership to arrive.
Every SLA is green. Every KPI reassures. Every steering committee confirms alignment. And yet customer satisfaction collapses, staff disengage, and competitors with fewer people and less capital start eating their lunch. This is the great corporate lie: nothing is wrong, but everything is broken.
No one challenges decisions. Governance multiplies. Risk frameworks expand until taking the initiative becomes a career-limiting move. Over time, the organisation stops thinking and starts obeying. Innovation is outsourced. All new thinking comes from consultants who interview staff, extract their ideas, repackage them as proprietary insight, and sell them back at eye-watering rates. Leadership applauds the output, comforted by the illusion that wisdom can be bought rather than lived.
This is indulgent leadership: protected, performative, and terminally disconnected.
The Humble Leader
Humble leaders operate at ground level. The CEO gets their own coffee. Leaders walk to teams instead of summoning them. They sit in on support calls. They use the product as a normal customer would. They experience friction directly, not through a quarterly summary.
In these organisations, leaders teach instead of posture. Knowledge is shared, not hoarded. Being corrected is not career suicide. Authority comes from competence, not title.
Humble leaders are not insecure. They are curious. They ask why more than they declare because I said so. They understand that distance from the work always degrades judgement. This is not informality. It is operational proximity.
PowerPoint Is Not Reality
Authoritarian organisations confuse reporting with truth. They believe if something is on a slide, it must be accurate. They trust traffic lights more than conversations. They cannot understand why customer satisfaction keeps falling when every operational metric is green.
The answer is obvious to everyone except leadership: people are optimising for the dashboard, not the customer.
Humble organisations distrust abstraction. They validate metrics against lived experience. They know dashboards are lagging indicators and conversations are leading ones.
Policy: To Guide or Or a Weapon?
Humble organisations treat policy as a tool. Arrogant organisations treat it as a weapon.
In humble cultures, policies exist to help people do the right thing. When a policy produces a bad outcome, the policy is questioned. In arrogant cultures, metrics and policy are weaponised. Performance management becomes spreadsheet theatre. Context disappears. Judgement is replaced by compliance.
People stop solving problems and start protecting themselves. The organisation feels controlled, but it is actually fragile.
Arrogance Cannot Pivot
Arrogant organisations cannot pivot because pivoting requires one unforgivable act: admitting you were wrong.
Instead of adapting, they become spectators. They watch markets move, clients leave, and value drain away while insisting the strategy is sound. They blame macro conditions, customer behaviour, or temporary headwinds. Then they double down on the same decisions that caused the decline.
Humble organisations pivot early. They say this isn’t working. They adjust before the damage shows up in the financials. They value relevance over ego.
Relevance vs Extraction
Arrogant organisations optimise for extraction. They become feature factories, launching endless products and layers of complexity to squeeze more fees out of a shrinking, disengaging client base. Every useful feature is locked behind a premium tier. Every improvement requires a new contract or upgrade.
Meanwhile, the basics decay. Reliability, clarity, and ease of use are sacrificed for gimmicks and monetisation hacks. Clients don’t leave immediately. They disengage first. Disengagement is always fatal, just slower.
Humble organisations optimise for relevance. They are simple to understand. Predictable. Honest. They deliver value to the entire client base, not just the most profitable sliver. Improvements are included, not resold. They understand that trust compounds faster than margin extraction ever will.
Performance Management as a Mirror
In arrogant organisations, performance management exists to defend leadership decisions. Targets are set far from reality. Success is defined narrowly. Failure is punished, not examined. People learn quickly that survival matters more than truth.
In humble organisations, performance management is a learning system. Outcomes matter, but so does context. Leaders care about why something failed, not just that it failed. The goal is improvement, not theatre.
The Dunning-Kruger Organisation
Arrogant organisations inevitably fall into the Dunning-Kruger trap. They overestimate their understanding of customers, markets, and their own competence precisely because they have insulated themselves from feedback. Confidence rises as signal quality drops.
Humble organisations assume they know less than they think. They stay close to the work. They listen. They test assumptions. And as a result, they actually learn faster.
Humility Scales. Arrogance Collapses.
Humility is not a personality trait. It is a structural choice. It determines whether truth can travel upward, whether correction is possible, and whether leadership remains connected to the outcomes it creates.
Because humility cannot be taught, organisations must select for it. Hire humble people. Promote humble leaders. Remove those who cannot hear feedback. The alternative is to wait for reality to deliver the lesson, and by then, it is usually too late.
In the long run, the most dangerous sentence in any organisation is not we failed. It is: Everything is green. Because by the time arrogance has acknowledged reality, reality has moved on.
I wanted to write about the trends we can see playing out, both in South Africa and globally with respect to: Large Retailers, Mobile Networks, Banking, Insurance and Technology. These thoughts are my own and I am often wrong, so dont get too excited if you dont agree with me 🙂
South Africa is experiencing a banking paradox. On one hand, consumers have never had more choice: digital challenger banks, retailer backed banks, insurer led banks, and mobile first offerings are launching at a remarkable pace. On the other hand, the fundamental economics of running a bank have never been more challenging. Margins are shrinking, fees are collapsing toward zero, fraud and cybercrime costs are exploding, and clients are fragmenting their financial lives across multiple institutions.
This is not merely a story about digital disruption or technological transformation. It is a story about scale, cost gravity, fraud economics, and the inevitable consolidations.
1. The Market Landscape: Understanding South Africa’s Banking Ecosystem
Before examining the pressures reshaping South African banking, it is essential to understand the current market structure. As of 2024, South Africa’s banking sector remains concentrated among a handful of large institutions. Together with Capitec and Investec, the major traditional banks held around 90 percent of the banking assets in the country.
Despite this dominance, the landscape is shifting. New bank entrants have gained large numbers of clients in South Africa. However, client acquisition has not translated into meaningful market share. This disconnect between client numbers and actual banking value reveals a critical truth: in an abundant market, acquiring accounts is easy. Becoming someone’s primary financial relationship is extraordinarily difficult.
2. The Incumbents: How Traditional Banks Face Structural Pressure
South Africa’s traditional banking system remains dominated by large institutions that have built their positions over decades. They continue to benefit from massive balance sheets, regulatory maturity, diversified revenue streams including corporate and investment banking, and deep institutional trust built over generations.
However, these very advantages now carry hidden liabilities. The infrastructure that enabled dominance in a scarce market has become expensive to maintain in an abundant one.
2.1 The True Cost Structure of Modern Banking
Running a traditional bank today means bearing the full weight of regulatory compliance spanning Basel frameworks, South African Reserve Bank supervision, anti money laundering controls, and know your client requirements. It means investing continuously in cybersecurity and fraud prevention systems that have evolved from control functions into permanent warfare operations. It means maintaining legacy core banking systems that are expensive to operate, difficult to modify, and politically challenging to replace. It means supporting hybrid client service models that span physical branches, call centres, and digital platforms, each requiring different skillsets and infrastructure.
Add to this the ongoing costs of card payment rails, interchange fees, and cash logistics infrastructure, and the fixed cost burden becomes clear. These are not discretionary investments that can be paused during difficult periods. They are the fundamental operating requirements of being a bank.
2.2 The Fee Collapse and Revenue Compression
At the same time that structural costs continue rising, transactional banking revenue is collapsing. Consumers are no longer willing to pay for monthly account fees, per transaction charges, ATM withdrawals, or digital interactions. What once subsidized the cost of branch networks and back office operations now generates minimal revenue.
This creates a fundamental squeeze where costs rise faster than revenue can be replaced. The incumbents still maintain advantages in complexity based products such as home loans, vehicle finance, large credit books, and business banking relationships. These products require sophisticated risk management, large balance sheets, and regulatory expertise that new entrants struggle to replicate.
However, they are increasingly losing the day to day transactional relationship. This is where client engagement happens, where financial behaviors are observed, and where long term loyalty is either built or destroyed. Without this foundation, even complex product relationships become vulnerable to attrition.
3. The Crossover Entrants: Why Retailers, Telcos, and Insurers Want Banks
Over the past decade, a powerful second segment has emerged: non banks launching banking operations. Retailers, insurers, and telecommunications companies have all moved into financial services. These players are not entering banking for prestige or diversification. They are making calculated economic decisions driven by specific strategic objectives.
3.1 The Economic Logic Behind Retailers Entering Banking
Retailers see five compelling reasons to operate banks:
First, they want to offload cash at tills. When customers can deposit and withdraw cash while shopping or visiting stores, retailers dramatically reduce cash in transit costs, eliminate expensive standalone ATM infrastructure, and reduce the security risks associated with holding large cash balances.
Second, they want to eliminate interchange fees by keeping payments within their own ecosystems. Every transaction that stays on their own payment rails avoids card scheme costs entirely, directly improving gross margins on retail sales.
Third, and most strategically, they want to control payment infrastructure. The long term vision extends beyond cards to account to account payment systems integrated directly into retail and mobile ecosystems. This would fundamentally shift power away from traditional card networks and banks.
Fourth, zero fee banking becomes a powerful loss leader. Banking services drive foot traffic, increase share of wallet across the ecosystem, and reduce payment friction for customers who increasingly expect seamless digital experiences.
Fifth, and increasingly the most sophisticated motivation, they want to capture higher quality client data and establish direct digital relationships with customers. This creates a powerful lever for upstream supplier negotiations that traditional retailers simply cannot replicate. Loyalty programs, whilst beneficial in respect of accurate client data, they typically fail to give you the realtime digital engagement needed to shift product. Most loyalty programs are either bar coded plastic cards or apps which have low client engagements and high drop off rates, principally due to their narrow value proposition.
Consider the dynamics this enables: a retailer with deep transactional banking relationships knows precisely which customers purchase specific product categories, their purchase frequency, their price sensitivity, their payment patterns, and their responsiveness to promotions. This is not aggregate market research. This is individualised, verified, behavioural data tied to actual spending.
Armed with this intelligence, the retailer can approach Supplier A with a proposition that would have been impossible without the banking relationship: “If you reduce your price by 10 basis points, we will actively engage the 340,000 customers in our ecosystem who purchase your product category. Based on our predictive models, we can demonstrate that targeted digital engagement through our banking app and payment notifications will double sales volume within 90 days.”
This is not speculation or marketing bravado. It is a data backed commitment that can be measured, verified, and contractually enforced.
The supplier faces a stark choice: accept the price reduction in exchange for guaranteed volume growth, or watch the retailer redirect those same 340,000 customers toward a competing supplier who will accept the terms.
Traditional retailers without banking operations cannot make this proposition credible. They might claim to have customer data, but it is fragmented, often anonymised, and lacks the real time engagement capability that banking infrastructure provides. A banking relationship means the retailer can send a push notification at the moment of payment, offer instant cashback on targeted products, and measure conversion within hours rather than weeks.
This upstream leverage fundamentally changes the power dynamics in retail supply chains. Suppliers who once dictated terms based on brand strength now find themselves negotiating with retailers who possess superior customer intelligence and the direct communication channels to act on it.
The implications extend beyond simple price negotiations. Retailers can use this data advantage to optimise product ranging, predict demand with greater accuracy, negotiate exclusivity periods, and even co develop products with suppliers based on demonstrated customer preferences. The banking relationship transforms the retailer from a passive distribution channel into an active market maker with privileged access to consumer behaviour.
This is why the smartest retailers view banking not as a side business or diversification play, but as strategic infrastructure that enhances their core retail operations. The banking losses during the growth phase are an investment in capabilities that competitors without banking licences simply cannot match.
3.2 The Hidden Complexity They Underestimate
What these players consistently underestimate is that banking is not retail with a license. The operational complexity, regulatory burden, and risk profile of banking operations differ fundamentally from their core businesses.
Fraud, cybercrime, dispute resolution, chargebacks, scams, and client remediation are brutally complex challenges. Unlike retail where a product return is a process inconvenience, banking disputes involve money that may be permanently lost, identities that can be stolen, and regulatory obligations that carry severe penalties for failure.
The client service standard in banking is fundamentally different. When a retail transaction fails, it is frustrating. When a banking transaction fails and money disappears, it becomes a crisis that can devastate client trust and trigger regulatory scrutiny.
The experience of insurer led banks illustrates these challenges with brutal precision. Building a banking operation requires billions of rand in upfront investment, primarily in technology infrastructure and regulatory compliance systems. Banks launched by insurers have operated at significant losses for several years while building scale. In a market already saturated with low cost options and fierce competition for the primary account relationship, the margin for strategic error is extraordinarily thin.
4. Case Study: Old Mutual and the Nedbank Paradox
The crossover entrant dynamics described above find their most striking illustration in Old Mutual’s decision to build a new bank just six years after unbundling a R43 billion stake in one of South Africa’s largest banks. This is not merely an interesting corporate finance story. It is a case study in whether insurers can learn from their own history, or whether they are destined to repeat expensive mistakes.
4.1 The History They Already Lived
Old Mutual acquired a controlling 52% stake in Nedcor (later Nedbank) in 1986 and held it for 32 years. During that time, they learned exactly how difficult banking is. Nedbank grew into a full service institution with corporate banking, investment banking, wealth management, and pan African operations. By 2018, Old Mutual’s board concluded that managing this complexity from London was destroying value rather than creating it.
The managed separation distributed R43.2 billion worth of Nedbank shares to shareholders. Old Mutual reduced its stake from 52% to 19.9%, then to 7%, and today holds just 3.9%. The market’s verdict: Nedbank’s market capitalisation is now R115 billion, more than double Old Mutual’s R57 billion.
Then, in 2022, Old Mutual announced it would build a new bank from scratch.
4.2 The Bet They Are Making Now
Old Mutual has invested R2.8 billion to build OM Bank, with cumulative losses projected at R4 billion to R5 billion before reaching break even in 2028. To succeed, they need 2.5 to 3 million clients, of whom 1.6 million must be “active” with seven or more transactions monthly.
They are launching into a market where Capitec has 24 million clients, TymeBank has achieved profitability with 10 million accounts, Discovery Bank has over 2 million clients, and Shoprite and Pepkor are both entering banking. The mass market segment Old Mutual is targeting is precisely where Capitec’s dominance is most entrenched.
The charitable interpretation: Old Mutual genuinely believes integrated financial services requires owning transactional banking capability. The less charitable interpretation: they are spending R4 billion to R5 billion to relearn lessons they should have retained from 32 years owning Nedbank.
4.3 The Questions That Should Trouble Shareholders
Why build rather than partner? Old Mutual could have negotiated a strategic partnership with Nedbank focused on mass market integration. Instead, they distributed R43 billion to shareholders and are now spending R5 billion to recreate a fraction of what they gave away.
What institutional knowledge survived? The resignation of OM Bank’s CEO and COO in September 2024, months before launch, suggests the 32 years of Nedbank experience did not transfer to the new venture. They are learning banking again, expensively.
Is integration actually differentiated? Discovery has pursued the integrated rewards and banking model for years with Vitality. Old Mutual Rewards exists but lacks the behavioural depth and brand recognition. Competing against Discovery for integration while competing against Capitec on price is a difficult strategic position.
What does success even look like? If OM Bank acquires 3 million accounts but most clients keep their salary at Capitec, the bank becomes another dormant account generator. The primary account relationship is what matters. Everything else is expensive distraction.
4.4 What This Tells Us About Insurer Led Banking
The Old Mutual case crystallises the risks facing every crossover entrant discussed in Section 3. Banking capability cannot be easily exited and re entered. Managed separations can destroy strategic options while unlocking short term value. The mass market is not a gap waiting to be filled; it is a battlefield where Capitec has spent 20 years building structural dominance.
Most importantly, ecosystem integration is necessary but not sufficient. The theory that insurance plus banking plus rewards creates unassailable client relationships remains unproven. Old Mutual’s version of this integrated play will need to be meaningfully better than Discovery’s, not merely present.
Whether Old Mutual’s second banking chapter ends differently from its first depends on whether the organisation has genuinely learned from Nedbank, or whether it is replaying the same strategies in a market that has moved on without it. The billions already committed suggest they believe the former. The competitive dynamics suggest the latter.
5. Fraud Economics: The Invisible War Reshaping Banking
Fraud has emerged as one of the most significant economic forces in South African banking, yet it remains largely invisible to most clients until they become victims themselves. The scale, velocity, and sophistication of fraud losses are fundamentally altering banking economics and will drive significant market consolidation over the coming years.
5.1 The Staggering Growth in Fraud Losses
The fraud landscape in South Africa has deteriorated at an alarming rate. Looking at the three year trend from 2022 to 2024, the acceleration is unmistakable. More than half of the total digital banking fraud cases in the last three years occurred in 2024 alone, according to SABRIC.
Digital banking crime increased by 86% in 2024, rising from 52,000 incidents in 2023 to almost 98,000 reported cases. When measured by actual cases rather than just value, digital banking fraud more than doubled, jumping from 31,612 in 2023 to 64,000 in 2024. The financial impact climbed from R1 billion in 2023 to over R1.4 billion in 2024, representing a 74% increase in losses year over year.
Card fraud continues its relentless climb despite banks’ investments in security. Losses from card related crime increased by 26.2% in 2024, reaching R1.466 billion. Card not present transactions, which occur primarily in online and mobile environments, accounted for 85.6% of gross credit card fraud losses, highlighting where criminals have concentrated their efforts.
Critically, 65.3% of all reported fraud incidents in 2024 involved digital banking channels. This is not a temporary spike. Banking apps alone bore the brunt of fraud, suffering losses exceeding R1.2 billion and accounting for 65% of digital fraud cases.
The overall picture is sobering: total financial crime losses, while dropping from R3.3 billion in 2023 to R2.7 billion in 2024, mask the explosion in digital and application fraud. SABRIC warns that fraud syndicates are becoming increasingly sophisticated, technologically advanced, and harder to detect, setting the stage for what experts describe as a potential “fraud storm” in 2025.
5.2 Beyond Digital: The Application Fraud Crisis
Digital banking fraud represents only one dimension of the crisis. Application fraud has become another major growth area that threatens bank profitability and balance sheet quality.
Vehicle Asset Finance (VAF) fraud surged by almost 50% in 2024, with potential losses estimated at R23 billion. This is not primarily digital fraud; it involves sophisticated document forgery, cloned vehicles, synthetic identities, and increasingly, AI generated employment records and payslips to deceive financing systems.
Unsecured credit fraud rose sharply by 57.6%, with more than 62,000 fraudulent applications reported. Actual losses more than doubled from the previous year to R221.7 million, demonstrating that approval rates for fraudulent applications are improving from the criminals’ perspective.
Home loan fraud, though slightly down in reported case numbers, remains highly lucrative for organized crime. Fraudsters are deploying AI modified payslips, deepfake video calls for identity verification, and sophisticated impersonation techniques to secure financing that will never be repaid.
5.3 The AI Powered Evolution of Fraud Techniques
The rapid advancement of artificial intelligence has fundamentally changed the fraud landscape. According to SABRIC CEO Andre Wentzel, criminals are leveraging AI to create scams that appear more legitimate and convincing than ever before.
From error free phishing emails to AI generated WhatsApp messages that perfectly mimic a bank’s communication style, and even voice cloned deepfakes impersonating bank officials or family members, these tactics highlight an unsettling reality: the traditional signals that helped clients identify fraud are disappearing.
SABRIC has cautioned that in 2025, real time deepfake audio and video may become common tools in fraud schemes. Early cases have already emerged of fraudsters using AI voice cloning to impersonate individuals and banking officials with chilling accuracy.
Importantly, SABRIC emphasizes that these incidents result from social engineering techniques that exploit human error rather than technical compromises of banking platforms. No amount of technical security investment alone can solve a problem that fundamentally targets human psychology and decision making under pressure.
5.3.1 The Android Malware Explosion: Repackaging and Overlay Attacks
Beyond AI powered social engineering, South African banking clients face a sophisticated Android malware ecosystem that operates largely undetected until accounts are drained.
Repackaged Banking Apps: Criminals are downloading legitimate banking apps from official stores, decompiling them, injecting malicious code, and repackaging them for distribution through third party app stores, phishing links, and even legitimate looking websites. These repackaged apps function identically to the real banking app, making detection nearly impossible for most users. Once installed, they silently harvest credentials, intercept one time passwords, and grant attackers remote control over the device.
GoldDigger and Advanced Banking Trojans: The GoldDigger banking trojan, first identified targeting South African and Vietnamese banks, represents the evolution of mobile banking malware. Unlike simple credential stealers, GoldDigger uses multiple sophisticated techniques: it abuses Android accessibility services to read screen content and interact with legitimate banking apps, captures biometric authentication attempts, intercepts SMS messages containing OTPs, and records screen activity to capture PINs and passwords as they are entered. What makes GoldDigger particularly dangerous is its ability to remain dormant for extended periods, activating only when specific banking apps are launched to avoid detection by antivirus software.
Overlay Attacks: Overlay attacks represent perhaps the most insidious form of Android banking malware. When a user opens their legitimate banking app, the malware detects this and instantly displays a pixel perfect fake login screen overlaid on top of the real app. The user, believing they are interacting with their actual banking app, enters credentials directly into the attacker’s interface. Modern overlay attacks are nearly impossible for average users to detect. The fake screens match the bank’s branding exactly, include the same security messages, and even replicate loading animations. By the time the user realizes something is wrong, usually when money disappears, the malware has already transmitted credentials and initiated fraudulent transactions.
The Scale of the Android Threat: Unlike iOS devices which benefit from Apple’s strict app ecosystem controls, Android’s open architecture and South Africa’s high Android market share create a perfect storm for mobile banking fraud. Users sideload apps from untrusted sources, delay security updates due to data costs, and often run older Android versions with known vulnerabilities. It’s important to note that the various Android variants hold roughly 70–73% of the global mobile operating system market share as of late 2025. In South Africa, Android holds a slightly higher share of 81–82 % of mobile devices.
For banks, this creates an impossible support burden. When a client’s account is compromised through malware they installed themselves, who bears responsibility? Under emerging fraud liability frameworks like the UK’s 50:50 model, banks may find themselves reimbursing losses even when the client unknowingly installed malware, creating enormous financial exposure with no clear technical solution.
The only effective defence is a combination of server side behavioural analysis to detect anomalous login patterns, device fingerprinting to identify compromised devices, and aggressive client education; but even this assumes clients will recognize and act on warnings, which social engineering attacks have proven they often will not.
5.4 The Operational and Reputational Burden of Fraud
Every fraud incident triggers a cascade of costs that extend far beyond the direct financial loss. Banks must investigate each case, which requires specialized fraud investigation teams working around the clock. They must manage call centre volume spikes as concerned clients seek reassurance that their accounts remain secure. They must fulfill regulatory reporting obligations that have become increasingly stringent. They must absorb reputational damage that can persist for years and influence client acquisition costs.
Client trust, once broken by a poor fraud response, is nearly impossible to rebuild. In a market where clients maintain multiple banking relationships and can switch their primary account with minimal friction, a single high profile fraud failure can trigger mass attrition.
Complexity magnifies this operational burden in ways that are not immediately obvious. clients who do not fully understand their bank’s products, account structures, or transaction limits are slower to recognize abnormal activity. They are more susceptible to social engineering attacks that exploit confusion about how banking processes work. They are more likely to contact support for clarification, driving up operational costs even when no fraud has occurred.
In this way, confusing product structures do not merely frustrate clients. They actively increase both fraud exposure and the operational costs of managing fraud incidents. A bank with ten account types, each with subtly different fee structures and transaction limits, creates far more opportunities for confusion than one with a single, clearly defined offering.
5.5 The UK Model: Fraud Liability Sharing Between Banks
The United Kingdom has introduced a revolutionary approach to fraud liability that fundamentally changes the economics of payment fraud. Since October 2024, UK payment service providers have been required to split fraud reimbursement liability 50:50 between the sending bank (victim’s bank) and the receiving bank (where the fraudster’s account is held).
Under the Payment Systems Regulator’s mandatory reimbursement rules, UK PSPs must reimburse in scope clients up to £85,000 for Authorised Push Payment (APP) fraud, with costs shared equally between sending and receiving firms. The sending bank must reimburse the victim within five business days of a claim being reported, or within 35 days if additional investigation time is required.
This represents a fundamental shift from the previous voluntary system, which placed reimbursement burden almost entirely on the sending bank and resulted in highly inconsistent outcomes. In 2022, only 59% of APP fraud losses were returned to victims under the voluntary framework. The new mandatory system ensures victims are reimbursed in most cases, unless the bank can prove the client acted fraudulently or with gross negligence.
The 50:50 split creates powerful incentives that did not exist under the old model. Receiving banks, which previously had little financial incentive to prevent fraudulent accounts from being opened or to act quickly when suspicious funds arrived, now bear direct financial liability. This has driven unprecedented collaboration between sending and receiving banks to detect fraudulent behavior, interrupt mule account activities, and share intelligence about emerging fraud patterns.
Sending banks are incentivized to implement robust fraud warnings, enhance real time transaction monitoring, and educate clients about common scam techniques. Receiving banks must tighten account opening procedures, monitor for suspicious deposit patterns, and act swiftly to freeze accounts when fraud is reported.
5.6 When South Africa Adopts Similar Regulations: The Coming Shock
When similar mandatory reimbursement and liability sharing regulations are eventually applied in South Africa, and they almost certainly will be, the operational impact will be devastating for banks operating at the margins.
The economics are straightforward and unforgiving. Banks with weak fraud detection capabilities, limited balance sheets to absorb reimbursement costs, or fragmented operations spanning multiple systems will face an impossible choice: invest heavily and immediately in fraud prevention infrastructure, or accept unsustainable losses from mandatory reimbursement obligations.
For smaller challenger banks, retailer or telco backed banks without deep fraud expertise, and any bank that has prioritized client acquisition over operational excellence, this regulatory shift could prove existential. The UK experience provides a clear warning: smaller payment service providers and start up financial services companies have found it prohibitively costly to comply with the new rules. Some have exited the market entirely. Others have been forced into mergers or partnerships with larger institutions that can absorb the compliance and reimbursement costs.
Consider the mathematics for a sub scale bank in South Africa. If digital fraud continues growing at 86% annually and mandatory 50:50 reimbursement is introduced, a bank with 500,000 active accounts could face tens of millions of rand in annual reimbursement costs before any investment in prevention systems. For a bank operating on thin margins with limited capital reserves, this is simply not sustainable.
The banks that will survive this transition are those that can achieve the scale necessary to amortize fraud prevention costs across millions of active relationships. Fraud detection systems, AI powered transaction monitoring, specialized investigation teams, and rapid response infrastructure all require significant fixed investment. These costs do not scale linearly with client count; they are largely fixed regardless of whether a bank serves 100,000 or 10 million clients.
Banks that cannot achieve this scale will find themselves in a death spiral where fraud losses and reimbursement obligations consume an ever larger percentage of revenue, forcing them to cut costs in ways that further weaken fraud prevention, creating even more losses. This dynamic will accelerate the consolidation that is already inevitable for other reasons.
The pressure will be particularly acute for banks that positioned themselves as low friction, high speed account opening experiences. Easy onboarding is a client experience win, but it is also a fraud liability nightmare. Under mandatory reimbursement with shared liability, banks will be forced to choose between maintaining fast onboarding and accepting massive fraud costs, or implementing stricter controls that destroy the very speed that differentiated them.
The only viable path forward for most banks will be radical simplification of products to reduce client confusion, massive investment in AI powered fraud detection, and either achieving scale through growth or accepting acquisition by a larger institution. The banks hustling at the margins, offering mediocre fraud prevention while burning cash on client acquisition, will not survive the transition to mandatory reimbursement.
If a bank gets fraud wrong, no amount of free banking, innovative features, or marketing spend will save it. Trust and safety will become the primary differentiators in South African banking, and the banks that invested early and deeply in fraud prevention will capture a disproportionate share of the primary account relationships that actually matter.
6.0 Technology as a Tailwind and a Trap for New Banks
Technology has dramatically lowered the barrier to starting a bank. Cloud infrastructure, software based cores, and banking platforms delivered as services mean a regulated banking operation can now be launched in months rather than years. This is a genuine tailwind and it will embolden more companies to attempt banking.
Retailers, insurers, fintechs, and digital platforms increasingly believe that with the right vendor stack they can become banks.
That belief is only partially correct.
6.1 Bank in a Box and SaaS Banking
Modern platforms promise fast launches and reduced engineering effort by packaging accounts, payments, cards, and basic lending into ready made systems.
Common examples include Mambu, Thought Machine, Temenos cloud deployments, and Finacle, alongside banking as a service providers such as Solaris, Marqeta, Stripe Treasury, Unit, Vodeno, and Adyen Issuing.
These platforms dramatically reduce the effort required to build a core banking system. What once required years of bespoke engineering can now be achieved in a fraction of the time.
But this is where many new entrants misunderstand the problem.
6.2 The Core Is a Small Part of Running a Bank
The core banking system is no longer the hard part. It is only a small fraction of the total effort and overhead of running a bank.
The real complexity sits elsewhere: • Fraud prevention and reimbursement • Credit risk and underwriting • Financial crime operations • Regulatory reporting and audit • Customer support and dispute handling • Capital and liquidity management • Governance and accountability
A bank in a box provides undifferentiated infrastructure. It does not provide a sustainable banking business.
Modern banking platforms are intentionally generic. New banks often start with the same capabilities, the same vendors, and similar architectures.
As a result: • Technology is rarely a lasting differentiator • Customer experience advantages are quickly copied • Operational weaknesses scale rapidly through digital channels
What appears to be leverage can quickly become fragility if not matched with deep operational competence and scaling out quickly, meaningfully to millions of clients. Banking is not a “hello world” moment, my first banking app has to come with significant and meaningful differences then scale quickly.
6.4 Why This Accelerates Consolidation
Technology makes it easier to start a bank but harder to sustain one.
It encourages more entrants, but ensures that many operate similar utilities with little durable differentiation. Those without discipline in cost control, risk management, and execution become natural consolidation candidates.
In a world where the core is commoditised, banking success is determined by operational excellence, the scale of the ecosystem clients interact with and not software selection.
Technology has made starting a bank easier, but it has not made running one simpler.
7. The Reality of Multi Banking and Dormant Accounts
South Africans are no longer loyal to a single bank. The abundance of options and the proliferation of zero fee accounts has fundamentally changed consumer behavior. Most consumers now maintain a salary account, a zero fee transactional account, a savings pocket somewhere else, and possibly a retailer or telco wallet.
This shift has created an ecosystem characterized by millions of dormant accounts, high acquisition but low engagement economics, and marketing vanity metrics that mask unprofitable user bases. Banks celebrate account openings while ignoring that most of these accounts will never become active, revenue generating relationships.
7.1 The Primary Account Remains King
Critically, salaries still get paid into one primary account. That account, the financial home, is where long term value accrues. It receives the monthly inflow, handles the bulk of payments, and becomes the anchor of the client’s financial life. Secondary accounts are used opportunistically for specific benefits, but they rarely capture the full relationship.
The battle for primary account status is therefore the only battle that truly matters. Everything else is peripheral.
8. The Coming Consolidation: Not Everyone Survives Abundance
There is a persistent fantasy in financial services that the current landscape can be preserved with enough innovation, enough branding, or enough regulatory patience. It cannot.
Abundance collapses margins, exposes fixed costs, and strips away the illusion of differentiation. The system does not converge slowly. It snaps. The only open question is whether institutions choose their end state, or have it chosen for them.
8.1 The Inevitable End States
Despite endless strategic options being debated in boardrooms, abundance only allows for a small number of viable outcomes.
End State 1: Primary Relationship Banks (Very Few Winners). A small number of institutions become default financial gravity wells. They hold the client’s salary and primary balance. They process the majority of transactions. They anchor identity, trust, and data consent. Everyone else integrates around them. These banks win not by having the most features, but by being operationally boring, radically simple, and cheap at scale. In South Africa, this number is likely two, maybe three. Not five. Not eight. Everyone else who imagines they will be a primary bank without already behaving like one is delusional.
End State 2: Platform Banks That Own the Balance Sheet but Not the Brand. These institutions quietly accept reality. They own compliance, capital, and risk. They power multiple consumer facing brands. They monetize through volume and embedded finance. Retailers, telcos, and fintechs ride on top. The bank becomes infrastructure. This is not a consolation prize. It is seeing the board clearly. But it requires executives to accept that brand ego is optional. Most will fail this test.
End State 3: Feature Banks and Specialist Utilities. Some institutions survive by narrowing aggressively. They become lending specialists, transaction processors, or foreign exchange and payments utilities. They stop pretending to be universal banks. They kill breadth to preserve depth. This path is viable, but brutal. It requires shrinking the organisation, killing products, and letting clients go. Few management teams have the courage to execute this cleanly.
End State 4: Zombie Institutions (The Most Common Outcome). This is where most end up. Zombie banks are legally alive. They have millions of accounts. They are nobody’s primary relationship. They bleed slowly through dormant clients, rising unit costs, and talent attrition. Eventually they are sold for parts, merged under duress, or quietly wound down. This is not stability. It is deferred death.
8.2 The Lie of Multi Banking Forever
Executives often comfort themselves with the idea that clients will happily juggle eight banks, twelve apps, and constant money movement. This is nonsense.
clients consolidate attention long before they consolidate accounts. The moment an institution is no longer default, it is already irrelevant. Multi banking is a transition phase, not an end state.
8.3 Why Consolidation Will Hurt More Than Expected
Consolidation is painful because it destroys illusions: that brand loyalty was real, that size implied relevance, that optionality was strategy.
It exposes overstaffed middle layers, redundant technology estates, and products that never should have existed. The pain is not just financial. It is reputational and existential.
8.4 The Real Divide: Those Who Accept Gravity and Those Who Deny It
Abundance creates gravity. clients, data, and liquidity concentrate.
Institutions that accept this move early, choose roles intentionally, and design for integration. Those that resist it protect legacy, multiply complexity, and delay simplification. And then they are consolidated without consent.
9. The Traits That Will Cause Institutions to Struggle
Abundance does not reward everyone equally. In fact, it is often brutal to incumbents and late movers because it exposes structural weakness faster than scarcity ever did. As transaction costs collapse, margins compress, and clients gain unprecedented choice, certain organisational traits become existential liabilities.
9.1 Confusing Complexity with Control
Many struggling institutions believe that complexity equals safety. Over time they accumulate multiple overlapping products solving the same problem, redundant approval layers, duplicated technology platforms, and slightly different pricing rules for similar clients.
This complexity feels like control internally, but externally it creates friction, confusion, and cost. In an abundant world, clients simply route around complexity. They do not complain, they do not escalate, they just leave.
Corporate culture symptom: Committees spend three months debating whether a new savings account should have 2.5% or 2.75% interest while competitors launch entire banks.
Abundance rewards clarity, not optionality.
9.2 Optimising for Internal Governance Instead of client Outcomes
Organisations that struggle tend to design systems around committee structures, reporting lines, risk ownership diagrams, and policy enforcement rather than client experience.
The result is products that are technically compliant but emotionally hollow. When zero cost competitors exist, clients gravitate toward institutions that feel intentional, not ones that feel procedurally correct.
Corporate culture symptom: Product launches require sign off from seventeen people across eight departments, none of whom actually talk to clients.
Strong governance matters, but when governance becomes the product, clients disengage.
9.3 Treating Technology as a Project Instead of a Capability
Struggling companies still think in terms of “the cloud programme”, “the core replacement project”, or “the digital transformation initiative”.
These organisations fund technology in bursts, pause between efforts, and declare victory far too early. In contrast, winners treat technology as a permanent operating capability, continuously refined and quietly improved.
Corporate culture symptom: CIOs present three year roadmaps in PowerPoint while engineering teams at winning banks ship code daily.
Abundance punishes stop start execution. The market does not wait for your next funding cycle.
9.4 Assuming clients Will Act Rationally
Many institutions believe clients will naturally rationalise their financial lives: “They’ll close unused accounts eventually”, “They’ll move everything once they see the benefits”, “They’ll optimise for fees and interest rates”.
In reality, clients are lazy optimisers. They consolidate only when there is a clear emotional or experiential pull, not when spreadsheets say they should.
Corporate culture symptom: Marketing teams celebrate 2 million account openings while finance quietly notes that 1.8 million are dormant and generating losses.
Companies that rely on rational client behaviour end up with large numbers of dormant, loss making relationships and very few primary ones.
9.5 Designing Products That Require Perfect Behaviour
Another common failure mode is designing offerings that only work if clients behave flawlessly: repayments that must happen on rigid schedules, penalties that escalate quickly, and products that assume steady income and stable employment.
In an abundant system, flexibility beats precision. Institutions that cannot tolerate variance, missed steps, or irregular usage push clients away, often toward simpler, more forgiving alternatives.
Corporate culture symptom: Credit teams reject 80% of applicants to hit target default rates, then express surprise when growth stalls.
The winners design for how people actually live, not how risk models wish they did.
9.6 Mistaking Distribution for Differentiation
Some companies believe scale alone will save them: large branch networks, massive client bases, and deep historical brand recognition.
But abundance erodes the advantage of distribution. If everyone can reach everyone digitally, then distribution without differentiation becomes a cost centre.
Corporate culture symptom: Executives tout “our 900 branches” as a competitive advantage while clients increasingly view them as an inconvenience.
Struggling firms often have reach, but no compelling reason for clients to engage more deeply or more often.
9.7 Fragmented Ownership and Too Many Decision Makers
When accountability is diffuse, every domain has its own technology head, no one owns end to end client journeys, and decisions are endlessly deferred across forums.
Execution slows to a crawl. Abundance favours organisations that can make clear, fast, and sometimes uncomfortable decisions.
Corporate culture symptom: Six different “digital transformation” initiatives run in parallel, each with its own budget, none talking to each other.
If everyone is in charge, no one is.
9.8 Protecting Legacy Revenue at the Expense of Future Relevance
Finally, struggling organisations are often trapped by their own success. They hesitate to simplify, reduce fees, or remove friction because it threatens existing revenue streams.
But abundance ensures that someone else will do it instead.
Corporate culture symptom: Finance vetoes removing a R5 monthly fee that generates R50 million annually, ignoring that it costs R200 million in client attrition and support calls.
Protecting yesterday’s margins at the cost of tomorrow’s relevance is not conservatism. It is delayed decline.
9.9 The Uncomfortable Truth
Abundance does not kill companies directly. It exposes indecision, over engineering, cultural inertia, teams working slavishly towards narrow anti-client KPIs and misaligned incentives.
The institutions that struggle are not usually the least intelligent or the least resourced. They are the ones most attached to how things used to work.
In an abundant world, simplicity is not naive. It is strategic.
10. The Traits That Enable Survival and Dominance
In stark contrast to the failing patterns above, the banks that will dominate South African banking over the next decade share a remarkably consistent set of traits.
10.1 Radically Simple Product Design
Winning banks offer one account, one card, one fee model, and one app. They resist the urge to create seventeen variants of the same product.
Corporate culture marker: Product managers can explain the entire product line in under two minutes without charts.
Complexity is a choice, and choosing simplicity requires discipline that most organisations lack.
10.2 Obsessive Cost Discipline Without Sacrificing Quality
Winners run aggressively low cost bases through modern cores, minimal branch infrastructure, and automation first operations. But they invest heavily where it matters: fraud prevention, client support when things go wrong, and system reliability.
Corporate culture marker: CFOs are revered, not resented. Every rand is questioned, but client impacting investments move fast.
Cheap does not mean shoddy. It means ruthlessly eliminating waste.
10.3 Treating Fraud as Warfare, Not Compliance
Dominant banks understand fraud is a permanent conflict requiring specialist teams, AI powered detection, real time monitoring, and rapid response infrastructure.
Corporate culture marker: Fraud teams have authority to freeze accounts, block transactions, and shut down attack vectors immediately. If you get fraud wrong, nothing else matters.
10.4 Speed Over Consensus
Winning organisations make fast decisions with incomplete information and course correct quickly. They ship features weekly, not quarterly.
Corporate culture marker: Teams use “disagree and commit” rather than “let’s form a working group to explore this further”.
Abundance punishes deliberation. The cost of being wrong is lower than the cost of being slow.
10.5 Designing for Actual Human Behaviour
Winners build products that work for how people actually live: irregular income, forgotten passwords, missed payments, confusion under pressure.
Corporate culture marker: Product teams spend time in call centres listening to why clients struggle, not in conference rooms hypothesising about ideal user journeys.
The best products feel obvious because they assume nothing about client behaviour except that it will be messy.
10.6 Becoming the Primary Account by Earning Trust in Crisis
The ultimate trait that separates winners from losers is this: winners are there when clients need them most. When fraud happens, when money disappears, when identity is stolen, they respond immediately with empathy and solutions.
Corporate culture marker: client support teams have real authority to solve problems on the spot, not scripts requiring three escalations to do anything meaningful.
Trust cannot be marketed. It must be earned in the moments that matter most.
11. The Consolidation Reality: How South African Banking Reorganises Itself
South African banking has moved beyond discussion to inevitability. The paradox in the market, abundant options but shrinking economics, is not a transitional phase; it is the structural condition driving consolidation. The forces shaping this are already visible: shrinking margins, collapsing transactional fees, exploding fraud costs, and clients fragmenting their banking relationships while never truly committing as primaries.
Consolidation is not a risk. It is the outcome.
11.1 The Economics That Drive Consolidation
The system that once rewarded scale and complexity now penalises them. Legacy governance, hybrid branch networks, dual technology stacks, and product breadth are all costs that cannot be supported when transactional revenue trends toward zero. Compliance, fraud prevention, cyber risk, KYC/AML, and ongoing supervision from SARB are fixed costs that do not scale with account openings.
clients are not spreading their value evenly across institutions; they are fragmenting activity but consolidating value into a primary account, the salary account, the balance that matters, the financial home. Others become secondary or dormant accounts with little commercial value.
This structural squeeze cannot be reversed by better branding, faster apps, or more channels. There is only one way out: simplify, streamline, or exit.
11.2 What Every Bank Must Do to Survive
Survival will not be granted by persistence or marketing. It will be earned by fundamentally changing the business model.
Radically reduce governance and decision overhead. Layers of committees and approvals must be replaced by automated controls and empowered teams. Slow decision cycles are death in a world where client behaviour shifts in days, not years.
Drastically cut cost to serve. Branch networks, legacy platforms, duplicated services, these are liabilities. Banks must automate operations, reduce support functions, and shrink cost structures to match the new economics.
Simplify and consolidate products. clients don’t value fifteen savings products, four transactional tiers, and seven rewards models. They want clarity, predictability, and alignment with their financial lives.
Modernise technology stacks. Old cores wrapped with new interfaces are stopgaps, not solutions. Banks must adopt modular, API first systems that cut marginal costs, reduce risk, and improve reliability.
Reframe fees to reflect value. clients expect free basic services. Fees will survive only where value is clear, credit, trust, convenience, and outcomes, not transactions.
Prioritise fraud and risk capability. Fraud is not a peripheral cost; it is a core determinant of economics. Banks must invest in real time detection, AI assisted risk models, and client education, or face disproportionate losses.
Focus on primary relationships. A bank that is never a client’s financial home will eventually become irrelevant.
11.3 Understanding Bank Tiers: What Separates Tier 1 from Tier 2
Not all traditional banks are equally positioned to survive consolidation. The distinction between Tier 1 and Tier 2 traditional banks is not primarily about size or brand heritage. It is about structural readiness for the economics of abundance.
Tier 1 Traditional Banks are characterised by demonstrated digital execution capability, with modern(ish) technology stacks either deployed or credibly in progress. They have diversified revenue streams that reduce dependence on transactional fees, including strong positions in corporate banking, investment banking, or wealth management. Their cost structures, while still high, show evidence of active rationalisation. Most critically, they have proven ability to ship digital products at competitive speed and have successfully defended or grown primary account relationships in the mass market.
Tier 2 Traditional Banks remain more dependent on legacy infrastructure and have struggled to modernise core systems at pace. Their revenue mix is more exposed to transactional fee compression, and cost reduction efforts have often stalled in governance complexity. Technology execution tends to be slower, more project based, and more prone to delays. They rely heavily on consultants to tell them what to do and have a sprawling array of vendor products that are poorly integrated. Primary account share in the mass market has eroded more significantly, leaving them more reliant on existing relationship inertia than active client acquisition.
The distinction matters because Tier 1 banks have a viable path to competing directly for primary relationships in the new economics. Tier 2 banks face harder choices: accelerate transformation dramatically, accept a platform or specialist role, or risk becoming acquisition targets or zombie institutions.
11.4 Consolidation Readiness by Category
Below is a high level summary of institutional categories and what they must do to survive:
Category
What Must Change
Effort Required
Tier 1 Traditional Banks
Consolidate product stacks, automate risk and operations, maintain digital execution pace
Defend simplicity, scale risk and fraud capability, deepen primary engagement
Medium
Digital Challengers
Deepen primary engagement, invest heavily in fraud and lending capability, improve unit economics
Very High
Insurer Led Banks
Focus on profitable niches, leverage ecosystem integration, accept extended timeline to profitability
High
Specialist Lenders
Narrow focus aggressively, partner for distribution and technology, automate operations
Medium-High
Niche and SME Banks
Stay niche, automate aggressively, consider merger or specialisation
High
Sub Scale Banks
Partner or merge to gain scale, exit non-core activities
Very High
Mutual Banks
Simplify or consolidate early, consider cooperative mergers
Very High
Foreign Bank Branches
Shrink retail footprint, focus on corporate and institutional services
Medium
This readiness spectrum illustrates the real truth: institutions with scale, execution discipline, and structural simplicity have the best odds; those without these characteristics will be absorbed or eliminated.
11.5 The Pattern of Consolidation
Consolidation will not be uniform. The most likely sequence is:
First, sub scale and mutual banks exit or merge. They are unable to amortise fixed costs across enough primary relationships.
Second, digital challengers face the choice: invest heavily or be acquired. Rapid client acquisition without deep engagement or lending depth is not sustainable in an environment where fraud liability looms large and fee income is near zero.
Third, traditional banks consolidate capabilities, not brands. Large banks will more often absorb technology, licences, and teams than merge brand to brand. Duplication will be eliminated inside existing platforms.
Fourth, foreign banks retreat to niches. Global players will prioritise corporate and institutional services, not mass retail banking, in markets where local economics are unfavourable.
11.6 Winners and Losers
Likely Winners: Digital first banks with proven simplicity and low cost models. Tier 1 traditional banks with strong digital execution. Any institution that genuinely removes complexity rather than just managing it.
Likely Losers: Sub scale challengers without lending depth. Institutions that equate governance with safety. Banks that fail to dramatically cut cost and complexity. Any organisation that protects legacy revenue at the expense of future relevance.
12. Back to the Future
Banking has become the new corporate fidget spinner, grabbing the attention of relevance staved corporates. Most don’t know why they want it, exactly what it is, but they know others have it and so it should be on the plan somewhere.
South African banking is no longer about who can build the most features or launch the most products. It is about cost discipline, trust under pressure, relentless simplicity, and scale that compounds rather than collapses.
The winners will not be the loudest innovators. They will be the quiet operators who make banking feel invisible, safe, and boring.
And in banking, boring done well is very hard to beat.
The consolidation outcome is not exotic. It is a return to a familiar pattern: a small number of dominant banks. We will likely end up back to the future, with a small number of dominant banks, which is exactly where we started.
The difference will be profound. Those dominant banks will be more client centric, with lower fees, lower fraud, better lending, and better, simpler client experiences.
The journey through abundance, with its explosion of choice, its vanity metrics of account openings, and its billions burned on client acquisition, will have served its purpose. It will have forced the industry to strip away complexity, invest in what actually matters, and compete on the only dimensions that clients genuinely value: trust, simplicity, and being there when things go wrong.
The market will consolidate not because regulators force it, but because economics demands it. South African banking is not being preserved. It is being reformed, by clients, by economics, and by the unavoidable logic of abundance.
Those who embrace the logic early will shape the future. Those who do not will watch it happen to them.
And when the dust settles, South African consumers will be better served by fewer, stronger institutions than they ever were by the fragmented abundance that preceded them.
12.1 Final Thought: The Danger of Fighting on Two Fronts
There is a deeper lesson embedded in the struggles of crossover players that pour energy and resources into secondary, loss-making businesses typically do so by redirecting investment and operational focus from their primary business. This redirection is rarely neutral. It weakens the core.
Every rand allocated to the second front, every executive hour spent in strategy sessions, every technology resource committed to banking infrastructure is a rand, an hour, and a resource that cannot be deployed to defend and strengthen the primary businesses that actually generate profit today.
Growth into secondary businesses must be evaluated not just on their own merits, but in terms of how dominant and successful the company has been in its primary business. If you are not unquestionably dominant in your core market, if your primary business still faces existential competitive threats, if you have not achieved such overwhelming scale and efficiency that your position is effectively unassailable, then opening a second front is strategic suicide.
It is like opening another front in a war when the first front is not secured. You redirect troops, you split command attention, you divide logistics, and you leave your current positions weakened and vulnerable to counterattack. Your competitors in the primary business do not pause while you build the secondary one. They exploit the distraction.
Banks that will thrive are those that have already won their primary battle so decisively that expansion becomes an overflow of strength rather than a diversion of it. Capitec can expand into mobile networks because they have already dominated transactional banking. They are not splitting focus; they are leveraging surplus capacity.
Institutions that have not yet won their core market, that are still fighting for primary account relationships, that have not yet achieved the operational excellence and cost discipline required to survive in abundance, cannot afford the luxury of secondary ambitions.
The market will punish divided attention ruthlessly. And in South African banking, where fraud costs are exploding, margins are collapsing, and consolidation is inevitable, there is no forgiveness for strategic distraction.
The winners will be those who understood that dominance in one thing beats mediocrity in many. And they will inherit the market share of those who learned that lesson too late.
13. Authors Note
This article synthesises public data, regulatory reports, industry analysis, and observed market behaviour. Conclusions are forward-looking and represent the author’s interpretation of structural trends rather than predictions of specific outcomes. The author is sharing opinion and is in now way claiming to have any special insights or be an expert in predicting the future.
For most of modern banking history, stability was assumed to increase with size. The thinking was the bigger you are, the more you should care, the more resources you can apply to problems. Larger banks had more capital, more infrastructure, and more people. In a pre-cloud world, this assumption appeared reasonable.
In practice, the opposite was often true.
Before cloud computing and elastic infrastructure, the larger a bank became, the more unstable it was under stress and the harder it was to maintain any kind of delivery cadence. Scale amplified fragility. In 2025, architecture (not size) has become the primary determinant of banking stability.
2. Scale, Fragility, and Quantum Entanglement
Traditional banking platforms were built on vertically scaled systems: mainframes, monolithic databases, and tightly coupled integration layers. These systems were engineered for control and predictability, not for elasticity or independent change.
As banks grew, they didn’t just add clients. They added products. Each new product introduced new dependencies, shared data models, synchronous calls, and operational assumptions. Over time, this created a state best described as quantum entanglement.
In this context, quantum entanglement refers to systems where:
Products cannot change independently
A change in one area unpredictably affects others
The full impact of change only appears under real load
Cause and effect are separated by time, traffic, and failure conditions
The larger the number of interdependent products, the more entangled the system becomes.
2.1 Why Entanglement Reduces Stability
As quantum entanglement increases, change becomes progressively riskier. Even small modifications require coordination across multiple teams and systems. Release cycles slow and defensive complexity increases.
Recovery also becomes harder. When something breaks, rolling back a single change is rarely sufficient because multiple products may already be in partially failed or inconsistent states.
Fault finding degrades as well. Logs, metrics, and alerts point in multiple directions. Symptoms appear far from root causes, forcing engineers to chase secondary effects rather than underlying faults.
Most importantly, blast radius expands. A fault in one product propagates through shared state and synchronous dependencies, impacting clients who weren’t using the originating product at all.
The paradox is that the very success of large banks (broad product portfolios) becomes a direct contributor to instability.
3. Why Scale Reduced Stability in the Pre-Cloud Era
Before cloud computing, capacity was finite, expensive, and slow to change. Systems scaled vertically, and failure domains were large by design.
As transaction volumes and product entanglement increased, capacity cliffs became unavoidable. Peak load failures became systemic rather than local. Recovery times lengthened and client impact widened.
Large institutions often appeared stable during normal operation but failed dramatically under stress. Smaller institutions appeared more stable largely because they had fewer entangled products and simpler operational surfaces (not because they were inherently better engineered).
Capitec itself experienced this capacity cliff, when its core banking SQL DB hit a capacity cliff in August 2022. In order to recover the service, close to 100 changes were made which resulted in a downtime of around 40 hrs. The wider service recovery took weeks, with missed payments a duplicate payments being fixed on a case by case basis. It was at this point that Capitec’s leadership drew a line in the sand and decided to totally re-engineer its entire stack from the ground up in AWS. This blog post is really trying to share a few nuggets from the engineering journey we went on, and hopefully help others all struggling the with burden of scale and hardened synchronous pathways.
4. Cloud Changed the Equation (But Only When Architecture Changed)
Cloud computing made it possible to break entanglement, but only for organisations willing to redesign systems to exploit it.
Horizontal scaling, availability zone isolation, managed databases, and elastic compute allow products to exist as independent domains rather than tightly bound extensions of a central core.
Institutions that merely moved infrastructure to the cloud without breaking product entanglement continue to experience the same instability patterns (only on newer hardware).
5. An Architecture Designed to Avoid Entanglement
Capitec represents a deliberate rejection of quantum entanglement.
Its entire App production stack is cloud native on AWS, Kubernetes, Kafka and Postgres. The platform is well advanced in rolling out new Java 25 runtimes, alongside ahead of time (AOT) optimisation to further reduce scale latency, improve startup characteristics, and increase predictability under load. All Aurora Serverless are setup with read replicas, offloading read pressure from write paths. All workloads are deployed across three availability zones, ensuring resilience. Database access is via the AWS JDBC wrapper (which enables extremely rapid failovers, outside of DNS TTLs)
Crucially, products are isolated by design. There is no central product graph where everything depends on everything else. But, a word of caution, we are “not there yet”. We will always have edges that can hurt and we you hit an edge at speed, sometimes its hard to get back up on your feet. Often you see that the downtime you experienced, simply results in pent up demand. Put another way, the volume that took your systems offline, is now significantly LESS than the volume thats waiting for you once you recover! This means that you somehow have to magically add capacity, or optimise code, during an outage in order to recover the service. You will often say “Rate Limiting” fan club put a foot forward when I discuss burst recoverability. I personally don’t buy this for single entity services (for a complex set of reasons). For someone like AWS, it absolutely makes sense to carry the enormous complexity of guarding services with rate limits. But I don’t believe the same is true for a single entity ecosystem, in these instances, offloading is normally a purer pathway.
6. Write Guarding as a Stability Primitive
Capitec’s mobile and digital platforms employ a deliberate **write guarding** strategy.
Read only operations (such as logging into the app) are explicitly prevented from performing inline write operations. Activities like audit logging, telemetry capture, behavioural flags, and notification triggers are never executed synchronously on high volume read paths.
Instead, these concerns are offloaded asynchronously using Amazon MSK (Managed Streaming for Apache Kafka) or written to in memory data stores such as Valkey, where they can be processed later without impacting the user journey.
This design completely removes read-write contention from critical paths. Authentication storms, balance checks, and session validation no longer compete with persistence workloads. Under load, read performance remains stable because it is not coupled to downstream write capacity.
Critically, write guarding prevents database maintenance pressure (such as vacuum activity) from leaking into high volume events like logins. Expensive background work remains isolated from customer facing read paths.
Write guarding turns one of the most common failure modes in large banking systems (read traffic triggering hidden writes) into a non event. Stability improves not by adding capacity, but by removing unnecessary coupling.
7. Virtual Threads as a Scalability Primitive
Java 25 introduces mature virtual threading as a first class concurrency model. This fundamentally changes how high concurrency systems behave under load.
Virtual threads decouple application concurrency from operating system threads. Instead of being constrained by a limited pool of heavyweight threads, services can handle hundreds of thousands of concurrent blocking operations without exhausting resources.
Request handling becomes simpler. Engineers can write straightforward blocking code without introducing thread pool starvation or complex asynchronous control flow.
Tail latency improves under load. When traffic spikes, virtual threads queue cheaply rather than collapsing the system through thread exhaustion.
Operationally, virtual threads align naturally with containerised, autoscaling environments. Concurrency scales with demand, not with preconfigured thread limits.
When combined with modern garbage collectors and ahead of time optimisation, virtual threading removes an entire class of concurrency related instability that plagued earlier JVM based banking platforms.
8. Nimbleness Emerges When Entanglement Disappears
When blast zones and integration choke points disappear, teams regain the ability to move quickly without increasing systemic risk.
Domains communicate through well defined RESTful interfaces, often across separate AWS accounts, enforcing isolation as a first class property. A failure in one domain does not cascade across the organisation.
To keep this operable at scale, Capitec uses Backstage (via an internal overlay called ODIN) as its internal orchestration and developer platform. All AWS accounts, services, pipelines, and operational assets are created to a common standard. Teams consume platform capability rather than inventing infrastructure.
This eliminates configuration drift, reduces cognitive load, and ensures that every new product inherits the same security, observability, and resilience characteristics.
The result is nimbleness without fragility.
9. Operational Stability Is Observability Plus Action
In entangled systems, failures are discovered by clients and stability is measured retrospectively.
Capitec operates differently. End to end observability through Instana and its in house AI platform, Neo, correlates client side errors, network faults, infrastructure signals, and transaction failures in real time. Issues are detected as they emerge, not after they cascade.
This operational awareness allows teams to intervene early, contain issues quickly, and reduce client impact before failures escalate.
Stability, in this model, is not the absence of failure. It is fast detection, rapid containment, and decisive response.
10. Fraud Prevention Without Creating New Entanglement
Fraud is treated as a first class stability concern rather than an external control.
Payments are evaluated inline as they move through the bank. Abnormal velocity, behavioural anomalies, and account provenance are assessed continuously. Even fraud reported in the call center is immediately visible to other clients paying from the Capitec App. Clients are presented with conscience pricking prompts for high risk payments; these frequently stop fraud as the clients abandon the payment when presented with the risks.
Capitec runs a real time malware detection engine directly on client devices. This engine detects hooks and overlays installed by malicious applications. When malware is identified, the client’s account is immediately stopped, preventing fraudulent transactions before they occur.
Because fraud controls are embedded directly into the transaction flow, they don’t introduce additional coupling or asynchronous failure modes.
The impact is measurable. Capitec’s fraud prevention systems have prevented R300 million in client losses from fraud. In November alone, these systems saved clients a further R60 million in fraud losses.
11. The Myth of Stability Through Multicloud
Multicloud is often presented as a stability strategy. In practice, it is largely a myth.
Running across multiple cloud providers does not remove failure risk. It compounds it. Cross cloud communication can typically only be secured using IP based controls, weakening security posture. Operational complexity increases sharply as teams must reason about heterogeneous platforms, tooling, failure modes, and networking behaviour.
Most critically, multicloud does not eliminate correlated failure. If either cloud provider becomes unavailable, systems are usually unusable anyway. The result is a doubled risk surface, increased operational risk, and new inter cloud network dependencies (without a corresponding reduction in outage impact).
Multicloud increases complexity, weakens controls, and expands risk surface area without delivering meaningful resilience.
12. What Actually Improves Stability
There are better options than multicloud.
Hybrid cloud with anti-affinity on critical channels is one. For example, card rails can be placed in two physically separate data centres so that if cloud based digital channels are unavailable, clients can still transact via cards and ATMs. This provides real functional resilience rather than architectural illusion.
Multi region deployment within a single cloud provider is another. This provides geographic fault isolation without introducing heterogeneous complexity. However, this only works if the provider avoids globally scoped services that introduce hidden single points of failure. At present, only AWS consistently supports this model. Some providers expose global services (such as global front doors) that introduce global blast radius and correlated failure risk.
True resilience requires isolation of failure domains, not duplication of platforms.
13. Why Traditional Banks Still Struggle
Traditional banks remain constrained by entangled product graphs, vertically scaled cores, synchronous integration models, and architectural decisions from a different era. As product portfolios grow, quantum entanglement increases. Change slows, recovery degrades, and outages become harder to diagnose and contain.
Modernisation programmes often increase entanglement temporarily through dual run architectures, making systems more fragile before they become more stable (if they ever do).
The challenge is not talent or ambition. It is the accumulated cost of entanglement.
14. Stability at Scale Without the Traditional Trade Off
Capitec’s significance is not that it is small. It is that it is large and remains stable.
Despite operating at massive scale with a broad product surface and high transaction volumes, stability improves rather than degrades. Scale does not increase blast radius, recovery time, or change risk. It increases parallelism, isolation, and resilience.
This directly contradicts historical banking patterns where growth inevitably led to fragility. Capitec demonstrates that with the right architecture, scale and stability are no longer opposing forces.
15. Final Thought
Before cloud and autoscaling, scale and stability were inversely related. The more products a bank had, the more entangled and fragile it became.
In 2025, that relationship can be reversed (but only by breaking entanglement, isolating failure domains, and avoiding complexity masquerading as resilience).
Doing a deal with a cloud provider means nothing if transformation stalls inside the organisation. If dozens of people carry the title of CIO while quietly pulling the handbrake on the change that is required, the outcome is inevitable regardless of vendor selection.
There is also a strategic question that many institutions avoid. If forced to choose between operating in a jurisdiction that is hostile to public cloud or accessing the full advantages of cloud, waiting is not a strategy. When that jurisdiction eventually allows public cloud, the market will already be populated by banks that moved earlier, built cloud native platforms, and are now entering at scale.
Capitec is an engineering led bank whose stability and speed increase with scale. Traditional banks remain constrained by quantum entanglement baked into architectures from a different era.
These outcomes are not accidental. They are the inevitable result of architectural and organisational choices made years ago, now playing out under real world load.
In networking, OSPF (Open Shortest Path First) is a routing protocol that ensures traffic flows along the shortest and lowest cost path through a network. It does not care about hierarchy, seniority, or intent. It routes based on capability, cost, and reliability.
Modern engineering organisations behave in exactly the same way, whether they realise it or not. Workloads move through teams, people, and processes, naturally finding the path that resolves uncertainty with the least friction. This article explores how OSPF maps directly onto human workload routing inside engineering driven organisations.
2. How Networks Route Work vs How Organisations Route Work
In a network, routers advertise their capabilities and current state. OSPF continuously recalculates the optimal path based on these signals.
In an efficient engineering organisation, teams do the same thing implicitly. Through their behaviour, outcomes, and delivery patterns, teams advertise whether they are high or low cost paths for work. Workloads should not stick to a single team and just wait on resources, due to ownership bias’s. Instead workloads should route themselves toward the teams that require the fewest handoffs, the least clarification, and the lowest coordination overhead.
3. Understanding Cost Beyond Time and Effort
In OSPF, cost is not simply distance. It includes bandwidth, latency, reliability, and congestion.
In software delivery, cost includes change risk, testing overhead, coordination effort, context switching, review latency, and rework probability.
A team that looks fast on paper but requires excessive reviews, repeated testing cycles, and constant external validation is not a low cost path, even if their delivery metrics appear healthy.
4. The Hidden Cost of Micro Changes and the MTU Analogy
In networking, the Maximum Transmission Unit (MTU) defines the largest packet size that can traverse a network path without fragmentation. When packets exceed the MTU, they must be split into smaller fragments. Each fragment carries its own headers, sequencing, and failure risk. If a single fragment is lost, the entire packet must be retransmitted.
Micro changes in software delivery behave exactly like fragmented packets.
Every change, no matter how small, carries a fixed overhead. This includes build time, deployment effort, testing cycles, reviews, coordination, monitoring, and rollback planning. When work is artificially split into very small parcels, the organisation pays this overhead repeatedly, just as a network pays repeated headers and retransmissions for fragmented packets.
Fragmentation feels safer because each unit appears smaller and more controlled. In reality, fragmentation increases total risk. The probability of failure rises with the number of deployments. Testing effort multiplies. Cognitive load increases as engineers must reassemble intent across multiple changes rather than reason about a coherent whole.
Competent teams intuitively manage their effective MTU. They batch related changes when it reduces total risk. They expand the scope just enough to keep work below the fragmentation threshold while still remaining safe to deploy. This allows intent, testing, and validation to stay aligned.
Optimising delivery is therefore not about making changes as small as possible. It is about making them as large as safely possible without fragmentation. Teams that understand this reduce total system cost, improve reliability, and deliver outcomes that are easier to reason about and support over time.
A common misconception in delivery organisations is that smaller changes are always safer.
Every change introduces fixed overhead. Every deployment carries risk. Every test cycle has a non trivial cost. When work is broken into micro parcels, those costs multiply.
Competent teams understand this and naturally batch changes when appropriate. They reduce total system cost by limiting the number of times risk is introduced, rather than optimising for superficial progress metrics.
5. Competent Teams as the Shortest Path
In OSPF, traffic flows toward routers that forward packets efficiently and rarely drop traffic.
In organisations, work flows toward teams that terminate uncertainty quickly and safely. These teams do not just execute instructions. They resolve ambiguity, anticipate downstream impact, and reduce the need for future work.
They attract work not because they ask for it, but because the system optimises around them.
6. A Message to Junior Engineers: How to Become a Highly Desirability Execution Path
If you are early in your career, attracting meaningful and complex work is not about visibility, speed, or saying yes to everything. Work flows toward engineers who consistently reduce effort, risk, and uncertainty for the rest of the system. Becoming that engineer is a deliberate skill.
6.1 Build Trust Through Follow-Through
Nothing increases desirability faster than reliability. Finishing what you start, closing loops, and making outcomes explicit tells the organisation that routing work to you is safe. Partial delivery, vague status, or silent stalls increase perceived cost, even if the technical work is strong.
High-desirability engineers are predictable in the best possible way. People know that when work lands with you, it will progress to a clear outcome.
6.2 Make Ambiguity Smaller
Many workloads are not technically hard; they are poorly defined. Engineers who can take a vague request and turn it into a clear execution plan dramatically lower organisational cost.
This means asking clarifying questions early, documenting assumptions, and explicitly stating what will and will not be delivered. Turning uncertainty into clarity is one of the fastest ways to become a preferred execution path.
6.3 Learn the System, Not Just the Code
Junior engineers often focus narrowly on the component they are assigned. High-growth engineers invest in understanding how data, requests, failures, and deployments move through the entire system.
Knowing where data comes from, who consumes it, how failures propagate, and how changes are rolled back makes you safer to route work to. System understanding compounds faster than any single technical skill.
6.4 Reduce Hand-Offs Proactively
Every hand-off is a routing hop. Engineers who can take work further end-to-end lower total delivery cost.
This does not mean doing everything yourself. It means anticipating what the next team will need, providing clean interfaces, clear documentation, and well-tested changes that reduce downstream effort.
6.5 Surface Trade-Offs, Not Just Solutions
High-value engineers do not present a single answer. They explain options, risks, and costs in simple terms. Even junior engineers can do this.
When you articulate trade-offs clearly, decision makers trust you with more complex work because you reduce the cognitive burden of decision making itself.
6.6 Take Ownership of Quality, Not Just Completion
Finishing a task is not the same as completing a workload. Quality includes operability, observability, testability, and clarity.
Engineers who think about how their change will be supported at three in the morning quickly become trusted paths for critical work. Supportability is a strong desirability signal.
6.7 Invest in Communication as an Engineering Skill
Clear written updates, concise explanations, and honest status reporting are not soft skills. They are routing signals.
Engineers who communicate well reduce the need for meetings, escalation, and oversight. This makes them cheaper to route work to, regardless of seniority.
6.8 Be Curious Beyond Your Boundary
The fastest way to grow is to follow problems across team and system boundaries. Ask why a dependency behaves the way it does. Learn what happens after your code runs.
Curiosity expands your problem-solving surface area and accelerates your transition from task executor to system engineer.
6.9 Optimise for the Organisation, Not the Ticket
The most desirable engineers think beyond the immediate task. They consider whether a change creates future work, reduces it, or merely shifts it elsewhere.
When people see that routing work to you improves the organisation rather than just closing tickets, you naturally become a preferred execution path.
6.10 Desirability Is Earned Through Reduced Friction
Ultimately, workloads flow toward engineers who make life easier for everyone else. By reducing ambiguity, risk, hand-offs, and future cost, even junior engineers can become the shortest path to execution.
In human workload routing, competence, clarity, and intent matter far more than seniority or title.
If you are early in your career, attracting meaningful work is not about appearing busy. It is about reducing friction for everyone around you. Its about having deep enough and broad enough skills to tackle workloads without handoffs. Its about pushing knowledge boundaries, about challenging the status quo and about making yourself an obvious choice to include in conversations through your knowledge and ability to get things done.
7. Going Beyond Requirements
Requirements define minimum constraints, not the full problem space. Engineers who limit themselves to ticking boxes behave like routers that blindly forward packets without understanding the network.
High value engineers ask what else will break, what future change this enables, and what hidden complexity already exists in the area they are touching.
7.2 Expanding the Problem Boundary Responsibly
Expansive problem solving does not mean gold plating. It means recognising when adjacent issues can be safely resolved while the system is already open.
By solving nearby problems at the same time, engineers reduce future change cost and increase the return on every intervention.
7.3 Reducing Future Work
The most valuable engineers remove entire classes of future tickets. They simplify mental models, clarify ownership, and leave systems easier to change than they found them.
They optimise for the long term cost curve, not the next status update.
8. Why Robotic Delivery Models Fail
Some organisations treat teams as input output machines, managed through narrow SLAs and superficial progress indicators.
This approach produces local optimisation and global inefficiency. Work may appear to move, but value leaks through rework, fragility, and accumulated complexity.
This is the organisational equivalent of forcing traffic through a congested network path because it looks short on a diagram.
9. Execution Pathways Matter More Than Velocity
High performing teams debate execution strategy, not just timelines. They consciously choose when to refactor, when to batch changes, and when to absorb risk versus defer it.
This is human workload routing in action. The goal is not raw speed. The goal is the lowest total system cost over time.
10. Self Regulating Teams vs Managed Bottlenecks
The strongest teams self regulate delivery risk, testing effort, non functional gains, and release timing. They do not require constant external control to make safe decisions.
Like OSPF, they adapt dynamically to changing conditions and advertise their true state through outcomes rather than promises.
11. The Cost of Indulgence
Indulgence in engineering teams rarely looks like laziness. More often, it presents as helpfulness, flexibility, or responsiveness. Teams accept work that should be declined, reshaped, or challenged, and they pivot repeatedly in the name of being accommodating. While well intentioned, this behaviour carries a high and often hidden cost.
An indulgent team says yes to workloads that are expensive, poorly framed, or actively harmful to client outcomes. They accept complexity without questioning its origin. They implement features that satisfy a request but degrade system clarity, performance, or safety. In doing so, they optimise for short term harmony rather than long term value.
Another form of indulgence is constant pivoting. Teams abandon partially completed work to chase the next urgent request, leaving behind half built solutions, unreconciled design decisions, and accumulated technical debt. Nothing truly completes, and the organisation pays repeatedly for context switching, relearning, and revalidation.
Examples of indulgent behaviour include implementing bespoke logic for a single client when a systemic solution is required, layering exceptions instead of fixing a broken abstraction, accepting unrealistic deadlines that force unsafe shortcuts, or continuing work on initiatives that no longer have a clear outcome simply because they were already started.
These behaviours increase total system cost. They inflate testing effort, complicate support, and create fragile software that is harder to reason about. Most critically, they lead to anti client outcomes where apparent progress masks declining quality and trust.
Avoiding indulgence requires tension. Productive, respectful tension between execution teams and those requesting work is essential. Tense conversations clarify intent, surface hidden costs, and force prioritisation. They allow teams to reshape workloads into something that is achievable, valuable, and safe.
High performing teams do not confuse compliance with effectiveness. They are willing to say no, not as an act of resistance, but as an act of stewardship. By declining or reframing indulgent work, they protect the organisation from inefficiency and ensure that every execution pathway is intentional rather than reactive.
12. Why Complex Support Always Routes to the Same Engineers
In every engineering organisation, complex support issues have a way of finding the same people. These are the incidents that span multiple systems, lack clear ownership, and resist simple fixes. This is not accidental. It is workload routing in action.
When a problem is poorly understood, the organisation implicitly looks for the shortest path to understanding, not just resolution. Engineers who take the time to deeply understand systems, data flows, and failure modes advertise themselves as low cost paths for uncertainty. Over time, the routing becomes automatic.
These engineers share common traits. They are curious rather than defensive. They ignore artificial boundaries between teams, technologies, and responsibilities. They ask how the system actually behaves instead of how it is documented. They are willing to trace a problem across layers, even when it falls outside their formal scope.
By doing this repeatedly, they accumulate something far more valuable than narrow expertise. They develop system intuition. Each complex support issue expands their mental model of how the organisation’s technology really works. This compounding knowledge makes them faster, calmer, and more effective under pressure.
As a result, they become critical to the organisation. Not because they hoard knowledge, but because the system naturally routes its hardest problems to those who can resolve them with the least friction. With every incident, their skills sharpen further, reinforcing the routing decision.
This is why the best engineers are almost always the ones who leaned into complex support early in their careers. They treated ambiguity as a learning opportunity, not an inconvenience. In doing so, they became the shortest path not just to fixing problems, but to understanding the system itself.
13. Organisations Route Work Whether You Design for It or Not
Every organisation has an invisible routing protocol. Work will always find the path of least resistance and lowest cognitive load.
You can fight this reality with process and control, or you can design for it by building competent teams that reduce total organisational cost.
In both networks and organisations, traffic flows toward reliability, not authority.
This is (hopefully) a short blog that will give you back a small piece of your life…
In technology, we rightly spend hours pouring over failure in order that we might understand it and therefore fix it and avoid it in the future. This seems a reasonable approach, learn from your mistakes, understand failure, plan your remediation etc etc. But is it possible that there are some instances where doing this is inappropriate? To answer this simple question, let me give you an analogy…
You decide that you want to travel from London to New York. Sounds reasonable so far…. But you decide you want to go by car! The reasoning for this is as follows:
Cars are “tried and tested”.
We have an existing deal with multiple car suppliers and we get great discounts.
The key decision maker is a car enthusiast.
The incumbent team understand cars and can support this choice.
Cars are what we have available right now and we want to start execution tomorrow, so lets just make it work.
You first try a small hatchback and only manage to get around 3m off the coast of Scotland. Next up you figure you will get a more durable car, so you get a truck – but sadly this only makes 2m headway from the beach. You report back to the team and they send you a brand new Porsche and this time you give yourself an even bigger run up at the sea and you manage to make a whopping 4m, before the car sinks. The team now analyse all the data to figure out why each car sunk and what they can do to make this better. The team continue to experiment with various cars and progress is observed over time. After 6 months the team has managed to travel 12m towards their goal of driving to New York. The main reason for the progress is that the sunken cars are starting to form a land bridge. The leadership has now spent over 200m USD on this venture and don’t feel they can pivot, so they start to brainstorm how to make this work.
Maybe wind the windows up a little tighter, maybe the cars need more underseal, maybe over inflate the tyres or maybe we simply need way more cars? All of these may or may not make a difference. But here’s the challenge: you made a bad engineering choice and anything you do will just be a variant of bad. It will never be good and you cannot win with your choice.
The above obviously sounds a bit daft (and it is), but the point is that I am often called in after downtime to review an architecture to find a route cause and suggest remediation. But what is not always understood is that bad technology choices can be as likely to succeed as driving from London to New York. Sometimes you simply need to look at alternatives, you need a boat or a plane. The product architecture can be terminal, it wont ever be what you want it to be and no amount of analysis or spend will change this. The trick is to accept the brutal reality of your situation and move your focus towards choosing the technology that you need to transition to. Next try and figure out how quickly can you can do this pivot…