Darwinian Architecture Philosophy: How Domain Isolation Creates Evolutionary Pressure for Better Software

Darwinian Architecture Philosophy

How Domain Isolation Creates Evolutionary Pressure for Better Software

After two decades building trading platforms and banking systems, I’ve watched the same pattern repeat itself countless times. A production incident occurs. The war room fills. And then the finger pointing begins.

“It’s the database team’s problem.” “No, it’s that batch job from payments.” “Actually, I think it’s the new release from the cards team.” Three weeks later, you might have an answer. Or you might just have a temporary workaround and a room full of people who’ve learned to blame each other more effectively.

This is the tragedy of the commons playing out in enterprise technology, and it’s killing your ability to evolve.

1. The Shared Infrastructure Trap

Traditional enterprise architecture loves shared infrastructure. It makes intuitive sense: why would you run fifteen database clusters when one big one will do? Why have each team manage their own message broker when a central platform team can run one for everybody? Economies of scale. Centralised expertise. Lower costs.

Except that’s not what actually happens.

What happens is that your shared Oracle RAC cluster becomes a battleground. The trading desk needs low latency queries. The batch processing team needs to run massive overnight jobs. The reporting team needs to scan entire tables. Everyone has legitimate needs, and everyone’s needs conflict with everyone else’s. The DBA team becomes a bottleneck, fielding requests from twelve different product owners, all of whom believe their work is the priority.

When the CPU spikes to 100% at 2pm on a Tuesday, the incident call has fifteen people on it, and nobody knows whose query caused it. The monitoring shows increased load, but the load comes from everywhere. Everyone claims their release was tested. Everyone points at someone else.

This isn’t a technical problem. It’s an accountability problem. And you cannot solve accountability problems with better monitoring dashboards.

2. Darwinian Pressure in Software Systems

Nature solved this problem billions of years ago. Organisms that make poor decisions suffer the consequences directly. There’s no committee meeting to discuss why the antelope got eaten. The feedback loop is immediate and unambiguous. Whilst nobody wants to watch it, teams secretly take comfort in not being the limping buffalo at the back of the herd. Teams get fit, they resist decisions that will put them in an unsafe place as they know they will receive an uncomfortable amount of focus from senior management.

Modern software architecture can learn from this. When you isolate domains, truly isolate them, with their own data stores, their own compute, their own failure boundaries, you create Darwinian pressure. Teams that write inefficient code see their own costs rise. Teams that deploy buggy releases see their own services degrade. Teams that don’t invest in resilience suffer their own outages.

There’s no hiding. There’s no ambiguity. There’s no three week investigation to determine fault. There is no watered down document that hints at the issue, but doesn’t really call it out, as all the teams couldn’t agree on something more pointed. The feedback loop tightens from weeks to hours, sometimes minutes.

This isn’t about blame. It’s about learning. When the consequences of your decisions land squarely on your own service, you learn faster. You care more. You invest in the right things because you directly experience the cost of not investing.

3. The Architecture of Isolation

Achieving genuine domain isolation requires more than just drawing boxes on a whiteboard and calling them “microservices.” It requires rethinking how domains interact with each other and with their data.

Data Localisation Through Replication

The hardest shift for most organisations is accepting that data duplication isn’t a sin. In a shared database world, we’re taught that the single source of truth is sacred. Duplicate data creates consistency problems. Normalisation is good.

But in a distributed world, the shared database is the coupling that prevents isolation. If three domains query the same customer table, they’re coupled. An index change that helps one domain might destroy another’s performance. A schema migration requires coordinating across teams. The tragedy of the commons returns.

Instead, each domain should own its data. If another domain needs that data, replicate it. Event driven patterns work well here: when a customer’s address changes, publish an event. Subscribing domains update their local copies. Yes, there’s eventual consistency. Yes, the data might be milliseconds or seconds stale. But in exchange, each domain can optimise its own data structures for its own access patterns, make schema changes without coordinating with half the organisation, and scale its data tier independently.

Queues as Circuit Breakers

Synchronous service to service calls are the other hidden coupling that defeats isolation. When the channel service calls the fraud service, and the fraud service calls the customer service, you’ve created a distributed monolith. A failure anywhere propagates everywhere. An outage in customer data brings down payments.

Asynchronous messaging changes this dynamic entirely. When a payment needs fraud checking, it drops a message on a queue. If the fraud service is slow or down, the queue absorbs the backlog. The payment service doesn’t fail, it just sees increased latency on fraud decisions. Customers might wait a few extra seconds for approval rather than seeing an error page.

This doesn’t make the fraud service’s problems disappear. The fraud team still needs to fix their outage, but you can make business choices about how to deal with the outage. For example, you can choose to bypass the checks for payments to “known” beneficiaries or below certain threshold values, so the blast radius is contained and can be managed. The payments team’s SLAs aren’t destroyed by someone else’s incident. The Darwinian pressure lands where it belongs: on the team whose service is struggling.

Proxy Layers for Graceful Degradation

Not everything can be asynchronous. Sometimes you need a real time answer. But even synchronous dependencies can be isolated through intelligent proxy layers.

A well designed proxy can cache responses, serve stale data during outages, fall back to default behaviours, and implement circuit breakers that fail fast rather than hanging. When the downstream service returns, the proxy heals automatically.

The key insight is that the proxy belongs to the calling domain, not the called domain. The payments team decides how to handle fraud service failures. Maybe they approve transactions under a certain threshold automatically. Maybe they queue high value transactions for manual review. The fraud team doesn’t need to know or care, they just need to get their service healthy again.

4. Escaping the Monolith: Strategies for Service Eviction

Understanding the destination is one thing. Knowing how to get there from where you are is another entirely. Most enterprises aren’t starting with a blank slate. They’re staring at a decade-old shared Oracle database with three hundred stored procedures, an enterprise service bus that routes traffic for forty applications, and a monolithic core banking system that everyone is terrified to touch.

The good news is that you don’t need to rebuild everything from scratch. The better news is that you can create structural incentives that make migration inevitable rather than optional.

Service Eviction: Making the Old World Uncomfortable

Service eviction is the deliberate practice of making shared infrastructure progressively less attractive to use while making domain-isolated alternatives progressively more attractive. This isn’t about being obstructive. It’s about aligning incentives with architecture.

Start with change management. On shared infrastructure, every change requires coordination. You need a CAB ticket. You need sign-off from every consuming team. You need a four-week lead time and a rollback plan approved by someone three levels up. The change window is 2am Sunday, and if anything goes wrong, you’re in a war room with fifteen other teams.

On domain-isolated services, changes are the team’s own business. They deploy when they’re ready. They roll back if they need to. Nobody else is affected because nobody else shares their infrastructure. The contrast becomes visceral: painful, bureaucratic change processes on shared services versus autonomous, rapid iteration on isolated ones.

This isn’t artificial friction. It’s honest friction. Shared infrastructure genuinely does require more coordination because changes genuinely do affect more people. You’re just making the hidden costs visible and letting teams experience them directly.

Data Localisation Through Kafka: Breaking the Database Coupling

The shared database is usually the hardest dependency to break. Everyone queries it. Everyone depends on its schema. Moving data feels impossibly risky.

Kafka changes the game by enabling data localisation without requiring big-bang migrations. The pattern works like this: identify a domain that wants autonomy. Have the source system publish events to Kafka whenever relevant data changes. Have the target domain consume those events and maintain its own local copy of the data it needs.

Initially, this looks like unnecessary duplication. The data exists in Oracle and in the domain’s local store. But that duplication is exactly what enables isolation. The domain can now evolve its schema independently. It can optimise its indexes for its access patterns. It can scale its data tier without affecting anyone else. And critically, it can be tested and deployed without coordinating database changes with twelve other teams.

Kafka’s log-based architecture makes this particularly powerful. New consumers can replay history to bootstrap their local state. The event stream becomes the source of truth for what changed and when. Individual domains derive their local views from that stream, each optimised for their specific needs.

The key insight is that you’re not migrating data. You’re replicating it through events until the domain no longer needs to query the shared database directly. Once every query can be served from local data, the coupling is broken. The shared database becomes a publisher of events rather than a shared resource everyone depends on.

The Strangler Fig: Gradual Replacement Without Risk

The strangler fig pattern, named after the tropical tree that gradually envelops and replaces its host, is the safest approach to extracting functionality from monoliths. Rather than replacing large systems wholesale, you intercept specific functions at the boundary and gradually route traffic to new implementations.

Put a proxy in front of the monolith. Initially, it routes everything through unchanged. Then, one function at a time, build the replacement in the target domain. Route traffic for that function to the new service while everything else continues to hit the monolith. When the new service is proven, remove the old code from the monolith.

The beauty of this approach is that failure is localised and reversible. If the new service has issues, flip the routing back. The monolith is still there, still working. You haven’t burned any bridges. You can take the time to get it right because you’re not under pressure from a hard cutover deadline.

Combined with Kafka-based data localisation, the strangler pattern becomes even more powerful. The new domain service consumes events to build its local state, the proxy routes relevant traffic to it, and the old monolith gradually loses responsibilities until what remains is small enough to either rewrite completely or simply turn off.

Asymmetric Change Management: The Hidden Accelerator

This is the strategy that sounds controversial but works remarkably well: make change management deliberately asymmetric between shared services and domain-isolated services.

On the shared database or monolith, changes require extensive governance. Four-week CAB cycles. Impact assessments signed off by every consuming team. Mandatory production support during changes. Post-implementation reviews. Change freezes around month-end, quarter-end, and peak trading periods.

On domain-isolated services, teams own their deployment pipeline end-to-end. They can deploy multiple times per day if their automation supports it. No CAB tickets. No external sign-offs. If they break their own service, they fix their own service.

This asymmetry isn’t punitive. It reflects genuine risk. Changes to shared infrastructure genuinely do have broader blast radius. They genuinely do require more coordination. You’re simply making the cost of that coordination visible rather than hiding it in endless meetings and implicit dependencies.

The effect is predictable. Teams that want to move fast migrate to domain isolation. Teams that are comfortable with quarterly releases can stay on shared infrastructure. Over time, the ambitious teams have extracted their most critical functionality into isolated domains. What remains on shared infrastructure is genuinely stable, rarely-changing functionality that doesn’t need rapid iteration.

The natural equilibrium is that shared infrastructure becomes genuinely shared: common utilities, reference data, things that change slowly and benefit from centralisation. Everything else migrates to where it can evolve independently.

The Migration Playbook

Put it together and the playbook looks like this:

First, establish Kafka as your enterprise event backbone. Every system of record publishes events when data changes. This is table stakes for everything else.

Second, identify a domain with high change velocity that’s suffering under shared infrastructure governance. They’re your early adopter. Help them establish their own data store, consuming events from Kafka to maintain local state.

Third, put a strangler proxy in front of relevant monolith functions. Route traffic to the new domain service. Prove it works. Remove the old implementation.

Fourth, give the domain team autonomous deployment capability. Let them experience the difference between deploying through a four-week CAB cycle versus deploying whenever they’re ready.

Fifth, publicise the success. Other teams will notice. They’ll start asking for the same thing. Now you have demand-driven migration rather than architecture-mandated migration.

The key is that you’re not forcing anyone to migrate. You’re creating conditions where migration is obviously attractive. The teams that care about velocity self-select. The shared infrastructure naturally shrinks to genuinely shared concerns.

5. The Cultural Shift

Architecture is easy compared to culture. You can draw domain boundaries in a week. Convincing people to live within them takes years.

The shared infrastructure model creates a particular kind of learned helplessness. When everything is everyone’s problem, nothing is anyone’s problem. Teams optimise for deflecting blame rather than improving reliability. Political skills matter more than engineering skills. The best career move is often to avoid owning anything that might fail.

Domain isolation flips this dynamic. Teams own their outcomes completely. There’s nowhere to hide, but there’s also genuine autonomy. You can choose your own technology stack. You can release when you’re ready without coordinating with twelve other teams. You can invest in reliability knowing that you’ll reap the benefits directly.

This autonomy attracts a different kind of engineer. People who want to own things. People who take pride in uptime and performance. People who’d rather fix problems than explain why problems aren’t their fault.

The teams that thrive under this model are the ones that learn fastest. They build observability into everything because they need to understand their own systems. They invest in automated testing because they can’t blame someone else when their deploys go wrong. They design for failure because they know they’ll be the ones getting paged.

The teams that don’t adapt… well, that’s the Darwinian part. Their services become known as unreliable. Other teams design around them. Eventually, the organisation notices that some teams consistently deliver and others consistently struggle. The feedback becomes impossible to ignore.

6. The Transition Path

You can’t flip a switch and move from shared infrastructure to domain isolation overnight. The dependencies are too deep. The skills don’t exist. The organisational structures don’t support it.

But you can start. Pick a domain that’s struggling with the current model, probably one that’s constantly blamed for incidents they didn’t cause. Give them their own database, their own compute, their own deployment pipeline. Build the event publishing infrastructure so they can share data with other domains through replication rather than direct queries.

Watch what happens. The team will stumble initially. They’ve never had to think about database sizing or query optimisation because that was always someone else’s job. But within a few months, they’ll own it. They’ll understand their system in a way they never did before. Their incident response will get faster because there’s no ambiguity about whose system is broken.

More importantly, other teams will notice. They’ll see a team that deploys whenever they want, that doesn’t get dragged into incident calls for problems they didn’t cause, that actually controls their own destiny. They’ll start asking for the same thing.

This is how architectural change actually happens, not through mandates from enterprise architecture, but through demonstrated success that creates demand.

7. The Economics Question

I can already hear the objections. “This is more expensive. We’ll have fifteen databases instead of one. Fifteen engineering teams managing infrastructure instead of one platform team.”

To which I’d say: you’re already paying these costs, you’re just hiding them.

Every hour spent in an incident call where twelve teams try to figure out whose code caused the database to spike is a cost. Every delayed release because you’re waiting for a shared schema migration is a cost. Every workaround another team implements because your shared service doesn’t quite meet their needs is a cost. Every engineer who leaves because they’re tired of fighting political battles instead of building software is a cost.

Domain isolation makes these costs visible and allocates them to the teams that incur them. That visibility is uncomfortable, but it’s also the prerequisite for improvement.

And yes, you’ll run more database clusters. But they’ll be right sized for their workloads. You won’t be paying for headroom that exists only because you can’t predict which team will spike load next. You won’t be over provisioning because the shared platform has to handle everyone’s worst case simultaneously.

8. Evolution, Not Design

The deepest insight from evolutionary biology is that complex, well adapted systems don’t emerge from top down design. They emerge from the accumulation of countless small improvements, each one tested against reality, with failures eliminated and successes preserved.

Enterprise architecture traditionally works the opposite way. Architects design systems from above. Teams implement those designs. Feedback loops are slow and filtered through layers of abstraction. By the time the architecture proves unsuitable, it’s too deeply embedded to change.

Domain isolation enables architectural evolution. Each team can experiment within their boundary. Good patterns spread as other teams observe and adopt them. Bad patterns get contained and eventually eliminated. The overall system improves through distributed learning rather than centralised planning.

This doesn’t mean architects become irrelevant. Someone needs to define the contracts between domains, design the event schemas, establish the standards for how services discover and communicate with each other. But the architect’s role shifts from designing systems to designing the conditions under which good systems can emerge.

9. The End State

I’ve seen organisations make this transition. It takes years, not months. It requires sustained leadership commitment. It forces difficult conversations about team structure and accountability.

But the end state is remarkable. Incident calls have three people on them instead of thirty. Root cause is established in minutes instead of weeks. Teams ship daily instead of quarterly. Engineers actually enjoy their work because they’re building things instead of attending meetings about who broke what.

The shared infrastructure isn’t completely gone, some things genuinely benefit from centralisation. But it’s the exception rather than the rule. And crucially, the teams that use shared infrastructure do so by choice, understanding the trade offs, rather than by mandate.

The tragedy of the commons is solved not by better governance of the commons, but by eliminating the commons. Give teams genuine ownership. Let them succeed or fail on their own merits. Trust that the Darwinian pressure will drive improvement faster than any amount of central planning ever could.

Nature figured this out a long time ago. It’s time enterprise architecture caught up.

0
0

Leave a Reply

Your email address will not be published. Required fields are marked *