Figure 1: Traditional DR Exercise vs Real World Outage
Disaster recovery is one of the most comforting practices in enterprise technology and one of the least honest. Organisations spend significant time and money designing DR strategies, running carefully choreographed exercises, producing polished post exercise reports, and reassuring themselves that they are prepared for major outages. The problem is not intent. The problem is that most DR exercises are optimised to demonstrate control and preparedness in artificial conditions, while real failures are chaotic, asymmetric and hostile to planning. When outages occur under real load, the assumptions underpinning these exercises fail almost immediately.
What most organisations call disaster recovery is closer to rehearsal than resilience. It tests whether people can follow a script, whether environments can be brought online when nothing else is going wrong, and whether senior stakeholders can be reassured. It does not test whether systems can survive reality.
1. DR Exercises Validate Planning Discipline, Not Failure Behaviour
Traditional DR exercises are run like projects. They are planned well in advance, aligned to change freezes, coordinated across teams, and executed when everyone knows exactly what is supposed to happen. This alone invalidates most of the conclusions drawn from them. Real outages are not announced, they do not arrive at convenient times, and they rarely fail cleanly. They emerge as partial failures, ambiguous symptoms and cascading side effects. Alerts contradict each other, dashboards lag reality, and engineers are forced to reason under pressure with incomplete information.
A recovery strategy that depends on precise sequencing, complete information and the availability of specific individuals is fragile by definition. The more a DR exercise depends on human coordination to succeed, the less likely it is to work when humans are stressed, unavailable or wrong. Resilience is not something that can be planned into existence through documentation. It is an emergent property of systems that behave safely when things go wrong without requiring perfect execution.
2. Recovery Is Almost Always Tested in the Absence of Load
Figure 2: Recovery Under Load With and Without Chaos Testing
The single most damaging flaw in DR testing is that it is almost always performed when systems are idle. Queues are empty, clients are disconnected, traffic is suppressed, and downstream systems are healthy. This creates a deeply misleading picture of recoverability. In real outages, load does not disappear. It concentrates. Clients retry, SDKs back off and then retry again, load balancers redistribute traffic aggressively, queues accumulate messages faster than they can be drained, and databases slow down at precisely the moment demand spikes.
Back pressure is the defining characteristic of real recovery scenarios, and it is almost entirely absent from DR exercises. A system that starts cleanly with no load may never become healthy when forced to recover while saturated. Recovery logic that looks correct in isolation frequently collapses when subjected to retry storms and backlog replays. Testing recovery without load is equivalent to testing a fire escape in an empty building and declaring it safe.
3. Recovery Commonly Triggers the Second Outage
DR plans tend to assume orderly reconnection. Services are expected to come back online, accept traffic gradually, and stabilise. Reality delivers the opposite. When systems reappear, clients reconnect simultaneously, message brokers attempt to drain entire backlogs at once, caches stampede databases, authentication systems spike, and internal rate limits are exceeded by internal callers rather than external users.
This thundering herd effect means that recovery itself often becomes the second outage, frequently worse than the first. Systems may technically be up while remaining unusable because they are overwhelmed the moment they re enter service. DR exercises rarely expose this behaviour because load is deliberately suppressed, leading organisations to confuse clean startup with safe recovery.
4. Why Real World DR Testing Is So Hard
The uncomfortable truth is that most organizations avoid real world DR testing not because they are lazy or incompetent, but because the technology they run makes realistic testing commercially irrational.
In traditional enterprise estates a genuine failover is not a minor operational event. A large SQL Server estate or a mainframe environment routinely takes well over an hour to fail over cleanly, and that is assuming everything behaves exactly as designed. During that window queues back up, batch windows are missed, downstream systems time out, and customers feel the impact immediately. Pulling the pin on a system like this during peak volumes is not a test, it is a deliberate business outage. No executive will approve that, and nor should they.
This creates an inevitable compromise. DR tests are scheduled during low load periods, often weekends or nights, precisely when the system behaves best. The back pressure that exists during real trading hours is absent. Cache warm up effects are invisible. Connection storms never happen. Latent data consistency problems remain hidden. The test passes, confidence is reported upward, and nothing meaningful has actually been proven.
The core issue is not testing discipline, it is recovery time characteristics. If your recovery time objective is measured in hours, then every real test carries a material business risk. As a result, organizations rationally choose theater over truth.
Change the technology and the equation changes completely. Platforms like Aurora Serverless fundamentally alter the cost of failure. A failover becomes an operational blip measured in seconds rather than an existential event measured in hours. Endpoints are reattached, capacity is rehydrated automatically, and traffic resumes quickly enough that controlled testing becomes possible even with real workloads. Once confidence is built at lower volumes, the same mechanism can be exercised progressively closer to peak without taking the business hostage.
This is the key distinction most DR conversations miss. You cannot meaningfully test DR if the act of testing is itself catastrophic. Modern architectures that fail fast and recover fast are not just operationally elegant, they are the only ones that make honest DR validation feasible. Everything else optimizes for paperwork, not resilience.
5. Availability Is Tested While Correctness Is Ignored
Most DR exercises optimise for availability signals rather than correctness. They focus on whether systems start, endpoints respond and dashboards turn green, while ignoring whether the system is still right. Modern architectures are asynchronous, distributed and event driven. Outages cut through workflows mid execution. Transactions may be partially applied, events may be published but never consumed, compensating actions may not run, and side effects may occur without corresponding state changes.
DR testing almost never validates whether business invariants still hold after recovery. It rarely checks for duplicated actions, missing compensations or widened consistency windows. Availability without correctness is not resilience. It is simply data corruption delivered faster.
6. Idempotency Is Assumed Rather Than Proven
Many systems claim idempotency at an architectural level, but real implementations are usually only partially idempotent. Idempotency keys are often scoped incorrectly, deduplication windows expire too quickly, global uniqueness is not enforced, and side effects are not adequately guarded. External integrations frequently replay blindly, amplifying the problem.
Outages expose these weaknesses because retries occur across multiple layers simultaneously. Messages are delivered more than once, requests are replayed long after original context has been lost, and systems are forced to process duplicates at scale. DR exercises rarely test this behaviour under load. They validate that systems start, not that they behave safely when flooded with replays. Idempotency that only works in steady state is not idempotency. It is an assumption waiting to fail.
7. DNS and Replication Lag Are Treated as Minor Details
DNS based failover is a common component of DR strategies because it looks clean and simple on diagrams. In practice it is unreliable and unpredictable. TTLs are not respected uniformly, client side caches persist far longer than expected, mobile networks are extremely sticky, corporate resolvers behave inconsistently, and CDN propagation is neither instantaneous nor symmetrical.
During real incidents, traffic often arrives from both old and new locations for extended periods. Systems must tolerate split traffic and asymmetric routing rather than assuming clean cutover. DR exercises that expect DNS to behave deterministically are rehearsing a scenario that almost never occurs in production.
8. Hidden Coupling Between Domains Undermines Recovery
Most large scale recovery failures are not caused by the system being recovered, but by something it depends on. Shared authentication services, centralised configuration systems, common message brokers, logging pipelines and global rate limits quietly undermine isolation. During DR exercises these couplings remain invisible because everything is brought up together in a controlled order. In real outages, dependencies fail independently, partially and out of sequence.
True resilience requires domain isolation with explicitly bounded blast radius. If recovery of one system depends on the health of multiple others, none of which are isolated, then recovery is fragile regardless of how well rehearsed it is.
9. Human Factors Are Removed From the Equation
DR exercises assume ideal human conditions. The right people are available, everyone knows it is a test, stress levels are low, and communication is structured and calm. Real incidents are defined by the opposite conditions. People are tired, unavailable or already overloaded, context is missing, and decisions are made under extreme cognitive load.
Systems that require heroics to recover are not resilient. They are brittle. Good systems assume humans will be late, distracted and wrong, and still recover safely.
10. DR Is Designed for Audit Cycles, Not Continuous Failure
Most DR programs exist to satisfy auditors, regulators and risk committees rather than to survive reality. This leads to annual exercises, static runbooks, binary success metrics and a complete absence of continuous feedback. Meanwhile production systems change daily.
A DR plan that is not continuously exercised against live systems is obsolete by default. The confidence it provides is inversely proportional to its accuracy.
11. Chaos Testing Is the Only Honest Substitute
Real resilience is built by failing systems while they are doing real work. That means killing instances under load, partitioning networks unpredictably, breaking dependencies intentionally, injecting latency and observing the blast radius honestly. Chaos testing exposes retry amplification, back pressure collapse, hidden coupling and unsafe assumptions that scripted DR exercises systematically hide.
It is uncomfortable and politically difficult, but it is the only approach that resembles reality.
12. What Systems Should Actually Be Proven To Do
A meaningful resilience strategy does not ask whether systems can be recovered quietly. It proves, continuously, that systems can recover under sustained load, tolerate duplication safely, remain isolated from unrelated domains, degrade gracefully, preserve business invariants and recover with minimal human coordination even when failure timing and scope are unpredictable.
Anything less is optimism masquerading as engineering.
13. DR Exercises Provide Reassurance, Not Resilience
Traditional DR exercises make organisations feel prepared without exposing uncomfortable truths. They work only when the system is quiet, the people are calm and the plan is followed perfectly. Reality offers none of these conditions.
If your recovery strategy only works in ideal circumstances, it is not a strategy. It is theater.
Real time mobile chat represents one of the most demanding challenges in distributed systems architecture. Unlike web applications where connections are relatively stable, mobile clients constantly transition between networks, experience variable latency, and must conserve battery while maintaining instant message delivery. This post examines the architectural decisions behind building mobile chat at massive scale, the problems each technology solves, and the tradeoffs involved in choosing between alternatives.
1. Understanding the Mobile Chat Problem
Before evaluating solutions, architects must understand precisely what makes mobile chat fundamentally different from other distributed systems challenges.
1.1 The Connection State Paradox
Traditional stateless architectures achieve scale through horizontal scaling of identical, interchangeable nodes. Load balancers distribute requests randomly because any node can handle any request. State lives in databases, and the application tier remains stateless.
Chat demolishes this model. When User A sends a message to User B, the system must know which server holds User B’s connection. This isn’t a database lookup; it’s a routing decision that must happen for every message, in milliseconds, with perfect consistency across your entire cluster.
At 100,000 concurrent connections, you might manage with a centralised routing table in Redis. Query Redis for User B’s server, forward the message, done. At 10 million connections, that centralised lookup becomes the bottleneck. Every message requires a Redis round trip. Redis clustering helps but doesn’t eliminate the fundamental serialisation point.
The deeper problem is consistency. User B might disconnect and reconnect to a different server. Your routing table is now stale. With mobile users reconnecting constantly due to network transitions, your routing information is perpetually outdated. Eventually consistent routing means occasionally lost messages, which users notice immediately.
1.2 The Idle Connection Problem
Mobile usage patterns create a unique resource challenge. Users open chat apps, exchange a few messages, then switch to other apps. The connection often remains open in the background for push notifications and presence updates. At scale, you might have 10 million “connected” users where only 500,000 are actively messaging at any moment.
Your architecture must provision resources for 10 million connections but only needs throughput capacity for 500,000 active users. Traditional thread per connection models collapse here. Ten million OS threads is impossible; the context switching alone would consume all CPU. But you need instant response when any of those 10 million connections becomes active.
This asymmetry between connection count and activity level is fundamental to mobile chat and drives many architectural decisions.
1.3 Network Instability as the Norm
Mobile networks are hostile environments. Users walk through buildings, ride elevators, transition from WiFi to cellular, pass through coverage gaps. A user walking from their office to a coffee shop might experience dozens of network transitions in fifteen minutes.
Each transition is a potential message loss event. The TCP connection over WiFi terminates when the device switches to cellular. Messages queued for delivery on the old connection are lost unless your architecture explicitly handles reconnection and replay.
Desktop web chat can treat disconnection as exceptional. Mobile chat must treat disconnection as continuous background noise. Reconnection isn’t error recovery; it’s normal operation.
1.4 Battery, Backgrounding, and the Wakeup Problem
Every network operation consumes battery. Maintaining a persistent connection keeps the radio active, draining battery faster than almost any other operation. The mobile radio state machine makes this worse: transitioning from idle to active takes hundreds of milliseconds and significant power. Frequent small transmissions prevent deep sleep, causing battery drain disproportionate to data transferred.
But the real architectural complexity emerges when users background your app.
1.4.1 What Happens When Apps Are Backgrounded
iOS and Android aggressively manage background applications to preserve battery and system resources. When a user switches away from your chat app:
iOS Behaviour: Apps receive approximately 10 seconds of background execution time before suspension. After suspension, no code executes, no network connections are maintained, no timers fire. The app is frozen in memory. iOS will terminate suspended apps entirely under memory pressure without notification.
Android Behaviour: Android is slightly more permissive but increasingly restrictive with each version. Background execution limits (introduced in Android 8) prevent apps from running background services freely. Doze mode (Android 6+) defers network access and background work when the device is stationary and screen off. App Standby Buckets (Android 9+) restrict background activity based on how recently the user engaged with the app.
In both cases, your carefully maintained SSE connection dies when the app backgrounds. The server sees a disconnect. Messages arrive but have nowhere to go.
1.4.2 Architectural Choices for Background Message Delivery
You have three fundamental approaches when clients are backgrounded:
Option 1: Push Notification Relay
When the server detects the SSE connection has closed, buffer incoming messages and send push notifications (APNs for iOS, FCM for Android) to wake the device and alert the user.
Advantages: Works within platform constraints. Users receive notifications even with app completely terminated. No special permissions or background modes required.
Disadvantages: Push notifications are not guaranteed delivery. APNs and FCM are best effort services that may delay or drop notifications under load. You cannot stream message content through push; you notify and wait for the user to open the app. The user experience degrades from real time chat to notification driven interaction.
Architectural implications: Your server must detect connection loss quickly (aggressive keepalive timeouts), maintain per user message buffers, integrate with APNs and FCM, and handle the complexity of notification payload limits (4KB for APNs, varying for FCM).
Option 2: Background Fetch and Silent Push
Use platform background fetch capabilities to periodically wake your app and check for new messages. Silent push notifications can trigger background fetches on demand.
iOS provides Background App Refresh, which wakes your app periodically (system determined intervals, typically 15 minutes to hours depending on user engagement patterns). Silent push notifications can wake the app for approximately 30 seconds of background execution.
Android provides WorkManager for deferrable background work and high priority FCM messages that can wake the app briefly.
Advantages: Better message freshness than pure notification relay. Can sync recent messages before user opens app, improving perceived responsiveness.
Disadvantages: Timing is not guaranteed; the system determines when background fetch runs. Silent push has strict limits (iOS limits rate and will throttle abusive apps). Background execution time is severely limited; you cannot maintain a persistent connection. Users who disable Background App Refresh get degraded experience.
Architectural implications: Your sync protocol must be efficient, fetching only delta updates within the brief execution window. Server must support efficient “messages since timestamp X” queries. Consider message batching to maximise value of each background wake.
Option 3: Persistent Connection via Platform APIs
Both platforms offer APIs for maintaining network connections in background, but with significant constraints.
iOS VoIP Push: Originally designed for VoIP apps, this mechanism maintains a persistent connection and wakes the app instantly for incoming calls. However, Apple now requires apps using VoIP push to actually provide VoIP calling functionality. Apps abusing VoIP push for chat have been rejected from the App Store.
iOS Background Modes: The “remote-notification” background mode combined with PushKit allows some connection maintenance, but Apple reviews usage carefully. Pure chat apps without calling features will likely be rejected.
Android Foreground Services: Apps can run foreground services that maintain connections, but must display a persistent notification to the user. This is appropriate for actively ongoing activities (music playback, navigation) but feels intrusive for chat apps. Users may disable or uninstall apps with unwanted persistent notifications.
Advantages: True real time message delivery even when backgrounded. Best possible user experience.
Disadvantages: Platform restrictions make this unavailable for most pure chat apps. Foreground service notifications annoy users. Increased battery consumption may lead users to uninstall.
Architectural implications: Only viable if your app genuinely provides VoIP or other qualifying functionality. Otherwise, design assuming connections terminate on background.
1.4.3 The Pragmatic Hybrid Architecture
Most successful chat apps use a hybrid approach:
Foreground: Maintain SSE connection for real time message streaming. Aggressive delivery with minimal latency.
Recently Backgrounded (first few minutes): The connection may persist briefly. Deliver messages normally until disconnect detected.
Backgrounded: Switch to push notification model. Buffer messages server side. Send push notification for new messages. Optionally use silent push to trigger background sync of recent messages.
App Terminated: Pure push notification relay. User sees notification, opens app, app reconnects and syncs all missed messages.
Return to Foreground: Immediately re-establish SSE connection. Sync any messages missed during background period using Last-Event-ID resume. Return to real time streaming.
This hybrid approach accepts platform constraints rather than fighting them. Real time delivery when possible, reliable notification when not.
1.4.4 Server Side Implications
The hybrid model requires server architecture to support:
Connection State Tracking: Detect when SSE connections close. Distinguish between network hiccup (will reconnect shortly) and true backgrounding (switch to push mode).
Per User Message Buffers: Store messages for offline users. Size buffers appropriately; users backgrounded for days may have thousands of messages.
Push Integration: Maintain connections to APNs and FCM. Handle token refresh, feedback service (invalid tokens), and retry logic.
Efficient Sync Protocol: Support “give me everything since message ID X” queries efficiently. Index appropriately for this access pattern.
Delivery Tracking: Track which messages were delivered via SSE versus require push notification versus awaiting sync on app open. Avoid duplicate notifications.
1.5 Message Ordering and Delivery Guarantees
Users expect messages to arrive in send order. When Alice sends “Are you free?” followed by “for dinner tonight?”, they must arrive in that order or the conversation becomes nonsensical. Network variability means packets arrive out of order constantly. Your application layer must reorder correctly.
Additionally, mobile chat requires “at least once” delivery with deduplication. Users expect messages to arrive even if they were offline when sent. But retransmission on reconnection must not create duplicates. This requires message identifiers, delivery tracking, and idempotent processing throughout your pipeline.
2. Why Apache Pekko Solves These Problems
Apache Pekko provides the distributed systems primitives that address mobile chat’s fundamental challenges. Understanding why requires examining what Pekko actually provides and how it maps to chat requirements.
2.1 The Licensing Context: Why Pekko Over Akka
Akka pioneered the actor model on the JVM and proved it at scale across thousands of production deployments. In 2022, Lightbend changed Akka’s licence from Apache 2.0 to the Business Source Licence, requiring commercial licences for production use above certain thresholds.
Apache Pekko emerged as a community fork maintaining API compatibility with Akka 2.6.x under Apache 2.0 licensing. For architects evaluating new projects, Pekko provides the same battle tested primitives without licensing concerns or vendor dependency.
The codebase is mature, inheriting over a decade of Akka’s production hardening. The community is active and includes many former Akka contributors. For new distributed systems projects on the JVM, Pekko is the clear choice.
2.2 The Actor Model: Right Abstraction for Connection State
The actor model treats computation as isolated entities exchanging messages. Each actor has private state, processes messages sequentially, and communicates only through asynchronous message passing. No shared memory, no locks, no synchronisation primitives.
This maps perfectly onto chat connections:
One Actor Per Connection: Each mobile connection becomes an actor. The actor holds connection state: user identity, device information, subscription preferences, message buffers. When messages arrive for that user, they route to the actor. When the connection terminates, the actor stops and releases resources.
Extreme Lightweightness: Actors are not threads. A single JVM hosts millions of actors, each consuming only a few hundred bytes when idle. This matches mobile’s reality: millions of mostly idle connections, each requiring instant activation when a message arrives.
Natural Fault Isolation: A misbehaving connection cannot crash the server. Actors fail independently. Supervisor hierarchies determine recovery strategy. One client sending malformed data affects only its actor, not the millions of other connections on that node.
Sequential Processing Eliminates Concurrency Bugs: Each actor processes one message at a time. Connection state updates are inherently serialised. You don’t need locks, atomic operations, or careful reasoning about race conditions. The actor model eliminates entire categories of bugs that plague traditional concurrent connection handling.
2.3 Cluster Sharding: Eliminating the Routing Bottleneck
Cluster sharding is Pekko’s solution to the connection routing problem. Rather than maintaining an explicit routing table, you define a sharding strategy based on entity identity. Pekko handles physical routing transparently.
When sending a message to User B, you address it to User B’s logical entity identifier. You don’t know or care which physical node hosts User B. Pekko’s sharding layer determines the correct node and routes the message. If User B isn’t currently active, the shard can activate an actor for them on demand.
The architectural significance is profound:
No Centralised Routing Table: There’s no Redis cluster to query for every message. Routing is computed from the entity identifier using consistent hashing. The computation is local; no network round trip required.
Automatic Rebalancing: When nodes join or leave the cluster, shards rebalance automatically. Application code is unchanged. A user might reconnect to a different physical node after a network transition, but message delivery continues because routing is by logical identity, not physical location.
Elastic Scaling: Add nodes to increase capacity. Remove nodes during low traffic. The sharding layer handles redistribution without application involvement. This is true elasticity, not the sticky session pseudo scaling that WebSocket architectures often require.
Location Transparency: Services sending messages don’t know cluster topology. They address logical entities. This decouples message producers from the physical deployment, enabling independent scaling of different cluster regions.
2.4 Backpressure: Graceful Degradation Under Load
Mobile networks have variable bandwidth. A user on fast WiFi can receive messages instantly. The same user in an elevator has effectively zero bandwidth. What happens to messages queued for delivery?
Without explicit backpressure, messages accumulate in memory. The buffer grows until the server exhausts heap and crashes. This cascading failure takes down not just one connection but thousands sharing that server.
Pekko Streams provides reactive backpressure propagating through entire pipelines. When a consumer can’t keep up, pressure signals flow backward to producers. You configure explicit overflow strategies:
Bounded Buffers: Limit how many messages queue per connection. Memory consumption is predictable regardless of consumer speed.
Overflow Strategies: When buffers fill, choose behaviour: drop oldest messages, drop newest messages, signal failure to producers. For chat, dropping oldest is usually correct; users prefer missing old messages to system crashes.
Graceful Degradation: Under extreme load, the system slows down rather than falling over. Message delivery delays but the system remains operational.
This explicit backpressure is essential for mobile where network quality varies wildly and client consumption rates are unpredictable.
2.5 Multi Device and Presence
Modern users have multiple devices: phone, tablet, watch, desktop. Messages should deliver to all connected devices. Presence should reflect aggregate state across devices.
The actor hierarchy models this naturally. A UserActor represents the user across all devices. Child ConnectionActors represent individual device connections. Messages to the user fan out to all active connections. When all devices disconnect, the UserActor knows the user is offline and can trigger push notifications or buffer messages.
This isn’t just convenience; it’s architectural clarity. The UserActor is the single source of truth for that user’s state. There’s no distributed coordination problem across devices because one actor owns the aggregate state.
3. Server Sent Events: The Right Protocol Choice
WebSockets are the default assumption for real time applications. Server Sent Events deserve serious architectural consideration for mobile chat.
3.1 Understanding Traffic Asymmetry
Examine any chat system’s traffic patterns. Users receive far more messages than they send. In a group chat with 50 participants, each sent message generates 49 deliveries. Downstream traffic (server to client) exceeds upstream by roughly two orders of magnitude.
WebSocket provides symmetric bidirectional streaming. You’re provisioning and managing upstream capacity you don’t need. SSE acknowledges the asymmetry: persistent streaming downstream, standard HTTP requests upstream.
This isn’t a limitation; it’s architectural honesty about traffic patterns.
3.2 Upstream Path Simplicity
With SSE, sending a message is an HTTP POST. This request is stateless. Any server in your cluster can handle it. Load balancing is trivial. Retries on network failure use standard HTTP retry logic. Rate limiting uses standard HTTP rate limiting. Authentication uses standard HTTP authentication.
You’ve eliminated an entire category of complexity. The upstream path doesn’t need sticky sessions, doesn’t need cluster coordination, doesn’t need special handling for connection migration. It’s just HTTP requests, which your infrastructure already knows how to handle.
3.3 Automatic Reconnection with Resume
The EventSource specification includes automatic reconnection with resume capability. When a connection drops, the client reconnects and sends the Last-Event-ID header indicating the last successfully received event. The server resumes from that point.
For mobile where disconnections happen constantly, this built in resume eliminates significant application complexity. You’re not implementing reconnection logic, not tracking client state for resume, not building replay mechanisms. The protocol handles it.
This is exactly once delivery semantics without distributed transaction protocols. The client tells you what it received; you replay from there.
3.4 HTTP Infrastructure Compatibility
SSE is pure HTTP. It works through every proxy, load balancer, CDN, and firewall that understands HTTP. Corporate networks, hotel WiFi, airplane WiFi: if HTTP works, SSE works.
WebSocket, despite widespread support, still encounters edge cases. Some corporate proxies don’t handle the upgrade handshake. Some firewalls block the WebSocket protocol. Some CDNs don’t support WebSocket passthrough. These edge cases occur precisely when users are on restrictive networks where reliability matters most.
From an operations perspective, SSE uses your existing HTTP monitoring, logging, and debugging infrastructure. WebSocket requires parallel tooling.
3.5 Debugging and Observability
SSE streams are plain text over HTTP. You can observe them with curl, log them with standard HTTP logging, replay them for debugging. Every HTTP tool in your operational arsenal works.
WebSocket debugging requires specialised tools understanding the frame protocol. At 3am during an incident, the simplicity of SSE becomes invaluable.
4. HTTP Protocol Version: A Critical Infrastructure Decision
The choice between HTTP/1.1, HTTP/2, and HTTP/3 significantly impacts mobile chat performance. Each version represents different tradeoffs.
4.1 HTTP/1.1: Universal Compatibility
HTTP/1.1 works everywhere. Every client, proxy, load balancer, and debugging tool supports it. For SSE specifically, HTTP/1.1 functions correctly because SSE connections are single stream.
The limitation is connection overhead. Browsers and mobile clients restrict HTTP/1.1 connections to six per domain. A chat app with multiple subscriptions (messages, presence, typing indicators, notifications) exhausts this quickly. Each subscription requires a separate TCP connection with separate TLS handshake overhead.
For mobile, the multiple connection problem compounds with battery impact. Each TCP connection requires radio activity for establishment and maintenance. Six connections consume significantly more power than one.
Choose HTTP/1.1 when: Maximum compatibility is essential, your infrastructure doesn’t support HTTP/2, or you have very few simultaneous streams.
4.2 HTTP/2: The Practical Choice for Most Deployments
HTTP/2 multiplexes unlimited streams over a single TCP connection. Each SSE subscription becomes a stream within the same connection. Browser connection limits become irrelevant.
For mobile architecture, the implications are substantial:
Single Connection Efficiency: One TCP connection, one TLS session, one set of kernel buffers. The radio wakes once rather than maintaining multiple connections. Battery consumption drops significantly.
Instant Stream Establishment: New subscriptions don’t require TCP handshakes. Opening a new chat room adds a stream to the existing connection in milliseconds rather than the hundreds of milliseconds for new TCP connection establishment.
Header Compression: HPACK compression eliminates redundant bytes in repetitive headers. SSE requests with identical Authorization, Accept, and User-Agent headers compress to single digit bytes after the first request.
Stream Isolation: Flow control operates per stream. A slow stream doesn’t block other streams. If a busy group chat falls behind, direct message delivery continues unaffected.
The limitation is TCP head of line blocking. HTTP/2 streams are independent at the application layer but share a single TCP connection underneath. A single lost packet blocks all streams until retransmission. On lossy mobile networks, this creates correlated latency spikes across all subscriptions.
Choose HTTP/2 when: You need multiplexing benefits, your infrastructure supports HTTP/2 termination, and TCP head of line blocking is acceptable.
4.3 HTTP/3 and QUIC: Purpose Built for Mobile
HTTP/3 replaces TCP with QUIC, a UDP based transport with integrated encryption. For mobile chat, QUIC provides capabilities that fundamentally change user experience.
Stream Independence: QUIC delivers streams independently at the transport layer, not just the application layer. Packet loss on one stream doesn’t affect others. On mobile networks where packet loss is routine, this isolation prevents correlated latency spikes across chat subscriptions.
Connection Migration: QUIC connections are identified by connection ID, not IP address and port. When a device switches from WiFi to cellular, the QUIC connection survives the IP address change. No reconnection, no TLS renegotiation, no message replay. The connection continues seamlessly.
This is transformative for mobile. A user walking from WiFi coverage to cellular maintains their chat connection without interruption. With TCP, this transition requires full reconnection with associated latency and potential message loss during the gap.
Zero Round Trip Resumption: For returning connections, QUIC supports 0-RTT establishment. A user who chatted yesterday can send and receive messages before completing the handshake. For apps where users connect and disconnect frequently, this eliminates perceptible connection latency.
Current Deployment Challenges: Some corporate firewalls block UDP. QUIC runs in userspace rather than leveraging kernel TCP optimisations, increasing CPU overhead. Operational tooling is less mature. Load balancer support varies across vendors.
Choose HTTP/3 when: Mobile experience is paramount, your infrastructure supports QUIC termination, and you can fall back gracefully when UDP is blocked.
4.4 The Hybrid Architecture Recommendation
Deploy HTTP/2 as your baseline with HTTP/3 alongside. Clients negotiate using Alt-Svc headers, selecting HTTP/3 when available and falling back to HTTP/2 when UDP is blocked.
Modern iOS (15+) and Android clients support HTTP/3 natively. Most mobile users will negotiate HTTP/3 automatically, getting connection migration benefits. Users on restrictive networks fall back to HTTP/2 without application awareness.
This hybrid approach provides optimal experience for capable clients while maintaining universal accessibility.
5. Java 25: Runtime Capabilities That Change Architecture
Java 25 delivers runtime capabilities that fundamentally change how you architect JVM based chat systems. These aren’t incremental improvements but architectural enablers.
5.1 Virtual Threads: Eliminating the Thread/Connection Tension
Traditional Java threads map one to one with operating system threads. Each thread allocates megabytes of stack space and involves kernel scheduling. At 10,000 threads, context switching overhead dominates CPU usage. At 100,000 threads, the system becomes unresponsive.
This created a fundamental architectural tension. Simple, readable code wants one thread per connection, processing messages sequentially with straightforward blocking I/O. But you can’t afford millions of OS threads for millions of connections. The solution was reactive programming: callback chains, continuation passing, complex async/await patterns that are difficult to write, debug, and maintain.
Virtual threads resolve this tension. They’re lightweight threads managed by the JVM, not the operating system. Millions of virtual threads multiplex onto a small pool of platform threads (typically matching CPU core count). When a virtual thread blocks on I/O, it yields its carrier platform thread to other virtual threads rather than blocking the OS thread.
Architecturally, you can now write straightforward sequential code for connection handling. Read from network. Process message. Write to database. Query cache. Each operation can block without concern. When I/O blocks, other connections proceed on the same platform threads.
Combined with Pekko’s actor model, virtual threads enable blocking operations inside actors without special handling. Actors calling databases or external services can use simple blocking calls rather than complex async patterns.
5.2 Generational ZGC: Eliminating GC as an Architectural Constraint
Garbage collection historically constrained chat architecture. Under sustained load, heap fills with connection state, message buffers, and temporary objects. Eventually, major collection triggers, pausing all application threads for hundreds of milliseconds.
During that pause, no messages deliver. Connections timeout. Clients reconnect. The reconnection surge creates more garbage, triggering more collection, potentially cascading into cluster wide instability.
Architects responded with complex mitigations: off heap storage, object pooling, careful allocation patterns, GC tuning rituals. Or they abandoned the JVM entirely for languages with different memory models.
Generational ZGC in Java 25 provides sub millisecond pause times regardless of heap size. At 100GB heap with millions of objects, GC pauses remain under 1ms. Collection happens concurrently while application threads continue executing.
Architecturally, this removes GC as a constraint. You can use straightforward object allocation patterns. You can provision large heaps for connection state. You don’t need off heap complexity for latency sensitive paths. GC induced latency spikes don’t trigger reconnection cascades.
5.3 AOT Compilation Cache: Solving the Warmup Problem
Java’s Just In Time compiler produces extraordinarily efficient code after warmup. The JVM interprets bytecode initially, identifies hot paths through profiling, compiles them to native code, then recompiles with more aggressive optimisation as profile data accumulates.
Full optimisation takes 3 to 5 minutes of sustained load. During warmup:
Elevated Latency: Interpreted code runs 10x to 100x slower than compiled code. Message delivery takes milliseconds instead of microseconds.
Increased CPU Usage: The JIT compiler consumes significant CPU while compiling. Less capacity remains for actual work.
Impaired Autoscaling: When load spikes trigger scaling, new instances need warmup before reaching efficiency. The spike might resolve before new capacity becomes useful.
Deployment Pain: Rolling deployments put cold instances into rotation. Users hitting new instances experience degraded performance until warmup completes.
AOT (Ahead of Time) compilation caching through Project Leyden addresses this. You perform a training run under representative load. The JVM records compilation decisions: which methods are hot, inlining choices, optimisation levels. This persists to a cache file.
On production startup, the JVM loads cached compilation decisions and applies them immediately. Methods identified as hot during training compile before handling any requests. The server starts at near optimal performance.
Architecturally, this transforms deployment and scaling characteristics. New instances become immediately productive. Autoscaling responds effectively to sudden load. Rolling deployments don’t cause latency regressions. You can be more aggressive with instance replacement for security patching or configuration changes.
5.4 Structured Concurrency: Lifecycle Clarity
Structured concurrency ensures concurrent operations have clear parent/child relationships. When a parent scope completes, child operations are guaranteed complete or cancelled. No orphaned tasks, no resource leaks from forgotten futures.
For chat connection lifecycle, this provides architectural clarity. When a connection closes, all associated operations terminate: pending message deliveries, presence updates, typing broadcasts. With unstructured concurrency, ensuring complete cleanup requires careful tracking. With structured concurrency, cleanup is automatic and guaranteed.
Combined with virtual threads, you might spawn thousands of lightweight threads for subtasks within a connection’s processing. Structured concurrency ensures they all terminate appropriately when the connection ends.
6. Kubernetes and EKS Deployment Architecture
Deploying Pekko clusters on Kubernetes requires understanding how actor clustering interacts with container orchestration.
6.1 EKS Configuration Considerations
Amazon EKS provides managed Kubernetes suitable for Pekko chat deployments. Several configuration choices significantly impact cluster behaviour.
Node Instance Types: Chat servers are memory bound before CPU bound due to connection state overhead. Memory optimised instances (r6i, r6g series) provide better cost efficiency than general purpose instances. For maximum connection density, r6g.4xlarge (128GB memory, 16 vCPU) or r6i.4xlarge handles approximately 500,000 connections per node.
Graviton Instances: ARM based Graviton instances (r6g, r7g series) provide approximately 20% better price performance than equivalent x86 instances. Java 25 has mature ARM support. Unless you have x86 specific dependencies, Graviton instances reduce infrastructure cost at scale.
Node Groups: Separate node groups for Pekko cluster nodes versus supporting services (databases, monitoring, ingestion). This allows independent scaling and prevents noisy neighbour issues where supporting workloads affect chat latency.
Pod Anti-Affinity: Configure pod anti-affinity to spread Pekko cluster members across availability zones and physical hosts. Losing a single host shouldn’t remove multiple cluster members simultaneously.
6.2 Pekko Kubernetes Discovery
Pekko clusters require members to discover each other for gossip protocol coordination. On Kubernetes, the Pekko Kubernetes Discovery module uses the Kubernetes API to find peer pods.
Configuration involves:
Headless Service: A Kubernetes headless service (clusterIP: None) allows pods to discover peer pod IPs directly rather than load balancing.
RBAC Permissions: The Pekko discovery module needs permissions to query the Kubernetes API for pod information. A ServiceAccount with appropriate RBAC rules enables this.
Startup Coordination: During rolling deployments, new pods must join the existing cluster before old pods terminate. Proper readiness probes and deployment strategies ensure cluster continuity.
6.3 Network Configuration for Connection Density
High connection counts require careful network configuration:
VPC CNI Settings: The default AWS VPC CNI limits pods per node based on ENI capacity. For high connection density, configure secondary IP mode or consider Calico CNI for higher pod density.
Connection Tracking: Linux connection tracking tables have default limits around 65,536 entries. At hundreds of thousands of connections per node, increase nf_conntrack_max accordingly.
Port Exhaustion: With HTTP/2 multiplexing, port exhaustion is less common but still possible for outbound connections to databases and services. Ensure adequate ephemeral port ranges.
6.4 Horizontal Pod Autoscaling Considerations
Traditional HPA based on CPU or memory doesn’t map well to chat workloads where connection count is the primary scaling dimension.
Custom Metrics: Expose connection count as a Prometheus metric and configure HPA using custom metrics adapter. Scale based on connections per pod rather than resource utilisation.
Predictive Scaling: Chat traffic often has predictable daily patterns. AWS Predictive Scaling can pre provision capacity before expected peaks rather than reacting after load arrives.
Scaling Responsiveness: With AOT compilation cache, new pods are immediately productive. This enables more aggressive scaling policies since new capacity provides value immediately rather than after warmup.
6.5 Service Mesh Considerations
Service mesh technologies (Istio, Linkerd) add sidecar proxies that intercept traffic. For high connection count workloads, evaluate carefully:
Sidecar Overhead: Each connection passes through the sidecar proxy, adding latency and memory overhead. At 500,000 connections per pod, sidecar memory consumption becomes significant.
mTLS Termination: If using service mesh for internal mTLS, the sidecar terminates and re-establishes TLS, adding CPU overhead per connection.
Recommendation: For Pekko cluster internal traffic, consider excluding from mesh using annotations. Apply mesh policies to edge traffic where the connection count is lower.
7. Linux Distribution Selection
The choice of Linux distribution affects performance, security posture, and operational characteristics for high connection count workloads.
7.1 Amazon Linux 2023
Amazon Linux 2023 (AL2023) is purpose built for AWS workloads. It uses a Fedora based lineage with Amazon specific optimisations.
Advantages: Optimised for AWS infrastructure including Nitro hypervisor integration. Regular security updates through Amazon. No licensing costs. Excellent AWS tooling integration. Kernel tuned for network performance.
Considerations: Shorter support lifecycle than enterprise distributions. Community smaller than Ubuntu or RHEL ecosystems.
Best for: EKS deployments prioritising AWS integration and cost optimisation.
7.2 Bottlerocket
Bottlerocket is Amazon’s container optimised Linux distribution. It runs containers and nothing else.
Advantages: Minimal attack surface with only container runtime components. Immutable root filesystem prevents runtime modification. Atomic updates reduce configuration drift. API driven configuration rather than SSH access.
Considerations: Cannot run non-containerised workloads. Debugging requires different operational patterns (exec into containers rather than SSH to host). Less community familiarity.
Best for: High security environments where minimal attack surface is paramount. Organisations with mature container debugging practices.
7.3 Ubuntu Server
Ubuntu Server (22.04 LTS or 24.04 LTS) provides broad compatibility and extensive community support.
Advantages: Large community and extensive documentation. Wide hardware and software compatibility. Canonical provides commercial support options. Most operational teams are familiar with Ubuntu.
Considerations: Larger base image than container optimised distributions. More components installed than strictly necessary for container hosts.
Best for: Teams prioritising operational familiarity and broad ecosystem compatibility.
7.4 Flatcar Container Linux
Flatcar is a community maintained fork of CoreOS Container Linux, designed specifically for container workloads.
Advantages: Minimal OS footprint focused on container hosting. Automatic atomic updates. Immutable infrastructure patterns built in. Active community continuing CoreOS legacy.
Considerations: Smaller community than major distributions. Fewer enterprise support options.
Best for: Organisations comfortable with immutable infrastructure patterns seeking minimal container optimised OS.
7.5 Recommendation
For most EKS chat deployments, Amazon Linux 2023 provides the best balance of AWS integration, performance, and operational familiarity. The kernel network stack tuning is appropriate for high connection counts, AWS tooling integration is seamless, and operational teams can apply existing Linux knowledge.
For high security environments or organisations committed to immutable infrastructure, Bottlerocket provides stronger security posture at the cost of operational model changes.
8. Comparing Alternative Architectures
8.1 WebSockets with Socket.IO
Socket.IO provides WebSocket with automatic fallback and higher level abstractions like rooms and acknowledgements.
Architectural Advantages: Rich feature set reduces development time. Room abstraction maps naturally to group chats. Acknowledgement system provides delivery confirmation. Large community provides extensive documentation and examples.
Architectural Disadvantages: Sticky sessions required for scaling. The load balancer must route all requests from a client to the same server, fighting against elastic scaling. Scaling beyond a single server requires a pub/sub adapter (typically Redis), introducing a centralised bottleneck. The proprietary protocol layer over WebSocket adds complexity and overhead.
Scale Ceiling: Practical limits around hundreds of thousands of connections before the Redis adapter becomes a bottleneck.
Best For: Moderate scale applications where development speed outweighs architectural flexibility.
8.2 Firebase Realtime Database / Firestore
Firebase provides real time synchronisation as a fully managed service with excellent mobile SDKs.
Architectural Advantages: Zero infrastructure to operate. Offline support built into mobile SDKs. Real time listeners are trivial to implement. Automatic scaling handled by Google. Cross platform consistency through Google’s SDKs.
Architectural Disadvantages: Complete vendor lock in to Google Cloud Platform. Pricing scales with reads, writes, and bandwidth, becoming expensive at scale. Limited query capabilities compared to purpose built databases. Security rules become complex as data models grow. No control over performance characteristics or geographic distribution.
Scale Ceiling: Technically unlimited, but cost prohibitive beyond moderate scale.
Best For: Startups and applications where chat is a feature, not the product. When operational simplicity justifies premium pricing.
8.3 gRPC Streaming
gRPC provides efficient bidirectional streaming with Protocol Buffer serialisation.
Architectural Advantages: Highly efficient binary serialisation reduces bandwidth. Strong typing through Protocol Buffers catches errors at compile time. Excellent for polyglot service meshes. Deadline propagation and cancellation built into the protocol.
Architectural Disadvantages: Limited browser support requiring gRPC-Web proxy translation. Protocol Buffers add schema management overhead. Mobile client support requires additional dependencies. Debugging is more complex than HTTP based protocols.
Scale Ceiling: Very high; gRPC is designed for Google scale internal communication.
Best For: Backend service to service communication. Mobile clients through a translation gateway.
8.4 Solace PubSub+
Solace provides enterprise messaging infrastructure with support for multiple protocols including MQTT, AMQP, REST, and WebSocket. It’s positioned as enterprise grade messaging for mission critical applications.
Architectural Advantages:
Multi-protocol support allows different clients to use optimal protocols. Mobile clients might use MQTT for battery efficiency while backend services use AMQP for reliability guarantees. Protocol translation happens at the broker level without application involvement.
Hardware appliance options provide deterministic latency for organisations requiring guaranteed performance characteristics. Software brokers run on commodity infrastructure for cloud deployments.
Built in message replay and persistence provides durable messaging without separate storage infrastructure. Messages survive broker restarts and can be replayed for late joining subscribers.
Enterprise features like fine grained access control, message filtering, and topic hierarchies are mature and well documented. Compliance and audit capabilities suit regulated industries.
Hybrid deployment models support on premises, cloud, and edge deployments with consistent APIs. Useful for organisations with complex deployment requirements spanning multiple environments.
Architectural Disadvantages:
Proprietary technology creates vendor dependency. While Solace supports standard protocols, the management plane and advanced features are Solace specific. Migration to alternatives requires significant effort.
Cost structure includes licensing fees that become substantial at scale. Unlike open source alternatives, you pay for the messaging infrastructure beyond just compute and storage.
Operational model differs from cloud native patterns. Solace brokers are stateful infrastructure requiring specific operational expertise. Teams familiar with Kubernetes native patterns face a learning curve.
Connection model is broker centric rather than service mesh style. All messages flow through Solace brokers, which become critical infrastructure requiring high availability configuration.
Less ecosystem integration than cloud provider native services. While Solace runs on AWS, Azure, and GCP, it doesn’t integrate as deeply as native services like Amazon MQ or Google Pub/Sub.
Scale Ceiling: Very high with appropriate hardware or cluster configuration. Solace publishes benchmarks showing millions of messages per second.
Best For: Enterprises with existing Solace investments. Organisations requiring multi-protocol support. Regulated industries needing enterprise support contracts and compliance certifications. Hybrid deployments spanning on premises and cloud.
Comparison to Pekko + SSE:
Solace is a messaging infrastructure product; Pekko + SSE is an application architecture pattern. Solace provides the transport layer with sophisticated routing, persistence, and protocol support. Pekko + SSE builds the application logic with actors, clustering, and HTTP streaming.
For greenfield mobile chat, Pekko + SSE provides more control, lower cost, and better fit for modern cloud native deployment. For enterprises integrating chat into existing Solace infrastructure or requiring Solace’s specific capabilities (multi-protocol, hardware acceleration, compliance), Solace as the transport layer with application logic on top is viable.
The architectures can also combine: use Solace for backend service communication and durable message storage while using Pekko + SSE for client-facing connection handling. This hybrid leverages Solace’s enterprise messaging strengths while maintaining cloud native patterns at the edge.
8.5 Commercial Platforms: Pusher, Ably, PubNub
Managed real time platforms provide complete infrastructure as a service.
Architectural Advantages: Zero infrastructure to build or operate. Global edge presence included. Guaranteed SLAs with financial backing. Features like presence and message history built in.
Architectural Disadvantages: Significant cost at scale, often exceeding $10,000 monthly at millions of connections. Vendor lock in with proprietary APIs. Limited customisation for specific requirements. Latency to vendor infrastructure adds milliseconds to every message.
Scale Ceiling: High, but cost limited rather than technology limited.
Best For: When real time is a feature you need but not core competency. When engineering time is more constrained than infrastructure budget.
8.6 Erlang/Elixir with Phoenix Channels
The BEAM VM provides battle tested concurrency primitives, and Phoenix Channels offer WebSocket abstraction with presence and pub/sub.
Architectural Advantages: Exceptional concurrency model designed and proven at telecom scale. “Let it crash” supervision provides natural fault tolerance. WhatsApp scaled to billions of messages on BEAM. Per process garbage collection eliminates global GC pauses. Hot code reloading enables deployment without disconnecting users.
Architectural Disadvantages: Smaller talent pool than JVM ecosystem. Different operational model requires team investment. Library ecosystem is smaller than Java. Integration with existing JVM based systems requires interop complexity.
Scale Ceiling: Very high; BEAM is purpose built for this workload.
Best For: Teams with Erlang/Elixir expertise. Greenfield applications where the BEAM’s unique capabilities (hot reloading, per process GC) provide significant value.
8.7 Comparison Summary
Architecture
Scale Ceiling
Operational Complexity
Development Speed
Cost at Scale
Talent Availability
Pekko + SSE
Very High
Medium
Medium
Low
High
Socket.IO
Medium
Medium
Fast
Medium
Very High
Firebase
High
Very Low
Very Fast
Very High
High
gRPC
Very High
Medium
Medium
Low
High
Solace
Very High
Medium-High
Medium
High
Medium
Commercial
High
Very Low
Fast
Very High
N/A
BEAM/Phoenix
Very High
Medium
Medium
Low
Low
9. Capacity Planning Framework
9.1 Connection Density Expectations
With Java 25 on appropriately sized instances, expect approximately 500,000 to 750,000 concurrent SSE connections per node. Limiting factors in order of typical impact:
Memory: Each connection requires actor state, stream buffers, and HTTP/2 overhead. Budget 100 to 200 bytes per idle connection, 1KB to 2KB per active connection with buffers.
File Descriptors: Each TCP connection requires a kernel file descriptor. Default Linux limits (1024) are inadequate. Production systems need limits of 500,000 or higher.
Network Bandwidth: Aggregate message throughput eventually saturates network interfaces, typically 10Gbps on modern cloud instances.
9.2 Throughput Expectations
Message throughput depends on message size and processing complexity:
Simple Relay: 50,000 to 100,000 messages per second per node for small messages with minimal processing.
With Persistence: 20,000 to 50,000 messages per second when writing to database.
With Complex Processing: 10,000 to 30,000 messages per second with encryption, filtering, or transformation logic.
9.3 Latency Targets
Reasonable expectations for properly architected systems:
Same Region Delivery: p50 under 10ms, p99 under 50ms.
Cross Region Delivery: p50 under 100ms, p99 under 200ms (dominated by network latency).
Connection Establishment: Under 500ms including TLS handshake.
Reconnection with Resume: Under 200ms with HTTP/3, under 500ms with HTTP/2.
9.4 Cluster Sizing Example
For 10 million concurrent connections with 1 million active users generating 10,000 messages per second:
Connection Tier: 15 to 20 Pekko nodes (r6g.4xlarge) handling connection state and message routing.
Persistence Tier: 3 to 5 node ScyllaDB or Cassandra cluster for message storage.
Cache Tier: 3 node Redis cluster for presence and transient state if not using Pekko distributed data.
Load Balancing: Application Load Balancer with HTTP/2 support, or Network Load Balancer with Nginx fleet for HTTP/3.
10. Architectural Principles
Several principles guide successful mobile chat architecture regardless of specific technology choices.
10.1 Design for Reconnection
Mobile connections are ephemeral. Every component should assume disconnection happens constantly. Message delivery must survive connection loss. State reconstruction must be fast. Resume must be seamless.
This isn’t defensive programming; it’s accurate modelling of mobile reality.
10.2 Separate Logical Identity from Physical Location
Messages should route to User B, not to “the server holding User B’s connection.” When User B reconnects to a different server, routing should work without explicit updates.
Cluster sharding provides this naturally. Explicit routing tables require careful consistency management that’s difficult to get right.
10.3 Embrace Traffic Asymmetry
Chat is read heavy. Optimise the downstream path aggressively. The upstream path handles lower volume and can be simpler.
SSE plus HTTP POST matches this asymmetry. Bidirectional WebSocket overprovisions upload capacity.
10.4 Make Backpressure Explicit
When consumers can’t keep up, something must give. Explicit backpressure with configurable overflow strategies is better than implicit unbounded buffering that eventually exhausts memory.
Decide what happens when a client falls behind. Drop oldest messages? Drop newest? Disconnect? Make it a conscious architectural choice.
10.5 Eliminate Warmup Dependencies
Mobile load is spiky. Autoscaling must respond quickly. New instances must be immediately productive.
AOT compilation cache, pre warmed connection pools, and eager initialisation eliminate the warmup period that makes autoscaling ineffective.
10.6 Plan for Multi Region
Mobile users are globally distributed. Latency matters for chat quality. Eventually you’ll need presence in multiple regions.
Architecture decisions made for single region deployment affect multi region feasibility. Avoid patterns that assume single cluster or centralised state.
10.7 Accept Platform Constraints for Background Operation
Fighting mobile platform restrictions on background execution is futile. Design for the hybrid model: real time when foregrounded, push notification relay when backgrounded, efficient sync on return.
Architectures that assume persistent connections regardless of app state will disappoint users with battery drain or fail entirely when platforms enforce restrictions.
11. Conclusion
Mobile chat at scale requires architectural decisions that embrace mobile reality: unstable networks, battery constraints, background execution limits, multi device users, and constant connection churn.
Apache Pekko provides the actor model and cluster sharding that naturally fit connection state and message routing. Actors handle millions of mostly idle connections efficiently. Cluster sharding solves routing without centralised bottlenecks.
Server Sent Events match chat’s asymmetric traffic pattern while providing automatic reconnection and resume. HTTP/2 multiplexing reduces connection overhead. HTTP/3 with QUIC enables connection migration for seamless network transitions.
Java 25 removes historical JVM limitations. Virtual threads eliminate the thread per connection tension. Generational ZGC removes GC as a latency concern. AOT compilation caching makes autoscaling effective by eliminating warmup.
The background execution model requires accepting platform constraints rather than fighting them. Real time streaming when foregrounded, push notification relay when backgrounded, efficient sync on return. This hybrid approach works within mobile platform rules while providing the best achievable user experience.
EKS deployment requires attention to instance sizing, network configuration, and Pekko cluster discovery integration. Amazon Linux 2023 provides the appropriate base for high connection count workloads.
Alternative approaches like Solace provide enterprise messaging capabilities but with different operational models and cost structures. The choice depends on existing infrastructure, compliance requirements, and team expertise.
The architecture handles tens of millions of concurrent connections. More importantly, it handles mobile gracefully: network transitions don’t lose messages, battery impact remains reasonable, and users experience the instant message delivery they expect whether the app is foregrounded or backgrounded.
The key architectural insight is that mobile chat is a distributed systems problem with mobile specific constraints layered on top. Solve the distributed systems challenges with proven primitives, address mobile constraints with appropriate protocol choices, and leverage modern runtime capabilities. The result is a system that scales horizontally, recovers automatically, and provides the experience mobile users demand.
Running WordPress on ARM-based Graviton instances delivers up to 40% better price-performance compared to x86 equivalents. This guide provides production-ready scripts to deploy an optimised WordPress stack in minutes, plus everything you need to migrate your existing site.
Why Graviton for WordPress?
Graviton3 processors deliver:
40% better price-performance vs comparable x86 instances
Up to 25% lower cost for equivalent workloads
60% less energy consumption per compute hour
Native ARM64 optimisations for PHP 8.x and MariaDB
The t4g.small instance (2 vCPU, 2GB RAM) at ~$12/month handles most WordPress sites comfortably. For higher traffic, t4g.medium or c7g instances scale beautifully.
# 1. Launch instance (local machine)
./launch-graviton-wp.sh
# 2. SSH in and setup WordPress
ssh -i ~/.ssh/key.pem ec2-user@IP
sudo ./setup-wordpress.sh
# 3. If migrating - on old server
./wp-export.sh
scp /tmp/wp-migration/wordpress-migration-*.tar.gz ec2-user@NEW_IP:/tmp/
# 4. If migrating - on new server
sudo ./wp-import.sh /tmp/wordpress-migration-*.tar.gz
This setup delivers a production-ready WordPress installation that’ll handle significant traffic while keeping your AWS bill minimal. The combination of Graviton’s price-performance, Caddy’s efficiency, and properly-tuned PHP creates a stack that punches well above its weight class.
How Domain Isolation Creates Evolutionary Pressure for Better Software
After two decades building trading platforms and banking systems, I’ve watched the same pattern repeat itself countless times. A production incident occurs. The war room fills. And then the finger pointing begins.
“It’s the database team’s problem.” “No, it’s that batch job from payments.” “Actually, I think it’s the new release from the cards team.” Three weeks later, you might have an answer. Or you might just have a temporary workaround and a room full of people who’ve learned to blame each other more effectively.
This is the tragedy of the commons playing out in enterprise technology, and it’s killing your ability to evolve.
1. The Shared Infrastructure Trap
Traditional enterprise architecture loves shared infrastructure. It makes intuitive sense: why would you run fifteen database clusters when one big one will do? Why have each team manage their own message broker when a central platform team can run one for everybody? Economies of scale. Centralised expertise. Lower costs.
Except that’s not what actually happens.
What happens is that your shared Oracle RAC cluster becomes a battleground. The trading desk needs low latency queries. The batch processing team needs to run massive overnight jobs. The reporting team needs to scan entire tables. Everyone has legitimate needs, and everyone’s needs conflict with everyone else’s. The DBA team becomes a bottleneck, fielding requests from twelve different product owners, all of whom believe their work is the priority.
When the CPU spikes to 100% at 2pm on a Tuesday, the incident call has fifteen people on it, and nobody knows whose query caused it. The monitoring shows increased load, but the load comes from everywhere. Everyone claims their release was tested. Everyone points at someone else.
This isn’t a technical problem. It’s an accountability problem. And you cannot solve accountability problems with better monitoring dashboards.
2. Darwinian Pressure in Software Systems
Nature solved this problem billions of years ago. Organisms that make poor decisions suffer the consequences directly. There’s no committee meeting to discuss why the antelope got eaten. The feedback loop is immediate and unambiguous. Whilst nobody wants to watch it, teams secretly take comfort in not being the limping buffalo at the back of the herd. Teams get fit, they resist decisions that will put them in an unsafe place as they know they will receive an uncomfortable amount of focus from senior management.
Modern software architecture can learn from this. When you isolate domains, truly isolate them, with their own data stores, their own compute, their own failure boundaries, you create Darwinian pressure. Teams that write inefficient code see their own costs rise. Teams that deploy buggy releases see their own services degrade. Teams that don’t invest in resilience suffer their own outages.
There’s no hiding. There’s no ambiguity. There’s no three week investigation to determine fault. There is no watered down document that hints at the issue, but doesn’t really call it out, as all the teams couldn’t agree on something more pointed. The feedback loop tightens from weeks to hours, sometimes minutes.
This isn’t about blame. It’s about learning. When the consequences of your decisions land squarely on your own service, you learn faster. You care more. You invest in the right things because you directly experience the cost of not investing.
3. The Architecture of Isolation
Achieving genuine domain isolation requires more than just drawing boxes on a whiteboard and calling them “microservices.” It requires rethinking how domains interact with each other and with their data.
Data Localisation Through Replication
The hardest shift for most organisations is accepting that data duplication isn’t a sin. In a shared database world, we’re taught that the single source of truth is sacred. Duplicate data creates consistency problems. Normalisation is good.
But in a distributed world, the shared database is the coupling that prevents isolation. If three domains query the same customer table, they’re coupled. An index change that helps one domain might destroy another’s performance. A schema migration requires coordinating across teams. The tragedy of the commons returns.
Instead, each domain should own its data. If another domain needs that data, replicate it. Event driven patterns work well here: when a customer’s address changes, publish an event. Subscribing domains update their local copies. Yes, there’s eventual consistency. Yes, the data might be milliseconds or seconds stale. But in exchange, each domain can optimise its own data structures for its own access patterns, make schema changes without coordinating with half the organisation, and scale its data tier independently.
Queues as Circuit Breakers
Synchronous service to service calls are the other hidden coupling that defeats isolation. When the channel service calls the fraud service, and the fraud service calls the customer service, you’ve created a distributed monolith. A failure anywhere propagates everywhere. An outage in customer data brings down payments.
Asynchronous messaging changes this dynamic entirely. When a payment needs fraud checking, it drops a message on a queue. If the fraud service is slow or down, the queue absorbs the backlog. The payment service doesn’t fail, it just sees increased latency on fraud decisions. Customers might wait a few extra seconds for approval rather than seeing an error page.
This doesn’t make the fraud service’s problems disappear. The fraud team still needs to fix their outage, but you can make business choices about how to deal with the outage. For example, you can choose to bypass the checks for payments to “known” beneficiaries or below certain threshold values, so the blast radius is contained and can be managed. The payments team’s SLAs aren’t destroyed by someone else’s incident. The Darwinian pressure lands where it belongs: on the team whose service is struggling.
Proxy Layers for Graceful Degradation
Not everything can be asynchronous. Sometimes you need a real time answer. But even synchronous dependencies can be isolated through intelligent proxy layers.
A well designed proxy can cache responses, serve stale data during outages, fall back to default behaviours, and implement circuit breakers that fail fast rather than hanging. When the downstream service returns, the proxy heals automatically.
The key insight is that the proxy belongs to the calling domain, not the called domain. The payments team decides how to handle fraud service failures. Maybe they approve transactions under a certain threshold automatically. Maybe they queue high value transactions for manual review. The fraud team doesn’t need to know or care, they just need to get their service healthy again.
4. Escaping the Monolith: Strategies for Service Eviction
Understanding the destination is one thing. Knowing how to get there from where you are is another entirely. Most enterprises aren’t starting with a blank slate. They’re staring at a decade old shared Oracle database with three hundred stored procedures, an enterprise service bus that routes traffic for forty applications, and a monolithic core banking system that everyone is terrified to touch.
The good news is that you don’t need to rebuild everything from scratch. The better news is that you can create structural incentives that make migration inevitable rather than optional.
Service Eviction: Making the Old World Uncomfortable
Service eviction is the deliberate practice of making shared infrastructure progressively less attractive to use while making domain-isolated alternatives progressively more attractive. This isn’t about being obstructive. It’s about aligning incentives with architecture.
Start with change management. On shared infrastructure, every change requires coordination. You need a CAB ticket. You need sign-off from every consuming team. You need a four week lead time and a rollback plan approved by someone three levels up. The change window is 2am Sunday, and if anything goes wrong, you’re in a war room with fifteen other teams.
On domain isolated services, changes are the team’s own business. They deploy when they’re ready. They roll back if they need to. Nobody else is affected because nobody else shares their infrastructure. The contrast becomes visceral: painful, bureaucratic change processes on shared services versus autonomous, rapid iteration on isolated ones.
This isn’t artificial friction. It’s honest friction. Shared infrastructure genuinely does require more coordination because changes genuinely do affect more people. You’re just making the hidden costs visible and letting teams experience them directly.
Data Localisation Through Kafka: Breaking the Database Coupling
The shared database is usually the hardest dependency to break. Everyone queries it. Everyone depends on its schema. Moving data feels impossibly risky.
Kafka changes the game by enabling data localisation without requiring big bang migrations. The pattern works like this: identify a domain that wants autonomy. Have the source system publish events to Kafka whenever relevant data changes. Have the target domain consume those events and maintain its own local copy of the data it needs.
Initially, this looks like unnecessary duplication. The data exists in Oracle and in the domain’s local store. But that duplication is exactly what enables isolation. The domain can now evolve its schema independently. It can optimise its indexes for its access patterns. It can scale its data tier without affecting anyone else. And critically, it can be tested and deployed without coordinating database changes with twelve other teams.
Kafka’s log based architecture makes this particularly powerful. New consumers can replay history to bootstrap their local state. The event stream becomes the source of truth for what changed and when. Individual domains derive their local views from that stream, each optimised for their specific needs.
The key insight is that you’re not migrating data. You’re replicating it through events until the domain no longer needs to query the shared database directly. Once every query can be served from local data, the coupling is broken. The shared database becomes a publisher of events rather than a shared resource everyone depends on.
The Strangler Fig: Gradual Replacement Without Risk
The strangler fig pattern, named after the tropical tree that gradually envelops and replaces its host, is the safest approach to extracting functionality from monoliths. Rather than replacing large systems wholesale, you intercept specific functions at the boundary and gradually route traffic to new implementations.
Put a proxy in front of the monolith. Initially, it routes everything through unchanged. Then, one function at a time, build the replacement in the target domain. Route traffic for that function to the new service while everything else continues to hit the monolith. When the new service is proven, remove the old code from the monolith.
The beauty of this approach is that failure is localised and reversible. If the new service has issues, flip the routing back. The monolith is still there, still working. You haven’t burned any bridges. You can take the time to get it right because you’re not under pressure from a hard cutover deadline.
Combined with Kafka-based data localisation, the strangler pattern becomes even more powerful. The new domain service consumes events to build its local state, the proxy routes relevant traffic to it, and the old monolith gradually loses responsibilities until what remains is small enough to either rewrite completely or simply turn off.
Asymmetric Change Management: The Hidden Accelerator
This is the strategy that sounds controversial but works remarkably well: make change management deliberately asymmetric between shared services and domain isolated services.
On the shared database or monolith, changes require extensive governance. Four week CAB cycles. Impact assessments signed off by every consuming team. Mandatory production support during changes. Post-implementation reviews. Change freezes around month-end, quarter-end, and peak trading periods.
On domain-isolated services, teams own their deployment pipeline end to end. They can deploy multiple times per day if their automation supports it. No CAB tickets. No external sign offs. If they break their own service, they fix their own service.
This asymmetry isn’t punitive. It reflects genuine risk. Changes to shared infrastructure genuinely do have broader blast radius. They genuinely do require more coordination. You’re simply making the cost of that coordination visible rather than hiding it in endless meetings and implicit dependencies.
The effect is predictable. Teams that want to move fast migrate to domain isolation. Teams that are comfortable with quarterly releases can stay on shared infrastructure. Over time, the ambitious teams have extracted their most critical functionality into isolated domains. What remains on shared infrastructure is genuinely stable, rarely changing functionality that doesn’t need rapid iteration.
The natural equilibrium is that shared infrastructure becomes genuinely shared: common utilities, reference data, things that change slowly and benefit from centralisation. Everything else migrates to where it can evolve independently.
The Migration Playbook
Put it together and the playbook looks like this:
First, establish Kafka as your enterprise event backbone. Every system of record publishes events when data changes. This is table stakes for everything else.
Second, identify a domain with high change velocity that’s suffering under shared infrastructure governance. They’re your early adopter. Help them establish their own data store, consuming events from Kafka to maintain local state.
Third, put a strangler proxy in front of relevant monolith functions. Route traffic to the new domain service. Prove it works. Remove the old implementation.
Fourth, give the domain team autonomous deployment capability. Let them experience the difference between deploying through a four-week CAB cycle versus deploying whenever they’re ready.
Fifth, publicise the success. Other teams will notice. They’ll start asking for the same thing. Now you have demand driven migration rather than architecture-mandated migration.
The key is that you’re not forcing anyone to migrate. You’re creating conditions where migration is obviously attractive. The teams that care about velocity self select. The shared infrastructure naturally shrinks to genuinely shared concerns.
5. The Cultural Shift
Architecture is easy compared to culture. You can draw domain boundaries in a week. Convincing people to live within them takes years.
The shared infrastructure model creates a particular kind of learned helplessness. When everything is everyone’s problem, nothing is anyone’s problem. Teams optimise for deflecting blame rather than improving reliability. Political skills matter more than engineering skills. The best career move is often to avoid owning anything that might fail.
Domain isolation flips this dynamic. Teams own their outcomes completely. There’s nowhere to hide, but there’s also genuine autonomy. You can choose your own technology stack. You can release when you’re ready without coordinating with twelve other teams. You can invest in reliability knowing that you’ll reap the benefits directly.
This autonomy attracts a different kind of engineer. People who want to own things. People who take pride in uptime and performance. People who’d rather fix problems than explain why problems aren’t their fault.
The teams that thrive under this model are the ones that learn fastest. They build observability into everything because they need to understand their own systems. They invest in automated testing because they can’t blame someone else when their deploys go wrong. They design for failure because they know they’ll be the ones getting paged.
The teams that don’t adapt… well, that’s the Darwinian part. Their services become known as unreliable. Other teams design around them. Eventually, the organisation notices that some teams consistently deliver and others consistently struggle. The feedback becomes impossible to ignore.
6. Conway’s Law: Accepting the Inevitable, Rejecting the Unnecessary
Melvin Conway observed in 1967 that organisations design systems that mirror their communication structures. Fifty years of software engineering has done nothing to disprove him. Your architecture will reflect your org chart whether you plan for it or not.
This isn’t a problem to be solved. It’s a reality to be acknowledged. Your domain boundaries will follow team boundaries. Your service interfaces will reflect the negotiations between teams. The political realities of your organisation will manifest in your technical architecture. Fighting this is futile.
But here’s what Conway’s Law doesn’t require: shared suffering.
Traditional enterprise architecture interprets Conway’s Law as an argument for centralisation. If teams need to communicate, give them shared infrastructure to communicate through. If domains overlap, put the overlapping data in a shared database. The result is that Conway’s Law manifests not just in system boundaries but in shared pain. When one team struggles, everyone struggles. When one domain has an incident, twelve teams join the war room.
Domain isolation accepts Conway’s Law while rejecting this unnecessary coupling. Yes, your domains will align with your teams. Yes, your service boundaries will reflect organisational reality. But each team’s infrastructure can be genuinely isolated. Public cloud makes this trivially achievable through account-level separation.
Give each domain its own AWS account or Azure subscription. Their blast radius is contained by cloud provider boundaries, not just by architectural diagrams. Their cost allocation is automatic. Their security boundaries are enforced by IAM, not by policy documents. Their quotas and limits are independent. When the fraud team accidentally spins up a thousand Lambda functions, the payments team doesn’t notice because they’re in a completely separate account with separate limits.
Conway’s Law still shapes your domain design. The payments team builds payment services. The fraud team builds fraud services. The boundaries reflect the org chart. But the implementation of those boundaries can be absolute rather than aspirational. Account level isolation means that even if your domain design isn’t perfect, the consequences of imperfection are contained.
This is the insight that transforms Conway’s Law from a constraint into an enabler. You’re not fighting organisational reality. You’re aligning infrastructure isolation with organisational boundaries so that each team genuinely owns their outcomes. The communication overhead that Conway identified still exists, but it happens through well-defined APIs and event contracts rather than through shared database contention and incident calls.
7. The Transition Path
You can’t flip a switch and move from shared infrastructure to domain isolation overnight. The dependencies are too deep. The skills don’t exist. The organisational structures don’t support it.
But you can start. Pick a domain that’s struggling with the current model, probably one that’s constantly blamed for incidents they didn’t cause. Give them their own database, their own compute, their own deployment pipeline. Build the event publishing infrastructure so they can share data with other domains through replication rather than direct queries.
Watch what happens. The team will stumble initially. They’ve never had to think about database sizing or query optimisation because that was always someone else’s job. But within a few months, they’ll own it. They’ll understand their system in a way they never did before. Their incident response will get faster because there’s no ambiguity about whose system is broken.
More importantly, other teams will notice. They’ll see a team that deploys whenever they want, that doesn’t get dragged into incident calls for problems they didn’t cause, that actually controls their own destiny. They’ll start asking for the same thing.
This is how architectural change actually happens, not through mandates from enterprise architecture, but through demonstrated success that creates demand.
8. The Economics Question
I can already hear the objections. “This is more expensive. We’ll have fifteen databases instead of one. Fifteen engineering teams managing infrastructure instead of one platform team.”
To which I’d say: you’re already paying these costs, you’re just hiding them.
Every hour spent in an incident call where twelve teams try to figure out whose code caused the database to spike is a cost. Every delayed release because you’re waiting for a shared schema migration is a cost. Every workaround another team implements because your shared service doesn’t quite meet their needs is a cost. Every engineer who leaves because they’re tired of fighting political battles instead of building software is a cost.
Domain isolation makes these costs visible and allocates them to the teams that incur them. That visibility is uncomfortable, but it’s also the prerequisite for improvement.
And yes, you’ll run more database clusters. But they’ll be right sized for their workloads. You won’t be paying for headroom that exists only because you can’t predict which team will spike load next. You won’t be over provisioning because the shared platform has to handle everyone’s worst case simultaneously.
9. But surely AWS is shared infrastructure?
A common pushback when discussing domain isolation and ownership is: “But surely AWS is shared infrastructure?” The answer is yes , but that observation misses the point of what ownership actually means in a Darwinian architectural model.
Ownership here is not about blame or liability when something goes wrong. It is about control and autonomy. The critical question is not who gets blamed, but who has the ability to act, change, and learn.
AWS operates under a clearly defined Shared Responsibility Model. AWS is responsible for the security of the cloud, the physical data centres, hardware, networking, and the underlying virtualization layers. Customers are responsible for security in the cloud, everything they configure, deploy, and operate on top of that platform.
Crucially, AWS gives you complete control over the things you are responsible for. You are not handed vague obligations without tools. You are given APIs, policy engines, telemetry, and automation primitives to fully own your outcomes. Identity and access management, network boundaries, encryption, scaling policies, deployment strategies, data durability, and recovery are all explicitly within your control.
This is why AWS being “shared infrastructure” does not undermine architectural ownership. Ownership is not defined by exclusive physical hardware; it is defined by decision-making authority and freedom to evolve. A team that owns its AWS account, VPC, services, and data can change direction without negotiating with a central platform team, can experiment safely within its own blast radius, and can immediately feel the consequences of poor design decisions.
That feedback loop is the point.
From a Darwinian perspective, AWS actually amplifies evolutionary pressure. Teams that design resilient, observable, well isolated systems thrive. Teams that cut corners experience outages, cost overruns, and operational pain, quickly and unambiguously. There is no shared infrastructure committee to absorb the consequences or hide failure behind abstraction layers.
So yes, AWS is shared infrastructure — but it is shared in a way that preserves local control, clear responsibility boundaries, and fast feedback. And those are the exact conditions required for domain isolation to work, and for better software to evolve over time.
10. Evolution, Not Design
The deepest insight from evolutionary biology is that complex, well adapted systems don’t emerge from top down design. They emerge from the accumulation of countless small improvements, each one tested against reality, with failures eliminated and successes preserved.
Enterprise architecture traditionally works the opposite way. Architects design systems from above. Teams implement those designs. Feedback loops are slow and filtered through layers of abstraction. By the time the architecture proves unsuitable, it’s too deeply embedded to change.
Domain isolation enables architectural evolution. Each team can experiment within their boundary. Good patterns spread as other teams observe and adopt them. Bad patterns get contained and eventually eliminated. The overall system improves through distributed learning rather than centralised planning.
This doesn’t mean architects become irrelevant. Someone needs to define the contracts between domains, design the event schemas, establish the standards for how services discover and communicate with each other. But the architect’s role shifts from designing systems to designing the conditions under which good systems can emerge.
10. The End State
I’ve seen organisations make this transition. It takes years, not months. It requires sustained leadership commitment. It forces difficult conversations about team structure and accountability.
But the end state is remarkable. Incident calls have three people on them instead of thirty. Root cause is established in minutes instead of weeks. Teams ship daily instead of quarterly. Engineers actually enjoy their work because they’re building things instead of attending meetings about who broke what.
Pain at the Source
The core idea is deceptively simple: put the pain of an issue right next to its source. When your database is slow, you feel it. When your deployment breaks, you fix it. The feedback loop is immediate and unambiguous.
But here’s what surprises people: this doesn’t make teams selfish. Far from it.
In the shared infrastructure world, teams spend enormous energy on defence. Every incident requires proving innocence. Every performance problem demands demonstrating that your code isn’t the cause. Every outage triggers a political battle over whose budget absorbs the remediation. Teams are exhausted not from building software but from fighting for survival in an environment of ambiguous, omnipresent enterprise guilt.
Domain isolation eliminates this overhead entirely. When your service has a problem, it’s your problem. There’s no ambiguity. There’s no blame game. There’s no three week investigation. You fix it and move on.
Cooperation, Not Competition
And suddenly, teams have energy to spare.
When the fraud team struggles with a complex caching problem, the payments team can offer to help. Not because they’re implicated, not because they’re defending themselves, but because they have genuine expertise and genuine capacity. They arrive as subject matter experts, and the fraud team receives them gratefully as such. There’s no suspicion that help comes with strings attached or that collaboration is really just blame shifting in disguise.
Teams become more cooperative in this world, not less. They show off where they’ve been successful. They write internal blog posts about their observability stack. They present at tech talks about how they achieved sub second deployments. Other teams gladly copy them because there’s no competitive zero sum dynamic. Your success doesn’t threaten my budget. Your innovation doesn’t make my team look bad. We’re all trying to build great software, and we can finally focus on that instead of on survival.
Breaking Hostage Dynamics
And you’re no longer hostage to hostage hiring.
In the shared infrastructure world, a single team can hold the entire organisation ransom. They build a group wide service. It becomes critical. It becomes a disaster. Suddenly they need twenty emergency engineers or the company is at risk. The service shouldn’t exist in the first place, but now it’s too important to fail and too broken to survive without massive investment. The team that created the problem gets rewarded with headcount. The teams that built sustainable, well-designed services get nothing because they’re not on fire.
Domain isolation breaks this perverse incentive. If a team builds a disaster, it’s their disaster. They can’t hold the organisation hostage because their blast radius is contained. Other domains have already designed around them with circuit breakers and fallbacks. The failing service can be deprecated, strangled out, or left to die without taking the company with it. Emergency hiring goes to teams that are succeeding and need to scale, not to teams that are failing and need to be rescued.
The Over Partitioning Trap
I should add a warning: I’ve also seen teams inflict shared pain on themselves, even without shared infrastructure.
They do this by hiring swathes of middle managers and over partitioning into tiny subdomains. Each team becomes responsible for a minuscule pool of resources. Nobody owns anything meaningful. To compensate, they hire armies of planners to try and align these micro teams. The teams fire emails and Jira tickets at each other to inch their ten year roadmap forward. Meetings multiply. Coordination overhead explodes. The organisation has recreated shared infrastructure pain through organisational structure rather than technology.
When something fails in this model, it quickly becomes clear that only a very few people actually understand anything. These elite few become the shared gatekeepers. Without them, no team can do anything. They’re the only ones who know how the pieces fit together, the only ones who can debug cross team issues, the only ones who can approve changes that touch multiple micro domains. You’ve replaced shared database contention with shared human contention. The bottleneck has moved from Oracle to a handful of exhausted architects.
It’s critical not to over partition into tiny subdomains. A domain should be large enough that a team can own something meaningful end to end. They should be able to deliver customer value without coordinating with five other teams. They should understand their entire service, not just their fragment of a service.
These nonsensical subdomains generally only occur when non technical staff have a disproportionately loud voice in team structure. When project managers dominate the discussions and own the narrative for the services. When the org chart is designed around reporting lines and budget centres rather than around software that needs to work together. When the people deciding team boundaries have never debugged a production incident or traced a request across service boundaries.
Domain isolation only works when domains are sized correctly. Too large and you’re back to the tragedy of the commons within the domain. Too small and you’ve created a distributed tragedy of the commons where the shared resource is human coordination rather than technical infrastructure. The sweet spot is teams large enough to own meaningful outcomes and small enough to maintain genuine accountability.
The Commons Solved
The shared infrastructure isn’t completely gone. Some things genuinely benefit from centralisation. But it’s the exception rather than the rule. And crucially, the teams that use shared infrastructure do so by choice, understanding the trade offs, rather than by mandate.
The tragedy of the commons is solved not by better governance of the commons, but by eliminating the commons. Give teams genuine ownership. Let them succeed or fail on their own merits. Trust that the Darwinian pressure will drive improvement faster than any amount of central planning ever could.
Nature figured this out a long time ago. It’s time enterprise architecture caught up.
Understanding and testing your server’s maximum concurrent stream configuration is critical for both performance tuning and security hardening against HTTP/2 attacks. This guide provides comprehensive tools and techniques to test the SETTINGS_MAX_CONCURRENT_STREAMS parameter on your web servers.
This article complements our previous guide on Testing Your Website for HTTP/2 Rapid Reset Vulnerabilities from a macOS. While that article focuses on the CVE-2023-44487 Rapid Reset attack, this guide helps you verify that your server properly enforces stream limits, which is a critical defense mechanism.
2. Why Test Stream Limits?
The SETTINGS_MAX_CONCURRENT_STREAMS setting determines how many concurrent requests a client can multiplex over a single HTTP/2 connection. Testing this limit is important because:
Security validation: Confirms your server enforces reasonable stream limits
Configuration verification: Ensures your settings match security recommendations (typically 100-128 streams)
Performance tuning: Helps optimize the balance between throughput and resource consumption
Attack surface assessment: Identifies if servers accept dangerously high stream counts
3. Understanding HTTP/2 Stream Limits
When an HTTP/2 connection is established, the server sends a SETTINGS frame that includes:
SETTINGS_MAX_CONCURRENT_STREAMS: 100
This tells the client the maximum number of concurrent streams allowed. A compliant client should respect this limit, but attackers will not.
3.1. Common Default Values
Web Servers:
Nginx: 128 (configurable via http2_max_concurrent_streams)
Apache: 100 (configurable via H2MaxSessionStreams)
Caddy: 250 (configurable via max_concurrent_streams)
LiteSpeed: 100 (configurable in admin panel)
Reverse Proxies and Load Balancers:
HAProxy: No default limit (should be explicitly configured)
Envoy: 100 (configurable via max_concurrent_streams)
Traefik: 250 (configurable via maxConcurrentStreams)
CDN and Cloud Services:
CloudFlare: 128 (managed automatically)
AWS ALB: 128 (managed automatically)
Azure Front Door: 100 (managed automatically)
4. The Stream Limit Testing Script
The following Python script tests your server’s maximum concurrent streams using the h2 library. This script will:
Connect to your HTTP/2 server
Read the advertised SETTINGS_MAX_CONCURRENT_STREAMS value
Attempt to open more streams than the advertised limit
Verify that the server actually enforces the limit
Advertised max streams: What the server claims to support
Successful stream opens: How many streams were successfully created
Failed stream opens: Streams that failed to open
Streams reset by server: Streams terminated by the server (enforcement)
Actual max achieved: The real concurrent stream limit
6.1. Example Output
Testing HTTP/2 Stream Limits:
Target: example.com:443
Max streams to test: 200
Batch size: 10
============================================================
Server advertised limit: 128 concurrent streams
Opening batch of 10 streams (total: 10)...
Opening batch of 10 streams (total: 20)...
Opening batch of 10 streams (total: 130)...
WARNING: 5 stream(s) were reset by server
Stream limit enforcement detected
============================================================
STREAM LIMIT TEST RESULTS
============================================================
Server Configuration:
Advertised max streams: 128
Test Statistics:
Successful stream opens: 130
Failed stream opens: 0
Streams reset by server: 5
Actual max achieved: 125
Test duration: 3.45s
Enforcement:
Stream limit enforcement: DETECTED
============================================================
ASSESSMENT
============================================================
Advertised limit (128) is within recommended range
Server actively enforces stream limits
Stream limit protection is working correctly
============================================================
7. Interpreting Different Scenarios
7.1. Scenario 1: Proper Enforcement
Advertised max streams: 100
Successful stream opens: 105
Streams reset by server: 5
Actual max achieved: 100
Stream limit enforcement: DETECTED
Analysis: Server properly enforces the limit. Configuration is working exactly as expected.
7.2. Scenario 2: No Enforcement
Advertised max streams: 128
Successful stream opens: 200
Streams reset by server: 0
Actual max achieved: 200
Stream limit enforcement: NOT DETECTED
Analysis: Server accepts far more streams than advertised. This is a potential vulnerability that should be investigated.
7.3. Scenario 3: No Advertised Limit
Advertised max streams: Not specified
Successful stream opens: 200
Streams reset by server: 0
Actual max achieved: 200
Stream limit enforcement: NOT DETECTED
Analysis: Server does not advertise or enforce limits. High risk configuration that requires immediate remediation.
7.4. Scenario 4: Conservative Limit
Advertised max streams: 50
Successful stream opens: 55
Streams reset by server: 5
Actual max achieved: 50
Stream limit enforcement: DETECTED
Analysis: Very conservative limit. Good for security but may impact performance for legitimate high-throughput applications.
8. Monitoring During Testing
8.1. Server Side Monitoring
While running tests, monitor your server for resource utilization and connection metrics.
You can use both the stream limit tester and the Rapid Reset tester together for comprehensive HTTP/2 security assessment:
# Step 1: Test stream limits
python3 http2_stream_limit_tester.py --host example.com
# Step 2: Test rapid reset with IP spoofing
sudo python3 http2rapidresettester_macos.py \
--host example.com \
--cidr 192.168.1.0/24 \
--packets 1000
# Step 3: Re-test stream limits to verify no degradation
python3 http2_stream_limit_tester.py --host example.com
11. Security Best Practices
11.1. Configuration Guidelines
Set explicit limits: Never rely on default values
Use conservative values: 100-128 streams is the recommended range
Monitor enforcement: Regularly verify that limits are actually being enforced
Document settings: Maintain records of your stream limit configuration
Test after changes: Always test after configuration modifications
11.2. Defense in Depth
Stream limits should be one layer in a comprehensive security strategy:
Stream limits: Prevent excessive concurrent streams per connection
Connection limits: Limit total connections per IP address
Request rate limiting: Throttle requests per second
Resource quotas: Set memory and CPU limits
WAF/DDoS protection: Use cloud-based or on-premise DDoS mitigation
11.3. Regular Testing Schedule
Establish a regular testing schedule:
Weekly: Automated basic stream limit tests
Monthly: Comprehensive security testing including Rapid Reset
After changes: Always test after configuration or infrastructure changes
Quarterly: Full security audit including penetration testing
12. Troubleshooting
12.1. Common Errors
Error: “SSL: CERTIFICATE_VERIFY_FAILED”
This occurs when testing against servers with self-signed certificates. For testing purposes only, you can modify the script to skip certificate verification (not recommended for production testing).
If streams are not being reset despite exceeding the advertised limit:
Server may not be enforcing limits properly
Configuration may not have been applied (restart required)
Server may be using a different enforcement mechanism
Limits may be set at a different layer (load balancer vs web server)
12.3. High Failure Rate
If many streams fail to open:
Network connectivity issues
Firewall blocking requests
Server resource exhaustion
Rate limiting triggering prematurely
13. Understanding the Attack Surface
When testing your infrastructure, consider all HTTP/2 endpoints:
Web servers: Nginx, Apache, IIS
Load balancers: HAProxy, Envoy, ALB
API gateways: Kong, Tyk, AWS API Gateway
CDN endpoints: CloudFlare, Fastly, Akamai
Reverse proxies: Traefik, Caddy
13.1. Testing Strategy
Test at multiple layers:
# Test CDN edge
python3 http2_stream_limit_tester.py --host cdn.example.com
# Test load balancer directly
python3 http2_stream_limit_tester.py --host lb.example.com
# Test origin server
python3 http2_stream_limit_tester.py --host origin.example.com
14. Conclusion
Testing your HTTP/2 maximum concurrent streams configuration is essential for maintaining a secure and performant web infrastructure. This tool allows you to:
Verify that your server advertises appropriate stream limits
Confirm that advertised limits are actually enforced
Identify misconfigurations before they can be exploited
Tune performance while maintaining security
Regular testing, combined with proper configuration and monitoring, will help protect your infrastructure against HTTP/2-based attacks while maintaining optimal performance for legitimate users.
This guide and testing script are provided for educational and defensive security purposes only. Always obtain proper authorization before testing systems you do not own.
In August 2023, a critical zero day vulnerability in the HTTP/2 protocol was disclosed that affected virtually every HTTP/2 capable web server and proxy. Known as HTTP/2 Rapid Reset (CVE 2023 44487), this vulnerability enabled attackers to launch devastating Distributed Denial of Service (DDoS) attacks with minimal resources. Google reported mitigating the largest DDoS attack ever recorded at the time (398 million requests per second) leveraging this technique.
Understanding this vulnerability and knowing how to test your infrastructure against it is crucial for maintaining a secure and resilient web presence. This guide provides a flexible testing tool specifically designed for macOS that uses hping3 for packet crafting with CIDR based source IP address spoofing capabilities.
What is HTTP/2 Rapid Reset?
The HTTP/2 Protocol Foundation
HTTP/2 introduced multiplexing, allowing multiple streams (requests/responses) to be sent concurrently over a single TCP connection. Each stream has a unique identifier and can be independently managed. To cancel a stream, HTTP/2 uses the RST_STREAM frame, which immediately terminates the stream and signals that no further processing is needed.
The Vulnerability Mechanism
The HTTP/2 Rapid Reset attack exploits the asymmetry between client cost and server cost:
Client cost: Sending a request followed immediately by a RST_STREAM frame is computationally trivial
Server cost: Processing the incoming request (parsing headers, routing, backend queries) consumes significant resources before the cancellation is received
An attacker can:
Open an HTTP/2 connection
Send thousands of requests with incrementing stream IDs
Immediately cancel each request with RST_STREAM frames
Repeat this cycle at extremely high rates
The server receives these requests and begins processing them. Even though the cancellation arrives milliseconds later, the server has already invested CPU, memory, and I/O resources. By sending millions of request cancel pairs per second, attackers can exhaust server resources with minimal bandwidth.
Why It’s So Effective
Traditional rate limiting and DDoS mitigation techniques struggle against Rapid Reset attacks because:
Low bandwidth usage: The attack uses minimal data (mostly HTTP/2 frames with small headers)
Valid protocol behavior: RST_STREAM is a legitimate HTTP/2 mechanism
Connection reuse: Attackers multiplex thousands of streams over relatively few connections
Amplification: Each cheap client operation triggers expensive server side processing
How to Guard Against HTTP/2 Rapid Reset
1. Update Your Software Stack
Immediate Priority: Ensure all HTTP/2 capable components are patched:
Web Servers:
Nginx 1.25.2+ or 1.24.1+
Apache HTTP Server 2.4.58+
Caddy 2.7.4+
LiteSpeed 6.0.12+
Reverse Proxies and Load Balancers:
HAProxy 2.8.2+ or 2.6.15+
Envoy 1.27.0+
Traefik 2.10.5+
CDN and Cloud Services:
CloudFlare (auto patched August 2023)
AWS ALB/CloudFront (patched)
Azure Front Door (patched)
Google Cloud Load Balancer (patched)
Application Servers:
Tomcat 10.1.13+, 9.0.80+
Jetty 12.0.1+, 11.0.16+, 10.0.16+
Node.js 20.8.0+, 18.18.0+
2. Implement Stream Limits
Configure strict limits on HTTP/2 stream behavior:
Note: This reduces performance benefits but eliminates the vulnerability.
Testing Script for HTTP/2 Rapid Reset Vulnerabilities on macOS
Below is a parameterized Python script that tests your web servers using hping3 for packet crafting. This script is specifically optimized for macOS and can spoof source IP addresses from a CIDR block to simulate distributed attacks. Using hping3 ensures IP spoofing works consistently across different network environments.
Gradual escalation test (start small, increase if needed):
# Start with 50 packets
sudo python3 http2rapidresettester_macos.py --host example.com --cidr 192.168.1.0/24 --packets 50
# If server handles it well, increase
sudo python3 http2rapidresettester_macos.py --host example.com --cidr 192.168.1.0/24 --packets 200
# Final aggressive test
sudo python3 http2rapidresettester_macos.py --host example.com --cidr 192.168.1.0/24 --packets 1000
Interpreting Results
The script outputs packet statistics including:
Total packets sent (SYN and RST combined)
Number of SYN packets
Number of RST packets
Failed packet count
Number of unique source IPs used
Average packet rate
Test duration
What to Monitor
Monitor your target server for:
Connection state table exhaustion: Check netstat or ss output for connection counts
CPU and memory utilization spikes: Use Activity Monitor or top command
Application performance degradation: Monitor response times and error rates
Firewall or rate limiting triggers: Check firewall logs and rate limiting counters
Protected Server Indicators
High failure rate in the test results
Server actively blocking or rate limiting connections
Firewall rules triggering during test
Connection resets from the server
Vulnerable Server Indicators
All packets successfully sent with low failure rate
No rate limiting or blocking observed
Server continues processing all requests
Resource utilization climbs steadily
Why hping3 for macOS?
Using hping3 provides several advantages for macOS users:
Universal IP Spoofing Support
Consistent behavior: hping3 provides reliable IP spoofing across different network configurations
Proven tool: Industry standard for packet crafting and network testing
Better compatibility: Works with most network interfaces and routing configurations
macOS Specific Benefits
Native support: Works well with macOS network stack
Firewall compatibility: Better integration with macOS firewall
Performance: Efficient packet generation on macOS
Reliability Advantages
Mature codebase: hping3 has been battle tested for decades
Active community: Well documented with extensive community support
Cross platform: Same tool works on Linux, BSD, and macOS
macOS Installation and Setup
Installing hping3
# Using Homebrew (recommended)
brew install hping
# Verify installation
which hping3
hping3 --version
Firewall Configuration
macOS firewall may need configuration for raw packet injection:
Open System Preferences > Security & Privacy > Firewall
Click “Firewall Options”
Add Python to allowed applications
Grant network access when prompted
Alternatively, for testing environments:
# Temporarily disable firewall (not recommended for production)
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off
# Re-enable after testing
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate on
Network Interfaces
List available network interfaces:
ifconfig
Common macOS interfaces:
en0: Primary Ethernet/WiFi
en1: Secondary network interface
lo0: Loopback interface
bridge0: Bridged interface (if using virtualization)
Best Practices for Testing
Start with staging/test environments: Never run aggressive tests against production without authorization
Coordinate with your team: Inform security and operations teams before testing
Monitor server metrics: Watch CPU, memory, and connection counts during tests
Test during low traffic periods: Minimize impact on real users if testing production
Gradual escalation: Start with conservative parameters and increase gradually
Document results: Keep records of test results and any configuration changes
Have rollback plans: Be prepared to quickly disable testing if issues arise
Firewall blocking: Temporarily disable firewall or add exception
Interface not active: Check ifconfig output
Permission issues: Ensure running with sudo
Wrong interface: Specify interface with hping3 using i flag
Low Packet Rate
Performance optimization tips:
Use wired Ethernet instead of WiFi
Close other network intensive applications
Reduce packet rate target with --packetrate
Use smaller CIDR blocks
Monitoring Your Tests
Using tcpdump
Monitor packets in real time:
# Watch SYN packets
sudo tcpdump -i en0 'tcp[tcpflags] & tcp-syn != 0' -n
# Watch RST packets
sudo tcpdump -i en0 'tcp[tcpflags] & tcp-rst != 0' -n
# Watch specific host and port
sudo tcpdump -i en0 host example.com and port 443 -n
# Save to file for later analysis
sudo tcpdump -i en0 -w test_capture.pcap host example.com
Using Wireshark
For detailed packet analysis:
# Install Wireshark
brew install --cask wireshark
# Run Wireshark
sudo wireshark
# Or use tshark for command line
tshark -i en0 -f "host example.com"
Activity Monitor
Monitor system resources during testing:
Open Activity Monitor (Applications > Utilities > Activity Monitor)
Select “Network” tab
Watch “Packets in” and “Packets out”
Monitor “Data sent/received”
Check CPU usage of Python process
Server Side Monitoring
On your target server, monitor:
# Connection states
netstat -an | grep :443 | awk '{print $6}' | sort | uniq -c
# Active connections count
netstat -an | grep ESTABLISHED | wc -l
# SYN_RECV connections
netstat -an | grep SYN_RECV | wc -l
# System resources
top -l 1 | head -10
Understanding IP Spoofing with hping3
How It Works
hping3 creates raw packets at the network layer, allowing you to specify arbitrary source IP addresses. This bypasses normal TCP/IP stack restrictions.
Network Requirements
For IP spoofing to work effectively:
Local networks: Works best on LANs you control
Direct routing: Requires direct layer 2 access
No NAT interference: NAT devices may rewrite source addresses
Router configuration: Some routers filter spoofed packets (BCP 38)
Testing Without Spoofing
If IP spoofing is not working in your environment:
# Test without CIDR block
sudo python3 http2rapidresettester_macos.py --host example.com --packets 1000
# This still validates:
# - Rate limiting configuration
# - Stream management
# - Server resilience
# - Resource consumption patterns
The HTTP/2 Rapid Reset vulnerability represents a significant threat to web infrastructure, but with proper patching, configuration, and monitoring, you can effectively protect your systems. This macOS optimized testing script using hping3 allows you to validate your defenses in a controlled manner with reliable IP spoofing capabilities across different network environments.
Remember that security is an ongoing process. Regularly:
Update your web server and proxy software
Review and adjust HTTP/2 configuration limits
Monitor for unusual traffic patterns
Test your defenses against emerging threats
By staying vigilant and proactive, you can maintain a resilient web presence capable of withstanding sophisticated DDoS attacks.
This blog post and testing script are provided for educational and defensive security purposes only. Always obtain proper authorization before testing systems you do not own.
For most of modern banking history, stability was assumed to increase with size. The thinking was the bigger you are, the more you should care, the more resources you can apply to problems. Larger banks had more capital, more infrastructure, and more people. In a pre-cloud world, this assumption appeared reasonable.
In practice, the opposite was often true.
Before cloud computing and elastic infrastructure, the larger a bank became, the more unstable it was under stress and the harder it was to maintain any kind of delivery cadence. Scale amplified fragility. In 2025, architecture (not size) has become the primary determinant of banking stability.
2. Scale, Fragility, and Quantum Entanglement
Traditional banking platforms were built on vertically scaled systems: mainframes, monolithic databases, and tightly coupled integration layers. These systems were engineered for control and predictability, not for elasticity or independent change.
As banks grew, they didn’t just add clients. They added products. Each new product introduced new dependencies, shared data models, synchronous calls, and operational assumptions. Over time, this created a state best described as quantum entanglement.
In this context, quantum entanglement refers to systems where:
Products cannot change independently
A change in one area unpredictably affects others
The full impact of change only appears under real load
Cause and effect are separated by time, traffic, and failure conditions
The larger the number of interdependent products, the more entangled the system becomes.
2.1 Why Entanglement Reduces Stability
As quantum entanglement increases, change becomes progressively riskier. Even small modifications require coordination across multiple teams and systems. Release cycles slow and defensive complexity increases.
Recovery also becomes harder. When something breaks, rolling back a single change is rarely sufficient because multiple products may already be in partially failed or inconsistent states.
Fault finding degrades as well. Logs, metrics, and alerts point in multiple directions. Symptoms appear far from root causes, forcing engineers to chase secondary effects rather than underlying faults.
Most importantly, blast radius expands. A fault in one product propagates through shared state and synchronous dependencies, impacting clients who weren’t using the originating product at all.
The paradox is that the very success of large banks (broad product portfolios) becomes a direct contributor to instability.
3. Why Scale Reduced Stability in the Pre-Cloud Era
Before cloud computing, capacity was finite, expensive, and slow to change. Systems scaled vertically, and failure domains were large by design.
As transaction volumes and product entanglement increased, capacity cliffs became unavoidable. Peak load failures became systemic rather than local. Recovery times lengthened and client impact widened.
Large institutions often appeared stable during normal operation but failed dramatically under stress. Smaller institutions appeared more stable largely because they had fewer entangled products and simpler operational surfaces (not because they were inherently better engineered).
Capitec itself experienced this capacity cliff, when its core banking SQL DB hit a capacity cliff in August 2022. In order to recover the service, close to 100 changes were made which resulted in a downtime of around 40 hrs. The wider service recovery took weeks, with missed payments a duplicate payments being fixed on a case by case basis. It was at this point that Capitec’s leadership drew a line in the sand and decided to totally re-engineer its entire stack from the ground up in AWS. This blog post is really trying to share a few nuggets from the engineering journey we went on, and hopefully help others all struggling the with burden of scale and hardened synchronous pathways.
4. Cloud Changed the Equation (But Only When Architecture Changed)
Cloud computing made it possible to break entanglement, but only for organisations willing to redesign systems to exploit it.
Horizontal scaling, availability zone isolation, managed databases, and elastic compute allow products to exist as independent domains rather than tightly bound extensions of a central core.
Institutions that merely moved infrastructure to the cloud without breaking product entanglement continue to experience the same instability patterns (only on newer hardware).
5. An Architecture Designed to Avoid Entanglement
Capitec represents a deliberate rejection of quantum entanglement.
Its entire App production stack is cloud native on AWS, Kubernetes, Kafka and Postgres. The platform is well advanced in rolling out new Java 25 runtimes, alongside ahead of time (AOT) optimisation to further reduce scale latency, improve startup characteristics, and increase predictability under load. All Aurora Serverless are setup with read replicas, offloading read pressure from write paths. All workloads are deployed across three availability zones, ensuring resilience. Database access is via the AWS JDBC wrapper (which enables extremely rapid failovers, outside of DNS TTLs)
Crucially, products are isolated by design. There is no central product graph where everything depends on everything else. But, a word of caution, we are “not there yet”. We will always have edges that can hurt and we you hit an edge at speed, sometimes its hard to get back up on your feet. Often you see that the downtime you experienced, simply results in pent up demand. Put another way, the volume that took your systems offline, is now significantly LESS than the volume thats waiting for you once you recover! This means that you somehow have to magically add capacity, or optimise code, during an outage in order to recover the service. You will often say “Rate Limiting” fan club put a foot forward when I discuss burst recoverability. I personally don’t buy this for single entity services (for a complex set of reasons). For someone like AWS, it absolutely makes sense to carry the enormous complexity of guarding services with rate limits. But I don’t believe the same is true for a single entity ecosystem, in these instances, offloading is normally a purer pathway.
6. Write Guarding as a Stability Primitive
Capitec’s mobile and digital platforms employ a deliberate **write guarding** strategy.
Read only operations (such as logging into the app) are explicitly prevented from performing inline write operations. Activities like audit logging, telemetry capture, behavioural flags, and notification triggers are never executed synchronously on high volume read paths.
Instead, these concerns are offloaded asynchronously using Amazon MSK (Managed Streaming for Apache Kafka) or written to in memory data stores such as Valkey, where they can be processed later without impacting the user journey.
This design completely removes read-write contention from critical paths. Authentication storms, balance checks, and session validation no longer compete with persistence workloads. Under load, read performance remains stable because it is not coupled to downstream write capacity.
Critically, write guarding prevents database maintenance pressure (such as vacuum activity) from leaking into high volume events like logins. Expensive background work remains isolated from customer facing read paths.
Write guarding turns one of the most common failure modes in large banking systems (read traffic triggering hidden writes) into a non event. Stability improves not by adding capacity, but by removing unnecessary coupling.
7. Virtual Threads as a Scalability Primitive
Java 25 introduces mature virtual threading as a first class concurrency model. This fundamentally changes how high concurrency systems behave under load.
Virtual threads decouple application concurrency from operating system threads. Instead of being constrained by a limited pool of heavyweight threads, services can handle hundreds of thousands of concurrent blocking operations without exhausting resources.
Request handling becomes simpler. Engineers can write straightforward blocking code without introducing thread pool starvation or complex asynchronous control flow.
Tail latency improves under load. When traffic spikes, virtual threads queue cheaply rather than collapsing the system through thread exhaustion.
Operationally, virtual threads align naturally with containerised, autoscaling environments. Concurrency scales with demand, not with preconfigured thread limits.
When combined with modern garbage collectors and ahead of time optimisation, virtual threading removes an entire class of concurrency related instability that plagued earlier JVM based banking platforms.
8. Nimbleness Emerges When Entanglement Disappears
When blast zones and integration choke points disappear, teams regain the ability to move quickly without increasing systemic risk.
Domains communicate through well defined RESTful interfaces, often across separate AWS accounts, enforcing isolation as a first class property. A failure in one domain does not cascade across the organisation.
To keep this operable at scale, Capitec uses Backstage (via an internal overlay called ODIN) as its internal orchestration and developer platform. All AWS accounts, services, pipelines, and operational assets are created to a common standard. Teams consume platform capability rather than inventing infrastructure.
This eliminates configuration drift, reduces cognitive load, and ensures that every new product inherits the same security, observability, and resilience characteristics.
The result is nimbleness without fragility.
9. Operational Stability Is Observability Plus Action
In entangled systems, failures are discovered by clients and stability is measured retrospectively.
Capitec operates differently. End to end observability through Instana and its in house AI platform, Neo, correlates client side errors, network faults, infrastructure signals, and transaction failures in real time. Issues are detected as they emerge, not after they cascade.
This operational awareness allows teams to intervene early, contain issues quickly, and reduce client impact before failures escalate.
Stability, in this model, is not the absence of failure. It is fast detection, rapid containment, and decisive response.
10. Fraud Prevention Without Creating New Entanglement
Fraud is treated as a first class stability concern rather than an external control.
Payments are evaluated inline as they move through the bank. Abnormal velocity, behavioural anomalies, and account provenance are assessed continuously. Even fraud reported in the call center is immediately visible to other clients paying from the Capitec App. Clients are presented with conscience pricking prompts for high risk payments; these frequently stop fraud as the clients abandon the payment when presented with the risks.
Capitec runs a real time malware detection engine directly on client devices. This engine detects hooks and overlays installed by malicious applications. When malware is identified, the client’s account is immediately stopped, preventing fraudulent transactions before they occur.
Because fraud controls are embedded directly into the transaction flow, they don’t introduce additional coupling or asynchronous failure modes.
The impact is measurable. Capitec’s fraud prevention systems have prevented R300 million in client losses from fraud. In November alone, these systems saved clients a further R60 million in fraud losses.
11. The Myth of Stability Through Multicloud
Multicloud is often presented as a stability strategy. In practice, it is largely a myth.
Running across multiple cloud providers does not remove failure risk. It compounds it. Cross cloud communication can typically only be secured using IP based controls, weakening security posture. Operational complexity increases sharply as teams must reason about heterogeneous platforms, tooling, failure modes, and networking behaviour.
Most critically, multicloud does not eliminate correlated failure. If either cloud provider becomes unavailable, systems are usually unusable anyway. The result is a doubled risk surface, increased operational risk, and new inter cloud network dependencies (without a corresponding reduction in outage impact).
Multicloud increases complexity, weakens controls, and expands risk surface area without delivering meaningful resilience.
12. What Actually Improves Stability
There are better options than multicloud.
Hybrid cloud with anti-affinity on critical channels is one. For example, card rails can be placed in two physically separate data centres so that if cloud based digital channels are unavailable, clients can still transact via cards and ATMs. This provides real functional resilience rather than architectural illusion.
Multi region deployment within a single cloud provider is another. This provides geographic fault isolation without introducing heterogeneous complexity. However, this only works if the provider avoids globally scoped services that introduce hidden single points of failure. At present, only AWS consistently supports this model. Some providers expose global services (such as global front doors) that introduce global blast radius and correlated failure risk.
True resilience requires isolation of failure domains, not duplication of platforms.
13. Why Traditional Banks Still Struggle
Traditional banks remain constrained by entangled product graphs, vertically scaled cores, synchronous integration models, and architectural decisions from a different era. As product portfolios grow, quantum entanglement increases. Change slows, recovery degrades, and outages become harder to diagnose and contain.
Modernisation programmes often increase entanglement temporarily through dual run architectures, making systems more fragile before they become more stable (if they ever do).
The challenge is not talent or ambition. It is the accumulated cost of entanglement.
14. Stability at Scale Without the Traditional Trade Off
Capitec’s significance is not that it is small. It is that it is large and remains stable.
Despite operating at massive scale with a broad product surface and high transaction volumes, stability improves rather than degrades. Scale does not increase blast radius, recovery time, or change risk. It increases parallelism, isolation, and resilience.
This directly contradicts historical banking patterns where growth inevitably led to fragility. Capitec demonstrates that with the right architecture, scale and stability are no longer opposing forces.
15. Final Thought
Before cloud and autoscaling, scale and stability were inversely related. The more products a bank had, the more entangled and fragile it became.
In 2025, that relationship can be reversed (but only by breaking entanglement, isolating failure domains, and avoiding complexity masquerading as resilience).
Doing a deal with a cloud provider means nothing if transformation stalls inside the organisation. If dozens of people carry the title of CIO while quietly pulling the handbrake on the change that is required, the outcome is inevitable regardless of vendor selection.
There is also a strategic question that many institutions avoid. If forced to choose between operating in a jurisdiction that is hostile to public cloud or accessing the full advantages of cloud, waiting is not a strategy. When that jurisdiction eventually allows public cloud, the market will already be populated by banks that moved earlier, built cloud native platforms, and are now entering at scale.
Capitec is an engineering led bank whose stability and speed increase with scale. Traditional banks remain constrained by quantum entanglement baked into architectures from a different era.
These outcomes are not accidental. They are the inevitable result of architectural and organisational choices made years ago, now playing out under real world load.
Stablecoins are a type of cryptocurrency designed to maintain a stable value by pegging themselves to a reserve asset, typically a fiat currency like the US dollar. Unlike volatile cryptocurrencies such as Bitcoin or Ethereum, which can experience dramatic price swings, stablecoins aim to provide the benefits of digital currency without the price volatility.
The most common types of stablecoins include:
Fiat collateralized stablecoins are backed by traditional currencies held in reserve at a 1:1 ratio. Examples include Tether (USDT) and USD Coin (USDC), which maintain reserves in US dollars or dollar equivalent assets.
Crypto collateralized stablecoins use other cryptocurrencies as collateral, often over collateralized to account for volatility. DAI is a prominent example, backed by Ethereum and other crypto assets.
Algorithmic stablecoins attempt to maintain their peg through automated supply adjustments based on market demand, without traditional collateral backing. These have proven to be the most controversial and risky category.
2. Why Do Stablecoins Exist?
Stablecoins emerged to solve several critical problems in both traditional finance and the cryptocurrency ecosystem.
In the crypto world, they provide a stable store of value and medium of exchange. Traders use stablecoins to move in and out of volatile positions without converting back to fiat currency, avoiding the delays and fees associated with traditional banking. They serve as a safe harbor during market turbulence and enable seamless transactions across different blockchain platforms.
For cross border payments and remittances, stablecoins offer significant advantages over traditional methods. International transfers that typically take days and cost substantial fees can be completed in minutes for a fraction of the cost. This makes them particularly valuable for workers sending money to families in other countries or businesses conducting international trade.
Stablecoins also address financial inclusion challenges. In countries with unstable currencies or limited banking infrastructure, they provide access to a stable digital currency that can be held and transferred using just a smartphone. This opens up financial services to the unbanked and underbanked populations worldwide.
2.1 How Do Stablecoins Move Money?
Stablecoins move between countries by riding on public or permissioned blockchains rather than correspondent banking rails. When a sender in one country initiates a payment, their bank or payment provider converts local currency into a regulated stablecoin (for example a USD or EUR backed token) and sends that token directly to the recipient bank’s blockchain address. The transaction settles globally in minutes with finality provided by the blockchain, not by intermediaries. To participate, a bank joins a stablecoin network by becoming an authorised issuer or distributor, integrating custody and wallet infrastructure, and connecting its core banking systems to blockchain rails via APIs. On the receiving side, the bank accepts the stablecoin, performs compliance checks (KYC, AML, sanctions screening), and redeems it back into local currency for the client’s account. Because value moves as tokens on chain rather than as messages between correspondent banks, there is no need for SWIFT messaging, nostro/vostro accounts, or multi-day settlement, resulting in faster, cheaper, and more transparent cross border payments.
If a bank does not want the operational and regulatory burden of running its own digital asset custody, it can partner with specialist technology and infrastructure providers that offer custody, wallet management, compliance tooling, and blockchain connectivity as managed services. In this model, the bank retains the customer relationship and regulatory accountability, while the tech partner handles private key security, smart-contract interaction, transaction monitoring, and network operations under strict service-level and audit agreements. Commonly used players in this space include Fireblocks and Copper for institutional custody and secure transaction orchestration; Anchorage Digital and BitGo for regulated custody and settlement services; Circle for stablecoin issuance and on-/off-ramps (USDC); Coinbase Institutional for custody and liquidity; and Stripe or Visa for fiat to stablecoin on-ramps and payment integration. This partnership approach allows banks to move quickly into stablecoin based cross-border payments without rebuilding their core infrastructure or taking on unnecessary operational risk.
3. How Do Stablecoins Make Money?
Stablecoin issuers have developed several revenue models that can be remarkably profitable.
The primary revenue source for fiat backed stablecoins is interest on reserves. When issuers hold billions of dollars in US Treasury bills or other interest bearing assets backing their stablecoins, they earn substantial returns. For instance, with interest rates at 5%, a stablecoin issuer with $100 billion in reserves could generate $5 billion annually while still maintaining the 1:1 peg. Users typically receive no interest on their stablecoin holdings, allowing issuers to pocket the entire yield.
Transaction fees represent another revenue stream. While often minimal, the sheer volume of stablecoin transactions generates significant income. Some issuers charge fees for minting (creating) or redeeming stablecoins, particularly for large institutional transactions.
Premium services for institutional clients provide additional revenue. Banks, payment processors, and large enterprises often pay for faster settlement, higher transaction limits, dedicated support, and integration services.
Many stablecoin platforms also generate revenue through their broader ecosystem. This includes charging fees on decentralized exchanges, lending protocols, or other financial services built around the stablecoin.
3.1 The Pendle Revenue Model: Yield Trading Innovation
Pendle represents an innovative evolution in the DeFi stablecoin ecosystem through its yield trading protocol. Rather than issuing stablecoins directly, Pendle creates markets for trading future yield on stablecoin deposits and other interest bearing assets.
The Pendle revenue model operates through several mechanisms. The protocol charges trading fees on its automated market makers (AMMs), typically around 0.1% to 0.3% per swap. When users trade yield tokens on Pendle’s platform, a portion of these fees goes to the protocol treasury while another portion rewards liquidity providers who supply capital to the trading pools.
Pendle’s unique approach involves splitting interest bearing tokens into two components: the principal token (PT) representing the underlying asset, and the yield token (YT) representing the future interest. This separation allows sophisticated users to speculate on interest rates, hedge yield exposure, or lock in fixed returns on their stablecoin holdings.
The protocol generates revenue through swap fees, redemption fees when tokens mature, and potential governance token value capture as the protocol grows. This model demonstrates how stablecoin adjacent services can create profitable businesses by adding layers of financial sophistication on top of basic stablecoin infrastructure. Pendle particularly benefits during periods of high interest rates, when demand for yield trading increases and the potential returns from separating yield rights become more valuable.
4. Security and Fraud Concerns
Stablecoins face several critical security and fraud challenges that potential users and regulators must consider.
Reserve transparency and verification remain the most significant concern. Issuers must prove they actually hold the assets backing their stablecoins. Several controversies have erupted when stablecoin companies failed to provide clear, audited proof of reserves. The risk is that an issuer might not have sufficient backing, leading to a bank run scenario where the peg collapses and users cannot redeem their coins.
Smart contract vulnerabilities pose technical risks. Stablecoins built on blockchain platforms rely on code that, if flawed, can be exploited by hackers. Major hacks have resulted in hundreds of millions of dollars in losses, and once stolen, blockchain transactions are typically irreversible.
Regulatory uncertainty creates ongoing challenges. Different jurisdictions treat stablecoins differently, and the lack of clear, consistent regulation creates risks for both issuers and users. There’s potential for sudden regulatory action that could freeze assets or shut down operations.
Counterparty risk is inherent in centralized stablecoins. Users must trust the issuing company to maintain reserves, operate honestly, and remain solvent. If the company fails or acts fraudulently, users may lose their funds with limited recourse.
The algorithmic stablecoin model has proven particularly vulnerable. The catastrophic collapse of TerraUSD in 2022, which lost over $40 billion in value, demonstrated that algorithmic mechanisms can fail spectacularly under market stress, creating devastating losses for holders.
Money laundering and sanctions evasion concerns have drawn regulatory scrutiny. The pseudonymous nature of cryptocurrency transactions makes stablecoins attractive for illicit finance, though blockchain’s transparent ledger also makes transactions traceable with proper tools and cooperation.
4.1 Monitoring Stablecoin Flows
Effective monitoring of stablecoin flows has become critical for financial institutions, regulators, and the issuers themselves to ensure compliance, detect fraud, and understand market dynamics.
On Chain Analytics Tools provide the foundation for stablecoin monitoring. Since most stablecoins operate on public blockchains, every transaction is recorded and traceable. Companies like Chainalysis, Elliptic, and TRM Labs specialize in blockchain analytics, offering platforms that track stablecoin movements across wallets and exchanges. These tools can identify patterns, flag suspicious activities, and trace funds through complex transaction chains.
Real Time Transaction Monitoring systems alert institutions to potentially problematic flows. These systems track large transfers, unusual transaction patterns, rapid movement between exchanges (potentially indicating wash trading or manipulation), and interactions with known illicit addresses. Financial institutions integrating stablecoins must implement monitoring comparable to traditional payment systems.
Wallet Clustering and Entity Attribution techniques help identify the real world entities behind blockchain addresses. By analyzing transaction patterns, timing, and common input addresses, analytics firms can cluster related wallets and often attribute them to specific exchanges, services, or even individuals. This capability is crucial for understanding who holds stablecoins and where they’re being used.
Reserve Monitoring and Attestation focuses on the issuer side. Independent auditors and blockchain analysis firms track the total supply of stablecoins and verify that corresponding reserves exist. Circle, for instance, publishes monthly attestations from accounting firms. Some advanced monitoring systems provide real time transparency by linking on chain supply data with bank account verification.
Cross Chain Tracking has become essential as stablecoins exist across multiple blockchains. USDC and USDT operate on Ethereum, Tron, Solana, and other chains, requiring monitoring solutions that aggregate data across these ecosystems to provide a complete picture of flows.
Market Intelligence and Risk Assessment platforms combine on chain data with off chain information to assess concentration risk, identify potential market manipulation, and provide early warning of potential instability. When a small number of addresses hold large stablecoin positions, it creates systemic risk that monitoring can help quantify.
Banks and financial institutions implementing stablecoins typically deploy a combination of commercial blockchain analytics platforms, custom monitoring systems, and compliance teams trained in cryptocurrency investigation. The goal is achieving the same level of financial crime prevention and risk management that exists in traditional banking while adapting to the unique characteristics of blockchain technology.
5. How Regulators View Stablecoins
Regulatory attitudes toward stablecoins vary significantly across jurisdictions, but common themes and concerns have emerged globally.
United States Regulatory Approach involves multiple agencies with overlapping jurisdictions. The Securities and Exchange Commission (SEC) has taken the position that some stablecoins may be securities, particularly those offering yield or governed by investment contracts. The Commodity Futures Trading Commission (CFTC) views certain stablecoins as commodities. The Treasury Department and the Financial Stability Oversight Council have identified stablecoins as potential systemic risks requiring bank like regulation.
Proposed legislation in the US Congress has sought to create a comprehensive framework requiring stablecoin issuers to maintain high quality liquid reserves, submit to regular audits, and potentially obtain banking charters or trust company licenses. The regulatory preference is clearly toward treating major stablecoin issuers as financial institutions subject to banking supervision.
European Union Regulation has taken a more structured approach through the Markets in Crypto Assets (MiCA) regulation, which came into effect in 2024. MiCA establishes clear requirements for stablecoin issuers including reserve asset quality standards, redemption rights for holders, capital requirements, and governance standards. The regulation distinguishes between smaller stablecoin operations and “significant” stablecoins that require more stringent oversight due to their systemic importance.
United Kingdom Regulators are developing a framework that treats stablecoins used for payments as similar to traditional payment systems. The Bank of England and Financial Conduct Authority have indicated that stablecoin issuers should meet standards comparable to commercial banks, including holding reserves in central bank accounts or high quality government securities.
Asian Regulatory Perspectives vary widely. Singapore’s Monetary Authority has created a licensing regime for stablecoin issuers focused on reserve management and redemption guarantees. Hong Kong is developing similar frameworks. China has banned private stablecoins entirely while developing its own central bank digital currency. Japan requires stablecoin issuers to be licensed banks or trust companies.
Key Regulatory Concerns consistently include systemic risk (the failure of a major stablecoin could trigger broader financial instability), consumer protection (ensuring holders can redeem stablecoins for fiat currency), anti money laundering compliance, reserve adequacy and quality, concentration risk in the Treasury market (if stablecoin reserves significantly increase holdings of government securities), and the potential for stablecoins to facilitate capital flight or undermine monetary policy.
Central Bank Digital Currencies (CBDCs) represent a regulatory response to private stablecoins. Many central banks are developing or piloting digital currencies partly to provide a public alternative to private stablecoins, allowing governments to maintain monetary sovereignty while capturing the benefits of digital currency.
The regulatory trend is clearly toward treating stablecoins as systemically important financial infrastructure requiring oversight comparable to banks or payment systems, with an emphasis on reserve quality, redemption rights, and anti money laundering compliance.
5.1 How Stablecoins Impact the Correspondent Banking Model
Stablecoins pose both opportunities and existential challenges to the traditional correspondent banking system that has dominated international payments for decades.
The Traditional Correspondent Banking Model relies on a network of banking relationships where banks hold accounts with each other to facilitate international transfers. When a business in Brazil wants to pay a supplier in Thailand, the payment typically flows through multiple intermediary banks, each taking fees and adding delays. This system involves currency conversion, compliance checks at multiple points, and settlement risk, making international payments slow and expensive.
Stablecoins as Direct Competition offer a fundamentally different model. A business can send USDC directly to a recipient anywhere in the world in minutes, bypassing the correspondent banking network entirely. The recipient can then convert to local currency through a local exchange or payment processor. This disintermediation threatens the fee generating correspondent banking relationships that have been profitable for banks, particularly in remittance corridors and business to business payments.
Cost and Speed Advantages are significant. Traditional correspondent banking involves fees at multiple layers, often totaling 3-7% for remittances and 1-3% for business payments, with settlement taking 1-5 days. Stablecoin transfers can cost less than 1% including conversion fees, with settlement in minutes. This efficiency gap puts pressure on banks to either adopt stablecoin technology or risk losing payment volume.
The Disintermediation Threat extends beyond just payments. Correspondent banking generates substantial revenue for major international banks through foreign exchange spreads, service fees, and liquidity management. If businesses and individuals can hold and transfer value in stablecoins, they become less dependent on banks for international transactions. This is particularly threatening in high volume, low margin corridors where efficiency matters most.
Banks Adapting Through Integration represents one response to this threat. Rather than being displaced, some banks are incorporating stablecoins into their service offerings. They can issue their own stablecoins, partner with stablecoin issuers to provide on ramps and off ramps, or offer custody and transaction services for corporate clients wanting to use stablecoins. JPMorgan’s JPM Coin exemplifies this approach, using blockchain technology and stablecoin principles for institutional payments within a bank controlled system.
The Hybrid Model Emerging in practice combines stablecoins with traditional banking. Banks provide the fiat on ramps and off ramps, regulatory compliance, customer relationships, and local currency conversion, while stablecoins handle the actual transfer of value. This partnership model allows banks to maintain their customer relationships and regulatory compliance role while capturing efficiency gains from blockchain technology.
Regulatory Arbitrage Concerns arise because stablecoins can sometimes operate with less regulatory burden than traditional correspondent banking. Banks face extensive anti money laundering requirements, capital requirements, and regulatory scrutiny. If stablecoins provide similar services with lighter regulation, they gain a competitive advantage that regulators are increasingly seeking to eliminate through tighter stablecoin oversight.
Settlement Risk and Liquidity Management change fundamentally with stablecoins. Traditional correspondent banking requires banks to maintain nostro accounts (accounts held in foreign banks) prefunded with liquidity. Stablecoins allow for near instant settlement without prefunding requirements, potentially freeing up billions in trapped liquidity that banks currently must maintain across the correspondent network.
The long term impact will likely involve correspondent banking evolving rather than disappearing. Banks will increasingly serve as regulated gateways between fiat currency and stablecoins, while stablecoins handle the actual transfer of value. The most vulnerable players are mid tier correspondent banks that primarily provide routing services without strong customer relationships or value added services.
5.2 How FATF Standards Apply to Stablecoins
The Financial Action Task Force (FATF) provides international standards for combating money laundering and terrorist financing, and these standards have been extended to cover stablecoins and other virtual assets.
The Travel Rule represents the most significant FATF requirement affecting stablecoins. Originally designed for traditional wire transfers, the Travel Rule requires that information about the originator and beneficiary of transfers above a certain threshold (typically $1,000) must travel with the transaction. For stablecoins, this means that Virtual Asset Service Providers (VASPs) such as exchanges, wallet providers, and payment processors must collect and transmit customer information when facilitating stablecoin transfers.
Implementing the Travel Rule on public blockchains creates technical challenges. While bank wire transfers pass through controlled systems where information can be attached, blockchain transactions are peer to peer and pseudonymous. The industry has developed solutions like the Travel Rule Information Sharing Architecture (TRISA) and other protocols that allow VASPs to exchange customer information securely off chain while the stablecoin transaction occurs on chain.
Know Your Customer (KYC) and Customer Due Diligence requirements apply to any entity that provides services for stablecoin transactions. Exchanges, wallet providers, and payment processors must verify customer identities, assess risk levels, and maintain records of transactions. This requirement creates a tension with the permissionless nature of blockchain technology, where anyone can hold a self hosted wallet and transact directly without intermediaries.
VASP Registration and Licensing is required in most jurisdictions following FATF guidance. Any business providing stablecoin custody, exchange, or transfer services must register with financial authorities, implement anti money laundering programs, and submit to regulatory oversight. This has created significant compliance burdens for smaller operators and driven consolidation toward larger, well capitalized platforms.
Stablecoin Issuers as VASPs are generally classified as Virtual Asset Service Providers under FATF standards, subjecting them to the full range of anti money laundering and counter terrorist financing obligations. This includes transaction monitoring, suspicious activity reporting, and sanctions screening. Major issuers like Circle and Paxos have built sophisticated compliance programs comparable to traditional financial institutions.
The Self Hosted Wallet Challenge represents a key friction point. FATF has expressed concern about transactions involving self hosted (non custodial) wallets where users control their own private keys without intermediary oversight. Some jurisdictions have proposed restricting or requiring enhanced due diligence for transactions between VASPs and self hosted wallets, though this remains controversial and difficult to enforce technically.
Cross Border Coordination is essential but challenging. Stablecoins operate globally and instantly, but regulatory enforcement is jurisdictional. FATF promotes information sharing between national financial intelligence units and encourages mutual legal assistance. However, gaps in enforcement across jurisdictions create opportunities for regulatory arbitrage, where bad actors operate from jurisdictions with weak oversight.
Sanctions Screening is mandatory for stablecoin service providers. They must screen transactions against lists of sanctioned individuals, entities, and countries maintained by organizations like the US Office of Foreign Assets Control (OFAC). Several stablecoin issuers have demonstrated the ability to freeze funds in wallets associated with sanctioned addresses, showing that even decentralized systems can implement centralized controls when required by law.
Risk Based Approach is fundamental to FATF methodology. Service providers must assess the money laundering and terrorist financing risks specific to their operations and implement controls proportionate to those risks. For stablecoins, this means considering factors like transaction volumes, customer types, geographic exposure, and the underlying blockchain’s anonymity features.
Challenges in Implementation are significant. The pseudonymous nature of blockchain transactions makes it difficult to identify ultimate beneficial owners. The speed and global reach of stablecoin transfers compress the time window for intervention. The prevalence of decentralized exchanges and peer to peer transactions creates enforcement gaps. Some argue that excessive regulation will drive activity to unregulated platforms or privacy focused cryptocurrencies, making financial crime harder rather than easier to detect.
The FATF framework essentially attempts to impose traditional financial system controls on a technology designed to operate without intermediaries. While large, regulated stablecoin platforms can implement these requirements, the tension between regulatory compliance and the permissionless nature of blockchain technology remains unresolved and continues to drive both technological innovation and regulatory evolution.
6. Good Use Cases for Stablecoins
Despite the risks, stablecoins excel in several legitimate applications that offer clear advantages over traditional alternatives.
Cross border payments and remittances benefit enormously from stablecoins. Workers sending money home can avoid high fees and long delays, with transactions settling in minutes rather than days. Businesses conducting international trade can reduce costs and streamline operations significantly.
Treasury management for crypto native companies provides a practical use case. Cryptocurrency exchanges, blockchain projects, and Web3 companies need stable assets for operations while staying within the crypto ecosystem. Stablecoins let them hold working capital without exposure to crypto volatility.
Decentralized finance (DeFi) applications rely heavily on stablecoins. They enable lending and borrowing, yield farming, liquidity provision, and trading without the complications of volatile assets. Users can earn interest on stablecoin deposits or use them as collateral for loans.
Hedging against local currency instability makes stablecoins valuable in countries experiencing hyperinflation or currency crises. Citizens can preserve purchasing power by holding dollar backed stablecoins instead of rapidly devalating local currencies.
Programmable payments and smart contracts benefit from stablecoins. Businesses can automate payments based on conditions (such as releasing funds when goods are received) or create subscription services, escrow arrangements, and other complex payment structures that execute automatically.
Ecommerce and online payments increasingly accept stablecoins as they combine the low fees of cryptocurrency with price stability. This is particularly valuable for digital goods, online services, and merchant payments where volatility would be problematic.
6.1 Companies Specializing in Banking Stablecoin Integration
Several companies have emerged as leaders in helping traditional banks launch and integrate stablecoin solutions into their existing infrastructure.
Paxos is a regulated blockchain infrastructure company that provides white label stablecoin solutions for financial institutions. They’ve partnered with major companies to issue stablecoins and offer compliance focused infrastructure that meets banking regulatory requirements. Paxos handles the technical complexity while allowing banks to maintain their customer relationships.
Circle offers comprehensive business account services and APIs that enable banks to integrate USD Coin (USDC) into their platforms. Their developer friendly tools and banking partnerships have made them a go to provider for institutions wanting to offer stablecoin services. Circle emphasizes regulatory compliance and transparency with regular reserve attestations.
Fireblocks provides institutional grade infrastructure for banks looking to offer digital asset services, including stablecoins. Their platform handles custody, treasury operations, and connectivity to various blockchains, allowing banks to offer stablecoin functionality without building everything from scratch.
Taurus specializes in digital asset infrastructure for banks, wealth managers, and other financial institutions in Europe. They provide technology for custody, tokenization, and trading that enables traditional financial institutions to offer stablecoin services within existing regulatory frameworks.
Sygnum operates as a Swiss digital asset bank and offers banking as a service solutions. They help other banks integrate digital assets including stablecoins while ensuring compliance with Swiss banking regulations. Their approach combines traditional banking security with blockchain innovation.
Ripple has expanded beyond its cryptocurrency focus to offer enterprise blockchain solutions for banks, including infrastructure for stablecoin issuance and cross border payment solutions. Their partnerships with financial institutions worldwide position them as a bridge between traditional banking and blockchain technology.
BBVA and JPMorgan have also developed proprietary solutions (JPM Coin for JPMorgan) that other institutions might license or use as models, though these are typically more focused on their own operations and select partners.
7.1 The Bid Offer Spread Challenge: Liquidity vs. True 1:1 Conversions
One of the hidden costs in stablecoin adoption that significantly impacts user economics is the bid offer spread applied during conversions between fiat currency and stablecoins. While stablecoins are designed to maintain a 1:1 peg with their underlying asset (typically the US dollar), the reality of converting between fiat and crypto introduces market dynamics that can erode this theoretical parity.
7.1 Understanding the Spread Problem
When users convert fiat currency to stablecoins or vice versa through most platforms, they encounter a bid offer spread the difference between the buying price and selling price. Even though USDC or USDT theoretically equals $1.00, a platform might effectively charge $1.008 to buy a stablecoin and offer only $0.992 when selling it back. This 0.8% to 1.5% spread represents a significant friction cost, particularly for businesses making frequent conversions or moving large amounts.
This spread exists because most platforms operate market making models where they must maintain liquidity on both sides of the transaction. Holding inventory of both fiat and stablecoins involves costs: capital tied up in reserves, exposure to brief depegging events, regulatory compliance overhead, and the operational expense of managing banking relationships for fiat on ramps and off ramps. Platforms traditionally recover these costs through the spread rather than explicit fees.
For cryptocurrency exchanges and most fintech platforms, the spread also serves as their primary revenue mechanism for stablecoin conversions. When a platform facilitates thousands or millions of conversions daily, even small spreads generate substantial income. The spread compensates for the risk that during periods of market stress, stablecoins might temporarily trade below their peg, leaving the platform holding depreciated assets.
7.2 The Impact on Users and Business Operations
The cumulative effect of bid offer spreads becomes particularly painful for certain use cases. Small and medium sized businesses operating across borders face multiple conversion points: exchanging local currency to USD, converting USD to stablecoins for cross border transfer, then converting stablecoins back to USD or local currency at the destination. Each conversion compounds the cost, potentially consuming 2% to 4% of the transaction value when combined with traditional banking fees.
For businesses using stablecoins as working capital converting payroll, managing treasury operations, or settling international invoices the spread can eliminate much of the cost advantage that stablecoins are supposed to provide over traditional correspondent banking. A company converting $100,000 might effectively pay $1,500 in spread costs on a round trip conversion, comparable to traditional wire transfer fees that stablecoins aimed to disrupt.
Individual users in countries with unstable currencies face similar challenges. While holding USDT or USDC protects against local currency devaluation, the cost of frequently moving between local currency and stablecoins can be prohibitive. The spread becomes a “tax” on financial stability that disproportionately affects those who can least afford it.
7.3 Revolut’s 1:1 Model: Internalizing the Cost
Revolut’s recent introduction of true 1:1 conversions between USD and stablecoins (USDC and USDT) represents a fundamentally different approach to solving the spread problem. Rather than passing market making costs to users, Revolut absorbs the spread internally, guaranteeing that $1.00 in fiat equals exactly 1.00 stablecoin units in both directions, with no hidden markups.
This model is economically viable for Revolut because of several structural advantages. First, as a neobank with 65 million users and existing banking infrastructure, Revolut already maintains substantial fiat currency liquidity and doesn’t need to rely on external banking partners for every stablecoin conversion. Second, the company generates revenue from other services within its ecosystem subscription fees, interchange fees from card spending, interest on deposits allowing it to treat stablecoin conversions as a loss leader or break even feature that enhances customer retention and platform stickiness.
Third, by setting a monthly limit of approximately $578,000 per customer, Revolut manages its risk exposure while still accommodating the vast majority of retail and small business use cases. This prevents arbitrage traders from exploiting the zero spread model to make risk free profits by moving large volumes between Revolut and other platforms where spreads exist.
Revolut essentially bets that the value of removing friction from fiat crypto conversions thereby making stablecoins genuinely useful as working capital rather than speculative assets will drive sufficient user engagement and platform growth to justify the cost of eliminating spreads. For users, this transforms the economics of stablecoin usage, particularly for frequent converters or those operating in high currency volatility environments.
7.4 Why Not Everyone Can Offer 1:1 Conversions
The challenge for smaller platforms and pure cryptocurrency exchanges is that they lack Revolut’s structural advantages. A standalone crypto exchange without banking licenses and integrated fiat services must partner with banks for fiat on ramps, pay fees to those partners, maintain separate liquidity pools, and manage the regulatory complexity of operating in multiple jurisdictions. These costs don’t disappear simply because users want better rates they must be recovered somehow.
Additionally, maintaining tight spreads or true 1:1 conversions requires deep liquidity and sophisticated risk management. When thousands of users simultaneously want to exit stablecoins during market stress, a platform must have sufficient reserves to honor redemptions instantly without moving the price. Smaller platforms operating with thin liquidity buffers cannot safely eliminate spreads without risking insolvency during volatile periods.
The market structure for stablecoins also presents challenges. While stablecoins theoretically maintain 1:1 pegs, secondary market prices on decentralized exchanges and between different platforms can vary by small amounts. A platform offering guaranteed 1:1 conversions must either hold sufficient reserves to absorb these variations or accept that arbitrage traders will exploit any price discrepancies, potentially draining liquidity.
7.5 The Competitive Implications
Revolut’s move to zero spread stablecoin conversions could trigger a competitive dynamic in the fintech space, similar to how its original zero fee foreign exchange offering disrupted traditional currency conversion. Established players like Coinbase, Kraken, and other major exchanges will face pressure to reduce their spreads or explain why their costs remain higher.
For traditional banks contemplating stablecoin integration, the spread question becomes strategic. Banks could follow the Revolut model, absorbing spread costs to drive adoption and maintain customer relationships in an increasingly crypto integrated financial system. Alternatively, they might maintain spreads but offer other value added services that justify the cost, such as enhanced compliance, insurance on holdings, or integration with business treasury management systems.
The long term outcome may be market segmentation. Large, integrated fintech platforms with diverse revenue streams can offer true 1:1 conversions as a competitive advantage. Smaller, specialized platforms will continue operating with spreads but may differentiate through speed, blockchain coverage, or serving specific niches like high volume traders who value depth of liquidity over tight spreads.
For stablecoin issuers like Circle and Tether, the spread dynamics affect their business indirectly. Wider spreads on third party platforms create friction that slows stablecoin adoption, reducing the total assets under management that generate interest income for issuers. Partnerships with platforms offering tighter spreads or true 1:1 conversions could accelerate growth, even if those partnerships involve revenue sharing or other commercial arrangements.
Ultimately, the bid offer spread challenge highlights a fundamental tension in stablecoin economics: the gap between the theoretical promise of 1:1 value stability and the practical costs of maintaining liquidity, managing risk, and operating the infrastructure that connects fiat currency to blockchain based assets. Platforms that can bridge this gap efficiently whether through scale, integration, or innovative business models will have significant competitive advantages as stablecoins move from crypto native use cases into mainstream financial infrastructure.
8. Conclusion
Stablecoins represent a significant innovation in digital finance, offering the benefits of cryptocurrency without extreme volatility. They’ve found genuine utility in payments, remittances, and decentralized finance while generating substantial revenue for issuers through interest on reserves. However, they also carry real risks around reserve transparency, regulatory uncertainty, and potential fraud that users and institutions must carefully consider.
The regulatory landscape is rapidly evolving, with authorities worldwide moving toward treating stablecoins as systemically important financial infrastructure requiring bank like oversight. FATF standards impose traditional anti money laundering requirements on stablecoin service providers, creating compliance obligations comparable to traditional finance. Meanwhile, sophisticated monitoring tools have emerged to track flows, detect illicit activity, and ensure reserve adequacy.
For traditional banks, stablecoins represent both a competitive threat to correspondent banking models and an opportunity to modernize payment infrastructure. Rather than being displaced entirely, banks are increasingly positioning themselves as regulated gateways between fiat currency and stablecoins, maintaining customer relationships and compliance functions while leveraging blockchain efficiency.
For banks considering stablecoin integration, working with established infrastructure providers can mitigate technical and compliance challenges. The key is choosing use cases where stablecoins offer clear advantages, particularly in cross border payments and treasury management, while implementing robust risk management, transaction monitoring, and ensuring regulatory compliance with both traditional financial regulations and emerging crypto specific frameworks.
As the regulatory landscape evolves and technology matures, stablecoins are likely to become increasingly integrated into mainstream financial services. Their success will depend on maintaining trust through transparency, security, and regulatory cooperation while continuing to deliver value that traditional financial rails cannot match. The future likely involves a hybrid model where stablecoins and traditional banking coexist, each playing to their respective strengths in a more efficient, global financial system.
This is (hopefully) a short blog that will give you back a small piece of your life…
In technology, we rightly spend hours pouring over failure in order that we might understand it and therefore fix it and avoid it in the future. This seems a reasonable approach, learn from your mistakes, understand failure, plan your remediation etc etc. But is it possible that there are some instances where doing this is inappropriate? To answer this simple question, let me give you an analogy…
You decide that you want to travel from London to New York. Sounds reasonable so far…. But you decide you want to go by car! The reasoning for this is as follows:
Cars are “tried and tested”.
We have an existing deal with multiple car suppliers and we get great discounts.
The key decision maker is a car enthusiast.
The incumbent team understand cars and can support this choice.
Cars are what we have available right now and we want to start execution tomorrow, so lets just make it work.
You first try a small hatchback and only manage to get around 3m off the coast of Scotland. Next up you figure you will get a more durable car, so you get a truck – but sadly this only makes 2m headway from the beach. You report back to the team and they send you a brand new Porsche and this time you give yourself an even bigger run up at the sea and you manage to make a whopping 4m, before the car sinks. The team now analyse all the data to figure out why each car sunk and what they can do to make this better. The team continue to experiment with various cars and progress is observed over time. After 6 months the team has managed to travel 12m towards their goal of driving to New York. The main reason for the progress is that the sunken cars are starting to form a land bridge. The leadership has now spent over 200m USD on this venture and don’t feel they can pivot, so they start to brainstorm how to make this work.
Maybe wind the windows up a little tighter, maybe the cars need more underseal, maybe over inflate the tyres or maybe we simply need way more cars? All of these may or may not make a difference. But here’s the challenge: you made a bad engineering choice and anything you do will just be a variant of bad. It will never be good and you cannot win with your choice.
The above obviously sounds a bit daft (and it is), but the point is that I am often called in after downtime to review an architecture to find a route cause and suggest remediation. But what is not always understood is that bad technology choices can be as likely to succeed as driving from London to New York. Sometimes you simply need to look at alternatives, you need a boat or a plane. The product architecture can be terminal, it wont ever be what you want it to be and no amount of analysis or spend will change this. The trick is to accept the brutal reality of your situation and move your focus towards choosing the technology that you need to transition to. Next try and figure out how quickly can you can do this pivot…