https://andrewbaker.ninja/wp-content/themes/twentysixteen/fonts/merriweather-plus-montserrat-plus-inconsolata.css

πŸ‘112views
The Silent Killer in Your AWS Architecture: IOPS Mismatches

Andrew Baker | Capitec Bank | March 2026

The AWS Well-Architected Framework will not spot this. Your cloud governance team will not catch it either, nor will Trusted Advisor. Your war room will be hours into the incident before anyone catches on. Your SRE reviews will miss it, your APM will not flag it, Performance Insights will not surface it, and your FinOps team will have no clue it is even there. It sits quietly in your estate for months, sometimes years, until the day load conditions expose it and you spend four hours in a war room trying to explain to your CTO why a bank cannot process payments.

What is it? IOPS mismatches, where your storage can deliver far more throughput than your compute instance can actually consume.

1. What Is an IOPS Mismatch?

IOPS (Input/Output Operations Per Second) is the fundamental currency of storage performance. When you provision an RDS instance or attach an EBS volume to an EC2 host, you are making two independent decisions: how many IOPS the storage can deliver, and how many IOPS the instance can actually push through its I/O subsystem. These two numbers are governed by entirely separate limits. EBS gp3 and io2 volumes can be provisioned up to 64,000 IOPS, but your EC2 instance has its own maximum EBS throughput ceiling, and for many instance types that ceiling is well below what your storage can theoretically deliver.

The result is a mismatch where you are paying for storage performance that your compute layer cannot reach, and worse, you are designing your system around an IOPS budget that does not actually exist at the instance level. Every rand you spend on IOPS above that ceiling is money you will never get value from, and none of your tooling will tell you.

2. The AWS Architecture Pattern That Makes This Worse

AWS makes it easy to provision storage and compute independently, and neither the console, the CLI, CloudFormation, nor Terraform will warn you when your storage IOPS ceiling exceeds your instance throughput ceiling. The typical path to a mismatch involves an engineer provisioning an r6g.large with a 3,500 IOPS maximum EBS throughput because the application is memory intensive, while a separate decision by a different team provisions the attached RDS storage with 10,000 IOPS because the database team uses a standard enterprise template. The two numbers are never compared against each other and the system goes to production.

At any large institution, as at any institution operating at scale across hundreds of accounts and workloads, the opportunity for this kind of drift is significant. An OU spanning multiple teams, each with their own provisioning habits and templates, will accumulate mismatches over time through the natural entropy of infrastructure management.

Some of the most frequent mismatches appear in exactly the instance types engineers reach for most often. The t3 family is particularly dangerous because it is the default choice for cost conscious workloads, yet every size from nano to 2xlarge is capped at 2,085 IOPS while a gp3 volume defaults to 3,000 and io1 storage attached by a database team using an enterprise template can easily reach 6,000 or more. The r6g.large caps at 3,500 IOPS and is routinely paired with RDS storage provisioned at double that. The m5.xlarge at 6,000 IOPS is frequently attached to io2 volumes at 10,000. On the RDS side, a db.t3.medium is limited to 1,536 IOPS but is commonly found with io1 storage provisioned at 3,000, and db.r6g.large instances capped at 3,500 are routinely given storage provisioned at 8,000 or more.

3. Why This Causes Outages and Not Just Waste

Paying for IOPS you cannot use is expensive but survivable. The dangerous scenario is what happens at the boundary when load actually increases.

Consider a team that knows their application spikes at just under 30,000 IOPS. They provision storage at 40,000 to give themselves headroom, which is exactly what good engineering practice tells them to do. Everything looks fine in testing, everything looks fine in monitoring, and the system runs without complaint. Then load spikes, the I/O subsystem begins to queue, upstream systems scale out in response, a flood of autoscaling connections exhausts the database connection pool, and services start timing out in a rapid cascading outage. The application is down, and now your team is trying to correlate thread exhaustion in your EKS clusters back to a blind configuration mismatch that no automated tool flagged at any point along the way.

That cascade is the real danger. Under normal conditions your application runs fine because actual I/O demand is modest and the mismatch is invisible. But when demand approaches the instance throughput ceiling, requests begin to queue at the virtualization layer and what was a 1ms storage read becomes 50ms and then 200ms as connection pools back up. In Aurora Serverless v2 or RDS PostgreSQL, connection pool exhaustion under I/O pressure is a documented failure mode where the database appears healthy and is accepting connections but query execution times blow through application timeouts, upstream services retry, retries compound the I/O pressure, and you are now in a feedback loop. On smaller instance types that operate on a credit model for I/O throughput, a sustained I/O spike drains the burst bucket and when it empties throughput drops to the baseline, which is often a fraction of what the application expects, even though the storage volume itself could still deliver perfectly well.

The most insidious part is that CloudWatch metrics can hide this entirely. VolumeReadOps and VolumeWriteOps report what the volume delivered while EBSReadOps on the instance side reports what the instance consumed, and under a mismatch these metrics appear healthy right up until they do not because the bottleneck lives in the virtualized I/O path between instance and volume rather than in either component independently. Performance Insights, which most teams trust to surface database I/O problems, operates above this layer entirely and will show you wait events and query latency but will not show you that the ceiling your instance is hitting is a hardware throughput limit rather than a query or index problem.

4. The FinOps Trap: When Cost Optimisation Creates the Outage

There is a particularly cruel variant of this failure that deserves its own warning. Trusted Advisor and your FinOps team will periodically flag RDS instances with low CPU utilisation as oversized and recommend downsizing to save cost. The recommendation is correct by every metric they can see. CPU is idle, the instance looks like waste, and the resize gets approved and executed. What nobody checks is whether the workload is IO bound rather than CPU bound, and whether the smaller instance class sits below the provisioned storage IOPS ceiling. The downsize goes through, the system runs fine for weeks because load is modest, and then three months later under a transaction spike you are in an outage with no obvious cause because nobody remembers a routine right sizing exercise from last quarter. The CloudWatch CPU graphs look fine throughout. The storage metrics look fine. The failure lives entirely in the gap between what the instance can consume and what the storage was provisioned to deliver, and that gap was created by a cost optimisation that was correct by every measure anyone was watching.

5. Detection Requires Automation

Once you have identified a mismatch the fix is straightforward: upgrade the instance to a class whose throughput ceiling exceeds the provisioned storage IOPS, or reduce the provisioned IOPS on the storage to match what the instance can actually consume. The hard part is finding these mismatches across a large estate before they manifest as incidents, because you cannot do this manually and you cannot rely on individual teams to self report. It needs to be a scheduled job that generates a report and produces findings you treat with the same severity as any security or cost compliance alert.

The script below is what I need Zak and Ryan to run across all our accounts. It queries every account in the OU, enumerates all RDS instances and EC2 volumes, compares provisioned storage IOPS against the instance throughput ceiling for each, and flags every mismatch with a severity classification and an estimated monthly waste figure. Findings are classified as CRITICAL where provisioned IOPS exceed the ceiling by three times or more and require immediate remediation, HIGH at two to three times over for remediation within the current sprint, MEDIUM at 1.5 to two times for the next planning cycle, and LOW for anything above the ceiling but within the 1 to 1.5x range. Output is a colour coded Excel workbook with a findings sheet and summary tab alongside a flat CSV, and the script exits with code 1 if any CRITICAL findings are present so it can be wired into a CI pipeline for scheduled runs.

To run it you will need boto3, pandas, and openpyxl installed, and your caller identity needs organizations:ListChildren, organizations:DescribeAccount, and sts:AssumeRole into each target account, while the assumed role needs rds:DescribeDBInstances, ec2:DescribeVolumes, and ec2:DescribeInstances.

6. The Broader Lesson

Every major cloud outage I have seen, and this spans AWS and the broader industry, has a mundane configuration decision somewhere near its root cause. It is not a sophisticated attack and not a novel failure mode, it is a number that should have been checked against another number at provisioning time and was not. IOPS mismatches are not glamorous, they do not appear on architectural review templates, and as we have seen they slip past every layer of automated governance you have in place. But in a payment processing environment they translate directly into transaction failures, customer impact, and regulatory exposure, which means they deserve the same treatment as any other class of compliance finding. Audit your estate, fix the mismatches, and make it a scheduled job rather than a post incident action item.

Andrew Baker is Chief Information Officer at Capitec Bank. He writes about cloud architecture, banking technology, and the gap between what systems are designed to do and what they actually do under load at andrewbaker.ninja.

Leave a Reply

Your email address will not be published. Required fields are marked *