Andrew Baker | March 2026
Companion article to: https://andrewbaker.ninja/2026/03/01/the-silent-killer-in-your-aws-architecture-iops-mismatches/
Last week I published a script that scans your AWS estate and finds every EBS volume and RDS instance where your provisioned storage IOPS exceed what the compute instance can actually consume. That problem, the structural mismatch between storage ceiling and instance ceiling, is important and expensive and almost completely invisible to your existing tooling. You should run that script.
But there is a second problem that the mismatch auditor does not solve, and in some ways it is the more dangerous one. The mismatch auditor tells you where the gap exists but it does not tell you whether you are actually falling into it.
Consider the difference. A provisioned storage IOPS ceiling of 10,000 on an instance that can only push 3,500 is a configuration problem, meaning you are paying for performance you cannot consume and your headroom assumptions are wrong. But if your actual workload is only ever generating 1,200 IOPS, the mismatch is a cost and an architecture risk rather than an active emergency. The mismatch auditor will find it and you should fix it, but the building is not on fire yet.
Now consider the other case. Your provisioned storage ceiling is correct, your instance class ceiling matches what you need, and your architecture review would pass. But your workload is generating 3,400 IOPS against a 3,500 ceiling for minutes at a time, every day, during the lunchtime transaction peak. CloudWatch shows nothing alarming because the volume is not saturated and the instance is not at CPU capacity. Performance Insights shows no problematic wait events and your APM shows acceptable latency. You are running at 97 percent of your I/O capacity for sustained periods without knowing it.
That is the building that is about to catch fire.
A 3 percent buffer against a hard ceiling is not a buffer, it is a queue waiting to form. When load spikes because a batch job overlaps with transaction traffic, or a partner integration runs slightly earlier than usual, or a retry storm arrives from an upstream timeout, you cross that ceiling and requests start stacking in the virtualised I/O path. What was a 2ms storage read becomes 40ms as the queue grows, connection pools back up, upstream services time out and retry, and those retries compound the I/O load further. You are now in a feedback loop where your database looks healthy by every metric your team is watching and you have no obvious cause to debug because the bottleneck lives in the gap between what your instance can consume and what your workload is demanding, a gap that none of your standard monitoring instruments will name for you.
The script in this post closes that gap.
1. What This Script Actually Does
The script scans your AWS estate across multiple accounts and regions and queries CloudWatch for every EBS volume, RDS instance, and Aurora instance. For each resource it asks whether actual observed IOPS reached or exceeded a percentage threshold of the resource’s ceiling, and if so, whether that condition persisted continuously for longer than a duration threshold you specify.
You provide both numbers at runtime. Running it with 90 percent and 120 seconds means any resource that sustained IOPS at or above 90 percent of its ceiling for more than two consecutive minutes in the lookback window gets reported, along with which resource breached, by how much, when it started and ended, and what the peak utilisation was.
Both parameters matter because a brief spike to 92 percent is not the same problem as 92 percent sustained for eight minutes. A spike is a normal part of operating any database under variable load, but a sustained breach is a sign that your headroom is structurally insufficient and the next slightly larger spike will tip you into saturation and queuing. The duration threshold is what separates the two.
2. Why the Metrics Differ By Service
This is the part that is easy to get wrong, and getting it wrong means your script either misses breaches entirely or fires false positives that erode trust in the output. The correct metric and the correct ceiling are different for EBS, standard RDS, Aurora provisioned instances, and Aurora Serverless v2, and here is why each one works the way it does.
2.1 The Dual Ceiling Problem for EBS and RDS
Before covering each service in detail, there is a principle that applies to EBS volumes and standard RDS instances that does not apply to Aurora, and it is the most common source of incorrect saturation calculations.
Every EBS volume and every RDS instance has two independent IOPS ceilings operating simultaneously. The first is the storage ceiling, which is the provisioned IOPS on the volume or instance. The second is the instance throughput ceiling, which is the maximum IOPS the underlying compute can push to attached storage. Your workload saturates whichever of these two ceilings it hits first, and that effective ceiling is always the lower of the two.
This is exactly the mismatch problem the companion script identifies: when your storage ceiling is higher than your instance ceiling, the instance ceiling becomes the binding constraint and the storage headroom above it is inaccessible. But even when there is no mismatch and both ceilings are sensibly set, you still need to compare observed IOPS against the lower of the two, because that is the number that actually governs when your workload runs out of room.
If you only compare against the storage ceiling you can build a false picture. A db.m6g.large with 4,000 provisioned storage IOPS has an instance class ceiling of 3,500 IOPS. If your workload hits 3,480 IOPS you are at 99.4 percent of your effective capacity, but a naive comparison against the storage ceiling gives you 87 percent and nothing fires. You are six minutes from saturation and your monitoring tells you everything is comfortable.
The script handles this by computing the effective ceiling as the minimum of the storage IOPS and the instance IOPS ceiling at runtime, using the instance type ceiling tables that also power the mismatch auditor. The note field in the output records both values so you can see which ceiling is binding.
2.2 EBS Volumes
EBS publishes VolumeReadOps and VolumeWriteOps as operation counts per CloudWatch collection period rather than as a rate. A 60-second period that returns a value of 180,000 for VolumeReadOps means 180,000 read operations happened in that minute, so to convert that to IOPS, which is the unit your provisioned ceiling is expressed in, you divide by 60. The script does this automatically.
For in-use volumes, the script looks up the EC2 instance the volume is attached to and retrieves the instance type’s IOPS ceiling from the lookup table. The effective ceiling used for breach detection is min(provisioned_storage_iops, instance_type_iops_ceiling). Only io1, io2, and gp3 volumes are scanned because gp2 volumes use a burst credit model where the effective ceiling is elastic and not meaningfully comparable to a fixed provisioned number. If a volume is not attached to a known instance type, the script falls back to the provisioned storage IOPS and records a note accordingly.
2.3 Standard RDS
RDS publishes ReadIOPS and WriteIOPS in the AWS/RDS namespace as rate metrics, meaning they are already expressed in IOPS rather than as counts per period. You add them together. The ceiling requires the same dual-minimum treatment as EBS: the script takes min(provisioned_storage_iops, instance_class_iops_ceiling) as the effective ceiling, using the RDS_IOPS_CEILING table keyed on instance class. This covers PostgreSQL, MySQL, Oracle, SQL Server, and MariaDB. Only instances with io1 or io2 storage are examined since those are the storage types where you have a defined and fixed IOPS ceiling on both sides of the comparison.
The ceiling used for comparison is printed in the note field of each finding, along with both the storage and instance values so you can see immediately which constraint is binding and what the other ceiling is. In the common case where a mismatch exists and the instance ceiling is lower, the percentage reported reflects the instance ceiling, which is the number that actually determines when the workload saturates.
2.4 Aurora Provisioned Instances
This is where most operators reach for the wrong metric or the wrong ceiling, because Aurora looks like standard RDS but behaves fundamentally differently at the storage layer. Aurora storage is a distributed, shared cluster volume that auto-scales and can sustain up to 256,000 IOPS at the storage layer. There is no provisioned IOPS value on an Aurora instance because there is nothing to provision, and if you call DescribeDBInstances on an Aurora writer the Iops field returns zero. The storage layer is not the constraint and the dual-ceiling problem from section 2.1 does not apply here because there is only one ceiling: the instance class.
The constraint is always the instance class. An Aurora db.r6g.large has the same EBS-optimised I/O ceiling as its equivalent EC2 counterpart, which is 3,500 IOPS, and that is the number you need to compare against observed workload. The script uses the instance class ceiling table from the mismatch auditor and compares ReadIOPS plus WriteIOPS per instance identifier against that ceiling. One metric worth knowing about but deliberately not used here is VolumeReadIOPs and VolumeWriteIOPs at the cluster namespace, which are storage-layer metrics that aggregate across the entire Aurora cluster and can exceed any single instance ceiling. They are useful for understanding cluster-wide storage throughput but they are not the right instrument for detecting whether a specific instance is hitting its processing limit.
2.5 Aurora Serverless v2
Aurora Serverless v2 removes the fixed instance class entirely and replaces it with an ACU range, a minimum and maximum capacity unit allocation between which the instance scales automatically, which means there is no static IOPS ceiling you can look up in a table. The effective IOPS capacity of a Serverless v2 instance scales proportionally with its current ACU allocation and reaches approximately 64,000 IOPS at maximum ACUs. The script reads the MaxCapacity value from DescribeDBClusters and treats 64,000 IOPS as the ceiling that corresponds to that maximum, with your threshold applied against that figure. The observed IOPS still come from ReadIOPS plus WriteIOPS per instance identifier in the AWS/RDS namespace, which is the same source used for provisioned Aurora instances. As with provisioned Aurora, the dual-ceiling problem does not apply because the storage layer is not a binding constraint.
3. Installation and Permissions
Install the required dependencies with the following:
pip install boto3 pandas openpyxl The IAM role you assume in each target account needs these permissions:
cloudwatch:GetMetricStatistics
rds:DescribeDBInstances
rds:DescribeDBClusters
ec2:DescribeVolumes
ec2:DescribeInstances If you are using AWS Organizations, the caller identity in your management account also needs organizations:ListChildren, organizations:DescribeAccount, and sts:AssumeRole into each member account. The script will attempt to assume the role name you specify in every account it discovers and will log a warning for any account where the assumption fails rather than aborting the entire run, so you get full coverage for every account where the role is in place.
4. Running the Script
The two required parameters are --max-ops-pct and --max-ops-duration-secs and everything else has sensible defaults. To scan an entire OU looking for anything that hit 90 percent for more than two minutes in the last 24 hours, run it like this:
python iops_saturation.py \
--max-ops-pct 90 \
--max-ops-duration-secs 120 \
--ou-id ou-xxxx-xxxxxxxx \
--regions eu-west-1 us-east-1 \
--workers 10 To scan specific accounts with a tighter threshold and a 48-hour lookback, pass account IDs directly instead of an OU:
python iops_saturation.py \
--max-ops-pct 95 \
--max-ops-duration-secs 300 \
--lookback-hours 48 \
--accounts 123456789012 234567890123 \
--regions eu-west-1 The script exits with code 0 if no breaches are found and code 1 if any resource hit 100 percent of its ceiling during the lookback window, which means you can wire it directly into a CI pipeline or EventBridge scheduled job that posts to your incident channel when the condition fires. A resource reaching 100 percent of its I/O ceiling is not a configuration curiosity, it is a past incident or an active risk, and it deserves the same treatment as any other production alert.
5. Reading the Output
The script produces a colour-coded Excel workbook sorted by peak utilisation percentage, a flat CSV for programmatic consumption, and a summary printed to stdout. Here is what a realistic run looks like across a small estate with three breach findings:
======================================================================
IOPS SATURATION BREACH REPORT
Threshold : >= 90.0% of IOPS ceiling
Duration : >= 120s sustained
======================================================================
Aurora (aurora-postgresql) (1 breach)
Resource Ceiling Peak IOPS Peak % Duration
======================================== ======== ========== ======= ==========
payments-writer / payments-db-instance-1 3,500 3,498 99.9% 480s
Account: 123456789012 | Region: eu-west-1
Window: 2026-03-01 12:14:00 UTC to 2026-03-01 12:22:00 UTC
Note: Aurora storage is uncapped; ceiling is the instance class processing limit (db.r6g.large: 3,500 IOPS)
EBS (io2) (1 breach)
Resource Ceiling Peak IOPS Peak % Duration
======================================== ======== ========== ======= ==========
analytics-etl-vol / vol-0a1b2c3d4e5f 13,333 12,940 97.0% 180s
Account: 234567890123 | Region: eu-west-1
Window: 2026-03-01 02:31:00 UTC to 2026-03-01 02:34:00 UTC
Note: Effective ceiling = min(storage: 16,000, instance m6i.2xlarge: 13,333) = 13,333 IOPS
RDS (postgres) (1 breach)
Resource Ceiling Peak IOPS Peak % Duration
======================================== ======== ========== ======= ==========
reporting-db / db-reporting-prod 3,500 3,412 97.5% 240s
Account: 123456789012 | Region: us-east-1
Window: 2026-03-01 08:45:00 UTC to 2026-03-01 08:49:00 UTC
Note: Effective ceiling = min(storage: 6,000, instance db.r6g.large: 3,500) = 3,500 IOPS
Total breaches found: 3
====================================================================== Notice how the RDS finding in the example above shows a ceiling of 3,500 even though the instance has 6,000 provisioned storage IOPS. The instance class is the binding constraint. A naive comparison against the storage ceiling would have shown 56.9 percent utilisation and produced no finding at all. The workload is actually at 97.5 percent of its effective capacity.
There are three things to read in each finding. The peak percentage tells you how close to the ceiling the resource actually got, the duration tells you how long it held there, and the window timestamps tell you exactly when to look in your application metrics and logs to find the correlated latency spike or error rate increase. The payments writer above at 99.9 percent for eight minutes during the middle of the day is a near-miss with a plausible transaction peak attached to it, and that is not a monitoring curiosity but a capacity planning item you act on this sprint.
5.1 What the Excel Workbook Contains
The workbook has two sheets. The first, IOPS Saturation Breaches, contains one row per breach event sorted by peak utilisation percentage descending, with columns for account ID and name, region, service type, resource ID and name, instance type, the effective IOPS ceiling used for comparison (always the minimum of the two applicable ceilings for EBS and RDS), the storage IOPS ceiling, the instance IOPS ceiling, the threshold IOPS at your specified percentage, peak observed IOPS, peak percentage of ceiling, longest breach duration in seconds, breach start and end timestamps in UTC, the CloudWatch metric used, and any relevant notes about how the ceiling was calculated. The second sheet, Summary by Service, groups findings by service type and shows total breach count, maximum peak percentage observed, and average breach duration, and this is the sheet to share with your engineering leads because it gives the distribution at a glance without requiring anyone to scroll through individual rows.
Row colours in the first sheet map to utilisation severity. Red means the resource hit or exceeded 100 percent of its ceiling, which is confirmed saturation and should be treated as an incident retrospective item. Orange covers 95 to 100 percent and represents a resource operating without meaningful headroom that needs attention before the next load event. Amber covers 90 to 95 percent, which is a structural warning worth adding to your next architecture review with the breach timestamps and duration included in the discussion. Green means a breach was detected but the peak stayed below your amber threshold, which typically indicates a noisy but not immediately dangerous resource.
6. What To Do With the Findings
Red rows require immediate action. Pull the breach timestamps and correlate them with your application latency and error rate metrics for the same period, because there is almost certainly an incident in your history that traces back to this resource even if it was attributed at the time to something else such as a downstream timeout, a connection pool exhaustion, or an upstream retry storm. The I/O saturation was almost certainly the initiating cause. Fix the instance class or provisioned IOPS and then open a retrospective item to understand why this was not caught at provisioning time.
Orange rows mean the resource does not have enough headroom to absorb any meaningful load increase. You need to either increase the instance ceiling by upgrading the instance class, reduce the workload through query optimisation, connection pooling improvements, or read replica offloading, or accept the risk consciously and document it. What you should not do is leave it and assume it will be fine because it has not triggered an outage yet. Luck is not a capacity model.
Amber rows are planning cycle items rather than emergencies. Add them to your next architecture review and include the breach timestamps and duration in the discussion. If a resource is repeatedly hitting 90 percent during predictable traffic peaks the fix is straightforward, and the conversation with your team is easier when you have the data to show them what is actually happening rather than asking them to take it on faith.
If the script returns no findings, either your estate is genuinely healthy from an I/O capacity perspective, or your threshold is set too conservatively, or the lookback window did not capture your peak traffic period. Try a 72-hour lookback across your typical weekly peak if you are not seeing results you expect, because the absence of findings with a 24-hour window that does not cover your busiest period is not the same as confirmation that nothing is wrong.
7. Using Both Scripts Together
The mismatch auditor and this saturation detector answer different questions and you need both to have complete coverage. The mismatch auditor runs against your configuration data from the AWS APIs, does not need CloudWatch, and tells you where your architecture has provisioned more storage IOPS than your instance class can consume. It is a preventive tool, and you should run it on a schedule as part of your infrastructure compliance pipeline, treat findings with the same severity as security misconfigurations, and block deployments that introduce new critical-severity mismatches.
The saturation detector runs against observed operational data from CloudWatch and tells you which resources are actually approaching or hitting their ceiling under real workload conditions, regardless of whether a configuration mismatch exists. It is a detective tool, and you should run it on a schedule against your recent history, pipe its exit code into your alerting infrastructure, and use it as an input to your capacity planning cycle.
The scenarios they catch are different. You can have a mismatch with no saturation because the storage is over-provisioned but the workload is light and the instance ceiling is never approached. You can have saturation with no mismatch because the configuration is correct but the workload has grown to the point where even the right instance class is running out of room. And you can have both, which is the worst case: a resource where the effective ceiling is lower than you think because of a mismatch and where observed IOPS are already approaching that lower ceiling. Running both scripts closes both gaps and gives you the structural audit and the operational signal together, and between them they surface the class of failure that your existing tooling, including Trusted Advisor, CloudWatch alarms, Performance Insights, and your APM, will not name for you until it has already caused an outage.
8. The Script
cat > iops_saturation.py << 'PYEOF'
#!/usr/bin/env python3
"""
IOPS Saturation Monitor
Scans EBS volumes, RDS instances, and Aurora clusters and identifies
any that have sustained IOPS utilisation at or above a threshold
percentage of their capacity for longer than a specified duration.
Metric selection and ceiling calculation are automatic per service type:
EBS VolumeReadOps + VolumeWriteOps (Count / 60s = IOPS)
Ceiling = min(provisioned_storage_iops, instance_type_iops_ceiling)
RDS standard ReadIOPS + WriteIOPS (rate metric, IOPS directly)
Ceiling = min(provisioned_storage_iops, instance_class_iops_ceiling)
Aurora provisioned ReadIOPS + WriteIOPS vs instance class ceiling only
(Aurora storage is uncapped; no dual-ceiling applies)
Aurora Serverless v2 ReadIOPS + WriteIOPS vs 64000 IOPS at max ACUs
Usage:
python iops_saturation.py --max-ops-pct 90 --max-ops-duration-secs 120 \
--ou-id ou-xxxx-xxxxxxxx --regions eu-west-1 us-east-1
python iops_saturation.py --max-ops-pct 95 --max-ops-duration-secs 300 \
--lookback-hours 48 --accounts 123456789012 --regions eu-west-1
Required permissions on the assumed role:
cloudwatch:GetMetricStatistics
rds:DescribeDBInstances
rds:DescribeDBClusters
ec2:DescribeVolumes
ec2:DescribeInstances
"""
import boto3
import csv
import sys
import argparse
import logging
from datetime import datetime, timezone, timedelta
from dataclasses import dataclass, asdict, field
from typing import Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
try:
import pandas as pd
PANDAS_AVAILABLE = True
except ImportError:
PANDAS_AVAILABLE = False
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
datefmt="%Y-%m-%d %H:%M:%S"
)
log = logging.getLogger(__name__)
EC2_IOPS_CEILING = {
"t3.nano": 2085, "t3.micro": 2085, "t3.small": 2085, "t3.medium": 2085,
"t3.large": 2085, "t3.xlarge": 2085, "t3.2xlarge": 2085,
"t3a.nano": 2085, "t3a.micro": 2085, "t3a.small": 2085, "t3a.medium": 2085,
"t3a.large": 2085, "t3a.xlarge": 2085, "t3a.2xlarge": 2085,
"t4g.nano": 2085, "t4g.micro": 2085, "t4g.small": 2085, "t4g.medium": 2085,
"t4g.large": 2085, "t4g.xlarge": 2085, "t4g.2xlarge": 2085,
"m5.large": 3600, "m5.xlarge": 6000, "m5.2xlarge": 8333,
"m5.4xlarge": 16667, "m5.8xlarge": 18750, "m5.12xlarge": 28750,
"m5.16xlarge": 37500, "m5.24xlarge": 40000, "m5.metal": 40000,
"m5a.large": 3600, "m5a.xlarge": 6000, "m5a.2xlarge": 8333,
"m5a.4xlarge": 16667, "m5a.8xlarge": 18750, "m5a.12xlarge": 28750,
"m5a.16xlarge": 37500, "m5a.24xlarge": 40000,
"m6g.medium": 3500, "m6g.large": 3500, "m6g.xlarge": 7000,
"m6g.2xlarge": 10000, "m6g.4xlarge": 20000, "m6g.8xlarge": 30000,
"m6g.12xlarge": 40000, "m6g.16xlarge": 40000, "m6g.metal": 40000,
"m6i.large": 6667, "m6i.xlarge": 10000, "m6i.2xlarge": 13333,
"m6i.4xlarge": 20000, "m6i.8xlarge": 26667, "m6i.12xlarge": 40000,
"m6i.16xlarge": 40000, "m6i.24xlarge": 40000, "m6i.32xlarge": 40000,
"r5.large": 3600, "r5.xlarge": 6000, "r5.2xlarge": 8333,
"r5.4xlarge": 16667, "r5.8xlarge": 18750, "r5.12xlarge": 28750,
"r5.16xlarge": 37500, "r5.24xlarge": 40000, "r5.metal": 40000,
"r6g.medium": 3500, "r6g.large": 3500, "r6g.xlarge": 7000,
"r6g.2xlarge": 10000, "r6g.4xlarge": 20000, "r6g.8xlarge": 30000,
"r6g.12xlarge": 40000, "r6g.16xlarge": 40000, "r6g.metal": 40000,
"r6i.large": 6667, "r6i.xlarge": 10000, "r6i.2xlarge": 13333,
"r6i.4xlarge": 20000, "r6i.8xlarge": 26667, "r6i.12xlarge": 40000,
"r6i.16xlarge": 40000, "r6i.24xlarge": 40000, "r6i.32xlarge": 40000,
"c5.large": 3600, "c5.xlarge": 6000, "c5.2xlarge": 8333,
"c5.4xlarge": 16667, "c5.9xlarge": 20000, "c5.12xlarge": 28750,
"c5.18xlarge": 37500, "c5.24xlarge": 40000, "c5.metal": 40000,
"c6g.medium": 3500, "c6g.large": 3500, "c6g.xlarge": 7000,
"c6g.2xlarge": 10000, "c6g.4xlarge": 20000, "c6g.8xlarge": 30000,
"c6g.12xlarge": 40000, "c6g.16xlarge": 40000, "c6g.metal": 40000,
"c6i.large": 6667, "c6i.xlarge": 10000, "c6i.2xlarge": 13333,
"c6i.4xlarge": 20000, "c6i.8xlarge": 26667, "c6i.12xlarge": 40000,
"c6i.16xlarge": 40000, "c6i.24xlarge": 40000, "c6i.32xlarge": 40000,
"i3.large": 3000, "i3.xlarge": 6000, "i3.2xlarge": 12000,
"i3.4xlarge": 16000, "i3.8xlarge": 32500, "i3.16xlarge": 65000,
"i3en.large": 4750, "i3en.xlarge": 9500, "i3en.2xlarge": 19000,
"i3en.3xlarge": 26125, "i3en.6xlarge": 52250, "i3en.12xlarge": 65000,
"i3en.24xlarge": 65000,
}
RDS_IOPS_CEILING = {
"db.t3.micro": 1536, "db.t3.small": 1536, "db.t3.medium": 1536,
"db.t3.large": 2048, "db.t3.xlarge": 2048, "db.t3.2xlarge": 2048,
"db.t4g.micro": 1700, "db.t4g.small": 1700, "db.t4g.medium": 1700,
"db.t4g.large": 2000, "db.t4g.xlarge": 2000, "db.t4g.2xlarge": 2000,
"db.m5.large": 3600, "db.m5.xlarge": 6000, "db.m5.2xlarge": 8333,
"db.m5.4xlarge": 16667, "db.m5.8xlarge": 18750, "db.m5.12xlarge": 28750,
"db.m5.16xlarge": 37500, "db.m5.24xlarge": 40000,
"db.m6g.large": 3500, "db.m6g.xlarge": 7000, "db.m6g.2xlarge": 10000,
"db.m6g.4xlarge": 20000, "db.m6g.8xlarge": 30000, "db.m6g.12xlarge": 40000,
"db.m6g.16xlarge": 40000,
"db.m6i.large": 6667, "db.m6i.xlarge": 10000, "db.m6i.2xlarge": 13333,
"db.m6i.4xlarge": 20000, "db.m6i.8xlarge": 26667, "db.m6i.12xlarge": 40000,
"db.m6i.16xlarge": 40000,
"db.r5.large": 3600, "db.r5.xlarge": 6000, "db.r5.2xlarge": 8333,
"db.r5.4xlarge": 16667, "db.r5.8xlarge": 18750, "db.r5.12xlarge": 28750,
"db.r5.16xlarge": 37500, "db.r5.24xlarge": 40000,
"db.r6g.large": 3500, "db.r6g.xlarge": 7000, "db.r6g.2xlarge": 10000,
"db.r6g.4xlarge": 20000, "db.r6g.8xlarge": 30000, "db.r6g.12xlarge": 40000,
"db.r6g.16xlarge": 40000,
"db.r6i.large": 6667, "db.r6i.xlarge": 10000, "db.r6i.2xlarge": 13333,
"db.r6i.4xlarge": 20000, "db.r6i.8xlarge": 26667, "db.r6i.12xlarge": 40000,
"db.r6i.16xlarge": 40000,
"db.x1e.xlarge": 3700, "db.x1e.2xlarge": 7400, "db.x1e.4xlarge": 14800,
"db.x1e.8xlarge": 29600, "db.x1e.16xlarge": 40000, "db.x1e.32xlarge": 40000,
"db.x2g.large": 3500, "db.x2g.xlarge": 7000, "db.x2g.2xlarge": 10000,
"db.x2g.4xlarge": 20000, "db.x2g.8xlarge": 30000, "db.x2g.12xlarge": 40000,
"db.x2g.16xlarge": 40000,
"db.serverless": 64000,
}
AURORA_SERVERLESS_MAX_ACUS_DEFAULT = 128
PERIOD_SECONDS = 60
@dataclass
class SaturationBreach:
account_id: str
account_name: str
region: str
service_type: str
resource_id: str
resource_name: str
instance_type: str
iops_ceiling: int # effective ceiling = min(storage, instance) for EBS/RDS
storage_iops_ceiling: int # provisioned storage IOPS
instance_iops_ceiling: int # instance class IOPS ceiling (0 = not applicable)
threshold_pct: float
threshold_iops: float
max_observed_iops: float
max_observed_pct: float
longest_breach_seconds: int
breach_start_utc: str
breach_end_utc: str
metric_used: str
note: str = ""
def get_metric_datapoints(cw_client, namespace, metric_name, dimensions, start_time, end_time, stat="Sum"):
resp = cw_client.get_metric_statistics(
Namespace=namespace,
MetricName=metric_name,
Dimensions=dimensions,
StartTime=start_time,
EndTime=end_time,
Period=PERIOD_SECONDS,
Statistics=[stat],
)
points = [(dp["Timestamp"], dp[stat]) for dp in resp.get("Datapoints", [])]
points.sort(key=lambda x: x[0])
return points
def find_sustained_breaches(combined_iops, threshold_iops, max_ops_duration_seconds, is_count_metric=False):
if not combined_iops:
return []
timestamps = sorted(combined_iops.keys())
breaches = []
run_start = None
run_end = None
run_max = 0.0
for ts in timestamps:
raw = combined_iops[ts]
iops = raw / PERIOD_SECONDS if is_count_metric else raw
if iops >= threshold_iops:
if run_start is None:
run_start = ts
run_end = ts
run_max = max(run_max, iops)
else:
if run_start is not None:
duration = (run_end - run_start).total_seconds() + PERIOD_SECONDS
if duration >= max_ops_duration_seconds:
breaches.append((run_start, run_end, run_max, duration))
run_start = None
run_end = None
run_max = 0.0
if run_start is not None:
duration = (run_end - run_start).total_seconds() + PERIOD_SECONDS
if duration >= max_ops_duration_seconds:
breaches.append((run_start, run_end, run_max, duration))
return breaches
def build_volume_instance_map(ec2_client):
"""
Returns a dict mapping volume_id -> (instance_id, instance_type)
for all in-use volumes in the region.
"""
vol_to_instance = {}
instance_types = {}
# Collect instance types first
inst_paginator = ec2_client.get_paginator("describe_instances")
for page in inst_paginator.paginate():
for reservation in page["Reservations"]:
for inst in reservation["Instances"]:
instance_types[inst["InstanceId"]] = inst.get("InstanceType", "unknown")
# Map volumes to instances
vol_paginator = ec2_client.get_paginator("describe_volumes")
for page in vol_paginator.paginate(
Filters=[{"Name": "status", "Values": ["in-use"]}]
):
for vol in page["Volumes"]:
for attachment in vol.get("Attachments", []):
iid = attachment.get("InstanceId")
if iid and iid in instance_types:
vol_to_instance[vol["VolumeId"]] = (iid, instance_types[iid])
break
return vol_to_instance
def audit_ebs(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours):
findings = []
ec2 = session.client("ec2", region_name=region)
cw = session.client("cloudwatch", region_name=region)
end_time = datetime.now(timezone.utc)
start_time = end_time - timedelta(hours=lookback_hours)
# Build volume -> instance type mapping once for the region
try:
vol_instance_map = build_volume_instance_map(ec2)
except Exception as e:
log.warning(f"Could not build volume/instance map in {account_id}/{region}: {e}")
vol_instance_map = {}
vol_paginator = ec2.get_paginator("describe_volumes")
for page in vol_paginator.paginate(
Filters=[
{"Name": "volume-type", "Values": ["io1", "io2", "gp3"]},
{"Name": "status", "Values": ["in-use"]},
]
):
for vol in page["Volumes"]:
provisioned_iops = vol.get("Iops", 0) or 0
if provisioned_iops == 0:
continue
vol_id = vol["VolumeId"]
tags = vol.get("Tags", [])
name = next((t["Value"] for t in tags if t["Key"] == "Name"), vol_id)
vol_type = vol.get("VolumeType", "unknown")
# Determine effective ceiling: min(storage IOPS, instance IOPS ceiling)
instance_type = "unknown"
instance_iops_ceiling = 0
ceiling_note = ""
if vol_id in vol_instance_map:
_, instance_type = vol_instance_map[vol_id]
instance_iops_ceiling = EC2_IOPS_CEILING.get(instance_type, 0)
if instance_iops_ceiling > 0:
effective_ceiling = min(provisioned_iops, instance_iops_ceiling)
binding = "storage" if provisioned_iops <= instance_iops_ceiling else "instance"
ceiling_note = (
f"Effective ceiling = min(storage: {provisioned_iops:,}, "
f"instance {instance_type}: {instance_iops_ceiling:,}) = {effective_ceiling:,} IOPS "
f"[{binding} ceiling is binding]"
)
else:
effective_ceiling = provisioned_iops
ceiling_note = (
f"Storage ceiling used ({provisioned_iops:,} IOPS); "
f"instance type {instance_type!r} not in lookup table"
)
threshold_iops = effective_ceiling * (max_ops_pct / 100.0)
try:
pts_read = get_metric_datapoints(cw, "AWS/EBS", "VolumeReadOps",
[{"Name": "VolumeId", "Value": vol_id}], start_time, end_time)
pts_write = get_metric_datapoints(cw, "AWS/EBS", "VolumeWriteOps",
[{"Name": "VolumeId", "Value": vol_id}], start_time, end_time)
combined = {}
for ts, val in pts_read:
combined[ts] = combined.get(ts, 0.0) + val
for ts, val in pts_write:
combined[ts] = combined.get(ts, 0.0) + val
breaches = find_sustained_breaches(combined, threshold_iops, max_ops_duration_seconds, is_count_metric=True)
except Exception as e:
log.warning(f"CloudWatch error for EBS {vol_id} in {account_id}/{region}: {e}")
continue
for breach_start, breach_end, breach_max_iops, breach_secs in breaches:
findings.append(SaturationBreach(
account_id=account_id, account_name=account_name, region=region,
service_type=f"EBS ({vol_type})", resource_id=vol_id, resource_name=name,
instance_type=instance_type,
iops_ceiling=effective_ceiling,
storage_iops_ceiling=provisioned_iops,
instance_iops_ceiling=instance_iops_ceiling,
threshold_pct=max_ops_pct, threshold_iops=round(threshold_iops, 1),
max_observed_iops=round(breach_max_iops, 1),
max_observed_pct=round((breach_max_iops / effective_ceiling) * 100, 1),
longest_breach_seconds=int(breach_secs),
breach_start_utc=breach_start.strftime("%Y-%m-%d %H:%M:%S UTC"),
breach_end_utc=breach_end.strftime("%Y-%m-%d %H:%M:%S UTC"),
metric_used="AWS/EBS: VolumeReadOps + VolumeWriteOps (Sum / 60s = IOPS)",
note=ceiling_note,
))
return findings
def audit_rds_standard(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours):
findings = []
rds = session.client("rds", region_name=region)
cw = session.client("cloudwatch", region_name=region)
end_time = datetime.now(timezone.utc)
start_time = end_time - timedelta(hours=lookback_hours)
paginator = rds.get_paginator("describe_db_instances")
for page in paginator.paginate():
for db in page["DBInstances"]:
engine = db.get("Engine", "")
if "aurora" in engine.lower():
continue
provisioned_iops = db.get("Iops", 0) or 0
if provisioned_iops == 0:
continue
status = db.get("DBInstanceStatus", "")
if status not in ("available", "backing-up", "modifying"):
continue
db_id = db.get("DBInstanceIdentifier", "")
instance_type = db.get("DBInstanceClass", "unknown")
tags = db.get("TagList", [])
name = next((t["Value"] for t in tags if t["Key"] == "Name"), db_id)
# Determine effective ceiling: min(storage IOPS, instance class IOPS ceiling)
instance_iops_ceiling = RDS_IOPS_CEILING.get(instance_type, 0)
if instance_iops_ceiling > 0:
effective_ceiling = min(provisioned_iops, instance_iops_ceiling)
binding = "storage" if provisioned_iops <= instance_iops_ceiling else "instance"
ceiling_note = (
f"Effective ceiling = min(storage: {provisioned_iops:,}, "
f"instance {instance_type}: {instance_iops_ceiling:,}) = {effective_ceiling:,} IOPS "
f"[{binding} ceiling is binding]"
)
else:
effective_ceiling = provisioned_iops
ceiling_note = (
f"Storage ceiling used ({provisioned_iops:,} IOPS); "
f"instance type {instance_type!r} not in lookup table"
)
threshold_iops = effective_ceiling * (max_ops_pct / 100.0)
dims = [{"Name": "DBInstanceIdentifier", "Value": db_id}]
try:
pts_read = get_metric_datapoints(cw, "AWS/RDS", "ReadIOPS", dims, start_time, end_time, stat="Average")
pts_write = get_metric_datapoints(cw, "AWS/RDS", "WriteIOPS", dims, start_time, end_time, stat="Average")
combined = {}
for ts, val in pts_read:
combined[ts] = combined.get(ts, 0.0) + val
for ts, val in pts_write:
combined[ts] = combined.get(ts, 0.0) + val
breaches = find_sustained_breaches(combined, threshold_iops, max_ops_duration_seconds, is_count_metric=False)
except Exception as e:
log.warning(f"CloudWatch error for RDS {db_id} in {account_id}/{region}: {e}")
continue
for breach_start, breach_end, breach_max_iops, breach_secs in breaches:
findings.append(SaturationBreach(
account_id=account_id, account_name=account_name, region=region,
service_type=f"RDS ({engine})", resource_id=db_id, resource_name=name,
instance_type=instance_type,
iops_ceiling=effective_ceiling,
storage_iops_ceiling=provisioned_iops,
instance_iops_ceiling=instance_iops_ceiling,
threshold_pct=max_ops_pct, threshold_iops=round(threshold_iops, 1),
max_observed_iops=round(breach_max_iops, 1),
max_observed_pct=round((breach_max_iops / effective_ceiling) * 100, 1),
longest_breach_seconds=int(breach_secs),
breach_start_utc=breach_start.strftime("%Y-%m-%d %H:%M:%S UTC"),
breach_end_utc=breach_end.strftime("%Y-%m-%d %H:%M:%S UTC"),
metric_used="AWS/RDS: ReadIOPS + WriteIOPS (Average)",
note=ceiling_note,
))
return findings
def audit_aurora(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours):
findings = []
rds = session.client("rds", region_name=region)
cw = session.client("cloudwatch", region_name=region)
end_time = datetime.now(timezone.utc)
start_time = end_time - timedelta(hours=lookback_hours)
cluster_max_acus = {}
try:
cluster_paginator = rds.get_paginator("describe_db_clusters")
for page in cluster_paginator.paginate():
for cluster in page["DBClusters"]:
if "aurora" not in cluster.get("Engine", "").lower():
continue
sv2 = cluster.get("ServerlessV2ScalingConfiguration", {})
if sv2:
cluster_max_acus[cluster["DBClusterIdentifier"]] = sv2.get(
"MaxCapacity", AURORA_SERVERLESS_MAX_ACUS_DEFAULT
)
except Exception as e:
log.warning(f"Could not list Aurora clusters in {account_id}/{region}: {e}")
paginator = rds.get_paginator("describe_db_instances")
for page in paginator.paginate():
for db in page["DBInstances"]:
engine = db.get("Engine", "")
if "aurora" not in engine.lower():
continue
status = db.get("DBInstanceStatus", "")
if status not in ("available", "backing-up", "modifying"):
continue
db_id = db.get("DBInstanceIdentifier", "")
instance_type = db.get("DBInstanceClass", "unknown")
cluster_id = db.get("DBClusterIdentifier", "")
tags = db.get("TagList", [])
name = next((t["Value"] for t in tags if t["Key"] == "Name"), db_id)
is_serverless = instance_type == "db.serverless"
dims = [{"Name": "DBInstanceIdentifier", "Value": db_id}]
# Aurora: no dual-ceiling applies -- storage is uncapped at cluster layer
if is_serverless:
max_acus = cluster_max_acus.get(cluster_id, AURORA_SERVERLESS_MAX_ACUS_DEFAULT)
iops_ceiling = 64000
threshold_iops = iops_ceiling * (max_ops_pct / 100.0)
service_type = "Aurora Serverless v2"
metric_note = f"AWS/RDS: ReadIOPS + WriteIOPS (Average); ceiling = 64000 IOPS at max {max_acus} ACUs"
note = "Serverless v2: IOPS ceiling is proportional to ACU allocation; storage layer is uncapped"
instance_iops_ceiling = 0
else:
ceiling = RDS_IOPS_CEILING.get(instance_type)
if ceiling is None:
log.warning(f"Unknown Aurora instance type {instance_type} ({db_id}) -- skipping")
continue
iops_ceiling = ceiling
threshold_iops = iops_ceiling * (max_ops_pct / 100.0)
service_type = f"Aurora ({engine})"
metric_note = f"AWS/RDS: ReadIOPS + WriteIOPS (Average); ceiling = instance class max ({instance_type}: {iops_ceiling:,} IOPS)"
note = f"Aurora storage is uncapped; ceiling is the instance class processing limit ({instance_type}: {iops_ceiling:,} IOPS)"
instance_iops_ceiling = iops_ceiling
try:
pts_read = get_metric_datapoints(cw, "AWS/RDS", "ReadIOPS", dims, start_time, end_time, stat="Average")
pts_write = get_metric_datapoints(cw, "AWS/RDS", "WriteIOPS", dims, start_time, end_time, stat="Average")
combined = {}
for ts, val in pts_read:
combined[ts] = combined.get(ts, 0.0) + val
for ts, val in pts_write:
combined[ts] = combined.get(ts, 0.0) + val
breaches = find_sustained_breaches(combined, threshold_iops, max_ops_duration_seconds, is_count_metric=False)
except Exception as e:
log.warning(f"CloudWatch error for Aurora {db_id} in {account_id}/{region}: {e}")
continue
for breach_start, breach_end, breach_max_iops, breach_secs in breaches:
findings.append(SaturationBreach(
account_id=account_id, account_name=account_name, region=region,
service_type=service_type, resource_id=db_id, resource_name=name,
instance_type=instance_type,
iops_ceiling=iops_ceiling,
storage_iops_ceiling=0, # not applicable for Aurora
instance_iops_ceiling=instance_iops_ceiling,
threshold_pct=max_ops_pct, threshold_iops=round(threshold_iops, 1),
max_observed_iops=round(breach_max_iops, 1),
max_observed_pct=round((breach_max_iops / iops_ceiling) * 100, 1),
longest_breach_seconds=int(breach_secs),
breach_start_utc=breach_start.strftime("%Y-%m-%d %H:%M:%S UTC"),
breach_end_utc=breach_end.strftime("%Y-%m-%d %H:%M:%S UTC"),
metric_used=metric_note,
note=note,
))
return findings
def list_accounts_in_ou(ou_id):
org = boto3.client("organizations")
accounts = []
def recurse(parent_id):
for child_type in ("ACCOUNT", "ORGANIZATIONAL_UNIT"):
paginator = org.get_paginator("list_children")
for page in paginator.paginate(ParentId=parent_id, ChildType=child_type):
for child in page["Children"]:
if child_type == "ACCOUNT":
try:
resp = org.describe_account(AccountId=child["Id"])
acc = resp["Account"]
if acc["Status"] == "ACTIVE":
accounts.append({"id": acc["Id"], "name": acc["Name"]})
except Exception as e:
log.warning(f"Could not describe account {child['Id']}: {e}")
else:
recurse(child["Id"])
recurse(ou_id)
return accounts
def get_session(account_id, role_name):
if role_name:
sts = boto3.client("sts")
role_arn = f"arn:aws:iam::{account_id}:role/{role_name}"
creds = sts.assume_role(RoleArn=role_arn, RoleSessionName="IOPSSaturationScan")["Credentials"]
return boto3.Session(
aws_access_key_id=creds["AccessKeyId"],
aws_secret_access_key=creds["SecretAccessKey"],
aws_session_token=creds["SessionToken"],
)
return boto3.Session()
def audit_account(account, role_name, regions, max_ops_pct, max_ops_duration_seconds, lookback_hours):
account_id = account["id"]
account_name = account["name"]
all_findings = []
log.info(f"Auditing account {account_id} ({account_name})")
try:
session = get_session(account_id, role_name)
except Exception as e:
log.error(f"Cannot assume role in {account_id}: {e}")
return []
for region in regions:
log.info(f" {account_id} scanning {region}...")
try:
all_findings.extend(audit_ebs(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours))
all_findings.extend(audit_rds_standard(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours))
all_findings.extend(audit_aurora(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours))
except Exception as e:
log.error(f" Error in {account_id}/{region}: {e}")
return all_findings
def write_csv(findings, path):
if not findings:
return
fieldnames = list(asdict(findings[0]).keys())
with open(path, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for finding in findings:
writer.writerow(asdict(finding))
log.info(f"CSV written: {path}")
def write_excel(findings, path):
if not PANDAS_AVAILABLE:
log.warning("pandas/openpyxl not installed -- skipping Excel output. pip install pandas openpyxl")
return
if not findings:
return
from openpyxl.styles import PatternFill
rows = [asdict(f) for f in findings]
df = pd.DataFrame(rows)
df = df.sort_values("max_observed_pct", ascending=False)
def row_colour(pct):
if pct >= 100:
return "FF2222"
elif pct >= 95:
return "FF6600"
elif pct >= 90:
return "FFB300"
return "90EE90"
with pd.ExcelWriter(path, engine="openpyxl") as writer:
df.to_excel(writer, index=False, sheet_name="IOPS Saturation Breaches")
ws = writer.sheets["IOPS Saturation Breaches"]
pct_col_idx = list(df.columns).index("max_observed_pct") + 1
for row_idx in range(2, len(df) + 2):
pct_val = ws.cell(row=row_idx, column=pct_col_idx).value or 0
colour = row_colour(pct_val)
for col_idx in range(1, len(df.columns) + 1):
ws.cell(row=row_idx, column=col_idx).fill = PatternFill(
start_color=colour, end_color=colour, fill_type="solid"
)
summary = df.groupby("service_type").agg(
breaches=("resource_id", "count"),
max_pct_observed=("max_observed_pct", "max"),
avg_breach_seconds=("longest_breach_seconds", "mean"),
).reset_index()
summary.to_excel(writer, index=False, sheet_name="Summary by Service")
log.info(f"Excel written: {path}")
def print_results(findings, max_ops_pct, max_ops_duration_seconds):
print()
print("=" * 70)
print("IOPS SATURATION BREACH REPORT")
print(f"Threshold : >= {max_ops_pct}% of effective IOPS ceiling")
print(f"Duration : >= {max_ops_duration_seconds}s sustained")
print("=" * 70)
if not findings:
print("\nNo sustained IOPS saturation breaches found.")
print("=" * 70)
return
findings_sorted = sorted(findings, key=lambda f: f.max_observed_pct, reverse=True)
by_type = {}
for f in findings_sorted:
by_type.setdefault(f.service_type, []).append(f)
for svc_type, items in sorted(by_type.items()):
print(f"\n {svc_type} ({len(items)} breach{'es' if len(items) != 1 else ''})")
print(f" {'Resource':<40} {'Ceiling':>8} {'Peak IOPS':>10} {'Peak %':>7} {'Duration':>10}")
print(f" {'=' * 40} {'=' * 8} {'=' * 10} {'=' * 7} {'=' * 10}")
for f in items:
print(f" {f.resource_name:<40} {f.iops_ceiling:>8,} {f.max_observed_iops:>10,.0f} {f.max_observed_pct:>6.1f}% {f.longest_breach_seconds:>8}s")
print(f" Account: {f.account_id} | Region: {f.region}")
print(f" Window: {f.breach_start_utc} to {f.breach_end_utc}")
if f.note:
print(f" Note: {f.note}")
print(f"\n Total breaches found: {len(findings)}")
print("=" * 70)
def parse_args():
parser = argparse.ArgumentParser(description="Scan EBS, RDS, and Aurora for sustained IOPS saturation.")
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument("--ou-id", help="AWS Organizations OU ID")
group.add_argument("--accounts", nargs="+", help="Specific AWS account IDs")
parser.add_argument("--max-ops-pct", type=float, required=True,
help="Percentage of IOPS ceiling that constitutes a breach (e.g. 90)")
parser.add_argument("--max-ops-duration-secs", type=int, required=True,
help="Minimum sustained breach duration in seconds to report (e.g. 120)")
parser.add_argument("--lookback-hours", type=int, default=24,
help="Hours of CloudWatch history to examine (default: 24)")
parser.add_argument("--role-name", default="OrganizationAccountAccessRole")
parser.add_argument("--regions", nargs="+", default=["eu-west-1", "us-east-1"])
parser.add_argument("--workers", type=int, default=5)
parser.add_argument("--output-prefix", default="iops_saturation_report")
return parser.parse_args()
def main():
args = parse_args()
timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
log.info(f"IOPS Saturation Scan starting")
log.info(f" Threshold : >= {args.max_ops_pct}% for >= {args.max_ops_duration_secs}s")
log.info(f" Lookback : {args.lookback_hours}h | Regions: {', '.join(args.regions)}")
accounts = list_accounts_in_ou(args.ou_id) if args.ou_id else [{"id": a, "name": a} for a in args.accounts]
if args.ou_id:
log.info(f" Found {len(accounts)} active accounts in OU {args.ou_id}")
all_findings = []
with ThreadPoolExecutor(max_workers=args.workers) as executor:
futures = {
executor.submit(audit_account, acc, args.role_name, args.regions,
args.max_ops_pct, args.max_ops_duration_secs, args.lookback_hours): acc
for acc in accounts
}
for future in as_completed(futures):
acc = futures[future]
try:
findings = future.result()
all_findings.extend(findings)
log.info(f"Account {acc['id']} complete: {len(findings)} breach(es)")
except Exception as e:
log.error(f"Account {acc['id']} failed: {e}")
print_results(all_findings, args.max_ops_pct, args.max_ops_duration_secs)
write_csv(all_findings, f"{args.output_prefix}_{timestamp}.csv")
write_excel(all_findings, f"{args.output_prefix}_{timestamp}.xlsx")
log.info("Scan complete.")
return 1 if any(f.max_observed_pct >= 100.0 for f in all_findings) else 0
if __name__ == "__main__":
sys.exit(main())
PYEOF
chmod +x iops_saturation.py
echo "iops_saturation.py created and marked executable" Andrew Baker writes about cloud architecture, banking technology, and the gap between what systems are designed to do and what they actually do under load at andrewbaker.ninja.





