https://andrewbaker.ninja/wp-content/themes/twentysixteen/fonts/merriweather-plus-montserrat-plus-inconsolata.css

Knowing Your IOPS Are Broken Is Not As Valuable As Knowing They Are About To Break

Andrew Baker | March 2026

Companion article to: https://andrewbaker.ninja/2026/03/01/the-silent-killer-in-your-aws-architecture-iops-mismatches/

Last week I published a script that scans your AWS estate and finds every EBS volume and RDS instance where your provisioned storage IOPS exceed what the compute instance can actually consume. That problem, the structural mismatch between storage ceiling and instance ceiling, is important and expensive and almost completely invisible to your existing tooling. You should run that script.

But there is a second problem that the mismatch auditor does not solve, and in some ways it is the more dangerous one. The mismatch auditor tells you where the gap exists but it does not tell you whether you are actually falling into it.

Consider the difference. A provisioned storage IOPS ceiling of 10,000 on an instance that can only push 3,500 is a configuration problem, meaning you are paying for performance you cannot consume and your headroom assumptions are wrong. But if your actual workload is only ever generating 1,200 IOPS, the mismatch is a cost and an architecture risk rather than an active emergency. The mismatch auditor will find it and you should fix it, but the building is not on fire yet.

Now consider the other case. Your provisioned storage ceiling is correct, your instance class ceiling matches what you need, and your architecture review would pass. But your workload is generating 3,400 IOPS against a 3,500 ceiling for minutes at a time, every day, during the lunchtime transaction peak. CloudWatch shows nothing alarming because the volume is not saturated and the instance is not at CPU capacity. Performance Insights shows no problematic wait events and your APM shows acceptable latency. You are running at 97 percent of your I/O capacity for sustained periods without knowing it.

That is the building that is about to catch fire.

A 3 percent buffer against a hard ceiling is not a buffer, it is a queue waiting to form. When load spikes because a batch job overlaps with transaction traffic, or a partner integration runs slightly earlier than usual, or a retry storm arrives from an upstream timeout, you cross that ceiling and requests start stacking in the virtualised I/O path. What was a 2ms storage read becomes 40ms as the queue grows, connection pools back up, upstream services time out and retry, and those retries compound the I/O load further. You are now in a feedback loop where your database looks healthy by every metric your team is watching and you have no obvious cause to debug because the bottleneck lives in the gap between what your instance can consume and what your workload is demanding, a gap that none of your standard monitoring instruments will name for you.

The script in this post closes that gap.

1. What This Script Actually Does

The script scans your AWS estate across multiple accounts and regions and queries CloudWatch for every EBS volume, RDS instance, and Aurora instance. For each resource it asks whether actual observed IOPS reached or exceeded a percentage threshold of the resource’s ceiling, and if so, whether that condition persisted continuously for longer than a duration threshold you specify.

You provide both numbers at runtime. Running it with 90 percent and 120 seconds means any resource that sustained IOPS at or above 90 percent of its ceiling for more than two consecutive minutes in the lookback window gets reported, along with which resource breached, by how much, when it started and ended, and what the peak utilisation was.

Both parameters matter because a brief spike to 92 percent is not the same problem as 92 percent sustained for eight minutes. A spike is a normal part of operating any database under variable load, but a sustained breach is a sign that your headroom is structurally insufficient and the next slightly larger spike will tip you into saturation and queuing. The duration threshold is what separates the two.

2. Why the Metrics Differ By Service

This is the part that is easy to get wrong, and getting it wrong means your script either misses breaches entirely or fires false positives that erode trust in the output. The correct metric and the correct ceiling are different for EBS, standard RDS, Aurora provisioned instances, and Aurora Serverless v2, and here is why each one works the way it does.

2.1 The Dual Ceiling Problem for EBS and RDS

Before covering each service in detail, there is a principle that applies to EBS volumes and standard RDS instances that does not apply to Aurora, and it is the most common source of incorrect saturation calculations.

Every EBS volume and every RDS instance has two independent IOPS ceilings operating simultaneously. The first is the storage ceiling, which is the provisioned IOPS on the volume or instance. The second is the instance throughput ceiling, which is the maximum IOPS the underlying compute can push to attached storage. Your workload saturates whichever of these two ceilings it hits first, and that effective ceiling is always the lower of the two.

This is exactly the mismatch problem the companion script identifies: when your storage ceiling is higher than your instance ceiling, the instance ceiling becomes the binding constraint and the storage headroom above it is inaccessible. But even when there is no mismatch and both ceilings are sensibly set, you still need to compare observed IOPS against the lower of the two, because that is the number that actually governs when your workload runs out of room.

If you only compare against the storage ceiling you can build a false picture. A db.m6g.large with 4,000 provisioned storage IOPS has an instance class ceiling of 3,500 IOPS. If your workload hits 3,480 IOPS you are at 99.4 percent of your effective capacity, but a naive comparison against the storage ceiling gives you 87 percent and nothing fires. You are six minutes from saturation and your monitoring tells you everything is comfortable.

The script handles this by computing the effective ceiling as the minimum of the storage IOPS and the instance IOPS ceiling at runtime, using the instance type ceiling tables that also power the mismatch auditor. The note field in the output records both values so you can see which ceiling is binding.

2.2 EBS Volumes

EBS publishes VolumeReadOps and VolumeWriteOps as operation counts per CloudWatch collection period rather than as a rate. A 60-second period that returns a value of 180,000 for VolumeReadOps means 180,000 read operations happened in that minute, so to convert that to IOPS, which is the unit your provisioned ceiling is expressed in, you divide by 60. The script does this automatically.

For in-use volumes, the script looks up the EC2 instance the volume is attached to and retrieves the instance type’s IOPS ceiling from the lookup table. The effective ceiling used for breach detection is min(provisioned_storage_iops, instance_type_iops_ceiling). Only io1, io2, and gp3 volumes are scanned because gp2 volumes use a burst credit model where the effective ceiling is elastic and not meaningfully comparable to a fixed provisioned number. If a volume is not attached to a known instance type, the script falls back to the provisioned storage IOPS and records a note accordingly.

2.3 Standard RDS

RDS publishes ReadIOPS and WriteIOPS in the AWS/RDS namespace as rate metrics, meaning they are already expressed in IOPS rather than as counts per period. You add them together. The ceiling requires the same dual-minimum treatment as EBS: the script takes min(provisioned_storage_iops, instance_class_iops_ceiling) as the effective ceiling, using the RDS_IOPS_CEILING table keyed on instance class. This covers PostgreSQL, MySQL, Oracle, SQL Server, and MariaDB. Only instances with io1 or io2 storage are examined since those are the storage types where you have a defined and fixed IOPS ceiling on both sides of the comparison.

The ceiling used for comparison is printed in the note field of each finding, along with both the storage and instance values so you can see immediately which constraint is binding and what the other ceiling is. In the common case where a mismatch exists and the instance ceiling is lower, the percentage reported reflects the instance ceiling, which is the number that actually determines when the workload saturates.

2.4 Aurora Provisioned Instances

This is where most operators reach for the wrong metric or the wrong ceiling, because Aurora looks like standard RDS but behaves fundamentally differently at the storage layer. Aurora storage is a distributed, shared cluster volume that auto-scales and can sustain up to 256,000 IOPS at the storage layer. There is no provisioned IOPS value on an Aurora instance because there is nothing to provision, and if you call DescribeDBInstances on an Aurora writer the Iops field returns zero. The storage layer is not the constraint and the dual-ceiling problem from section 2.1 does not apply here because there is only one ceiling: the instance class.

The constraint is always the instance class. An Aurora db.r6g.large has the same EBS-optimised I/O ceiling as its equivalent EC2 counterpart, which is 3,500 IOPS, and that is the number you need to compare against observed workload. The script uses the instance class ceiling table from the mismatch auditor and compares ReadIOPS plus WriteIOPS per instance identifier against that ceiling. One metric worth knowing about but deliberately not used here is VolumeReadIOPs and VolumeWriteIOPs at the cluster namespace, which are storage-layer metrics that aggregate across the entire Aurora cluster and can exceed any single instance ceiling. They are useful for understanding cluster-wide storage throughput but they are not the right instrument for detecting whether a specific instance is hitting its processing limit.

2.5 Aurora Serverless v2

Aurora Serverless v2 removes the fixed instance class entirely and replaces it with an ACU range, a minimum and maximum capacity unit allocation between which the instance scales automatically, which means there is no static IOPS ceiling you can look up in a table. The effective IOPS capacity of a Serverless v2 instance scales proportionally with its current ACU allocation and reaches approximately 64,000 IOPS at maximum ACUs. The script reads the MaxCapacity value from DescribeDBClusters and treats 64,000 IOPS as the ceiling that corresponds to that maximum, with your threshold applied against that figure. The observed IOPS still come from ReadIOPS plus WriteIOPS per instance identifier in the AWS/RDS namespace, which is the same source used for provisioned Aurora instances. As with provisioned Aurora, the dual-ceiling problem does not apply because the storage layer is not a binding constraint.

3. Installation and Permissions

Install the required dependencies with the following:

pip install boto3 pandas openpyxl

The IAM role you assume in each target account needs these permissions:

cloudwatch:GetMetricStatistics
rds:DescribeDBInstances
rds:DescribeDBClusters
ec2:DescribeVolumes
ec2:DescribeInstances

If you are using AWS Organizations, the caller identity in your management account also needs organizations:ListChildren, organizations:DescribeAccount, and sts:AssumeRole into each member account. The script will attempt to assume the role name you specify in every account it discovers and will log a warning for any account where the assumption fails rather than aborting the entire run, so you get full coverage for every account where the role is in place.

4. Running the Script

The two required parameters are --max-ops-pct and --max-ops-duration-secs and everything else has sensible defaults. To scan an entire OU looking for anything that hit 90 percent for more than two minutes in the last 24 hours, run it like this:

python iops_saturation.py \
  --max-ops-pct 90 \
  --max-ops-duration-secs 120 \
  --ou-id ou-xxxx-xxxxxxxx \
  --regions eu-west-1 us-east-1 \
  --workers 10

To scan specific accounts with a tighter threshold and a 48-hour lookback, pass account IDs directly instead of an OU:

python iops_saturation.py \
  --max-ops-pct 95 \
  --max-ops-duration-secs 300 \
  --lookback-hours 48 \
  --accounts 123456789012 234567890123 \
  --regions eu-west-1

The script exits with code 0 if no breaches are found and code 1 if any resource hit 100 percent of its ceiling during the lookback window, which means you can wire it directly into a CI pipeline or EventBridge scheduled job that posts to your incident channel when the condition fires. A resource reaching 100 percent of its I/O ceiling is not a configuration curiosity, it is a past incident or an active risk, and it deserves the same treatment as any other production alert.

5. Reading the Output

The script produces a colour-coded Excel workbook sorted by peak utilisation percentage, a flat CSV for programmatic consumption, and a summary printed to stdout. Here is what a realistic run looks like across a small estate with three breach findings:

======================================================================
IOPS SATURATION BREACH REPORT
Threshold : >= 90.0% of IOPS ceiling
Duration  : >= 120s sustained
======================================================================

  Aurora (aurora-postgresql) (1 breach)
  Resource                                  Ceiling  Peak IOPS  Peak %   Duration
  ======================================== ======== ========== ======= ==========
  payments-writer / payments-db-instance-1    3,500      3,498   99.9%      480s
    Account: 123456789012 | Region: eu-west-1
    Window:  2026-03-01 12:14:00 UTC  to  2026-03-01 12:22:00 UTC
    Note:    Aurora storage is uncapped; ceiling is the instance class processing limit (db.r6g.large: 3,500 IOPS)

  EBS (io2) (1 breach)
  Resource                                  Ceiling  Peak IOPS  Peak %   Duration
  ======================================== ======== ========== ======= ==========
  analytics-etl-vol / vol-0a1b2c3d4e5f      13,333     12,940   97.0%      180s
    Account: 234567890123 | Region: eu-west-1
    Window:  2026-03-01 02:31:00 UTC  to  2026-03-01 02:34:00 UTC
    Note:    Effective ceiling = min(storage: 16,000, instance m6i.2xlarge: 13,333) = 13,333 IOPS

  RDS (postgres) (1 breach)
  Resource                                  Ceiling  Peak IOPS  Peak %   Duration
  ======================================== ======== ========== ======= ==========
  reporting-db / db-reporting-prod            3,500      3,412   97.5%      240s
    Account: 123456789012 | Region: us-east-1
    Window:  2026-03-01 08:45:00 UTC  to  2026-03-01 08:49:00 UTC
    Note:    Effective ceiling = min(storage: 6,000, instance db.r6g.large: 3,500) = 3,500 IOPS

  Total breaches found: 3
======================================================================

Notice how the RDS finding in the example above shows a ceiling of 3,500 even though the instance has 6,000 provisioned storage IOPS. The instance class is the binding constraint. A naive comparison against the storage ceiling would have shown 56.9 percent utilisation and produced no finding at all. The workload is actually at 97.5 percent of its effective capacity.

There are three things to read in each finding. The peak percentage tells you how close to the ceiling the resource actually got, the duration tells you how long it held there, and the window timestamps tell you exactly when to look in your application metrics and logs to find the correlated latency spike or error rate increase. The payments writer above at 99.9 percent for eight minutes during the middle of the day is a near-miss with a plausible transaction peak attached to it, and that is not a monitoring curiosity but a capacity planning item you act on this sprint.

5.1 What the Excel Workbook Contains

The workbook has two sheets. The first, IOPS Saturation Breaches, contains one row per breach event sorted by peak utilisation percentage descending, with columns for account ID and name, region, service type, resource ID and name, instance type, the effective IOPS ceiling used for comparison (always the minimum of the two applicable ceilings for EBS and RDS), the storage IOPS ceiling, the instance IOPS ceiling, the threshold IOPS at your specified percentage, peak observed IOPS, peak percentage of ceiling, longest breach duration in seconds, breach start and end timestamps in UTC, the CloudWatch metric used, and any relevant notes about how the ceiling was calculated. The second sheet, Summary by Service, groups findings by service type and shows total breach count, maximum peak percentage observed, and average breach duration, and this is the sheet to share with your engineering leads because it gives the distribution at a glance without requiring anyone to scroll through individual rows.

Row colours in the first sheet map to utilisation severity. Red means the resource hit or exceeded 100 percent of its ceiling, which is confirmed saturation and should be treated as an incident retrospective item. Orange covers 95 to 100 percent and represents a resource operating without meaningful headroom that needs attention before the next load event. Amber covers 90 to 95 percent, which is a structural warning worth adding to your next architecture review with the breach timestamps and duration included in the discussion. Green means a breach was detected but the peak stayed below your amber threshold, which typically indicates a noisy but not immediately dangerous resource.

6. What To Do With the Findings

Red rows require immediate action. Pull the breach timestamps and correlate them with your application latency and error rate metrics for the same period, because there is almost certainly an incident in your history that traces back to this resource even if it was attributed at the time to something else such as a downstream timeout, a connection pool exhaustion, or an upstream retry storm. The I/O saturation was almost certainly the initiating cause. Fix the instance class or provisioned IOPS and then open a retrospective item to understand why this was not caught at provisioning time.

Orange rows mean the resource does not have enough headroom to absorb any meaningful load increase. You need to either increase the instance ceiling by upgrading the instance class, reduce the workload through query optimisation, connection pooling improvements, or read replica offloading, or accept the risk consciously and document it. What you should not do is leave it and assume it will be fine because it has not triggered an outage yet. Luck is not a capacity model.

Amber rows are planning cycle items rather than emergencies. Add them to your next architecture review and include the breach timestamps and duration in the discussion. If a resource is repeatedly hitting 90 percent during predictable traffic peaks the fix is straightforward, and the conversation with your team is easier when you have the data to show them what is actually happening rather than asking them to take it on faith.

If the script returns no findings, either your estate is genuinely healthy from an I/O capacity perspective, or your threshold is set too conservatively, or the lookback window did not capture your peak traffic period. Try a 72-hour lookback across your typical weekly peak if you are not seeing results you expect, because the absence of findings with a 24-hour window that does not cover your busiest period is not the same as confirmation that nothing is wrong.

7. Using Both Scripts Together

The mismatch auditor and this saturation detector answer different questions and you need both to have complete coverage. The mismatch auditor runs against your configuration data from the AWS APIs, does not need CloudWatch, and tells you where your architecture has provisioned more storage IOPS than your instance class can consume. It is a preventive tool, and you should run it on a schedule as part of your infrastructure compliance pipeline, treat findings with the same severity as security misconfigurations, and block deployments that introduce new critical-severity mismatches.

The saturation detector runs against observed operational data from CloudWatch and tells you which resources are actually approaching or hitting their ceiling under real workload conditions, regardless of whether a configuration mismatch exists. It is a detective tool, and you should run it on a schedule against your recent history, pipe its exit code into your alerting infrastructure, and use it as an input to your capacity planning cycle.

The scenarios they catch are different. You can have a mismatch with no saturation because the storage is over-provisioned but the workload is light and the instance ceiling is never approached. You can have saturation with no mismatch because the configuration is correct but the workload has grown to the point where even the right instance class is running out of room. And you can have both, which is the worst case: a resource where the effective ceiling is lower than you think because of a mismatch and where observed IOPS are already approaching that lower ceiling. Running both scripts closes both gaps and gives you the structural audit and the operational signal together, and between them they surface the class of failure that your existing tooling, including Trusted Advisor, CloudWatch alarms, Performance Insights, and your APM, will not name for you until it has already caused an outage.

8. The Script

cat > iops_saturation.py << 'PYEOF'
#!/usr/bin/env python3
"""
IOPS Saturation Monitor

Scans EBS volumes, RDS instances, and Aurora clusters and identifies
any that have sustained IOPS utilisation at or above a threshold
percentage of their capacity for longer than a specified duration.

Metric selection and ceiling calculation are automatic per service type:

  EBS                  VolumeReadOps + VolumeWriteOps (Count / 60s = IOPS)
                       Ceiling = min(provisioned_storage_iops, instance_type_iops_ceiling)

  RDS standard         ReadIOPS + WriteIOPS (rate metric, IOPS directly)
                       Ceiling = min(provisioned_storage_iops, instance_class_iops_ceiling)

  Aurora provisioned   ReadIOPS + WriteIOPS vs instance class ceiling only
                       (Aurora storage is uncapped; no dual-ceiling applies)

  Aurora Serverless v2 ReadIOPS + WriteIOPS vs 64000 IOPS at max ACUs

Usage:
  python iops_saturation.py --max-ops-pct 90 --max-ops-duration-secs 120 \
      --ou-id ou-xxxx-xxxxxxxx --regions eu-west-1 us-east-1

  python iops_saturation.py --max-ops-pct 95 --max-ops-duration-secs 300 \
      --lookback-hours 48 --accounts 123456789012 --regions eu-west-1

Required permissions on the assumed role:
  cloudwatch:GetMetricStatistics
  rds:DescribeDBInstances
  rds:DescribeDBClusters
  ec2:DescribeVolumes
  ec2:DescribeInstances
"""

import boto3
import csv
import sys
import argparse
import logging
from datetime import datetime, timezone, timedelta
from dataclasses import dataclass, asdict, field
from typing import Optional
from concurrent.futures import ThreadPoolExecutor, as_completed

try:
    import pandas as pd
    PANDAS_AVAILABLE = True
except ImportError:
    PANDAS_AVAILABLE = False

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S"
)
log = logging.getLogger(__name__)


EC2_IOPS_CEILING = {
    "t3.nano": 2085, "t3.micro": 2085, "t3.small": 2085, "t3.medium": 2085,
    "t3.large": 2085, "t3.xlarge": 2085, "t3.2xlarge": 2085,
    "t3a.nano": 2085, "t3a.micro": 2085, "t3a.small": 2085, "t3a.medium": 2085,
    "t3a.large": 2085, "t3a.xlarge": 2085, "t3a.2xlarge": 2085,
    "t4g.nano": 2085, "t4g.micro": 2085, "t4g.small": 2085, "t4g.medium": 2085,
    "t4g.large": 2085, "t4g.xlarge": 2085, "t4g.2xlarge": 2085,
    "m5.large": 3600, "m5.xlarge": 6000, "m5.2xlarge": 8333,
    "m5.4xlarge": 16667, "m5.8xlarge": 18750, "m5.12xlarge": 28750,
    "m5.16xlarge": 37500, "m5.24xlarge": 40000, "m5.metal": 40000,
    "m5a.large": 3600, "m5a.xlarge": 6000, "m5a.2xlarge": 8333,
    "m5a.4xlarge": 16667, "m5a.8xlarge": 18750, "m5a.12xlarge": 28750,
    "m5a.16xlarge": 37500, "m5a.24xlarge": 40000,
    "m6g.medium": 3500, "m6g.large": 3500, "m6g.xlarge": 7000,
    "m6g.2xlarge": 10000, "m6g.4xlarge": 20000, "m6g.8xlarge": 30000,
    "m6g.12xlarge": 40000, "m6g.16xlarge": 40000, "m6g.metal": 40000,
    "m6i.large": 6667, "m6i.xlarge": 10000, "m6i.2xlarge": 13333,
    "m6i.4xlarge": 20000, "m6i.8xlarge": 26667, "m6i.12xlarge": 40000,
    "m6i.16xlarge": 40000, "m6i.24xlarge": 40000, "m6i.32xlarge": 40000,
    "r5.large": 3600, "r5.xlarge": 6000, "r5.2xlarge": 8333,
    "r5.4xlarge": 16667, "r5.8xlarge": 18750, "r5.12xlarge": 28750,
    "r5.16xlarge": 37500, "r5.24xlarge": 40000, "r5.metal": 40000,
    "r6g.medium": 3500, "r6g.large": 3500, "r6g.xlarge": 7000,
    "r6g.2xlarge": 10000, "r6g.4xlarge": 20000, "r6g.8xlarge": 30000,
    "r6g.12xlarge": 40000, "r6g.16xlarge": 40000, "r6g.metal": 40000,
    "r6i.large": 6667, "r6i.xlarge": 10000, "r6i.2xlarge": 13333,
    "r6i.4xlarge": 20000, "r6i.8xlarge": 26667, "r6i.12xlarge": 40000,
    "r6i.16xlarge": 40000, "r6i.24xlarge": 40000, "r6i.32xlarge": 40000,
    "c5.large": 3600, "c5.xlarge": 6000, "c5.2xlarge": 8333,
    "c5.4xlarge": 16667, "c5.9xlarge": 20000, "c5.12xlarge": 28750,
    "c5.18xlarge": 37500, "c5.24xlarge": 40000, "c5.metal": 40000,
    "c6g.medium": 3500, "c6g.large": 3500, "c6g.xlarge": 7000,
    "c6g.2xlarge": 10000, "c6g.4xlarge": 20000, "c6g.8xlarge": 30000,
    "c6g.12xlarge": 40000, "c6g.16xlarge": 40000, "c6g.metal": 40000,
    "c6i.large": 6667, "c6i.xlarge": 10000, "c6i.2xlarge": 13333,
    "c6i.4xlarge": 20000, "c6i.8xlarge": 26667, "c6i.12xlarge": 40000,
    "c6i.16xlarge": 40000, "c6i.24xlarge": 40000, "c6i.32xlarge": 40000,
    "i3.large": 3000, "i3.xlarge": 6000, "i3.2xlarge": 12000,
    "i3.4xlarge": 16000, "i3.8xlarge": 32500, "i3.16xlarge": 65000,
    "i3en.large": 4750, "i3en.xlarge": 9500, "i3en.2xlarge": 19000,
    "i3en.3xlarge": 26125, "i3en.6xlarge": 52250, "i3en.12xlarge": 65000,
    "i3en.24xlarge": 65000,
}

RDS_IOPS_CEILING = {
    "db.t3.micro": 1536, "db.t3.small": 1536, "db.t3.medium": 1536,
    "db.t3.large": 2048, "db.t3.xlarge": 2048, "db.t3.2xlarge": 2048,
    "db.t4g.micro": 1700, "db.t4g.small": 1700, "db.t4g.medium": 1700,
    "db.t4g.large": 2000, "db.t4g.xlarge": 2000, "db.t4g.2xlarge": 2000,
    "db.m5.large": 3600, "db.m5.xlarge": 6000, "db.m5.2xlarge": 8333,
    "db.m5.4xlarge": 16667, "db.m5.8xlarge": 18750, "db.m5.12xlarge": 28750,
    "db.m5.16xlarge": 37500, "db.m5.24xlarge": 40000,
    "db.m6g.large": 3500, "db.m6g.xlarge": 7000, "db.m6g.2xlarge": 10000,
    "db.m6g.4xlarge": 20000, "db.m6g.8xlarge": 30000, "db.m6g.12xlarge": 40000,
    "db.m6g.16xlarge": 40000,
    "db.m6i.large": 6667, "db.m6i.xlarge": 10000, "db.m6i.2xlarge": 13333,
    "db.m6i.4xlarge": 20000, "db.m6i.8xlarge": 26667, "db.m6i.12xlarge": 40000,
    "db.m6i.16xlarge": 40000,
    "db.r5.large": 3600, "db.r5.xlarge": 6000, "db.r5.2xlarge": 8333,
    "db.r5.4xlarge": 16667, "db.r5.8xlarge": 18750, "db.r5.12xlarge": 28750,
    "db.r5.16xlarge": 37500, "db.r5.24xlarge": 40000,
    "db.r6g.large": 3500, "db.r6g.xlarge": 7000, "db.r6g.2xlarge": 10000,
    "db.r6g.4xlarge": 20000, "db.r6g.8xlarge": 30000, "db.r6g.12xlarge": 40000,
    "db.r6g.16xlarge": 40000,
    "db.r6i.large": 6667, "db.r6i.xlarge": 10000, "db.r6i.2xlarge": 13333,
    "db.r6i.4xlarge": 20000, "db.r6i.8xlarge": 26667, "db.r6i.12xlarge": 40000,
    "db.r6i.16xlarge": 40000,
    "db.x1e.xlarge": 3700, "db.x1e.2xlarge": 7400, "db.x1e.4xlarge": 14800,
    "db.x1e.8xlarge": 29600, "db.x1e.16xlarge": 40000, "db.x1e.32xlarge": 40000,
    "db.x2g.large": 3500, "db.x2g.xlarge": 7000, "db.x2g.2xlarge": 10000,
    "db.x2g.4xlarge": 20000, "db.x2g.8xlarge": 30000, "db.x2g.12xlarge": 40000,
    "db.x2g.16xlarge": 40000,
    "db.serverless": 64000,
}

AURORA_SERVERLESS_MAX_ACUS_DEFAULT = 128
PERIOD_SECONDS = 60


@dataclass
class SaturationBreach:
    account_id: str
    account_name: str
    region: str
    service_type: str
    resource_id: str
    resource_name: str
    instance_type: str
    iops_ceiling: int           # effective ceiling = min(storage, instance) for EBS/RDS
    storage_iops_ceiling: int   # provisioned storage IOPS
    instance_iops_ceiling: int  # instance class IOPS ceiling (0 = not applicable)
    threshold_pct: float
    threshold_iops: float
    max_observed_iops: float
    max_observed_pct: float
    longest_breach_seconds: int
    breach_start_utc: str
    breach_end_utc: str
    metric_used: str
    note: str = ""


def get_metric_datapoints(cw_client, namespace, metric_name, dimensions, start_time, end_time, stat="Sum"):
    resp = cw_client.get_metric_statistics(
        Namespace=namespace,
        MetricName=metric_name,
        Dimensions=dimensions,
        StartTime=start_time,
        EndTime=end_time,
        Period=PERIOD_SECONDS,
        Statistics=[stat],
    )
    points = [(dp["Timestamp"], dp[stat]) for dp in resp.get("Datapoints", [])]
    points.sort(key=lambda x: x[0])
    return points


def find_sustained_breaches(combined_iops, threshold_iops, max_ops_duration_seconds, is_count_metric=False):
    if not combined_iops:
        return []

    timestamps = sorted(combined_iops.keys())
    breaches = []
    run_start = None
    run_end = None
    run_max = 0.0

    for ts in timestamps:
        raw = combined_iops[ts]
        iops = raw / PERIOD_SECONDS if is_count_metric else raw

        if iops >= threshold_iops:
            if run_start is None:
                run_start = ts
            run_end = ts
            run_max = max(run_max, iops)
        else:
            if run_start is not None:
                duration = (run_end - run_start).total_seconds() + PERIOD_SECONDS
                if duration >= max_ops_duration_seconds:
                    breaches.append((run_start, run_end, run_max, duration))
                run_start = None
                run_end = None
                run_max = 0.0

    if run_start is not None:
        duration = (run_end - run_start).total_seconds() + PERIOD_SECONDS
        if duration >= max_ops_duration_seconds:
            breaches.append((run_start, run_end, run_max, duration))

    return breaches


def build_volume_instance_map(ec2_client):
    """
    Returns a dict mapping volume_id -> (instance_id, instance_type)
    for all in-use volumes in the region.
    """
    vol_to_instance = {}
    instance_types = {}

    # Collect instance types first
    inst_paginator = ec2_client.get_paginator("describe_instances")
    for page in inst_paginator.paginate():
        for reservation in page["Reservations"]:
            for inst in reservation["Instances"]:
                instance_types[inst["InstanceId"]] = inst.get("InstanceType", "unknown")

    # Map volumes to instances
    vol_paginator = ec2_client.get_paginator("describe_volumes")
    for page in vol_paginator.paginate(
        Filters=[{"Name": "status", "Values": ["in-use"]}]
    ):
        for vol in page["Volumes"]:
            for attachment in vol.get("Attachments", []):
                iid = attachment.get("InstanceId")
                if iid and iid in instance_types:
                    vol_to_instance[vol["VolumeId"]] = (iid, instance_types[iid])
                    break

    return vol_to_instance


def audit_ebs(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours):
    findings = []
    ec2 = session.client("ec2", region_name=region)
    cw = session.client("cloudwatch", region_name=region)
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(hours=lookback_hours)

    # Build volume -> instance type mapping once for the region
    try:
        vol_instance_map = build_volume_instance_map(ec2)
    except Exception as e:
        log.warning(f"Could not build volume/instance map in {account_id}/{region}: {e}")
        vol_instance_map = {}

    vol_paginator = ec2.get_paginator("describe_volumes")
    for page in vol_paginator.paginate(
        Filters=[
            {"Name": "volume-type", "Values": ["io1", "io2", "gp3"]},
            {"Name": "status", "Values": ["in-use"]},
        ]
    ):
        for vol in page["Volumes"]:
            provisioned_iops = vol.get("Iops", 0) or 0
            if provisioned_iops == 0:
                continue

            vol_id = vol["VolumeId"]
            tags = vol.get("Tags", [])
            name = next((t["Value"] for t in tags if t["Key"] == "Name"), vol_id)
            vol_type = vol.get("VolumeType", "unknown")

            # Determine effective ceiling: min(storage IOPS, instance IOPS ceiling)
            instance_type = "unknown"
            instance_iops_ceiling = 0
            ceiling_note = ""
            if vol_id in vol_instance_map:
                _, instance_type = vol_instance_map[vol_id]
                instance_iops_ceiling = EC2_IOPS_CEILING.get(instance_type, 0)

            if instance_iops_ceiling > 0:
                effective_ceiling = min(provisioned_iops, instance_iops_ceiling)
                binding = "storage" if provisioned_iops <= instance_iops_ceiling else "instance"
                ceiling_note = (
                    f"Effective ceiling = min(storage: {provisioned_iops:,}, "
                    f"instance {instance_type}: {instance_iops_ceiling:,}) = {effective_ceiling:,} IOPS "
                    f"[{binding} ceiling is binding]"
                )
            else:
                effective_ceiling = provisioned_iops
                ceiling_note = (
                    f"Storage ceiling used ({provisioned_iops:,} IOPS); "
                    f"instance type {instance_type!r} not in lookup table"
                )

            threshold_iops = effective_ceiling * (max_ops_pct / 100.0)

            try:
                pts_read = get_metric_datapoints(cw, "AWS/EBS", "VolumeReadOps",
                    [{"Name": "VolumeId", "Value": vol_id}], start_time, end_time)
                pts_write = get_metric_datapoints(cw, "AWS/EBS", "VolumeWriteOps",
                    [{"Name": "VolumeId", "Value": vol_id}], start_time, end_time)
                combined = {}
                for ts, val in pts_read:
                    combined[ts] = combined.get(ts, 0.0) + val
                for ts, val in pts_write:
                    combined[ts] = combined.get(ts, 0.0) + val
                breaches = find_sustained_breaches(combined, threshold_iops, max_ops_duration_seconds, is_count_metric=True)
            except Exception as e:
                log.warning(f"CloudWatch error for EBS {vol_id} in {account_id}/{region}: {e}")
                continue

            for breach_start, breach_end, breach_max_iops, breach_secs in breaches:
                findings.append(SaturationBreach(
                    account_id=account_id, account_name=account_name, region=region,
                    service_type=f"EBS ({vol_type})", resource_id=vol_id, resource_name=name,
                    instance_type=instance_type,
                    iops_ceiling=effective_ceiling,
                    storage_iops_ceiling=provisioned_iops,
                    instance_iops_ceiling=instance_iops_ceiling,
                    threshold_pct=max_ops_pct, threshold_iops=round(threshold_iops, 1),
                    max_observed_iops=round(breach_max_iops, 1),
                    max_observed_pct=round((breach_max_iops / effective_ceiling) * 100, 1),
                    longest_breach_seconds=int(breach_secs),
                    breach_start_utc=breach_start.strftime("%Y-%m-%d %H:%M:%S UTC"),
                    breach_end_utc=breach_end.strftime("%Y-%m-%d %H:%M:%S UTC"),
                    metric_used="AWS/EBS: VolumeReadOps + VolumeWriteOps (Sum / 60s = IOPS)",
                    note=ceiling_note,
                ))
    return findings


def audit_rds_standard(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours):
    findings = []
    rds = session.client("rds", region_name=region)
    cw = session.client("cloudwatch", region_name=region)
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(hours=lookback_hours)

    paginator = rds.get_paginator("describe_db_instances")
    for page in paginator.paginate():
        for db in page["DBInstances"]:
            engine = db.get("Engine", "")
            if "aurora" in engine.lower():
                continue
            provisioned_iops = db.get("Iops", 0) or 0
            if provisioned_iops == 0:
                continue
            status = db.get("DBInstanceStatus", "")
            if status not in ("available", "backing-up", "modifying"):
                continue

            db_id = db.get("DBInstanceIdentifier", "")
            instance_type = db.get("DBInstanceClass", "unknown")
            tags = db.get("TagList", [])
            name = next((t["Value"] for t in tags if t["Key"] == "Name"), db_id)

            # Determine effective ceiling: min(storage IOPS, instance class IOPS ceiling)
            instance_iops_ceiling = RDS_IOPS_CEILING.get(instance_type, 0)
            if instance_iops_ceiling > 0:
                effective_ceiling = min(provisioned_iops, instance_iops_ceiling)
                binding = "storage" if provisioned_iops <= instance_iops_ceiling else "instance"
                ceiling_note = (
                    f"Effective ceiling = min(storage: {provisioned_iops:,}, "
                    f"instance {instance_type}: {instance_iops_ceiling:,}) = {effective_ceiling:,} IOPS "
                    f"[{binding} ceiling is binding]"
                )
            else:
                effective_ceiling = provisioned_iops
                ceiling_note = (
                    f"Storage ceiling used ({provisioned_iops:,} IOPS); "
                    f"instance type {instance_type!r} not in lookup table"
                )

            threshold_iops = effective_ceiling * (max_ops_pct / 100.0)
            dims = [{"Name": "DBInstanceIdentifier", "Value": db_id}]

            try:
                pts_read = get_metric_datapoints(cw, "AWS/RDS", "ReadIOPS", dims, start_time, end_time, stat="Average")
                pts_write = get_metric_datapoints(cw, "AWS/RDS", "WriteIOPS", dims, start_time, end_time, stat="Average")
                combined = {}
                for ts, val in pts_read:
                    combined[ts] = combined.get(ts, 0.0) + val
                for ts, val in pts_write:
                    combined[ts] = combined.get(ts, 0.0) + val
                breaches = find_sustained_breaches(combined, threshold_iops, max_ops_duration_seconds, is_count_metric=False)
            except Exception as e:
                log.warning(f"CloudWatch error for RDS {db_id} in {account_id}/{region}: {e}")
                continue

            for breach_start, breach_end, breach_max_iops, breach_secs in breaches:
                findings.append(SaturationBreach(
                    account_id=account_id, account_name=account_name, region=region,
                    service_type=f"RDS ({engine})", resource_id=db_id, resource_name=name,
                    instance_type=instance_type,
                    iops_ceiling=effective_ceiling,
                    storage_iops_ceiling=provisioned_iops,
                    instance_iops_ceiling=instance_iops_ceiling,
                    threshold_pct=max_ops_pct, threshold_iops=round(threshold_iops, 1),
                    max_observed_iops=round(breach_max_iops, 1),
                    max_observed_pct=round((breach_max_iops / effective_ceiling) * 100, 1),
                    longest_breach_seconds=int(breach_secs),
                    breach_start_utc=breach_start.strftime("%Y-%m-%d %H:%M:%S UTC"),
                    breach_end_utc=breach_end.strftime("%Y-%m-%d %H:%M:%S UTC"),
                    metric_used="AWS/RDS: ReadIOPS + WriteIOPS (Average)",
                    note=ceiling_note,
                ))
    return findings


def audit_aurora(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours):
    findings = []
    rds = session.client("rds", region_name=region)
    cw = session.client("cloudwatch", region_name=region)
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(hours=lookback_hours)

    cluster_max_acus = {}
    try:
        cluster_paginator = rds.get_paginator("describe_db_clusters")
        for page in cluster_paginator.paginate():
            for cluster in page["DBClusters"]:
                if "aurora" not in cluster.get("Engine", "").lower():
                    continue
                sv2 = cluster.get("ServerlessV2ScalingConfiguration", {})
                if sv2:
                    cluster_max_acus[cluster["DBClusterIdentifier"]] = sv2.get(
                        "MaxCapacity", AURORA_SERVERLESS_MAX_ACUS_DEFAULT
                    )
    except Exception as e:
        log.warning(f"Could not list Aurora clusters in {account_id}/{region}: {e}")

    paginator = rds.get_paginator("describe_db_instances")
    for page in paginator.paginate():
        for db in page["DBInstances"]:
            engine = db.get("Engine", "")
            if "aurora" not in engine.lower():
                continue
            status = db.get("DBInstanceStatus", "")
            if status not in ("available", "backing-up", "modifying"):
                continue

            db_id = db.get("DBInstanceIdentifier", "")
            instance_type = db.get("DBInstanceClass", "unknown")
            cluster_id = db.get("DBClusterIdentifier", "")
            tags = db.get("TagList", [])
            name = next((t["Value"] for t in tags if t["Key"] == "Name"), db_id)
            is_serverless = instance_type == "db.serverless"
            dims = [{"Name": "DBInstanceIdentifier", "Value": db_id}]

            # Aurora: no dual-ceiling applies -- storage is uncapped at cluster layer
            if is_serverless:
                max_acus = cluster_max_acus.get(cluster_id, AURORA_SERVERLESS_MAX_ACUS_DEFAULT)
                iops_ceiling = 64000
                threshold_iops = iops_ceiling * (max_ops_pct / 100.0)
                service_type = "Aurora Serverless v2"
                metric_note = f"AWS/RDS: ReadIOPS + WriteIOPS (Average); ceiling = 64000 IOPS at max {max_acus} ACUs"
                note = "Serverless v2: IOPS ceiling is proportional to ACU allocation; storage layer is uncapped"
                instance_iops_ceiling = 0
            else:
                ceiling = RDS_IOPS_CEILING.get(instance_type)
                if ceiling is None:
                    log.warning(f"Unknown Aurora instance type {instance_type} ({db_id}) -- skipping")
                    continue
                iops_ceiling = ceiling
                threshold_iops = iops_ceiling * (max_ops_pct / 100.0)
                service_type = f"Aurora ({engine})"
                metric_note = f"AWS/RDS: ReadIOPS + WriteIOPS (Average); ceiling = instance class max ({instance_type}: {iops_ceiling:,} IOPS)"
                note = f"Aurora storage is uncapped; ceiling is the instance class processing limit ({instance_type}: {iops_ceiling:,} IOPS)"
                instance_iops_ceiling = iops_ceiling

            try:
                pts_read = get_metric_datapoints(cw, "AWS/RDS", "ReadIOPS", dims, start_time, end_time, stat="Average")
                pts_write = get_metric_datapoints(cw, "AWS/RDS", "WriteIOPS", dims, start_time, end_time, stat="Average")
                combined = {}
                for ts, val in pts_read:
                    combined[ts] = combined.get(ts, 0.0) + val
                for ts, val in pts_write:
                    combined[ts] = combined.get(ts, 0.0) + val
                breaches = find_sustained_breaches(combined, threshold_iops, max_ops_duration_seconds, is_count_metric=False)
            except Exception as e:
                log.warning(f"CloudWatch error for Aurora {db_id} in {account_id}/{region}: {e}")
                continue

            for breach_start, breach_end, breach_max_iops, breach_secs in breaches:
                findings.append(SaturationBreach(
                    account_id=account_id, account_name=account_name, region=region,
                    service_type=service_type, resource_id=db_id, resource_name=name,
                    instance_type=instance_type,
                    iops_ceiling=iops_ceiling,
                    storage_iops_ceiling=0,        # not applicable for Aurora
                    instance_iops_ceiling=instance_iops_ceiling,
                    threshold_pct=max_ops_pct, threshold_iops=round(threshold_iops, 1),
                    max_observed_iops=round(breach_max_iops, 1),
                    max_observed_pct=round((breach_max_iops / iops_ceiling) * 100, 1),
                    longest_breach_seconds=int(breach_secs),
                    breach_start_utc=breach_start.strftime("%Y-%m-%d %H:%M:%S UTC"),
                    breach_end_utc=breach_end.strftime("%Y-%m-%d %H:%M:%S UTC"),
                    metric_used=metric_note,
                    note=note,
                ))
    return findings


def list_accounts_in_ou(ou_id):
    org = boto3.client("organizations")
    accounts = []

    def recurse(parent_id):
        for child_type in ("ACCOUNT", "ORGANIZATIONAL_UNIT"):
            paginator = org.get_paginator("list_children")
            for page in paginator.paginate(ParentId=parent_id, ChildType=child_type):
                for child in page["Children"]:
                    if child_type == "ACCOUNT":
                        try:
                            resp = org.describe_account(AccountId=child["Id"])
                            acc = resp["Account"]
                            if acc["Status"] == "ACTIVE":
                                accounts.append({"id": acc["Id"], "name": acc["Name"]})
                        except Exception as e:
                            log.warning(f"Could not describe account {child['Id']}: {e}")
                    else:
                        recurse(child["Id"])

    recurse(ou_id)
    return accounts


def get_session(account_id, role_name):
    if role_name:
        sts = boto3.client("sts")
        role_arn = f"arn:aws:iam::{account_id}:role/{role_name}"
        creds = sts.assume_role(RoleArn=role_arn, RoleSessionName="IOPSSaturationScan")["Credentials"]
        return boto3.Session(
            aws_access_key_id=creds["AccessKeyId"],
            aws_secret_access_key=creds["SecretAccessKey"],
            aws_session_token=creds["SessionToken"],
        )
    return boto3.Session()


def audit_account(account, role_name, regions, max_ops_pct, max_ops_duration_seconds, lookback_hours):
    account_id = account["id"]
    account_name = account["name"]
    all_findings = []
    log.info(f"Auditing account {account_id} ({account_name})")
    try:
        session = get_session(account_id, role_name)
    except Exception as e:
        log.error(f"Cannot assume role in {account_id}: {e}")
        return []

    for region in regions:
        log.info(f"  {account_id} scanning {region}...")
        try:
            all_findings.extend(audit_ebs(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours))
            all_findings.extend(audit_rds_standard(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours))
            all_findings.extend(audit_aurora(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours))
        except Exception as e:
            log.error(f"  Error in {account_id}/{region}: {e}")

    return all_findings


def write_csv(findings, path):
    if not findings:
        return
    fieldnames = list(asdict(findings[0]).keys())
    with open(path, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        for finding in findings:
            writer.writerow(asdict(finding))
    log.info(f"CSV written: {path}")


def write_excel(findings, path):
    if not PANDAS_AVAILABLE:
        log.warning("pandas/openpyxl not installed -- skipping Excel output. pip install pandas openpyxl")
        return
    if not findings:
        return

    from openpyxl.styles import PatternFill

    rows = [asdict(f) for f in findings]
    df = pd.DataFrame(rows)
    df = df.sort_values("max_observed_pct", ascending=False)

    def row_colour(pct):
        if pct >= 100:
            return "FF2222"
        elif pct >= 95:
            return "FF6600"
        elif pct >= 90:
            return "FFB300"
        return "90EE90"

    with pd.ExcelWriter(path, engine="openpyxl") as writer:
        df.to_excel(writer, index=False, sheet_name="IOPS Saturation Breaches")
        ws = writer.sheets["IOPS Saturation Breaches"]
        pct_col_idx = list(df.columns).index("max_observed_pct") + 1
        for row_idx in range(2, len(df) + 2):
            pct_val = ws.cell(row=row_idx, column=pct_col_idx).value or 0
            colour = row_colour(pct_val)
            for col_idx in range(1, len(df.columns) + 1):
                ws.cell(row=row_idx, column=col_idx).fill = PatternFill(
                    start_color=colour, end_color=colour, fill_type="solid"
                )
        summary = df.groupby("service_type").agg(
            breaches=("resource_id", "count"),
            max_pct_observed=("max_observed_pct", "max"),
            avg_breach_seconds=("longest_breach_seconds", "mean"),
        ).reset_index()
        summary.to_excel(writer, index=False, sheet_name="Summary by Service")

    log.info(f"Excel written: {path}")


def print_results(findings, max_ops_pct, max_ops_duration_seconds):
    print()
    print("=" * 70)
    print("IOPS SATURATION BREACH REPORT")
    print(f"Threshold : >= {max_ops_pct}% of effective IOPS ceiling")
    print(f"Duration  : >= {max_ops_duration_seconds}s sustained")
    print("=" * 70)

    if not findings:
        print("\nNo sustained IOPS saturation breaches found.")
        print("=" * 70)
        return

    findings_sorted = sorted(findings, key=lambda f: f.max_observed_pct, reverse=True)
    by_type = {}
    for f in findings_sorted:
        by_type.setdefault(f.service_type, []).append(f)

    for svc_type, items in sorted(by_type.items()):
        print(f"\n  {svc_type} ({len(items)} breach{'es' if len(items) != 1 else ''})")
        print(f"  {'Resource':<40} {'Ceiling':>8} {'Peak IOPS':>10} {'Peak %':>7} {'Duration':>10}")
        print(f"  {'=' * 40} {'=' * 8} {'=' * 10} {'=' * 7} {'=' * 10}")
        for f in items:
            print(f"  {f.resource_name:<40} {f.iops_ceiling:>8,} {f.max_observed_iops:>10,.0f} {f.max_observed_pct:>6.1f}% {f.longest_breach_seconds:>8}s")
            print(f"    Account: {f.account_id} | Region: {f.region}")
            print(f"    Window:  {f.breach_start_utc}  to  {f.breach_end_utc}")
            if f.note:
                print(f"    Note:    {f.note}")

    print(f"\n  Total breaches found: {len(findings)}")
    print("=" * 70)


def parse_args():
    parser = argparse.ArgumentParser(description="Scan EBS, RDS, and Aurora for sustained IOPS saturation.")
    group = parser.add_mutually_exclusive_group(required=True)
    group.add_argument("--ou-id", help="AWS Organizations OU ID")
    group.add_argument("--accounts", nargs="+", help="Specific AWS account IDs")
    parser.add_argument("--max-ops-pct", type=float, required=True,
        help="Percentage of IOPS ceiling that constitutes a breach (e.g. 90)")
    parser.add_argument("--max-ops-duration-secs", type=int, required=True,
        help="Minimum sustained breach duration in seconds to report (e.g. 120)")
    parser.add_argument("--lookback-hours", type=int, default=24,
        help="Hours of CloudWatch history to examine (default: 24)")
    parser.add_argument("--role-name", default="OrganizationAccountAccessRole")
    parser.add_argument("--regions", nargs="+", default=["eu-west-1", "us-east-1"])
    parser.add_argument("--workers", type=int, default=5)
    parser.add_argument("--output-prefix", default="iops_saturation_report")
    return parser.parse_args()


def main():
    args = parse_args()
    timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")

    log.info(f"IOPS Saturation Scan starting")
    log.info(f"  Threshold : >= {args.max_ops_pct}% for >= {args.max_ops_duration_secs}s")
    log.info(f"  Lookback  : {args.lookback_hours}h | Regions: {', '.join(args.regions)}")

    accounts = list_accounts_in_ou(args.ou_id) if args.ou_id else [{"id": a, "name": a} for a in args.accounts]
    if args.ou_id:
        log.info(f"  Found {len(accounts)} active accounts in OU {args.ou_id}")

    all_findings = []
    with ThreadPoolExecutor(max_workers=args.workers) as executor:
        futures = {
            executor.submit(audit_account, acc, args.role_name, args.regions,
                args.max_ops_pct, args.max_ops_duration_secs, args.lookback_hours): acc
            for acc in accounts
        }
        for future in as_completed(futures):
            acc = futures[future]
            try:
                findings = future.result()
                all_findings.extend(findings)
                log.info(f"Account {acc['id']} complete: {len(findings)} breach(es)")
            except Exception as e:
                log.error(f"Account {acc['id']} failed: {e}")

    print_results(all_findings, args.max_ops_pct, args.max_ops_duration_secs)

    write_csv(all_findings, f"{args.output_prefix}_{timestamp}.csv")
    write_excel(all_findings, f"{args.output_prefix}_{timestamp}.xlsx")

    log.info("Scan complete.")
    return 1 if any(f.max_observed_pct >= 100.0 for f in all_findings) else 0


if __name__ == "__main__":
    sys.exit(main())
PYEOF

chmod +x iops_saturation.py
echo "iops_saturation.py created and marked executable"

Andrew Baker writes about cloud architecture, banking technology, and the gap between what systems are designed to do and what they actually do under load at andrewbaker.ninja.

The Silent Killer in Your AWS Architecture: IOPS Mismatches

Andrew Baker | March 2026

Companion article to: https://andrewbaker.ninja/2026/03/03/knowing-your-iops-are-broken-is-not-the-same-as-knowing-they-are-about-to-break/

The AWS Well-Architected Framework will not spot this. Your cloud governance team will not catch it either, nor will Trusted Advisor. Your war room will be hours into the incident before anyone catches on. Your SRE reviews will miss it, your APM will not flag it, Performance Insights will not surface it, and your FinOps team will have no clue it is even there. It sits quietly in your estate for months, sometimes years, until the day load conditions expose it and you spend four hours in a war room trying to explain to your CTO why a bank cannot process payments.

What is it? IOPS mismatches, where your storage can deliver far more throughput than your compute instance can actually consume.

1. What Is an IOPS Mismatch?

IOPS (Input/Output Operations Per Second) is the fundamental currency of storage performance. When you provision an RDS instance or attach an EBS volume to an EC2 host, you are making two independent decisions: how many IOPS the storage can deliver, and how many IOPS the instance can actually push through its I/O subsystem. These two numbers are governed by entirely separate limits. EBS gp3 and io2 volumes can be provisioned up to 64,000 IOPS, but your EC2 instance has its own maximum EBS throughput ceiling, and for many instance types that ceiling is well below what your storage can theoretically deliver.

The result is a mismatch where you are paying for storage performance that your compute layer cannot reach, and worse, you are designing your system around an IOPS budget that does not actually exist at the instance level. Every dollar you spend on IOPS above that ceiling is money you will never get value from, and none of your tooling will tell you.

2. The AWS Architecture Pattern That Makes This Worse

AWS makes it easy to provision storage and compute independently, and neither the console, the CLI, CloudFormation, nor Terraform will warn you when your storage IOPS ceiling exceeds your instance throughput ceiling. The typical path to a mismatch involves an engineer provisioning an r6g.large with a 3,500 IOPS maximum EBS throughput because the application is memory intensive, while a separate decision by a different team provisions the attached RDS storage with 10,000 IOPS because the database team uses a standard enterprise template. The two numbers are never compared against each other and the system goes to production.

At any institution operating at scale across hundreds of accounts and workloads, the opportunity for this kind of drift is significant. An OU spanning multiple teams, each with their own provisioning habits and templates, will accumulate mismatches over time through the natural entropy of infrastructure management.

Some of the most frequent mismatches appear in exactly the instance types engineers reach for most often. The t3 family is particularly dangerous because it is the default choice for cost-conscious workloads, yet every size from nano to 2xlarge is capped at 2,085 IOPS while a gp3 volume defaults to 3,000 and io1 storage attached by a database team using an enterprise template can easily reach 6,000 or more. The r6g.large caps at 3,500 IOPS and is routinely paired with RDS storage provisioned at double that. The m5.xlarge at 6,000 IOPS is frequently attached to io2 volumes at 10,000. On the RDS side, a db.t3.medium is limited to 1,536 IOPS but is commonly found with io1 storage provisioned at 3,000, and db.r6g.large instances capped at 3,500 are routinely given storage provisioned at 8,000 or more.

3. Why This Causes Outages and Not Just Waste

Paying for IOPS you cannot use is expensive but survivable. The dangerous scenario is what happens at the boundary when load actually increases.

Consider a team that knows their application spikes at just under 30,000 IOPS. They provision storage at 40,000 to give themselves headroom, which is exactly what good engineering practice tells them to do. Everything looks fine in testing, everything looks fine in monitoring, and the system runs without complaint. Then load spikes, the I/O subsystem begins to queue, upstream systems scale out in response, a flood of autoscaling connections exhausts the database connection pool, and services start timing out in a rapid cascading outage. The application is down, and now your team is trying to correlate thread exhaustion in your EKS clusters back to a blind configuration mismatch that no automated tool flagged at any point along the way.

That cascade is the real danger. Under normal conditions your application runs fine because actual I/O demand is modest and the mismatch is invisible. But when demand approaches the instance throughput ceiling, requests begin to queue at the virtualization layer and what was a 1ms storage read becomes 50ms and then 200ms as connection pools back up. In Aurora Serverless v2 or RDS PostgreSQL, connection pool exhaustion under I/O pressure is a documented failure mode where the database appears healthy and is accepting connections but query execution times blow through application timeouts, upstream services retry, retries compound the I/O pressure, and you are now in a feedback loop. On smaller instance types that operate on a credit model for I/O throughput, a sustained I/O spike drains the burst bucket and when it empties throughput drops to the baseline, which is often a fraction of what the application expects, even though the storage volume itself could still deliver perfectly well.

The most insidious part is that CloudWatch metrics can hide this entirely. VolumeReadOps and VolumeWriteOps report what the volume delivered while EBSReadOps on the instance side reports what the instance consumed, and under a mismatch these metrics appear healthy right up until they do not because the bottleneck lives in the virtualized I/O path between instance and volume rather than in either component independently. Performance Insights, which most teams trust to surface database I/O problems, operates above this layer entirely and will show you wait events and query latency but will not show you that the ceiling your instance is hitting is a hardware throughput limit rather than a query or index problem.

4. The FinOps Trap: When Cost Optimisation Creates the Outage

There is a particularly cruel variant of this failure that deserves its own warning. Trusted Advisor and your FinOps team will periodically flag RDS instances with low CPU utilisation as oversized and recommend downsizing to save cost. The recommendation is correct by every metric they can see. CPU is idle, the instance looks like waste, and the resize gets approved and executed. What nobody checks is whether the workload is I/O bound rather than CPU bound, and whether the smaller instance class sits below the provisioned storage IOPS ceiling. The downsize goes through, the system runs fine for weeks because load is modest, and then three months later under a transaction spike you are in an outage with no obvious cause because nobody remembers a routine right-sizing exercise from last quarter. The CloudWatch CPU graphs look fine throughout. The storage metrics look fine. The failure lives entirely in the gap between what the instance can consume and what the storage was provisioned to deliver, and that gap was created by a cost optimisation that was correct by every measure anyone was watching, but deadly by the metrics they couldnt see.

5. Detection Requires Automation

Once you have identified a mismatch the fix is straightforward: upgrade the instance to a class whose throughput ceiling exceeds the provisioned storage IOPS, or reduce the provisioned IOPS on the storage to match what the instance can actually consume. The hard part is finding these mismatches across a large estate before they manifest as incidents, because you cannot do this manually and you cannot rely on individual teams to self-report. It needs to be a scheduled job that generates a report and produces findings you treat with the same severity as any security or cost compliance alert.

The script below queries every account in the OU, enumerates all RDS instances and EC2 volumes, compares provisioned storage IOPS against the instance throughput ceiling for each, and flags every mismatch with a severity classification and an estimated monthly waste figure. Findings are classified as CRITICAL where provisioned IOPS exceed the ceiling by three times or more and require immediate remediation, HIGH at two to three times over for remediation within the current sprint, MEDIUM at 1.5 to two times for the next planning cycle, and LOW for anything above the ceiling but within the 1 to 1.5x range. Output is a colour-coded Excel workbook with a findings sheet and summary tab alongside a flat CSV, and the script exits with code 1 if any CRITICAL findings are present so it can be wired into a CI pipeline for scheduled runs.

To run it you will need boto3, pandas, and openpyxl installed, and your caller identity needs organizations:ListChildren, organizations:DescribeAccount, and sts:AssumeRole into each target account, while the assumed role needs rds:DescribeDBInstances, ec2:DescribeVolumes, and ec2:DescribeInstances.

6. The Broader Lesson

Every major cloud outage I have seen, and this spans AWS and the broader industry, has a mundane configuration decision somewhere near its root cause. It is not a sophisticated attack and not a novel failure mode, it is a number that should have been checked against another number at provisioning time and was not. IOPS mismatches are not glamorous, they do not appear on architectural review templates, and as we have seen they slip past every layer of automated governance you have in place. But in any environment where uptime matters they translate directly into transaction failures, customer impact, and regulatory exposure, which means they deserve the same treatment as any other class of compliance finding. Audit your estate, fix the mismatches, and make it a scheduled job rather than a post-incident action item.

Andrew Baker writes about cloud architecture, banking technology, and the gap between what systems are designed to do and what they actually do under load at andrewbaker.ninja.

7. The Diagnostic Illusion: Why IOPS Mismatches Hide During Outages

One of the reasons IOPS mismatches are so destructive is that they do not behave like normal resource exhaustion. Most infrastructure limits are obvious when they fail.

When CPU reaches 100 percent, everyone can see it. Dashboards turn red. Engineers immediately understand what the bottleneck is. CPU exhaustion is visible and honest. IOPS bottlenecks rarely behave this way. Instead, they fail indirectly.

A storage system can begin throttling I/O when it reaches its provisioned limits. Requests do not immediately fail. They simply get slower. Latency increases. Queues begin to build. This is where the diagnostic illusion appears.

Imagine a system where the database suddenly shows 5,000 active connections, application latency spikes, and the EKS node group begins scaling out. From the outside it looks like a traffic surge or an application scaling problem.

But the real cause might be that the storage layer has quietly hit its true IOPS ceiling, even though dashboards show it running at only 76 percent of the capacity you think it has.

This happens because storage performance is constrained by two key limits at the same time: volume IOPS, and instance throughput. The real limit is always the lowest constraint in the chain, not the number you provisioned against the storage volume.

Once that invisible boundary is crossed, storage latency increases and the entire stack begins compensating. Pods scale horizontally. Applications retry queries. Connection pools expand. Databases appear overloaded. But the failure did not start there. It started when storage quietly hit an IOPS boundary that nobody recognised.

An exhausted CPU is easy to diagnose. A database with 5,000 connections clearly looks like a problem. But explaining that an entire EKS fleet scaled out because storage hit an invisible IOPS ceiling at 76 percent utilisation is a far harder story to tell during a production outage. And that is exactly why IOPS mismatches are such dangerous architectural traps.

WordPress Totally Free Backup and Restore: CloudScale Backup Plugin – Does Exactly What It Says

Iโ€™ve been running this blog on WordPress for years, and the backup situation has always quietly bothered me. The popular backup plugins either charge a monthly fee, cap you on storage, phone home to an external service, or do all three. I wanted something simple: a plugin that makes a zip file of my site, stores it locally, runs on a schedule, and optionally syncs to S3. No accounts, no subscriptions, no mandatory cloud storage. The final straw for me was when my previous backup solution let me backup – but when I needed to restore, it would not let me restore unless I paid to upgrade to the premium version!

So I built one. Itโ€™s called CloudScale Free Backup and Restore. Itโ€™s completely free, and you can install it right now.

Github repo:

https://github.com/andrewbakercloudscale/cloudscale-backup

Download: cloudscale-backup.zip

CloudScale WordPress Backup Admin Screens:

CloudScale Backup plugin interface showing WordPress backup and restore options
CloudScale Backup plugin interface showing WordPress backup settings and options

1. What It Backs Up

The plugin backs up any combination of the following:

Core

  • The WordPress database โ€” all posts, pages, settings, users, and comments
  • Media uploads (wp-content/uploads)
  • Plugins folder (wp-content/plugins)
  • Themes folder (wp-content/themes)

Other (shown only if present on your server)

  • Must-use plugins (wp-content/mu-plugins)
  • Languages and translation files (wp-content/languages)
  • wp-content dropin files (object-cache.php, db.php, advanced-cache.php)
  • .htaccess โ€” your Apache rewrite rules and custom security directives
  • wp-config.php โ€” flagged with a credentials warning, unchecked by default

Each backup is a single zip file. Inside it youโ€™ll find a database.sql dump, the selected folders each in their own subdirectory, and a backup-meta.json file that records the plugin version, WordPress version, site URL, table prefix, and exactly what was included. That metadata matters when restoring to a different server.

2. How the Backup Works Internally

Database dump

The plugin detects whether mysqldump is available on your server. If it is, it uses that โ€” fast, handles large databases cleanly, produces a proper SQL dump including all CREATE TABLE and INSERT statements. If mysqldump isnโ€™t available (common on shared hosting), it falls back to a pure PHP implementation that streams through every table and writes compatible SQL. Either way you get a database.sql that can be imported with any standard MySQL client.

File backup

Files are added to the zip using PHPโ€™s ZipArchive extension, available on virtually every PHP installation. The plugin walks each selected directory recursively and adds every file. Thereโ€™s no timeout risk because it does not use shell commands for file backup, it streams directly in PHP.

Backup naming and location

Backups are stored in wp-content/cloudscale-backups/. On first run the plugin creates this directory and drops an .htaccess file inside it containing Deny from all, which prevents direct web access. Backup filenames use a short format encoding the content type; for example bkup_f1.zip is a full backup (database, media, plugins, themes), bkup_d2.zip is database only, and bkup_dm3.zip is database plus media. The sequence number increments and never overwrites an existing file.

Scheduled backups

Scheduled backups use WordPress Cron. You pick which days of the week and what hour (server time), and the plugin registers a recurring event. WordPress Cron fires when someone visits your site, so on very low traffic sites the backup may run a few minutes after the scheduled hour rather than exactly on it. If you need exact timing, add a real server cron job that hits wp-cron.php directly.

3. The Retention System

The retention setting controls how many backups to keep. Every time a backup completes, the plugin counts current backups and deletes the oldest ones beyond your limit. The default is 10.

The plugin shows a live storage estimate: it takes the size of your most recent backup, multiplies by your retention count, and compares that against current free disk space. A traffic light indicator: green, amber, red tells you at a glance whether youโ€™re comfortable, getting tight, or at risk of filling the disk. This updates live as you change the retention number.

4. S3 Remote Backup

Local backups are only half the story. If your server dies entirely, local backups die with it. The plugin addresses this with built in S3 remote backup. After every backup, scheduled or manual, it automatically uploads the zip to an S3 bucket of your choosing using the AWS CLI. The local copy is always kept.

How it works

The plugin runs aws s3 cp after each backup completes. There are no PHP SDKs, no Composer dependencies, no AWS API keys stored in the WordPress database. It relies entirely on the AWS CLI binary installed on the server and whatever credential chain that CLI is configured to use. If the sync fails for any reason, the local backup is unaffected and the error is logged and displayed.

Requirements

The AWS CLI must be installed on the server. The plugin detects it automatically across the common install locations including /usr/local/bin/aws, /usr/bin/aws, and the AWS v2 installer default at /usr/local/aws-cli/v2/current/bin/aws. The S3 card shows a green tick with the detected version if found.

Install on Ubuntu/Debian (x86_64):

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o awscliv2.zip
unzip awscliv2.zip && sudo ./aws/install

Install on Ubuntu/Debian (ARM64 โ€” Graviton, Raspberry Pi, Apple Silicon VMs):

curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o awscliv2.zip
unzip awscliv2.zip && sudo ./aws/install

Install on Amazon Linux / EC2 (x86_64 and ARM):

sudo yum install -y aws-cli

Credentials

The plugin inherits whatever credentials the AWS CLI is configured with under the web server user. Three options in order of preference:

IAM instance role: attach an IAM role to your EC2 or Lightsail instance with the policy below. Zero configuration, no keys stored anywhere, automatically rotated. This is the right approach if youโ€™re on AWS infrastructure.

Credentials file: run aws configure as the web server user (www-data on Ubuntu, apache on Amazon Linux). This writes ~/.aws/credentials which the CLI picks up automatically.

Environment variables: set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_DEFAULT_REGION in the server environment.

Minimum IAM policy

Replace YOUR-BUCKET-NAME with your actual bucket:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::YOUR-BUCKET-NAME",
      "arn:aws:s3:::YOUR-BUCKET-NAME/*"
    ]
  }]
}

PutObject is what actually uploads the backups. GetObject and ListBucket are useful if you ever retrieve a backup directly from S3.

Configuration and testing

In the S3 Remote Backup card, set your bucket name and key prefix (defaults to backups/), then hit Save S3 Settings. Use the Test Connection button to verify everything before your next scheduled backup runs โ€” it writes a small test file to the bucket and reports either a success confirmation or the exact AWS CLI error, which tells you specifically whether the problem is a missing CLI, a credentials gap, a wrong bucket name, or an IAM permissions issue.

Once configured, every backup syncs automatically. The result appears in the backup progress panel: green on success with the destination path, red with the raw error message if something goes wrong.

Create your S3 bucket in the same AWS region as your server where possible. Cross-region uploads work fine but add latency and inter-region transfer costs.

5. Installing the Plugin

Step 1. Download the plugin zip from:

https://andrewninjawordpress.s3.af-south-1.amazonaws.com/cloudscale-backup.zip

Step 2. In WordPress admin, go to Plugins โ†’ Add New โ†’ Upload Plugin. Choose the downloaded zip and click Install Now.

WordPress plugin installation interface showing CloudScale SEO AI Optimiser setup process

Step 3. Activate the plugin.

Step 4. Go to Tools โ†’ CloudScale Backup and Restore.

No API keys, no account creation, no configuration wizard.

6. First Things to Configure

Set your schedule. Under the Backup Schedule card, enable automatic backups and tick the days you want it to run. The default is Monday, Wednesday, and Friday at 03:00 server time. Hit Save Schedule.

Set your retention. Under Retention and Storage, decide how many backups to keep. Watch the live storage estimate, if youโ€™re on a small instance size with limited disk, keep this number modest. Ten backups gives you nearly two weeks of daily coverage or about a month of three days a week coverage.

Configure S3 if you want remote copies. See section 4 above. The Test Connection button makes verification fast.

Run your first manual backup. Under Manual Backup, tick the components you want and click Run Backup Now. On most sites the first full backup takes between 30 seconds and a few minutes depending on media library size.

7. Restoring a Backup

The Backup History card lists all your backups with filename, size, creation date, age, and type. For each backup you have three actions along the left: Download, Restore DB, and Delete.

Download streams the zip directly to your browser; useful for keeping an offsite copy or moving to a new server.

Restore DB unpacks and restores just the database from the included SQL dump, using native mysql CLI if available, otherwise PHP. The plugin reads backup-meta.json to verify compatibility before proceeding.

The restore process puts the site into maintenance mode for the duration and brings it back up automatically when done.

8. A Note on wp-config.php Backups

The plugin can optionally back up wp-config.php, but it is unchecked by default and flagged with a warning. This file contains your database hostname, username, password, and secret keys. Include it in a backup and that backup ends up somewhere it shouldnโ€™t, and youโ€™ve handed someone the keys to your database.

The case for including it: a wp-config.php backup is extremely valuable for full disaster recovery onto a blank server, because you donโ€™t have to reconstruct your configuration from memory. The case against: routine backups donโ€™t need it since the file rarely changes.

My recommendation: include it in occasional deliberate full disaster recovery backups that you store securely, and leave it unchecked in your daily automated backups.

9. What It Deliberately Doesnโ€™t Do

No multisite support. The plugin is designed for standard single-site WordPress installations.

No incremental backups. Every backup is a full backup of the selected components. This keeps the code simple and the restore process reliable, at the cost of larger backup files.

No external service dependency. There is no cloud account, no licence check, no telemetry. The plugin does not phone home. S3 sync is entirely optional and uses your own bucket under your own AWS account.

10. Open Source

The plugin is open source. The code is all standard WordPress plugin PHP, a single main file plus a CSS and JS asset. No build steps, no Node dependencies, no compilation required. The entire plugin is designed to be auditable by anyone who can read PHP.

Go build something.

Download: cloudscale-backup.zip

Scaling Aurora Serverless v2 PostgreSQL: A Production Deep Dive

Aurora Serverless v2 promises the dream of a database that automatically scales to meet demand, freeing engineering teams from capacity planning. The reality is considerably more nuanced. After running Serverless v2 PostgreSQL clusters under production workloads, I have encountered enough sharp edges to fill a blog post. This is that post.

The topics covered here span the entire lifecycle of running Serverless v2 at scale: from choosing the right minimum ACU to avoid crippling your buffer cache, to managing connection pool saturation across hundreds of application pods, to understanding why Aurora’s aggressive scale down behaviour can cause more production incidents than it prevents. I also cover the connection storm failure pattern that is responsible for the majority of Aurora Serverless v2 production outages, the AWS Advanced JDBC Wrapper, prepared statement planning pitfalls with partitioned tables, reader versus writer traffic routing, vacuuming at scale, and how to build alerting that actually catches scaling issues before your customers do.

1 The Minimum ACU Trap: Buffer Cache, Memory, and Cold Start Pain

The most consequential decision you will make with Aurora Serverless v2 is choosing your minimum ACU. It is also the decision most teams get wrong.

Each ACU provides approximately 2 GiB of memory with corresponding CPU and networking. When you set a minimum of 0.5 ACUs, you are telling Aurora that 1 GiB of memory is an acceptable baseline for your database. For anything beyond a development environment, this is almost certainly too low.

1.1 Why Low Minimum ACUs Destroy Performance

PostgreSQL relies heavily on its buffer cache (shared_buffers) to keep frequently accessed data in memory. When Aurora scales down to a low ACU count, the buffer cache shrinks proportionally. Data that was previously served from memory is evicted. When load returns, every query that needs that data must now go to Aurora’s storage layer. The BufferCacheHitRatio metric drops and query latencies spike, sometimes by orders of magnitude.

This is not just a performance concern. It is a cascading failure waiting to happen. Here is the sequence that plays out in production:

Traffic drops overnight or during a quiet period. Aurora scales down to its minimum ACU. The buffer cache is flushed. Morning traffic arrives. Every query hits storage. Latencies increase. Application connection pools begin to saturate because queries that used to complete in 2 milliseconds now take 50 milliseconds. The database begins scaling up, but it takes time for the buffer cache to warm. During this warming period, you are effectively running a cold database under production load.

1.2 The Scaling Rate Problem

The rate at which Aurora Serverless v2 can scale up depends on its current capacity. An instance at 2 ACUs scales up more slowly than one at 16 ACUs. AWS documentation states this explicitly: the higher the current capacity, the faster it can scale up. If your minimum is set too low and you experience a sudden burst, the database cannot scale fast enough to meet demand. It scales in 0.5 ACU increments and the increment size grows with current capacity, meaning scaling from 0.5 to 16 ACUs takes meaningfully longer than scaling from 8 to 16 ACUs.

1.3 Practical Minimum ACU Guidance

For production workloads, your minimum ACU should be set high enough that your working dataset fits comfortably in the buffer cache at minimum capacity. Monitor the BufferCacheHitRatio metric over a representative period. If it drops below 99% at any point during normal operations, your minimum is too low.

A practical approach is to look at your provisioned instance metrics before migrating to serverless. If your provisioned r6g.xlarge (32 GiB RAM) cluster consistently uses 20 GiB of buffer cache, your minimum ACU should be at least 10 ACUs (20 GiB / 2 GiB per ACU) to maintain comparable buffer cache performance.

AWS recommends specific minimums for certain features:

  • 2 ACUs minimum if using Database Insights (Performance Insights)
  • 8 ACUs minimum for Global Databases
  • 1 ACU minimum for high connection workloads (max_connections is capped at 2,000 when minimum is set to 0 or 0.5 ACUs)

My recommendation: start with a minimum ACU that represents your steady state workload minus 20%, not the absolute lowest possible. The cost savings from running at 0.5 ACUs during quiet periods are almost never worth the performance degradation and cold start penalty when traffic returns.

2 Port Saturation and Connection Pool Sizing

Connection management is the second most common production issue with Aurora Serverless v2, particularly in microservices architectures where dozens or hundreds of pods each maintain their own connection pools.

2.1 Understanding Port Saturation

Every TCP connection to your Aurora instance consumes an ephemeral port. On the client side, the Linux kernel assigns a source port from the ephemeral range (typically 32768 to 60999, giving you approximately 28,000 ports per source IP). On a single application host or Kubernetes node, if you run many pods that each maintain large connection pools to the same Aurora endpoint, you can exhaust the ephemeral port range on that host.

The symptoms of port saturation are subtle and often misdiagnosed. Connection attempts start timing out intermittently. The database itself shows no signs of stress. CloudWatch metrics for the Aurora instance look fine. The problem is entirely on the client side: the operating system cannot allocate new source ports.

In Kubernetes environments this is particularly treacherous because many pods share the same node’s network namespace (unless using host networking). A single node running 20 pods, each with a connection pool of 20, means 400 connections from a single IP address. Add connection churn from pool recycling and you can approach the ephemeral port limit.

2.2 Detecting Port Saturation

On your application hosts or Kubernetes nodes, monitor the following:

# Count connections to your Aurora endpoint
ss -tn | grep ':5432' | wc -l

# Check ephemeral port usage
cat /proc/sys/net/ipv4/ip_local_port_range
ss -tn state established | wc -l

# On Kubernetes nodes, check connections per pod IP
ss -tn | grep ':5432' | awk '{print $4}' | cut -d: -f1 | sort | uniq -c | sort -rn

2.3 Avoiding Port Saturation

There are several approaches to preventing port saturation, and you should typically combine more than one.

First, widen the ephemeral port range. On Linux, the default range of 32768 to 60999 can be expanded. Set net.ipv4.ip_local_port_range = 1024 65535 in sysctl.conf to nearly double the available ports. Be aware that this overlaps with some well known ports, so ensure your applications do not listen on ports in the lower range.

Second, enable TCP port reuse. Set net.ipv4.tcp_tw_reuse = 1 to allow reuse of TIME_WAIT sockets for new connections to the same destination. This is safe for client side connections and dramatically reduces port pressure during connection churn.

Third, reduce connection pool sizes. This is where proper sizing becomes critical.

2.4 Calculating Connection Pool Size

The PostgreSQL wiki provides a formula that has held up well across many benchmarks:

optimal_connections = (core_count * 2) + effective_spindle_count

For Aurora Serverless v2, where storage is network attached and there are no physical spindles, effective_spindle_count is effectively zero if your working set is cached (which it should be if you have set your minimum ACUs correctly). So the formula simplifies to:

optimal_connections = core_count * 2

Each ACU provides roughly 0.25 vCPUs of compute capacity. At 16 ACUs maximum, you have approximately 4 vCPUs. This means the optimal number of active connections for a 16 ACU instance is around 8 to 10 connections doing active work simultaneously.

This number surprises people. It should not. PostgreSQL is not designed for massive connection parallelism. Every connection spawns a process, each consuming around 10 MiB of memory. Context switching between hundreds of active connections creates overhead that actively degrades throughput.

The formula for your total connection budget across all application instances is:

total_pool_connections = max_connections - superuser_reserved - monitoring_connections
per_pod_pool_size = total_pool_connections / number_of_pods

For Aurora Serverless v2, max_connections is derived from the maximum ACU setting. A rough approximation is:

max_connections โ‰ˆ GREATEST(max_ACU * 50, 100)

If your maximum ACU is 16, you get approximately 800 connections. Reserve 3 for superusers, another 10 for monitoring and administrative connections, leaving 787 for your applications. If you run 50 pods, each pod gets approximately 15 connections in its pool. If you run 200 pods, each gets approximately 4.

When the per pod number drops below 5, you need either an external connection pooler (PgBouncer or RDS Proxy) or fewer pods with larger pool sizes. RDS Proxy is the path of least resistance in the AWS ecosystem, though be aware it prevents Aurora Serverless v2 from scaling to zero ACUs since the proxy maintains persistent connections.

2.5 The Microservices Connection Multiplication Problem

In a microservices architecture, the connection math gets worse quickly. If you have 10 microservices, each running 20 pods, each with a connection pool of 10, you need 2,000 connections. This exceeds the max_connections for many ACU configurations.

The solution is PgBouncer in transaction mode, either as a sidecar on each pod or as a shared pool. In transaction mode, a server side connection is only held for the duration of a transaction, not the lifetime of a client connection. This lets you support thousands of client connections with a much smaller number of actual PostgreSQL connections.

3 Aurora’s Aggressive Scale Down and the Lack of Custom Scaling Parameters

This is the section where I become most critical of Aurora Serverless v2, because this behaviour has caused real production incidents.

3.1 The Problem: Aurora Scales Down Too Aggressively

Aurora Serverless v2 uses an internal algorithm to determine when to scale down. You cannot configure this algorithm. You cannot set a cooldown period. You cannot define custom scaling metrics. AWS documentation confirms that Aurora Auto Scaling (the kind that adds readers) is explicitly not supported for Serverless v2 because “scaling based on CPU usage isn’t meaningful for Aurora Serverless v2.”

The scaling down behaviour works as follows: when CPU load decreases, CPU capacity is released relatively quickly. Memory is released more gradually, but it is still released. AWS describes this as a “deliberate architectural choice” where memory is “more gradually released” as demand lessens. In practice, this means that after a burst of traffic, Aurora will begin reducing ACUs within minutes. If another burst arrives before the buffer cache has been properly rebuilt, you hit the cold cache problem described in Section 1.

You cannot tell Aurora to “keep at least 16 ACUs for the next 30 minutes after a burst.” You cannot scale based on a custom CloudWatch metric like queue depth or request rate. You cannot set a scale down cooldown. The only control you have is the minimum and maximum ACU range.

3.2 The Burst Traffic Pattern Problem

Consider a banking application that processes batch payments every 15 minutes. Each batch takes 3 minutes and requires significant database resources. Between batches, the database is relatively idle. Aurora Serverless v2 will scale up for each batch, then aggressively scale back down during the 12 minute quiet period. Each time it scales down, it loses buffer cache. Each time the next batch arrives, it hits cold storage.

With a provisioned instance, you simply size for your peak workload and accept the cost during quiet periods. With Serverless v2, you are forced to choose between setting a high minimum ACU (defeating the purpose of serverless) or accepting degraded burst performance.

3.3 The Workaround

The only reliable workaround is to set your minimum ACU high enough to absorb the scale down behaviour. This means your minimum ACU should be set to the capacity needed to keep your buffer cache warm, not to the lowest possible cost saving value.

For burst workloads specifically, consider running a lightweight background query that keeps the database warm. This is an ugly hack, but it works:

-- Run every 5 minutes from a lightweight scheduler
SELECT count(*) FROM (
    SELECT 1 FROM your_hot_table
    ORDER BY your_commonly_queried_column
    LIMIT 10000
) t;

This keeps frequently accessed pages in the buffer cache and prevents Aurora from releasing memory it will need again shortly.

4 Long Running Queries, Timeouts, and Lock Management

Long running queries interact badly with Aurora Serverless v2’s scaling model in ways that catch teams off guard. A query that would be merely slow on a provisioned instance can become a cascade failure trigger on a serverless cluster.

4.1 What Happens to Queries During a Scaling Event

Aurora Serverless v2 scales compute resources with minimal disruption to running queries, but minimal is not zero. During a scale up event, CPU and memory are added relatively transparently. During a scale down event, if the instance needs to reclaim memory that is currently being used by an active query’s sort or hash buffers, there is contention. Queries that were running fine at 16 ACUs and relying on 4 GiB of work_mem allocations will behave differently at 4 ACUs where the same global work_mem setting is competing with a much smaller memory budget.

The practical consequence is that work_mem should not be set globally to a value that assumes your maximum ACU is always in effect. Set it conservatively at the server level and use SET LOCAL work_mem within specific transactions that need more:

BEGIN;
SET LOCAL work_mem = '256MB';
-- your sort heavy or hash join query here
COMMIT;

This scopes the elevated work_mem to a single transaction without affecting the global setting, which matters when the instance is at lower ACU counts.

4.2 Statement Timeout: Your First Line of Defence

Aurora Serverless v2 has no built in query kill mechanism based on resource pressure. If a long running query is consuming CPU and preventing scale down, it will continue to do so until it completes, is cancelled by the client, or you intervene manually.

Set statement_timeout at the database or role level, not just in application code. Application side timeouts are unreliable: connection drops, application restarts, and network issues can all leave a query running on the server even after the client has given up:

-- For your application role, set a sensible default
ALTER ROLE app_user SET statement_timeout = '30s';

-- For reporting or analytics roles that legitimately need longer
ALTER ROLE analytics_user SET statement_timeout = '300s';

-- Verify what is set
SELECT rolname, rolconfig FROM pg_roles WHERE rolname IN ('app_user', 'analytics_user');

For Aurora Serverless v2 specifically, long running analytical queries should either be directed to a dedicated reader instance (section 6) or run via Aurora’s parallel query feature if eligible, not on the writer at peak hours.

4.3 Idle in Transaction: The Silent ACU Killer

A connection that is idle inside a transaction holds locks, consumes memory, and prevents autovacuum from cleaning up dead rows. At low ACU counts where memory is limited, a handful of idle in transaction connections can cause measurable performance degradation.

Set idle_in_transaction_session_timeout to terminate connections that have opened a transaction and then gone silent:

-- Terminate connections idle in a transaction for more than 60 seconds
ALTER SYSTEM SET idle_in_transaction_session_timeout = '60s';
SELECT pg_reload_conf();

This is a server level parameter. Setting it to 60 seconds will terminate sessions that have started a transaction and not done anything for a minute. This is almost always a bug in application code: a connection pool returning a connection mid transaction, or an application that opened a transaction and crashed. Finding these with a permissive timeout is better than letting them accumulate.

Also set lock_timeout to prevent lock queues from building silently:

ALTER ROLE app_user SET lock_timeout = '10s';

Without a lock timeout, a query waiting on a lock will wait indefinitely. On a serverless cluster, this means connections accumulate in the lock wait state, connection pool queues build up, and by the time you notice, your DatabaseConnections metric has saturated.

4.4 Detecting and Killing Problem Queries

Add these diagnostic queries to your runbook. Run them when monitoring shows ACUUtilization is high but application traffic does not explain it:

-- Queries running longer than 60 seconds
SELECT
    pid,
    now() - pg_stat_activity.query_start AS duration,
    query,
    state,
    wait_event_type,
    wait_event
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > INTERVAL '60 seconds'
  AND state != 'idle'
ORDER BY duration DESC;

-- Connections idle in transaction (holding locks, consuming memory)
SELECT
    pid,
    now() - state_change AS idle_duration,
    usename,
    application_name,
    client_addr,
    LEFT(query, 100) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY idle_duration DESC;

-- Lock waits that are blocking other queries
SELECT
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.query AS blocking_query,
    now() - blocked.query_start AS blocked_duration
FROM pg_stat_activity AS blocked
JOIN pg_stat_activity AS blocking
    ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
ORDER BY blocked_duration DESC;

To terminate a specific query without killing the connection:

SELECT pg_cancel_backend(pid);

To terminate the connection entirely:

SELECT pg_terminate_backend(pid);

4.5 When Not to Use Aurora Serverless v2

Long running query patterns are a signal that Serverless v2 may not be the right choice for a workload. If your cluster regularly runs queries longer than five minutes, hosts analytics or reporting workloads alongside OLTP, or has a predictable and stable baseline load that never drops meaningfully, a provisioned instance with Graviton will almost always outperform Serverless v2 on both cost and latency.

Serverless v2 earns its keep for workloads with genuine variability: overnight quiet periods followed by daytime peaks, bursty batch jobs with unpredictable timing, or development and staging environments that are idle most of the day. For consistent high throughput workloads, the overhead of the scaling mechanism and the buffer cache volatility described in section 1 are costs with no corresponding benefit.

5 The AWS Advanced JDBC Wrapper

If you are running Java applications against Aurora, the AWS Advanced JDBC Wrapper is not optional. It is a significant upgrade over using a standard PostgreSQL or MySQL JDBC driver directly, and it solves several problems that are specific to Aurora’s architecture.

5.1 What It Does

The AWS Advanced JDBC Wrapper sits on top of your existing JDBC driver (PostgreSQL JDBC, MySQL Connector/J, or MariaDB Connector/J) and adds Aurora specific capabilities without requiring you to rewrite your data access code. The key features for production workloads are:

Fast Failover Beyond DNS. When Aurora performs a failover, the cluster endpoint DNS record needs to update to point to the new primary. DNS propagation can take up to 30 seconds. The JDBC Wrapper maintains a real time cache of the Aurora cluster topology by querying Aurora directly. When a failover occurs, it immediately knows which instance is the new primary and reconnects without waiting for DNS. In testing, this reduces failover reconnection time from 30 seconds to approximately 6 seconds with the v1 failover plugin, and even less with Failover v2.

Failover v2 Plugin. The original failover plugin handled each connection’s failover independently. If you had 100 active connections during a failover, each one independently probed the cluster topology to find the new writer. The Failover v2 plugin centralises topology monitoring into a single thread (MonitorRdsHostListProvider), so hundreds of connections can fail over simultaneously without overwhelming the cluster with topology queries.

Enhanced Failure Monitoring. Traditional JDBC drivers detect failures through TCP timeouts, which can take 30 seconds or more. The JDBC Wrapper’s Enhanced Failure Monitoring (EFM) proactively monitors database node health using lightweight probe connections. It detects failures before your application’s connection times out, enabling faster response.

Seamless IAM Authentication. Aurora supports IAM database authentication, but implementing token generation, expiration handling, and renewal in your application is tedious. The wrapper handles the entire IAM authentication lifecycle transparently.

Secrets Manager Integration. Database credentials are retrieved automatically from AWS Secrets Manager. Your application configuration never contains the actual password.

5.2 Read Write Splitting

The read/write splitting plugin is particularly valuable for Aurora Serverless v2 because it allows you to route read traffic to reader instances (which scale independently from the writer in promotion tiers 2 through 15) while keeping write traffic on the writer.

The plugin works by intercepting connection.setReadOnly(true) calls and switching the underlying connection to a reader instance. When setReadOnly(false) is called, it switches back to the writer. In Spring Boot with JPA, this maps directly to @Transactional(readOnly = true) annotations.

# HikariCP configuration with AWS JDBC Wrapper
spring:
  datasource:
    url: jdbc:aws-wrapper:postgresql://your-cluster.cluster-xxx.region.rds.amazonaws.com:5432/mydb
    driver-class-name: software.amazon.jdbc.Driver
    hikari:
      connection-timeout: 30000
      maximum-pool-size: 15
  jpa:
    properties:
      hibernate:
        connection:
          provider_class: software.amazon.jdbc.ds.HikariPooledConnectionProvider

The reader selection strategy is configurable. Options include random (default), round robin, and fastest response time. For Aurora Serverless v2 where reader instances may be at different ACU levels, round robin provides the most even distribution.

One important caveat from the AWS documentation: if you are using Spring’s @Transactional(readOnly = true) annotation, Spring calls setReadOnly(true), runs the method, then calls setReadOnly(false). This constant switching incurs overhead from connection switching. For high throughput read workloads, the AWS documentation actually recommends using separate data sources pointing to the writer and reader cluster URLs rather than using the read/write splitting plugin with Spring annotations. Use the plugin for workloads where the same code path mixes reads and writes and you want automatic routing, and use separate data sources for dedicated read heavy services.

6 Prepared Statement Plans and Partitioned Tables

This is a subtle but consequential issue that can cause significant memory consumption and degraded query performance, particularly on partitioned tables.

6.1 How PostgreSQL Plan Caching Works

When you execute a prepared statement, PostgreSQL goes through parsing, planning, and execution. For the first five executions, it creates a custom plan specific to the parameter values provided. Starting with the sixth execution, it evaluates whether a generic plan (one that works for any parameter values) would be efficient enough. If the generic plan’s estimated cost is close to the average custom plan cost, PostgreSQL switches to reusing the generic plan permanently for that prepared statement.

This is normally a good thing. Planning is expensive, and reusing a generic plan saves planning overhead. The problem arises with partitioned tables.

6.2 The Partition Pruning Problem

When PostgreSQL plans a query against a partitioned table, it determines which partitions are relevant based on the query’s WHERE clause. This is called partition pruning. With a custom plan where the parameter values are known, the planner can prune partitions at planning time (initial pruning). Only the relevant partitions are included in the plan.

With a generic plan, the parameter values are not known at planning time. The planner must generate a plan that covers all partitions, because it does not know which ones will be relevant until execution time. PostgreSQL can still perform runtime pruning during execution, but the plan itself references all partitions.

Here is where the memory problem emerges. When PostgreSQL creates a plan that references a partition, it loads that partition’s metadata into the relation cache (relcache). For a table with hundreds of partitions, this means the relcache for a single prepared statement contains metadata for every partition. Each cached plan also holds locks on all referenced partitions. Multiply this by hundreds of prepared statements across many connections, and memory consumption becomes substantial.

6.3 The Memory Explosion

Consider a table partitioned by date with 365 daily partitions. A typical query filters by a single day. With a custom plan, only 1 partition is referenced. With a generic plan, all 365 partitions are referenced. Each partition’s relcache entry consumes memory (typically several kilobytes per partition including index metadata). Across 200 connections, each with 50 cached generic plans touching 365 partitions, you can consume gigabytes of memory just in relcache entries.

This problem is well documented. PostgreSQL committed a fix in version 17 (commit 525392d5) that prevents partitions pruned during initial pruning from being locked and loaded into the relcache. However, this fix was subsequently reverted (commit 1722d5eb) because it caused issues. PostgreSQL 18 includes partial improvements, but the core problem of generic plans loading all partitions into the relcache is not fully resolved.

6.4 Mitigations for Aurora

For Aurora Serverless v2, where memory directly translates to ACU consumption and cost, this issue is particularly impactful. Several mitigations are available:

Force custom plans where appropriate. For queries that always target a single partition, disable generic plans at the session level:

SET plan_cache_mode = 'force_custom_plan';

This forces PostgreSQL to generate a custom plan for every execution. You pay the planning overhead, but the plan only references the relevant partitions. For queries against large partitioned tables where pruning eliminates most partitions, the planning cost is almost always worth the memory savings.

Reduce partition count. If you have 365 daily partitions, consider monthly partitions (12) or weekly partitions (52). Fewer partitions means smaller relcache footprint per generic plan.

Use PREPARE explicitly. If using PgBouncer in transaction mode, prepared statements from one client can leak to another. Consider whether your connection pooler properly handles prepared statement lifecycle.

Monitor plan memory. Query pg_prepared_statements to see what plans are cached:

SELECT name, statement, generic_plans, custom_plans
FROM pg_prepared_statements;

If generic_plans is high for queries against partitioned tables, those are your candidates for force_custom_plan.

7 Directing Traffic to Readers vs Writers

Aurora provides separate cluster endpoints for writers and readers, but getting your application to use them correctly requires deliberate architecture.

7.1 Aurora Endpoint Types

Aurora provides four endpoint types:

The cluster endpoint (also called the writer endpoint) always points to the current primary instance. Use this for all write operations.

The reader endpoint load balances across all reader instances using round robin DNS. Use this for read only queries.

Instance endpoints connect to a specific instance. Avoid using these in application code as they break failover.

Custom endpoints allow you to define groups of instances for specific workloads. These can be useful for directing analytics queries to larger reader instances while keeping transactional reads on smaller ones.

7.2 Application Level Routing

The simplest approach is two data sources in your application: one pointing to the writer endpoint and one to the reader endpoint. In Spring Boot:

# Writer datasource
app:
  datasource:
    writer:
      url: jdbc:postgresql://cluster.cluster-xxx.region.rds.amazonaws.com:5432/mydb
    reader:
      url: jdbc:postgresql://cluster.cluster-ro-xxx.region.rds.amazonaws.com:5432/mydb

Route services or repository methods to the appropriate data source based on whether they perform reads or writes. This avoids the connection switching overhead of the JDBC Wrapper’s read/write splitting plugin while still distributing load.

7.3 Reader Scaling and Promotion Tiers

Readers in promotion tiers 0 and 1 scale with the writer. They are always sized to handle a failover and take over the writer role. Readers in promotion tiers 2 through 15 scale independently based on their own workload.

For cost optimisation, place your read replicas in tier 2 or higher. They will scale down independently when read traffic is low, rather than tracking the writer’s ACU level. This can save significant cost if your read traffic pattern differs from your write traffic pattern.

For availability, keep at least one reader in tier 0 or 1. This reader will always be sized appropriately to become the writer during a failover, eliminating the risk of a failover promoting an undersized reader.

7.4 DNS TTL and Stale Routing

The reader endpoint uses DNS round robin, which means the DNS TTL affects how quickly your application discovers new readers or stops sending traffic to removed ones. Set your JVM’s DNS cache TTL to a low value:

java.security.Security.setProperty("networkaddress.cache.ttl", "5");
java.security.Security.setProperty("networkaddress.cache.negative.ttl", "3");

The AWS JDBC Wrapper handles this internally through its topology cache, which is another reason to use it.

8 The Outage You Will Most Likely Have: Connection Storm During Scale Lag

All the scaling realities discussed above converge into one specific failure pattern that accounts for the majority of Aurora Serverless v2 production incidents. It does not announce itself as a capacity problem. It announces itself as a wall of connection timeouts and a silent database.

Understanding this pattern, and distinguishing it from CPU saturation, is the difference between an engineer who panics and an engineer who fixes it in under five minutes.

8.1 The Failure Sequence

The pattern plays out in four predictable stages.

Stage 1: Traffic spike arrives. Application concurrency climbs rapidly, whether from a batch job, a marketing event, or a morning surge. Each new application thread or request opens or borrows a database session.

Stage 2: Scale lag creates a window of vulnerability. Aurora Serverless v2 scales fast, but not instantaneously. From a low minimum ACU baseline; say 2 ACUs; reaching 16 ACUs can take 30 to 90 seconds. During this window, max_connections is still calibrated to the current ACU count, not the target. At 2 ACUs on PostgreSQL, you have roughly 135 connections. Your application, unaware of this ceiling, is already trying to open 300.

Stage 3: Connection attempts fail and retry amplification begins. Application connection pools hit the limit and throw errors. Clients retry. Each retry is another connection attempt against a database that is already saturated. The retry storm is now generating more load than the original traffic spike. New ACUs are still spinning up. The database cannot yet serve enough connections to drain the retry queue.

Stage 4: Full saturation. The writer node reaches 100% of its current connection ceiling. All new connections are rejected. RDS Proxy queues fill if you have it deployed, or requests fail immediately if you do not. Application pods begin cascading with database unavailability errors. On CloudWatch, you see DatabaseConnections flatlined at maximum, ServerlessDatabaseCapacity still climbing, the capacity was coming, but the application fell over before it arrived.

8.2 CPU Saturation vs Session Saturation: Two Different Problems

These two failure modes look similar on a dashboard and require completely different responses.

CPU bound scaling failure presents as high CPUUtilization, slow query times, and gradual degradation. Connections succeed but queries run slowly. Aurora is scaling but the workload is compute intensive. The fix is a higher minimum ACU and query optimisation.

Session bound saturation presents as DatabaseConnections at ceiling, CPUUtilization potentially low or moderate, and immediate hard connection errors rather than slow responses. The database may have ample CPU headroom but cannot admit new sessions because the connection limit is a function of current memory, not future memory. The fix is RDS Proxy, higher minimum ACU, and application level connection limiting.

Misdiagnosing session saturation as a CPU problem and throwing more application pods at it makes the situation worse. More pods mean more connection attempts against the same ceiling.

8.3 Why Minimum ACU Is Your Primary Defence

The scaling rate in Aurora Serverless v2 is proportional to current capacity. A database sitting at 0.5 ACUs doubles slowly. A database sitting at 8 ACUs reaches 16 ACUs in seconds. Setting your minimum ACU to match your p95 baseline load is not about cost. it is about ensuring the database can reach adequate capacity before your application runs out of connection budget.

The formula that matters here is simple: at any given ACU level, PostgreSQL on Aurora Serverless v2 allocates roughly GREATEST(75, LEAST(5000, 25 * ACU)) connections as a default max_connections ceiling (the actual formula varies slightly by engine version). At 2 ACUs that gives you around 150 connections. At 8 ACUs you have around 400. Setting minimum ACU too low is effectively setting a connection ceiling at the point your database is most vulnerable, immediately after a surge begins.

8.4 Diagnostic Checklist for This Failure Mode

When you suspect you have hit a connection storm during scale lag, work through these checks in order.

Confirm it is session saturation and not CPU saturation. Check DatabaseConnections against the expected maximum for the current ServerlessDatabaseCapacity. If connections are at ceiling but CPU is below 80%, you have a session problem not a compute problem.

Check the timeline of ACU scaling vs connection spike. In CloudWatch, plot ServerlessDatabaseCapacity and DatabaseConnections on the same axis with a one minute period. If connections hit ceiling before ACUs reached adequate capacity, scale lag is confirmed as the root cause.

Check whether RDS Proxy was in the path. If RDS Proxy was not deployed, every application thread was holding a direct connection. If it was deployed, check ProxyClientConnections and ProxyDatabaseConnections to see whether the proxy queue was the bottleneck rather than the database connection limit directly.

Check your application connection pool configuration. If your pool has no maximum connection limit per pod, each pod will attempt to open as many connections as it has threads. With ten pods each attempting 50 connections you have 500 attempts against a database that allows 150. No amount of Aurora scaling fixes this without application side throttling.

Check your minimum ACU setting. If minimum is below 4 ACUs, you are starting each scaling event from a low connection budget. Increase minimum ACU to match your realistic idle baseline, not the theoretical minimum.

Check whether retries are exponential with jitter. Synchronised retries amplify the storm. If your application retries every 100ms on all pods simultaneously, you are generating burst connection load far exceeding the original spike.

8.5 The Core Issue: Application Concurrency Outpacing Database Scaling Velocity

This is the framing that matters. Aurora Serverless v2 does not have a connection problem. It has a timing problem. The database will scale to handle your load. The question is whether your application can survive the seconds it takes to get there.

This is not unique to Aurora Serverless v2. Any autoscaling database faces the same gap. What makes Aurora Serverless v2 distinctive is that teams deploy it expecting magic, instant, invisible scaling and do not build the application side defensive patterns that provisioned databases trained them to build. They remove connection pooling because “serverless handles it.” They remove retry backoff because “it scales instantly.” And then they hit the 45 second window during a Monday morning surge and the database goes dark.

The correct mental model is: Aurora Serverless v2 scales fast enough that you rarely need to manually intervene on capacity, but not fast enough to absorb an uncontrolled connection storm from a stateless application with no connection budget discipline.

The three interventions that prevent this outage are as follows. RDS Proxy absorbs connection spikes by multiplexing thousands of application connections onto a stable, smaller set of database sessions. It does not eliminate the scale lag window but it means the database is not fighting both a capacity ramp and a connection ceiling simultaneously. A minimum ACU above your realistic idle load ensures the database enters the scaling ramp from a higher starting point and reaches adequate capacity in seconds rather than minutes. Application side connection limiting, whether via connection pool max size, circuit breakers, or concurrency limiters at the service layer, ensures that even under surge conditions your pods do not collectively attempt more connections than the database can admit at its current ACU level.

With all three in place, connection storms during scale lag become a metric to monitor rather than an outage to survive.

9 Monitoring and Alerting on Scaling Issues

The challenge with alerting on Aurora Serverless v2 is that traditional database metrics do not tell the full story. A CPU utilisation of 80% on a provisioned instance means something very different from 80% on a Serverless v2 instance that may be in the process of scaling.

9.1 Essential CloudWatch Metrics

ACUUtilization is the single most important metric. It represents the percentage of the currently allocated capacity that is in use. When this approaches 100%, the instance is at its current capacity ceiling and will attempt to scale up. If it stays at 100%, the instance has hit its maximum ACU and cannot scale further.

ServerlessDatabaseCapacity shows the current ACU value. Plot this alongside your application’s request rate to understand the correlation between traffic and scaling behaviour. This metric is also essential for identifying the aggressive scale down pattern described in Section 3.

FreeableMemory shows available memory at maximum capacity. When this approaches zero, the instance has scaled as high as it can and is running out of memory.

BufferCacheHitRatio should stay above 99% at all times. Any drop below this threshold indicates that the buffer cache is too small for your working set, usually because the instance has scaled down too aggressively.

DatabaseConnections tracks active connections. Alert when this approaches max_connections for your ACU configuration. As described in Section 7, this metric plotted alongside ServerlessDatabaseCapacity is your primary diagnostic for connection storm detection.

9.2 CloudWatch Alarms

Set up the following alarms as a baseline:

# ACU Utilization approaching maximum
aws cloudwatch put-metric-alarm \
  --alarm-name "aurora-acu-utilization-high" \
  --metric-name ACUUtilization \
  --namespace AWS/RDS \
  --statistic Average \
  --period 300 \
  --threshold 90 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:region:account:alerts

# Buffer cache hit ratio dropping
aws cloudwatch put-metric-alarm \
  --alarm-name "aurora-buffer-cache-low" \
  --metric-name BufferCacheHitRatio \
  --namespace AWS/RDS \
  --statistic Average \
  --period 300 \
  --threshold 99 \
  --comparison-operator LessThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:region:account:alerts

# Connection count approaching limit
# For a max ACU of 16, max_connections is approximately 800
aws cloudwatch put-metric-alarm \
  --alarm-name "aurora-connections-high" \
  --metric-name DatabaseConnections \
  --namespace AWS/RDS \
  --statistic Maximum \
  --period 60 \
  --threshold 640 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:region:account:alerts

# Freeable memory critically low
aws cloudwatch put-metric-alarm \
  --alarm-name "aurora-memory-low" \
  --metric-name FreeableMemory \
  --namespace AWS/RDS \
  --statistic Average \
  --period 300 \
  --threshold 524288000 \
  --comparison-operator LessThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:region:account:alerts

9.3 Custom Metrics Worth Tracking

Beyond CloudWatch’s built in metrics, track these at the application level:

Connection pool wait time: how long your application waits for a connection from the pool. Spikes indicate pool saturation.

Transaction duration percentiles: p95 and p99 transaction durations. Sudden increases often correlate with buffer cache misses during scale down.

Connection churn rate: how frequently connections are opened and closed. High churn can indicate pool misconfiguration.

Query plan cache memory: monitor pg_prepared_statements to catch the partition related memory bloat described in Section 5.

9.4 A Dashboard That Tells the Story

Build a dashboard that correlates application request rate (from your APM tool), ServerlessDatabaseCapacity (current ACU), ACUUtilization (percentage of current capacity in use), BufferCacheHitRatio (cache health), DatabaseConnections (connection pressure), and query latency p95 (application perceived performance).

When you can see all six of these on one graph, the cause of any performance incident becomes immediately obvious. The request rate goes up, ACUs lag behind, buffer cache ratio drops, latency spikes. Or connections climb while ACUs are stable, indicating a connection pool problem rather than a compute problem, the session saturation signature described in Section 7.

10 Vacuuming Aurora Serverless v2

Vacuum management on Aurora Serverless v2 deserves special attention because the autovacuum process directly affects ACU consumption and the scaling behaviour of your cluster.

I have written a comprehensive guide on Aurora PostgreSQL vacuum optimization for large tables, including TOAST table tuning and the pg_repack extension, so I will not repeat all of that here. Instead, I will focus on the serverless specific considerations.

10.1 Autovacuum and ACU Scaling

The autovacuum daemon runs as a background process. Its activity counts toward the instance’s CPU and memory utilisation, which means it can trigger scaling. This is by design, but it has consequences.

If your minimum ACU is set low and a vacuum operation starts, the instance will scale up to accommodate the vacuum. When the vacuum completes, the instance scales back down. If you have many tables with aggressive autovacuum settings, you can see constant ACU fluctuations driven entirely by vacuum activity rather than application workload.

For Serverless v2, tune your autovacuum parameters to be aggressive during off peak hours and gentler during peak hours. This is not natively supported by PostgreSQL’s autovacuum, but you can achieve it with scheduled parameter changes:

-- During peak hours: slower, gentler vacuums
ALTER SYSTEM SET autovacuum_vacuum_cost_delay = 20;
ALTER SYSTEM SET autovacuum_vacuum_cost_limit = 200;
SELECT pg_reload_conf();

-- During off peak hours: aggressive vacuums
ALTER SYSTEM SET autovacuum_vacuum_cost_delay = 0;
ALTER SYSTEM SET autovacuum_vacuum_cost_limit = 4000;
SELECT pg_reload_conf();

Schedule these changes using pg_cron or an external scheduler like EventBridge with Lambda.

10.2 Table Level Vacuum Tuning

For high churn tables on Serverless v2, set table level autovacuum parameters to prevent bloat while managing the impact on ACU consumption:

ALTER TABLE high_churn_table SET (
    autovacuum_vacuum_threshold = 5000,
    autovacuum_vacuum_scale_factor = 0.05,
    autovacuum_vacuum_cost_delay = 10,
    autovacuum_analyze_threshold = 2500,
    autovacuum_analyze_scale_factor = 0.05
);

These settings take effect immediately without a restart. They are table level storage parameters, not server level GUCs.

For tables with large text, JSON, or bytea columns, also tune the TOAST table parameters separately:

ALTER TABLE large_data_table SET (
    toast.autovacuum_vacuum_cost_delay = 0,
    toast.autovacuum_vacuum_threshold = 2000,
    toast.autovacuum_vacuum_scale_factor = 0.05,
    toast.autovacuum_vacuum_cost_limit = 3000
);

10.3 Vacuum Monitoring Queries

Regularly check which tables need vacuuming and how the autovacuum is performing:

-- Tables most in need of vacuum
SELECT
    schemaname || '.' || relname AS table_name,
    n_dead_tup,
    n_live_tup,
    ROUND(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 2) AS dead_pct,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC
LIMIT 20;

-- Long running vacuums that may be inflating ACUs
SELECT
    pid,
    now() - xact_start AS duration,
    query
FROM pg_stat_activity
WHERE query LIKE 'autovacuum%'
ORDER BY duration DESC;

11 Putting It All Together

Running Aurora Serverless v2 PostgreSQL well in production requires understanding that “serverless” does not mean “hands off.” The database still needs careful tuning; the nature of that tuning is just different from provisioned instances.

Set your minimum ACU based on buffer cache requirements, not cost optimisation. Size your connection pools using the core count formula and account for the total connection budget across all pods. Accept that Aurora’s aggressive scale down behaviour is a limitation you must design around, not a feature you can configure. Understand that the most common production outage pattern is not a capacity problem., it is a connection storm that arrives during the scale lag window before capacity catches up with demand. Use the AWS Advanced JDBC Wrapper for failover handling and reader routing. Be deliberate about prepared statement caching on partitioned tables. Tune your vacuum settings at the table level, not just globally. And build dashboards that correlate application metrics with Aurora scaling behaviour so you can see problems before they become incidents.

Aurora Serverless v2 is a powerful platform, but it rewards engineers who understand its mechanics and punishes those who treat it as a magic auto scaling box. The scaling is automatic, but the architecture decisions that make it work well, particularly around minimum ACU sizing, RDS Proxy placement, and application side connection discipline, are very much manual.

Automatically Recovering a Failed WordPress Instance on AWS

When WordPress goes down on your AWS instance, waiting for manual intervention means downtime and lost revenue. Here are two robust approaches to automatically detect and recover from WordPress failures.

Approach 1: Lambda Based Intelligent Recovery

This approach tries the least disruptive fix first (restarting services) before escalating to a full instance reboot.

Step 1: Create the Health Check Script on Your EC2 Instance

SSH into your WordPress EC2 instance and create the health check script:

sudo tee /usr/local/bin/wordpress-health.sh > /dev/null << 'EOF'
#!/bin/bash
response=$(curl -s -o /dev/null -w "%{http_code}" https://localhost)
if [ $response -eq 200 ]; then
  echo 1
else
  echo 0
fi
EOF

sudo chmod +x /usr/local/bin/wordpress-health.sh

Test it works:

/usr/local/bin/wordpress-health.sh

You should see 1 if WordPress is running.

Step 2: Install CloudWatch Agent on Your EC2 Instance

Still on your EC2 instance, download and install the CloudWatch agent:

wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i -E ./amazon-cloudwatch-agent.deb

Step 3: Create Metric Publishing Script on Your EC2 Instance

This script will send the health check result to CloudWatch every minute:

sudo tee /usr/local/bin/send-wordpress-metric.sh > /dev/null << 'EOF'
#!/bin/bash
INSTANCE_ID=$(ec2-metadata --instance-id | cut -d " " -f 2)
REGION=$(ec2-metadata --availability-zone | cut -d " " -f 2 | sed 's/[a-z]$//')
HEALTH=$(/usr/local/bin/wordpress-health.sh)

aws cloudwatch put-metric-data 
  --namespace "WordPress" 
  --metric-name HealthCheck 
  --value $HEALTH 
  --dimensions Instance=$INSTANCE_ID 
  --region $REGION
EOF

sudo chmod +x /usr/local/bin/send-wordpress-metric.sh

Test it:

/usr/local/bin/send-wordpress-metric.sh

If you get permission errors, ensure your EC2 instance has an IAM role with CloudWatch permissions.

Step 4: Add Health Check to Cron on Your EC2 Instance

This runs the health check every minute:

(crontab -l 2>/dev/null; echo "* * * * * /usr/local/bin/send-wordpress-metric.sh") | crontab -

Verify it was added:

crontab -l

Step 5: Create IAM Role for Lambda on Your Laptop

Now switch to your laptop (or use AWS CloudShell in your browser). You’ll need the AWS CLI installed and configured with credentials.

Create the IAM role that Lambda will use:

aws iam create-role 
  --role-name WordPressRecoveryRole 
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"Service": "lambda.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }]
  }'

Attach the necessary policies:

aws iam attach-role-policy 
  --role-name WordPressRecoveryRole 
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

aws iam put-role-policy 
  --role-name WordPressRecoveryRole 
  --policy-name EC2SSMAccess 
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "ec2:RebootInstances",
          "ec2:DescribeInstances",
          "ssm:SendCommand",
          "ssm:GetCommandInvocation"
        ],
        "Resource": "*"
      }
    ]
  }'

Step 6: Create Lambda Function on Your Laptop

On your laptop, create a file called wordpress-recovery.py in a new directory:

import boto3
import os
import time

ec2 = boto3.client('ec2')
ssm = boto3.client('ssm')

def lambda_handler(event, context):
    instance_id = os.environ.get('INSTANCE_ID')
    
    if not instance_id:
        return {'statusCode': 400, 'body': 'INSTANCE_ID not configured'}
    
    print(f"WordPress health check failed for {instance_id}")
    
    # Step 1: Try restarting services
    try:
        print("Attempting to restart services...")
        response = ssm.send_command(
            InstanceIds=[instance_id],
            DocumentName='AWS-RunShellScript',
            Parameters={
                'commands': [
                    'systemctl restart php-fpm || systemctl restart php8.2-fpm || systemctl restart php8.1-fpm',
                    'systemctl restart nginx || systemctl restart apache2',
                    'sleep 30',
                    'curl -f https://localhost || exit 1'
                ]
            },
            TimeoutSeconds=120
        )
        
        command_id = response['Command']['CommandId']
        print(f"Command ID: {command_id}")
        
        # Wait for command to complete
        time.sleep(35)
        
        result = ssm.get_command_invocation(
            CommandId=command_id,
            InstanceId=instance_id
        )
        
        if result['Status'] == 'Success':
            print("Services restarted successfully")
            return {'statusCode': 200, 'body': 'Services restarted successfully'}
        else:
            print(f"Service restart failed with status: {result['Status']}")
    
    except Exception as e:
        print(f"Service restart failed with error: {str(e)}")
    
    # Step 2: Reboot the instance as last resort
    try:
        print(f"Rebooting instance {instance_id}")
        ec2.reboot_instances(InstanceIds=[instance_id])
        return {'statusCode': 200, 'body': 'Instance rebooted'}
    except Exception as e:
        print(f"Reboot failed: {str(e)}")
        return {'statusCode': 500, 'body': f'Recovery failed: {str(e)}'}

Create the deployment package:

zip wordpress-recovery.zip wordpress-recovery.py

Get your AWS account ID:

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

Deploy the Lambda function (replace i-1234567890abcdef0 with your actual instance ID and us-east-1 with your region):

aws lambda create-function 
  --function-name wordpress-recovery 
  --runtime python3.11 
  --role arn:aws:iam::${AWS_ACCOUNT_ID}:role/WordPressRecoveryRole 
  --handler wordpress-recovery.lambda_handler 
  --zip-file fileb://wordpress-recovery.zip 
  --timeout 180 
  --region us-east-1 
  --environment Variables={INSTANCE_ID=i-1234567890abcdef0}

Step 7: Create CloudWatch Alarm on Your Laptop

Replace i-1234567890abcdef0 with your instance ID and us-east-1 with your region:

aws cloudwatch put-metric-alarm 
  --region us-east-1 
  --alarm-name wordpress-down-recovery 
  --alarm-description "Trigger recovery when WordPress is down" 
  --namespace WordPress 
  --metric-name HealthCheck 
  --dimensions Name=Instance,Value=i-1234567890abcdef0 
  --statistic Average 
  --period 300 
  --evaluation-periods 2 
  --threshold 1 
  --comparison-operator LessThanThreshold 
  --treat-missing-data notBreaching

This alarm triggers if the health check fails for 10 minutes (2 periods of 5 minutes each).

Step 8: Connect Alarm to Lambda on Your Laptop

Create an SNS topic (replace us-east-1 with your region):

aws sns create-topic --name wordpress-recovery-topic --region us-east-1

Get the topic ARN:

export TOPIC_ARN=$(aws sns list-topics --region us-east-1 --query 'Topics[?contains(TopicArn, `wordpress-recovery-topic`)].TopicArn' --output text)

Subscribe Lambda to the topic:

aws sns subscribe 
  --region us-east-1 
  --topic-arn ${TOPIC_ARN} 
  --protocol lambda 
  --notification-endpoint arn:aws:lambda:us-east-1:${AWS_ACCOUNT_ID}:function:wordpress-recovery

Give SNS permission to invoke Lambda:

aws lambda add-permission 
  --region us-east-1 
  --function-name wordpress-recovery 
  --statement-id AllowSNSInvoke 
  --action lambda:InvokeFunction 
  --principal sns.amazonaws.com 
  --source-arn ${TOPIC_ARN}

Update the CloudWatch alarm to notify SNS (replace i-1234567890abcdef0 with your instance ID and us-east-1 with your region):

aws cloudwatch put-metric-alarm 
  --region us-east-1 
  --alarm-name wordpress-down-recovery 
  --alarm-description "Trigger recovery when WordPress is down" 
  --namespace WordPress 
  --metric-name HealthCheck 
  --dimensions Name=Instance,Value=i-1234567890abcdef0 
  --statistic Average 
  --period 300 
  --evaluation-periods 2 
  --threshold 1 
  --comparison-operator LessThanThreshold 
  --treat-missing-data notBreaching 
  --alarm-actions ${TOPIC_ARN}

Approach 2: Custom Health Check with CloudWatch Reboot

This approach is simpler than the Lambda version. It uses a custom CloudWatch metric based on checking your WordPress homepage, then automatically reboots when the check fails.

Step 1: Create the Health Check Script on Your EC2 Instance

SSH into your WordPress EC2 instance and create the health check script:

sudo tee /usr/local/bin/wordpress-health.sh > /dev/null << 'EOF'
#!/bin/bash
response=$(curl -s -o /dev/null -w "%{http_code}" https://localhost)
if [ $response -eq 200 ]; then
  echo 1
else
  echo 0
fi
EOF

sudo chmod +x /usr/local/bin/wordpress-health.sh

Test it works:

/usr/local/bin/wordpress-health.sh

You should see 1 if WordPress is running.

Step 2: Create Metric Publishing Script on Your EC2 Instance

This script sends the health check result to CloudWatch:

sudo tee /usr/local/bin/send-wordpress-metric.sh > /dev/null << 'EOF'
#!/bin/bash
INSTANCE_ID=$(ec2-metadata --instance-id | cut -d " " -f 2)
REGION=$(ec2-metadata --availability-zone | cut -d " " -f 2 | sed 's/[a-z]$//')
HEALTH=$(/usr/local/bin/wordpress-health.sh)

aws cloudwatch put-metric-data 
  --namespace "WordPress" 
  --metric-name HealthCheck 
  --value $HEALTH 
  --dimensions Instance=$INSTANCE_ID 
  --region $REGION
EOF

sudo chmod +x /usr/local/bin/send-wordpress-metric.sh

Test it (ensure your EC2 instance has an IAM role with CloudWatch permissions):

/usr/local/bin/send-wordpress-metric.sh

Step 3: Add Health Check to Cron on Your EC2 Instance

Run the health check every minute:

(crontab -l 2>/dev/null; echo "* * * * * /usr/local/bin/send-wordpress-metric.sh") | crontab -

Verify it was added:

crontab -l

Step 4: Create CloudWatch Alarm with Reboot Action on Your Laptop

Now from your laptop (or AWS CloudShell), create the alarm. Replace i-1234567890abcdef0 with your instance ID and us-east-1 with your region:

aws cloudwatch put-metric-alarm 
  --region us-east-1 
  --alarm-name wordpress-health-reboot 
  --alarm-description "Reboot instance when WordPress health check fails" 
  --namespace WordPress 
  --metric-name HealthCheck 
  --dimensions Name=Instance,Value=i-1234567890abcdef0 
  --statistic Average 
  --period 300 
  --evaluation-periods 2 
  --threshold 1 
  --comparison-operator LessThanThreshold 
  --treat-missing-data notBreaching 
  --alarm-actions arn:aws:automate:us-east-1:ec2:reboot

This will reboot your instance if WordPress fails health checks for 10 minutes (2 periods of 5 minutes).

That’s it. The entire setup is contained in 4 steps, and there’s no Lambda function to maintain. When WordPress goes down, CloudWatch will automatically reboot your instance.

Which Approach Should You Use?

Use Lambda Recovery (Approach 1) if:

  • You want intelligent recovery that tries service restart before rebooting
  • You need visibility into what recovery actions are taken
  • You want to extend the logic later (notifications, multiple recovery steps, etc)
  • You have SSM agent installed on your instance

Use Custom Health Check Reboot (Approach 2) if:

  • You want a simple solution with minimal moving parts
  • A full reboot is acceptable for all WordPress failures
  • You don’t need to try service restarts before rebooting
  • You prefer fewer AWS services to maintain

The Lambda approach is more sophisticated and tries to minimize downtime by restarting services first. The custom health check reboot approach is simpler, requires no Lambda function, but always reboots the entire instance.

Testing Your Setup

For Lambda Approach

SSH into your instance and stop nginx:

sudo systemctl stop nginx

Watch the Lambda logs from your laptop:

aws logs tail /aws/lambda/wordpress-recovery --follow --region us-east-1

After 10 minutes, you should see the Lambda function trigger and attempt to restart services.

For Custom Health Check Reboot

SSH into your instance and stop nginx:

sudo systemctl stop nginx

Check that the metric is being sent from your laptop:

aws cloudwatch get-metric-statistics 
  --region us-east-1 
  --namespace WordPress 
  --metric-name HealthCheck 
  --dimensions Name=Instance,Value=i-1234567890abcdef0 
  --start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%S) 
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) 
  --period 60 
  --statistics Average

You should see values of 0 appearing. After 10 minutes, your instance will automatically reboot.

Both approaches ensure your WordPress site recovers automatically without manual intervention.

Create / Migrate WordPress to AWS Graviton: Maximum Performance, Minimum Cost

Running WordPress on ARM-based Graviton instances delivers up to 40% better price-performance compared to x86 equivalents. This guide provides production-ready scripts to deploy an optimised WordPress stack in minutes, plus everything you need to migrate your existing site.

Why Graviton for WordPress?

Graviton3 processors deliver:

  • 40% better price-performance vs comparable x86 instances
  • Up to 25% lower cost for equivalent workloads
  • 60% less energy consumption per compute hour
  • Native ARM64 optimisations for PHP 8.x and MariaDB

The t4g.small instance (2 vCPU, 2GB RAM) at ~$12/month handles most WordPress sites comfortably. For higher traffic, t4g.medium or c7g instances scale beautifully.

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   CloudFront                     โ”‚
โ”‚              (Optional CDN Layer)                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Graviton EC2 Instance               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚  โ”‚              Caddy (Reverse Proxy)          โ”‚โ”‚
โ”‚  โ”‚         Auto-TLS, HTTP/2, Compression       โ”‚โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ”‚                        โ”‚                         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚  โ”‚              PHP-FPM 8.3                     โ”‚โ”‚
โ”‚  โ”‚         OPcache, JIT Compilation            โ”‚โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ”‚                        โ”‚                         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚  โ”‚              MariaDB 10.11                   โ”‚โ”‚
โ”‚  โ”‚         InnoDB Optimised, Query Cache       โ”‚โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ”‚                                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚  โ”‚              EBS gp3 Volume                  โ”‚โ”‚
โ”‚  โ”‚         3000 IOPS, 125 MB/s baseline        โ”‚โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Prerequisites

  • AWS CLI configured with appropriate permissions
  • A domain name with DNS you control
  • SSH key pair in your target region

If you’d prefer to download these scripts, check out https://github.com/Scr1ptW0lf/wordpress-graviton.

Part 1: Launch the Instance

Save this as launch-graviton-wp.sh and run from AWS CloudShell:

#!/bin/bash

# AWS EC2 ARM Instance Launch Script with Elastic IP
# Launches ARM-based instances with Ubuntu 24.04 LTS ARM64

set -e

echo "=== AWS EC2 ARM Ubuntu Instance Launcher ==="
echo ""

# Function to get Ubuntu 24.04 ARM64 AMI for a region
get_ubuntu_ami() {
    local region=$1
    # Get the latest Ubuntu 24.04 LTS ARM64 AMI
    aws ec2 describe-images \
        --region "$region" \
        --owners 099720109477 \
        --filters "Name=name,Values=ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-arm64-server-*" \
                  "Name=state,Values=available" \
        --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
        --output text
}

# Check for default region
if [ -n "$AWS_DEFAULT_REGION" ]; then
    echo "AWS default region detected: $AWS_DEFAULT_REGION"
    read -p "Use this region? (y/n, default: y): " use_default
    use_default=${use_default:-y}
    
    if [ "$use_default" == "y" ]; then
        REGION="$AWS_DEFAULT_REGION"
        echo "Using region: $REGION"
    else
        use_default="n"
    fi
else
    use_default="n"
fi

# Prompt for region if not using default
if [ "$use_default" == "n" ]; then
    echo ""
    echo "Available regions for ARM instances:"
    echo "1. us-east-1 (N. Virginia)"
    echo "2. us-east-2 (Ohio)"
    echo "3. us-west-2 (Oregon)"
    echo "4. eu-west-1 (Ireland)"
    echo "5. eu-central-1 (Frankfurt)"
    echo "6. ap-southeast-1 (Singapore)"
    echo "7. ap-northeast-1 (Tokyo)"
    echo "8. Enter custom region"
    echo ""
    read -p "Select region (1-8): " region_choice

    case $region_choice in
        1) REGION="us-east-1" ;;
        2) REGION="us-east-2" ;;
        3) REGION="us-west-2" ;;
        4) REGION="eu-west-1" ;;
        5) REGION="eu-central-1" ;;
        6) REGION="ap-southeast-1" ;;
        7) REGION="ap-northeast-1" ;;
        8) read -p "Enter region code: " REGION ;;
        *) echo "Invalid choice"; exit 1 ;;
    esac
    
    echo "Selected region: $REGION"
fi

# Prompt for instance type
echo ""
echo "Select instance type (ARM/Graviton):"
echo "1. t4g.micro   (2 vCPU, 1 GB RAM)   - Free tier eligible"
echo "2. t4g.small   (2 vCPU, 2 GB RAM)   - ~\$0.0168/hr"
echo "3. t4g.medium  (2 vCPU, 4 GB RAM)   - ~\$0.0336/hr"
echo "4. t4g.large   (2 vCPU, 8 GB RAM)   - ~\$0.0672/hr"
echo "5. t4g.xlarge  (4 vCPU, 16 GB RAM)  - ~\$0.1344/hr"
echo "6. t4g.2xlarge (8 vCPU, 32 GB RAM)  - ~\$0.2688/hr"
echo "7. Enter custom ARM instance type"
echo ""
read -p "Select instance type (1-7): " instance_choice

case $instance_choice in
    1) INSTANCE_TYPE="t4g.micro" ;;
    2) INSTANCE_TYPE="t4g.small" ;;
    3) INSTANCE_TYPE="t4g.medium" ;;
    4) INSTANCE_TYPE="t4g.large" ;;
    5) INSTANCE_TYPE="t4g.xlarge" ;;
    6) INSTANCE_TYPE="t4g.2xlarge" ;;
    7) read -p "Enter instance type (e.g., c7g.medium): " INSTANCE_TYPE ;;
    *) echo "Invalid choice"; exit 1 ;;
esac

echo "Selected instance type: $INSTANCE_TYPE"
echo ""
echo "Fetching latest Ubuntu 24.04 ARM64 AMI..."

AMI_ID=$(get_ubuntu_ami "$REGION")

if [ -z "$AMI_ID" ]; then
    echo "Error: Could not find Ubuntu ARM64 AMI in region $REGION"
    exit 1
fi

echo "Found AMI: $AMI_ID"
echo ""

# List existing key pairs
echo "Fetching existing key pairs in $REGION..."
EXISTING_KEYS=$(aws ec2 describe-key-pairs \
    --region "$REGION" \
    --query 'KeyPairs[*].KeyName' \
    --output text 2>/dev/null || echo "")

if [ -n "$EXISTING_KEYS" ]; then
    echo "Existing key pairs in $REGION:"
    # Convert to array for number selection
    mapfile -t KEY_ARRAY < <(echo "$EXISTING_KEYS" | tr '\t' '\n')
    for i in "${!KEY_ARRAY[@]}"; do
        echo "$((i+1)). ${KEY_ARRAY[$i]}"
    done
    echo ""
else
    echo "No existing key pairs found in $REGION"
    echo ""
fi

# Prompt for key pair
read -p "Enter key pair name, number to select from list, or press Enter to create new: " KEY_INPUT
CREATE_NEW_KEY=false

if [ -z "$KEY_INPUT" ]; then
    KEY_NAME="arm-key-$(date +%s)"
    CREATE_NEW_KEY=true
    echo "Will create new key pair: $KEY_NAME"
elif [[ "$KEY_INPUT" =~ ^[0-9]+$ ]] && [ -n "$EXISTING_KEYS" ]; then
    # User entered a number
    if [ "$KEY_INPUT" -ge 1 ] && [ "$KEY_INPUT" -le "${#KEY_ARRAY[@]}" ]; then
        KEY_NAME="${KEY_ARRAY[$((KEY_INPUT-1))]}"
        echo "Will use existing key pair: $KEY_NAME"
    else
        echo "Invalid selection number"
        exit 1
    fi
else
    KEY_NAME="$KEY_INPUT"
    echo "Will use existing key pair: $KEY_NAME"
fi

echo ""

# List existing security groups
echo "Fetching existing security groups in $REGION..."
EXISTING_SGS=$(aws ec2 describe-security-groups \
    --region "$REGION" \
    --query 'SecurityGroups[*].[GroupId,GroupName,Description]' \
    --output text 2>/dev/null || echo "")

if [ -n "$EXISTING_SGS" ]; then
    echo "Existing security groups in $REGION:"
    # Convert to arrays for number selection
    mapfile -t SG_LINES < <(echo "$EXISTING_SGS")
    declare -a SG_ID_ARRAY
    declare -a SG_NAME_ARRAY
    declare -a SG_DESC_ARRAY

    for line in "${SG_LINES[@]}"; do
        read -r sg_id sg_name sg_desc <<< "$line"
        SG_ID_ARRAY+=("$sg_id")
        SG_NAME_ARRAY+=("$sg_name")
        SG_DESC_ARRAY+=("$sg_desc")
    done

    for i in "${!SG_ID_ARRAY[@]}"; do
        echo "$((i+1)). ${SG_ID_ARRAY[$i]} - ${SG_NAME_ARRAY[$i]} (${SG_DESC_ARRAY[$i]})"
    done
    echo ""
else
    echo "No existing security groups found in $REGION"
    echo ""
fi

# Prompt for security group
read -p "Enter security group ID, number to select from list, or press Enter to create new: " SG_INPUT
CREATE_NEW_SG=false

if [ -z "$SG_INPUT" ]; then
    SG_NAME="arm-sg-$(date +%s)"
    CREATE_NEW_SG=true
    echo "Will create new security group: $SG_NAME"
    echo "  - Port 22 (SSH) - open to 0.0.0.0/0"
    echo "  - Port 80 (HTTP) - open to 0.0.0.0/0"
    echo "  - Port 443 (HTTPS) - open to 0.0.0.0/0"
elif [[ "$SG_INPUT" =~ ^[0-9]+$ ]] && [ -n "$EXISTING_SGS" ]; then
    # User entered a number
    if [ "$SG_INPUT" -ge 1 ] && [ "$SG_INPUT" -le "${#SG_ID_ARRAY[@]}" ]; then
        SG_ID="${SG_ID_ARRAY[$((SG_INPUT-1))]}"
        echo "Will use existing security group: $SG_ID (${SG_NAME_ARRAY[$((SG_INPUT-1))]})"
        echo "Note: Ensure ports 22, 80, and 443 are open if needed"
    else
        echo "Invalid selection number"
        exit 1
    fi
else
    SG_ID="$SG_INPUT"
    echo "Will use existing security group: $SG_ID"
    echo "Note: Ensure ports 22, 80, and 443 are open if needed"
fi

echo ""

# Prompt for Elastic IP
read -p "Allocate and assign an Elastic IP? (y/n, default: n): " ALLOCATE_EIP
ALLOCATE_EIP=${ALLOCATE_EIP:-n}

echo ""
read -p "Enter instance name tag (default: ubuntu-arm-instance): " INSTANCE_NAME
INSTANCE_NAME=${INSTANCE_NAME:-ubuntu-arm-instance}

echo ""
echo "=== Launch Configuration ==="
echo "Region: $REGION"
echo "Instance Type: $INSTANCE_TYPE"
echo "AMI: $AMI_ID (Ubuntu 24.04 ARM64)"
echo "Key Pair: $KEY_NAME $([ "$CREATE_NEW_KEY" == true ] && echo '(will be created)')"
echo "Security Group: $([ "$CREATE_NEW_SG" == true ] && echo "$SG_NAME (will be created)" || echo "$SG_ID")"
echo "Name: $INSTANCE_NAME"
echo "Elastic IP: $([ "$ALLOCATE_EIP" == "y" ] && echo 'Yes' || echo 'No')"
echo ""
read -p "Launch instance? (y/n, default: y): " CONFIRM
CONFIRM=${CONFIRM:-y}

if [ "$CONFIRM" != "y" ]; then
    echo "Launch cancelled"
    exit 0
fi

echo ""
echo "Starting launch process..."

# Create key pair if needed
if [ "$CREATE_NEW_KEY" == true ]; then
    echo ""
    echo "Creating key pair: $KEY_NAME"
    aws ec2 create-key-pair \
        --region "$REGION" \
        --key-name "$KEY_NAME" \
        --query 'KeyMaterial' \
        --output text > "${KEY_NAME}.pem"
    chmod 400 "${KEY_NAME}.pem"
    echo "  โœ“ Key saved to: ${KEY_NAME}.pem"
    echo "  โš ๏ธ  IMPORTANT: Download this key file from CloudShell if you need it elsewhere!"
fi

# Create security group if needed
if [ "$CREATE_NEW_SG" == true ]; then
    echo ""
    echo "Creating security group: $SG_NAME"
    
    # Get default VPC
    VPC_ID=$(aws ec2 describe-vpcs \
        --region "$REGION" \
        --filters "Name=isDefault,Values=true" \
        --query 'Vpcs[0].VpcId' \
        --output text)
    
    if [ -z "$VPC_ID" ] || [ "$VPC_ID" == "None" ]; then
        echo "Error: No default VPC found. Please specify a security group ID."
        exit 1
    fi
    
    SG_ID=$(aws ec2 create-security-group \
        --region "$REGION" \
        --group-name "$SG_NAME" \
        --description "Security group for ARM instance with web access" \
        --vpc-id "$VPC_ID" \
        --query 'GroupId' \
        --output text)
    
    echo "  โœ“ Created security group: $SG_ID"
    echo "  Adding security rules..."
    
    # Add SSH rule
    aws ec2 authorize-security-group-ingress \
        --region "$REGION" \
        --group-id "$SG_ID" \
        --ip-permissions \
        IpProtocol=tcp,FromPort=22,ToPort=22,IpRanges="[{CidrIp=0.0.0.0/0,Description='SSH'}]" \
        > /dev/null
    
    # Add HTTP rule
    aws ec2 authorize-security-group-ingress \
        --region "$REGION" \
        --group-id "$SG_ID" \
        --ip-permissions \
        IpProtocol=tcp,FromPort=80,ToPort=80,IpRanges="[{CidrIp=0.0.0.0/0,Description='HTTP'}]" \
        > /dev/null
    
    # Add HTTPS rule
    aws ec2 authorize-security-group-ingress \
        --region "$REGION" \
        --group-id "$SG_ID" \
        --ip-permissions \
        IpProtocol=tcp,FromPort=443,ToPort=443,IpRanges="[{CidrIp=0.0.0.0/0,Description='HTTPS'}]" \
        > /dev/null
    
    echo "  โœ“ Port 22 (SSH) configured"
    echo "  โœ“ Port 80 (HTTP) configured"
    echo "  โœ“ Port 443 (HTTPS) configured"
fi

echo ""
echo "Launching instance..."

INSTANCE_ID=$(aws ec2 run-instances \
    --region "$REGION" \
    --image-id "$AMI_ID" \
    --instance-type "$INSTANCE_TYPE" \
    --key-name "$KEY_NAME" \
    --security-group-ids "$SG_ID" \
    --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$INSTANCE_NAME}]" \
    --query 'Instances[0].InstanceId' \
    --output text)

echo "  โœ“ Instance launched: $INSTANCE_ID"
echo "  Waiting for instance to be running..."

aws ec2 wait instance-running \
    --region "$REGION" \
    --instance-ids "$INSTANCE_ID"

echo "  โœ“ Instance is running!"

# Handle Elastic IP
if [ "$ALLOCATE_EIP" == "y" ]; then
    echo ""
    echo "Allocating Elastic IP..."
    
    ALLOCATION_OUTPUT=$(aws ec2 allocate-address \
        --region "$REGION" \
        --domain vpc \
        --tag-specifications "ResourceType=elastic-ip,Tags=[{Key=Name,Value=$INSTANCE_NAME-eip}]")
    
    ALLOCATION_ID=$(echo "$ALLOCATION_OUTPUT" | grep -o '"AllocationId": "[^"]*' | cut -d'"' -f4)
    ELASTIC_IP=$(echo "$ALLOCATION_OUTPUT" | grep -o '"PublicIp": "[^"]*' | cut -d'"' -f4)
    
    echo "  โœ“ Elastic IP allocated: $ELASTIC_IP"
    echo "  Associating Elastic IP with instance..."
    
    ASSOCIATION_ID=$(aws ec2 associate-address \
        --region "$REGION" \
        --instance-id "$INSTANCE_ID" \
        --allocation-id "$ALLOCATION_ID" \
        --query 'AssociationId' \
        --output text)
    
    echo "  โœ“ Elastic IP associated"
    PUBLIC_IP=$ELASTIC_IP
else
    PUBLIC_IP=$(aws ec2 describe-instances \
        --region "$REGION" \
        --instance-ids "$INSTANCE_ID" \
        --query 'Reservations[0].Instances[0].PublicIpAddress' \
        --output text)
fi

echo ""
echo "=========================================="
echo "=== Instance Ready ==="
echo "=========================================="
echo "Instance ID: $INSTANCE_ID"
echo "Instance Type: $INSTANCE_TYPE"
echo "Public IP: $PUBLIC_IP"
if [ "$ALLOCATE_EIP" == "y" ]; then
    echo "Elastic IP: Yes (IP will persist after stop/start)"
    echo "Allocation ID: $ALLOCATION_ID"
else
    echo "Elastic IP: No (IP will change if instance is stopped)"
fi
echo "Region: $REGION"
echo "Security: SSH (22), HTTP (80), HTTPS (443) open"
echo ""
echo "Connect with:"
echo "  ssh -i ${KEY_NAME}.pem ubuntu@${PUBLIC_IP}"
echo ""
echo "Test web access:"
echo "  curl https://${PUBLIC_IP}"
echo ""
echo "โฑ๏ธ  Wait 30-60 seconds for SSH to become available"

if [ "$ALLOCATE_EIP" == "y" ]; then
    echo ""
    echo "=========================================="
    echo "โš ๏ธ  ELASTIC IP WARNING"
    echo "=========================================="
    echo "Elastic IPs cost \$0.005/hour when NOT"
    echo "associated with a running instance!"
    echo ""
    echo "To avoid charges, release the EIP if you"
    echo "delete the instance:"
    echo ""
    echo "aws ec2 release-address \\"
    echo "  --region $REGION \\"
    echo "  --allocation-id $ALLOCATION_ID"
fi

echo ""
echo "=========================================="

Run it:

chmod +x launch-graviton-wp.sh
./launch-graviton-wp.sh

Part 2: Install WordPress Stack

SSH into your new instance and save this as setup-wordpress.sh:

#!/bin/bash

# WordPress Installation Script for Ubuntu 24.04 ARM64
# Installs Apache, MySQL, PHP, and WordPress with automatic configuration

set -e

echo "=== WordPress Installation Script (Apache) ==="
echo "This script will install and configure:"
echo "  - Apache web server"
echo "  - MySQL database"
echo "  - PHP 8.3"
echo "  - WordPress (latest version)"
echo ""

# Check if running as root
if [ "$EUID" -ne 0 ]; then
    echo "Please run as root (use: sudo bash $0)"
    exit 1
fi

# Get configuration from user
echo "=== WordPress Configuration ==="
read -p "Enter your domain name (or press Enter to use server IP): " DOMAIN_NAME
read -p "Enter WordPress site title (default: My WordPress Site): " SITE_TITLE
SITE_TITLE=${SITE_TITLE:-My WordPress Site}
read -p "Enter WordPress admin username (default: admin): " WP_ADMIN_USER
WP_ADMIN_USER=${WP_ADMIN_USER:-admin}
read -sp "Enter WordPress admin password (or press Enter to generate): " WP_ADMIN_PASS
echo ""
if [ -z "$WP_ADMIN_PASS" ]; then
    WP_ADMIN_PASS=$(openssl rand -base64 16)
    echo "Generated password: $WP_ADMIN_PASS"
fi
read -p "Enter WordPress admin email: (default:[email protected])" WP_ADMIN_EMAIL
WP_ADMIN_EMAIL=${WP_ADMIN_EMAIL:[email protected]}

# Generate database credentials
DB_NAME="wordpress"
DB_USER="wpuser"
DB_PASS=$(openssl rand -base64 16)
DB_ROOT_PASS=$(openssl rand -base64 16)

echo ""
echo "=== Installation Summary ==="
echo "Domain: ${DOMAIN_NAME:-Server IP}"
echo "Site Title: $SITE_TITLE"
echo "Admin User: $WP_ADMIN_USER"
echo "Admin Email: $WP_ADMIN_EMAIL"
echo "Database: $DB_NAME"
echo ""
read -p "Proceed with installation? (y/n, default: y): " CONFIRM
CONFIRM=${CONFIRM:-y}

if [ "$CONFIRM" != "y" ]; then
    echo "Installation cancelled"
    exit 0
fi

echo ""
echo "Starting installation..."

# Update system
echo ""
echo "[1/8] Updating system packages..."
apt-get update -qq
apt-get upgrade -y -qq

# Install Apache
echo ""
echo "[2/8] Installing Apache..."
apt-get install -y -qq apache2

# Enable Apache modules
echo "Enabling Apache modules..."
a2enmod rewrite
a2enmod ssl
a2enmod headers

# Check if MySQL is already installed
MYSQL_INSTALLED=false
if systemctl is-active --quiet mysql || systemctl is-active --quiet mysqld; then
    MYSQL_INSTALLED=true
    echo ""
    echo "MySQL is already installed and running."
elif command -v mysql &> /dev/null; then
    MYSQL_INSTALLED=true
    echo ""
    echo "MySQL is already installed."
fi

if [ "$MYSQL_INSTALLED" = true ]; then
    echo ""
    echo "[3/8] Using existing MySQL installation..."
    read -sp "Enter MySQL root password (or press Enter to try without password): " EXISTING_ROOT_PASS
    echo ""

    MYSQL_CONNECTION_OK=false

    # Test the password
    if [ -n "$EXISTING_ROOT_PASS" ]; then
        if mysql -u root -p"${EXISTING_ROOT_PASS}" -e "SELECT 1;" &> /dev/null; then
            echo "Successfully connected to MySQL."
            DB_ROOT_PASS="$EXISTING_ROOT_PASS"
            MYSQL_CONNECTION_OK=true
        else
            echo "Error: Could not connect to MySQL with provided password."
        fi
    fi

    # Try without password if previous attempt failed or no password was provided
    if [ "$MYSQL_CONNECTION_OK" = false ]; then
        echo "Trying to connect without password..."
        if mysql -u root -e "SELECT 1;" &> /dev/null; then
            echo "Connected without password. Will set a password now."
            DB_ROOT_PASS=$(openssl rand -base64 16)
            mysql -u root -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY '${DB_ROOT_PASS}';"
            echo "New root password set: $DB_ROOT_PASS"
            MYSQL_CONNECTION_OK=true
        fi
    fi

    # If still cannot connect, offer to reinstall
    if [ "$MYSQL_CONNECTION_OK" = false ]; then
        echo ""
        echo "ERROR: Cannot connect to MySQL with any method."
        echo "This usually means MySQL is in an inconsistent state."
        echo ""
        read -p "Remove and reinstall MySQL? (y/n, default: y): " REINSTALL_MYSQL
        REINSTALL_MYSQL=${REINSTALL_MYSQL:-y}

        if [ "$REINSTALL_MYSQL" = "y" ]; then
            echo ""
            echo "Removing MySQL..."
            systemctl stop mysql 2>/dev/null || systemctl stop mysqld 2>/dev/null || true
            apt-get remove --purge -y mysql-server mysql-client mysql-common mysql-server-core-* mysql-client-core-* -qq
            apt-get autoremove -y -qq
            apt-get autoclean -qq
            rm -rf /etc/mysql /var/lib/mysql /var/log/mysql

            echo "Reinstalling MySQL..."
            export DEBIAN_FRONTEND=noninteractive
            apt-get update -qq
            apt-get install -y -qq mysql-server

            # Generate new root password
            DB_ROOT_PASS=$(openssl rand -base64 16)

            # Set root password and secure installation
            mysql -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY '${DB_ROOT_PASS}';"
            mysql -u root -p"${DB_ROOT_PASS}" -e "DELETE FROM mysql.user WHERE User='';"
            mysql -u root -p"${DB_ROOT_PASS}" -e "DELETE FROM mysql.user WHERE User='root' AND Host NOT IN ('localhost', '127.0.0.1', '::1');"
            mysql -u root -p"${DB_ROOT_PASS}" -e "DROP DATABASE IF EXISTS test;"
            mysql -u root -p"${DB_ROOT_PASS}" -e "DELETE FROM mysql.db WHERE Db='test' OR Db='test\\_%';"
            mysql -u root -p"${DB_ROOT_PASS}" -e "FLUSH PRIVILEGES;"

            echo "MySQL reinstalled successfully."
            echo "New root password: $DB_ROOT_PASS"
        else
            echo "Installation cancelled."
            exit 1
        fi
    fi
else
    # Install MySQL
    echo ""
    echo "[3/8] Installing MySQL..."
    export DEBIAN_FRONTEND=noninteractive
    apt-get install -y -qq mysql-server

    # Secure MySQL installation
    echo ""
    echo "[4/8] Configuring MySQL..."
    mysql -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY '${DB_ROOT_PASS}';"
    mysql -u root -p"${DB_ROOT_PASS}" -e "DELETE FROM mysql.user WHERE User='';"
    mysql -u root -p"${DB_ROOT_PASS}" -e "DELETE FROM mysql.user WHERE User='root' AND Host NOT IN ('localhost', '127.0.0.1', '::1');"
    mysql -u root -p"${DB_ROOT_PASS}" -e "DROP DATABASE IF EXISTS test;"
    mysql -u root -p"${DB_ROOT_PASS}" -e "DELETE FROM mysql.db WHERE Db='test' OR Db='test\\_%';"
    mysql -u root -p"${DB_ROOT_PASS}" -e "FLUSH PRIVILEGES;"
fi

# Check if WordPress database already exists
echo ""
echo "[4/8] Setting up WordPress database..."

# Create MySQL defaults file for safer password handling
MYSQL_CNF=$(mktemp)
cat > "$MYSQL_CNF" <<EOF
[client]
user=root
password=${DB_ROOT_PASS}
EOF
chmod 600 "$MYSQL_CNF"

# Test MySQL connection first
echo "Testing MySQL connection..."
if ! mysql --defaults-extra-file="$MYSQL_CNF" -e "SELECT 1;" &> /dev/null; then
    echo "ERROR: Cannot connect to MySQL to create database."
    rm -f "$MYSQL_CNF"
    exit 1
fi

echo "MySQL connection successful."

# Check if database exists
echo "Checking for existing database '${DB_NAME}'..."
DB_EXISTS=$(mysql --defaults-extra-file="$MYSQL_CNF" -e "SHOW DATABASES LIKE '${DB_NAME}';" 2>/dev/null | grep -c "${DB_NAME}" || true)

if [ "$DB_EXISTS" -gt 0 ]; then
    echo ""
    echo "WARNING: Database '${DB_NAME}' already exists!"
    read -p "Delete existing database and create fresh? (y/n, default: n): " DELETE_DB
    DELETE_DB=${DELETE_DB:-n}

    if [ "$DELETE_DB" = "y" ]; then
        echo "Dropping existing database..."
        mysql --defaults-extra-file="$MYSQL_CNF" -e "DROP DATABASE ${DB_NAME};"
        echo "Creating fresh WordPress database..."
        mysql --defaults-extra-file="$MYSQL_CNF" <<EOF
CREATE DATABASE ${DB_NAME} DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
EOF
    else
        echo "Using existing database '${DB_NAME}'."
    fi
else
    echo "Creating WordPress database..."
    mysql --defaults-extra-file="$MYSQL_CNF" <<EOF
CREATE DATABASE ${DB_NAME} DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
EOF
    echo "Database created successfully."
fi

# Check if WordPress user already exists
echo "Checking for existing database user '${DB_USER}'..."
USER_EXISTS=$(mysql --defaults-extra-file="$MYSQL_CNF" -e "SELECT User FROM mysql.user WHERE User='${DB_USER}';" 2>/dev/null | grep -c "${DB_USER}" || true)

if [ "$USER_EXISTS" -gt 0 ]; then
    echo "Database user '${DB_USER}' already exists. Updating password and permissions..."
    mysql --defaults-extra-file="$MYSQL_CNF" <<EOF
ALTER USER '${DB_USER}'@'localhost' IDENTIFIED BY '${DB_PASS}';
GRANT ALL PRIVILEGES ON ${DB_NAME}.* TO '${DB_USER}'@'localhost';
FLUSH PRIVILEGES;
EOF
    echo "User updated successfully."
else
    echo "Creating WordPress database user..."
    mysql --defaults-extra-file="$MYSQL_CNF" <<EOF
CREATE USER '${DB_USER}'@'localhost' IDENTIFIED BY '${DB_PASS}';
GRANT ALL PRIVILEGES ON ${DB_NAME}.* TO '${DB_USER}'@'localhost';
FLUSH PRIVILEGES;
EOF
    echo "User created successfully."
fi

echo "Database setup complete."
rm -f "$MYSQL_CNF"

# Install PHP
echo ""
echo "[5/8] Installing PHP and extensions..."
apt-get install -y -qq php8.3 php8.3-mysql php8.3-curl php8.3-gd php8.3-mbstring \
    php8.3-xml php8.3-xmlrpc php8.3-soap php8.3-intl php8.3-zip libapache2-mod-php8.3 php8.3-imagick

# Configure PHP
echo "Configuring PHP..."
sed -i 's/upload_max_filesize = .*/upload_max_filesize = 64M/' /etc/php/8.3/apache2/php.ini
sed -i 's/post_max_size = .*/post_max_size = 64M/' /etc/php/8.3/apache2/php.ini
sed -i 's/max_execution_time = .*/max_execution_time = 300/' /etc/php/8.3/apache2/php.ini

# Check if WordPress directory already exists
if [ -d "/var/www/html/wordpress" ]; then
    echo ""
    echo "WARNING: WordPress directory /var/www/html/wordpress already exists!"
    read -p "Delete existing WordPress installation? (y/n, default: n): " DELETE_WP
    DELETE_WP=${DELETE_WP:-n}

    if [ "$DELETE_WP" = "y" ]; then
        echo "Removing existing WordPress directory..."
        rm -rf /var/www/html/wordpress
    fi
fi

# Download WordPress
echo ""
echo "[6/8] Downloading WordPress..."
cd /tmp
wget -q https://wordpress.org/latest.tar.gz
tar -xzf latest.tar.gz
mv wordpress /var/www/html/
chown -R www-data:www-data /var/www/html/wordpress
rm -f latest.tar.gz

# Configure WordPress
echo ""
echo "[7/8] Configuring WordPress..."
cd /var/www/html/wordpress

# Generate WordPress salts
SALTS=$(curl -s https://api.wordpress.org/secret-key/1.1/salt/)

# Create wp-config.php
cat > wp-config.php <<EOF
<?php
define( 'DB_NAME', '${DB_NAME}' );
define( 'DB_USER', '${DB_USER}' );
define( 'DB_PASSWORD', '${DB_PASS}' );
define( 'DB_HOST', 'localhost' );
define( 'DB_CHARSET', 'utf8mb4' );
define( 'DB_COLLATE', '' );

${SALTS}

\$table_prefix = 'wp_';

define( 'WP_DEBUG', false );

if ( ! defined( 'ABSPATH' ) ) {
    define( 'ABSPATH', __DIR__ . '/' );
}

require_once ABSPATH . 'wp-settings.php';
EOF

chown www-data:www-data wp-config.php
chmod 640 wp-config.php

# Configure Apache
echo ""
echo "[8/8] Configuring Apache..."

# Determine server name
if [ -z "$DOMAIN_NAME" ]; then
    # Try to get EC2 public IP first
    SERVER_NAME=$(curl -s --connect-timeout 5 http://169.254.169.254/latest/meta-data/public-ipv4 2>/dev/null)

    # If we got a valid public IP, use it
    if [ -n "$SERVER_NAME" ] && [[ ! "$SERVER_NAME" =~ ^172\. ]] && [[ ! "$SERVER_NAME" =~ ^10\. ]] && [[ ! "$SERVER_NAME" =~ ^192\.168\. ]]; then
        echo "Detected EC2 public IP: $SERVER_NAME"
    else
        # Fallback: try to get public IP from external service
        echo "EC2 metadata not available, trying external service..."
        SERVER_NAME=$(curl -s --connect-timeout 5 https://api.ipify.org 2>/dev/null || curl -s --connect-timeout 5 https://icanhazip.com 2>/dev/null)

        if [ -n "$SERVER_NAME" ]; then
            echo "Detected public IP from external service: $SERVER_NAME"
        else
            # Last resort: use local IP (but warn user)
            SERVER_NAME=$(hostname -I | awk '{print $1}')
            echo "WARNING: Using local IP address: $SERVER_NAME"
            echo "This is a private IP and won't be accessible from the internet."
            echo "Consider specifying a domain name or public IP."
        fi
    fi
else
    SERVER_NAME="$DOMAIN_NAME"
    echo "Using provided domain: $SERVER_NAME"
fi

# Create Apache virtual host
cat > /etc/apache2/sites-available/wordpress.conf <<EOF
<VirtualHost *:80>
    ServerName ${SERVER_NAME}
    ServerAdmin ${WP_ADMIN_EMAIL}
    DocumentRoot /var/www/html/wordpress

    <Directory /var/www/html/wordpress>
        Options FollowSymLinks
        AllowOverride All
        Require all granted
    </Directory>

    ErrorLog \${APACHE_LOG_DIR}/wordpress-error.log
    CustomLog \${APACHE_LOG_DIR}/wordpress-access.log combined
</VirtualHost>
EOF

# Enable WordPress site
echo "Enabling WordPress site..."
a2ensite wordpress.conf

# Disable default site if it exists
if [ -f /etc/apache2/sites-enabled/000-default.conf ]; then
    echo "Disabling default site..."
    a2dissite 000-default.conf
fi

# Test Apache configuration
echo ""
echo "Testing Apache configuration..."
if ! apache2ctl configtest 2>&1 | grep -q "Syntax OK"; then
    echo "ERROR: Apache configuration test failed!"
    apache2ctl configtest
    exit 1
fi

echo "Apache configuration is valid."

# Restart Apache
echo "Restarting Apache..."
systemctl restart apache2

# Enable services to start on boot
systemctl enable apache2
systemctl enable mysql

# Install WP-CLI for command line WordPress management
echo ""
echo "Installing WP-CLI..."
wget -q https://raw.githubusercontent.com/wp-cli/builds/gh-pages/phar/wp-cli.phar -O /usr/local/bin/wp
chmod +x /usr/local/bin/wp

# Complete WordPress installation via WP-CLI
echo ""
echo "Completing WordPress installation..."
cd /var/www/html/wordpress

# Determine WordPress URL
# If SERVER_NAME looks like a private IP, try to get public IP
if [[ "$SERVER_NAME" =~ ^172\. ]] || [[ "$SERVER_NAME" =~ ^10\. ]] || [[ "$SERVER_NAME" =~ ^192\.168\. ]]; then
    PUBLIC_IP=$(curl -s --connect-timeout 5 http://169.254.169.254/latest/meta-data/public-ipv4 2>/dev/null || curl -s --connect-timeout 5 https://api.ipify.org 2>/dev/null)
    if [ -n "$PUBLIC_IP" ]; then
        WP_URL="https://${PUBLIC_IP}"
        echo "Using public IP for WordPress URL: $PUBLIC_IP"
    else
        WP_URL="https://${SERVER_NAME}"
        echo "WARNING: Could not determine public IP, using private IP: $SERVER_NAME"
    fi
else
    WP_URL="https://${SERVER_NAME}"
fi

echo "WordPress URL will be: $WP_URL"

# Check if WordPress is already installed
if sudo -u www-data wp core is-installed 2>/dev/null; then
    echo ""
    echo "WARNING: WordPress is already installed!"
    read -p "Continue with fresh installation? (y/n, default: n): " REINSTALL_WP
    REINSTALL_WP=${REINSTALL_WP:-n}

    if [ "$REINSTALL_WP" = "y" ]; then
        echo "Reinstalling WordPress..."
        sudo -u www-data wp db reset --yes
        sudo -u www-data wp core install \
            --url="$WP_URL" \
            --title="${SITE_TITLE}" \
            --admin_user="${WP_ADMIN_USER}" \
            --admin_password="${WP_ADMIN_PASS}" \
            --admin_email="${WP_ADMIN_EMAIL}" \
            --skip-email
    fi
else
    sudo -u www-data wp core install \
        --url="$WP_URL" \
        --title="${SITE_TITLE}" \
        --admin_user="${WP_ADMIN_USER}" \
        --admin_password="${WP_ADMIN_PASS}" \
        --admin_email="${WP_ADMIN_EMAIL}" \
        --skip-email
fi

echo ""
echo "=========================================="
echo "=== WordPress Installation Complete! ==="
echo "=========================================="
echo ""
echo "Website URL: $WP_URL"
echo "Admin URL: $WP_URL/wp-admin"
echo ""
echo "WordPress Admin Credentials:"
echo "  Username: $WP_ADMIN_USER"
echo "  Password: $WP_ADMIN_PASS"
echo "  Email: $WP_ADMIN_EMAIL"
echo ""
echo "Database Credentials:"
echo "  Database: $DB_NAME"
echo "  User: $DB_USER"
echo "  Password: $DB_PASS"
echo ""
echo "MySQL Root Password: $DB_ROOT_PASS"
echo ""
echo "IMPORTANT: Save these credentials securely!"
echo ""

# Save credentials to file
CREDS_FILE="/root/wordpress-credentials.txt"
cat > "$CREDS_FILE" <<EOF
WordPress Installation Credentials
===================================
Date: $(date)

Website URL: $WP_URL
Admin URL: $WP_URL/wp-admin

WordPress Admin:
  Username: $WP_ADMIN_USER
  Password: $WP_ADMIN_PASS
  Email: $WP_ADMIN_EMAIL

Database:
  Name: $DB_NAME
  User: $DB_USER
  Password: $DB_PASS

MySQL Root Password: $DB_ROOT_PASS

WP-CLI installed at: /usr/local/bin/wp
Usage: sudo -u www-data wp <command>

Apache Configuration: /etc/apache2/sites-available/wordpress.conf
EOF

chmod 600 "$CREDS_FILE"

echo "Credentials saved to: $CREDS_FILE"
echo ""
echo "Next steps:"
echo "1. Visit $WP_URL/wp-admin to access your site"
echo "2. Consider setting up SSL/HTTPS with Let's Encrypt"
echo "3. Install a caching plugin for better performance"
echo "4. Configure regular backups"
echo ""

if [ -n "$DOMAIN_NAME" ]; then
    echo "To set up SSL with Let's Encrypt:"
    echo "  apt-get install -y certbot python3-certbot-apache"
    echo "  certbot --apache -d ${DOMAIN_NAME}"
    echo ""
fi

echo "To manage WordPress from command line:"
echo "  cd /var/www/html/wordpress"
echo "  sudo -u www-data wp plugin list"
echo "  sudo -u www-data wp theme list"
echo ""
echo "Apache logs:"
echo "  Error log: /var/log/apache2/wordpress-error.log"
echo "  Access log: /var/log/apache2/wordpress-access.log"
echo ""
echo "=========================================="

Run it:

chmod +x setup-wordpress.sh
sudo ./setup-wordpress.sh

Part 3: Migrate Your Existing Site

If youโ€™re migrating from an existing WordPress installation, follow these steps.

What gets migrated:

  • All posts, pages, and media
  • All users and their roles
  • All plugins (files + database settings)
  • All themes (including customisations)
  • All plugin/theme configurations (stored in wp_options table)
  • Widgets, menus, and customizer settings
  • WooCommerce products, orders, customers (if applicable)
  • All custom database tables created by plugins

Step 3a: Export from Old Server

Run this on your existing WordPress server. Save as wp-export.sh:

#!/bin/bash
set -euo pipefail

# Configuration
WP_PATH="/var/www/html"           # Adjust to your WordPress path
EXPORT_DIR="/tmp/wp-migration"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Detect WordPress path if not set correctly
if [ ! -f "${WP_PATH}/wp-config.php" ]; then
    for path in "/var/www/wordpress" "/var/www/html/wordpress" "/home/*/public_html" "/var/www/*/public_html"; do
        if [ -f "${path}/wp-config.php" ]; then
            WP_PATH="$path"
            break
        fi
    done
fi

if [ ! -f "${WP_PATH}/wp-config.php" ]; then
    echo "ERROR: wp-config.php not found. Please set WP_PATH correctly."
    exit 1
fi

echo "==> WordPress found at: ${WP_PATH}"

# Extract database credentials from wp-config.php
DB_NAME=$(grep "DB_NAME" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)
DB_USER=$(grep "DB_USER" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)
DB_PASS=$(grep "DB_PASSWORD" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)
DB_HOST=$(grep "DB_HOST" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)

echo "==> Database: ${DB_NAME}"

# Create export directory
mkdir -p "${EXPORT_DIR}"
cd "${EXPORT_DIR}"

echo "==> Exporting database..."
mysqldump -h "${DB_HOST}" -u "${DB_USER}" -p"${DB_PASS}" \
    --single-transaction \
    --quick \
    --lock-tables=false \
    --routines \
    --triggers \
    "${DB_NAME}" > database.sql

DB_SIZE=$(ls -lh database.sql | awk '{print $5}')
echo "    Database exported: ${DB_SIZE}"

echo "==> Exporting wp-content..."
tar czf wp-content.tar.gz -C "${WP_PATH}" wp-content

CONTENT_SIZE=$(ls -lh wp-content.tar.gz | awk '{print $5}')
echo "    wp-content exported: ${CONTENT_SIZE}"

echo "==> Exporting wp-config.php..."
cp "${WP_PATH}/wp-config.php" wp-config.php.bak

echo "==> Creating migration package..."
tar czf "wordpress-migration-${TIMESTAMP}.tar.gz" \
    database.sql \
    wp-content.tar.gz \
    wp-config.php.bak

rm -f database.sql wp-content.tar.gz wp-config.php.bak

PACKAGE_SIZE=$(ls -lh "wordpress-migration-${TIMESTAMP}.tar.gz" | awk '{print $5}')

echo ""
echo "============================================"
echo "Export complete!"
echo ""
echo "Package: ${EXPORT_DIR}/wordpress-migration-${TIMESTAMP}.tar.gz"
echo "Size:    ${PACKAGE_SIZE}"
echo ""
echo "Transfer to new server with:"
echo "  scp ${EXPORT_DIR}/wordpress-migration-${TIMESTAMP}.tar.gz ec2-user@NEW_IP:/tmp/"
echo "============================================"

Step 3b: Transfer the Export

scp /tmp/wp-migration/wordpress-migration-*.tar.gz ec2-user@YOUR_NEW_IP:/tmp/

Step 3c: Import on New Server

Run this on your new Graviton instance. Save as wp-import.sh:

#!/bin/bash
set -euo pipefail

# Configuration - EDIT THESE
MIGRATION_FILE="${1:-/tmp/wordpress-migration-*.tar.gz}"
OLD_DOMAIN="oldsite.com"          # Your old domain
NEW_DOMAIN="newsite.com"          # Your new domain (can be same)
WP_PATH="/var/www/wordpress"

# Resolve migration file path
MIGRATION_FILE=$(ls -1 ${MIGRATION_FILE} 2>/dev/null | head -1)

if [ ! -f "${MIGRATION_FILE}" ]; then
    echo "ERROR: Migration file not found: ${MIGRATION_FILE}"
    echo "Usage: $0 /path/to/wordpress-migration-XXXXXX.tar.gz"
    exit 1
fi

echo "==> Using migration file: ${MIGRATION_FILE}"

# Get database credentials from existing wp-config
if [ ! -f "${WP_PATH}/wp-config.php" ]; then
    echo "ERROR: wp-config.php not found at ${WP_PATH}"
    echo "Please run the WordPress setup script first"
    exit 1
fi

DB_NAME=$(grep "DB_NAME" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)
DB_USER=$(grep "DB_USER" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)
DB_PASS=$(grep "DB_PASSWORD" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)
MYSQL_ROOT_PASS=$(cat /root/.wordpress/credentials | grep "MySQL Root" | awk '{print $4}')

echo "==> Extracting migration package..."
TEMP_DIR=$(mktemp -d)
cd "${TEMP_DIR}"
tar xzf "${MIGRATION_FILE}"

echo "==> Backing up current installation..."
BACKUP_DIR="/var/backups/wordpress/pre-migration-$(date +%Y%m%d_%H%M%S)"
mkdir -p "${BACKUP_DIR}"
cp -r "${WP_PATH}/wp-content" "${BACKUP_DIR}/" 2>/dev/null || true
mysqldump -u root -p"${MYSQL_ROOT_PASS}" "${DB_NAME}" > "${BACKUP_DIR}/database.sql" 2>/dev/null || true

echo "==> Importing database..."
mysql -u root -p"${MYSQL_ROOT_PASS}" << EOF
DROP DATABASE IF EXISTS ${DB_NAME};
CREATE DATABASE ${DB_NAME} CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
GRANT ALL PRIVILEGES ON ${DB_NAME}.* TO '${DB_USER}'@'localhost';
FLUSH PRIVILEGES;
EOF

mysql -u root -p"${MYSQL_ROOT_PASS}" "${DB_NAME}" < database.sql

echo "==> Importing wp-content..."
rm -rf "${WP_PATH}/wp-content"
tar xzf wp-content.tar.gz -C "${WP_PATH}"
chown -R caddy:caddy "${WP_PATH}/wp-content"
find "${WP_PATH}/wp-content" -type d -exec chmod 755 {} \;
find "${WP_PATH}/wp-content" -type f -exec chmod 644 {} \;

echo "==> Updating URLs in database..."
cd "${WP_PATH}"

OLD_URL_HTTP="https://${OLD_DOMAIN}"
OLD_URL_HTTPS="https://${OLD_DOMAIN}"
NEW_URL="https://${NEW_DOMAIN}"

# Install WP-CLI if not present
if ! command -v wp &> /dev/null; then
    curl -sO https://raw.githubusercontent.com/wp-cli/builds/gh-pages/phar/wp-cli.phar
    chmod +x wp-cli.phar
    mv wp-cli.phar /usr/local/bin/wp
fi

echo "    Replacing ${OLD_URL_HTTPS} with ${NEW_URL}..."
sudo -u caddy wp search-replace "${OLD_URL_HTTPS}" "${NEW_URL}" --all-tables --precise --skip-columns=guid 2>/dev/null || true

echo "    Replacing ${OLD_URL_HTTP} with ${NEW_URL}..."
sudo -u caddy wp search-replace "${OLD_URL_HTTP}" "${NEW_URL}" --all-tables --precise --skip-columns=guid 2>/dev/null || true

echo "    Replacing //${OLD_DOMAIN} with //${NEW_DOMAIN}..."
sudo -u caddy wp search-replace "//${OLD_DOMAIN}" "//${NEW_DOMAIN}" --all-tables --precise --skip-columns=guid 2>/dev/null || true

echo "==> Flushing caches and rewrite rules..."
sudo -u caddy wp cache flush
sudo -u caddy wp rewrite flush

echo "==> Reactivating plugins..."
# Some plugins may deactivate during migration - reactivate all
sudo -u caddy wp plugin activate --all 2>/dev/null || true

echo "==> Verifying import..."
POST_COUNT=$(sudo -u caddy wp post list --post_type=post --format=count)
PAGE_COUNT=$(sudo -u caddy wp post list --post_type=page --format=count)
USER_COUNT=$(sudo -u caddy wp user list --format=count)
PLUGIN_COUNT=$(sudo -u caddy wp plugin list --format=count)

echo ""
echo "============================================"
echo "Migration complete!"
echo ""
echo "Imported content:"
echo "  - Posts:   ${POST_COUNT}"
echo "  - Pages:   ${PAGE_COUNT}"
echo "  - Users:   ${USER_COUNT}"
echo "  - Plugins: ${PLUGIN_COUNT}"
echo ""
echo "Site URL: https://${NEW_DOMAIN}"
echo ""
echo "Pre-migration backup: ${BACKUP_DIR}"
echo "============================================"

rm -rf "${TEMP_DIR}"

Run it:

chmod +x wp-import.sh
sudo ./wp-import.sh /tmp/wordpress-migration-*.tar.gz

Step 3d: Verify Migration

#!/bin/bash
set -euo pipefail

WP_PATH="/var/www/wordpress"
cd "${WP_PATH}"

echo "==> WordPress Verification Report"
echo "=================================="
echo ""

echo "WordPress Version:"
sudo -u caddy wp core version
echo ""

echo "Site URL Configuration:"
sudo -u caddy wp option get siteurl
sudo -u caddy wp option get home
echo ""

echo "Database Status:"
sudo -u caddy wp db check
echo ""

echo "Content Summary:"
echo "  Posts:      $(sudo -u caddy wp post list --post_type=post --format=count)"
echo "  Pages:      $(sudo -u caddy wp post list --post_type=page --format=count)"
echo "  Media:      $(sudo -u caddy wp post list --post_type=attachment --format=count)"
echo "  Users:      $(sudo -u caddy wp user list --format=count)"
echo ""

echo "Plugin Status:"
sudo -u caddy wp plugin list --format=table
echo ""

echo "Uploads Directory:"
UPLOAD_COUNT=$(find "${WP_PATH}/wp-content/uploads" -type f 2>/dev/null | wc -l)
UPLOAD_SIZE=$(du -sh "${WP_PATH}/wp-content/uploads" 2>/dev/null | cut -f1)
echo "  Files: ${UPLOAD_COUNT}"
echo "  Size:  ${UPLOAD_SIZE}"
echo ""

echo "Service Status:"
echo "  PHP-FPM: $(systemctl is-active php-fpm)"
echo "  MariaDB: $(systemctl is-active mariadb)"
echo "  Caddy:   $(systemctl is-active caddy)"
echo ""

echo "Page Load Test:"
DOMAIN=$(sudo -u caddy wp option get siteurl | sed 's|https://||' | sed 's|/.*||')
curl -w "  Total time: %{time_total}s\n  HTTP code: %{http_code}\n" -o /dev/null -s "https://${DOMAIN}/"

Rollback if Needed

If something goes wrong:

#!/bin/bash
set -euo pipefail

BACKUP_DIR=$(ls -1d /var/backups/wordpress/pre-migration-* 2>/dev/null | tail -1)

if [ -z "${BACKUP_DIR}" ]; then
    echo "ERROR: No backup found"
    exit 1
fi

echo "==> Rolling back to: ${BACKUP_DIR}"

WP_PATH="/var/www/wordpress"
MYSQL_ROOT_PASS=$(cat /root/.wordpress/credentials | grep "MySQL Root" | awk '{print $4}')
DB_NAME=$(grep "DB_NAME" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)

mysql -u root -p"${MYSQL_ROOT_PASS}" "${DB_NAME}" < "${BACKUP_DIR}/database.sql"

rm -rf "${WP_PATH}/wp-content"
cp -r "${BACKUP_DIR}/wp-content" "${WP_PATH}/"
chown -R caddy:caddy "${WP_PATH}/wp-content"

cd "${WP_PATH}"
sudo -u caddy wp cache flush
sudo -u caddy wp rewrite flush

echo "Rollback complete!"

Part 4: Post-Installation Optimisations

After setup (or migration), run these additional optimisations:

#!/bin/bash

cd /var/www/wordpress

# Remove default content
sudo -u caddy wp post delete 1 2 --force 2>/dev/null || true
sudo -u caddy wp theme delete twentytwentytwo twentytwentythree 2>/dev/null || true

# Update everything
sudo -u caddy wp core update
sudo -u caddy wp plugin update --all
sudo -u caddy wp theme update --all

# Configure WP Super Cache
sudo -u caddy wp super-cache enable 2>/dev/null || true

# Set optimal permalink structure
sudo -u caddy wp rewrite structure '/%postname%/'
sudo -u caddy wp rewrite flush

echo "Optimisations complete!"

Performance Verification

Check your stack is running optimally:

# Verify PHP OPcache status
php -i | grep -i opcache

# Check PHP-FPM status
systemctl status php-fpm

# Test page load time
curl -w "@-" -o /dev/null -s "https://yourdomain.com" << 'EOF'
     time_namelookup:  %{time_namelookup}s
        time_connect:  %{time_connect}s
     time_appconnect:  %{time_appconnect}s
    time_pretransfer:  %{time_pretransfer}s
       time_redirect:  %{time_redirect}s
  time_starttransfer:  %{time_starttransfer}s
                     ----------
          time_total:  %{time_total}s
EOF

Cost Comparison

InstancevCPURAMMonthly CostUse Case
t4g.micro21GB~$6Dev/testing
t4g.small22GB~$12Small blogs
t4g.medium24GB~$24Medium traffic
t4g.large28GB~$48High traffic
c7g.medium12GB~$25CPU-intensive

All prices are approximate for eu-west-1 with on-demand pricing. Reserved instances or Savings Plans reduce costs by 30-60%.


Troubleshooting

502 Bad Gateway: PHP-FPM socket permissions issue

systemctl restart php-fpm
ls -la /run/php-fpm/www.sock

Database connection error: Check MariaDB is running

systemctl status mariadb
mysql -u wp_user -p wordpress

SSL certificate not working: Ensure DNS is pointing to instance IP

dig +short yourdomain.com
curl -I https://yourdomain.com

OPcache not working: Verify with phpinfo

php -r "phpinfo();" | grep -i opcache.enable

Quick Reference

# 1. Launch instance (local machine)
./launch-graviton-wp.sh

# 2. SSH in and setup WordPress
ssh -i ~/.ssh/key.pem ec2-user@IP
sudo ./setup-wordpress.sh

# 3. If migrating - on old server
./wp-export.sh
scp /tmp/wp-migration/wordpress-migration-*.tar.gz ec2-user@NEW_IP:/tmp/

# 4. If migrating - on new server
sudo ./wp-import.sh /tmp/wordpress-migration-*.tar.gz

This setup delivers a production-ready WordPress installation thatโ€™ll handle significant traffic while keeping your AWS bill minimal. The combination of Gravitonโ€™s price-performance, Caddyโ€™s efficiency, and properly-tuned PHP creates a stack that punches well above its weight class.

Why Rubrikโ€™s Architecture Matters: When Restore, Not Backup, Is the Product

1. Backups Should Be Boring (and That Is the Point)

Backups are boring. They should be boring. A backup system that generates excitement is usually signalling failure.

The only time backups become interesting is when they are missing, and that interest level is lethal. Emergency bridges. Frozen change windows. Executive escalation. Media briefings. Regulatory apology letters. Engineers being asked questions that have no safe answers.

Most backup platforms are built for the boring days. Rubrik is designed for the day boredom ends.

2. Backup Is Not the Product. Restore Is.

Many organisations still evaluate backup platforms on the wrong metric: how fast they can copy data somewhere else.

That metric is irrelevant during an incident.

When things go wrong, the only questions that matter are:

  • What can I restore?
  • How fast can it be used?
  • How many restores can run in parallel?
  • How little additional infrastructure is required?

Rubrik treats restore as the primary product, not a secondary feature.

3. Architectural Starting Point: Designed for Failure, Not Demos

Rubrik was built without tape era assumptions. There is no central backup server, no serial job controller, and no media server bottleneck. Instead, it uses a distributed, scale out architecture with a global metadata index and a stateless policy engine.

Restore becomes a metadata lookup problem, not a job replay problem. This distinction is invisible in demos and decisive during outages.

4. Performance Metrics That Actually Matter

Backup throughput is easy to optimise and easy to market. Restore performance is constrained by network fan out, restore concurrency, control plane orchestration, and application host contention.

Rubrik addresses this by default through parallel restore streams, linear scaling with node count, and minimal control plane chatter. Restore performance becomes predictable rather than optimistic.

5. Restore Semantics That Match Reality

The real test of any backup platform is not how elegantly it captures data, but how usefully it returns that data when needed. This is where architectural decisions made years earlier either pay dividends or extract penalties.

5.1 Instant Access Instead of Full Rehydration

Rubrik does not require full data copy back before access. It supports live mount of virtual machines, database mounts directly from backup storage, and file system mounts for selective recovery.

The recovery model becomes access first, copy later if needed. This is the difference between minutes and hours when production is down.

5.2 Dropping a Table Should Not Be a Crisis

Rubrik understands databases as structured systems, not opaque blobs.

It supports table level restores for SQL Server, mounting a database backup as a live database, extracting tables or schemas without restoring the full database, and point in time recovery without rollback.

Accidental table drops should be operational annoyances, not existential threats.

5.3 Supported Database Engines

Rubrik provides native protection for the major enterprise database platforms:

Database EngineLive MountPoint in Time RecoveryKey Constraints
Microsoft SQL ServerYesYes (transaction log replay)SQL 2012+ supported; Always On AG, FCI, standalone
Oracle DatabaseYesYes (archive log replay)RAC, Data Guard, Exadata supported; SPFILE required for automated recovery
SAP HANANoYesBackint API integration; uses native HANA backup scheduling
PostgreSQLNoYes (up to 5 minute RPO)File level incremental; on premises and cloud (AWS, Azure, GCP)
IBM Db2Via Elastic App ServiceYesUses native Db2 backup utilities
MongoDBVia Elastic App ServiceYesSharded and unsharded clusters; no quiescing required
MySQLVia Elastic App ServiceYesUses native MySQL backup tools
CassandraVia Elastic App ServiceYesVia Rubrik Datos IO integration

The distinction between native integration and Elastic App Service matters operationally. Native integration means Rubrik handles discovery, scheduling, and orchestration directly. Elastic App Service means Rubrik provides managed volumes as backup targets while the databaseโ€™s native tools handle the actual backup process. Both approaches deliver immutability and policy driven retention, but the operational experience differs.

5.4 Live Mount: Constraints and Caveats

Live Mount is Rubrikโ€™s signature capabilityโ€”mounting backups as live, queryable databases without copying data back to production storage. The database runs with its data files served directly from the Rubrik cluster over NFS (for Oracle) or SMB 3.0 (for SQL Server).

This capability is transformative for specific use cases. It is not a replacement for production storage.

What Live Mount Delivers:

  • Near instant database availability (seconds to minutes, regardless of database size)
  • Zero storage provisioning on the target host
  • Multiple concurrent mounts from the same backup
  • Point in time access across the entire retention window
  • Ideal for granular recovery, DBCC health checks, test/dev cloning, audit queries, and upgrade validation

What Live Mount Does Not Deliver:

  • Production grade I/O performance
  • High availability during Rubrik cluster maintenance
  • Persistence across host or cluster reboots

IOPS Constraints:

Live Mount performance is bounded by the Rubrik applianceโ€™s ability to serve I/O, not by the target hostโ€™s storage subsystem. Published figures suggest approximately 30,000 IOPS per Rubrik appliance for Live Mount workloads. This is adequate for reporting queries, data extraction, and validation testing. It is not adequate for transaction heavy production workloads.

The performance characteristics are inherently different from production storage:

MetricProduction SAN/FlashRubrik Live Mount
Random read IOPS100,000+~30,000 per appliance
Latency profileSub millisecondNetwork + NFS overhead
Write optimisationProduction tunedBackup optimised
Concurrent workloadsDesigned for contentionShared with backup operations

SQL Server Live Mount Specifics:

  • Databases mount via SMB 3.0 shares with UNC paths
  • Transaction log replay occurs during mount for point in time positioning
  • The mounted database is read write, but writes go to the Rubrik cluster
  • Supported for standalone instances, Failover Cluster Instances, and Always On Availability Groups
  • Table level recovery requires mounting the database, then using T SQL to extract and import specific objects

Oracle Live Mount Specifics:

  • Data files mount via NFS; redo logs and control files remain on the target host
  • Automated recovery requires source and target configurations to match (RAC to RAC, single instance to single instance, ASM to ASM)
  • Files only recovery allows dissimilar configurations but requires DBA managed RMAN recovery
  • SPFILE is required for automated recovery; PFILE databases require manual intervention
  • Block change tracking (BCT) is disabled on Live Mount targets
  • Live Mount fails if the target host, RAC cluster, or Rubrik cluster reboots during the mountโ€”requiring forced unmount to clean up metadata
  • Direct NFS (DNFS) is recommended on Oracle RAC nodes for improved recovery performance

What Live Mount Is Not:

Live Mount is explicitly designed for temporary access, not sustained production workloads. The use cases Rubrik marketsโ€”test/dev, DBCC validation, granular recovery, audit queriesโ€”all share a common characteristic: they are time bounded operations that tolerate moderate I/O performance in exchange for instant availability.

Running production transaction processing against a Live Mount database would be technically possible and operationally inadvisable. The I/O profile, the network dependency, and the lack of high availability guarantees make it unsuitable for workloads where performance and uptime matter.

5.5 The Recovery Hierarchy

Understanding when to use each recovery method matters:

Recovery NeedRecommended MethodTime to AccessStorage Required
Extract specific rows/tablesLive Mount + queryMinutesNone
Validate backup integrityLive Mount + DBCCMinutesNone
Clone for test/devLive MountMinutesNone
Full database replacementExport/RestoreHours (size dependent)Full database size
Disaster recovery cutoverInstant RecoveryMinutes (then migrate)Temporary, then full

The strategic value of Live Mount is avoiding full restores when full restores are unnecessary. For a 5TB database where someone dropped a single table, Live Mount means extracting that table in minutes rather than waiting hours for a complete restore.

For actual disaster recovery, where the production database is gone and must be replaced, Live Mount provides bridge access while the full restore completes in parallel. The database is queryable immediately; production grade performance follows once data migration finishes.

5.6 The Hidden Failure Mode After a Successful Restore

Rubrik is not deployed in a single explosive moment. In the real world, it is rolled out carefully over weeks. Systems are onboarded one by one, validated, and then left to settle. Each system performs a single full backup, after which life becomes calm and predictable. From that point forward, everything is incremental. Deltas are small, backup windows shrink, networks breathe easily, and the platform looks deceptively relaxed.

This operating state creates a dangerous illusion.

After a large scale recovery event, you will spend hours restoring systems. That work feels like the crisis. It is not. The real stress event happens later, quietly, on the first night after the restores complete. Every restored system now believes it is brand new. Every one of them schedules a full backup. At that moment, your entire estate attempts to perform a first full backup simultaneously while still serving live traffic.

This is the point where Rubrik appliances, networks, and upstream storage experience their true failure conditions. Not during the restore, but after it. Massive ingest rates, saturated links, constrained disk, and queueing effects all arrive at once. If this scenario is not explicitly planned for, the recovery that looked successful during the day can cascade into instability overnight.

Recovery planning therefore cannot stop at restore completion. Backup re entry must be treated as a first class recovery phase. In most environments, the only viable strategy is to deliberately phase backup schedules over multiple days following a large scale restore. Systems must be staggered back into protection in controlled waves, rather than allowed to collide into a single catastrophic full backup storm.

Restore is the product. But what comes after restore is where architectures either hold, or quietly collapse.

6. Why Logical Streaming Is a Design Failure

Traditional restore models stream backup data through the database host. This guarantees CPU contention, IO pressure, and restore times proportional to database size rather than change size.

Figure 1 illustrates this contrast clearly. Traditional restore requires data to be copied back through the database server, creating high I/O, CPU and network load with correspondingly long restore times. Rubrikโ€™s Live Mount approach mounts the backup copy directly, achieving near zero RTO with minimal data movement. The difference between these approaches becomes decisive when production is down and every minute of restore time translates to business impact.

Rubrik Live Mount dashboard showing instant data recovery interface

Rubrik avoids this by mounting database images and extracting only required objects. The database host stops being collateral damage during recovery.

6.1 The VSS Tax: Why SQL Server Backups Cannot Escape Application Coordination

For VMware workloads without databases, Rubrik can leverage storage level snapshots that are instantaneous, application agnostic, and impose zero load on the guest operating system. The hypervisor freezes the VM state, the storage array captures the point in time image, and the backup completes before the application notices.

SQL Server cannot offer this simplicity. The reason is not a Microsoft limitation or a Rubrik constraint. The reason is transactional consistency.

6.1.1 The Crash Consistent Option Exists

Nothing technically prevents Rubrik, or any backup tool, from taking a pure storage snapshot of a SQL Server volume without application coordination. The snapshot would complete in milliseconds with zero database load.

The problem is what you would recover: a crash consistent image, not an application consistent one.

A crash consistent snapshot captures storage state mid flight. This includes partially written pages, uncommitted transactions, dirty buffers not yet flushed to disk, and potentially torn writes caught mid I/O. SQL Server is designed to recover from exactly this state. Every time the database engine starts after an unexpected shutdown, it runs crash recovery, rolling forward committed transactions from the log and rolling back uncommitted ones.

The database will become consistent. Eventually. Probably.

6.1.2 Why Probably Is Not Good Enough

Crash recovery works. It works reliably. It is tested millions of times daily across every SQL Server instance that experiences an unclean shutdown.

But restore confidence matters. When production is down and executives are asking questions, the difference between โ€œthis backup is guaranteed consistentโ€ and โ€œthis backup should recover correctly after crash recovery completesโ€ is operationally significant.

Traditional backup architecture with multiple steps and potential failure points during data restore process

VSS exists to eliminate that uncertainty.

6.1.3 What VSS Actually Does

When a backup application requests an application consistent SQL Server snapshot, the sequence shown in Figure 2 executes. The backup server sends a signal through VSS Orchestration, which triggers the SQL Server VSS Writer to prepare the database. This preparation involves flushing dirty pages to storage, hardening transaction logs, and momentarily freezing I/O. Only then does the storage-level snapshot execute, capturing a point-in-time consistent image that requires no crash recovery on restore.

The result is a snapshot that requires no crash recovery on restore. The database is immediately consistent, immediately usable, and carries no uncertainty about transactional integrity.

6.1.4 The Coordination Cost

The VSS freeze window is typically brief, milliseconds to low seconds. But the preparation is not free.

Buffer pool flushes on large databases generate I/O pressure. Checkpoint operations compete with production workloads. The freeze, however short, introduces latency for in flight transactions. The database instance is actively participating in its own backup.

For databases measured in terabytes, with buffer pools consuming hundreds of gigabytes, this coordination overhead becomes operationally visible. Backup windows that appear instantaneous from the storage console are hiding real work inside the SQL Server instance.

Rubrik data management platform dashboard showing backup and restore operations

6.1.5 The Architectural Asymmetry

This creates a fundamental difference in backup elegance across workload types:

Workload TypeBackup MethodApplication LoadRestore State
VMware VM (no database)Storage snapshotZeroCrash consistent (acceptable)
VMware VM (with SQL Server)VSS coordinated snapshotModerateApplication consistent
Physical SQL ServerVSS coordinated snapshotModerate to highApplication consistent
Physical SQL ServerPure storage snapshotZeroCrash consistent (risky)

For a web server or file share, crash consistent is fine. The application has no transactional state worth protecting. For a database, crash consistent means trusting recovery logic rather than guaranteeing consistency.

6.1.6 The Uncomfortable Reality

The largest, most critical SQL Server databases, the ones that would benefit most from zero overhead instantaneous backup, are precisely the workloads where crash consistent snapshots carry the most risk. More transactions in flight. Larger buffer pools. More recovery time if something needs replay.

Rubrik supports VSS coordination because the alternative is shipping backups that might need crash recovery. That uncertainty is acceptable for test environments. It is rarely acceptable for production databases backing financial systems, customer records, or regulatory reporting.

The VSS tax is not a limitation imposed by Microsoft or avoided by competitors. It is the cost of consistency. Every backup platform that claims application consistent SQL Server protection is paying it. The only question is whether they admit the overhead exists.

7. Snapshot Based Protection Is Objectively Better (When You Can Get It)

The previous section explained why SQL Server backups cannot escape application coordination. VSS exists because transactional consistency requires it, and the coordination overhead is the price of certainty.

This makes the contrast with pure snapshot based protection even starker. Where snapshots work cleanly, they are not incrementally better. They are categorically superior.

7.1 What Pure Snapshots Deliver

Snapshot based backups in environments that support them provide:

  • Near instant capture: microseconds to milliseconds, regardless of dataset size
  • Zero application load: the workload never knows a backup occurred
  • Consistent recovery points: the storage layer guarantees point in time consistency
  • Predictable backup windows: duration is independent of data volume
  • No bandwidth consumption during capture: data movement happens later, asynchronously

A 50TB VMware datastore snapshots in the same time as a 50GB datastore. Backup windows become scheduling decisions rather than capacity constraints.

Rubrik exploits this deeply in VMware environments. Snapshot orchestration, instant VM recovery, and live mounts all depend on the hypervisor providing clean, consistent, zero overhead capture points.

7.2 Why This Is Harder Than It Looks

The elegance of snapshot based protection depends entirely on the underlying platform providing the right primitives. This is where the gap between VMware and everything else becomes painful.

VMware offers:

  • Native snapshot APIs with transactional semantics
  • Changed Block Tracking (CBT) for efficient incrementals
  • Hypervisor level consistency without guest coordination
  • Storage integration through VADP (vSphere APIs for Data Protection)

These are not accidental features. VMware invested years building a backup ecosystem because they understood that enterprise adoption required operational maturity, not just compute virtualisation.

Physical hosts offer none of this.

There is no universal snapshot API for bare metal servers. Storage arrays provide snapshot capabilities, but each vendor implements them differently, with different consistency guarantees, different integration points, and different failure modes. The operating system has no standard mechanism to coordinate application state with storage level capture.

7.3 The Physical Host Penalty

This is why physical SQL Server hosts face a compounding disadvantage:

  1. No hypervisor abstraction: there is no layer between the OS and storage that can freeze state cleanly
  2. VSS remains mandatory: application consistency still requires database coordination
  3. No standardised incremental tracking: without CBT or equivalent, every backup must rediscover what changed
  4. Storage integration is bespoke: each array, each SAN, each configuration requires specific handling

The result is that physical hosts with the largest databasesโ€”the workloads generating the most backup data, with the longest restore times, under the most operational pressure, receive the least architectural benefit from modern backup platforms.

They are stuck paying the VSS tax without receiving the snapshot dividend.

7.4 The Integration Hierarchy

Backup elegance follows a clear hierarchy based on platform integration depth:

EnvironmentSnapshot QualityIncremental EfficiencyApplication ConsistencyOverall Experience
VMware (no database)ExcellentCBT drivenNot requiredSeamless
VMware (with SQL Server)ExcellentCBT drivenVSS coordinatedGood with overhead
Cloud native (EBS, managed disks)GoodProvider dependentVaries by workloadGenerally clean
Physical with enterprise SANPossibleArray dependentVSS coordinatedComplex but workable
Physical with commodity storageLimitedOften full scanVSS coordinatedPainful

The further down this hierarchy, the more the backup platform must compensate for missing primitives. Rubrik handles this better than most, but even excellent software cannot conjure APIs that do not exist.

7.5 Why the Industry Irony Persists

The uncomfortable truth is that snapshot based protection delivers its greatest value precisely where it is least available.

A 500GB VMware VM snapshots effortlessly. The hypervisor provides everything needed. Backup is boring, as it should be.

A 50TB physical SQL Server, the database actually keeping the business running, containing years of transactional history, backing regulatory reporting and financial reconciliation, must coordinate through VSS, flush terabytes of buffer pool, sustain I/O pressure during capture, and hope the storage layer cooperates.

The workloads that need snapshot elegance the most are architecturally prevented from receiving it.

This is not a Rubrik limitation. It is not a Microsoft conspiracy. It is the accumulated consequence of decades of infrastructure evolution where virtualisation received backup investment and physical infrastructure did not.

7.6 What This Means for Architecture Decisions

Understanding this hierarchy should influence infrastructure strategy:

Virtualise where possible. The backup benefits alone often justify the overhead. A SQL Server VM with VSS coordination still benefits from CBT, instant recovery, and hypervisor level orchestration.

Choose storage with snapshot maturity. If physical hosts are unavoidable, enterprise arrays with proven snapshot integration reduce the backup penalty. This is not the place for commodity storage experimentation.

Accept the VSS overhead. For SQL Server workloads, crash consistent snapshots are technically possible but operationally risky. The coordination cost is worth paying. Budget for it in backup windows and I/O capacity.

Plan restore, not backup. Snapshot speed is irrelevant if restore requires hours of data rehydration. The architectural advantage of snapshots extends to recovery only if the platform supports instant mount and selective restore.

Rubrikโ€™s value in this landscape is not eliminating the integration gaps, nobody can, but navigating them intelligently. Where snapshots work, Rubrik exploits them fully. Where they do not, Rubrik minimises the penalty through parallel restore, live mounts, and metadata driven recovery.

The goal remains the same: make restore the product, regardless of how constrained the backup capture had to be.

8. Rubrik Restore Policies: Strategy, Trade offs, and Gotchas

SLA Domains are Rubrikโ€™s policy abstraction layer, and understanding how to configure them properly separates smooth recoveries from painful ones. The flexibility is substantial, but so are the consequences of misconfiguration.

8.1 Understanding SLA Domain Architecture

Rubrikโ€™s policy model centres on SLA Domains, named policies that define retention, frequency, replication, and archival behaviour. Objects are assigned to SLA Domains rather than configured individually, which creates operational leverage but requires upfront design discipline.

The core parameters that matter for restore planning:

Snapshot Frequency determines your Recovery Point Objective (RPO). A 4-hour frequency means you could lose up to 4 hours of data. For SQL Server with log backup enabled, transaction logs between snapshots reduce effective RPO to minutes, but the full snapshot frequency still determines how quickly you can access a baseline restore point.

Local Retention controls how many snapshots remain on the Rubrik cluster for instant access. This is your Live Mount window. Data within local retention restores in minutes. Data beyond it requires rehydration from archive, which takes hours.

Replication copies snapshots to a secondary Rubrik cluster, typically in another location. This is your disaster recovery tier. Replication targets can serve Live Mount operations, meaning DR isnโ€™t just โ€œeventually consistent backup copiesโ€ but actual instant recovery capability at the secondary site.

Archival moves aged snapshots to object storage (S3, Azure Blob, Google Cloud Storage). Archive tier data cannot be Live Mounted, it must be retrieved first, which introduces retrieval latency and potentially egress costs.

8.2 The Retention vs. Recovery Speed Trade off

This is where most organisations get the policy design wrong.

The temptation is to keep minimal local retention and archive aggressively to reduce storage costs. The consequence is that any restore request older than a few days becomes a multi hour operation.

Consider the mathematics for a 5TB SQL Server database:

Recovery ScenarioLocal RetentionTime to AccessOperational Impact
Yesterdayโ€™s backupWithin local retention2-5 minutes (Live Mount)Minimal
Last weekโ€™s backupWithin local retention2-5 minutes (Live Mount)Minimal
Last monthโ€™s backupArchived4-8 hours (retrieval + restore)Significant
Last quarterโ€™s backupArchived (cold tier)12-24 hoursMajor incident

The storage cost of keeping 30 days local versus 7 days local might seem significant when multiplied across the estate. But the operational cost of a 6 hour restore delay during an audit request or compliance investigation often exceeds years of incremental storage spend.

Recommendation: Size local retention to cover your realistic recovery scenarios, not your theoretical minimum. For most organisations, 14-30 days of local retention provides the right balance between cost and operational flexibility.

8.3 SLA Domain Design Patterns

8.3.1 Pattern 1: Tiered by Criticality

Create separate SLA Domains for different criticality levels:

  • Platinum: 4 hour snapshots, 30 day local retention, synchronous replication, 7 year archive
  • Gold: 8 hour snapshots, 14 day local retention, asynchronous replication, 3 year archive
  • Silver: Daily snapshots, 7 day local retention, no replication, 1 year archive
  • Bronze: Daily snapshots, 7 day local retention, no replication, 90 day archive

This pattern works well when criticality maps cleanly to workload types, but creates governance overhead when applications span tiers.

8.3.2 Pattern 2: Tiered by Recovery Requirements

Align SLA Domains to recovery time objectives rather than business criticality:

  • Instant Recovery: Maximum local retention, synchronous replication, Live Mount always available
  • Same Day Recovery: 14 day local retention, asynchronous replication
  • Next Day Recovery: 7 day local retention, archive first strategy

This pattern acknowledges that โ€œcriticalโ€ and โ€œneeds instant recoveryโ€ arenโ€™t always the same thing. A compliance archive might be business critical but tolerate 24 hour recovery times.

8.3.3 Pattern 3: Application Aligned

Create SLA Domains per major application or database platform:

  • SQL Server Production
  • SQL Server Non Production
  • Oracle Production
  • VMware Infrastructure
  • File Shares

This pattern simplifies troubleshooting and reporting but can lead to policy sprawl as the estate grows.

8.4 Log Backup Policies: The Hidden Complexity

For SQL Server and Oracle, snapshot frequency alone doesnโ€™t tell the full story. Transaction log backups between snapshots determine actual RPO.

Rubrik supports log backup frequencies down to 1 minute for SQL Server. The trade offs:

Aggressive Log Backup (1-5 minute frequency):

  • Sub 5 minute RPO
  • Higher metadata overhead on Rubrik cluster
  • More objects to manage during restore
  • Longer Live Mount preparation time (more logs to replay)

Conservative Log Backup (15-60 minute frequency):

  • Acceptable RPO for most workloads
  • Lower operational overhead
  • Faster Live Mount operations
  • Simpler troubleshooting

Gotcha: Log backup frequency creates a hidden I/O load on the source database. A 1 minute log backup interval on a high transaction database generates constant log backup traffic. For already I/O constrained databases, this can become the straw that breaks performance.

Recommendation: Match log backup frequency to actual RPO requirements, not aspirational ones. If the business can tolerate 15 minutes of data loss, donโ€™t configure 1 minute log backups just because you can.

8.5 Replication Topology Gotchas

Replication seems straightforward, copy snapshots to another cluster, but the implementation details matter.

8.5.1 Gotcha 1: Replication Lag Under Load

Asynchronous replication means the target cluster is always behind the source. During high backup activity (month end processing, batch loads), this lag can extend to hours. If a disaster occurs during this window, you lose more data than your SLA suggests.

Monitor replication lag as an operational metric, not just a capacity planning number.

8.5.2 Gotcha 2: Bandwidth Contention with Production Traffic

Replication competes for the same network paths as production traffic. If your backup replication saturates a WAN link, production application performance degrades.

Either implement QoS policies to protect production traffic, or schedule replication during low utilisation windows. Rubrik supports replication scheduling, but the default is โ€œas fast as possible,โ€ which isnโ€™t always appropriate.

8.5.3 Gotcha 3: Cascaded Replication Complexity

For multi site architectures, you might configure Site A โ†’ Site B โ†’ Site C replication. Each hop adds latency and failure modes. A Site B outage breaks the chain to Site C.

Consider whether hub and spoke (Site A replicates independently to both B and C) better matches your DR requirements, despite the additional bandwidth consumption.

8.6 Archive Tier Selection: Retrieval Time Matters

Object storage isnโ€™t monolithic. The choice between storage classes has direct recovery implications.

Storage ClassTypical Retrieval TimeUse Case
S3 Standard / Azure HotImmediateFrequently accessed archives
S3 Standard-IA / Azure CoolImmediate (higher retrieval cost)Infrequent but urgent access
S3 Glacier Instant RetrievalMillisecondsCompliance archives with occasional audit access
S3 Glacier Flexible Retrieval1-12 hoursLong-term retention with rare access
S3 Glacier Deep Archive12-48 hoursLegal hold, never access unless subpoenaed

Gotcha: Rubrikโ€™s archive policy assigns snapshots to a single storage class. If your retention spans 7 years, all 7 years of archives pay the same storage rate, even though year 1 archives are accessed far more frequently than year 7 archives.

Recommendation: Consider tiered archive policiesโ€”recent archives to Standard-IA, aged archives to Glacier. This requires multiple SLA Domains and careful lifecycle management, but the cost savings compound significantly at scale.

8.7 Policy Assignment Gotchas

8.7.1 Gotcha 1: Inheritance and Override Conflicts

Rubrik supports hierarchical policy assignment (cluster โ†’ host โ†’ database). When policies conflict, the resolution logic isnโ€™t always intuitive. A database with an explicit SLA assignment wonโ€™t inherit changes made to its parent hostโ€™s policy.

Document your policy hierarchy explicitly. During audits, the question โ€œwhat policy actually applies to this database?โ€ should have an immediate, verifiable answer.

8.7.2 Gotcha 2: Pre script and Post script Failures

Custom scripts for application quiescing or notification can fail, and failure handling varies. A pre script failure might skip the backup entirely (safe but creates a gap) or proceed without proper quiescing (dangerous).

Test script failure modes explicitly. Know what happens when your notification webhook is unreachable or your custom quiesce script times out.

8.7.3 Gotcha 3: Time Zone Confusion

Rubrik displays times in the clusterโ€™s configured time zone, but SLA schedules operate in UTC unless explicitly configured otherwise. An โ€œ8 PM backupโ€ might run at midnight local time if the time zone mapping is wrong.

Verify backup execution times after policy configuration, donโ€™t trust the schedule display alone.

8.8 Testing Your Restore Policies

Policy design is theoretical until tested. The following tests should be regular operational practice:

Live Mount Validation: Mount a backup from local retention and verify application functionality. This proves both backup integrity and Live Mount operational capability.

Archive Retrieval Test: Retrieve a backup from archive tier and time the operation. Compare actual retrieval time against SLA commitments.

Replication Failover Test: Perform a Live Mount from the replication target, not the source cluster. This validates that DR actually works, not just that replication is running.

Point in Time Recovery Test: For databases with log backup enabled, recover to a specific timestamp between snapshots. This validates that log chain integrity is maintained.

Concurrent Restore Test: Simulate a ransomware scenario by triggering multiple simultaneous restores. Measure whether your infrastructure can sustain the required parallelism.

8.9 Policy Review Triggers

SLA Domains shouldnโ€™t be โ€œset and forget.โ€ Trigger policy reviews when:

  • Application criticality changes (promotion to production, decommissioning)
  • Recovery requirements change (new compliance mandates, audit findings)
  • Infrastructure changes (new replication targets, storage tier availability)
  • Performance issues emerge (backup windows exceeded, replication lag growing)
  • Cost optimisation cycles (storage spend review, cloud egress analysis)

The goal is proactive policy maintenance, not reactive incident response when a restore takes longer than expected.

9. Ransomware: Where Architecture Is Exposed

9.1 The Restore Storm Problem

After ransomware, the challenge is not backup availability. The challenge is restoring everything at once.

Constraints appear immediately. East-west traffic saturates. DWDM links run hot. Core switch buffers overflow. Cloud egress throttling kicks in.

Rubrik mitigates this through parallel restores, SLA based prioritisation, and live mounts for critical systems. What it cannot do is defeat physics. A good recovery plan avoids turning a data breach into a network outage.

10. SaaS vs Appliance: This Is a Network Decision

Functionally, Rubrik SaaS and on prem appliances share the same policy engine, metadata index, and restore semantics.

The difference is bandwidth reality.

On prem appliances provide fast local restores, predictable latency, and minimal WAN dependency. SaaS based protection provides excellent cloud workload coverage and operational simplicity, but restore speed is bounded by network capacity and egress costs.

Hybrid estates usually require both.

11. Why Rubrik in the Cloud?

Cloud providers offer native backup primitives. These are necessary but insufficient. They do not provide unified policy across environments, cross account recovery at scale, ransomware intelligence, or consistent restore semantics. Rubrik turns cloud backups into recoverable systems rather than isolated snapshots.

11.1 Should You Protect Your AWS Root and Crypto Accounts?

Yes, because losing the control plane is worse than losing data.

Rubrik protects IAM configuration, account state, and infrastructure metadata. After a compromise, restoring how the account was configured is as important as restoring the data itself.

12. Backup Meets Security (Finally)

Rubrik integrates threat awareness into recovery using entropy analysis, change rate anomaly detection, and snapshot divergence tracking.cThis answers the most dangerous question in recovery: which backup is actually safe to restore? Most platforms cannot answer this with confidence.

13. VMware First Class Citizen, Physical Hosts Still Lag

Rubrikโ€™s deepest integrations exist in VMware environments, including snapshot orchestration, instant VM recovery, and live mounts.

The uncomfortable reality remains that physical hosts with the largest datasets would benefit most from snapshot based protection, yet receive the least integration. This is an industry gap, not just a tooling one.

14. When Rubrik Is Not the Right Tool

Rubrik is not universal.

It is less optimal when bandwidth is severely constrained, estates are very small, or tape workflows are legally mandated.

Rubrikโ€™s value emerges at scale, under pressure, and during failure.

15. Conclusion: Boredom Is Success

Backups should be boring. Restores should be quiet. Executives should never know the platform exists.

The only time backups become exciting is when they fail, and that excitement is almost always lethal.

Rubrik is not interesting because it stores data. It is interesting because, when everything is already on fire, restore remains a controlled engineering exercise rather than a panic response.

References

  1. Gartner Magic Quadrant for Enterprise Backup and Recovery Solutions โ€“ https://www.gartner.com/en/documents/5138291
  2. Rubrik Technical Architecture Whitepapers โ€“ https://www.rubrik.com/resources
  3. Microsoft SQL Server Backup and Restore Internals โ€“ https://learn.microsoft.com/en-us/sql/relational-databases/backup-restore/backup-overview-sql-server
  4. VMware Snapshot and Backup Best Practices โ€“ https://knowledge.broadcom.com/external/article?legacyId=1025279
  5. AWS Backup and Recovery Documentation โ€“ https://docs.aws.amazon.com/aws-backup/
  6. NIST SP 800-209 Security Guidelines for Storage Infrastructure โ€“ https://csrc.nist.gov/publications/detail/sp/800-209/final
  7. Rubrik SQL Live Mount Documentation โ€“ https://www.rubrik.com/solutions/sql-live-mount
  8. Rubrik Oracle Live Mount Documentation โ€“ https://docs.rubrik.com/en-us/saas/oracle/oracle_live_mount.html
  9. Rubrik for Oracle and Microsoft SQL Server Data Sheet โ€“ https://www.rubrik.com/content/dam/rubrik/en/resources/data-sheet/Rubrik-for-Oracle-and-Microsoft-SQL-Sever-DS.pdf
  10. Rubrik Enhanced Performance for Microsoft SQL and Oracle Database โ€“ https://www.rubrik.com/blog/technology/2021/12/rubrik-enhanced-performance-for-microsoft-sql-and-oracle-database
  11. Rubrik PostgreSQL Support Announcement โ€“ https://www.rubrik.com/blog/technology/24/10/rubrik-expands-database-protection-with-postgre-sql-support-and-on-premises-sensitive-data-monitoring-for-microsoft-sql-server
  12. Rubrik Elastic App Service โ€“ https://www.rubrik.com/solutions/elastic-app-service
  13. Rubrik and VMware vSphere Reference Architecture โ€“ https://www.rubrik.com/content/dam/rubrik/en/resources/white-paper/ra-rubrik-vmware-vsphere.pdf
  14. Protecting Microsoft SQL Server with Rubrik Technical White Paper โ€“ https://www.rubrik.com/content/dam/rubrik/en/resources/white-paper/rwp-protecting-microsoft-sql-server-with-rubrik.pdf
  15. The Definitive Guide to Rubrik Cloud Data Management โ€“ https://www.rubrik.com/content/dam/rubrik/en/resources/white-paper/rwp-definitive-guide-to-rubrik-cdm.pdf
  16. Rubrik Oracle Tools GitHub Repository โ€“ https://github.com/rubrikinc/rubrik_oracle_tools
  17. Automating SQL Server Live Mounts with Rubrik โ€“ https://virtuallysober.com/2017/08/08/automating-sql-server-live-mounts-with-rubrik-alta-4-0/

Aurora PostgreSQL: Archiving and Restoring Partitions from Large Tables to Iceberg and Parquet on S3

A Complete Guide to Archiving, Restoring, and Querying Large Table Partitions

When dealing with multi-terabyte tables in Aurora PostgreSQL, keeping historical partitions online becomes increasingly expensive and operationally burdensome. This guide presents a complete solution for archiving partitions to S3 in Iceberg/Parquet format, restoring them when needed, and querying archived data directly via a Spring Boot API without database restoration.

1. Architecture Overview

The solution comprises three components:

  1. Archive Script: Exports a partition from Aurora PostgreSQL to Parquet files organised in Iceberg table format on S3
  2. Restore Script: Imports archived data from S3 back into a staging table for validation and migration to the main table
  3. Query API: A Spring Boot application that reads Parquet files directly from S3, applying predicate pushdown for efficient filtering

This approach reduces storage costs by approximately 70 to 80 percent compared to keeping data in Aurora, while maintaining full queryability through the API layer.

2. Prerequisites and Dependencies

2.1 Python Environment

pip install boto3 pyarrow psycopg2-binary pandas pyiceberg sqlalchemy

2.2 AWS Configuration

Ensure your environment has appropriate IAM permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::your-archive-bucket",
                "arn:aws:s3:::your-archive-bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateTable",
                "glue:GetTable",
                "glue:UpdateTable",
                "glue:DeleteTable"
            ],
            "Resource": "*"
        }
    ]
}

2.3 Java Dependencies

For the Spring Boot API, add these to your pom.xml:

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.parquet</groupId>
        <artifactId>parquet-avro</artifactId>
        <version>1.14.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-aws</artifactId>
        <version>3.3.6</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.6</version>
    </dependency>
    <dependency>
        <groupId>software.amazon.awssdk</groupId>
        <artifactId>s3</artifactId>
        <version>2.25.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.iceberg</groupId>
        <artifactId>iceberg-core</artifactId>
        <version>1.5.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.iceberg</groupId>
        <artifactId>iceberg-parquet</artifactId>
        <version>1.5.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.iceberg</groupId>
        <artifactId>iceberg-aws</artifactId>
        <version>1.5.0</version>
    </dependency>
</dependencies>

3. The Archive Script

This script extracts a partition from Aurora PostgreSQL and writes it to S3 in Iceberg table format with Parquet data files.

3.1 Configuration Module

# config.py

from dataclasses import dataclass
from typing import Optional
import os


@dataclass
class DatabaseConfig:
    host: str
    port: int
    database: str
    user: str
    password: str

    @classmethod
    def from_environment(cls) -> "DatabaseConfig":
        return cls(
            host=os.environ["AURORA_HOST"],
            port=int(os.environ.get("AURORA_PORT", "5432")),
            database=os.environ["AURORA_DATABASE"],
            user=os.environ["AURORA_USER"],
            password=os.environ["AURORA_PASSWORD"]
        )

    @property
    def connection_string(self) -> str:
        return f"postgresql://{self.user}:{self.password}@{self.host}:{self.port}/{self.database}"


@dataclass
class S3Config:
    bucket: str
    prefix: str
    region: str

    @classmethod
    def from_environment(cls) -> "S3Config":
        return cls(
            bucket=os.environ["ARCHIVE_S3_BUCKET"],
            prefix=os.environ.get("ARCHIVE_S3_PREFIX", "aurora-archives"),
            region=os.environ.get("AWS_REGION", "eu-west-1")
        )


@dataclass  
class ArchiveConfig:
    table_name: str
    partition_column: str
    partition_value: str
    schema_name: str = "public"
    batch_size: int = 100000
    compression: str = "snappy"

3.2 Schema Introspection

# schema_inspector.py

from typing import Dict, List, Tuple, Any
import psycopg2
from psycopg2.extras import RealDictCursor
import pyarrow as pa


class SchemaInspector:
    """Inspects PostgreSQL table schema and converts to PyArrow schema."""

    POSTGRES_TO_ARROW = {
        "integer": pa.int32(),
        "bigint": pa.int64(),
        "smallint": pa.int16(),
        "real": pa.float32(),
        "double precision": pa.float64(),
        "numeric": pa.decimal128(38, 10),
        "text": pa.string(),
        "character varying": pa.string(),
        "varchar": pa.string(),
        "char": pa.string(),
        "character": pa.string(),
        "boolean": pa.bool_(),
        "date": pa.date32(),
        "timestamp without time zone": pa.timestamp("us"),
        "timestamp with time zone": pa.timestamp("us", tz="UTC"),
        "time without time zone": pa.time64("us"),
        "time with time zone": pa.time64("us"),
        "uuid": pa.string(),
        "json": pa.string(),
        "jsonb": pa.string(),
        "bytea": pa.binary(),
    }

    def __init__(self, connection_string: str):
        self.connection_string = connection_string

    def get_table_columns(
        self, 
        schema_name: str, 
        table_name: str
    ) -> List[Dict[str, Any]]:
        """Retrieve column definitions from information_schema."""
        query = """
            SELECT 
                column_name,
                data_type,
                is_nullable,
                column_default,
                ordinal_position,
                character_maximum_length,
                numeric_precision,
                numeric_scale
            FROM information_schema.columns
            WHERE table_schema = %s 
              AND table_name = %s
            ORDER BY ordinal_position
        """

        with psycopg2.connect(self.connection_string) as conn:
            with conn.cursor(cursor_factory=RealDictCursor) as cur:
                cur.execute(query, (schema_name, table_name))
                return list(cur.fetchall())

    def get_primary_key_columns(
        self, 
        schema_name: str, 
        table_name: str
    ) -> List[str]:
        """Retrieve primary key column names."""
        query = """
            SELECT a.attname
            FROM pg_index i
            JOIN pg_attribute a ON a.attrelid = i.indrelid 
                AND a.attnum = ANY(i.indkey)
            JOIN pg_class c ON c.oid = i.indrelid
            JOIN pg_namespace n ON n.oid = c.relnamespace
            WHERE i.indisprimary
              AND n.nspname = %s
              AND c.relname = %s
            ORDER BY array_position(i.indkey, a.attnum)
        """

        with psycopg2.connect(self.connection_string) as conn:
            with conn.cursor() as cur:
                cur.execute(query, (schema_name, table_name))
                return [row[0] for row in cur.fetchall()]

    def postgres_type_to_arrow(
        self, 
        pg_type: str, 
        precision: int = None, 
        scale: int = None
    ) -> pa.DataType:
        """Convert PostgreSQL type to PyArrow type."""
        pg_type_lower = pg_type.lower()

        if pg_type_lower == "numeric" and precision and scale:
            return pa.decimal128(precision, scale)

        if pg_type_lower in self.POSTGRES_TO_ARROW:
            return self.POSTGRES_TO_ARROW[pg_type_lower]

        if pg_type_lower.startswith("character varying"):
            return pa.string()

        if pg_type_lower.endswith("[]"):
            base_type = pg_type_lower[:-2]
            if base_type in self.POSTGRES_TO_ARROW:
                return pa.list_(self.POSTGRES_TO_ARROW[base_type])

        return pa.string()

    def build_arrow_schema(
        self, 
        schema_name: str, 
        table_name: str
    ) -> pa.Schema:
        """Build PyArrow schema from PostgreSQL table definition."""
        columns = self.get_table_columns(schema_name, table_name)

        fields = []
        for col in columns:
            arrow_type = self.postgres_type_to_arrow(
                col["data_type"],
                col.get("numeric_precision"),
                col.get("numeric_scale")
            )
            nullable = col["is_nullable"] == "YES"
            fields.append(pa.field(col["column_name"], arrow_type, nullable=nullable))

        return pa.Schema(fields)

    def get_partition_info(
        self, 
        schema_name: str, 
        table_name: str
    ) -> Dict[str, Any]:
        """Get partition strategy information if table is partitioned."""
        query = """
            SELECT 
                pt.partstrat,
                array_agg(a.attname ORDER BY pos.idx) as partition_columns
            FROM pg_partitioned_table pt
            JOIN pg_class c ON c.oid = pt.partrelid
            JOIN pg_namespace n ON n.oid = c.relnamespace
            CROSS JOIN LATERAL unnest(pt.partattrs) WITH ORDINALITY AS pos(attnum, idx)
            JOIN pg_attribute a ON a.attrelid = c.oid AND a.attnum = pos.attnum
            WHERE n.nspname = %s AND c.relname = %s
            GROUP BY pt.partstrat
        """

        with psycopg2.connect(self.connection_string) as conn:
            with conn.cursor(cursor_factory=RealDictCursor) as cur:
                cur.execute(query, (schema_name, table_name))
                result = cur.fetchone()

                if result:
                    strategy_map = {"r": "range", "l": "list", "h": "hash"}
                    return {
                        "is_partitioned": True,
                        "strategy": strategy_map.get(result["partstrat"], "unknown"),
                        "partition_columns": result["partition_columns"]
                    }

                return {"is_partitioned": False}

3.3 Archive Script

#!/usr/bin/env python3
# archive_partition.py

"""
Archive a partition from Aurora PostgreSQL to S3 in Iceberg/Parquet format.

Usage:
    python archive_partition.py \
        --table transactions \
        --partition-column transaction_date \
        --partition-value 2024-01 \
        --schema public
"""

import argparse
import json
import logging
import sys
from datetime import datetime
from pathlib import Path
from typing import Generator, Dict, Any, Optional
import hashlib

import boto3
import pandas as pd
import psycopg2
from psycopg2.extras import RealDictCursor
import pyarrow as pa
import pyarrow.parquet as pq

from config import DatabaseConfig, S3Config, ArchiveConfig
from schema_inspector import SchemaInspector


logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger(__name__)


class PartitionArchiver:
    """Archives PostgreSQL partitions to S3 in Iceberg format."""

    def __init__(
        self,
        db_config: DatabaseConfig,
        s3_config: S3Config,
        archive_config: ArchiveConfig
    ):
        self.db_config = db_config
        self.s3_config = s3_config
        self.archive_config = archive_config
        self.schema_inspector = SchemaInspector(db_config.connection_string)
        self.s3_client = boto3.client("s3", region_name=s3_config.region)

    def _generate_archive_path(self) -> str:
        """Generate S3 path for archive files."""
        return (
            f"{self.s3_config.prefix}/"
            f"{self.archive_config.schema_name}/"
            f"{self.archive_config.table_name}/"
            f"{self.archive_config.partition_column}={self.archive_config.partition_value}"
        )

    def _build_extraction_query(self) -> str:
        """Build query to extract partition data."""
        full_table = (
            f"{self.archive_config.schema_name}.{self.archive_config.table_name}"
        )

        partition_info = self.schema_inspector.get_partition_info(
            self.archive_config.schema_name,
            self.archive_config.table_name
        )

        if partition_info["is_partitioned"]:
            partition_table = (
                f"{self.archive_config.table_name}_"
                f"{self.archive_config.partition_value.replace('-', '_')}"
            )
            return f"SELECT * FROM {self.archive_config.schema_name}.{partition_table}"

        return (
            f"SELECT * FROM {full_table} "
            f"WHERE {self.archive_config.partition_column} = %s"
        )

    def _stream_partition_data(self) -> Generator[pd.DataFrame, None, None]:
        """Stream partition data in batches."""
        query = self._build_extraction_query()
        partition_info = self.schema_inspector.get_partition_info(
            self.archive_config.schema_name,
            self.archive_config.table_name
        )

        with psycopg2.connect(self.db_config.connection_string) as conn:
            with conn.cursor(
                name="archive_cursor",
                cursor_factory=RealDictCursor
            ) as cur:
                if partition_info["is_partitioned"]:
                    cur.execute(query)
                else:
                    cur.execute(query, (self.archive_config.partition_value,))

                while True:
                    rows = cur.fetchmany(self.archive_config.batch_size)
                    if not rows:
                        break
                    yield pd.DataFrame(rows)

    def _write_parquet_to_s3(
        self,
        table: pa.Table,
        file_number: int,
        arrow_schema: pa.Schema
    ) -> Dict[str, Any]:
        """Write a Parquet file to S3 and return metadata."""
        archive_path = self._generate_archive_path()
        file_name = f"data_{file_number:05d}.parquet"
        s3_key = f"{archive_path}/data/{file_name}"

        buffer = pa.BufferOutputStream()
        pq.write_table(
            table,
            buffer,
            compression=self.archive_config.compression,
            write_statistics=True
        )

        data = buffer.getvalue().to_pybytes()

        self.s3_client.put_object(
            Bucket=self.s3_config.bucket,
            Key=s3_key,
            Body=data,
            ContentType="application/octet-stream"
        )

        return {
            "path": f"s3://{self.s3_config.bucket}/{s3_key}",
            "file_size_bytes": len(data),
            "record_count": table.num_rows
        }

    def _write_iceberg_metadata(
        self,
        arrow_schema: pa.Schema,
        data_files: list,
        total_records: int
    ) -> None:
        """Write Iceberg table metadata files."""
        archive_path = self._generate_archive_path()

        schema_dict = {
            "type": "struct",
            "schema_id": 0,
            "fields": []
        }

        for i, field in enumerate(arrow_schema):
            field_type = self._arrow_to_iceberg_type(field.type)
            schema_dict["fields"].append({
                "id": i + 1,
                "name": field.name,
                "required": not field.nullable,
                "type": field_type
            })

        table_uuid = hashlib.md5(
            f"{self.archive_config.table_name}_{datetime.utcnow().isoformat()}".encode()
        ).hexdigest()

        manifest_entries = []
        for df in data_files:
            manifest_entries.append({
                "status": 1,
                "snapshot_id": 1,
                "data_file": {
                    "file_path": df["path"],
                    "file_format": "PARQUET",
                    "record_count": df["record_count"],
                    "file_size_in_bytes": df["file_size_bytes"]
                }
            })

        metadata = {
            "format_version": 2,
            "table_uuid": table_uuid,
            "location": f"s3://{self.s3_config.bucket}/{archive_path}",
            "last_sequence_number": 1,
            "last_updated_ms": int(datetime.utcnow().timestamp() * 1000),
            "last_column_id": len(arrow_schema),
            "current_schema_id": 0,
            "schemas": [schema_dict],
            "partition_spec": [],
            "default_spec_id": 0,
            "last_partition_id": 0,
            "properties": {
                "source.table": self.archive_config.table_name,
                "source.schema": self.archive_config.schema_name,
                "source.partition_column": self.archive_config.partition_column,
                "source.partition_value": self.archive_config.partition_value,
                "archive.timestamp": datetime.utcnow().isoformat(),
                "archive.total_records": str(total_records)
            },
            "current_snapshot_id": 1,
            "snapshots": [
                {
                    "snapshot_id": 1,
                    "timestamp_ms": int(datetime.utcnow().timestamp() * 1000),
                    "summary": {
                        "operation": "append",
                        "added_data_files": str(len(data_files)),
                        "total_records": str(total_records)
                    },
                    "manifest_list": f"s3://{self.s3_config.bucket}/{archive_path}/metadata/manifest_list.json"
                }
            ]
        }

        self.s3_client.put_object(
            Bucket=self.s3_config.bucket,
            Key=f"{archive_path}/metadata/v1.metadata.json",
            Body=json.dumps(metadata, indent=2),
            ContentType="application/json"
        )

        self.s3_client.put_object(
            Bucket=self.s3_config.bucket,
            Key=f"{archive_path}/metadata/manifest_list.json",
            Body=json.dumps(manifest_entries, indent=2),
            ContentType="application/json"
        )

        self.s3_client.put_object(
            Bucket=self.s3_config.bucket,
            Key=f"{archive_path}/metadata/version_hint.text",
            Body="1",
            ContentType="text/plain"
        )

    def _arrow_to_iceberg_type(self, arrow_type: pa.DataType) -> str:
        """Convert PyArrow type to Iceberg type string."""
        type_mapping = {
            pa.int16(): "int",
            pa.int32(): "int",
            pa.int64(): "long",
            pa.float32(): "float",
            pa.float64(): "double",
            pa.bool_(): "boolean",
            pa.string(): "string",
            pa.binary(): "binary",
            pa.date32(): "date",
        }

        if arrow_type in type_mapping:
            return type_mapping[arrow_type]

        if pa.types.is_timestamp(arrow_type):
            if arrow_type.tz:
                return "timestamptz"
            return "timestamp"

        if pa.types.is_decimal(arrow_type):
            return f"decimal({arrow_type.precision},{arrow_type.scale})"

        if pa.types.is_list(arrow_type):
            inner = self._arrow_to_iceberg_type(arrow_type.value_type)
            return f"list<{inner}>"

        return "string"

    def _save_schema_snapshot(self, arrow_schema: pa.Schema) -> None:
        """Save schema information for restore validation."""
        archive_path = self._generate_archive_path()

        columns = self.schema_inspector.get_table_columns(
            self.archive_config.schema_name,
            self.archive_config.table_name
        )

        pk_columns = self.schema_inspector.get_primary_key_columns(
            self.archive_config.schema_name,
            self.archive_config.table_name
        )

        schema_snapshot = {
            "source_table": {
                "schema": self.archive_config.schema_name,
                "table": self.archive_config.table_name
            },
            "columns": columns,
            "primary_key_columns": pk_columns,
            "arrow_schema": arrow_schema.to_string(),
            "archived_at": datetime.utcnow().isoformat()
        }

        self.s3_client.put_object(
            Bucket=self.s3_config.bucket,
            Key=f"{archive_path}/schema_snapshot.json",
            Body=json.dumps(schema_snapshot, indent=2, default=str),
            ContentType="application/json"
        )

    def archive(self) -> Dict[str, Any]:
        """Execute the archive operation."""
        logger.info(
            f"Starting archive of {self.archive_config.schema_name}."
            f"{self.archive_config.table_name} "
            f"partition {self.archive_config.partition_column}="
            f"{self.archive_config.partition_value}"
        )

        arrow_schema = self.schema_inspector.build_arrow_schema(
            self.archive_config.schema_name,
            self.archive_config.table_name
        )

        self._save_schema_snapshot(arrow_schema)

        data_files = []
        total_records = 0
        file_number = 0

        for batch_df in self._stream_partition_data():
            table = pa.Table.from_pandas(batch_df, schema=arrow_schema)
            file_meta = self._write_parquet_to_s3(table, file_number, arrow_schema)
            data_files.append(file_meta)
            total_records += file_meta["record_count"]
            file_number += 1

            logger.info(
                f"Written file {file_number}: {file_meta['record_count']} records"
            )

        self._write_iceberg_metadata(arrow_schema, data_files, total_records)

        result = {
            "status": "success",
            "table": f"{self.archive_config.schema_name}.{self.archive_config.table_name}",
            "partition": f"{self.archive_config.partition_column}={self.archive_config.partition_value}",
            "location": f"s3://{self.s3_config.bucket}/{self._generate_archive_path()}",
            "total_records": total_records,
            "total_files": len(data_files),
            "total_bytes": sum(f["file_size_bytes"] for f in data_files),
            "archived_at": datetime.utcnow().isoformat()
        }

        logger.info(f"Archive complete: {total_records} records in {len(data_files)} files")
        return result


def main():
    parser = argparse.ArgumentParser(
        description="Archive Aurora PostgreSQL partition to S3"
    )
    parser.add_argument("--table", required=True, help="Table name")
    parser.add_argument("--partition-column", required=True, help="Partition column name")
    parser.add_argument("--partition-value", required=True, help="Partition value to archive")
    parser.add_argument("--schema", default="public", help="Schema name")
    parser.add_argument("--batch-size", type=int, default=100000, help="Batch size for streaming")

    args = parser.parse_args()

    db_config = DatabaseConfig.from_environment()
    s3_config = S3Config.from_environment()
    archive_config = ArchiveConfig(
        table_name=args.table,
        partition_column=args.partition_column,
        partition_value=args.partition_value,
        schema_name=args.schema,
        batch_size=args.batch_size
    )

    archiver = PartitionArchiver(db_config, s3_config, archive_config)
    result = archiver.archive()

    print(json.dumps(result, indent=2))
    return 0


if __name__ == "__main__":
    sys.exit(main())

4. The Restore Script

This script reverses the archive operation by reading Parquet files from S3 and loading them into a staging table.

4.1 Restore Script

#!/usr/bin/env python3
# restore_partition.py

"""
Restore an archived partition from S3 back to Aurora PostgreSQL.

Usage:
    python restore_partition.py \
        --source-path s3://bucket/prefix/schema/table/partition_col=value \
        --target-table transactions_staging \
        --target-schema public
"""

import argparse
import json
import logging
import sys
from datetime import datetime
from typing import Dict, Any, List, Optional
from urllib.parse import urlparse

import boto3
import pandas as pd
import psycopg2
from psycopg2 import sql
from psycopg2.extras import execute_values
import pyarrow.parquet as pq

from config import DatabaseConfig


logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger(__name__)


class PartitionRestorer:
    """Restores archived partitions from S3 to PostgreSQL."""

    def __init__(
        self,
        db_config: DatabaseConfig,
        source_path: str,
        target_schema: str,
        target_table: str,
        create_table: bool = True,
        batch_size: int = 10000
    ):
        self.db_config = db_config
        self.source_path = source_path
        self.target_schema = target_schema
        self.target_table = target_table
        self.create_table = create_table
        self.batch_size = batch_size

        parsed = urlparse(source_path)
        self.bucket = parsed.netloc
        self.prefix = parsed.path.lstrip("/")

        self.s3_client = boto3.client("s3")

    def _load_schema_snapshot(self) -> Dict[str, Any]:
        """Load the schema snapshot from the archive."""
        response = self.s3_client.get_object(
            Bucket=self.bucket,
            Key=f"{self.prefix}/schema_snapshot.json"
        )
        return json.loads(response["Body"].read())

    def _load_iceberg_metadata(self) -> Dict[str, Any]:
        """Load Iceberg metadata."""
        response = self.s3_client.get_object(
            Bucket=self.bucket,
            Key=f"{self.prefix}/metadata/v1.metadata.json"
        )
        return json.loads(response["Body"].read())

    def _list_data_files(self) -> List[str]:
        """List all Parquet data files in the archive."""
        data_prefix = f"{self.prefix}/data/"

        files = []
        paginator = self.s3_client.get_paginator("list_objects_v2")

        for page in paginator.paginate(Bucket=self.bucket, Prefix=data_prefix):
            for obj in page.get("Contents", []):
                if obj["Key"].endswith(".parquet"):
                    files.append(obj["Key"])

        return sorted(files)

    def _postgres_type_from_column_def(self, col: Dict[str, Any]) -> str:
        """Convert column definition to PostgreSQL type."""
        data_type = col["data_type"]

        if data_type == "character varying":
            max_len = col.get("character_maximum_length")
            if max_len:
                return f"varchar({max_len})"
            return "text"

        if data_type == "numeric":
            precision = col.get("numeric_precision")
            scale = col.get("numeric_scale")
            if precision and scale:
                return f"numeric({precision},{scale})"
            return "numeric"

        return data_type

    def _create_staging_table(
        self,
        schema_snapshot: Dict[str, Any],
        conn: psycopg2.extensions.connection
    ) -> None:
        """Create the staging table based on archived schema."""
        columns = schema_snapshot["columns"]

        column_defs = []
        for col in columns:
            pg_type = self._postgres_type_from_column_def(col)
            nullable = "" if col["is_nullable"] == "YES" else " NOT NULL"
            column_defs.append(f'    "{col["column_name"]}" {pg_type}{nullable}')

        create_sql = f"""
            DROP TABLE IF EXISTS {self.target_schema}.{self.target_table};
            CREATE TABLE {self.target_schema}.{self.target_table} (
{chr(10).join(column_defs)}
            )
        """

        with conn.cursor() as cur:
            cur.execute(create_sql)
        conn.commit()

        logger.info(f"Created staging table {self.target_schema}.{self.target_table}")

    def _insert_batch(
        self,
        df: pd.DataFrame,
        columns: List[str],
        conn: psycopg2.extensions.connection
    ) -> int:
        """Insert a batch of records into the staging table."""
        if df.empty:
            return 0

        for col in df.columns:
            if pd.api.types.is_datetime64_any_dtype(df[col]):
                df[col] = df[col].apply(
                    lambda x: x.isoformat() if pd.notna(x) else None
                )

        values = [tuple(row) for row in df[columns].values]

        column_names = ", ".join(f'"{c}"' for c in columns)
        insert_sql = f"""
            INSERT INTO {self.target_schema}.{self.target_table} ({column_names})
            VALUES %s
        """

        with conn.cursor() as cur:
            execute_values(cur, insert_sql, values, page_size=self.batch_size)

        return len(values)

    def restore(self) -> Dict[str, Any]:
        """Execute the restore operation."""
        logger.info(f"Starting restore from {self.source_path}")

        schema_snapshot = self._load_schema_snapshot()
        metadata = self._load_iceberg_metadata()
        data_files = self._list_data_files()

        logger.info(f"Found {len(data_files)} data files to restore")

        columns = [col["column_name"] for col in schema_snapshot["columns"]]

        with psycopg2.connect(self.db_config.connection_string) as conn:
            if self.create_table:
                self._create_staging_table(schema_snapshot, conn)

            total_records = 0

            for file_key in data_files:
                s3_uri = f"s3://{self.bucket}/{file_key}"
                logger.info(f"Restoring from {file_key}")

                response = self.s3_client.get_object(Bucket=self.bucket, Key=file_key)
                table = pq.read_table(response["Body"])
                df = table.to_pandas()

                file_records = 0
                for start in range(0, len(df), self.batch_size):
                    batch_df = df.iloc[start:start + self.batch_size]
                    inserted = self._insert_batch(batch_df, columns, conn)
                    file_records += inserted

                conn.commit()
                total_records += file_records
                logger.info(f"Restored {file_records} records from {file_key}")

        result = {
            "status": "success",
            "source": self.source_path,
            "target": f"{self.target_schema}.{self.target_table}",
            "total_records": total_records,
            "files_processed": len(data_files),
            "restored_at": datetime.utcnow().isoformat()
        }

        logger.info(f"Restore complete: {total_records} records")
        return result


def main():
    parser = argparse.ArgumentParser(
        description="Restore archived partition from S3 to PostgreSQL"
    )
    parser.add_argument("--source-path", required=True, help="S3 path to archived partition")
    parser.add_argument("--target-table", required=True, help="Target table name")
    parser.add_argument("--target-schema", default="public", help="Target schema name")
    parser.add_argument("--batch-size", type=int, default=10000, help="Insert batch size")
    parser.add_argument("--no-create", action="store_true", help="Don't create table, assume it exists")

    args = parser.parse_args()

    db_config = DatabaseConfig.from_environment()

    restorer = PartitionRestorer(
        db_config=db_config,
        source_path=args.source_path,
        target_schema=args.target_schema,
        target_table=args.target_table,
        create_table=not args.no_create,
        batch_size=args.batch_size
    )

    result = restorer.restore()
    print(json.dumps(result, indent=2))
    return 0


if __name__ == "__main__":
    sys.exit(main())

5. SQL Operations for Partition Migration

Once data is restored to a staging table, you need SQL operations to validate and migrate it to the main table.

5.1 Schema Validation

-- Validate that staging table schema matches the main table

CREATE OR REPLACE FUNCTION validate_table_schemas(
    p_source_schema TEXT,
    p_source_table TEXT,
    p_target_schema TEXT,
    p_target_table TEXT
) RETURNS TABLE (
    validation_type TEXT,
    column_name TEXT,
    source_value TEXT,
    target_value TEXT,
    is_valid BOOLEAN
) AS $$
BEGIN
    -- Check column count
    RETURN QUERY
    SELECT 
        'column_count'::TEXT,
        NULL::TEXT,
        src.cnt::TEXT,
        tgt.cnt::TEXT,
        src.cnt = tgt.cnt
    FROM 
        (SELECT COUNT(*)::INT AS cnt 
         FROM information_schema.columns 
         WHERE table_schema = p_source_schema 
           AND table_name = p_source_table) src,
        (SELECT COUNT(*)::INT AS cnt 
         FROM information_schema.columns 
         WHERE table_schema = p_target_schema 
           AND table_name = p_target_table) tgt;

    -- Check each column exists with matching type
    RETURN QUERY
    SELECT 
        'column_definition'::TEXT,
        src.column_name,
        src.data_type || COALESCE('(' || src.character_maximum_length::TEXT || ')', ''),
        COALESCE(tgt.data_type || COALESCE('(' || tgt.character_maximum_length::TEXT || ')', ''), 'MISSING'),
        src.data_type = COALESCE(tgt.data_type, '') 
            AND COALESCE(src.character_maximum_length, 0) = COALESCE(tgt.character_maximum_length, 0)
    FROM 
        information_schema.columns src
    LEFT JOIN 
        information_schema.columns tgt 
        ON tgt.table_schema = p_target_schema 
        AND tgt.table_name = p_target_table
        AND tgt.column_name = src.column_name
    WHERE 
        src.table_schema = p_source_schema 
        AND src.table_name = p_source_table
    ORDER BY src.ordinal_position;

    -- Check nullability
    RETURN QUERY
    SELECT 
        'nullability'::TEXT,
        src.column_name,
        src.is_nullable,
        COALESCE(tgt.is_nullable, 'MISSING'),
        src.is_nullable = COALESCE(tgt.is_nullable, '')
    FROM 
        information_schema.columns src
    LEFT JOIN 
        information_schema.columns tgt 
        ON tgt.table_schema = p_target_schema 
        AND tgt.table_name = p_target_table
        AND tgt.column_name = src.column_name
    WHERE 
        src.table_schema = p_source_schema 
        AND src.table_name = p_source_table
    ORDER BY src.ordinal_position;
END;
$$ LANGUAGE plpgsql;

-- Usage
SELECT * FROM validate_table_schemas('public', 'transactions_staging', 'public', 'transactions');

5.2 Comprehensive Validation Report

-- Generate a full validation report before migration

CREATE OR REPLACE FUNCTION generate_migration_report(
    p_staging_schema TEXT,
    p_staging_table TEXT,
    p_target_schema TEXT,
    p_target_table TEXT,
    p_partition_column TEXT,
    p_partition_value TEXT
) RETURNS TABLE (
    check_name TEXT,
    result TEXT,
    details JSONB
) AS $$
DECLARE
    v_staging_count BIGINT;
    v_existing_count BIGINT;
    v_schema_valid BOOLEAN;
BEGIN
    -- Get staging table count
    EXECUTE format(
        'SELECT COUNT(*) FROM %I.%I',
        p_staging_schema, p_staging_table
    ) INTO v_staging_count;

    RETURN QUERY SELECT 
        'staging_record_count'::TEXT,
        'INFO'::TEXT,
        jsonb_build_object('count', v_staging_count);

    -- Check for existing data in target partition
    BEGIN
        EXECUTE format(
            'SELECT COUNT(*) FROM %I.%I WHERE %I = $1',
            p_target_schema, p_target_table, p_partition_column
        ) INTO v_existing_count USING p_partition_value;

        IF v_existing_count > 0 THEN
            RETURN QUERY SELECT 
                'existing_partition_data'::TEXT,
                'WARNING'::TEXT,
                jsonb_build_object(
                    'count', v_existing_count,
                    'message', 'Target partition already contains data'
                );
        ELSE
            RETURN QUERY SELECT 
                'existing_partition_data'::TEXT,
                'OK'::TEXT,
                jsonb_build_object('count', 0);
        END IF;
    EXCEPTION WHEN undefined_column THEN
        RETURN QUERY SELECT 
            'partition_column_check'::TEXT,
            'ERROR'::TEXT,
            jsonb_build_object(
                'message', format('Partition column %s not found', p_partition_column)
            );
    END;

    -- Validate schemas match
    SELECT bool_and(is_valid) INTO v_schema_valid
    FROM validate_table_schemas(
        p_staging_schema, p_staging_table,
        p_target_schema, p_target_table
    );

    RETURN QUERY SELECT 
        'schema_validation'::TEXT,
        CASE WHEN v_schema_valid THEN 'OK' ELSE 'ERROR' END::TEXT,
        jsonb_build_object('schemas_match', v_schema_valid);

    -- Check for null values in NOT NULL columns
    RETURN QUERY
    SELECT 
        'null_check_' || c.column_name,
        CASE WHEN null_count > 0 THEN 'ERROR' ELSE 'OK' END,
        jsonb_build_object('null_count', null_count)
    FROM information_schema.columns c
    CROSS JOIN LATERAL (
        SELECT COUNT(*) as null_count 
        FROM (
            SELECT 1 
            FROM information_schema.columns ic
            WHERE ic.table_schema = p_staging_schema
              AND ic.table_name = p_staging_table
              AND ic.column_name = c.column_name
        ) x
    ) nc
    WHERE c.table_schema = p_target_schema
      AND c.table_name = p_target_table
      AND c.is_nullable = 'NO';
END;
$$ LANGUAGE plpgsql;

-- Usage
SELECT * FROM generate_migration_report(
    'public', 'transactions_staging',
    'public', 'transactions',
    'transaction_date', '2024-01'
);

5.3 Partition Migration

-- Migrate data from staging table to main table

CREATE OR REPLACE PROCEDURE migrate_partition_data(
    p_staging_schema TEXT,
    p_staging_table TEXT,
    p_target_schema TEXT,
    p_target_table TEXT,
    p_partition_column TEXT,
    p_partition_value TEXT,
    p_delete_existing BOOLEAN DEFAULT FALSE,
    p_batch_size INTEGER DEFAULT 50000
)
LANGUAGE plpgsql
AS $$
DECLARE
    v_columns TEXT;
    v_total_migrated BIGINT := 0;
    v_batch_migrated BIGINT;
    v_validation_passed BOOLEAN;
BEGIN
    -- Validate schemas match
    SELECT bool_and(is_valid) INTO v_validation_passed
    FROM validate_table_schemas(
        p_staging_schema, p_staging_table,
        p_target_schema, p_target_table
    );

    IF NOT v_validation_passed THEN
        RAISE EXCEPTION 'Schema validation failed. Run validate_table_schemas() for details.';
    END IF;

    -- Build column list
    SELECT string_agg(quote_ident(column_name), ', ' ORDER BY ordinal_position)
    INTO v_columns
    FROM information_schema.columns
    WHERE table_schema = p_staging_schema
      AND table_name = p_staging_table;

    -- Delete existing data if requested
    IF p_delete_existing THEN
        EXECUTE format(
            'DELETE FROM %I.%I WHERE %I = $1',
            p_target_schema, p_target_table, p_partition_column
        ) USING p_partition_value;

        RAISE NOTICE 'Deleted existing data for partition % = %', 
            p_partition_column, p_partition_value;
    END IF;

    -- Migrate in batches using a cursor approach
    LOOP
        EXECUTE format($sql$
            WITH to_migrate AS (
                SELECT ctid 
                FROM %I.%I 
                WHERE NOT EXISTS (
                    SELECT 1 FROM %I.%I t 
                    WHERE t.%I = $1
                )
                LIMIT $2
            ),
            inserted AS (
                INSERT INTO %I.%I (%s)
                SELECT %s 
                FROM %I.%I s
                WHERE s.ctid IN (SELECT ctid FROM to_migrate)
                RETURNING 1
            )
            SELECT COUNT(*) FROM inserted
        $sql$,
            p_staging_schema, p_staging_table,
            p_target_schema, p_target_table, p_partition_column,
            p_target_schema, p_target_table, v_columns,
            v_columns,
            p_staging_schema, p_staging_table
        ) INTO v_batch_migrated USING p_partition_value, p_batch_size;

        v_total_migrated := v_total_migrated + v_batch_migrated;

        IF v_batch_migrated = 0 THEN
            EXIT;
        END IF;

        RAISE NOTICE 'Migrated batch: % records (total: %)', v_batch_migrated, v_total_migrated;
        COMMIT;
    END LOOP;

    RAISE NOTICE 'Migration complete. Total records migrated: %', v_total_migrated;
END;
$$;

-- Usage
CALL migrate_partition_data(
    'public', 'transactions_staging',
    'public', 'transactions',
    'transaction_date', '2024-01',
    TRUE,   -- delete existing
    50000   -- batch size
);

5.4 Attach Partition (for Partitioned Tables)

-- For natively partitioned tables, attach the staging table as a partition

CREATE OR REPLACE PROCEDURE attach_restored_partition(
    p_staging_schema TEXT,
    p_staging_table TEXT,
    p_target_schema TEXT,
    p_target_table TEXT,
    p_partition_column TEXT,
    p_partition_start TEXT,
    p_partition_end TEXT
)
LANGUAGE plpgsql
AS $$
DECLARE
    v_partition_name TEXT;
    v_constraint_name TEXT;
BEGIN
    -- Validate schemas match
    IF NOT (
        SELECT bool_and(is_valid)
        FROM validate_table_schemas(
            p_staging_schema, p_staging_table,
            p_target_schema, p_target_table
        )
    ) THEN
        RAISE EXCEPTION 'Schema validation failed';
    END IF;

    -- Add constraint to staging table that matches partition bounds
    v_constraint_name := p_staging_table || '_partition_check';

    EXECUTE format($sql$
        ALTER TABLE %I.%I 
        ADD CONSTRAINT %I 
        CHECK (%I >= %L AND %I < %L)
    $sql$,
        p_staging_schema, p_staging_table,
        v_constraint_name,
        p_partition_column, p_partition_start,
        p_partition_column, p_partition_end
    );

    -- Validate constraint without locking
    EXECUTE format($sql$
        ALTER TABLE %I.%I 
        VALIDATE CONSTRAINT %I
    $sql$,
        p_staging_schema, p_staging_table,
        v_constraint_name
    );

    -- Detach old partition if exists
    v_partition_name := p_target_table || '_' || replace(p_partition_start, '-', '_');

    BEGIN
        EXECUTE format($sql$
            ALTER TABLE %I.%I 
            DETACH PARTITION %I.%I
        $sql$,
            p_target_schema, p_target_table,
            p_target_schema, v_partition_name
        );
        RAISE NOTICE 'Detached existing partition %', v_partition_name;
    EXCEPTION WHEN undefined_table THEN
        RAISE NOTICE 'No existing partition to detach';
    END;

    -- Rename staging table to partition name
    EXECUTE format($sql$
        ALTER TABLE %I.%I RENAME TO %I
    $sql$,
        p_staging_schema, p_staging_table,
        v_partition_name
    );

    -- Attach as partition
    EXECUTE format($sql$
        ALTER TABLE %I.%I 
        ATTACH PARTITION %I.%I 
        FOR VALUES FROM (%L) TO (%L)
    $sql$,
        p_target_schema, p_target_table,
        p_staging_schema, v_partition_name,
        p_partition_start, p_partition_end
    );

    RAISE NOTICE 'Successfully attached partition % to %', 
        v_partition_name, p_target_table;
END;
$$;

-- Usage for range partitioned table
CALL attach_restored_partition(
    'public', 'transactions_staging',
    'public', 'transactions',
    'transaction_date',
    '2024-01-01', '2024-02-01'
);

5.5 Cleanup Script

-- Clean up after successful migration

CREATE OR REPLACE PROCEDURE cleanup_after_migration(
    p_staging_schema TEXT,
    p_staging_table TEXT,
    p_verify_target_schema TEXT DEFAULT NULL,
    p_verify_target_table TEXT DEFAULT NULL,
    p_verify_count BOOLEAN DEFAULT TRUE
)
LANGUAGE plpgsql
AS $$
DECLARE
    v_staging_count BIGINT;
    v_target_count BIGINT;
BEGIN
    IF p_verify_count AND p_verify_target_schema IS NOT NULL THEN
        EXECUTE format(
            'SELECT COUNT(*) FROM %I.%I',
            p_staging_schema, p_staging_table
        ) INTO v_staging_count;

        EXECUTE format(
            'SELECT COUNT(*) FROM %I.%I',
            p_verify_target_schema, p_verify_target_table
        ) INTO v_target_count;

        IF v_target_count < v_staging_count THEN
            RAISE WARNING 'Target count (%) is less than staging count (%). Migration may be incomplete.',
                v_target_count, v_staging_count;
            RETURN;
        END IF;
    END IF;

    EXECUTE format(
        'DROP TABLE IF EXISTS %I.%I',
        p_staging_schema, p_staging_table
    );

    RAISE NOTICE 'Dropped staging table %.%', p_staging_schema, p_staging_table;
END;
$$;

-- Usage
CALL cleanup_after_migration(
    'public', 'transactions_staging',
    'public', 'transactions',
    TRUE
);

6. Spring Boot Query API

This API allows querying archived data directly from S3 without restoring to the database.

6.1 Project Structure

src/main/java/com/example/archivequery/
    ArchiveQueryApplication.java
    config/
        AwsConfig.java
        ParquetConfig.java
    controller/
        ArchiveQueryController.java
    service/
        ParquetQueryService.java
        PredicateParser.java
    model/
        QueryRequest.java
        QueryResponse.java
        ArchiveMetadata.java
    predicate/
        Predicate.java
        ComparisonPredicate.java
        LogicalPredicate.java
        PredicateEvaluator.java

6.2 Application Configuration

// ArchiveQueryApplication.java
package com.example.archivequery;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class ArchiveQueryApplication {
    public static void main(String[] args) {
        SpringApplication.run(ArchiveQueryApplication.class, args);
    }
}

6.3 AWS Configuration

// config/AwsConfig.java
package com.example.archivequery.config;

import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.S3Client;

@Configuration
public class AwsConfig {

    @Value("${aws.region:eu-west-1}")
    private String region;

    @Bean
    public S3Client s3Client() {
        return S3Client.builder()
            .region(Region.of(region))
            .credentialsProvider(DefaultCredentialsProvider.create())
            .build();
    }
}

6.4 Parquet Configuration

// config/ParquetConfig.java
package com.example.archivequery.config;

import org.apache.hadoop.conf.Configuration;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;

@org.springframework.context.annotation.Configuration
public class ParquetConfig {

    @Value("${aws.region:eu-west-1}")
    private String region;

    @Bean
    public Configuration hadoopConfiguration() {
        Configuration conf = new Configuration();
        conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
        conf.set("fs.s3a.aws.credentials.provider", 
            "com.amazonaws.auth.DefaultAWSCredentialsProviderChain");
        conf.set("fs.s3a.endpoint.region", region);
        return conf;
    }
}

6.5 Model Classes

// model/QueryRequest.java
package com.example.archivequery.model;

import java.util.List;
import java.util.Map;

public record QueryRequest(
    String archivePath,
    List<String> columns,
    Map<String, Object> filters,
    String filterExpression,
    Integer limit,
    Integer offset
) {
    public QueryRequest {
        if (limit == null) limit = 1000;
        if (offset == null) offset = 0;
    }
}
// model/QueryResponse.java
package com.example.archivequery.model;

import java.util.List;
import java.util.Map;

public record QueryResponse(
    List<Map<String, Object>> data,
    long totalMatched,
    long totalScanned,
    long executionTimeMs,
    Map<String, String> schema,
    String archivePath
) {}
// model/ArchiveMetadata.java
package com.example.archivequery.model;

import java.time.Instant;
import java.util.List;
import java.util.Map;

public record ArchiveMetadata(
    String tableUuid,
    String location,
    List<ColumnDefinition> columns,
    Map<String, String> properties,
    Instant archivedAt
) {
    public record ColumnDefinition(
        int id,
        String name,
        String type,
        boolean required
    ) {}
}

6.6 Predicate Classes

// predicate/Predicate.java
package com.example.archivequery.predicate;

import java.util.Map;

public sealed interface Predicate 
    permits ComparisonPredicate, LogicalPredicate {

    boolean evaluate(Map<String, Object> record);
}
// predicate/ComparisonPredicate.java
package com.example.archivequery.predicate;

import java.math.BigDecimal;
import java.time.LocalDate;
import java.time.LocalDateTime;
import java.util.Map;
import java.util.Objects;

public record ComparisonPredicate(
    String column,
    Operator operator,
    Object value
) implements Predicate {

    public enum Operator {
        EQ, NE, GT, GE, LT, LE, LIKE, IN, IS_NULL, IS_NOT_NULL
    }

    @Override
    public boolean evaluate(Map<String, Object> record) {
        Object recordValue = record.get(column);

        return switch (operator) {
            case IS_NULL -> recordValue == null;
            case IS_NOT_NULL -> recordValue != null;
            case EQ -> Objects.equals(recordValue, convertValue(recordValue));
            case NE -> !Objects.equals(recordValue, convertValue(recordValue));
            case GT -> compare(recordValue, convertValue(recordValue)) > 0;
            case GE -> compare(recordValue, convertValue(recordValue)) >= 0;
            case LT -> compare(recordValue, convertValue(recordValue)) < 0;
            case LE -> compare(recordValue, convertValue(recordValue)) <= 0;
            case LIKE -> matchesLike(recordValue);
            case IN -> matchesIn(recordValue);
        };
    }

    private Object convertValue(Object recordValue) {
        if (recordValue == null || value == null) return value;

        if (recordValue instanceof LocalDate && value instanceof String s) {
            return LocalDate.parse(s);
        }
        if (recordValue instanceof LocalDateTime && value instanceof String s) {
            return LocalDateTime.parse(s);
        }
        if (recordValue instanceof BigDecimal && value instanceof Number n) {
            return new BigDecimal(n.toString());
        }
        if (recordValue instanceof Long && value instanceof Number n) {
            return n.longValue();
        }
        if (recordValue instanceof Integer && value instanceof Number n) {
            return n.intValue();
        }
        if (recordValue instanceof Double && value instanceof Number n) {
            return n.doubleValue();
        }

        return value;
    }

    @SuppressWarnings({"unchecked", "rawtypes"})
    private int compare(Object a, Object b) {
        if (a == null || b == null) return 0;
        if (a instanceof Comparable ca && b instanceof Comparable cb) {
            return ca.compareTo(cb);
        }
        return 0;
    }

    private boolean matchesLike(Object recordValue) {
        if (recordValue == null || value == null) return false;
        String pattern = value.toString()
            .replace("%", ".*")
            .replace("_", ".");
        return recordValue.toString().matches(pattern);
    }

    private boolean matchesIn(Object recordValue) {
        if (recordValue == null || value == null) return false;
        if (value instanceof Iterable<?> iterable) {
            for (Object item : iterable) {
                if (Objects.equals(recordValue, item)) return true;
            }
        }
        return false;
    }
}
// predicate/LogicalPredicate.java
package com.example.archivequery.predicate;

import java.util.List;
import java.util.Map;

public record LogicalPredicate(
    Operator operator,
    List<Predicate> predicates
) implements Predicate {

    public enum Operator {
        AND, OR, NOT
    }

    @Override
    public boolean evaluate(Map<String, Object> record) {
        return switch (operator) {
            case AND -> predicates.stream().allMatch(p -> p.evaluate(record));
            case OR -> predicates.stream().anyMatch(p -> p.evaluate(record));
            case NOT -> !predicates.getFirst().evaluate(record);
        };
    }
}
// predicate/PredicateEvaluator.java
package com.example.archivequery.predicate;

import java.util.Map;

public class PredicateEvaluator {

    private final Predicate predicate;

    public PredicateEvaluator(Predicate predicate) {
        this.predicate = predicate;
    }

    public boolean matches(Map<String, Object> record) {
        if (predicate == null) return true;
        return predicate.evaluate(record);
    }
}

6.7 Predicate Parser

// service/PredicateParser.java
package com.example.archivequery.service;

import com.example.archivequery.predicate.*;
import org.springframework.stereotype.Service;

import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

@Service
public class PredicateParser {

    private static final Pattern COMPARISON_PATTERN = Pattern.compile(
        "(\\w+)\\s*(=|!=|<>|>=|<=|>|<|LIKE|IN|IS NULL|IS NOT NULL)\\s*(.*)?"
    );

    public Predicate parse(String expression) {
        if (expression == null || expression.isBlank()) {
            return null;
        }
        return parseExpression(expression.trim());
    }

    public Predicate fromFilters(Map<String, Object> filters) {
        if (filters == null || filters.isEmpty()) {
            return null;
        }

        List<Predicate> predicates = new ArrayList<>();

        for (Map.Entry<String, Object> entry : filters.entrySet()) {
            String key = entry.getKey();
            Object value = entry.getValue();

            if (key.endsWith("_gt")) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 3),
                    ComparisonPredicate.Operator.GT,
                    value
                ));
            } else if (key.endsWith("_gte")) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 4),
                    ComparisonPredicate.Operator.GE,
                    value
                ));
            } else if (key.endsWith("_lt")) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 3),
                    ComparisonPredicate.Operator.LT,
                    value
                ));
            } else if (key.endsWith("_lte")) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 4),
                    ComparisonPredicate.Operator.LE,
                    value
                ));
            } else if (key.endsWith("_ne")) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 3),
                    ComparisonPredicate.Operator.NE,
                    value
                ));
            } else if (key.endsWith("_like")) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 5),
                    ComparisonPredicate.Operator.LIKE,
                    value
                ));
            } else if (key.endsWith("_in") && value instanceof List<?> list) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 3),
                    ComparisonPredicate.Operator.IN,
                    list
                ));
            } else {
                predicates.add(new ComparisonPredicate(
                    key,
                    ComparisonPredicate.Operator.EQ,
                    value
                ));
            }
        }

        if (predicates.size() == 1) {
            return predicates.getFirst();
        }

        return new LogicalPredicate(LogicalPredicate.Operator.AND, predicates);
    }

    private Predicate parseExpression(String expression) {
        String upper = expression.toUpperCase();

        int andIndex = findLogicalOperator(upper, " AND ");
        if (andIndex > 0) {
            String left = expression.substring(0, andIndex);
            String right = expression.substring(andIndex + 5);
            return new LogicalPredicate(
                LogicalPredicate.Operator.AND,
                List.of(parseExpression(left), parseExpression(right))
            );
        }

        int orIndex = findLogicalOperator(upper, " OR ");
        if (orIndex > 0) {
            String left = expression.substring(0, orIndex);
            String right = expression.substring(orIndex + 4);
            return new LogicalPredicate(
                LogicalPredicate.Operator.OR,
                List.of(parseExpression(left), parseExpression(right))
            );
        }

        if (upper.startsWith("NOT ")) {
            return new LogicalPredicate(
                LogicalPredicate.Operator.NOT,
                List.of(parseExpression(expression.substring(4)))
            );
        }

        if (expression.startsWith("(") && expression.endsWith(")")) {
            return parseExpression(expression.substring(1, expression.length() - 1));
        }

        return parseComparison(expression);
    }

    private int findLogicalOperator(String expression, String operator) {
        int depth = 0;
        int index = 0;

        while (index < expression.length()) {
            char c = expression.charAt(index);
            if (c == '(') depth++;
            else if (c == ')') depth--;
            else if (depth == 0 && expression.substring(index).startsWith(operator)) {
                return index;
            }
            index++;
        }
        return -1;
    }

    private Predicate parseComparison(String expression) {
        Matcher matcher = COMPARISON_PATTERN.matcher(expression.trim());

        if (!matcher.matches()) {
            throw new IllegalArgumentException("Invalid expression: " + expression);
        }

        String column = matcher.group(1);
        String operatorStr = matcher.group(2).toUpperCase();
        String valueStr = matcher.group(3);

        ComparisonPredicate.Operator operator = switch (operatorStr) {
            case "=" -> ComparisonPredicate.Operator.EQ;
            case "!=", "<>" -> ComparisonPredicate.Operator.NE;
            case ">" -> ComparisonPredicate.Operator.GT;
            case ">=" -> ComparisonPredicate.Operator.GE;
            case "<" -> ComparisonPredicate.Operator.LT;
            case "<=" -> ComparisonPredicate.Operator.LE;
            case "LIKE" -> ComparisonPredicate.Operator.LIKE;
            case "IN" -> ComparisonPredicate.Operator.IN;
            case "IS NULL" -> ComparisonPredicate.Operator.IS_NULL;
            case "IS NOT NULL" -> ComparisonPredicate.Operator.IS_NOT_NULL;
            default -> throw new IllegalArgumentException("Unknown operator: " + operatorStr);
        };

        Object value = parseValue(valueStr, operator);

        return new ComparisonPredicate(column, operator, value);
    }

    private Object parseValue(String valueStr, ComparisonPredicate.Operator operator) {
        if (operator == ComparisonPredicate.Operator.IS_NULL || 
            operator == ComparisonPredicate.Operator.IS_NOT_NULL) {
            return null;
        }

        if (valueStr == null || valueStr.isBlank()) {
            return null;
        }

        valueStr = valueStr.trim();

        if (valueStr.startsWith("'") && valueStr.endsWith("'")) {
            return valueStr.substring(1, valueStr.length() - 1);
        }

        if (operator == ComparisonPredicate.Operator.IN) {
            if (valueStr.startsWith("(") && valueStr.endsWith(")")) {
                valueStr = valueStr.substring(1, valueStr.length() - 1);
            }
            return Arrays.stream(valueStr.split(","))
                .map(String::trim)
                .map(s -> s.startsWith("'") ? s.substring(1, s.length() - 1) : parseNumber(s))
                .toList();
        }

        return parseNumber(valueStr);
    }

    private Object parseNumber(String s) {
        try {
            if (s.contains(".")) {
                return Double.parseDouble(s);
            }
            return Long.parseLong(s);
        } catch (NumberFormatException e) {
            return s;
        }
    }
}

6.8 Parquet Query Service

// service/ParquetQueryService.java
package com.example.archivequery.service;

import com.example.archivequery.model.*;
import com.example.archivequery.predicate.*;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;
import org.apache.avro.generic.GenericRecord;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Service;
import software.amazon.awssdk.services.s3.S3Client;
import software.amazon.awssdk.services.s3.model.*;

import java.io.IOException;
import java.net.URI;
import java.time.*;
import java.util.*;
import java.util.concurrent.*;
import java.util.stream.Collectors;

@Service
public class ParquetQueryService {

    private static final Logger log = LoggerFactory.getLogger(ParquetQueryService.class);

    private final S3Client s3Client;
    private final Configuration hadoopConfig;
    private final PredicateParser predicateParser;
    private final ObjectMapper objectMapper;
    private final ExecutorService executorService;

    public ParquetQueryService(
        S3Client s3Client,
        Configuration hadoopConfig,
        PredicateParser predicateParser
    ) {
        this.s3Client = s3Client;
        this.hadoopConfig = hadoopConfig;
        this.predicateParser = predicateParser;
        this.objectMapper = new ObjectMapper();
        this.executorService = Executors.newFixedThreadPool(
            Runtime.getRuntime().availableProcessors()
        );
    }

    public QueryResponse query(QueryRequest request) {
        long startTime = System.currentTimeMillis();

        Predicate predicate = buildPredicate(request);
        PredicateEvaluator evaluator = new PredicateEvaluator(predicate);

        List<String> dataFiles = listDataFiles(request.archivePath());

        List<Map<String, Object>> allResults = Collections.synchronizedList(new ArrayList<>());
        long[] counters = new long[2];

        List<Future<?>> futures = new ArrayList<>();

        for (String dataFile : dataFiles) {
            futures.add(executorService.submit(() -> {
                try {
                    QueryFileResult result = queryFile(
                        dataFile, 
                        request.columns(), 
                        evaluator
                    );
                    synchronized (counters) {
                        counters[0] += result.matched();
                        counters[1] += result.scanned();
                    }
                    allResults.addAll(result.records());
                } catch (IOException e) {
                    log.error("Error reading file: " + dataFile, e);
                }
            }));
        }

        for (Future<?> future : futures) {
            try {
                future.get();
            } catch (InterruptedException | ExecutionException e) {
                log.error("Error waiting for query completion", e);
            }
        }

        List<Map<String, Object>> paginatedResults = allResults.stream()
            .skip(request.offset())
            .limit(request.limit())
            .collect(Collectors.toList());

        Map<String, String> schema = extractSchema(request.archivePath());

        long executionTime = System.currentTimeMillis() - startTime;

        return new QueryResponse(
            paginatedResults,
            counters[0],
            counters[1],
            executionTime,
            schema,
            request.archivePath()
        );
    }

    public ArchiveMetadata getMetadata(String archivePath) {
        try {
            URI uri = URI.create(archivePath);
            String bucket = uri.getHost();
            String prefix = uri.getPath().substring(1);

            GetObjectRequest getRequest = GetObjectRequest.builder()
                .bucket(bucket)
                .key(prefix + "/metadata/v1.metadata.json")
                .build();

            byte[] content = s3Client.getObjectAsBytes(getRequest).asByteArray();
            Map<String, Object> metadata = objectMapper.readValue(content, Map.class);

            List<Map<String, Object>> schemas = (List<Map<String, Object>>) metadata.get("schemas");
            Map<String, Object> currentSchema = schemas.getFirst();
            List<Map<String, Object>> fields = (List<Map<String, Object>>) currentSchema.get("fields");

            List<ArchiveMetadata.ColumnDefinition> columns = fields.stream()
                .map(f -> new ArchiveMetadata.ColumnDefinition(
                    ((Number) f.get("id")).intValue(),
                    (String) f.get("name"),
                    (String) f.get("type"),
                    (Boolean) f.get("required")
                ))
                .toList();

            Map<String, String> properties = (Map<String, String>) metadata.get("properties");

            String archivedAtStr = properties.get("archive.timestamp");
            Instant archivedAt = archivedAtStr != null ? 
                Instant.parse(archivedAtStr) : Instant.now();

            return new ArchiveMetadata(
                (String) metadata.get("table_uuid"),
                (String) metadata.get("location"),
                columns,
                properties,
                archivedAt
            );
        } catch (Exception e) {
            throw new RuntimeException("Failed to read archive metadata", e);
        }
    }

    private Predicate buildPredicate(QueryRequest request) {
        Predicate filterPredicate = predicateParser.fromFilters(request.filters());
        Predicate expressionPredicate = predicateParser.parse(request.filterExpression());

        if (filterPredicate == null) return expressionPredicate;
        if (expressionPredicate == null) return filterPredicate;

        return new LogicalPredicate(
            LogicalPredicate.Operator.AND,
            List.of(filterPredicate, expressionPredicate)
        );
    }

    private List<String> listDataFiles(String archivePath) {
        URI uri = URI.create(archivePath);
        String bucket = uri.getHost();
        String prefix = uri.getPath().substring(1) + "/data/";

        ListObjectsV2Request listRequest = ListObjectsV2Request.builder()
            .bucket(bucket)
            .prefix(prefix)
            .build();

        List<String> files = new ArrayList<>();
        ListObjectsV2Response response;

        do {
            response = s3Client.listObjectsV2(listRequest);

            for (S3Object obj : response.contents()) {
                if (obj.key().endsWith(".parquet")) {
                    files.add("s3a://" + bucket + "/" + obj.key());
                }
            }

            listRequest = listRequest.toBuilder()
                .continuationToken(response.nextContinuationToken())
                .build();

        } while (response.isTruncated());

        return files;
    }

    private record QueryFileResult(
        List<Map<String, Object>> records,
        long matched,
        long scanned
    ) {}

    private QueryFileResult queryFile(
        String filePath,
        List<String> columns,
        PredicateEvaluator evaluator
    ) throws IOException {
        List<Map<String, Object>> results = new ArrayList<>();
        long matched = 0;
        long scanned = 0;

        Path path = new Path(filePath);
        HadoopInputFile inputFile = HadoopInputFile.fromPath(path, hadoopConfig);

        try (ParquetReader<GenericRecord> reader = AvroParquetReader
                .<GenericRecord>builder(inputFile)
                .withConf(hadoopConfig)
                .build()) {

            GenericRecord record;
            while ((record = reader.read()) != null) {
                scanned++;
                Map<String, Object> recordMap = recordToMap(record);

                if (evaluator.matches(recordMap)) {
                    matched++;

                    if (columns != null && !columns.isEmpty()) {
                        Map<String, Object> projected = new LinkedHashMap<>();
                        for (String col : columns) {
                            if (recordMap.containsKey(col)) {
                                projected.put(col, recordMap.get(col));
                            }
                        }
                        results.add(projected);
                    } else {
                        results.add(recordMap);
                    }
                }
            }
        }

        return new QueryFileResult(results, matched, scanned);
    }

    private Map<String, Object> recordToMap(GenericRecord record) {
        Map<String, Object> map = new LinkedHashMap<>();

        for (org.apache.avro.Schema.Field field : record.getSchema().getFields()) {
            String name = field.name();
            Object value = record.get(name);
            map.put(name, convertValue(value));
        }

        return map;
    }

    private Object convertValue(Object value) {
        if (value == null) return null;

        if (value instanceof org.apache.avro.util.Utf8) {
            return value.toString();
        }

        if (value instanceof org.apache.avro.generic.GenericRecord nested) {
            return recordToMap(nested);
        }

        if (value instanceof java.nio.ByteBuffer buffer) {
            byte[] bytes = new byte[buffer.remaining()];
            buffer.get(bytes);
            return Base64.getEncoder().encodeToString(bytes);
        }

        if (value instanceof Integer i) {
            return LocalDate.ofEpochDay(i);
        }

        if (value instanceof Long l && l > 1_000_000_000_000L) {
            return LocalDateTime.ofInstant(
                Instant.ofEpochMilli(l / 1000), 
                ZoneOffset.UTC
            );
        }

        return value;
    }

    private Map<String, String> extractSchema(String archivePath) {
        try {
            ArchiveMetadata metadata = getMetadata(archivePath);
            return metadata.columns().stream()
                .collect(Collectors.toMap(
                    ArchiveMetadata.ColumnDefinition::name,
                    ArchiveMetadata.ColumnDefinition::type,
                    (a, b) -> a,
                    LinkedHashMap::new
                ));
        } catch (Exception e) {
            log.warn("Failed to extract schema", e);
            return Collections.emptyMap();
        }
    }
}

6.9 REST Controller

// controller/ArchiveQueryController.java
package com.example.archivequery.controller;

import com.example.archivequery.model.*;
import com.example.archivequery.service.ParquetQueryService;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;

import java.util.List;
import java.util.Map;

@RestController
@RequestMapping("/api/v1/archives")
public class ArchiveQueryController {

    private final ParquetQueryService queryService;

    public ArchiveQueryController(ParquetQueryService queryService) {
        this.queryService = queryService;
    }

    @PostMapping("/query")
    public ResponseEntity<QueryResponse> query(@RequestBody QueryRequest request) {
        QueryResponse response = queryService.query(request);
        return ResponseEntity.ok(response);
    }

    @GetMapping("/query")
    public ResponseEntity<QueryResponse> queryGet(
        @RequestParam String archivePath,
        @RequestParam(required = false) List<String> columns,
        @RequestParam(required = false) String filter,
        @RequestParam(defaultValue = "1000") Integer limit,
        @RequestParam(defaultValue = "0") Integer offset
    ) {
        QueryRequest request = new QueryRequest(
            archivePath,
            columns,
            null,
            filter,
            limit,
            offset
        );

        QueryResponse response = queryService.query(request);
        return ResponseEntity.ok(response);
    }

    @GetMapping("/metadata")
    public ResponseEntity<ArchiveMetadata> getMetadata(
        @RequestParam String archivePath
    ) {
        ArchiveMetadata metadata = queryService.getMetadata(archivePath);
        return ResponseEntity.ok(metadata);
    }

    @GetMapping("/schema/{schema}/{table}/{partitionValue}")
    public ResponseEntity<ArchiveMetadata> getMetadataByTable(
        @PathVariable String schema,
        @PathVariable String table,
        @PathVariable String partitionValue,
        @RequestParam String bucket,
        @RequestParam(defaultValue = "aurora-archives") String prefix
    ) {
        String archivePath = String.format(
            "s3://%s/%s/%s/%s/%s",
            bucket, prefix, schema, table, partitionValue
        );

        ArchiveMetadata metadata = queryService.getMetadata(archivePath);
        return ResponseEntity.ok(metadata);
    }

    @PostMapping("/query/{schema}/{table}/{partitionValue}")
    public ResponseEntity<QueryResponse> queryByTable(
        @PathVariable String schema,
        @PathVariable String table,
        @PathVariable String partitionValue,
        @RequestParam String bucket,
        @RequestParam(defaultValue = "aurora-archives") String prefix,
        @RequestBody Map<String, Object> requestBody
    ) {
        String archivePath = String.format(
            "s3://%s/%s/%s/%s/%s",
            bucket, prefix, schema, table, partitionValue
        );

        List<String> columns = requestBody.containsKey("columns") ?
            (List<String>) requestBody.get("columns") : null;

        Map<String, Object> filters = requestBody.containsKey("filters") ?
            (Map<String, Object>) requestBody.get("filters") : null;

        String filterExpression = requestBody.containsKey("filterExpression") ?
            (String) requestBody.get("filterExpression") : null;

        Integer limit = requestBody.containsKey("limit") ?
            ((Number) requestBody.get("limit")).intValue() : 1000;

        Integer offset = requestBody.containsKey("offset") ?
            ((Number) requestBody.get("offset")).intValue() : 0;

        QueryRequest request = new QueryRequest(
            archivePath,
            columns,
            filters,
            filterExpression,
            limit,
            offset
        );

        QueryResponse response = queryService.query(request);
        return ResponseEntity.ok(response);
    }
}

6.10 Application Properties

# application.yml
server:
  port: 8080

aws:
  region: ${AWS_REGION:eu-west-1}

spring:
  application:
    name: archive-query-service

logging:
  level:
    com.example.archivequery: DEBUG
    org.apache.parquet: WARN
    org.apache.hadoop: WARN

7. Usage Examples

7.1 Archive a Partition

# Set environment variables
export AURORA_HOST=your-cluster.cluster-xxxx.eu-west-1.rds.amazonaws.com
export AURORA_DATABASE=production
export AURORA_USER=admin
export AURORA_PASSWORD=your-password
export ARCHIVE_S3_BUCKET=my-archive-bucket
export ARCHIVE_S3_PREFIX=aurora-archives
export AWS_REGION=eu-west-1

# Archive January 2024 transactions
python archive_partition.py \
    --table transactions \
    --partition-column transaction_month \
    --partition-value 2024-01 \
    --schema public \
    --batch-size 100000

7.2 Query Archived Data via API

# Get archive metadata
curl -X GET "https://localhost:8080/api/v1/archives/metadata?archivePath=s3://my-archive-bucket/aurora-archives/public/transactions/transaction_month=2024-01"

# Query with filters (POST)
curl -X POST "https://localhost:8080/api/v1/archives/query" \
    -H "Content-Type: application/json" \
    -d '{
        "archivePath": "s3://my-archive-bucket/aurora-archives/public/transactions/transaction_month=2024-01",
        "columns": ["transaction_id", "customer_id", "amount", "transaction_date"],
        "filters": {
            "amount_gte": 1000,
            "status": "completed"
        },
        "limit": 100,
        "offset": 0
    }'

# Query with expression filter (GET)
curl -X GET "https://localhost:8080/api/v1/archives/query" \
    --data-urlencode "archivePath=s3://my-archive-bucket/aurora-archives/public/transactions/transaction_month=2024-01" \
    --data-urlencode "filter=amount >= 1000 AND status = 'completed'" \
    --data-urlencode "columns=transaction_id,customer_id,amount" \
    --data-urlencode "limit=50"

7.3 Restore Archived Data

# Restore to staging table
python restore_partition.py \
    --source-path s3://my-archive-bucket/aurora-archives/public/transactions/transaction_month=2024-01 \
    --target-table transactions_staging \
    --target-schema public \
    --batch-size 10000

7.4 Migrate to Main Table

-- Validate before migration
SELECT * FROM generate_migration_report(
    'public', 'transactions_staging',
    'public', 'transactions',
    'transaction_month', '2024-01'
);

-- Migrate data
CALL migrate_partition_data(
    'public', 'transactions_staging',
    'public', 'transactions',
    'transaction_month', '2024-01',
    TRUE,    -- delete existing data in partition
    50000    -- batch size
);

-- Clean up staging table
CALL cleanup_after_migration(
    'public', 'transactions_staging',
    'public', 'transactions',
    TRUE     -- verify counts
);

8. Operational Considerations

8.1 Cost Analysis

The cost savings from this approach are significant:

Storage TypeMonthly Cost per TB
Aurora PostgreSQL$230
S3 Standard$23
S3 Intelligent Tiering$23 (hot) to $4 (archive)
S3 Glacier Instant Retrieval$4

For a 10TB historical dataset:

  • Aurora: $2,300/month
  • S3 with Parquet (7:1 compression): ~$33/month
  • Savings: ~98.5%

8.2 Query Performance

The Spring Boot API performance depends on:

  1. Data file size: Smaller files (100MB to 250MB) enable better parallelism
  2. Predicate selectivity: Higher selectivity means fewer records to deserialise
  3. Column projection: Requesting fewer columns reduces I/O
  4. S3 latency: Use same region as the bucket

Expected performance for a 1TB partition (compressed to ~150GB Parquet):

Query TypeTypical Latency
Point lookup (indexed column)500ms to 2s
Range scan (10% selectivity)5s to 15s
Full scan with aggregation30s to 60s

8.3 Monitoring and Alerting

Implement these CloudWatch metrics for production use:

@Component
public class QueryMetrics {

    private final MeterRegistry meterRegistry;

    public void recordQuery(QueryResponse response) {
        meterRegistry.counter("archive.query.count").increment();

        meterRegistry.timer("archive.query.duration")
            .record(response.executionTimeMs(), TimeUnit.MILLISECONDS);

        meterRegistry.gauge("archive.query.records_scanned", response.totalScanned());
        meterRegistry.gauge("archive.query.records_matched", response.totalMatched());
    }
}

9. Conclusion

This solution provides a complete data lifecycle management approach for large Aurora PostgreSQL tables. The archive script efficiently exports partitions to the cost effective Iceberg/Parquet format on S3, while the restore script enables seamless data recovery when needed. The Spring Boot API bridges the gap by allowing direct queries against archived data, eliminating the need for restoration in many analytical scenarios.

Key benefits:

  1. Cost reduction: 90 to 98 percent storage cost savings compared to keeping data in Aurora
  2. Operational flexibility: Query archived data without restoration
  3. Schema preservation: Full schema metadata maintained for reliable restores
  4. Partition management: Clean attach/detach operations for partitioned tables
  5. Predicate pushdown: Efficient filtering reduces data transfer and processing

The Iceberg format ensures compatibility with the broader data ecosystem, allowing tools like Athena, Spark, and Trino to query the same archived data when needed for more complex analytical workloads.

Why Bigger Banks Were Historically More Fragile and Why Architecture Determines Resilience

1. Size Was Once Mistaken for Stability

For most of modern banking history, stability was assumed to increase with size. The thinking was the bigger you are, the more you should care, the more resources you can apply to problems. Larger banks had more capital, more infrastructure, and more people. In a pre-cloud world, this assumption appeared reasonable.

In practice, the opposite was often true.

Before cloud computing and elastic infrastructure, the larger a bank became, the more unstable it was under stress and the harder it was to maintain any kind of delivery cadence. Scale amplified fragility. In 2025, architecture (not size) has become the primary determinant of banking stability.

2. Scale, Fragility, and Quantum Entanglement

Traditional banking platforms were built on vertically scaled systems: mainframes, monolithic databases, and tightly coupled integration layers. These systems were engineered for control and predictability, not for elasticity or independent change.

As banks grew, they didn’t just add clients. They added products. Each new product introduced new dependencies, shared data models, synchronous calls, and operational assumptions. Over time, this created a state best described as quantum entanglement.

In this context, quantum entanglement refers to systems where:

  • Products cannot change independently
  • A change in one area unpredictably affects others
  • The full impact of change only appears under real load
  • Cause and effect are separated by time, traffic, and failure conditions

The larger the number of interdependent products, the more entangled the system becomes.

2.1 Why Entanglement Reduces Stability

As quantum entanglement increases, change becomes progressively riskier. Even small modifications require coordination across multiple teams and systems. Release cycles slow and defensive complexity increases.

Recovery also becomes harder. When something breaks, rolling back a single change is rarely sufficient because multiple products may already be in partially failed or inconsistent states.

Fault finding degrades as well. Logs, metrics, and alerts point in multiple directions. Symptoms appear far from root causes, forcing engineers to chase secondary effects rather than underlying faults.

Most importantly, blast radius expands. A fault in one product propagates through shared state and synchronous dependencies, impacting clients who weren’t using the originating product at all.

The paradox is that the very success of large banks (broad product portfolios) becomes a direct contributor to instability.

3. Why Scale Reduced Stability in the Pre-Cloud Era

Before cloud computing, capacity was finite, expensive, and slow to change. Systems scaled vertically, and failure domains were large by design.

As transaction volumes and product entanglement increased, capacity cliffs became unavoidable. Peak load failures became systemic rather than local. Recovery times lengthened and client impact widened.

Large institutions often appeared stable during normal operation but failed dramatically under stress. Smaller institutions appeared more stable largely because they had fewer entangled products and simpler operational surfaces (not because they were inherently better engineered).

Capitec itself experienced this capacity cliff, when its core banking SQL DB hit a capacity cliff in August 2022. In order to recover the service, close to 100 changes were made which resulted in a downtime of around 40 hrs. The wider service recovery took weeks, with missed payments a duplicate payments being fixed on a case by case basis. It was at this point that Capitec’s leadership drew a line in the sand and decided to totally re-engineer its entire stack from the ground up in AWS. This blog post is really trying to share a few nuggets from the engineering journey we went on, and hopefully help others all struggling the with burden of scale and hardened synchronous pathways.

4. Cloud Changed the Equation (But Only When Architecture Changed)

Cloud computing made it possible to break entanglement, but only for organisations willing to redesign systems to exploit it.

Horizontal scaling, availability zone isolation, managed databases, and elastic compute allow products to exist as independent domains rather than tightly bound extensions of a central core.

Institutions that merely moved infrastructure to the cloud without breaking product entanglement continue to experience the same instability patterns (only on newer hardware).

5. An Architecture Designed to Avoid Entanglement

Capitec represents a deliberate rejection of quantum entanglement.

Its entire App production stack is cloud native on AWS, Kubernetes, Kafka and Postgres. The platform is well advanced in rolling out new Java 25 runtimes, alongside ahead of time (AOT) optimisation to further reduce scale latency, improve startup characteristics, and increase predictability under load. All Aurora Serverless are setup with read replicas, offloading read pressure from write paths. All workloads are deployed across three availability zones, ensuring resilience. Database access is via the AWS JDBC wrapper (which enables extremely rapid failovers, outside of DNS TTLs)

Crucially, products are isolated by design. There is no central product graph where everything depends on everything else. But, a word of caution, we are “not there yet”. We will always have edges that can hurt and we you hit an edge at speed, sometimes its hard to get back up on your feet. Often you see that the downtime you experienced, simply results in pent up demand. Put another way, the volume that took your systems offline, is now significantly LESS than the volume thats waiting for you once you recover! This means that you somehow have to magically add capacity, or optimise code, during an outage in order to recover the service. You will often say “Rate Limiting” fan club put a foot forward when I discuss burst recoverability. I personally don’t buy this for single entity services (for a complex set of reasons). For someone like AWS, it absolutely makes sense to carry the enormous complexity of guarding services with rate limits. But I don’t believe the same is true for a single entity ecosystem, in these instances, offloading is normally a purer pathway.

6. Write Guarding as a Stability Primitive

Capitec’s mobile and digital platforms employ a deliberate **write guarding** strategy.

Read only operations (such as logging into the app) are explicitly prevented from performing inline write operations. Activities like audit logging, telemetry capture, behavioural flags, and notification triggers are never executed synchronously on high volume read paths.

Instead, these concerns are offloaded asynchronously using Amazon MSK (Managed Streaming for Apache Kafka) or written to in memory data stores such as Valkey, where they can be processed later without impacting the user journey.

This design completely removes read-write contention from critical paths. Authentication storms, balance checks, and session validation no longer compete with persistence workloads. Under load, read performance remains stable because it is not coupled to downstream write capacity.

Critically, write guarding prevents database maintenance pressure (such as vacuum activity) from leaking into high volume events like logins. Expensive background work remains isolated from customer facing read paths.

Write guarding turns one of the most common failure modes in large banking systems (read traffic triggering hidden writes) into a non event. Stability improves not by adding capacity, but by removing unnecessary coupling.

7. Virtual Threads as a Scalability Primitive

Java 25 introduces mature virtual threading as a first class concurrency model. This fundamentally changes how high concurrency systems behave under load.

Virtual threads decouple application concurrency from operating system threads. Instead of being constrained by a limited pool of heavyweight threads, services can handle hundreds of thousands of concurrent blocking operations without exhausting resources.

Request handling becomes simpler. Engineers can write straightforward blocking code without introducing thread pool starvation or complex asynchronous control flow.

Tail latency improves under load. When traffic spikes, virtual threads queue cheaply rather than collapsing the system through thread exhaustion.

Failure isolation improves. Slow downstream calls no longer monopolise scarce threads, reducing cascading failure modes.

Operationally, virtual threads align naturally with containerised, autoscaling environments. Concurrency scales with demand, not with preconfigured thread limits.

When combined with modern garbage collectors and ahead of time optimisation, virtual threading removes an entire class of concurrency related instability that plagued earlier JVM based banking platforms.

8. Nimbleness Emerges When Entanglement Disappears

When blast zones and integration choke points disappear, teams regain the ability to move quickly without increasing systemic risk.

Domains communicate through well defined RESTful interfaces, often across separate AWS accounts, enforcing isolation as a first class property. A failure in one domain does not cascade across the organisation.

To keep this operable at scale, Capitec uses Backstage (via an internal overlay called ODIN) as its internal orchestration and developer platform. All AWS accounts, services, pipelines, and operational assets are created to a common standard. Teams consume platform capability rather than inventing infrastructure.

This eliminates configuration drift, reduces cognitive load, and ensures that every new product inherits the same security, observability, and resilience characteristics.

The result is nimbleness without fragility.

9. Operational Stability Is Observability Plus Action

In entangled systems, failures are discovered by clients and stability is measured retrospectively.

Capitec operates differently. End to end observability through Instana and its in house AI platform, Neo, correlates client side errors, network faults, infrastructure signals, and transaction failures in real time. Issues are detected as they emerge, not after they cascade.

This operational awareness allows teams to intervene early, contain issues quickly, and reduce client impact before failures escalate.

Stability, in this model, is not the absence of failure. It is fast detection, rapid containment, and decisive response.

10. Fraud Prevention Without Creating New Entanglement

Fraud is treated as a first class stability concern rather than an external control.

Payments are evaluated inline as they move through the bank. Abnormal velocity, behavioural anomalies, and account provenance are assessed continuously. Even fraud reported in the call center is immediately visible to other clients paying from the Capitec App. Clients are presented with conscience pricking prompts for high risk payments; these frequently stop fraud as the clients abandon the payment when presented with the risks.

Capitec runs a real time malware detection engine directly on client devices. This engine detects hooks and overlays installed by malicious applications. When malware is identified, the client’s account is immediately stopped, preventing fraudulent transactions before they occur.

Because fraud controls are embedded directly into the transaction flow, they don’t introduce additional coupling or asynchronous failure modes.

The impact is measurable. Capitec’s fraud prevention systems have prevented R300 million in client losses from fraud. In November alone, these systems saved clients a further R60 million in fraud losses.

11. The Myth of Stability Through Multicloud

Multicloud is often presented as a stability strategy. In practice, it is largely a myth.

Running across multiple cloud providers does not remove failure risk. It compounds it. Cross cloud communication can typically only be secured using IP based controls, weakening security posture. Operational complexity increases sharply as teams must reason about heterogeneous platforms, tooling, failure modes, and networking behaviour.

Most critically, multicloud does not eliminate correlated failure. If either cloud provider becomes unavailable, systems are usually unusable anyway. The result is a doubled risk surface, increased operational risk, and new inter cloud network dependencies (without a corresponding reduction in outage impact).

Multicloud increases complexity, weakens controls, and expands risk surface area without delivering meaningful resilience.

12. What Actually Improves Stability

There are better options than multicloud.

Hybrid cloud with anti-affinity on critical channels is one. For example, card rails can be placed in two physically separate data centres so that if cloud based digital channels are unavailable, clients can still transact via cards and ATMs. This provides real functional resilience rather than architectural illusion.

Multi region deployment within a single cloud provider is another. This provides geographic fault isolation without introducing heterogeneous complexity. However, this only works if the provider avoids globally scoped services that introduce hidden single points of failure. At present, only AWS consistently supports this model. Some providers expose global services (such as global front doors) that introduce global blast radius and correlated failure risk.

True resilience requires isolation of failure domains, not duplication of platforms.

13. Why Traditional Banks Still Struggle

Traditional banks remain constrained by entangled product graphs, vertically scaled cores, synchronous integration models, and architectural decisions from a different era. As product portfolios grow, quantum entanglement increases. Change slows, recovery degrades, and outages become harder to diagnose and contain.

Modernisation programmes often increase entanglement temporarily through dual run architectures, making systems more fragile before they become more stable (if they ever do).

The challenge is not talent or ambition. It is the accumulated cost of entanglement.

14. Stability at Scale Without the Traditional Trade Off

Capitec’s significance is not that it is small. It is that it is large and remains stable.

Despite operating at massive scale with a broad product surface and high transaction volumes, stability improves rather than degrades. Scale does not increase blast radius, recovery time, or change risk. It increases parallelism, isolation, and resilience.

This directly contradicts historical banking patterns where growth inevitably led to fragility. Capitec demonstrates that with the right architecture, scale and stability are no longer opposing forces.

15. Final Thought

Before cloud and autoscaling, scale and stability were inversely related. The more products a bank had, the more entangled and fragile it became.

In 2025, that relationship can be reversed (but only by breaking entanglement, isolating failure domains, and avoiding complexity masquerading as resilience).

Doing a deal with a cloud provider means nothing if transformation stalls inside the organisation. If dozens of people carry the title of CIO while quietly pulling the handbrake on the change that is required, the outcome is inevitable regardless of vendor selection.

There is also a strategic question that many institutions avoid. If forced to choose between operating in a jurisdiction that is hostile to public cloud or accessing the full advantages of cloud, waiting is not a strategy. When that jurisdiction eventually allows public cloud, the market will already be populated by banks that moved earlier, built cloud native platforms, and are now entering at scale.

Capitec is an engineering led bank whose stability and speed increase with scale. Traditional banks remain constrained by quantum entanglement baked into architectures from a different era.

These outcomes are not accidental. They are the inevitable result of architectural and organisational choices made years ago, now playing out under real world load.