17 Mar 2026 AWS Cloud

👁46views
Aurora PostgreSQL Write Throughput: Saturation & Tuning Guide

CloudScale AI SEO - Article Summary

1.
What it is
Explains how Amazon Aurora PostgreSQL handles write operations through its memory management system (buffer pools and WAL buffers) and why write performance eventually hits limits due to Aurora's distributed storage architecture.
2.
Why it matters
Engineering teams running high-throughput Aurora workloads need to understand these write path constraints to optimize performance and avoid hitting throughput walls that can't be solved by simply adding more compute power.
3.
Key takeaway
Aurora trades raw write speed for durability by using distributed storage with cross-AZ quorum writes, making it impossible to achieve sub-millisecond commit latency regardless of instance performance.

1. Introduction

Every engineering team that runs a high throughput transactional workload on Amazon Aurora PostgreSQL will eventually arrive at the same uncomfortable question: why does the database start refusing to go faster, and what can actually be done about it? Aurora’s architecture is genuinely brilliant, but it introduces a set of write path constraints that are fundamentally different from both vanilla PostgreSQL and traditional block storage databases. Those constraints are not bugs. They are deliberate engineering decisions that trade some raw write speed for extraordinary durability and availability. Understanding them atomically is the prerequisite for any meaningful performance work.

This article walks through the entire write path from the moment a client issues a COMMIT statement to the moment that data is durable in Aurora’s distributed storage. It explains what the buffer pool actually does, what WAL is and why it exists, what a checkpoint is and why it matters even though Aurora changes its role dramatically, and why the cross availability zone quorum write means you can never achieve sub-millisecond commit latency regardless of how fast your instance is. It then covers every practical technique available to squeeze more write throughput out of an existing cluster without rewriting the application from scratch, compares Aurora against RDS PostgreSQL with io2 Block Express for write heavy workloads, and provides CloudWatch and SQL scripts for observing write queuing and saturation as it happens.

2. How PostgreSQL Manages Memory: The Buffer Pool and WAL Buffers

Before examining Aurora specifically, the foundation must be the stock PostgreSQL memory architecture, because Aurora inherits it at the engine layer while radically transforming what happens beneath it.

2.1 The Shared Buffer Pool

When any PostgreSQL process needs to read or modify a row, it does not go directly to disk. It goes first to a region of shared memory called the shared buffer pool, controlled by the shared_buffers parameter. This pool is a page cache holding 8 KB pages of table and index data that all backend processes can share simultaneously. If the requested page is already in the pool, the read is served from memory with no I/O cost. If not, the engine fetches it from storage, places it in the pool, and then serves it. This is a standard operating database buffer cache pattern.

When a backend process modifies a page, it does so in memory only. The modified page is called a dirty buffer. It is not immediately written to disk. This is intentional: random I/O to data files is extremely expensive, and the engine deliberately defers those writes. The pool uses a clock sweep algorithm to evict less recently used pages when space is needed for new ones. The goal is to keep the entire working set in memory so that the common case involves no data file I/O at all.

PostgreSQL documentation on shared_buffers

In Aurora PostgreSQL, the default shared_buffers formula is considerably more aggressive than community PostgreSQL because Aurora has no local filesystem cache to fall back on. Aurora exclusively depends on PostgreSQL shared buffers in the database instance, without relying on a filesystem cache, because Aurora does not write to local files. Although the shared buffer cache is updated, it is never checkpointed to disk. Whenever there is a cache miss, the remote storage is accessed. This means a cache miss in Aurora crosses a network boundary to the distributed storage tier. Sizing shared_buffers generously is therefore more critical in Aurora than in any other PostgreSQL deployment.

The Aurora PostgreSQL default value is derived from the formula SUM(DBInstanceClassMemory/12038, -50003), which is considerably higher than the RDS PostgreSQL default.

2.2 The WAL Buffer

In parallel with the shared buffer pool, PostgreSQL maintains a much smaller region of shared memory called the WAL buffer, controlled by wal_buffers. A WAL buffer holds transaction data that Aurora PostgreSQL later writes to persistent storage. When a client changes data, Aurora PostgreSQL writes the changes to the WAL buffer. When the client issues a COMMIT, the WAL writer process writes transaction data to the WAL file.

The WAL buffer is a staging area for log records that have not yet been flushed to the WAL stream. Because WAL writes are sequential, they are far cheaper per byte than the random I/O required to write dirty pages to scattered data file locations. The WAL buffer keeps those sequential writes batched until a flush is required, which normally occurs at transaction commit.

2.3 Why the Buffer Pool Enables Crash Recovery

The reason the dirty page deferral pattern is safe is the WAL itself. PostgreSQL follows the write ahead logging principle without exception: a change record describing every modification must be written to durable storage before the data page containing that modification is written. This means that even if the server crashes with a pool full of dirty pages that have never been written to data files, every modification is recorded in the WAL. On restart, the engine replays those WAL records to reconstruct the correct state of every page. The buffer pool can therefore absorb enormous amounts of in flight writes without any durability risk, because the WAL provides the ground truth.

3. The Anatomy of a Transaction: DO, REDO, UNDO, and the WAL Record

Understanding what actually happens inside a transaction requires understanding three distinct log concepts: the DO record, the REDO record, and the UNDO record.

3.1 DO: The Act of Modification

When a backend process modifies a row, it performs the DO operation: it locates the page in the buffer pool, acquires a lock on that buffer, applies the change in memory, and records the change in the WAL buffer as a REDO log record before releasing the lock. The data page in the pool is now dirty. The WAL record is not yet on durable storage, but it will be before the transaction can be acknowledged as committed.

3.2 REDO: Replaying Forward After a Crash

The REDO record in the WAL describes the after image of the modification: what the page should look like after the change was applied. If the server crashes after the WAL record was written but before the data page was flushed to disk, crash recovery replays the REDO record to bring the data page to its correct post modification state. Postgres reads the latest checkpoint record in the WAL. The checkpoint’s redo point tells it where to start. It replays all WAL records from that point forward, applying each change in order. If any pages were partially written (torn), Postgres restores them using full page images stored in the WAL.

The WAL is therefore sufficient to reconstruct the database from any point after the last checkpoint. No data is lost for committed transactions even if the instance crashes immediately after the COMMIT acknowledgement is sent to the client, provided the WAL was durably written.

3.3 UNDO: Rolling Back Uncommitted Changes

PostgreSQL does not use a traditional separate UNDO log. Instead, it uses Multi Version Concurrency Control. When a transaction modifies a row, the old version of that row remains visible in the heap as a dead tuple until vacuum reclaims it. If the transaction aborts, the changes are visible only to that transaction, and when it terminates, the old versions naturally become the visible ones for all other transactions. No redo of undo is required on crash recovery because uncommitted changes were never acknowledged as durable. This design means that PostgreSQL crash recovery is forward only: it replays REDO records from the last checkpoint forward, and the MVCC heap naturally excludes any uncommitted transaction’s changes.

3.4 WAL Sequence Numbers and the LSN

Every WAL record has a Log Sequence Number, an ever increasing 64 bit value that identifies its position in the WAL stream. The LSN is the fundamental unit of progress tracking in both replication and storage consistency. Aurora relies heavily on LSNs to coordinate which WAL records have been acknowledged by the storage quorum and which are still in flight.

4. The Checkpoint: What It Is, What It Does, and Aurora’s Radical Change to Its Role

4.1 What a Checkpoint Does in Standard PostgreSQL

Checkpoints are points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before that checkpoint. At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the WAL file. In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the WAL (known as the redo record) from which it should start the REDO operation. Any changes made to data files before that point are guaranteed to be already on disk. Hence, after a checkpoint, WAL segments preceding the one containing the redo record are no longer needed and can be recycled or removed.

Checkpoints serve two purposes: they bound the amount of WAL that must be replayed on crash recovery, and they write dirty pages to data files so that the WAL can be recycled. In community PostgreSQL, checkpoints are expensive because they require writing all dirty pages to random locations on disk. The checkpoint_timeout and max_wal_size parameters control how frequently they occur, and checkpoint_completion_target spreads the dirty page writing over a fraction of the checkpoint interval to avoid a burst of I/O.

4.2 How Aurora Transforms the Checkpoint Contract

In Aurora, the data files do not live on a locally attached disk. They live in the distributed storage tier. Aurora’s engine still generates WAL records and still performs checkpoint operations, but all checkpoint and recovery is done by the storage instances, which contain some PostgreSQL code for that. The database instance ships only the WAL (redo log). They apply it locally to maintain their shared buffer, but that stays in memory.

This is a profound architectural shift. In standard PostgreSQL, a crash means the engine must replay WAL from the last checkpoint. In Aurora, the storage tier handles redo application continuously and autonomously. In order to reduce the time to recover, an instance can be immediately opened even when there’s still some redo to apply to recover to the point of failure. As the Aurora storage tier is autonomous for this, the redo is applied on the flow when a block is read.

The practical consequence is that Aurora crash recovery is nearly instantaneous from the instance perspective. The engine restarts, the storage tier already has all durably committed WAL applied or in flight, and the instance is available within seconds. This is why an Aurora writer can fail with zero data loss: the WAL was already durable in the quorum before the commit was acknowledged.

4.3 Why a Host Can Fail With No Data Loss

This is one of Aurora’s most important guarantees and it flows directly from the write path design. A commit is only acknowledged to the client after four of the six storage nodes have confirmed they received and durably stored the WAL record. The writer instance holds only in flight WAL records in its WAL buffer. All committed data is already in the storage quorum. If the writer instance fails, no committed data is lost because committed data was never solely in the writer’s memory. A new writer can be promoted, the storage tier provides the latest consistent state, and transactions that were in progress but not yet committed are simply lost, which is the correct behaviour for uncommitted data.

5. Aurora’s Storage Architecture: Six Copies, Three AZs, and the 4 of 6 Quorum

5.1 The Storage Topology

Aurora’s distributed storage divides the cluster volume into 10 GB segments. The Aurora storage engine writes data to six copies of data in parallel spread across three Availability Zones. The storage layer in Aurora is not just a block device but a cluster of machines functioning as storage nodes capable of decrypting database transaction logs.

Each 10 GB segment therefore has six copies: two in each of three AZs. The writer instance sends WAL records directly to all six storage nodes in parallel using Aurora’s proprietary compute to storage protocol. This is categorically different from standard PostgreSQL replication, where the engine ships WAL to replica instances that then apply it.

5.2 The Quorum Write and Why 4 of 6 Is Sufficient

When the database instance sends WAL logs to storage instances, it doesn’t wait for all of them to respond to consider the write as successful. Instead, it only needs the confirmation from some of them. This is called a write quorum. In Aurora, each data segment has 6 nodes distributed among 3 Availability Zones. The write quorum for Aurora is 4, which means the writes are persistent on at least 2 Availability Zones.

The read quorum is 3. Because read quorum plus write quorum (3 + 4 = 7) exceeds the total number of copies (6), there is always at least one node that was part of both the last write and any subsequent read. This guarantees that a reader always sees the latest committed state. Aurora tracks LSNs per storage node, so it knows exactly which node has the most current data and can route reads accordingly without always requiring a full quorum read.

From a durability point of view, the Aurora storage engine can handle an Availability Zone plus one failure, sustaining continued write despite loss of an Availability Zone. The Aurora storage engine can continue to serve reads despite an Availability Zone plus a failure of an additional copy.

5.3 The Boxcar Technique: Batching WAL Records

To improve the I/O flow, a boxcar technique is used for log records. This is a method of optimizing I/O by shipping a set of log records in what can be termed a boxcar. The boxcar log records are fully ordered by their log sequence number. The records are shuffled to appropriate segments in a partially ordered state, and then boxcared to storage nodes where writes are issued. An asynchronous 4 of 6 quorum is used to reduce I/O and network jitter. Writes are sorted in buckets per storage node for getting more efficiency out of the network.

This batching is critical to Aurora’s write efficiency under concurrency. When many transactions are committing simultaneously, their WAL records can be grouped into a single network transmission to each storage node, dramatically reducing the per transaction network overhead compared to sending records individually.

6. Why Aurora Commits Cannot Be Sub-Millisecond

This is the most important architectural constraint for high frequency write workloads, and it is non negotiable.

6.1 Cross AZ Network Latency Is the Floor

Every commit requires acknowledgement from four storage nodes spread across at least two Availability Zones. A write to a single AZ typically takes well under a millisecond on AWS network fabric. A write that must cross an AZ boundary and wait for acknowledgement cannot complete in under a millisecond under any realistic conditions. The higher latency and lower throughput are due to the process of writing to storage with six replicas across three Availability Zones. In this process, four out of the six replicas must acknowledge the Write-Ahead Log.

In practice, the IO:XactSync wait event in Aurora Performance Insights represents the time each commit spends waiting for the storage quorum to acknowledge. The IO:XactSync event occurs when the database is waiting for the Aurora storage subsystem to acknowledge the commit of a regular transaction. This wait is not a sign of a problem when it appears at normal levels. It is the fundamental cost of Aurora’s durability model.

6.2 The Commit Rate Ceiling

Because each commit must wait for a cross AZ acknowledgement, the maximum sustainable commit rate in Aurora is bounded by the commit latency. If a single session commits every 2 ms, it can issue at most 500 commits per second. Scaling this up requires parallelism across many concurrent sessions, and Aurora handles concurrent commits well by batching the WAL records from multiple sessions into shared network transmissions. But there is a ceiling, and systems that issue a commit for every single row insert will hit it quickly.

A well designed application should not have more than one commit per user interaction and then a few milliseconds is not perceptible. Aurora is completely different. For High Availability reasons, the storage is not local but distributed to multiple data centres. And in order to stay in HA even in case of a full AZ failure, the blocks are mirrored in each AZ. This means that each buffer write is actually written to 6 copies on remote storage.

7. Read Nodes and Immediate Read Consistency After Writes

7.1 How Aurora Reader Nodes See Writes

One of Aurora’s most counterintuitive capabilities is that reader instances can immediately see writes committed on the writer without any significant replication lag. This is possible because the replica shares the storage with the master database. The master database ships the redo (WAL) to the replica database. The redo records are applied directly to the cached pages and do not need to be written out.

The reader instances receive the WAL stream asynchronously from the writer and use it to invalidate or update their buffer cache entries. Because the underlying storage is shared and the quorum has already acknowledged the write, any read to storage from a reader will return the committed data. The reader’s buffer cache may briefly hold a stale page, but the WAL invalidation signal brings it in line almost immediately. Aurora measures reader lag in single digit milliseconds under normal conditions.

7.2 Why This Matters for Application Architecture

The near zero reader lag in Aurora means that applications can offload a substantial fraction of reads to reader endpoints without worrying about serving stale data for meaningful periods. A correctly designed application can write to the writer endpoint and immediately issue subsequent reads to the reader endpoint with high confidence that the data will be visible. This is fundamentally different from standard PostgreSQL physical replication, where reader lag can grow to seconds or minutes under write pressure.

8. The IO:XactSync Wait Event in Detail

The IO:XactSync wait event is Aurora’s window into write commit latency and is the single most important wait event for diagnosing write saturation. When the IO:XactSync event appears more than normal, possibly indicating a performance problem, typical causes include network saturation, where traffic between clients and the DB instance or traffic to the storage subsystem might be too heavy for the network bandwidth, and CPU pressure, where a heavy workload might be preventing the Aurora storage daemon from getting sufficient CPU time.

A system that sees IO:XactSync consuming a large fraction of active session time is a system where commit frequency is too high relative to the throughput the network and storage subsystem can sustain. The remedy is always one of three things: batch more work into each commit, scale up the instance to get more CPU and network bandwidth, or reduce the total write rate.

9. Provisioned Aurora vs Aurora Serverless v2 for Write Saturation

9.1 Provisioned Aurora: Predictable Ceiling, Full Power

A provisioned Aurora instance gives you a fixed, predictable compute envelope. You know exactly how many vCPUs and how much network bandwidth are available, which makes capacity planning for write heavy workloads straightforward. The largest available instance classes on Graviton 4 (db.r8g.48xlarge) provide up to 192 vCPUs and 50 Gbps of network bandwidth. Improved allocation of Write-Ahead Log stream numbers, resulting in increased throughput for write heavy workloads on the new Graviton 4 high end instances.

For sustained, high volume write workloads, provisioned Aurora is almost always the right choice. The instance is always at full capacity, there is no warmup delay, and the network connection to the storage tier is fully established. The write ceiling is bounded only by the instance’s network bandwidth and the cross AZ quorum latency.

9.2 Aurora Serverless v2: Elastic but Write Constrained at Low ACU

Aurora Serverless v2 scales compute capacity in Aurora Capacity Unit increments, where each ACU provides approximately 2 GiB of memory, proportional CPU, and proportional network bandwidth. Aurora Serverless v2 continually monitors the workload on your database. If the workload increases, Aurora Serverless v2 automatically scales up the capacity of your database by adding more ACUs up to the maximum capacity you defined.

The critical issue for write heavy workloads is the scaling response time. The rate of scale up is faster on larger current DB capacity. For example, time to scale up ACU from 8.0 to 10.5 varied as follows by the DB capacity when scale up started: 138 seconds when ACU was 1.0 when scale up started. A Serverless v2 instance sitting at a low minimum ACU that suddenly receives a write spike will lag behind for one to two minutes while scaling. During that window, IO:XactSync latency rises, commit throughput drops, and the application experiences degraded write performance. This is a fundamental limitation of the elastic model.

For write heavy production workloads with predictable load, provisioned Aurora is strongly preferred. For workloads that are genuinely spiky and write heavy only during short bursts, Serverless v2 with a sufficiently high minimum ACU (to allow rapid scaling) can work. The minimum ACU must be set high enough that the scale up response covers the burst quickly enough for the application’s SLA.

9.3 The Network Bandwidth Constraint

Network bandwidth to the storage tier scales with ACU in Serverless v2. A cluster running at 2 ACUs has a fraction of the network bandwidth of a db.r6g.4xlarge. Since Aurora commits are network bound at the cross AZ quorum hop, a Serverless v2 instance at low ACU will saturate its storage network bandwidth earlier and at lower commit rates than a large provisioned instance.

10. Amazon S3 and Aurora’s Backup Consistency Model

Aurora continuously backs up the cluster volume to Amazon S3. This is the foundation of its point in time recovery capability and its ability to create new clusters from snapshots with no I/O penalty on the running cluster. Understanding how S3 handles consistency is important for understanding how Aurora restores work.

10.1 S3 Strong Consistency: The 2020 re:Invent Announcement

For most of S3’s history, it used an eventual consistency model: a write followed immediately by a read might not see the latest version of the object. Amazon S3 delivers strong read after write consistency automatically for all applications, without changes to performance or availability, without sacrificing regional isolation for applications, and at no additional cost. After a successful write of a new object, or an overwrite or delete of an existing object, any subsequent read request immediately receives the latest version of the object. S3 also provides strong consistency for list operations, so after a write, you can immediately perform a listing of the objects in a bucket with any changes reflected.

This change, announced at AWS re:Invent 2020, was architecturally significant for Aurora. It means that when Aurora writes backup segments to S3, the recovery process can immediately and reliably read those segments in the order they were written, with no risk of seeing a stale version. This simplifies the Aurora recovery logic and eliminates the need for the consistency workarounds that pre 2020 S3 required. The strong consistency announcement blog post confirmed that this applies to all GET, PUT, and LIST operations across all regions with no additional cost.

10.2 S3 Throughput and Aurora Backup Performance

S3 performance supports at least 3,500 requests per second to add data and 5,500 requests per second to retrieve data. Each S3 prefix can support these request rates, making it simple to increase performance significantly. Aurora uses multiple prefixes for backup segments, so the effective aggregate throughput to S3 is well above these per prefix limits. For database teams concerned about whether backup I/O competes with production write I/O, the answer in Aurora is that they operate on separate paths: production writes go to the storage tier network, and backups are exported from the storage tier to S3 without passing through the writer instance at all.

11. Observing Write Queuing and Saturation: Scripts and Metrics

11.1 Key CloudWatch Metrics for Write Saturation

The following CloudWatch metrics are the primary signals for write throughput saturation in Aurora PostgreSQL.

Cluster level metrics:

VolumeWriteIOPs measures the number of write I/O operations to the Aurora cluster volume, reported at 5 minute intervals. This is the billing metric and reflects actual storage tier writes. Note that this metric counts logical writes to the Aurora storage log, not traditional block device writes, so interpretation requires care.

CommitLatency measures the average duration for the engine and storage to complete commit operations. A rising CommitLatency under steady write load is a strong signal of storage network saturation or CPU pressure on the storage daemon.

CommitThroughput measures the average number of commit operations per second. Plotting this alongside CommitLatency reveals whether the system is approaching its commit rate ceiling.

WriteLatency measures the average time for write I/O operations.

DiskQueueDepth measures the length of the I/O queue for the underlying storage. In Aurora, this metric reflects activity on the local instance storage used for temporary files, not the distributed storage tier. The distributed storage queue is better observed through CommitLatency and IO:XactSync wait events.

Instance level metrics for Serverless v2:

ServerlessDatabaseCapacity shows the current ACU level. Track this alongside CommitLatency to see whether scaling events correlate with commit latency spikes.

ACUUtilization shows capacity utilisation as a percentage of the maximum ACU. A value consistently at or near 100% means the instance is at its maximum ACU and cannot scale further.

11.2 SQL Diagnostic Scripts

The following scripts connect to the Aurora writer endpoint and surface internal write performance state.

Script 1: Current wait events and write pressure

SELECT
  wait_event_type,
  wait_event,
  count(*) AS session_count,
  array_agg(state) AS states
FROM pg_stat_activity
WHERE state != 'idle'
GROUP BY wait_event_type, wait_event
ORDER BY session_count DESC;

A high count on IO / XactSync confirms the system is commit latency bound. A high count on Lock events alongside write waits suggests row level contention is compounding the problem.

Script 2: Commit rate and latency from Aurora internal functions

SELECT
  datname,
  aurora_stat_get_db_commit_latency(oid) AS cumulative_commit_latency_us,
  xact_commit,
  xact_rollback,
  ROUND(
    aurora_stat_get_db_commit_latency(oid)::numeric / NULLIF(xact_commit, 0), 2
  ) AS avg_commit_latency_us_per_commit
FROM pg_stat_database
JOIN pg_database ON pg_database.datname = pg_stat_database.datname
WHERE pg_stat_database.datname NOT IN ('template0', 'template1', 'postgres')
ORDER BY avg_commit_latency_us_per_commit DESC NULLS LAST;

This uses Aurora’s aurora_stat_get_db_commit_latency function to compute average commit latency per database. Values consistently above 5,000 microseconds (5 ms) under normal load suggest storage saturation.

Script 3: WAL generation rate

SELECT
  pg_current_wal_lsn() AS current_lsn,
  pg_wal_lsn_diff(
    pg_current_wal_lsn(),
    '0/0'
  ) AS total_wal_bytes_generated;

Run this at 60 second intervals and compute the delta to get the WAL generation rate in bytes per second. Correlate this with VolumeWriteIOPs to understand the write amplification factor for your workload.

Script 4: Identifying high frequency commit sources

SELECT
  application_name,
  state,
  count(*) AS sessions,
  count(*) FILTER (WHERE wait_event = 'XactSync') AS xactsync_waiters,
  avg(extract(epoch FROM now() - state_change)) AS avg_age_s
FROM pg_stat_activity
WHERE state != 'idle'
GROUP BY application_name, state
ORDER BY xactsync_waiters DESC;

This identifies which application connections are most frequently waiting on XactSync, which directly points to the code paths issuing too many commits.

Script 5: CloudWatch CLI script for write saturation dashboard

#!/usr/bin/env bash

CLUSTER_ID="your-cluster-identifier"
REGION="af-south-1"
START_TIME=$(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ')
END_TIME=$(date -u '+%Y-%m-%dT%H:%M:%SZ')

echo "=== Aurora Write Saturation Metrics (last 1 hour) ==="

for METRIC in CommitLatency CommitThroughput WriteLatency VolumeWriteIOPs; do
  echo ""
  echo "--- $METRIC ---"
  aws cloudwatch get-metric-statistics \
    --namespace AWS/RDS \
    --metric-name "$METRIC" \
    --dimensions Name=DBClusterIdentifier,Value="$CLUSTER_ID" \
    --start-time "$START_TIME" \
    --end-time "$END_TIME" \
    --period 300 \
    --statistics Average Maximum \
    --region "$REGION" \
    --output table \
    --query 'sort_by(Datapoints, &Timestamp)[*].{Time:Timestamp,Avg:Average,Max:Maximum}'
done

chmod +x aurora-write-saturation.sh

12. Squeezing More Write Throughput Through an Existing Cluster

The following techniques are ordered from lowest risk and effort to highest. Most teams should exhaust the early items before considering the later ones.

12.1 Batch Commits: The Single Most Impactful Change

Because every commit in Aurora waits for a cross AZ acknowledgement, the cost of a commit is fixed and non trivial. The single most effective technique for improving write throughput is to ensure each commit contains as many rows as the application’s consistency requirements allow.

An application that issues one INSERT followed by one COMMIT per row is wasting the Aurora storage network on single record round trips. The same application rewritten to batch inserts into groups of 100 to 1000 rows per commit can achieve 100 to 1000 times the throughput for the same storage network cost.

BEGIN;
INSERT INTO events (user_id, event_type, payload, created_at)
VALUES
  (1001, 'click', '{"button": "buy"}', NOW()),
  (1002, 'view', '{"page": "home"}', NOW()),
  (1003, 'submit', '{"form": "checkout"}', NOW());
COMMIT;

For applications processing a stream of incoming events, a simple accumulator pattern in the application layer that collects records for 50 to 100 ms before issuing a batch insert is often sufficient to reduce commit frequency by two orders of magnitude.

12.2 COPY for Bulk Loads

For bulk data loading, COPY is dramatically more efficient than batched INSERT statements. COPY bypasses the per row overhead of the SQL parser and planner, and it commits the entire load as a single transaction.

COPY events (user_id, event_type, payload, created_at)
FROM STDIN
WITH (FORMAT csv, DELIMITER ',');

For very large loads, splitting the data into 50,000 to 100,000 row chunks and committing between chunks provides a balance between transaction size and recovery risk.

12.3 Use `INSERT ... ON CONFLICT` Instead of Read Before Write

A common pattern that doubles write I/O is the read before write: select a row to check existence, then insert or update based on the result. INSERT ... ON CONFLICT DO UPDATE (upsert) collapses this into a single server side operation that issues one WAL record instead of two, and eliminates the race condition inherent in the read before write pattern.

INSERT INTO user_balances (user_id, balance, updated_at)
VALUES (1001, 500.00, NOW())
ON CONFLICT (user_id) DO UPDATE
  SET balance = EXCLUDED.balance,
      updated_at = EXCLUDED.updated_at;

12.4 Partial Indexes to Reduce Index Maintenance WAL

Every index on a table generates additional WAL records on every write to that table. A table with 10 indexes generates roughly 10 times the WAL volume per row insertion compared to a table with no indexes. Partial indexes that cover only the rows that queries actually filter on can dramatically reduce the number of index pages that must be updated on each write.

CREATE INDEX CONCURRENTLY idx_events_pending
ON events (created_at)
WHERE status = 'pending';

This index only contains rows where status = 'pending'. If most rows transition through pending quickly and the vast majority of rows have a different status, this index will be much smaller and generate far less write amplification than a full index on created_at.

12.5 Fillfactor Tuning to Reduce Write Amplification on Updates

PostgreSQL’s default fillfactor for heap pages is 100%, meaning pages are packed completely full. When an update occurs on a full page, there is no room for the new row version in the same page, so PostgreSQL must write the new version to a different page and leave a pointer. This is called a HOT (Heap Only Tuple) chain break and it generates extra WAL and extra I/O.

Setting fillfactor to 70 to 80 on heavily updated tables leaves free space on each page for new row versions, enabling HOT updates that write only one WAL record instead of two and avoid page splits.

ALTER TABLE events SET (fillfactor = 75);
VACUUM FULL events;

12.6 Unlogged Tables for Non Durable Staging Data

For staging tables used as temporary landing zones before data is transformed and inserted into durable tables, UNLOGGED tables bypass the WAL entirely. They are approximately two to four times faster for writes than logged tables.

CREATE UNLOGGED TABLE event_staging (
  id BIGSERIAL PRIMARY KEY,
  raw_payload JSONB,
  received_at TIMESTAMPTZ DEFAULT NOW()
);

Unlogged tables are truncated on crash recovery, so they are appropriate only for truly transient data where loss on failure is acceptable. In Aurora, where the writer node can restart quickly and the storage tier is durable, an unlogged staging table that is processed in seconds and then cleared is a safe optimisation for ingestion pipelines.

12.7 Connection Pooling with PgBouncer

Each direct PostgreSQL connection from a client occupies a backend process on the writer instance, consuming memory and CPU. Aurora’s per commit cross AZ write means that the writer CPU is under pressure during high write load. Connection pooling with PgBouncer in transaction mode reduces the number of backend processes, reduces context switching overhead, and allows more efficient batching of WAL writes from fewer concurrent sessions.

12.8 Increase `wal_buffers` for High Concurrency

On systems with high WAL output, XLogFlush requests might not occur often enough to prevent XLogInsertRecord from having to do writes. On such systems one should increase the number of WAL buffers by modifying the wal_buffers parameter. When full_page_writes is set and the system is very busy, setting wal_buffers higher will help smooth response times during the period immediately following each checkpoint.

The default wal_buffers is sized at 1/32 of shared_buffers. For Aurora instances with large shared_buffers, increasing wal_buffers to 64 MB or 128 MB can improve the throughput of WAL writes under heavy concurrent load by reducing the frequency of WAL buffer flushes triggered by individual sessions.

12.9 Tune `max_wal_size` and Checkpoint Parameters

In Aurora, the max_wal_size parameter still influences how aggressively the engine cycles WAL segments. Setting it too low forces more frequent checkpoints, which on standard PostgreSQL causes I/O bursts. In Aurora, the checkpoint pressure is absorbed by the storage tier, but aggressive checkpoint cycling still adds overhead to the WAL management path.

Setting max_wal_size to 8 GB to 16 GB on write heavy clusters and checkpoint_completion_target to 0.9 gives the background writer maximum time to spread dirty page writes, reducing the chance of a checkpoint I/O burst impacting foreground commits.

12.10 Partitioning to Reduce Write Amplification on Indexes

Range partitioned tables allow new writes to land on the most recent partition, which typically has much smaller indexes than a monolithic table. Smaller indexes mean fewer index page splits, less WAL generated per row, and faster write paths. Partition pruning also means that queries that filter by the partition key skip older partitions entirely.

CREATE TABLE events (
  id BIGSERIAL,
  user_id BIGINT,
  event_type TEXT,
  created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2025_q1 PARTITION OF events
  FOR VALUES FROM ('2025-01-01') TO ('2025-04-01');

CREATE TABLE events_2025_q2 PARTITION OF events
  FOR VALUES FROM ('2025-04-01') TO ('2025-07-01');

12.11 Asynchronous Commit for Non Critical Paths

For workloads that can tolerate losing the last few milliseconds of writes on a crash (for example, audit logs or analytics events where occasional gaps are acceptable), synchronous_commit = off instructs the engine not to wait for the WAL to be acknowledged by the storage quorum before returning to the client. If you disable synchronous_commit, every commit requested by client doesn’t wait for the four out of six quorum. Disabling synchronous_commit compromises durability of your DB instance and can lead to data loss.

This should be used only for specific sessions or tables where data loss of the most recent transactions is genuinely acceptable. It should never be applied at the cluster level for transactional workloads.

SET synchronous_commit = off;
INSERT INTO analytics_events ...;
COMMIT;
SET synchronous_commit = on;

12.12 Graviton 4 Instances for WAL Stream Improvements

Improved allocation of Write-Ahead Log stream numbers resulted in increased throughput for write heavy workloads on the new Graviton 4 high end instances. If the cluster is on an older generation instance type and write throughput is the bottleneck, migrating to db.r8g Graviton 4 instances provides both WAL stream improvements specific to Aurora and the general compute and network bandwidth improvements of the Graviton 4 architecture.

13. Aurora I/O Optimized: Eliminating the I/O Cost Variable

Aurora’s default pricing charges separately for compute, storage, and I/O operations. For write heavy workloads that generate high VolumeWriteIOPs, the I/O cost can become significant and unpredictable. Aurora I/O Optimized, introduced in 2023, replaces the per I/O charge with a higher flat storage cost that includes all I/O. For clusters whose I/O cost represents more than roughly 25% of the total Aurora bill, I/O Optimized is typically cheaper and eliminates the pricing volatility that makes capacity planning difficult.

14. RDS PostgreSQL with io2 Block Express vs Aurora for Write Heavy Workloads

Aurora is not always the right choice for write heavy workloads. For workloads where the cross AZ quorum latency of Aurora is a binding constraint and where the application requires the absolute lowest possible commit latency rather than Aurora’s durability and availability guarantees, RDS PostgreSQL with io2 Block Express storage deserves serious consideration.

14.1 io2 Block Express Architecture

With io2 Block Express volumes, database workloads benefit from consistent sub-millisecond latency and significantly improved durability of 99.999% compared to io1’s 99.9%. You also get up to 1,000 IOPS per GiB from provisioned storage, which is 20 times more than io1, all while maintaining the same price point as io1 volumes.

io2 Block Express is a locally attached EBS volume with direct NVMe connectivity from the instance to the storage hardware. All RDS io2 volumes based on the AWS Nitro System are io2 Block Express volumes and provide sub-millisecond average latency. The maximum provisioned IOPS is 256,000 per volume, and the maximum throughput is 4,000 MB/s. For an RDS for PostgreSQL instance running synchronous_commit = on, the WAL is written to the local io2 volume and the commit is acknowledged after a single local fsync, not a cross AZ network round trip. This makes individual commit latency materially lower than Aurora.

14.2 Dedicated Log Volumes

A dedicated log volume (DLV) moves PostgreSQL database transaction logs and MySQL/MariaDB redo logs and binary logs to a storage volume that is separate from the volume containing the database tables. A DLV makes transaction write logging more efficient and consistent. DLVs are ideal for databases with large allocated storage, high I/O per second requirements, or latency sensitive workloads.

Separating the WAL onto its own io2 Block Express volume eliminates I/O contention between WAL writes and data page reads and writes. DLV is compatible with PIOPS storage types (io1 and io2 Block Express) and is provisioned with a fixed size of 1,024 GiB and 3,000 Provisioned IOPS. The DLV’s dedicated IOPS budget means that WAL writes always have reserved capacity regardless of what the main data volume is doing.

14.3 Benchmark Comparison: RDS io2 vs Aurora for PostgreSQL

The RDS for PostgreSQL instance db.r6gd.4xlarge in Single AZ configured with io2 Block Express has a 32% improvement over gp3 and 29% over io1 in transaction latency.

For write heavy OLTP workloads with simple transactions, a well tuned RDS PostgreSQL instance on io2 Block Express with a dedicated log volume can achieve lower per transaction latency than Aurora on the same instance class, because the commit path does not cross an AZ boundary. The tradeoff is availability: RDS Single AZ has no automatic failover, and RDS Multi AZ requires a failover of 60 to 120 seconds on writer failure, versus Aurora’s typical 15 to 30 second failover.

The practical guidance is:

Choose Aurora when: the workload benefits from Aurora’s fast failover, zero data loss durability model, read scaling to multiple replicas, or continuous S3 backup. Most production OLTP workloads should use Aurora.

Choose RDS with io2 Block Express when: the application is extremely latency sensitive, commits must complete in under 1 ms, and the engineering team is prepared to manage Multi AZ failover and replication lag on read replicas. Financial systems that process sequential streams of individual transactions with strict per transaction latency SLAs are the primary use case.

14.4 x2iedn Instances for Extreme Write Workloads on RDS

For the most demanding write workloads on RDS PostgreSQL, the x2iedn instance family provides extremely large local NVMe storage alongside substantial memory. The x2iedn.32xlarge provides 128 vCPUs, 4 TB of memory, and 60 Gbps of network bandwidth with local NVMe drives that can sustain millions of IOPS. Combined with io2 Block Express for the data volume and a dedicated log volume for WAL, this configuration can sustain write throughputs that are impractical on Aurora’s shared storage tier.

The appropriate comparison is not Aurora vs RDS per se, but Aurora’s durability and availability guarantees versus the raw performance ceiling of a dedicated high performance instance. Teams should benchmark their actual workload on both before committing.

15. The Write Path Diagram: End to End

Tracing a single transaction from application to durable storage in Aurora PostgreSQL:

The application opens a connection through PgBouncer to the Aurora writer endpoint.
The application issues BEGIN and one or more DML statements.
Each DML statement causes the backend process to locate the relevant pages in the shared buffer pool, apply changes in memory, and write REDO log records to the WAL buffer. The UNDO information for each change is stored in the heap as the old row version.
The application issues COMMIT.
The WAL writer process flushes the WAL buffer to the Aurora storage daemon.
The Aurora storage daemon transmits the WAL records in parallel to all six storage nodes across three AZs, grouped into boxcar batches.
The engine waits for the IO:XactSync condition: four of six storage nodes have acknowledged receipt and durable storage of the WAL records.
The commit acknowledgement is returned to the application.
The dirty pages in the shared buffer pool remain in memory. The storage nodes apply the WAL records asynchronously to materialise the updated page versions. Reader instances receive the WAL stream asynchronously and invalidate their cached pages.
Checkpoint processing on the storage tier periodically advances the redo point, allowing older WAL to be recycled.

16. A Practical Pathway Out of Write Throughput Problems

For engineering teams that have identified write throughput as the current bottleneck, the following sequence provides a structured pathway.

16.1 Step 1: Diagnose Before Acting

Enable Performance Insights if not already enabled. Observe the average active sessions chart and identify the dominant wait event. If IO:XactSync dominates, the problem is commit frequency or storage network saturation. If LWLock:buffer_mapping or Lock:relation appear, the problem is lock contention, which is a different class of problem.

Run Script 4 from section 11.2 to identify which application connection strings are generating the most XactSync waits. This identifies the code path to target first.

16.2 Step 2: Batch Commits in Application Code

Without any infrastructure changes, implementing or increasing commit batching in the application layer is typically the highest return intervention. A single engineer can often double or triple write throughput in a few days by changing one hot write path to batch 100 to 500 rows per commit instead of one.

16.3 Step 3: Audit and Eliminate Unnecessary Indexes

Run the following query to identify indexes that are consuming write IOPS but are rarely used for reads:

SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan AS scans,
  idx_tup_read,
  idx_tup_fetch,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
JOIN pg_statio_user_indexes USING (indexrelid)
ORDER BY idx_scan ASC, pg_relation_size(indexrelid) DESC;

Indexes with zero or near zero idx_scan values that are large are pure write overhead. Drop them after confirming no query depends on them. Each dropped index reduces WAL generation per write.

16.4 Step 4: Upgrade Instance Class and Move to Graviton 4

If the instance is not on db.r8g Graviton 4, upgrading provides improved WAL stream allocation and greater network bandwidth to the storage tier. The upgrade is an in place modification with a brief restart.

16.5 Step 5: Enable Aurora I/O Optimized and Review Storage Costs

If I/O cost is a meaningful fraction of the database bill, enable Aurora I/O Optimized. This removes the per I/O pricing and allows the team to write more aggressively without cost surprises.

16.6 Step 6: Evaluate Partitioning for Append Heavy Tables

For tables that receive the majority of writes to recent time periods, range partitioning by day or month dramatically reduces the write amplification from index maintenance on large tables.

16.7 Step 7: Consider Aurora Limitless for Extreme Scale

If the write throughput ceiling of a single writer instance cannot be overcome through the above techniques, Aurora PostgreSQL Limitless Database provides horizontal write scaling through distributed sharding. This is a significant architectural commitment and should be the last resort after all single writer optimisations have been exhausted.