Why Rubrik’s Architecture Matters: When Restore, Not Backup, Is the Product

1. Backups Should Be Boring (and That Is the Point)

Backups are boring. They should be boring.
A backup system that generates excitement is usually signalling failure.

The only time backups become interesting is when they are missing, and that interest level is lethal. Emergency bridges. Frozen change windows. Executive escalation. Media briefings. Regulatory apology letters. Engineers being asked questions that have no safe answers.

Most backup platforms are built for the boring days. Rubrik is designed for the day boredom ends.

2. Backup Is Not the Product. Restore Is.

Many organisations still evaluate backup platforms on the wrong metric: how fast they can copy data somewhere else.

That metric is irrelevant during an incident.

When things go wrong, the only questions that matter are:

  • What can I restore?
  • How fast can it be used?
  • How many restores can run in parallel?
  • How little additional infrastructure is required?

Rubrik treats restore as the primary product, not a secondary feature.

3. Architectural Starting Point: Designed for Failure, Not Demos

Rubrik was built without tape era assumptions.

There is no central backup server, no serial job controller, and no media server bottleneck. Instead, it uses a distributed, scale out architecture with a global metadata index and a stateless policy engine.

Restore becomes a metadata lookup problem, not a job replay problem. This distinction is invisible in demos and decisive during outages.

4. Performance Metrics That Actually Matter

Backup throughput is easy to optimise and easy to market. Restore performance is constrained by network fan out, restore concurrency, control plane orchestration, and application host contention.

Rubrik addresses this by default through parallel restore streams, linear scaling with node count, and minimal control plane chatter. Restore performance becomes predictable rather than optimistic.

5. Restore Semantics That Match Reality

The real test of any backup platform is not how elegantly it captures data, but how usefully it returns that data when needed. This is where architectural decisions made years earlier either pay dividends or extract penalties.

5.1 Instant Access Instead of Full Rehydration

Rubrik does not require full data copy back before access. It supports live mount of virtual machines, database mounts directly from backup storage, and file system mounts for selective recovery.

The recovery model becomes access first, copy later if needed. This is the difference between minutes and hours when production is down.

5.2 Dropping a Table Should Not Be a Crisis

Rubrik understands databases as structured systems, not opaque blobs.

It supports table level restores for SQL Server, mounting a database backup as a live database, extracting tables or schemas without restoring the full database, and point in time recovery without rollback.

Accidental table drops should be operational annoyances, not existential threats.

5.3 Supported Database Engines

Rubrik provides native protection for the major enterprise database platforms:

Database EngineLive MountPoint in Time RecoveryKey Constraints
Microsoft SQL ServerYesYes (transaction log replay)SQL 2012+ supported; Always On AG, FCI, standalone
Oracle DatabaseYesYes (archive log replay)RAC, Data Guard, Exadata supported; SPFILE required for automated recovery
SAP HANANoYesBackint API integration; uses native HANA backup scheduling
PostgreSQLNoYes (up to 5 minute RPO)File level incremental; on premises and cloud (AWS, Azure, GCP)
IBM Db2Via Elastic App ServiceYesUses native Db2 backup utilities
MongoDBVia Elastic App ServiceYesSharded and unsharded clusters; no quiescing required
MySQLVia Elastic App ServiceYesUses native MySQL backup tools
CassandraVia Elastic App ServiceYesVia Rubrik Datos IO integration

The distinction between native integration and Elastic App Service matters operationally. Native integration means Rubrik handles discovery, scheduling, and orchestration directly. Elastic App Service means Rubrik provides managed volumes as backup targets while the database’s native tools handle the actual backup process. Both approaches deliver immutability and policy driven retention, but the operational experience differs.

5.4 Live Mount: Constraints and Caveats

Live Mount is Rubrik’s signature capability—mounting backups as live, queryable databases without copying data back to production storage. The database runs with its data files served directly from the Rubrik cluster over NFS (for Oracle) or SMB 3.0 (for SQL Server).

This capability is transformative for specific use cases. It is not a replacement for production storage.

What Live Mount Delivers:

  • Near instant database availability (seconds to minutes, regardless of database size)
  • Zero storage provisioning on the target host
  • Multiple concurrent mounts from the same backup
  • Point in time access across the entire retention window
  • Ideal for granular recovery, DBCC health checks, test/dev cloning, audit queries, and upgrade validation

What Live Mount Does Not Deliver:

  • Production grade I/O performance
  • High availability during Rubrik cluster maintenance
  • Persistence across host or cluster reboots

IOPS Constraints:

Live Mount performance is bounded by the Rubrik appliance’s ability to serve I/O, not by the target host’s storage subsystem. Published figures suggest approximately 30,000 IOPS per Rubrik appliance for Live Mount workloads. This is adequate for reporting queries, data extraction, and validation testing. It is not adequate for transaction heavy production workloads.

The performance characteristics are inherently different from production storage:

MetricProduction SAN/FlashRubrik Live Mount
Random read IOPS100,000+~30,000 per appliance
Latency profileSub millisecondNetwork + NFS overhead
Write optimisationProduction tunedBackup optimised
Concurrent workloadsDesigned for contentionShared with backup operations

SQL Server Live Mount Specifics:

  • Databases mount via SMB 3.0 shares with UNC paths
  • Transaction log replay occurs during mount for point in time positioning
  • The mounted database is read write, but writes go to the Rubrik cluster
  • Supported for standalone instances, Failover Cluster Instances, and Always On Availability Groups
  • Table level recovery requires mounting the database, then using T SQL to extract and import specific objects

Oracle Live Mount Specifics:

  • Data files mount via NFS; redo logs and control files remain on the target host
  • Automated recovery requires source and target configurations to match (RAC to RAC, single instance to single instance, ASM to ASM)
  • Files only recovery allows dissimilar configurations but requires DBA managed RMAN recovery
  • SPFILE is required for automated recovery; PFILE databases require manual intervention
  • Block change tracking (BCT) is disabled on Live Mount targets
  • Live Mount fails if the target host, RAC cluster, or Rubrik cluster reboots during the mount—requiring forced unmount to clean up metadata
  • Direct NFS (DNFS) is recommended on Oracle RAC nodes for improved recovery performance

What Live Mount Is Not:

Live Mount is explicitly designed for temporary access, not sustained production workloads. The use cases Rubrik markets test/dev, DBCC validation, granular recovery, audit queries: all share a common characteristic: they are time bounded operations that tolerate moderate I/O performance in exchange for instant availability.

Running production transaction processing against a Live Mount database would be technically possible and operationally inadvisable. The I/O profile, the network dependency, and the lack of high availability guarantees make it unsuitable for workloads where performance and uptime matter.

5.5 The Recovery Hierarchy

Understanding when to use each recovery method matters:

Recovery NeedRecommended MethodTime to AccessStorage Required
Extract specific rows/tablesLive Mount + queryMinutesNone
Validate backup integrityLive Mount + DBCCMinutesNone
Clone for test/devLive MountMinutesNone
Full database replacementExport/RestoreHours (size dependent)Full database size
Disaster recovery cutoverInstant RecoveryMinutes (then migrate)Temporary, then full

The strategic value of Live Mount is avoiding full restores when full restores are unnecessary. For a 5TB database where someone dropped a single table, Live Mount means extracting that table in minutes rather than waiting hours for a complete restore.

For actual disaster recovery, where the production database is gone and must be replaced, Live Mount provides bridge access while the full restore completes in parallel. The database is queryable immediately; production grade performance follows once data migration finishes.

6. Why Logical Streaming Is a Design Failure

Traditional restore models stream backup data through the database host. This guarantees CPU contention, IO pressure, and restore times proportional to database size rather than change size.

Rubrik avoids this by mounting database images and extracting only required objects. The database host stops being collateral damage during recovery.

6.1 The VSS Tax: Why SQL Server Backups Cannot Escape Application Coordination

For VMware workloads without databases, Rubrik can leverage storage level snapshots that are instantaneous, application agnostic, and impose zero load on the guest operating system. The hypervisor freezes the VM state, the storage array captures the point in time image, and the backup completes before the application notices.

SQL Server cannot offer this simplicity. The reason is not a Microsoft limitation or a Rubrik constraint. The reason is transactional consistency.

The Crash Consistent Option Exists

Nothing technically prevents Rubrik, or any backup tool, from taking a pure storage snapshot of a SQL Server volume without application coordination. The snapshot would complete in milliseconds with zero database load.

The problem is what you would recover: a crash consistent image, not an application consistent one.

A crash consistent snapshot captures storage state mid flight. This includes partially written pages, uncommitted transactions, dirty buffers not yet flushed to disk, and potentially torn writes caught mid I/O. SQL Server is designed to recover from exactly this state. Every time the database engine starts after an unexpected shutdown, it runs crash recovery, rolling forward committed transactions from the log and rolling back uncommitted ones.

The database will become consistent. Eventually. Probably.

Why Probably Is Not Good Enough

Crash recovery works. It works reliably. It is tested millions of times daily across every SQL Server instance that experiences an unclean shutdown.

But restore confidence matters. When production is down and executives are asking questions, the difference between “this backup is guaranteed consistent” and “this backup should recover correctly after crash recovery completes” is operationally significant.

VSS exists to eliminate that uncertainty.

What VSS Actually Does

When a backup application requests an application consistent SQL Server snapshot, the following sequence executes:

  1. The backup application calls the VSS coordinator
  2. VSS notifies the SQL Server VSS Writer that a backup is imminent
  3. SQL Server flushes dirty pages from the buffer pool to disk
  4. SQL Server briefly freezes write I/O to guarantee a consistent capture point
  5. The storage snapshot executes
  6. SQL Server resumes normal operation
  7. VSS confirms completion to the backup application

The result is a snapshot that requires no crash recovery on restore. The database is immediately consistent, immediately usable, and carries no uncertainty about transactional integrity.

The Coordination Cost

The VSS freeze window is typically brief, milliseconds to low seconds. But the preparation is not free.

Buffer pool flushes on large databases generate I/O pressure. Checkpoint operations compete with production workloads. The freeze, however short, introduces latency for in flight transactions. The database instance is actively participating in its own backup.

For databases measured in terabytes, with buffer pools consuming hundreds of gigabytes, this coordination overhead becomes operationally visible. Backup windows that appear instantaneous from the storage console are hiding real work inside the SQL Server instance.

The Architectural Asymmetry

This creates a fundamental difference in backup elegance across workload types:

Workload TypeBackup MethodApplication LoadRestore State
VMware VM (no database)Storage snapshotZeroCrash consistent (acceptable)
VMware VM (with SQL Server)VSS coordinated snapshotModerateApplication consistent
Physical SQL ServerVSS coordinated snapshotModerate to highApplication consistent
Physical SQL ServerPure storage snapshotZeroCrash consistent (risky)

For a web server or file share, crash consistent is fine. The application has no transactional state worth protecting. For a database, crash consistent means trusting recovery logic rather than guaranteeing consistency.

The Uncomfortable Reality

The largest, most critical SQL Server databases, the ones that would benefit most from zero overhead instantaneous backup are precisely the workloads where crash consistent snapshots carry the most risk. More transactions in flight. Larger buffer pools. More recovery time if something needs replay.

Rubrik supports VSS coordination because the alternative is shipping backups that might need crash recovery. That uncertainty is acceptable for test environments. It is rarely acceptable for production databases backing financial systems, customer records, or regulatory reporting.

The VSS tax is not a limitation imposed by Microsoft or avoided by competitors. It is the cost of consistency. Every backup platform that claims application consistent SQL Server protection is paying it. The only question is whether they admit the overhead exists.

7. Snapshot Based Protection Is Objectively Better (When You Can Get It)

The previous section explained why SQL Server backups cannot escape application coordination. VSS exists because transactional consistency requires it, and the coordination overhead is the price of certainty.

This makes the contrast with pure snapshot based protection even starker. Where snapshots work cleanly, they are not incrementally better. They are categorically superior.

What Pure Snapshots Deliver

Snapshot based backups in environments that support them provide:

  • Near instant capture: microseconds to milliseconds, regardless of dataset size
  • Zero application load: the workload never knows a backup occurred
  • Consistent recovery points: the storage layer guarantees point in time consistency
  • Predictable backup windows: duration is independent of data volume
  • No bandwidth consumption during capture: data movement happens later, asynchronously

A 50TB VMware datastore snapshots in the same time as a 50GB datastore. Backup windows become scheduling decisions rather than capacity constraints.

Rubrik exploits this deeply in VMware environments. Snapshot orchestration, instant VM recovery, and live mounts all depend on the hypervisor providing clean, consistent, zero overhead capture points.

Why This Is Harder Than It Looks

The elegance of snapshot based protection depends entirely on the underlying platform providing the right primitives. This is where the gap between VMware and everything else becomes painful.

VMware offers:

  • Native snapshot APIs with transactional semantics
  • Changed Block Tracking (CBT) for efficient incrementals
  • Hypervisor level consistency without guest coordination
  • Storage integration through VADP (vSphere APIs for Data Protection)

These are not accidental features. VMware invested years building a backup ecosystem because they understood that enterprise adoption required operational maturity, not just compute virtualisation.

Physical hosts offer none of this.

There is no universal snapshot API for bare metal servers. Storage arrays provide snapshot capabilities, but each vendor implements them differently, with different consistency guarantees, different integration points, and different failure modes. The operating system has no standard mechanism to coordinate application state with storage level capture.

The Physical Host Penalty

This is why physical SQL Server hosts face a compounding disadvantage:

  1. No hypervisor abstraction: there is no layer between the OS and storage that can freeze state cleanly
  2. VSS remains mandatory: application consistency still requires database coordination
  3. No standardised incremental tracking: without CBT or equivalent, every backup must rediscover what changed
  4. Storage integration is bespoke: each array, each SAN, each configuration requires specific handling

The result is that physical hosts with the largest databases, the workloads generating the most backup data, with the longest restore times, under the most operational pressure, receive the least architectural benefit from modern backup platforms.

They are stuck paying the VSS tax without receiving the snapshot dividend.

The Integration Hierarchy

Backup elegance follows a clear hierarchy based on platform integration depth:

EnvironmentSnapshot QualityIncremental EfficiencyApplication ConsistencyOverall Experience
VMware (no database)ExcellentCBT drivenNot requiredSeamless
VMware (with SQL Server)ExcellentCBT drivenVSS coordinatedGood with overhead
Cloud native (EBS, managed disks)GoodProvider dependentVaries by workloadGenerally clean
Physical with enterprise SANPossibleArray dependentVSS coordinatedComplex but workable
Physical with commodity storageLimitedOften full scanVSS coordinatedPainful

The further down this hierarchy, the more the backup platform must compensate for missing primitives. Rubrik handles this better than most, but even excellent software cannot conjure APIs that do not exist.

Why the Industry Irony Persists

The uncomfortable truth is that snapshot based protection delivers its greatest value precisely where it is least available.

A 500GB VMware VM snapshots effortlessly. The hypervisor provides everything needed. Backup is boring, as it should be.

A 50TB physical SQL Server, the database actually keeping the business running, containing years of transactional history, backing regulatory reporting and financial reconciliation, must coordinate through VSS, flush terabytes of buffer pool, sustain I/O pressure during capture, and hope the storage layer cooperates.

The workloads that need snapshot elegance the most are architecturally prevented from receiving it.

This is not a Rubrik limitation. It is not a Microsoft conspiracy. It is the accumulated consequence of decades of infrastructure evolution where virtualisation received backup investment and physical infrastructure did not.

What This Means for Architecture Decisions

Understanding this hierarchy should influence infrastructure strategy:

Virtualise where possible. The backup benefits alone often justify the overhead. A SQL Server VM with VSS coordination still benefits from CBT, instant recovery, and hypervisor level orchestration.

Choose storage with snapshot maturity. If physical hosts are unavoidable, enterprise arrays with proven snapshot integration reduce the backup penalty. This is not the place for commodity storage experimentation.

Accept the VSS overhead. For SQL Server workloads, crash consistent snapshots are technically possible but operationally risky. The coordination cost is worth paying. Budget for it in backup windows and I/O capacity.

Plan restore, not backup. Snapshot speed is irrelevant if restore requires hours of data rehydration. The architectural advantage of snapshots extends to recovery only if the platform supports instant mount and selective restore.

Rubrik’s value in this landscape is not eliminating the integration gaps—nobody can—but navigating them intelligently. Where snapshots work, Rubrik exploits them fully. Where they do not, Rubrik minimises the penalty through parallel restore, live mounts, and metadata driven recovery.

The goal remains the same: make restore the product, regardless of how constrained the backup capture had to be.

8. Ransomware: Where Architecture Is Exposed

8.1 The Restore Storm Problem

After ransomware, the challenge is not backup availability. The challenge is restoring everything at once.

Constraints appear immediately. East-west traffic saturates. DWDM links run hot. Core switch buffers overflow. Cloud egress throttling kicks in.

Rubrik mitigates this through parallel restores, SLA based prioritisation, and live mounts for critical systems. What it cannot do is defeat physics. A good recovery plan avoids turning a data breach into a network outage.

9. SaaS vs Appliance: This Is a Network Decision

Functionally, Rubrik SaaS and on prem appliances share the same policy engine, metadata index, and restore semantics.

The difference is bandwidth reality.

On prem appliances provide fast local restores, predictable latency, and minimal WAN dependency. SaaS based protection provides excellent cloud workload coverage and operational simplicity, but restore speed is bounded by network capacity and egress costs.

Hybrid estates usually require both.

10. Why Rubrik in the Cloud?

Cloud providers offer native backup primitives. These are necessary but insufficient.

They do not provide unified policy across environments, cross account recovery at scale, ransomware intelligence, or consistent restore semantics.

Rubrik turns cloud backups into recoverable systems rather than isolated snapshots.

10.1 Should You Protect Your AWS Root and Crypto Accounts?

Yes, because losing the control plane is worse than losing data.

Rubrik protects IAM configuration, account state, and infrastructure metadata. After a compromise, restoring how the account was configured is as important as restoring the data itself.

11. Backup Meets Security (Finally)

Rubrik integrates threat awareness into recovery using entropy analysis, change rate anomaly detection, and snapshot divergence tracking.

This answers the most dangerous question in recovery: which backup is actually safe to restore?

Most platforms cannot answer this with confidence.

12. VMware First Class Citizen, Physical Hosts Still Lag

Rubrik’s deepest integrations exist in VMware environments, including snapshot orchestration, instant VM recovery, and live mounts.

The uncomfortable reality remains that physical hosts with the largest datasets would benefit most from snapshot based protection, yet receive the least integration. This is an industry gap, not just a tooling one.

13. When Rubrik Is Not the Right Tool

Rubrik is not universal.

It is less optimal when bandwidth is severely constrained, estates are very small, or tape workflows are legally mandated.

Rubrik’s value emerges at scale, under pressure, and during failure.

14. Conclusion: Boredom Is Success

Backups should be boring. Restores should be quiet. Executives should never know the platform exists.

The only time backups become exciting is when they fail, and that excitement is almost always lethal.

Rubrik is not interesting because it stores data. It is interesting because, when everything is already on fire, restore remains a controlled engineering exercise rather than a panic response.

References

  1. Gartner Magic Quadrant for Enterprise Backup and Recovery Solutions – https://www.gartner.com/en/documents/5138291
  2. Rubrik Technical Architecture Whitepapers – https://www.rubrik.com/resources
  3. Microsoft SQL Server Backup and Restore Internals – https://learn.microsoft.com/en-us/sql/relational-databases/backup-restore/backup-overview-sql-server
  4. VMware Snapshot and Backup Best Practices – https://knowledge.broadcom.com/external/article?legacyId=1025279
  5. AWS Backup and Recovery Documentation – https://docs.aws.amazon.com/aws-backup/
  6. NIST SP 800-209 Security Guidelines for Storage Infrastructure – https://csrc.nist.gov/publications/detail/sp/800-209/final
  7. Rubrik SQL Live Mount Documentation – https://www.rubrik.com/solutions/sql-live-mount
  8. Rubrik Oracle Live Mount Documentation – https://docs.rubrik.com/en-us/saas/oracle/oracle_live_mount.html
  9. Rubrik for Oracle and Microsoft SQL Server Data Sheet – https://www.rubrik.com/content/dam/rubrik/en/resources/data-sheet/Rubrik-for-Oracle-and-Microsoft-SQL-Sever-DS.pdf
  10. Rubrik Enhanced Performance for Microsoft SQL and Oracle Database – https://www.rubrik.com/blog/technology/2021/12/rubrik-enhanced-performance-for-microsoft-sql-and-oracle-database
  11. Rubrik PostgreSQL Support Announcement – https://www.rubrik.com/blog/technology/24/10/rubrik-expands-database-protection-with-postgre-sql-support-and-on-premises-sensitive-data-monitoring-for-microsoft-sql-server
  12. Rubrik Elastic App Service – https://www.rubrik.com/solutions/elastic-app-service
  13. Rubrik and VMware vSphere Reference Architecture – https://www.rubrik.com/content/dam/rubrik/en/resources/white-paper/ra-rubrik-vmware-vsphere.pdf
  14. Protecting Microsoft SQL Server with Rubrik Technical White Paper – https://www.rubrik.com/content/dam/rubrik/en/resources/white-paper/rwp-protecting-microsoft-sql-server-with-rubrik.pdf
  15. The Definitive Guide to Rubrik Cloud Data Management – https://www.rubrik.com/content/dam/rubrik/en/resources/white-paper/rwp-definitive-guide-to-rubrik-cdm.pdf
  16. Rubrik Oracle Tools GitHub Repository – https://github.com/rubrikinc/rubrik_oracle_tools
  17. Automating SQL Server Live Mounts with Rubrik – https://virtuallysober.com/2017/08/08/automating-sql-server-live-mounts-with-rubrik-alta-4-0/
0
0

Aurora PostgreSQL: Archiving and Restoring Partitions from Large Tables to Iceberg and Parquet on S3

A Complete Guide to Archiving, Restoring, and Querying Large Table Partitions

When dealing with multi-terabyte tables in Aurora PostgreSQL, keeping historical partitions online becomes increasingly expensive and operationally burdensome. This guide presents a complete solution for archiving partitions to S3 in Iceberg/Parquet format, restoring them when needed, and querying archived data directly via a Spring Boot API without database restoration.

1. Architecture Overview

The solution comprises three components:

  1. Archive Script: Exports a partition from Aurora PostgreSQL to Parquet files organised in Iceberg table format on S3
  2. Restore Script: Imports archived data from S3 back into a staging table for validation and migration to the main table
  3. Query API: A Spring Boot application that reads Parquet files directly from S3, applying predicate pushdown for efficient filtering

This approach reduces storage costs by approximately 70 to 80 percent compared to keeping data in Aurora, while maintaining full queryability through the API layer.

2. Prerequisites and Dependencies

2.1 Python Environment

pip install boto3 pyarrow psycopg2-binary pandas pyiceberg sqlalchemy

2.2 AWS Configuration

Ensure your environment has appropriate IAM permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::your-archive-bucket",
                "arn:aws:s3:::your-archive-bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateTable",
                "glue:GetTable",
                "glue:UpdateTable",
                "glue:DeleteTable"
            ],
            "Resource": "*"
        }
    ]
}

2.3 Java Dependencies

For the Spring Boot API, add these to your pom.xml:

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.parquet</groupId>
        <artifactId>parquet-avro</artifactId>
        <version>1.14.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-aws</artifactId>
        <version>3.3.6</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.6</version>
    </dependency>
    <dependency>
        <groupId>software.amazon.awssdk</groupId>
        <artifactId>s3</artifactId>
        <version>2.25.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.iceberg</groupId>
        <artifactId>iceberg-core</artifactId>
        <version>1.5.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.iceberg</groupId>
        <artifactId>iceberg-parquet</artifactId>
        <version>1.5.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.iceberg</groupId>
        <artifactId>iceberg-aws</artifactId>
        <version>1.5.0</version>
    </dependency>
</dependencies>

3. The Archive Script

This script extracts a partition from Aurora PostgreSQL and writes it to S3 in Iceberg table format with Parquet data files.

3.1 Configuration Module

# config.py

from dataclasses import dataclass
from typing import Optional
import os


@dataclass
class DatabaseConfig:
    host: str
    port: int
    database: str
    user: str
    password: str

    @classmethod
    def from_environment(cls) -> "DatabaseConfig":
        return cls(
            host=os.environ["AURORA_HOST"],
            port=int(os.environ.get("AURORA_PORT", "5432")),
            database=os.environ["AURORA_DATABASE"],
            user=os.environ["AURORA_USER"],
            password=os.environ["AURORA_PASSWORD"]
        )

    @property
    def connection_string(self) -> str:
        return f"postgresql://{self.user}:{self.password}@{self.host}:{self.port}/{self.database}"


@dataclass
class S3Config:
    bucket: str
    prefix: str
    region: str

    @classmethod
    def from_environment(cls) -> "S3Config":
        return cls(
            bucket=os.environ["ARCHIVE_S3_BUCKET"],
            prefix=os.environ.get("ARCHIVE_S3_PREFIX", "aurora-archives"),
            region=os.environ.get("AWS_REGION", "eu-west-1")
        )


@dataclass  
class ArchiveConfig:
    table_name: str
    partition_column: str
    partition_value: str
    schema_name: str = "public"
    batch_size: int = 100000
    compression: str = "snappy"

3.2 Schema Introspection

# schema_inspector.py

from typing import Dict, List, Tuple, Any
import psycopg2
from psycopg2.extras import RealDictCursor
import pyarrow as pa


class SchemaInspector:
    """Inspects PostgreSQL table schema and converts to PyArrow schema."""

    POSTGRES_TO_ARROW = {
        "integer": pa.int32(),
        "bigint": pa.int64(),
        "smallint": pa.int16(),
        "real": pa.float32(),
        "double precision": pa.float64(),
        "numeric": pa.decimal128(38, 10),
        "text": pa.string(),
        "character varying": pa.string(),
        "varchar": pa.string(),
        "char": pa.string(),
        "character": pa.string(),
        "boolean": pa.bool_(),
        "date": pa.date32(),
        "timestamp without time zone": pa.timestamp("us"),
        "timestamp with time zone": pa.timestamp("us", tz="UTC"),
        "time without time zone": pa.time64("us"),
        "time with time zone": pa.time64("us"),
        "uuid": pa.string(),
        "json": pa.string(),
        "jsonb": pa.string(),
        "bytea": pa.binary(),
    }

    def __init__(self, connection_string: str):
        self.connection_string = connection_string

    def get_table_columns(
        self, 
        schema_name: str, 
        table_name: str
    ) -> List[Dict[str, Any]]:
        """Retrieve column definitions from information_schema."""
        query = """
            SELECT 
                column_name,
                data_type,
                is_nullable,
                column_default,
                ordinal_position,
                character_maximum_length,
                numeric_precision,
                numeric_scale
            FROM information_schema.columns
            WHERE table_schema = %s 
              AND table_name = %s
            ORDER BY ordinal_position
        """

        with psycopg2.connect(self.connection_string) as conn:
            with conn.cursor(cursor_factory=RealDictCursor) as cur:
                cur.execute(query, (schema_name, table_name))
                return list(cur.fetchall())

    def get_primary_key_columns(
        self, 
        schema_name: str, 
        table_name: str
    ) -> List[str]:
        """Retrieve primary key column names."""
        query = """
            SELECT a.attname
            FROM pg_index i
            JOIN pg_attribute a ON a.attrelid = i.indrelid 
                AND a.attnum = ANY(i.indkey)
            JOIN pg_class c ON c.oid = i.indrelid
            JOIN pg_namespace n ON n.oid = c.relnamespace
            WHERE i.indisprimary
              AND n.nspname = %s
              AND c.relname = %s
            ORDER BY array_position(i.indkey, a.attnum)
        """

        with psycopg2.connect(self.connection_string) as conn:
            with conn.cursor() as cur:
                cur.execute(query, (schema_name, table_name))
                return [row[0] for row in cur.fetchall()]

    def postgres_type_to_arrow(
        self, 
        pg_type: str, 
        precision: int = None, 
        scale: int = None
    ) -> pa.DataType:
        """Convert PostgreSQL type to PyArrow type."""
        pg_type_lower = pg_type.lower()

        if pg_type_lower == "numeric" and precision and scale:
            return pa.decimal128(precision, scale)

        if pg_type_lower in self.POSTGRES_TO_ARROW:
            return self.POSTGRES_TO_ARROW[pg_type_lower]

        if pg_type_lower.startswith("character varying"):
            return pa.string()

        if pg_type_lower.endswith("[]"):
            base_type = pg_type_lower[:-2]
            if base_type in self.POSTGRES_TO_ARROW:
                return pa.list_(self.POSTGRES_TO_ARROW[base_type])

        return pa.string()

    def build_arrow_schema(
        self, 
        schema_name: str, 
        table_name: str
    ) -> pa.Schema:
        """Build PyArrow schema from PostgreSQL table definition."""
        columns = self.get_table_columns(schema_name, table_name)

        fields = []
        for col in columns:
            arrow_type = self.postgres_type_to_arrow(
                col["data_type"],
                col.get("numeric_precision"),
                col.get("numeric_scale")
            )
            nullable = col["is_nullable"] == "YES"
            fields.append(pa.field(col["column_name"], arrow_type, nullable=nullable))

        return pa.Schema(fields)

    def get_partition_info(
        self, 
        schema_name: str, 
        table_name: str
    ) -> Dict[str, Any]:
        """Get partition strategy information if table is partitioned."""
        query = """
            SELECT 
                pt.partstrat,
                array_agg(a.attname ORDER BY pos.idx) as partition_columns
            FROM pg_partitioned_table pt
            JOIN pg_class c ON c.oid = pt.partrelid
            JOIN pg_namespace n ON n.oid = c.relnamespace
            CROSS JOIN LATERAL unnest(pt.partattrs) WITH ORDINALITY AS pos(attnum, idx)
            JOIN pg_attribute a ON a.attrelid = c.oid AND a.attnum = pos.attnum
            WHERE n.nspname = %s AND c.relname = %s
            GROUP BY pt.partstrat
        """

        with psycopg2.connect(self.connection_string) as conn:
            with conn.cursor(cursor_factory=RealDictCursor) as cur:
                cur.execute(query, (schema_name, table_name))
                result = cur.fetchone()

                if result:
                    strategy_map = {"r": "range", "l": "list", "h": "hash"}
                    return {
                        "is_partitioned": True,
                        "strategy": strategy_map.get(result["partstrat"], "unknown"),
                        "partition_columns": result["partition_columns"]
                    }

                return {"is_partitioned": False}

3.3 Archive Script

#!/usr/bin/env python3
# archive_partition.py

"""
Archive a partition from Aurora PostgreSQL to S3 in Iceberg/Parquet format.

Usage:
    python archive_partition.py \
        --table transactions \
        --partition-column transaction_date \
        --partition-value 2024-01 \
        --schema public
"""

import argparse
import json
import logging
import sys
from datetime import datetime
from pathlib import Path
from typing import Generator, Dict, Any, Optional
import hashlib

import boto3
import pandas as pd
import psycopg2
from psycopg2.extras import RealDictCursor
import pyarrow as pa
import pyarrow.parquet as pq

from config import DatabaseConfig, S3Config, ArchiveConfig
from schema_inspector import SchemaInspector


logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger(__name__)


class PartitionArchiver:
    """Archives PostgreSQL partitions to S3 in Iceberg format."""

    def __init__(
        self,
        db_config: DatabaseConfig,
        s3_config: S3Config,
        archive_config: ArchiveConfig
    ):
        self.db_config = db_config
        self.s3_config = s3_config
        self.archive_config = archive_config
        self.schema_inspector = SchemaInspector(db_config.connection_string)
        self.s3_client = boto3.client("s3", region_name=s3_config.region)

    def _generate_archive_path(self) -> str:
        """Generate S3 path for archive files."""
        return (
            f"{self.s3_config.prefix}/"
            f"{self.archive_config.schema_name}/"
            f"{self.archive_config.table_name}/"
            f"{self.archive_config.partition_column}={self.archive_config.partition_value}"
        )

    def _build_extraction_query(self) -> str:
        """Build query to extract partition data."""
        full_table = (
            f"{self.archive_config.schema_name}.{self.archive_config.table_name}"
        )

        partition_info = self.schema_inspector.get_partition_info(
            self.archive_config.schema_name,
            self.archive_config.table_name
        )

        if partition_info["is_partitioned"]:
            partition_table = (
                f"{self.archive_config.table_name}_"
                f"{self.archive_config.partition_value.replace('-', '_')}"
            )
            return f"SELECT * FROM {self.archive_config.schema_name}.{partition_table}"

        return (
            f"SELECT * FROM {full_table} "
            f"WHERE {self.archive_config.partition_column} = %s"
        )

    def _stream_partition_data(self) -> Generator[pd.DataFrame, None, None]:
        """Stream partition data in batches."""
        query = self._build_extraction_query()
        partition_info = self.schema_inspector.get_partition_info(
            self.archive_config.schema_name,
            self.archive_config.table_name
        )

        with psycopg2.connect(self.db_config.connection_string) as conn:
            with conn.cursor(
                name="archive_cursor",
                cursor_factory=RealDictCursor
            ) as cur:
                if partition_info["is_partitioned"]:
                    cur.execute(query)
                else:
                    cur.execute(query, (self.archive_config.partition_value,))

                while True:
                    rows = cur.fetchmany(self.archive_config.batch_size)
                    if not rows:
                        break
                    yield pd.DataFrame(rows)

    def _write_parquet_to_s3(
        self,
        table: pa.Table,
        file_number: int,
        arrow_schema: pa.Schema
    ) -> Dict[str, Any]:
        """Write a Parquet file to S3 and return metadata."""
        archive_path = self._generate_archive_path()
        file_name = f"data_{file_number:05d}.parquet"
        s3_key = f"{archive_path}/data/{file_name}"

        buffer = pa.BufferOutputStream()
        pq.write_table(
            table,
            buffer,
            compression=self.archive_config.compression,
            write_statistics=True
        )

        data = buffer.getvalue().to_pybytes()

        self.s3_client.put_object(
            Bucket=self.s3_config.bucket,
            Key=s3_key,
            Body=data,
            ContentType="application/octet-stream"
        )

        return {
            "path": f"s3://{self.s3_config.bucket}/{s3_key}",
            "file_size_bytes": len(data),
            "record_count": table.num_rows
        }

    def _write_iceberg_metadata(
        self,
        arrow_schema: pa.Schema,
        data_files: list,
        total_records: int
    ) -> None:
        """Write Iceberg table metadata files."""
        archive_path = self._generate_archive_path()

        schema_dict = {
            "type": "struct",
            "schema_id": 0,
            "fields": []
        }

        for i, field in enumerate(arrow_schema):
            field_type = self._arrow_to_iceberg_type(field.type)
            schema_dict["fields"].append({
                "id": i + 1,
                "name": field.name,
                "required": not field.nullable,
                "type": field_type
            })

        table_uuid = hashlib.md5(
            f"{self.archive_config.table_name}_{datetime.utcnow().isoformat()}".encode()
        ).hexdigest()

        manifest_entries = []
        for df in data_files:
            manifest_entries.append({
                "status": 1,
                "snapshot_id": 1,
                "data_file": {
                    "file_path": df["path"],
                    "file_format": "PARQUET",
                    "record_count": df["record_count"],
                    "file_size_in_bytes": df["file_size_bytes"]
                }
            })

        metadata = {
            "format_version": 2,
            "table_uuid": table_uuid,
            "location": f"s3://{self.s3_config.bucket}/{archive_path}",
            "last_sequence_number": 1,
            "last_updated_ms": int(datetime.utcnow().timestamp() * 1000),
            "last_column_id": len(arrow_schema),
            "current_schema_id": 0,
            "schemas": [schema_dict],
            "partition_spec": [],
            "default_spec_id": 0,
            "last_partition_id": 0,
            "properties": {
                "source.table": self.archive_config.table_name,
                "source.schema": self.archive_config.schema_name,
                "source.partition_column": self.archive_config.partition_column,
                "source.partition_value": self.archive_config.partition_value,
                "archive.timestamp": datetime.utcnow().isoformat(),
                "archive.total_records": str(total_records)
            },
            "current_snapshot_id": 1,
            "snapshots": [
                {
                    "snapshot_id": 1,
                    "timestamp_ms": int(datetime.utcnow().timestamp() * 1000),
                    "summary": {
                        "operation": "append",
                        "added_data_files": str(len(data_files)),
                        "total_records": str(total_records)
                    },
                    "manifest_list": f"s3://{self.s3_config.bucket}/{archive_path}/metadata/manifest_list.json"
                }
            ]
        }

        self.s3_client.put_object(
            Bucket=self.s3_config.bucket,
            Key=f"{archive_path}/metadata/v1.metadata.json",
            Body=json.dumps(metadata, indent=2),
            ContentType="application/json"
        )

        self.s3_client.put_object(
            Bucket=self.s3_config.bucket,
            Key=f"{archive_path}/metadata/manifest_list.json",
            Body=json.dumps(manifest_entries, indent=2),
            ContentType="application/json"
        )

        self.s3_client.put_object(
            Bucket=self.s3_config.bucket,
            Key=f"{archive_path}/metadata/version_hint.text",
            Body="1",
            ContentType="text/plain"
        )

    def _arrow_to_iceberg_type(self, arrow_type: pa.DataType) -> str:
        """Convert PyArrow type to Iceberg type string."""
        type_mapping = {
            pa.int16(): "int",
            pa.int32(): "int",
            pa.int64(): "long",
            pa.float32(): "float",
            pa.float64(): "double",
            pa.bool_(): "boolean",
            pa.string(): "string",
            pa.binary(): "binary",
            pa.date32(): "date",
        }

        if arrow_type in type_mapping:
            return type_mapping[arrow_type]

        if pa.types.is_timestamp(arrow_type):
            if arrow_type.tz:
                return "timestamptz"
            return "timestamp"

        if pa.types.is_decimal(arrow_type):
            return f"decimal({arrow_type.precision},{arrow_type.scale})"

        if pa.types.is_list(arrow_type):
            inner = self._arrow_to_iceberg_type(arrow_type.value_type)
            return f"list<{inner}>"

        return "string"

    def _save_schema_snapshot(self, arrow_schema: pa.Schema) -> None:
        """Save schema information for restore validation."""
        archive_path = self._generate_archive_path()

        columns = self.schema_inspector.get_table_columns(
            self.archive_config.schema_name,
            self.archive_config.table_name
        )

        pk_columns = self.schema_inspector.get_primary_key_columns(
            self.archive_config.schema_name,
            self.archive_config.table_name
        )

        schema_snapshot = {
            "source_table": {
                "schema": self.archive_config.schema_name,
                "table": self.archive_config.table_name
            },
            "columns": columns,
            "primary_key_columns": pk_columns,
            "arrow_schema": arrow_schema.to_string(),
            "archived_at": datetime.utcnow().isoformat()
        }

        self.s3_client.put_object(
            Bucket=self.s3_config.bucket,
            Key=f"{archive_path}/schema_snapshot.json",
            Body=json.dumps(schema_snapshot, indent=2, default=str),
            ContentType="application/json"
        )

    def archive(self) -> Dict[str, Any]:
        """Execute the archive operation."""
        logger.info(
            f"Starting archive of {self.archive_config.schema_name}."
            f"{self.archive_config.table_name} "
            f"partition {self.archive_config.partition_column}="
            f"{self.archive_config.partition_value}"
        )

        arrow_schema = self.schema_inspector.build_arrow_schema(
            self.archive_config.schema_name,
            self.archive_config.table_name
        )

        self._save_schema_snapshot(arrow_schema)

        data_files = []
        total_records = 0
        file_number = 0

        for batch_df in self._stream_partition_data():
            table = pa.Table.from_pandas(batch_df, schema=arrow_schema)
            file_meta = self._write_parquet_to_s3(table, file_number, arrow_schema)
            data_files.append(file_meta)
            total_records += file_meta["record_count"]
            file_number += 1

            logger.info(
                f"Written file {file_number}: {file_meta['record_count']} records"
            )

        self._write_iceberg_metadata(arrow_schema, data_files, total_records)

        result = {
            "status": "success",
            "table": f"{self.archive_config.schema_name}.{self.archive_config.table_name}",
            "partition": f"{self.archive_config.partition_column}={self.archive_config.partition_value}",
            "location": f"s3://{self.s3_config.bucket}/{self._generate_archive_path()}",
            "total_records": total_records,
            "total_files": len(data_files),
            "total_bytes": sum(f["file_size_bytes"] for f in data_files),
            "archived_at": datetime.utcnow().isoformat()
        }

        logger.info(f"Archive complete: {total_records} records in {len(data_files)} files")
        return result


def main():
    parser = argparse.ArgumentParser(
        description="Archive Aurora PostgreSQL partition to S3"
    )
    parser.add_argument("--table", required=True, help="Table name")
    parser.add_argument("--partition-column", required=True, help="Partition column name")
    parser.add_argument("--partition-value", required=True, help="Partition value to archive")
    parser.add_argument("--schema", default="public", help="Schema name")
    parser.add_argument("--batch-size", type=int, default=100000, help="Batch size for streaming")

    args = parser.parse_args()

    db_config = DatabaseConfig.from_environment()
    s3_config = S3Config.from_environment()
    archive_config = ArchiveConfig(
        table_name=args.table,
        partition_column=args.partition_column,
        partition_value=args.partition_value,
        schema_name=args.schema,
        batch_size=args.batch_size
    )

    archiver = PartitionArchiver(db_config, s3_config, archive_config)
    result = archiver.archive()

    print(json.dumps(result, indent=2))
    return 0


if __name__ == "__main__":
    sys.exit(main())

4. The Restore Script

This script reverses the archive operation by reading Parquet files from S3 and loading them into a staging table.

4.1 Restore Script

#!/usr/bin/env python3
# restore_partition.py

"""
Restore an archived partition from S3 back to Aurora PostgreSQL.

Usage:
    python restore_partition.py \
        --source-path s3://bucket/prefix/schema/table/partition_col=value \
        --target-table transactions_staging \
        --target-schema public
"""

import argparse
import json
import logging
import sys
from datetime import datetime
from typing import Dict, Any, List, Optional
from urllib.parse import urlparse

import boto3
import pandas as pd
import psycopg2
from psycopg2 import sql
from psycopg2.extras import execute_values
import pyarrow.parquet as pq

from config import DatabaseConfig


logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger(__name__)


class PartitionRestorer:
    """Restores archived partitions from S3 to PostgreSQL."""

    def __init__(
        self,
        db_config: DatabaseConfig,
        source_path: str,
        target_schema: str,
        target_table: str,
        create_table: bool = True,
        batch_size: int = 10000
    ):
        self.db_config = db_config
        self.source_path = source_path
        self.target_schema = target_schema
        self.target_table = target_table
        self.create_table = create_table
        self.batch_size = batch_size

        parsed = urlparse(source_path)
        self.bucket = parsed.netloc
        self.prefix = parsed.path.lstrip("/")

        self.s3_client = boto3.client("s3")

    def _load_schema_snapshot(self) -> Dict[str, Any]:
        """Load the schema snapshot from the archive."""
        response = self.s3_client.get_object(
            Bucket=self.bucket,
            Key=f"{self.prefix}/schema_snapshot.json"
        )
        return json.loads(response["Body"].read())

    def _load_iceberg_metadata(self) -> Dict[str, Any]:
        """Load Iceberg metadata."""
        response = self.s3_client.get_object(
            Bucket=self.bucket,
            Key=f"{self.prefix}/metadata/v1.metadata.json"
        )
        return json.loads(response["Body"].read())

    def _list_data_files(self) -> List[str]:
        """List all Parquet data files in the archive."""
        data_prefix = f"{self.prefix}/data/"

        files = []
        paginator = self.s3_client.get_paginator("list_objects_v2")

        for page in paginator.paginate(Bucket=self.bucket, Prefix=data_prefix):
            for obj in page.get("Contents", []):
                if obj["Key"].endswith(".parquet"):
                    files.append(obj["Key"])

        return sorted(files)

    def _postgres_type_from_column_def(self, col: Dict[str, Any]) -> str:
        """Convert column definition to PostgreSQL type."""
        data_type = col["data_type"]

        if data_type == "character varying":
            max_len = col.get("character_maximum_length")
            if max_len:
                return f"varchar({max_len})"
            return "text"

        if data_type == "numeric":
            precision = col.get("numeric_precision")
            scale = col.get("numeric_scale")
            if precision and scale:
                return f"numeric({precision},{scale})"
            return "numeric"

        return data_type

    def _create_staging_table(
        self,
        schema_snapshot: Dict[str, Any],
        conn: psycopg2.extensions.connection
    ) -> None:
        """Create the staging table based on archived schema."""
        columns = schema_snapshot["columns"]

        column_defs = []
        for col in columns:
            pg_type = self._postgres_type_from_column_def(col)
            nullable = "" if col["is_nullable"] == "YES" else " NOT NULL"
            column_defs.append(f'    "{col["column_name"]}" {pg_type}{nullable}')

        create_sql = f"""
            DROP TABLE IF EXISTS {self.target_schema}.{self.target_table};
            CREATE TABLE {self.target_schema}.{self.target_table} (
{chr(10).join(column_defs)}
            )
        """

        with conn.cursor() as cur:
            cur.execute(create_sql)
        conn.commit()

        logger.info(f"Created staging table {self.target_schema}.{self.target_table}")

    def _insert_batch(
        self,
        df: pd.DataFrame,
        columns: List[str],
        conn: psycopg2.extensions.connection
    ) -> int:
        """Insert a batch of records into the staging table."""
        if df.empty:
            return 0

        for col in df.columns:
            if pd.api.types.is_datetime64_any_dtype(df[col]):
                df[col] = df[col].apply(
                    lambda x: x.isoformat() if pd.notna(x) else None
                )

        values = [tuple(row) for row in df[columns].values]

        column_names = ", ".join(f'"{c}"' for c in columns)
        insert_sql = f"""
            INSERT INTO {self.target_schema}.{self.target_table} ({column_names})
            VALUES %s
        """

        with conn.cursor() as cur:
            execute_values(cur, insert_sql, values, page_size=self.batch_size)

        return len(values)

    def restore(self) -> Dict[str, Any]:
        """Execute the restore operation."""
        logger.info(f"Starting restore from {self.source_path}")

        schema_snapshot = self._load_schema_snapshot()
        metadata = self._load_iceberg_metadata()
        data_files = self._list_data_files()

        logger.info(f"Found {len(data_files)} data files to restore")

        columns = [col["column_name"] for col in schema_snapshot["columns"]]

        with psycopg2.connect(self.db_config.connection_string) as conn:
            if self.create_table:
                self._create_staging_table(schema_snapshot, conn)

            total_records = 0

            for file_key in data_files:
                s3_uri = f"s3://{self.bucket}/{file_key}"
                logger.info(f"Restoring from {file_key}")

                response = self.s3_client.get_object(Bucket=self.bucket, Key=file_key)
                table = pq.read_table(response["Body"])
                df = table.to_pandas()

                file_records = 0
                for start in range(0, len(df), self.batch_size):
                    batch_df = df.iloc[start:start + self.batch_size]
                    inserted = self._insert_batch(batch_df, columns, conn)
                    file_records += inserted

                conn.commit()
                total_records += file_records
                logger.info(f"Restored {file_records} records from {file_key}")

        result = {
            "status": "success",
            "source": self.source_path,
            "target": f"{self.target_schema}.{self.target_table}",
            "total_records": total_records,
            "files_processed": len(data_files),
            "restored_at": datetime.utcnow().isoformat()
        }

        logger.info(f"Restore complete: {total_records} records")
        return result


def main():
    parser = argparse.ArgumentParser(
        description="Restore archived partition from S3 to PostgreSQL"
    )
    parser.add_argument("--source-path", required=True, help="S3 path to archived partition")
    parser.add_argument("--target-table", required=True, help="Target table name")
    parser.add_argument("--target-schema", default="public", help="Target schema name")
    parser.add_argument("--batch-size", type=int, default=10000, help="Insert batch size")
    parser.add_argument("--no-create", action="store_true", help="Don't create table, assume it exists")

    args = parser.parse_args()

    db_config = DatabaseConfig.from_environment()

    restorer = PartitionRestorer(
        db_config=db_config,
        source_path=args.source_path,
        target_schema=args.target_schema,
        target_table=args.target_table,
        create_table=not args.no_create,
        batch_size=args.batch_size
    )

    result = restorer.restore()
    print(json.dumps(result, indent=2))
    return 0


if __name__ == "__main__":
    sys.exit(main())

5. SQL Operations for Partition Migration

Once data is restored to a staging table, you need SQL operations to validate and migrate it to the main table.

5.1 Schema Validation

-- Validate that staging table schema matches the main table

CREATE OR REPLACE FUNCTION validate_table_schemas(
    p_source_schema TEXT,
    p_source_table TEXT,
    p_target_schema TEXT,
    p_target_table TEXT
) RETURNS TABLE (
    validation_type TEXT,
    column_name TEXT,
    source_value TEXT,
    target_value TEXT,
    is_valid BOOLEAN
) AS $$
BEGIN
    -- Check column count
    RETURN QUERY
    SELECT 
        'column_count'::TEXT,
        NULL::TEXT,
        src.cnt::TEXT,
        tgt.cnt::TEXT,
        src.cnt = tgt.cnt
    FROM 
        (SELECT COUNT(*)::INT AS cnt 
         FROM information_schema.columns 
         WHERE table_schema = p_source_schema 
           AND table_name = p_source_table) src,
        (SELECT COUNT(*)::INT AS cnt 
         FROM information_schema.columns 
         WHERE table_schema = p_target_schema 
           AND table_name = p_target_table) tgt;

    -- Check each column exists with matching type
    RETURN QUERY
    SELECT 
        'column_definition'::TEXT,
        src.column_name,
        src.data_type || COALESCE('(' || src.character_maximum_length::TEXT || ')', ''),
        COALESCE(tgt.data_type || COALESCE('(' || tgt.character_maximum_length::TEXT || ')', ''), 'MISSING'),
        src.data_type = COALESCE(tgt.data_type, '') 
            AND COALESCE(src.character_maximum_length, 0) = COALESCE(tgt.character_maximum_length, 0)
    FROM 
        information_schema.columns src
    LEFT JOIN 
        information_schema.columns tgt 
        ON tgt.table_schema = p_target_schema 
        AND tgt.table_name = p_target_table
        AND tgt.column_name = src.column_name
    WHERE 
        src.table_schema = p_source_schema 
        AND src.table_name = p_source_table
    ORDER BY src.ordinal_position;

    -- Check nullability
    RETURN QUERY
    SELECT 
        'nullability'::TEXT,
        src.column_name,
        src.is_nullable,
        COALESCE(tgt.is_nullable, 'MISSING'),
        src.is_nullable = COALESCE(tgt.is_nullable, '')
    FROM 
        information_schema.columns src
    LEFT JOIN 
        information_schema.columns tgt 
        ON tgt.table_schema = p_target_schema 
        AND tgt.table_name = p_target_table
        AND tgt.column_name = src.column_name
    WHERE 
        src.table_schema = p_source_schema 
        AND src.table_name = p_source_table
    ORDER BY src.ordinal_position;
END;
$$ LANGUAGE plpgsql;

-- Usage
SELECT * FROM validate_table_schemas('public', 'transactions_staging', 'public', 'transactions');

5.2 Comprehensive Validation Report

-- Generate a full validation report before migration

CREATE OR REPLACE FUNCTION generate_migration_report(
    p_staging_schema TEXT,
    p_staging_table TEXT,
    p_target_schema TEXT,
    p_target_table TEXT,
    p_partition_column TEXT,
    p_partition_value TEXT
) RETURNS TABLE (
    check_name TEXT,
    result TEXT,
    details JSONB
) AS $$
DECLARE
    v_staging_count BIGINT;
    v_existing_count BIGINT;
    v_schema_valid BOOLEAN;
BEGIN
    -- Get staging table count
    EXECUTE format(
        'SELECT COUNT(*) FROM %I.%I',
        p_staging_schema, p_staging_table
    ) INTO v_staging_count;

    RETURN QUERY SELECT 
        'staging_record_count'::TEXT,
        'INFO'::TEXT,
        jsonb_build_object('count', v_staging_count);

    -- Check for existing data in target partition
    BEGIN
        EXECUTE format(
            'SELECT COUNT(*) FROM %I.%I WHERE %I = $1',
            p_target_schema, p_target_table, p_partition_column
        ) INTO v_existing_count USING p_partition_value;

        IF v_existing_count > 0 THEN
            RETURN QUERY SELECT 
                'existing_partition_data'::TEXT,
                'WARNING'::TEXT,
                jsonb_build_object(
                    'count', v_existing_count,
                    'message', 'Target partition already contains data'
                );
        ELSE
            RETURN QUERY SELECT 
                'existing_partition_data'::TEXT,
                'OK'::TEXT,
                jsonb_build_object('count', 0);
        END IF;
    EXCEPTION WHEN undefined_column THEN
        RETURN QUERY SELECT 
            'partition_column_check'::TEXT,
            'ERROR'::TEXT,
            jsonb_build_object(
                'message', format('Partition column %s not found', p_partition_column)
            );
    END;

    -- Validate schemas match
    SELECT bool_and(is_valid) INTO v_schema_valid
    FROM validate_table_schemas(
        p_staging_schema, p_staging_table,
        p_target_schema, p_target_table
    );

    RETURN QUERY SELECT 
        'schema_validation'::TEXT,
        CASE WHEN v_schema_valid THEN 'OK' ELSE 'ERROR' END::TEXT,
        jsonb_build_object('schemas_match', v_schema_valid);

    -- Check for null values in NOT NULL columns
    RETURN QUERY
    SELECT 
        'null_check_' || c.column_name,
        CASE WHEN null_count > 0 THEN 'ERROR' ELSE 'OK' END,
        jsonb_build_object('null_count', null_count)
    FROM information_schema.columns c
    CROSS JOIN LATERAL (
        SELECT COUNT(*) as null_count 
        FROM (
            SELECT 1 
            FROM information_schema.columns ic
            WHERE ic.table_schema = p_staging_schema
              AND ic.table_name = p_staging_table
              AND ic.column_name = c.column_name
        ) x
    ) nc
    WHERE c.table_schema = p_target_schema
      AND c.table_name = p_target_table
      AND c.is_nullable = 'NO';
END;
$$ LANGUAGE plpgsql;

-- Usage
SELECT * FROM generate_migration_report(
    'public', 'transactions_staging',
    'public', 'transactions',
    'transaction_date', '2024-01'
);

5.3 Partition Migration

-- Migrate data from staging table to main table

CREATE OR REPLACE PROCEDURE migrate_partition_data(
    p_staging_schema TEXT,
    p_staging_table TEXT,
    p_target_schema TEXT,
    p_target_table TEXT,
    p_partition_column TEXT,
    p_partition_value TEXT,
    p_delete_existing BOOLEAN DEFAULT FALSE,
    p_batch_size INTEGER DEFAULT 50000
)
LANGUAGE plpgsql
AS $$
DECLARE
    v_columns TEXT;
    v_total_migrated BIGINT := 0;
    v_batch_migrated BIGINT;
    v_validation_passed BOOLEAN;
BEGIN
    -- Validate schemas match
    SELECT bool_and(is_valid) INTO v_validation_passed
    FROM validate_table_schemas(
        p_staging_schema, p_staging_table,
        p_target_schema, p_target_table
    );

    IF NOT v_validation_passed THEN
        RAISE EXCEPTION 'Schema validation failed. Run validate_table_schemas() for details.';
    END IF;

    -- Build column list
    SELECT string_agg(quote_ident(column_name), ', ' ORDER BY ordinal_position)
    INTO v_columns
    FROM information_schema.columns
    WHERE table_schema = p_staging_schema
      AND table_name = p_staging_table;

    -- Delete existing data if requested
    IF p_delete_existing THEN
        EXECUTE format(
            'DELETE FROM %I.%I WHERE %I = $1',
            p_target_schema, p_target_table, p_partition_column
        ) USING p_partition_value;

        RAISE NOTICE 'Deleted existing data for partition % = %', 
            p_partition_column, p_partition_value;
    END IF;

    -- Migrate in batches using a cursor approach
    LOOP
        EXECUTE format($sql$
            WITH to_migrate AS (
                SELECT ctid 
                FROM %I.%I 
                WHERE NOT EXISTS (
                    SELECT 1 FROM %I.%I t 
                    WHERE t.%I = $1
                )
                LIMIT $2
            ),
            inserted AS (
                INSERT INTO %I.%I (%s)
                SELECT %s 
                FROM %I.%I s
                WHERE s.ctid IN (SELECT ctid FROM to_migrate)
                RETURNING 1
            )
            SELECT COUNT(*) FROM inserted
        $sql$,
            p_staging_schema, p_staging_table,
            p_target_schema, p_target_table, p_partition_column,
            p_target_schema, p_target_table, v_columns,
            v_columns,
            p_staging_schema, p_staging_table
        ) INTO v_batch_migrated USING p_partition_value, p_batch_size;

        v_total_migrated := v_total_migrated + v_batch_migrated;

        IF v_batch_migrated = 0 THEN
            EXIT;
        END IF;

        RAISE NOTICE 'Migrated batch: % records (total: %)', v_batch_migrated, v_total_migrated;
        COMMIT;
    END LOOP;

    RAISE NOTICE 'Migration complete. Total records migrated: %', v_total_migrated;
END;
$$;

-- Usage
CALL migrate_partition_data(
    'public', 'transactions_staging',
    'public', 'transactions',
    'transaction_date', '2024-01',
    TRUE,   -- delete existing
    50000   -- batch size
);

5.4 Attach Partition (for Partitioned Tables)

-- For natively partitioned tables, attach the staging table as a partition

CREATE OR REPLACE PROCEDURE attach_restored_partition(
    p_staging_schema TEXT,
    p_staging_table TEXT,
    p_target_schema TEXT,
    p_target_table TEXT,
    p_partition_column TEXT,
    p_partition_start TEXT,
    p_partition_end TEXT
)
LANGUAGE plpgsql
AS $$
DECLARE
    v_partition_name TEXT;
    v_constraint_name TEXT;
BEGIN
    -- Validate schemas match
    IF NOT (
        SELECT bool_and(is_valid)
        FROM validate_table_schemas(
            p_staging_schema, p_staging_table,
            p_target_schema, p_target_table
        )
    ) THEN
        RAISE EXCEPTION 'Schema validation failed';
    END IF;

    -- Add constraint to staging table that matches partition bounds
    v_constraint_name := p_staging_table || '_partition_check';

    EXECUTE format($sql$
        ALTER TABLE %I.%I 
        ADD CONSTRAINT %I 
        CHECK (%I >= %L AND %I < %L)
    $sql$,
        p_staging_schema, p_staging_table,
        v_constraint_name,
        p_partition_column, p_partition_start,
        p_partition_column, p_partition_end
    );

    -- Validate constraint without locking
    EXECUTE format($sql$
        ALTER TABLE %I.%I 
        VALIDATE CONSTRAINT %I
    $sql$,
        p_staging_schema, p_staging_table,
        v_constraint_name
    );

    -- Detach old partition if exists
    v_partition_name := p_target_table || '_' || replace(p_partition_start, '-', '_');

    BEGIN
        EXECUTE format($sql$
            ALTER TABLE %I.%I 
            DETACH PARTITION %I.%I
        $sql$,
            p_target_schema, p_target_table,
            p_target_schema, v_partition_name
        );
        RAISE NOTICE 'Detached existing partition %', v_partition_name;
    EXCEPTION WHEN undefined_table THEN
        RAISE NOTICE 'No existing partition to detach';
    END;

    -- Rename staging table to partition name
    EXECUTE format($sql$
        ALTER TABLE %I.%I RENAME TO %I
    $sql$,
        p_staging_schema, p_staging_table,
        v_partition_name
    );

    -- Attach as partition
    EXECUTE format($sql$
        ALTER TABLE %I.%I 
        ATTACH PARTITION %I.%I 
        FOR VALUES FROM (%L) TO (%L)
    $sql$,
        p_target_schema, p_target_table,
        p_staging_schema, v_partition_name,
        p_partition_start, p_partition_end
    );

    RAISE NOTICE 'Successfully attached partition % to %', 
        v_partition_name, p_target_table;
END;
$$;

-- Usage for range partitioned table
CALL attach_restored_partition(
    'public', 'transactions_staging',
    'public', 'transactions',
    'transaction_date',
    '2024-01-01', '2024-02-01'
);

5.5 Cleanup Script

-- Clean up after successful migration

CREATE OR REPLACE PROCEDURE cleanup_after_migration(
    p_staging_schema TEXT,
    p_staging_table TEXT,
    p_verify_target_schema TEXT DEFAULT NULL,
    p_verify_target_table TEXT DEFAULT NULL,
    p_verify_count BOOLEAN DEFAULT TRUE
)
LANGUAGE plpgsql
AS $$
DECLARE
    v_staging_count BIGINT;
    v_target_count BIGINT;
BEGIN
    IF p_verify_count AND p_verify_target_schema IS NOT NULL THEN
        EXECUTE format(
            'SELECT COUNT(*) FROM %I.%I',
            p_staging_schema, p_staging_table
        ) INTO v_staging_count;

        EXECUTE format(
            'SELECT COUNT(*) FROM %I.%I',
            p_verify_target_schema, p_verify_target_table
        ) INTO v_target_count;

        IF v_target_count < v_staging_count THEN
            RAISE WARNING 'Target count (%) is less than staging count (%). Migration may be incomplete.',
                v_target_count, v_staging_count;
            RETURN;
        END IF;
    END IF;

    EXECUTE format(
        'DROP TABLE IF EXISTS %I.%I',
        p_staging_schema, p_staging_table
    );

    RAISE NOTICE 'Dropped staging table %.%', p_staging_schema, p_staging_table;
END;
$$;

-- Usage
CALL cleanup_after_migration(
    'public', 'transactions_staging',
    'public', 'transactions',
    TRUE
);

6. Spring Boot Query API

This API allows querying archived data directly from S3 without restoring to the database.

6.1 Project Structure

src/main/java/com/example/archivequery/
    ArchiveQueryApplication.java
    config/
        AwsConfig.java
        ParquetConfig.java
    controller/
        ArchiveQueryController.java
    service/
        ParquetQueryService.java
        PredicateParser.java
    model/
        QueryRequest.java
        QueryResponse.java
        ArchiveMetadata.java
    predicate/
        Predicate.java
        ComparisonPredicate.java
        LogicalPredicate.java
        PredicateEvaluator.java

6.2 Application Configuration

// ArchiveQueryApplication.java
package com.example.archivequery;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class ArchiveQueryApplication {
    public static void main(String[] args) {
        SpringApplication.run(ArchiveQueryApplication.class, args);
    }
}

6.3 AWS Configuration

// config/AwsConfig.java
package com.example.archivequery.config;

import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.S3Client;

@Configuration
public class AwsConfig {

    @Value("${aws.region:eu-west-1}")
    private String region;

    @Bean
    public S3Client s3Client() {
        return S3Client.builder()
            .region(Region.of(region))
            .credentialsProvider(DefaultCredentialsProvider.create())
            .build();
    }
}

6.4 Parquet Configuration

// config/ParquetConfig.java
package com.example.archivequery.config;

import org.apache.hadoop.conf.Configuration;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;

@org.springframework.context.annotation.Configuration
public class ParquetConfig {

    @Value("${aws.region:eu-west-1}")
    private String region;

    @Bean
    public Configuration hadoopConfiguration() {
        Configuration conf = new Configuration();
        conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
        conf.set("fs.s3a.aws.credentials.provider", 
            "com.amazonaws.auth.DefaultAWSCredentialsProviderChain");
        conf.set("fs.s3a.endpoint.region", region);
        return conf;
    }
}

6.5 Model Classes

// model/QueryRequest.java
package com.example.archivequery.model;

import java.util.List;
import java.util.Map;

public record QueryRequest(
    String archivePath,
    List<String> columns,
    Map<String, Object> filters,
    String filterExpression,
    Integer limit,
    Integer offset
) {
    public QueryRequest {
        if (limit == null) limit = 1000;
        if (offset == null) offset = 0;
    }
}
// model/QueryResponse.java
package com.example.archivequery.model;

import java.util.List;
import java.util.Map;

public record QueryResponse(
    List<Map<String, Object>> data,
    long totalMatched,
    long totalScanned,
    long executionTimeMs,
    Map<String, String> schema,
    String archivePath
) {}
// model/ArchiveMetadata.java
package com.example.archivequery.model;

import java.time.Instant;
import java.util.List;
import java.util.Map;

public record ArchiveMetadata(
    String tableUuid,
    String location,
    List<ColumnDefinition> columns,
    Map<String, String> properties,
    Instant archivedAt
) {
    public record ColumnDefinition(
        int id,
        String name,
        String type,
        boolean required
    ) {}
}

6.6 Predicate Classes

// predicate/Predicate.java
package com.example.archivequery.predicate;

import java.util.Map;

public sealed interface Predicate 
    permits ComparisonPredicate, LogicalPredicate {

    boolean evaluate(Map<String, Object> record);
}
// predicate/ComparisonPredicate.java
package com.example.archivequery.predicate;

import java.math.BigDecimal;
import java.time.LocalDate;
import java.time.LocalDateTime;
import java.util.Map;
import java.util.Objects;

public record ComparisonPredicate(
    String column,
    Operator operator,
    Object value
) implements Predicate {

    public enum Operator {
        EQ, NE, GT, GE, LT, LE, LIKE, IN, IS_NULL, IS_NOT_NULL
    }

    @Override
    public boolean evaluate(Map<String, Object> record) {
        Object recordValue = record.get(column);

        return switch (operator) {
            case IS_NULL -> recordValue == null;
            case IS_NOT_NULL -> recordValue != null;
            case EQ -> Objects.equals(recordValue, convertValue(recordValue));
            case NE -> !Objects.equals(recordValue, convertValue(recordValue));
            case GT -> compare(recordValue, convertValue(recordValue)) > 0;
            case GE -> compare(recordValue, convertValue(recordValue)) >= 0;
            case LT -> compare(recordValue, convertValue(recordValue)) < 0;
            case LE -> compare(recordValue, convertValue(recordValue)) <= 0;
            case LIKE -> matchesLike(recordValue);
            case IN -> matchesIn(recordValue);
        };
    }

    private Object convertValue(Object recordValue) {
        if (recordValue == null || value == null) return value;

        if (recordValue instanceof LocalDate && value instanceof String s) {
            return LocalDate.parse(s);
        }
        if (recordValue instanceof LocalDateTime && value instanceof String s) {
            return LocalDateTime.parse(s);
        }
        if (recordValue instanceof BigDecimal && value instanceof Number n) {
            return new BigDecimal(n.toString());
        }
        if (recordValue instanceof Long && value instanceof Number n) {
            return n.longValue();
        }
        if (recordValue instanceof Integer && value instanceof Number n) {
            return n.intValue();
        }
        if (recordValue instanceof Double && value instanceof Number n) {
            return n.doubleValue();
        }

        return value;
    }

    @SuppressWarnings({"unchecked", "rawtypes"})
    private int compare(Object a, Object b) {
        if (a == null || b == null) return 0;
        if (a instanceof Comparable ca && b instanceof Comparable cb) {
            return ca.compareTo(cb);
        }
        return 0;
    }

    private boolean matchesLike(Object recordValue) {
        if (recordValue == null || value == null) return false;
        String pattern = value.toString()
            .replace("%", ".*")
            .replace("_", ".");
        return recordValue.toString().matches(pattern);
    }

    private boolean matchesIn(Object recordValue) {
        if (recordValue == null || value == null) return false;
        if (value instanceof Iterable<?> iterable) {
            for (Object item : iterable) {
                if (Objects.equals(recordValue, item)) return true;
            }
        }
        return false;
    }
}
// predicate/LogicalPredicate.java
package com.example.archivequery.predicate;

import java.util.List;
import java.util.Map;

public record LogicalPredicate(
    Operator operator,
    List<Predicate> predicates
) implements Predicate {

    public enum Operator {
        AND, OR, NOT
    }

    @Override
    public boolean evaluate(Map<String, Object> record) {
        return switch (operator) {
            case AND -> predicates.stream().allMatch(p -> p.evaluate(record));
            case OR -> predicates.stream().anyMatch(p -> p.evaluate(record));
            case NOT -> !predicates.getFirst().evaluate(record);
        };
    }
}
// predicate/PredicateEvaluator.java
package com.example.archivequery.predicate;

import java.util.Map;

public class PredicateEvaluator {

    private final Predicate predicate;

    public PredicateEvaluator(Predicate predicate) {
        this.predicate = predicate;
    }

    public boolean matches(Map<String, Object> record) {
        if (predicate == null) return true;
        return predicate.evaluate(record);
    }
}

6.7 Predicate Parser

// service/PredicateParser.java
package com.example.archivequery.service;

import com.example.archivequery.predicate.*;
import org.springframework.stereotype.Service;

import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

@Service
public class PredicateParser {

    private static final Pattern COMPARISON_PATTERN = Pattern.compile(
        "(\\w+)\\s*(=|!=|<>|>=|<=|>|<|LIKE|IN|IS NULL|IS NOT NULL)\\s*(.*)?"
    );

    public Predicate parse(String expression) {
        if (expression == null || expression.isBlank()) {
            return null;
        }
        return parseExpression(expression.trim());
    }

    public Predicate fromFilters(Map<String, Object> filters) {
        if (filters == null || filters.isEmpty()) {
            return null;
        }

        List<Predicate> predicates = new ArrayList<>();

        for (Map.Entry<String, Object> entry : filters.entrySet()) {
            String key = entry.getKey();
            Object value = entry.getValue();

            if (key.endsWith("_gt")) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 3),
                    ComparisonPredicate.Operator.GT,
                    value
                ));
            } else if (key.endsWith("_gte")) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 4),
                    ComparisonPredicate.Operator.GE,
                    value
                ));
            } else if (key.endsWith("_lt")) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 3),
                    ComparisonPredicate.Operator.LT,
                    value
                ));
            } else if (key.endsWith("_lte")) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 4),
                    ComparisonPredicate.Operator.LE,
                    value
                ));
            } else if (key.endsWith("_ne")) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 3),
                    ComparisonPredicate.Operator.NE,
                    value
                ));
            } else if (key.endsWith("_like")) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 5),
                    ComparisonPredicate.Operator.LIKE,
                    value
                ));
            } else if (key.endsWith("_in") && value instanceof List<?> list) {
                predicates.add(new ComparisonPredicate(
                    key.substring(0, key.length() - 3),
                    ComparisonPredicate.Operator.IN,
                    list
                ));
            } else {
                predicates.add(new ComparisonPredicate(
                    key,
                    ComparisonPredicate.Operator.EQ,
                    value
                ));
            }
        }

        if (predicates.size() == 1) {
            return predicates.getFirst();
        }

        return new LogicalPredicate(LogicalPredicate.Operator.AND, predicates);
    }

    private Predicate parseExpression(String expression) {
        String upper = expression.toUpperCase();

        int andIndex = findLogicalOperator(upper, " AND ");
        if (andIndex > 0) {
            String left = expression.substring(0, andIndex);
            String right = expression.substring(andIndex + 5);
            return new LogicalPredicate(
                LogicalPredicate.Operator.AND,
                List.of(parseExpression(left), parseExpression(right))
            );
        }

        int orIndex = findLogicalOperator(upper, " OR ");
        if (orIndex > 0) {
            String left = expression.substring(0, orIndex);
            String right = expression.substring(orIndex + 4);
            return new LogicalPredicate(
                LogicalPredicate.Operator.OR,
                List.of(parseExpression(left), parseExpression(right))
            );
        }

        if (upper.startsWith("NOT ")) {
            return new LogicalPredicate(
                LogicalPredicate.Operator.NOT,
                List.of(parseExpression(expression.substring(4)))
            );
        }

        if (expression.startsWith("(") && expression.endsWith(")")) {
            return parseExpression(expression.substring(1, expression.length() - 1));
        }

        return parseComparison(expression);
    }

    private int findLogicalOperator(String expression, String operator) {
        int depth = 0;
        int index = 0;

        while (index < expression.length()) {
            char c = expression.charAt(index);
            if (c == '(') depth++;
            else if (c == ')') depth--;
            else if (depth == 0 && expression.substring(index).startsWith(operator)) {
                return index;
            }
            index++;
        }
        return -1;
    }

    private Predicate parseComparison(String expression) {
        Matcher matcher = COMPARISON_PATTERN.matcher(expression.trim());

        if (!matcher.matches()) {
            throw new IllegalArgumentException("Invalid expression: " + expression);
        }

        String column = matcher.group(1);
        String operatorStr = matcher.group(2).toUpperCase();
        String valueStr = matcher.group(3);

        ComparisonPredicate.Operator operator = switch (operatorStr) {
            case "=" -> ComparisonPredicate.Operator.EQ;
            case "!=", "<>" -> ComparisonPredicate.Operator.NE;
            case ">" -> ComparisonPredicate.Operator.GT;
            case ">=" -> ComparisonPredicate.Operator.GE;
            case "<" -> ComparisonPredicate.Operator.LT;
            case "<=" -> ComparisonPredicate.Operator.LE;
            case "LIKE" -> ComparisonPredicate.Operator.LIKE;
            case "IN" -> ComparisonPredicate.Operator.IN;
            case "IS NULL" -> ComparisonPredicate.Operator.IS_NULL;
            case "IS NOT NULL" -> ComparisonPredicate.Operator.IS_NOT_NULL;
            default -> throw new IllegalArgumentException("Unknown operator: " + operatorStr);
        };

        Object value = parseValue(valueStr, operator);

        return new ComparisonPredicate(column, operator, value);
    }

    private Object parseValue(String valueStr, ComparisonPredicate.Operator operator) {
        if (operator == ComparisonPredicate.Operator.IS_NULL || 
            operator == ComparisonPredicate.Operator.IS_NOT_NULL) {
            return null;
        }

        if (valueStr == null || valueStr.isBlank()) {
            return null;
        }

        valueStr = valueStr.trim();

        if (valueStr.startsWith("'") && valueStr.endsWith("'")) {
            return valueStr.substring(1, valueStr.length() - 1);
        }

        if (operator == ComparisonPredicate.Operator.IN) {
            if (valueStr.startsWith("(") && valueStr.endsWith(")")) {
                valueStr = valueStr.substring(1, valueStr.length() - 1);
            }
            return Arrays.stream(valueStr.split(","))
                .map(String::trim)
                .map(s -> s.startsWith("'") ? s.substring(1, s.length() - 1) : parseNumber(s))
                .toList();
        }

        return parseNumber(valueStr);
    }

    private Object parseNumber(String s) {
        try {
            if (s.contains(".")) {
                return Double.parseDouble(s);
            }
            return Long.parseLong(s);
        } catch (NumberFormatException e) {
            return s;
        }
    }
}

6.8 Parquet Query Service

// service/ParquetQueryService.java
package com.example.archivequery.service;

import com.example.archivequery.model.*;
import com.example.archivequery.predicate.*;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;
import org.apache.avro.generic.GenericRecord;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Service;
import software.amazon.awssdk.services.s3.S3Client;
import software.amazon.awssdk.services.s3.model.*;

import java.io.IOException;
import java.net.URI;
import java.time.*;
import java.util.*;
import java.util.concurrent.*;
import java.util.stream.Collectors;

@Service
public class ParquetQueryService {

    private static final Logger log = LoggerFactory.getLogger(ParquetQueryService.class);

    private final S3Client s3Client;
    private final Configuration hadoopConfig;
    private final PredicateParser predicateParser;
    private final ObjectMapper objectMapper;
    private final ExecutorService executorService;

    public ParquetQueryService(
        S3Client s3Client,
        Configuration hadoopConfig,
        PredicateParser predicateParser
    ) {
        this.s3Client = s3Client;
        this.hadoopConfig = hadoopConfig;
        this.predicateParser = predicateParser;
        this.objectMapper = new ObjectMapper();
        this.executorService = Executors.newFixedThreadPool(
            Runtime.getRuntime().availableProcessors()
        );
    }

    public QueryResponse query(QueryRequest request) {
        long startTime = System.currentTimeMillis();

        Predicate predicate = buildPredicate(request);
        PredicateEvaluator evaluator = new PredicateEvaluator(predicate);

        List<String> dataFiles = listDataFiles(request.archivePath());

        List<Map<String, Object>> allResults = Collections.synchronizedList(new ArrayList<>());
        long[] counters = new long[2];

        List<Future<?>> futures = new ArrayList<>();

        for (String dataFile : dataFiles) {
            futures.add(executorService.submit(() -> {
                try {
                    QueryFileResult result = queryFile(
                        dataFile, 
                        request.columns(), 
                        evaluator
                    );
                    synchronized (counters) {
                        counters[0] += result.matched();
                        counters[1] += result.scanned();
                    }
                    allResults.addAll(result.records());
                } catch (IOException e) {
                    log.error("Error reading file: " + dataFile, e);
                }
            }));
        }

        for (Future<?> future : futures) {
            try {
                future.get();
            } catch (InterruptedException | ExecutionException e) {
                log.error("Error waiting for query completion", e);
            }
        }

        List<Map<String, Object>> paginatedResults = allResults.stream()
            .skip(request.offset())
            .limit(request.limit())
            .collect(Collectors.toList());

        Map<String, String> schema = extractSchema(request.archivePath());

        long executionTime = System.currentTimeMillis() - startTime;

        return new QueryResponse(
            paginatedResults,
            counters[0],
            counters[1],
            executionTime,
            schema,
            request.archivePath()
        );
    }

    public ArchiveMetadata getMetadata(String archivePath) {
        try {
            URI uri = URI.create(archivePath);
            String bucket = uri.getHost();
            String prefix = uri.getPath().substring(1);

            GetObjectRequest getRequest = GetObjectRequest.builder()
                .bucket(bucket)
                .key(prefix + "/metadata/v1.metadata.json")
                .build();

            byte[] content = s3Client.getObjectAsBytes(getRequest).asByteArray();
            Map<String, Object> metadata = objectMapper.readValue(content, Map.class);

            List<Map<String, Object>> schemas = (List<Map<String, Object>>) metadata.get("schemas");
            Map<String, Object> currentSchema = schemas.getFirst();
            List<Map<String, Object>> fields = (List<Map<String, Object>>) currentSchema.get("fields");

            List<ArchiveMetadata.ColumnDefinition> columns = fields.stream()
                .map(f -> new ArchiveMetadata.ColumnDefinition(
                    ((Number) f.get("id")).intValue(),
                    (String) f.get("name"),
                    (String) f.get("type"),
                    (Boolean) f.get("required")
                ))
                .toList();

            Map<String, String> properties = (Map<String, String>) metadata.get("properties");

            String archivedAtStr = properties.get("archive.timestamp");
            Instant archivedAt = archivedAtStr != null ? 
                Instant.parse(archivedAtStr) : Instant.now();

            return new ArchiveMetadata(
                (String) metadata.get("table_uuid"),
                (String) metadata.get("location"),
                columns,
                properties,
                archivedAt
            );
        } catch (Exception e) {
            throw new RuntimeException("Failed to read archive metadata", e);
        }
    }

    private Predicate buildPredicate(QueryRequest request) {
        Predicate filterPredicate = predicateParser.fromFilters(request.filters());
        Predicate expressionPredicate = predicateParser.parse(request.filterExpression());

        if (filterPredicate == null) return expressionPredicate;
        if (expressionPredicate == null) return filterPredicate;

        return new LogicalPredicate(
            LogicalPredicate.Operator.AND,
            List.of(filterPredicate, expressionPredicate)
        );
    }

    private List<String> listDataFiles(String archivePath) {
        URI uri = URI.create(archivePath);
        String bucket = uri.getHost();
        String prefix = uri.getPath().substring(1) + "/data/";

        ListObjectsV2Request listRequest = ListObjectsV2Request.builder()
            .bucket(bucket)
            .prefix(prefix)
            .build();

        List<String> files = new ArrayList<>();
        ListObjectsV2Response response;

        do {
            response = s3Client.listObjectsV2(listRequest);

            for (S3Object obj : response.contents()) {
                if (obj.key().endsWith(".parquet")) {
                    files.add("s3a://" + bucket + "/" + obj.key());
                }
            }

            listRequest = listRequest.toBuilder()
                .continuationToken(response.nextContinuationToken())
                .build();

        } while (response.isTruncated());

        return files;
    }

    private record QueryFileResult(
        List<Map<String, Object>> records,
        long matched,
        long scanned
    ) {}

    private QueryFileResult queryFile(
        String filePath,
        List<String> columns,
        PredicateEvaluator evaluator
    ) throws IOException {
        List<Map<String, Object>> results = new ArrayList<>();
        long matched = 0;
        long scanned = 0;

        Path path = new Path(filePath);
        HadoopInputFile inputFile = HadoopInputFile.fromPath(path, hadoopConfig);

        try (ParquetReader<GenericRecord> reader = AvroParquetReader
                .<GenericRecord>builder(inputFile)
                .withConf(hadoopConfig)
                .build()) {

            GenericRecord record;
            while ((record = reader.read()) != null) {
                scanned++;
                Map<String, Object> recordMap = recordToMap(record);

                if (evaluator.matches(recordMap)) {
                    matched++;

                    if (columns != null && !columns.isEmpty()) {
                        Map<String, Object> projected = new LinkedHashMap<>();
                        for (String col : columns) {
                            if (recordMap.containsKey(col)) {
                                projected.put(col, recordMap.get(col));
                            }
                        }
                        results.add(projected);
                    } else {
                        results.add(recordMap);
                    }
                }
            }
        }

        return new QueryFileResult(results, matched, scanned);
    }

    private Map<String, Object> recordToMap(GenericRecord record) {
        Map<String, Object> map = new LinkedHashMap<>();

        for (org.apache.avro.Schema.Field field : record.getSchema().getFields()) {
            String name = field.name();
            Object value = record.get(name);
            map.put(name, convertValue(value));
        }

        return map;
    }

    private Object convertValue(Object value) {
        if (value == null) return null;

        if (value instanceof org.apache.avro.util.Utf8) {
            return value.toString();
        }

        if (value instanceof org.apache.avro.generic.GenericRecord nested) {
            return recordToMap(nested);
        }

        if (value instanceof java.nio.ByteBuffer buffer) {
            byte[] bytes = new byte[buffer.remaining()];
            buffer.get(bytes);
            return Base64.getEncoder().encodeToString(bytes);
        }

        if (value instanceof Integer i) {
            return LocalDate.ofEpochDay(i);
        }

        if (value instanceof Long l && l > 1_000_000_000_000L) {
            return LocalDateTime.ofInstant(
                Instant.ofEpochMilli(l / 1000), 
                ZoneOffset.UTC
            );
        }

        return value;
    }

    private Map<String, String> extractSchema(String archivePath) {
        try {
            ArchiveMetadata metadata = getMetadata(archivePath);
            return metadata.columns().stream()
                .collect(Collectors.toMap(
                    ArchiveMetadata.ColumnDefinition::name,
                    ArchiveMetadata.ColumnDefinition::type,
                    (a, b) -> a,
                    LinkedHashMap::new
                ));
        } catch (Exception e) {
            log.warn("Failed to extract schema", e);
            return Collections.emptyMap();
        }
    }
}

6.9 REST Controller

// controller/ArchiveQueryController.java
package com.example.archivequery.controller;

import com.example.archivequery.model.*;
import com.example.archivequery.service.ParquetQueryService;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;

import java.util.List;
import java.util.Map;

@RestController
@RequestMapping("/api/v1/archives")
public class ArchiveQueryController {

    private final ParquetQueryService queryService;

    public ArchiveQueryController(ParquetQueryService queryService) {
        this.queryService = queryService;
    }

    @PostMapping("/query")
    public ResponseEntity<QueryResponse> query(@RequestBody QueryRequest request) {
        QueryResponse response = queryService.query(request);
        return ResponseEntity.ok(response);
    }

    @GetMapping("/query")
    public ResponseEntity<QueryResponse> queryGet(
        @RequestParam String archivePath,
        @RequestParam(required = false) List<String> columns,
        @RequestParam(required = false) String filter,
        @RequestParam(defaultValue = "1000") Integer limit,
        @RequestParam(defaultValue = "0") Integer offset
    ) {
        QueryRequest request = new QueryRequest(
            archivePath,
            columns,
            null,
            filter,
            limit,
            offset
        );

        QueryResponse response = queryService.query(request);
        return ResponseEntity.ok(response);
    }

    @GetMapping("/metadata")
    public ResponseEntity<ArchiveMetadata> getMetadata(
        @RequestParam String archivePath
    ) {
        ArchiveMetadata metadata = queryService.getMetadata(archivePath);
        return ResponseEntity.ok(metadata);
    }

    @GetMapping("/schema/{schema}/{table}/{partitionValue}")
    public ResponseEntity<ArchiveMetadata> getMetadataByTable(
        @PathVariable String schema,
        @PathVariable String table,
        @PathVariable String partitionValue,
        @RequestParam String bucket,
        @RequestParam(defaultValue = "aurora-archives") String prefix
    ) {
        String archivePath = String.format(
            "s3://%s/%s/%s/%s/%s",
            bucket, prefix, schema, table, partitionValue
        );

        ArchiveMetadata metadata = queryService.getMetadata(archivePath);
        return ResponseEntity.ok(metadata);
    }

    @PostMapping("/query/{schema}/{table}/{partitionValue}")
    public ResponseEntity<QueryResponse> queryByTable(
        @PathVariable String schema,
        @PathVariable String table,
        @PathVariable String partitionValue,
        @RequestParam String bucket,
        @RequestParam(defaultValue = "aurora-archives") String prefix,
        @RequestBody Map<String, Object> requestBody
    ) {
        String archivePath = String.format(
            "s3://%s/%s/%s/%s/%s",
            bucket, prefix, schema, table, partitionValue
        );

        List<String> columns = requestBody.containsKey("columns") ?
            (List<String>) requestBody.get("columns") : null;

        Map<String, Object> filters = requestBody.containsKey("filters") ?
            (Map<String, Object>) requestBody.get("filters") : null;

        String filterExpression = requestBody.containsKey("filterExpression") ?
            (String) requestBody.get("filterExpression") : null;

        Integer limit = requestBody.containsKey("limit") ?
            ((Number) requestBody.get("limit")).intValue() : 1000;

        Integer offset = requestBody.containsKey("offset") ?
            ((Number) requestBody.get("offset")).intValue() : 0;

        QueryRequest request = new QueryRequest(
            archivePath,
            columns,
            filters,
            filterExpression,
            limit,
            offset
        );

        QueryResponse response = queryService.query(request);
        return ResponseEntity.ok(response);
    }
}

6.10 Application Properties

# application.yml
server:
  port: 8080

aws:
  region: ${AWS_REGION:eu-west-1}

spring:
  application:
    name: archive-query-service

logging:
  level:
    com.example.archivequery: DEBUG
    org.apache.parquet: WARN
    org.apache.hadoop: WARN

7. Usage Examples

7.1 Archive a Partition

# Set environment variables
export AURORA_HOST=your-cluster.cluster-xxxx.eu-west-1.rds.amazonaws.com
export AURORA_DATABASE=production
export AURORA_USER=admin
export AURORA_PASSWORD=your-password
export ARCHIVE_S3_BUCKET=my-archive-bucket
export ARCHIVE_S3_PREFIX=aurora-archives
export AWS_REGION=eu-west-1

# Archive January 2024 transactions
python archive_partition.py \
    --table transactions \
    --partition-column transaction_month \
    --partition-value 2024-01 \
    --schema public \
    --batch-size 100000

7.2 Query Archived Data via API

# Get archive metadata
curl -X GET "http://localhost:8080/api/v1/archives/metadata?archivePath=s3://my-archive-bucket/aurora-archives/public/transactions/transaction_month=2024-01"

# Query with filters (POST)
curl -X POST "http://localhost:8080/api/v1/archives/query" \
    -H "Content-Type: application/json" \
    -d '{
        "archivePath": "s3://my-archive-bucket/aurora-archives/public/transactions/transaction_month=2024-01",
        "columns": ["transaction_id", "customer_id", "amount", "transaction_date"],
        "filters": {
            "amount_gte": 1000,
            "status": "completed"
        },
        "limit": 100,
        "offset": 0
    }'

# Query with expression filter (GET)
curl -X GET "http://localhost:8080/api/v1/archives/query" \
    --data-urlencode "archivePath=s3://my-archive-bucket/aurora-archives/public/transactions/transaction_month=2024-01" \
    --data-urlencode "filter=amount >= 1000 AND status = 'completed'" \
    --data-urlencode "columns=transaction_id,customer_id,amount" \
    --data-urlencode "limit=50"

7.3 Restore Archived Data

# Restore to staging table
python restore_partition.py \
    --source-path s3://my-archive-bucket/aurora-archives/public/transactions/transaction_month=2024-01 \
    --target-table transactions_staging \
    --target-schema public \
    --batch-size 10000

7.4 Migrate to Main Table

-- Validate before migration
SELECT * FROM generate_migration_report(
    'public', 'transactions_staging',
    'public', 'transactions',
    'transaction_month', '2024-01'
);

-- Migrate data
CALL migrate_partition_data(
    'public', 'transactions_staging',
    'public', 'transactions',
    'transaction_month', '2024-01',
    TRUE,    -- delete existing data in partition
    50000    -- batch size
);

-- Clean up staging table
CALL cleanup_after_migration(
    'public', 'transactions_staging',
    'public', 'transactions',
    TRUE     -- verify counts
);

8. Operational Considerations

8.1 Cost Analysis

The cost savings from this approach are significant:

Storage TypeMonthly Cost per TB
Aurora PostgreSQL$230
S3 Standard$23
S3 Intelligent Tiering$23 (hot) to $4 (archive)
S3 Glacier Instant Retrieval$4

For a 10TB historical dataset:

  • Aurora: $2,300/month
  • S3 with Parquet (7:1 compression): ~$33/month
  • Savings: ~98.5%

8.2 Query Performance

The Spring Boot API performance depends on:

  1. Data file size: Smaller files (100MB to 250MB) enable better parallelism
  2. Predicate selectivity: Higher selectivity means fewer records to deserialise
  3. Column projection: Requesting fewer columns reduces I/O
  4. S3 latency: Use same region as the bucket

Expected performance for a 1TB partition (compressed to ~150GB Parquet):

Query TypeTypical Latency
Point lookup (indexed column)500ms to 2s
Range scan (10% selectivity)5s to 15s
Full scan with aggregation30s to 60s

8.3 Monitoring and Alerting

Implement these CloudWatch metrics for production use:

@Component
public class QueryMetrics {

    private final MeterRegistry meterRegistry;

    public void recordQuery(QueryResponse response) {
        meterRegistry.counter("archive.query.count").increment();

        meterRegistry.timer("archive.query.duration")
            .record(response.executionTimeMs(), TimeUnit.MILLISECONDS);

        meterRegistry.gauge("archive.query.records_scanned", response.totalScanned());
        meterRegistry.gauge("archive.query.records_matched", response.totalMatched());
    }
}

9. Conclusion

This solution provides a complete data lifecycle management approach for large Aurora PostgreSQL tables. The archive script efficiently exports partitions to the cost effective Iceberg/Parquet format on S3, while the restore script enables seamless data recovery when needed. The Spring Boot API bridges the gap by allowing direct queries against archived data, eliminating the need for restoration in many analytical scenarios.

Key benefits:

  1. Cost reduction: 90 to 98 percent storage cost savings compared to keeping data in Aurora
  2. Operational flexibility: Query archived data without restoration
  3. Schema preservation: Full schema metadata maintained for reliable restores
  4. Partition management: Clean attach/detach operations for partitioned tables
  5. Predicate pushdown: Efficient filtering reduces data transfer and processing

The Iceberg format ensures compatibility with the broader data ecosystem, allowing tools like Athena, Spark, and Trino to query the same archived data when needed for more complex analytical workloads.

0
0

Why Bigger Banks Were Historically More Fragile and Why Architecture Determines Resilience

1. Size Was Once Mistaken for Stability

For most of modern banking history, stability was assumed to increase with size. The thinking was the bigger you are, the more you should care, the more resources you can apply to problems. Larger banks had more capital, more infrastructure, and more people. In a pre-cloud world, this assumption appeared reasonable.

In practice, the opposite was often true.

Before cloud computing and elastic infrastructure, the larger a bank became, the more unstable it was under stress and the harder it was to maintain any kind of delivery cadence. Scale amplified fragility. In 2025, architecture (not size) has become the primary determinant of banking stability.

2. Scale, Fragility, and Quantum Entanglement

Traditional banking platforms were built on vertically scaled systems: mainframes, monolithic databases, and tightly coupled integration layers. These systems were engineered for control and predictability, not for elasticity or independent change.

As banks grew, they didn’t just add clients. They added products. Each new product introduced new dependencies, shared data models, synchronous calls, and operational assumptions. Over time, this created a state best described as quantum entanglement.

In this context, quantum entanglement refers to systems where:

  • Products cannot change independently
  • A change in one area unpredictably affects others
  • The full impact of change only appears under real load
  • Cause and effect are separated by time, traffic, and failure conditions

The larger the number of interdependent products, the more entangled the system becomes.

2.1 Why Entanglement Reduces Stability

As quantum entanglement increases, change becomes progressively riskier. Even small modifications require coordination across multiple teams and systems. Release cycles slow and defensive complexity increases.

Recovery also becomes harder. When something breaks, rolling back a single change is rarely sufficient because multiple products may already be in partially failed or inconsistent states.

Fault finding degrades as well. Logs, metrics, and alerts point in multiple directions. Symptoms appear far from root causes, forcing engineers to chase secondary effects rather than underlying faults.

Most importantly, blast radius expands. A fault in one product propagates through shared state and synchronous dependencies, impacting clients who weren’t using the originating product at all.

The paradox is that the very success of large banks (broad product portfolios) becomes a direct contributor to instability.

3. Why Scale Reduced Stability in the Pre-Cloud Era

Before cloud computing, capacity was finite, expensive, and slow to change. Systems scaled vertically, and failure domains were large by design.

As transaction volumes and product entanglement increased, capacity cliffs became unavoidable. Peak load failures became systemic rather than local. Recovery times lengthened and client impact widened.

Large institutions often appeared stable during normal operation but failed dramatically under stress. Smaller institutions appeared more stable largely because they had fewer entangled products and simpler operational surfaces (not because they were inherently better engineered).

Capitec itself experienced this capacity cliff, when its core banking SQL DB hit a capacity cliff in August 2022. In order to recover the service, close to 100 changes were made which resulted in a downtime of around 40 hrs. The wider service recovery took weeks, with missed payments a duplicate payments being fixed on a case by case basis. It was at this point that Capitec’s leadership drew a line in the sand and decided to totally re-engineer its entire stack from the ground up in AWS. This blog post is really trying to share a few nuggets from the engineering journey we went on, and hopefully help others all struggling the with burden of scale and hardened synchronous pathways.

4. Cloud Changed the Equation (But Only When Architecture Changed)

Cloud computing made it possible to break entanglement, but only for organisations willing to redesign systems to exploit it.

Horizontal scaling, availability zone isolation, managed databases, and elastic compute allow products to exist as independent domains rather than tightly bound extensions of a central core.

Institutions that merely moved infrastructure to the cloud without breaking product entanglement continue to experience the same instability patterns (only on newer hardware).

5. An Architecture Designed to Avoid Entanglement

Capitec represents a deliberate rejection of quantum entanglement.

Its entire App production stack is cloud native on AWS, Kubernetes, Kafka and Postgres. The platform is well advanced in rolling out new Java 25 runtimes, alongside ahead of time (AOT) optimisation to further reduce scale latency, improve startup characteristics, and increase predictability under load. All Aurora Serverless are setup with read replicas, offloading read pressure from write paths. All workloads are deployed across three availability zones, ensuring resilience. Database access is via the AWS JDBC wrapper (which enables extremely rapid failovers, outside of DNS TTLs)

Crucially, products are isolated by design. There is no central product graph where everything depends on everything else. But, a word of caution, we are “not there yet”. We will always have edges that can hurt and we you hit an edge at speed, sometimes its hard to get back up on your feet. Often you see that the downtime you experienced, simply results in pent up demand. Put another way, the volume that took your systems offline, is now significantly LESS than the volume thats waiting for you once you recover! This means that you somehow have to magically add capacity, or optimise code, during an outage in order to recover the service. You will often say “Rate Limiting” fan club put a foot forward when I discuss burst recoverability. I personally don’t buy this for single entity services (for a complex set of reasons). For someone like AWS, it absolutely makes sense to carry the enormous complexity of guarding services with rate limits. But I don’t believe the same is true for a single entity ecosystem, in these instances, offloading is normally a purer pathway.

6. Write Guarding as a Stability Primitive

Capitec’s mobile and digital platforms employ a deliberate **write guarding** strategy.

Read only operations (such as logging into the app) are explicitly prevented from performing inline write operations. Activities like audit logging, telemetry capture, behavioural flags, and notification triggers are never executed synchronously on high volume read paths.

Instead, these concerns are offloaded asynchronously using Amazon MSK (Managed Streaming for Apache Kafka) or written to in memory data stores such as Valkey, where they can be processed later without impacting the user journey.

This design completely removes read-write contention from critical paths. Authentication storms, balance checks, and session validation no longer compete with persistence workloads. Under load, read performance remains stable because it is not coupled to downstream write capacity.

Critically, write guarding prevents database maintenance pressure (such as vacuum activity) from leaking into high volume events like logins. Expensive background work remains isolated from customer facing read paths.

Write guarding turns one of the most common failure modes in large banking systems (read traffic triggering hidden writes) into a non event. Stability improves not by adding capacity, but by removing unnecessary coupling.

7. Virtual Threads as a Scalability Primitive

Java 25 introduces mature virtual threading as a first class concurrency model. This fundamentally changes how high concurrency systems behave under load.

Virtual threads decouple application concurrency from operating system threads. Instead of being constrained by a limited pool of heavyweight threads, services can handle hundreds of thousands of concurrent blocking operations without exhausting resources.

Request handling becomes simpler. Engineers can write straightforward blocking code without introducing thread pool starvation or complex asynchronous control flow.

Tail latency improves under load. When traffic spikes, virtual threads queue cheaply rather than collapsing the system through thread exhaustion.

Failure isolation improves. Slow downstream calls no longer monopolise scarce threads, reducing cascading failure modes.

Operationally, virtual threads align naturally with containerised, autoscaling environments. Concurrency scales with demand, not with preconfigured thread limits.

When combined with modern garbage collectors and ahead of time optimisation, virtual threading removes an entire class of concurrency related instability that plagued earlier JVM based banking platforms.

8. Nimbleness Emerges When Entanglement Disappears

When blast zones and integration choke points disappear, teams regain the ability to move quickly without increasing systemic risk.

Domains communicate through well defined RESTful interfaces, often across separate AWS accounts, enforcing isolation as a first class property. A failure in one domain does not cascade across the organisation.

To keep this operable at scale, Capitec uses Backstage (via an internal overlay called ODIN) as its internal orchestration and developer platform. All AWS accounts, services, pipelines, and operational assets are created to a common standard. Teams consume platform capability rather than inventing infrastructure.

This eliminates configuration drift, reduces cognitive load, and ensures that every new product inherits the same security, observability, and resilience characteristics.

The result is nimbleness without fragility.

9. Operational Stability Is Observability Plus Action

In entangled systems, failures are discovered by clients and stability is measured retrospectively.

Capitec operates differently. End to end observability through Instana and its in house AI platform, Neo, correlates client side errors, network faults, infrastructure signals, and transaction failures in real time. Issues are detected as they emerge, not after they cascade.

This operational awareness allows teams to intervene early, contain issues quickly, and reduce client impact before failures escalate.

Stability, in this model, is not the absence of failure. It is fast detection, rapid containment, and decisive response.

10. Fraud Prevention Without Creating New Entanglement

Fraud is treated as a first class stability concern rather than an external control.

Payments are evaluated inline as they move through the bank. Abnormal velocity, behavioural anomalies, and account provenance are assessed continuously. Even fraud reported in the call center is immediately visible to other clients paying from the Capitec App. Clients are presented with conscience pricking prompts for high risk payments; these frequently stop fraud as the clients abandon the payment when presented with the risks.

Capitec runs a real time malware detection engine directly on client devices. This engine detects hooks and overlays installed by malicious applications. When malware is identified, the client’s account is immediately stopped, preventing fraudulent transactions before they occur.

Because fraud controls are embedded directly into the transaction flow, they don’t introduce additional coupling or asynchronous failure modes.

The impact is measurable. Capitec’s fraud prevention systems have prevented R300 million in client losses from fraud. In November alone, these systems saved clients a further R60 million in fraud losses.

11. The Myth of Stability Through Multicloud

Multicloud is often presented as a stability strategy. In practice, it is largely a myth.

Running across multiple cloud providers does not remove failure risk. It compounds it. Cross cloud communication can typically only be secured using IP based controls, weakening security posture. Operational complexity increases sharply as teams must reason about heterogeneous platforms, tooling, failure modes, and networking behaviour.

Most critically, multicloud does not eliminate correlated failure. If either cloud provider becomes unavailable, systems are usually unusable anyway. The result is a doubled risk surface, increased operational risk, and new inter cloud network dependencies (without a corresponding reduction in outage impact).

Multicloud increases complexity, weakens controls, and expands risk surface area without delivering meaningful resilience.

12. What Actually Improves Stability

There are better options than multicloud.

Hybrid cloud with anti-affinity on critical channels is one. For example, card rails can be placed in two physically separate data centres so that if cloud based digital channels are unavailable, clients can still transact via cards and ATMs. This provides real functional resilience rather than architectural illusion.

Multi region deployment within a single cloud provider is another. This provides geographic fault isolation without introducing heterogeneous complexity. However, this only works if the provider avoids globally scoped services that introduce hidden single points of failure. At present, only AWS consistently supports this model. Some providers expose global services (such as global front doors) that introduce global blast radius and correlated failure risk.

True resilience requires isolation of failure domains, not duplication of platforms.

13. Why Traditional Banks Still Struggle

Traditional banks remain constrained by entangled product graphs, vertically scaled cores, synchronous integration models, and architectural decisions from a different era. As product portfolios grow, quantum entanglement increases. Change slows, recovery degrades, and outages become harder to diagnose and contain.

Modernisation programmes often increase entanglement temporarily through dual run architectures, making systems more fragile before they become more stable (if they ever do).

The challenge is not talent or ambition. It is the accumulated cost of entanglement.

14. Stability at Scale Without the Traditional Trade Off

Capitec’s significance is not that it is small. It is that it is large and remains stable.

Despite operating at massive scale with a broad product surface and high transaction volumes, stability improves rather than degrades. Scale does not increase blast radius, recovery time, or change risk. It increases parallelism, isolation, and resilience.

This directly contradicts historical banking patterns where growth inevitably led to fragility. Capitec demonstrates that with the right architecture, scale and stability are no longer opposing forces.

15. Final Thought

Before cloud and autoscaling, scale and stability were inversely related. The more products a bank had, the more entangled and fragile it became.

In 2025, that relationship can be reversed (but only by breaking entanglement, isolating failure domains, and avoiding complexity masquerading as resilience).

Doing a deal with a cloud provider means nothing if transformation stalls inside the organisation. If dozens of people carry the title of CIO while quietly pulling the handbrake on the change that is required, the outcome is inevitable regardless of vendor selection.

There is also a strategic question that many institutions avoid. If forced to choose between operating in a jurisdiction that is hostile to public cloud or accessing the full advantages of cloud, waiting is not a strategy. When that jurisdiction eventually allows public cloud, the market will already be populated by banks that moved earlier, built cloud native platforms, and are now entering at scale.

Capitec is an engineering led bank whose stability and speed increase with scale. Traditional banks remain constrained by quantum entanglement baked into architectures from a different era.

These outcomes are not accidental. They are the inevitable result of architectural and organisational choices made years ago, now playing out under real world load.

0
0

Amazon Aurora DSQL: A Deep Dive into Performance and Limitations

1. Executive Summary

Amazon Aurora DSQL represents AWS’s ambitious entry into the distributed SQL database market, announced at re:Invent 2024. It’s a serverless, distributed SQL database featuring active active high availability and PostgreSQL compatibility. While the service offers impressive architectural innovations including 99.99% single region and 99.999% multi region availability, but it comes with significant limitations that developers must carefully consider. This analysis examines Aurora DSQL’s performance characteristics, architectural tradeoffs, and critical constraints that impact real world applications.

Key Takeaways:

  • Aurora DSQL excels at multiregion active active workloads with low latency reads and writes
  • Significant PostgreSQL compatibility gaps limit migration paths for existing applications
  • Optimistic concurrency control requires application level retry logic
  • Preview phase limitations include 10,000 row transaction limits and missing critical features
  • Pricing model is complex and difficult to predict without production testing

1.1 Architecture Overview

Aurora DSQL fundamentally reimagines distributed database architecture by decoupling transaction processing from storage. The service overcomes two historical challenges: achieving multi region strong consistency with low latency and syncing servers with microsecond accuracy around the globe.

1.2 Core Components

The system consists of three independently scalable components:

  1. Compute Layer: Executes SQL queries without managing locks
  2. Commit/Journal Layer: Handles transaction ordering and conflict detection
  3. Storage Layer: Provides durable, queryable storage built from transaction logs

The Journal logs every transaction, while the Adjudicator component manages transaction isolation and conflict resolution. This separation allows Aurora DSQL to scale each component independently based on workload demands.

1.3 Multi Region Architecture

In multi region configurations, clusters provide two regional endpoints that present a single logical database supporting concurrent read and write operations with strong data consistency. A third witness region stores transaction logs for recovery purposes, enabling the system to maintain availability even during regional failures.

2. Performance Characteristics

2.1 Read and Write Latency

In simple workload tests, read latency achieves single digit milliseconds, while write latency is approximately two roundtrip times to the nearest region at commit. This predictable latency model differs significantly from traditional databases where performance degrades under contention.

Transaction latency remains constant relative to statement count, even across regions. This consistency provides predictable performance characteristics regardless of transaction complexity—a significant advantage for globally distributed applications.

2.2 Scalability Claims

AWS claims Aurora DSQL delivers reads and writes that are four times faster than Google Cloud Spanner. However, real world benchmarks will further reveal performance at extreme scale, as Aurora DSQL is yet to be truly tested on an enterprise scale.

The service provides virtually unlimited scalability from a single endpoint, eliminating manual provisioning and management of database instances. The architecture automatically partitions the key space to detect conflicting transactions, allowing it to scale without traditional sharding complexity.

2.3 Concurrency Control Tradeoffs

Aurora DSQL uses optimistic concurrency control (OCC), where transactions run without considering other concurrent transactions, with conflict detection happening at commit time. While this prevents slow transactions from blocking others, it requires applications to handle transaction retries.

OCC provides better scalability for query processing and a more robust cluster for realistic failures by avoiding locking mechanisms that can lead to deadlocks or performance bottlenecks. However, this comes at the cost of increased application complexity.

3. Critical Limitations

3.1 Transaction Constraints

The preview version imposes several hard limits that significantly impact application design:

  • Maximum 10,000 rows modified per transaction
  • Transaction size cannot exceed 10 MiB
  • Sessions are capped at 1 hour, requiring reconnection for long lived processes
  • Cannot mix DDL and DML statements within a single transaction

Common DDL/DML Limitation Example:

The most common pattern that fails in Aurora DSQL is SELECT INTO, which combines table creation and data population in a single statement:

BEGIN;
SELECT id, username, created_at 
INTO new_users_copy 
FROM users 
WHERE active = true;
COMMIT;  -- ERROR: SELECT INTO mixes DDL and DML

This pattern is extremely common in:

  • ETL processes that create temporary staging tables and load data with SELECT INTO
  • Reporting workflows that materialize query results into new tables
  • Database migrations that create tables and seed initial data
  • Test fixtures that set up schema and populate test data
  • Multi-tenant applications that dynamically create tenant-specific tables and initialize them

The workaround requires splitting into separate transactions using CREATE TABLE followed by INSERT INTO ... SELECT:

-- Transaction 1: Create table structure
BEGIN;
CREATE TABLE new_users_copy (
    id BIGINT,
    username TEXT,
    created_at TIMESTAMP
);
COMMIT;

-- Transaction 2: Populate with data
BEGIN;
INSERT INTO new_users_copy 
SELECT id, username, created_at 
FROM users 
WHERE active = true;
COMMIT;

This limitation stems from Aurora DSQL’s asynchronous DDL processing architecture, where schema changes are propagated separately from data changes across the distributed system.

These constraints make Aurora DSQL unsuitable for batch processing workloads or applications requiring large bulk operations.

3.2 Missing PostgreSQL Features

Aurora DSQL sacrifices several critical PostgreSQL features for performance and scalability:

Not Supported:

  • No foreign keys
  • No temporary tables
  • No views
  • No triggers, PL/pgSQL, sequences, or explicit locking
  • No serializable isolation level or LOCK TABLE support
  • No PostgreSQL extensions (pgcrypto, PostGIS, PGVector, hstore)

Unsupported Data Types:

  • No SERIAL or BIGSERIAL (auto incrementing integers)
  • No JSON or JSONB types
  • No range types (tsrange, int4range, etc.)
  • No geospatial types (geometry, geography)
  • No vector types
  • TEXT type limited to 1MB (vs. 1GB in PostgreSQL)

Aurora DSQL prioritizes scalability, sacrificing some features for performance. The distributed architecture and asynchronous DDL requirements prevent support for extensions that depend on PostgreSQL’s internal storage format. These omissions require significant application redesign for PostgreSQL migrations.

3.3 Isolation Level Limitations

Aurora DSQL supports strong snapshot isolation, equivalent to repeatable read isolation in PostgreSQL. This creates several important implications:

  1. Write Skew Anomalies: Applications susceptible to write skew cannot rely on serializable isolation to prevent conflicts.

What is Write Skew?

Write skew is a database anomaly that occurs when two concurrent transactions read overlapping data, make disjoint updates based on what they read, and both commit successfully—even though the result violates a business constraint. This happens because snapshot isolation only detects direct write write conflicts on the same rows, not constraint violations across different rows.

Classic example: A hospital requires at least two doctors on call at all times. Two transactions check the current count (finds 2 doctors), both see the requirement is satisfied, and both remove themselves from the on call roster. Both transactions modify different rows (their own records), so snapshot isolation sees no conflict. Both commit successfully, leaving zero doctors on call and violating the business rule.

In PostgreSQL with serializable isolation, this would be prevented because the database detects the anomaly and aborts one transaction. Aurora DSQL’s snapshot isolation cannot prevent this, requiring application level logic to handle such constraints.

  1. Referential Integrity Challenges: Without foreign keys and serializable isolation, maintaining referential integrity requires application side logic using SELECT FOR UPDATE to lock parent keys during child table inserts.
  2. Conflict Detection: Aurora DSQL uses optimistic concurrency control where SELECT FOR UPDATE intents are not synchronized to be visible to other transactions until commit.

4. Inter Region Consistency and Data Durability

4.1 Strong Consistency Guarantees

Aurora DSQL provides strong consistency across all regional endpoints, ensuring that reads and writes to any endpoint always reflect the same logical state. This is achieved through synchronous replication and careful transaction ordering:

How It Works:

  1. Synchronous Commits: When a transaction commits, Aurora DSQL writes to the distributed transaction log and synchronously replicates all committed log data across regions before acknowledging the commit to the client.
  2. Single Logical Database: Both regional endpoints in a multi region cluster present the same logical database. Readers consistently see the same data regardless of which endpoint they query.
  3. Zero Replication Lag: Unlike traditional asynchronous replication, there is no replication lag on commit. The commit only succeeds after data is durably stored across regions.
  4. Witness Region: A third region stores transaction logs for durability but doesn’t have a query endpoint. This witness region ensures multi region durability without requiring three full active regions.

4.2 Can You Lose Data?

Aurora DSQL is designed with strong durability guarantees that make data loss extremely unlikely:

Single Region Configuration:

  • All write transactions are synchronously replicated to storage replicas across three Availability Zones
  • Replication uses quorum based commits, ensuring data survives even if one AZ fails completely
  • No risk of data loss due to replication lag because replication is always synchronous

Multi Region Configuration:

  • Committed transactions are synchronously written to the transaction log in both active regions plus the witness region
  • A transaction only acknowledges success after durable storage across multiple regions
  • Even if an entire AWS region becomes unavailable, committed data remains accessible from the other region

Failure Scenarios:

Single AZ Failure: Aurora DSQL automatically routes to healthy AZs. No data loss occurs because data is replicated across three AZs synchronously.

Single Region Failure (Multi Region Setup): Applications can continue operating from the remaining active region with zero data loss. All committed transactions were synchronously replicated before the commit acknowledgment was sent.

Component Failure: Individual component failures (compute, storage, journal) are handled through Aurora DSQL’s self healing architecture. The system automatically repairs failed replicas asynchronously while serving requests from healthy components.

4.3 Will Everything Always Be in Both Regions?

Yes, with important caveats:

Committed Data: Once a transaction receives a commit acknowledgment, that data is guaranteed to exist in both active regions. The synchronous replication model ensures this.

Uncommitted Transactions: Transactions that haven’t yet committed exist only in their originating region’s session state. If that region fails before commit, the transaction is lost (which is expected behavior).

Durability vs. Availability Tradeoff: The strong consistency model means that if cross region network connectivity is lost, write operations may be impacted. Aurora DSQL prioritizes consistency over availability in the CAP theorem sense, it won’t accept writes that can’t be properly replicated.

Geographic Restrictions: Multi region clusters are currently limited to geographic groupings (US regions together, European regions together, Asia Pacific regions together). You cannot pair US East with EU West, which limits truly global active active deployments.

4.4 Consistency Risks and Limitations

While Aurora DSQL provides strong consistency, developers should understand these considerations:

Network Partition Handling: In the event of a network partition between regions, Aurora DSQL’s behavior depends on which components can maintain quorum. The system is designed to maintain consistency, which may mean rejecting writes rather than accepting writes that can’t be properly replicated.

Write Skew at Application Level: While individual transactions are consistent, applications must still handle write skew anomalies that can occur with snapshot isolation (as discussed in the Isolation Level Limitations section).

Time Synchronization Dependency: Aurora DSQL relies on Amazon Time Sync Service for precise time coordination. While highly reliable, this creates a subtle dependency on time synchronization for maintaining transaction ordering across regions.

This test measures basic insert latency using the PostgreSQL wire protocol:

#!/bin/bash
# Basic Aurora DSQL Performance Test
# Prerequisites: AWS CLI, psql, jq
# Configuration
REGION="us-east-1"
CLUSTER_ENDPOINT="your-cluster-endpoint.dsql.us-east-1.on.aws"
DATABASE="testdb"
# Generate temporary authentication token
export PGPASSWORD=$(aws dsql generate-db-connect-admin-auth-token \
  --hostname $CLUSTER_ENDPOINT \
  --region $REGION \
  --expires-in 3600)
export PGHOST=$CLUSTER_ENDPOINT
export PGDATABASE=$DATABASE
export PGUSER=admin
# Create test table
psql << 'EOF'
DROP TABLE IF EXISTS perf_test;
CREATE TABLE perf_test (
  id BIGSERIAL PRIMARY KEY,
  data TEXT,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
EOF
# Run timed insert test
echo "Running 100 sequential inserts..."
time psql -c "
DO \$\$
DECLARE
  i INT;
BEGIN
  FOR i IN 1..100 LOOP
    INSERT INTO perf_test (data) 
    VALUES ('test_data_' || i);
  END LOOP;
END \$\$;
"
# Test transaction commit latency
echo "Testing transaction commit latency..."
psql << 'EOF'
\timing on
BEGIN;
INSERT INTO perf_test (data) VALUES ('commit_test');
COMMIT;
EOF

4.4 Concurrency Control Testing

Test optimistic concurrency behavior and conflict detection:

#!/usr/bin/env python3
"""
Aurora DSQL Concurrency Test
Tests optimistic concurrency control and retry logic
"""
import psycopg2
import boto3
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
def get_auth_token(endpoint, region):
    """Generate IAM authentication token"""
    client = boto3.client('dsql', region_name=region)
    return client.generate_db_connect_admin_auth_token(
        hostname=endpoint,
        region=region,
        expires_in=3600
    )
def connect(endpoint, database, region):
    """Create database connection"""
    token = get_auth_token(endpoint, region)
    return psycopg2.connect(
        host=endpoint,
        database=database,
        user='admin',
        password=token,
        sslmode='require'
    )
def update_with_retry(endpoint, database, region, row_id, new_value, max_retries=5):
    """Update with exponential backoff retry logic"""
    retries = 0
    delay = 0.1
    
    while retries < max_retries:
        conn = None
        try:
            conn = connect(endpoint, database, region)
            cursor = conn.cursor()
            
            # Start transaction
            cursor.execute("BEGIN")
            
            # Read current value
            cursor.execute(
                "SELECT value FROM test_table WHERE id = %s FOR UPDATE",
                (row_id,)
            )
            current = cursor.fetchone()
            
            # Simulate some processing
            time.sleep(0.01)
            
            # Update value
            cursor.execute(
                "UPDATE test_table SET value = %s WHERE id = %s",
                (new_value, row_id)
            )
            
            # Commit
            cursor.execute("COMMIT")
            
            return True, retries
            
        except psycopg2.Error as e:
            if "change conflicts with another transaction" in str(e):
                retries += 1
                if retries < max_retries:
                    time.sleep(delay)
                    delay *= 2  # Exponential backoff
                    continue
            raise
        finally:
            if conn:
                conn.close()
    
    return False, max_retries
def run_concurrency_test(endpoint, database, region, num_threads=10):
    """Run concurrent updates on same row"""
    
    # Setup test table
    conn = connect(endpoint, database, region)
    cursor = conn.cursor()
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS test_table (
            id BIGINT PRIMARY KEY,
            value INT
        )
    """)
    cursor.execute("INSERT INTO test_table (id, value) VALUES (1, 0)")
    conn.commit()
    conn.close()
    
    # Run concurrent updates
    start_time = time.time()
    results = {'success': 0, 'failed': 0, 'total_retries': 0}
    
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = [
            executor.submit(update_with_retry, endpoint, database, region, 1, i)
            for i in range(num_threads)
        ]
        
        for future in as_completed(futures):
            success, retries = future.result()
            if success:
                results['success'] += 1
            else:
                results['failed'] += 1
            results['total_retries'] += retries
    
    elapsed = time.time() - start_time
    
    print(f"\nConcurrency Test Results:")
    print(f"Duration: {elapsed:.2f}s")
    print(f"Successful: {results['success']}")
    print(f"Failed: {results['failed']}")
    print(f"Total Retries: {results['total_retries']}")
    print(f"Avg Retries per Transaction: {results['total_retries']/num_threads:.2f}")
if __name__ == "__main__":
    ENDPOINT = "your-cluster-endpoint.dsql.us-east-1.on.aws"
    DATABASE = "testdb"
    REGION = "us-east-1"
    
    run_concurrency_test(ENDPOINT, DATABASE, REGION)

4.5 Multi Region Latency Test

Measure cross region write latency:

// Node.js Multi-Region Latency Test
const { Client } = require('pg');
const AWS = require('aws-sdk');
async function getAuthToken(endpoint, region) {
  const dsql = new AWS.DSQL({ region });
  const params = {
    hostname: endpoint,
    region: region,
    expiresIn: 3600
  };
  return dsql.generateDbConnectAdminAuthToken(params);
}
async function testRegionalLatency(endpoints, database) {
  const results = [];
  
  for (const [region, endpoint] of Object.entries(endpoints)) {
    const token = await getAuthToken(endpoint, region);
    
    const client = new Client({
      host: endpoint,
      database: database,
      user: 'admin',
      password: token,
      ssl: { rejectUnauthorized: true }
    });
    
    await client.connect();
    
    // Measure read latency
    const readStart = Date.now();
    await client.query('SELECT 1');
    const readLatency = Date.now() - readStart;
    
    // Measure write latency (includes commit sync)
    const writeStart = Date.now();
    await client.query('BEGIN');
    await client.query('INSERT INTO latency_test (ts) VALUES (NOW())');
    await client.query('COMMIT');
    const writeLatency = Date.now() - writeStart;
    
    results.push({
      region,
      readLatency,
      writeLatency
    });
    
    await client.end();
  }
  
  console.log('Multi-Region Latency Results:');
  console.table(results);
}
// Usage
const endpoints = {
  'us-east-1': 'cluster-1.dsql.us-east-1.on.aws',
  'us-west-2': 'cluster-1.dsql.us-west-2.on.aws'
};
testRegionalLatency(endpoints, 'testdb');

4.6 Transaction Size Limit Test

Verify the 10,000 row transaction limit impacts on operations:

#!/usr/bin/env python3
"""
Test Aurora DSQL transaction size limits
"""
import psycopg2
import boto3
def connect(endpoint, database, region):
    """Create database connection"""
    client = boto3.client('dsql', region_name=region)
    token = client.generate_db_connect_admin_auth_token(
        hostname=endpoint,
        region=region,
        expires_in=3600
    )
    return psycopg2.connect(
        host=endpoint,
        database=database,
        user='admin',
        password=token,
        sslmode='require'
    )
def test_limits(endpoint, database, region):
    """Test transaction row limits"""
    conn = connect(endpoint, database, region)
    cursor = conn.cursor()
    
    # Create test table
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS limit_test (
            id BIGSERIAL PRIMARY KEY,
            data TEXT
        )
    """)
    conn.commit()
    
    # Test under limit (should succeed)
    print("Testing 9,999 row insert (under limit)...")
    try:
        cursor.execute("BEGIN")
        for i in range(9999):
            cursor.execute(
                "INSERT INTO limit_test (data) VALUES (%s)",
                (f"row_{i}",)
            )
        cursor.execute("COMMIT")
        print("✓ Success: Under-limit transaction committed")
    except Exception as e:
        print(f"✗ Failed: {e}")
        cursor.execute("ROLLBACK")
    
    conn.close()
if __name__ == "__main__":
    ENDPOINT = "your-cluster-endpoint.dsql.us-east-1.on.aws"
    DATABASE = "testdb"
    REGION = "us-east-1"
    
    test_limits(ENDPOINT, DATABASE, REGION)

5. Aurora DSQL and af-south-1 (Cape Town): Regional Availability and Performance Impact

5.1 Current Regional Support

As of the preview launch in November 2024, Amazon Aurora DSQL is not available in the af-south-1 (Cape Town) region. AWS has initially launched DSQL in a limited set of regions, focusing on major US, European, and Asia Pacific markets. The Cape Town region, while part of AWS’s global infrastructure, has not been included in the initial preview rollout.

This absence is significant for African businesses and organizations looking to leverage DSQL’s distributed SQL capabilities, as there is currently no way to deploy a regional endpoint within the African continent.

5.2 Geographic Pairing Limitations

Even if af-south-1 support is added in the future, Aurora DSQL’s current multi region architecture imposes geographic restrictions on region pairing. According to AWS documentation, multi region clusters are limited to specific geographic groupings:

  • US regions can only pair with other US regions
  • European regions can only pair with other European regions
  • Asia Pacific regions can only pair with other Asia Pacific regions

This means that even with af-south-1 support, African deployments would likely be restricted to pairing with other African regions. Given that AWS currently operates only one region in Africa (Cape Town), true multi region DSQL deployment within the continent would require AWS to launch additional African regions first.

5.3 The 150ms RTT Challenge

The network round trip time (RTT) between Cape Town and major AWS regions presents a fundamental performance challenge for Aurora DSQL deployments. Typical RTT measurements from af-south-1 to other regions include:

  • af-south-1 to eu-west-1 (Ireland): approximately 150ms
  • af-south-1 to us-east-1 (Virginia): approximately 200ms
  • af-south-1 to ap-southeast-1 (Singapore): approximately 280ms

To understand the performance impact of 150ms RTT, we need to examine how Aurora DSQL’s architecture handles write operations.

5.4 Write Performance Impact Analysis

Aurora DSQL’s write latency is directly tied to network round trip times because of its synchronous replication model. When a transaction commits:

  1. The client sends a COMMIT request to the nearest regional endpoint
  2. The commit layer synchronously replicates the transaction log to all regions in the cluster
  3. The system waits for acknowledgment from all regions before confirming the commit to the client
  4. Only after all regions have durably stored the transaction does the client receive confirmation

According to AWS documentation, write latency is approximately two round trip times to the nearest region at commit. This means:

5.5 Single Region Scenario (Hypothetical af-south-1)

If Aurora DSQL were available in af-south-1 as a single region deployment, write performance would be competitive with other regions. The write latency would be roughly:

  • Local RTT within af-south-1: approximately 2-5ms (between availability zones)
  • Expected write latency: 4-10ms (two RTTs)

This would provide acceptable performance for most transactional workloads.

5.6 Multi Region Scenario (af-south-1 paired with eu-west-1)

In a multi region configuration pairing Cape Town with Ireland, the 150ms RTT creates severe performance constraints:

  • Cross region RTT: 150ms
  • Write latency calculation: 2 × 150ms = 300ms minimum per write transaction
  • Best case scenario with optimizations: 250-350ms per transaction

This 300ms write latency has cascading effects on application throughput:

Throughput Impact:

  • With 300ms per transaction, a single connection can complete approximately 3.3 transactions per second
  • To achieve 100 transactions per second, you would need at least 30 concurrent database connections
  • To achieve 1,000 transactions per second, you would need 300+ concurrent connections

Comparison to US/EU Deployments:

  • us-east-1 to us-west-2: approximately 60ms RTT = 120ms write latency
  • eu-west-1 to eu-central-1: approximately 25ms RTT = 50ms write latency

An af-south-1 to eu-west-1 pairing would experience 6x slower writes compared to typical European region pairings and 2.5x slower writes compared to cross-US deployments.

5.7 Optimistic Concurrency Control Compounds the Problem

Aurora DSQL’s optimistic concurrency control (OCC) makes the latency problem worse for African deployments. When transactions conflict:

  1. The application must detect the conflict (after waiting 300ms for the failed commit)
  2. The application implements retry logic with exponential backoff
  3. Each retry attempt incurs another 300ms commit latency

For a workload with 20% conflict rate requiring retries:

  • First attempt: 300ms (80% success)
  • Second attempt: 300ms + backoff delay (15% success)
  • Third attempt: 300ms + backoff delay (4% success)
  • Fourth attempt: 300ms + backoff delay (1% success)

Average transaction time balloons to 400-500ms when accounting for retries and backoff delays, reducing effective throughput to 2-2.5 transactions per second per connection.

5.8 Read Performance Considerations

While write performance suffers significantly, read performance from af-south-1 would be less impacted:

  • Local reads: If querying data from a local af-south-1 endpoint, read latency would remain in single digit milliseconds
  • Cross region reads: Reading from eu-west-1 while located in Cape Town would incur the 150ms RTT, but this is expected behavior for geo-distributed queries

Aurora DSQL’s architecture excels at providing low latency reads from the nearest regional endpoint, so applications that are read-heavy could still benefit from local read performance even with degraded write performance.

5.9 Cost Implications

The 150ms RTT also impacts cost through increased DPU consumption:

  • Longer transaction times mean connections remain open longer
  • More concurrent connections are needed to achieve target throughput
  • Higher retry rates consume additional DPUs for conflict resolution
  • Network data transfer costs between continents are higher than intra-region transfer

A workload that might consume 1 million DPUs in a us-east-1/us-west-2 configuration could easily consume 2-3 million DPUs in an af-south-1/eu-west-1 configuration due to longer transaction times and retry overhead.

At $8 per million DPUs, this represents a 2-3x cost increase purely from geographic latency, before considering higher network transfer costs.

5.10 Viability Assessment for African Deployments

Aurora DSQL is currently not viable for af-south-1 deployments due to:

5.1 Immediate Blockers

  1. No regional availability: DSQL is not offered in af-south-1
  2. No local pairing options: Even if available, there are no other African regions to pair with for multi region deployments
  3. Geographic restrictions: Current architecture prevents pairing af-south-1 with regions outside Africa

5.2 Performance Barriers (if it were available)

  1. Write performance: 300ms write latency makes DSQL unsuitable for:
    • High frequency transactional systems
    • Real time inventory management
    • Interactive user facing applications requiring sub 100ms response times
    • Any workload requiring more than 10-20 transactions per second without massive connection pooling
  2. Cost efficiency: 2-3x higher DPU consumption makes the service economically unviable compared to regional alternatives
  3. Retry amplification: Optimistic concurrency control multiplies the latency problem, making high contention workloads essentially unusable

5.3 Scenarios where DSQL might work (Hypothetically)

If Aurora DSQL becomes available in af-south-1, it could potentially serve:

  1. Asynchronous workloads: Background job processing, batch operations, and tasks where 300ms latency is acceptable
  2. Read-heavy applications: Systems that primarily read locally but occasionally sync writes to Europe
  3. Low-volume transactional systems: Applications processing fewer than 10 transactions per second
  4. Eventually consistent workflows: Systems that can tolerate write delays and handle retries gracefully

However, for these use cases, traditional Aurora PostgreSQL with cross region read replicas or DynamoDB Global Tables would likely provide better performance and cost efficiency.

5.4 Recommendations for African Organizations

Until AWS expands Aurora DSQL availability and addresses the latency constraints, organizations in Africa should consider:

  1. Aurora PostgreSQL: Deploy in af-south-1 with cross region read replicas to Europe or the Middle East for disaster recovery
  2. DynamoDB Global Tables: For globally distributed data with eventual consistency requirements
  3. RDS PostgreSQL: For traditional relational workloads that don’t require multi region active active
  4. Self-managed solutions: CockroachDB or YugabyteDB can be deployed in af-south-1 and paired with regions of your choice, avoiding AWS’s geographic restrictions

5.5 Future Outlook

For Aurora DSQL to become viable in Africa, AWS would need to:

  1. Launch DSQL in af-south-1: Basic prerequisite for any African deployment
  2. Add more African regions: To enable multi region deployments within acceptable latency bounds
  3. Remove geographic pairing restrictions: Allow af-south-1 to pair with me-south-1 (Bahrain) or other nearby regions with better latency profiles
  4. Optimize for high latency scenarios: Implement asynchronous commit options or relaxed consistency modes for geo-distributed deployments

None of these improvements have been announced, and given DSQL’s preview status, African availability is unlikely in the near term.

5.6 Quantified Performance Impact Summary

Single write transaction latency:

  • Local (within af-south-1): 4-10ms (estimated)
  • To eu-west-1: 300ms (6x slower than EU region pairs, 2.5x slower than US region pairs)
  • To us-east-1: 400ms
  • To ap-southeast-1: 560ms

Throughput per connection:

  • Local: 100-250 TPS (estimated)
  • To eu-west-1: 3.3 TPS (30x reduction)
  • With 20% retry rate: 2-2.5 TPS (50x reduction)

Cost multiplier:

  • Estimated 2-3x DPU consumption compared to low latency region pairs
  • Additional cross continental data transfer costs

Conclusion: The 150ms RTT to Europe creates a 6x write latency penalty and reduces per-connection throughput by 30-50x. Combined with the lack of regional availability, geographic pairing restrictions, and cost implications, Aurora DSQL is not viable for African deployments in its current form. Organizations in af-south-1 should continue using traditional database solutions until AWS addresses these fundamental constraints.

5.7 Performance Analysis

5.7.1 Strengths

  1. Predictable Latency: Transaction latency remains constant regardless of statement count, providing consistent performance characteristics that simplify capacity planning.
  2. Multi Region Active Active: Both regional endpoints support concurrent read and write operations with strong data consistency, enabling true active active configurations without complex replication lag management.
  3. No Single Point of Contention: A single slow client or long running query doesn’t impact other transactions because contention is handled at commit time on the server side.

5.7.2 Weaknesses

  1. High Contention Workload Performance: Applications with frequent updates to small key ranges experience high retry rates (see detailed explanation in Performance Analysis section above).
  2. Application Complexity: Aurora DSQL’s optimistic concurrency control minimizes cross region latency but requires applications to handle retries. This shifts complexity from the database to application code.
  3. Feature Gaps: Missing PostgreSQL features like foreign keys, triggers, views, and critical data types like JSON/JSONB require application redesign. Some developers view it as barely a database, more like a key value store with basic PostgreSQL wire compatibility.
  4. Unpredictable Costs: The pricing model is monumentally confusing, with costs varying based on DPU consumption that’s difficult to predict without production testing.

5.8 Use Case Recommendations

5.8.1 Good Fit

Global Ecommerce Platforms

Applications requiring continuous availability across regions with strong consistency for inventory and order management. If your business depends on continuous availability—like global ecommerce or financial platforms—Aurora DSQL’s active active model is a game changer.

Multi Tenant SaaS Applications

Services with dynamic scaling requirements and geographic distribution of users. The automatic scaling eliminates capacity planning concerns.

Financial Services (with caveats)

Transaction processing systems that can implement application level retry logic and work within the snapshot isolation model.

5.8.2 Poor Fit

Batch Processing Systems

The 10,000 row transaction limit makes Aurora DSQL unsuitable for bulk data operations, ETL processes, or large scale data migrations.

Legacy PostgreSQL Applications

Applications depending on foreign keys, triggers, stored procedures, views, or serializable isolation will require extensive rewrites.

High Contention Workloads

Applications with frequent updates to small key ranges (like continuously updating stock tickers, inventory counters for popular items, or high frequency account balance updates) will experience high retry rates and degraded throughput due to optimistic concurrency control. See the detailed explanation in the Performance Analysis section for why this occurs.

6. Comparison with Alternatives

vs. Google Cloud Spanner

Aurora DSQL claims 4x faster performance, but Spanner has been used for multi region consistent deployments at proven enterprise scale. Spanner uses its own SQL dialect, while Aurora DSQL provides PostgreSQL wire protocol compatibility.

vs. CockroachDB

YugabyteDB is more compatible with PostgreSQL, supporting features like triggers, PL/pgSQL, foreign keys, sequences, all isolation levels, and explicit locking. CockroachDB offers similar advantages with battle tested multi cloud deployment options, while Aurora DSQL is AWS exclusive and still in preview.

vs. Aurora PostgreSQL

Traditional Aurora PostgreSQL provides full PostgreSQL compatibility with proven reliability but lacks the multi region active active capabilities and automatic horizontal scaling of DSQL. The choice depends on whether distributed architecture benefits outweigh compatibility trade offs.

6.1 Production Readiness Assessment

Preview Status Concerns

As of November 2024, Aurora DSQL remains in public preview with several implications:

  • No production SLA guarantees
  • Feature set still evolving
  • Limited regional availability
  • Pricing subject to change at general availability

Missing Observability

During preview, instrumentation is limited. PostgreSQL doesn’t provide EXPLAIN ANALYZE output from commits, making it difficult to understand what happens during the synchronization and wait phases.

Migration Complexity

Aurora DSQL prioritizes scalability, sacrificing some features for performance. This requires careful evaluation of application dependencies on unsupported PostgreSQL features before attempting migration.

6.2 Pricing Considerations

Billing for Aurora DSQL is based on two primary measures: Distributed Processing Units (DPU) and storage, with costs of $8 per million DPU and $0.33 per GB month in US East.

However, DPU consumption varies unpredictably based on query complexity, making cost forecasting extremely difficult. The result of cost modeling exercises amounts to “yes, this will cost you some amount of money,” which is unacceptable when costs rise beyond science experiment levels.

The AWS Free Tier provides 100,000 DPUs and 1 GB month of storage monthly, allowing for initial testing without costs.

6.3 Recommendations

For New Applications

Aurora DSQL makes sense for greenfield projects where:

  • Multi region active active is a core requirement
  • Application can be designed around optimistic concurrency from the start
  • Features like foreign keys and triggers aren’t architectural requirements
  • Team accepts preview stage maturity risks

For Existing Applications

Migration from PostgreSQL requires:

  • Comprehensive audit of PostgreSQL feature dependencies
  • Redesign of referential integrity enforcement
  • Implementation of retry logic for optimistic concurrency
  • Extensive testing to validate cost models
  • Acceptance that some features may require application level implementation

Testing Strategy

Before production commitment:

  1. Benchmark actual workloads against Aurora DSQL to measure real DPU consumption
  2. Test at production scale to validate the 10,000 row transaction limit doesn’t impact operations
  3. Implement comprehensive retry logic and verify behavior under contention
  4. Measure cross region latency for your specific geographic requirements
  5. Calculate total cost of ownership including application development effort for missing features

7. Conclusion

Amazon Aurora DSQL represents significant innovation in distributed SQL database architecture, solving genuine problems around multi region strong consistency and operational simplicity. The technical implementation particularly the disaggregated architecture and optimistic concurrency control demonstrates sophisticated engineering.

However, the service makes substantial tradeoffs that limit its applicability. The missing PostgreSQL features, transaction size constraints, and optimistic concurrency requirements create significant migration friction for existing applications. The unpredictable pricing model adds further uncertainty.

For organizations building new globally distributed applications with flexible architectural requirements, Aurora DSQL deserves serious evaluation. For teams with existing PostgreSQL applications or those requiring full PostgreSQL compatibility, traditional Aurora PostgreSQL or alternative distributed SQL databases may provide better paths forward.

As the service matures beyond preview status, AWS will likely address some limitations and provide better cost prediction tools. Until then, Aurora DSQL remains a promising but unfinished solution that requires careful evaluation against specific requirements and willingness to adapt applications to its architectural constraints.

References

Last Updated: November 2024 | Preview Status: Public Preview

0
0

Deep Dive into PostgreSQL Aurora Vacuum Optimizations for Large Tables

When managing large PostgreSQL tables with frequent updates, vacuum operations become critical for maintaining database health and performance. In this comprehensive guide, we’ll explore vacuum optimization techniques, dive deep into the pg_repack extension, and provide hands-on examples you can run in your own environment.

1. Understanding the Problem

PostgreSQL uses Multi-Version Concurrency Control (MVCC) to handle concurrent transactions. When rows are updated or deleted, PostgreSQL doesn’t immediately remove the old versions—it marks them as dead tuples. Over time, these dead tuples accumulate, leading to:

  • Table bloat: Wasted disk space
  • Index bloat: Degraded query performance
  • Slower sequential scans: More pages to read
  • Transaction ID wraparound risks: In extreme cases

The VACUUM process reclaims this space, but for large, heavily-updated tables, standard vacuum strategies often fall short.

1.1 Setting Up Our Test Environment

Let’s create a realistic scenario to understand vacuum optimization. We’ll build a large user activity tracking table that receives constant updates—similar to what you might find in production systems tracking user behaviors, session data, or transaction logs.

1.2 Creating the Test Table

This schema represents a typical high-volume table with multiple indexes for different query patterns:

-- Create our test table
CREATE TABLE user_activities (
    id BIGSERIAL PRIMARY KEY,
    user_id INTEGER NOT NULL,
    activity_type VARCHAR(50) NOT NULL,
    activity_data JSONB,
    status VARCHAR(20) DEFAULT 'pending',
    processed_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    metadata TEXT
);

-- Create indexes
CREATE INDEX idx_user_activities_user_id ON user_activities(user_id);
CREATE INDEX idx_user_activities_status ON user_activities(status);
CREATE INDEX idx_user_activities_created_at ON user_activities(created_at);
CREATE INDEX idx_user_activities_processed_at ON user_activities(processed_at) 
    WHERE processed_at IS NOT NULL;

Example Output:

CREATE TABLE
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX

1.3 Population Script

This function generates realistic test data with varied activity types and statuses to simulate a production environment:

-- Function to generate random activity data
CREATE OR REPLACE FUNCTION generate_user_activities(num_rows INTEGER)
RETURNS void AS $$
DECLARE
    batch_size INTEGER := 10000;
    batches INTEGER;
    i INTEGER;
BEGIN
    batches := CEIL(num_rows::NUMERIC / batch_size);
    
    FOR i IN 1..batches LOOP
        INSERT INTO user_activities (
            user_id,
            activity_type,
            activity_data,
            status,
            created_at,
            metadata
        )
        SELECT
            (random() * 100000)::INTEGER + 1,
            (ARRAY['login', 'purchase', 'view', 'search', 'logout'])[FLOOR(random() * 5 + 1)],
            jsonb_build_object(
                'ip', '192.168.' || (random() * 255)::INTEGER || '.' || (random() * 255)::INTEGER,
                'user_agent', 'Mozilla/5.0',
                'session_id', md5(random()::TEXT)
            ),
            (ARRAY['pending', 'processing', 'completed'])[FLOOR(random() * 3 + 1)],
            NOW() - (random() * INTERVAL '90 days'),
            repeat('x', (random() * 500)::INTEGER + 100)
        FROM generate_series(1, LEAST(batch_size, num_rows - (i-1) * batch_size));
        
        RAISE NOTICE 'Inserted batch % of %', i, batches;
    END LOOP;
    
    RAISE NOTICE 'Completed inserting % rows', num_rows;
END;
$$ LANGUAGE plpgsql;

-- Populate with 5 million rows (adjust as needed)
SELECT generate_user_activities(5000000);

-- Analyze the table
ANALYZE user_activities;

Example Output:

CREATE FUNCTION
NOTICE:  Inserted batch 1 of 500
NOTICE:  Inserted batch 2 of 500
NOTICE:  Inserted batch 3 of 500
...
NOTICE:  Inserted batch 500 of 500
NOTICE:  Completed inserting 5000000 rows
 generate_user_activities 
---------------------------
 
(1 row)

ANALYZE

1.4 Simulating Heavy Update Load

Understanding bloat requires seeing it in action. This function simulates the update-heavy workload patterns that cause vacuum challenges in production systems:

-- Function to simulate continuous updates
CREATE OR REPLACE FUNCTION simulate_updates(duration_minutes INTEGER)
RETURNS void AS $$
DECLARE
    end_time TIMESTAMP;
    update_count INTEGER := 0;
BEGIN
    end_time := NOW() + (duration_minutes || ' minutes')::INTERVAL;
    
    WHILE NOW() < end_time LOOP
        -- Update random rows to 'processing' status
        UPDATE user_activities
        SET status = 'processing',
            updated_at = NOW()
        WHERE id IN (
            SELECT id FROM user_activities
            WHERE status = 'pending'
            ORDER BY random()
            LIMIT 1000
        );
        
        -- Update random rows to 'completed' status
        UPDATE user_activities
        SET status = 'completed',
            processed_at = NOW(),
            updated_at = NOW()
        WHERE id IN (
            SELECT id FROM user_activities
            WHERE status = 'processing'
            ORDER BY random()
            LIMIT 800
        );
        
        update_count := update_count + 1800;
        
        IF update_count % 10000 = 0 THEN
            RAISE NOTICE 'Processed % updates', update_count;
        END IF;
        
        PERFORM pg_sleep(0.1);
    END LOOP;
    
    RAISE NOTICE 'Completed % total updates', update_count;
END;
$$ LANGUAGE plpgsql;

-- Run for 5 minutes to generate bloat
-- SELECT simulate_updates(5);

Example Output (when running the simulate_updates function):

NOTICE:  Processed 10000 updates
NOTICE:  Processed 20000 updates
NOTICE:  Processed 30000 updates
...
NOTICE:  Completed 54000 total updates
 simulate_updates 
------------------
 
(1 row)

1.4 Monitoring Table Health

Before optimizing vacuum operations, you need visibility into your table’s health metrics. These queries provide essential diagnostics for understanding bloat levels and vacuum effectiveness.

1.5 Check Table and Index Bloat

This comprehensive query gives you a snapshot of your table’s overall health, including size metrics and tuple statistics:

-- Comprehensive bloat analysis
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - 
                   pg_relation_size(schemaname||'.'||tablename)) AS indexes_size,
    n_live_tup,
    n_dead_tup,
    ROUND(100 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_tuple_percent,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE tablename = 'user_activities';

Example Output:

 schemaname |    tablename     | total_size | table_size | indexes_size | n_live_tup | n_dead_tup | dead_tuple_percent |       last_vacuum       |     last_autovacuum     |       last_analyze       |     last_autoanalyze     
------------+------------------+------------+------------+--------------+------------+------------+--------------------+-------------------------+-------------------------+--------------------------+--------------------------
 public     | user_activities  | 4892 MB    | 3214 MB    | 1678 MB      |    5000000 |     847523 |              14.51 | 2024-11-16 02:15:33.421 | 2024-11-17 08:22:14.832 | 2024-11-16 02:15:45.123 | 2024-11-17 08:22:28.945
(1 row)

1.6 Detailed Bloat Estimation

For a more precise understanding of how much space is wasted, this query calculates bloat based on tuple density:

-- More accurate bloat estimation
WITH table_stats AS (
    SELECT
        schemaname,
        tablename,
        n_live_tup,
        n_dead_tup,
        pg_relation_size(schemaname||'.'||tablename) AS table_bytes,
        (n_live_tup + n_dead_tup)::NUMERIC AS total_tuples
    FROM pg_stat_user_tables
    WHERE tablename = 'user_activities'
),
bloat_calc AS (
    SELECT
        *,
        CASE 
            WHEN total_tuples > 0 THEN
                table_bytes / NULLIF(total_tuples, 0)
            ELSE 0
        END AS bytes_per_tuple,
        CASE
            WHEN n_live_tup > 0 THEN
                table_bytes * (n_dead_tup::NUMERIC / NULLIF(n_live_tup + n_dead_tup, 0))
            ELSE 0
        END AS bloat_bytes
    FROM table_stats
)
SELECT
    tablename,
    pg_size_pretty(table_bytes) AS current_size,
    pg_size_pretty(bloat_bytes::BIGINT) AS estimated_bloat,
    ROUND(100 * bloat_bytes / NULLIF(table_bytes, 0), 2) AS bloat_percent,
    n_live_tup,
    n_dead_tup
FROM bloat_calc;

Example Output:

    tablename     | current_size | estimated_bloat | bloat_percent | n_live_tup | n_dead_tup 
------------------+--------------+-----------------+---------------+------------+------------
 user_activities  | 3214 MB      | 466 MB          |         14.51 |    5000000 |     847523
(1 row)

1.7 Check Current Vacuum Activity

When troubleshooting vacuum issues, it’s crucial to see what’s actually running:

-- Monitor active vacuum operations
SELECT
    pid,
    datname,
    usename,
    state,
    query_start,
    NOW() - query_start AS duration,
    query
FROM pg_stat_activity
WHERE query LIKE '%VACUUM%'
  AND query NOT LIKE '%pg_stat_activity%';

Example Output:

  pid  |  datname   | usename  |  state  |         query_start         |    duration     |                        query                        
-------+------------+----------+---------+-----------------------------+-----------------+-----------------------------------------------------
 12847 | production | postgres | active  | 2024-11-17 09:15:22.534829  | 00:03:17.482341 | VACUUM (VERBOSE, ANALYZE) user_activities;
(1 row)

2. Standard Vacuum Strategies

Understanding the different vacuum options is essential for choosing the right approach for your workload. Each vacuum variant serves different purposes and has different performance characteristics.

2.1 Manual VACUUM

These are the basic vacuum commands you’ll use for routine maintenance:

-- Basic vacuum (doesn't lock table)
VACUUM user_activities;

-- Vacuum with analyze
VACUUM ANALYZE user_activities;

-- Verbose output for monitoring
VACUUM (VERBOSE, ANALYZE) user_activities;

-- Aggressive vacuum (more thorough, slower)
VACUUM (FULL, VERBOSE, ANALYZE) user_activities;

Example Output (VACUUM VERBOSE):

INFO:  vacuuming "public.user_activities"
INFO:  scanned index "user_activities_pkey" to remove 847523 row versions
DETAIL:  CPU: user: 2.45 s, system: 0.89 s, elapsed: 12.34 s
INFO:  scanned index "idx_user_activities_user_id" to remove 847523 row versions
DETAIL:  CPU: user: 1.87 s, system: 0.67 s, elapsed: 9.12 s
INFO:  scanned index "idx_user_activities_status" to remove 847523 row versions
DETAIL:  CPU: user: 1.92 s, system: 0.71 s, elapsed: 9.45 s
INFO:  scanned index "idx_user_activities_created_at" to remove 847523 row versions
DETAIL:  CPU: user: 1.88 s, system: 0.68 s, elapsed: 9.23 s
INFO:  "user_activities": removed 847523 row versions in 112456 pages
DETAIL:  CPU: user: 3.21 s, system: 1.45 s, elapsed: 18.67 s
INFO:  "user_activities": found 847523 removable, 5000000 nonremovable row versions in 425678 out of 425678 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 123456789
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 11.33 s, system: 4.40 s, elapsed: 58.81 s.
VACUUM

Note: VACUUM FULL requires an ACCESS EXCLUSIVE lock and rewrites the entire table, making it unsuitable for production during business hours.

2.2 Configuring Autovacuum

Aurora PostgreSQL has autovacuum enabled by default, but tuning these parameters is critical for large, frequently-updated tables:

-- Check current autovacuum settings
SHOW autovacuum_vacuum_threshold;
SHOW autovacuum_vacuum_scale_factor;
SHOW autovacuum_vacuum_cost_delay;
SHOW autovacuum_vacuum_cost_limit;

-- Custom autovacuum settings for our table
ALTER TABLE user_activities SET (
    autovacuum_vacuum_threshold = 5000,
    autovacuum_vacuum_scale_factor = 0.05,  -- More aggressive (default 0.2)
    autovacuum_vacuum_cost_delay = 10,      -- Faster vacuum (default 20)
    autovacuum_analyze_threshold = 2500,
    autovacuum_analyze_scale_factor = 0.05
);

-- For extremely busy tables
ALTER TABLE user_activities SET (
    autovacuum_vacuum_threshold = 1000,
    autovacuum_vacuum_scale_factor = 0.02,
    autovacuum_vacuum_cost_delay = 2,
    autovacuum_vacuum_cost_limit = 2000,    -- Higher I/O limit
    autovacuum_naptime = 10                 -- Check more frequently
);

Example Output:

 autovacuum_vacuum_threshold 
-----------------------------
 50
(1 row)

 autovacuum_vacuum_scale_factor 
--------------------------------
 0.2
(1 row)

 autovacuum_vacuum_cost_delay 
------------------------------
 20
(1 row)

 autovacuum_vacuum_cost_limit 
------------------------------
 200
(1 row)

ALTER TABLE
ALTER TABLE

These are table-level storage parameters, not server-level GUC (Grand Unified Configuration) parameters, so no restart is needed. These settings take effect immediately without requiring a database or server restart.

Server-Level vs Table-Level

These specific parameters are being set at the table level using ALTER TABLE ... SET, which means they only affect the user_activities table.

However, these same parameters do exist as server-level GUC parameters with slightly different names:

  • autovacuum_vacuum_threshold (server-level GUC exists)
  • autovacuum_vacuum_scale_factor (server-level GUC exists)
  • autovacuum_vacuum_cost_delay (server-level GUC exists)
  • autovacuum_analyze_threshold (server-level GUC exists)
  • autovacuum_analyze_scale_factor (server-level GUC exists)

When set at the server level in postgresql.conf, those would require a reload (pg_ctl reload or SELECT pg_reload_conf()), but not a full restart.

Your Command

Your ALTER TABLE command is overriding the server-level defaults specifically for the user_activities table, making autovacuum more aggressive for that table. This is a common approach for high-churn tables and applies instantly.

2.3 The pg_repack Extension

pg_repack is a game-changer for managing large tables with bloat. While VACUUM FULL requires a long-duration exclusive lock that blocks all operations, pg_repack uses an innovative approach that allows the table to remain online and accessible throughout most of the operation.

Understanding pg_repack’s Architecture

pg_repack works fundamentally differently from traditional vacuum operations. Here’s what makes it special:

The Problem with VACUUM FULL:

  • Acquires an ACCESS EXCLUSIVE lock for the entire operation
  • Blocks all reads and writes
  • For a 100GB table, this could mean hours of downtime
  • Single-threaded operation

How pg_repack Solves This:

pg_repack employs a clever multi-stage approach:

  1. Log Table Creation: Creates a temporary log table to capture changes made during the rebuild
  2. Online Rebuild: Builds a new, defragmented copy of your table while the original remains fully operational
  3. Change Capture: Records all INSERT, UPDATE, and DELETE operations in the log table
  4. Change Replay: Applies the logged changes to the new table
  5. Atomic Swap: Takes a brief exclusive lock (typically < 1 second) to swap the old and new tables
  6. Index Rebuild: Rebuilds indexes concurrently on the new table

Key Benefits:

  • Minimal Locking: Only a brief lock during the table swap
  • Online Operation: Applications continue running normally
  • Better Efficiency: Rewrites data in optimal order, improving subsequent query performance
  • Parallel Processing: Can use multiple workers for faster completion
  • Transaction Safety: All changes are captured and replayed, ensuring data consistency

2.4 Installing pg_repack on Aurora

Setting up pg_repack is straightforward on Aurora PostgreSQL:

-- Check available extensions
SELECT * FROM pg_available_extensions WHERE name = 'pg_repack';

-- Install pg_repack (requires rds_superuser role)
CREATE EXTENSION pg_repack;

-- Verify installation
\dx pg_repack

Example Output:

   name    | default_version | installed_version |                         comment                          
-----------+-----------------+-------------------+----------------------------------------------------------
 pg_repack | 1.4.8           |                   | Reorganize tables in PostgreSQL databases with minimal locks
(1 row)

CREATE EXTENSION

                                    List of installed extensions
   Name    | Version |   Schema   |                         Description                          
-----------+---------+------------+--------------------------------------------------------------
 pg_repack | 1.4.8   | public     | Reorganize tables in PostgreSQL databases with minimal locks
(1 row)

2.5 How pg_repack Works (Technical Deep Dive)

Let’s break down the pg_repack process with more detail:

Phase 1: Setup (seconds)

  • Creates schema repack for temporary objects
  • Creates a log table repack.log_XXXXX with triggers
  • Installs triggers on source table to capture changes
  • Takes a snapshot of current transaction ID

Phase 2: Initial Copy (majority of time)

  • Copies all data from original table to repack.table_XXXXX
  • Sorts data optimally (by primary key or specified order)
  • Meanwhile, all changes are captured in the log table
  • No locks on the original table during this phase

Phase 3: Delta Application (proportional to changes)

  • Reads the log table
  • Applies INSERT/UPDATE/DELETE operations to new table
  • May iterate if many changes occurred during Phase 2

Phase 4: Final Swap (< 1 second typically)

  • Acquires ACCESS EXCLUSIVE lock
  • Applies any final logged changes
  • Swaps the table definitions atomically
  • Releases lock
  • Drops old table and log table

Phase 5: Index Rebuild (concurrent)

  • Rebuilds all indexes on new table
  • Uses CREATE INDEX CONCURRENTLY to avoid blocking

2.6 Basic pg_repack Usage

From the command line (requires appropriate IAM/credentials for Aurora):

# Basic repack
pg_repack -h your-aurora-cluster.region.rds.amazonaws.com \
          -U your_username \
          -d your_database \
          -t user_activities

# With specific options
pg_repack -h your-aurora-cluster.region.rds.amazonaws.com \
          -U your_username \
          -d your_database \
          -t user_activities \
          --no-order \
          --no-kill-backend \
          -j 4  # Use 4 parallel workers

Example Output:

INFO: repacking table "public.user_activities"
INFO: disabling triggers
INFO: creating temporary table
INFO: copying rows
INFO: 5000000 rows copied
INFO: creating indexes
INFO: creating index "user_activities_pkey"
INFO: creating index "idx_user_activities_user_id"
INFO: creating index "idx_user_activities_status"
INFO: creating index "idx_user_activities_created_at"
INFO: creating index "idx_user_activities_processed_at"
INFO: swapping tables
INFO: applying log
INFO: 12847 log rows applied
INFO: enabling triggers
INFO: dropping old table
INFO: Repacked user_activities (3.2GB -> 2.7GB), 15.6% space reclaimed
NOTICE: TABLE "public.user_activities" repacked successfully

2.7 Advanced pg_repack with SQL Interface

You can also trigger pg_repack from within PostgreSQL:

-- Repack a specific table
SELECT repack.repack_table('public.user_activities');

-- Repack with options
SELECT repack.repack_table(
    'public.user_activities',
    'REINDEX'  -- Rebuild indexes too
);

-- Check pg_repack progress (run in another session)
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
    n_tup_ins,
    n_tup_upd,
    n_tup_del
FROM pg_stat_user_tables
WHERE tablename LIKE '%repack%';

Example Output:

 repack_table 
--------------
 t
(1 row)

 schemaname |         tablename          |  size   | n_tup_ins | n_tup_upd | n_tup_del 
------------+----------------------------+---------+-----------+-----------+-----------
 repack     | table_12847                | 2689 MB |   5000000 |         0 |         0
 repack     | log_12847                  | 145 MB  |     12847 |         0 |         0
(2 rows)

2.8 Monitoring pg_repack Progress

Real-time monitoring helps you understand how long the operation will take:

-- Create a monitoring function
CREATE OR REPLACE FUNCTION monitor_repack()
RETURNS TABLE (
    table_name TEXT,
    phase TEXT,
    elapsed_time INTERVAL,
    table_size TEXT,
    estimated_remaining TEXT
) AS $$
BEGIN
    RETURN QUERY
    SELECT
        t.tablename::TEXT,
        CASE
            WHEN t.tablename LIKE 'repack.table_%' THEN 'Building new table'
            WHEN t.tablename LIKE 'repack.log_%' THEN 'Logging changes'
            ELSE 'Processing'
        END AS phase,
        NOW() - ps.query_start AS elapsed,
        pg_size_pretty(pg_total_relation_size(t.schemaname||'.'||t.tablename)),
        '~' || ROUND(EXTRACT(EPOCH FROM (NOW() - ps.query_start)) * 1.5 / 60) || ' min' AS est_remaining
    FROM pg_stat_user_tables t
    LEFT JOIN pg_stat_activity ps ON ps.query LIKE '%repack%'
    WHERE t.schemaname = 'repack'
       OR ps.query LIKE '%repack%';
END;
$$ LANGUAGE plpgsql;

-- Monitor during repack
SELECT * FROM monitor_repack();

Example Output:

CREATE FUNCTION

       table_name       |       phase        | elapsed_time  | table_size | estimated_remaining 
------------------------+--------------------+---------------+------------+---------------------
 repack.table_12847     | Building new table | 00:08:23.457  | 2689 MB    | ~13 min
 repack.log_12847       | Logging changes    | 00:08:23.457  | 145 MB     | ~13 min
(2 rows)

3.0 Off-Hours Maintenance Script

This comprehensive script is designed to run during low-traffic periods and automatically selects the best vacuum strategy based on bloat levels:

-- ============================================
-- OFF-HOURS TABLE MAINTENANCE SCRIPT
-- Run during maintenance windows
-- ============================================

DO $$
DECLARE
    v_start_time TIMESTAMP;
    v_table_size BIGINT;
    v_dead_tuples BIGINT;
    v_bloat_percent NUMERIC;
    v_action TEXT;
    v_repack_available BOOLEAN;
BEGIN
    v_start_time := NOW();
    
    RAISE NOTICE '========================================';
    RAISE NOTICE 'Starting maintenance at %', v_start_time;
    RAISE NOTICE '========================================';
    
    -- Check if pg_repack is available
    SELECT EXISTS (
        SELECT 1 FROM pg_extension WHERE extname = 'pg_repack'
    ) INTO v_repack_available;
    
    -- Gather current statistics
    SELECT
        pg_relation_size('user_activities'),
        n_dead_tup,
        ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2)
    INTO v_table_size, v_dead_tuples, v_bloat_percent
    FROM pg_stat_user_tables
    WHERE tablename = 'user_activities';
    
    RAISE NOTICE 'Current table size: %', pg_size_pretty(v_table_size);
    RAISE NOTICE 'Dead tuples: %', v_dead_tuples;
    RAISE NOTICE 'Bloat percentage: %', v_bloat_percent;
    
    -- Decide on action based on bloat level
    IF v_bloat_percent > 50 THEN
        v_action := 'pg_repack (high bloat)';
        
        IF v_repack_available THEN
            RAISE NOTICE 'Bloat > 50%: Executing pg_repack...';
            PERFORM repack.repack_table('public.user_activities');
            RAISE NOTICE 'pg_repack completed';
        ELSE
            RAISE NOTICE 'pg_repack not available, running VACUUM FULL...';
            RAISE NOTICE 'WARNING: This will lock the table!';
            EXECUTE 'VACUUM FULL ANALYZE user_activities';
        END IF;
        
    ELSIF v_bloat_percent > 20 THEN
        v_action := 'aggressive_vacuum';
        RAISE NOTICE 'Bloat 20-50%: Running aggressive VACUUM...';
        EXECUTE 'VACUUM (VERBOSE, ANALYZE, FREEZE) user_activities';
        
    ELSIF v_bloat_percent > 10 THEN
        v_action := 'standard_vacuum';
        RAISE NOTICE 'Bloat 10-20%: Running standard VACUUM...';
        EXECUTE 'VACUUM ANALYZE user_activities';
        
    ELSE
        v_action := 'analyze_only';
        RAISE NOTICE 'Bloat < 10%: Running ANALYZE only...';
        EXECUTE 'ANALYZE user_activities';
    END IF;
    
    -- Rebuild indexes if needed
    RAISE NOTICE 'Checking index health...';
    
    -- Reindex if bloated
    IF v_bloat_percent > 30 THEN
        RAISE NOTICE 'Rebuilding indexes concurrently...';
        EXECUTE 'REINDEX INDEX CONCURRENTLY idx_user_activities_user_id';
        EXECUTE 'REINDEX INDEX CONCURRENTLY idx_user_activities_status';
        EXECUTE 'REINDEX INDEX CONCURRENTLY idx_user_activities_created_at';
        EXECUTE 'REINDEX INDEX CONCURRENTLY idx_user_activities_processed_at';
    END IF;
    
    -- Final statistics
    SELECT
        pg_relation_size('user_activities'),
        n_dead_tup,
        ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2)
    INTO v_table_size, v_dead_tuples, v_bloat_percent
    FROM pg_stat_user_tables
    WHERE tablename = 'user_activities';
    
    RAISE NOTICE '========================================';
    RAISE NOTICE 'Maintenance completed in %', NOW() - v_start_time;
    RAISE NOTICE 'Action taken: %', v_action;
    RAISE NOTICE 'Final table size: %', pg_size_pretty(v_table_size);
    RAISE NOTICE 'Final dead tuples: %', v_dead_tuples;
    RAISE NOTICE 'Final bloat percentage: %', v_bloat_percent;
    RAISE NOTICE '========================================';
    
END $$;

Example Output:

NOTICE:  ========================================
NOTICE:  Starting maintenance at 2024-11-17 02:00:00.123456
NOTICE:  ========================================
NOTICE:  Current table size: 3214 MB
NOTICE:  Dead tuples: 847523
NOTICE:  Bloat percentage: 14.51
NOTICE:  Bloat 10-20%: Running standard VACUUM...
INFO:  vacuuming "public.user_activities"
INFO:  "user_activities": removed 847523 row versions in 112456 pages
INFO:  "user_activities": found 847523 removable, 5000000 nonremovable row versions
NOTICE:  Checking index health...
NOTICE:  ========================================
NOTICE:  Maintenance completed in 00:01:23.847293
NOTICE:  Action taken: standard_vacuum
NOTICE:  Final table size: 2987 MB
NOTICE:  Final dead tuples: 0
NOTICE:  Final bloat percentage: 0.00
NOTICE:  ========================================
DO

3.1 Scheduling the Maintenance Script

For Aurora PostgreSQL, you can use AWS EventBridge with Lambda to schedule this:

# Lambda function to execute maintenance
import boto3
import psycopg2
import os

def lambda_handler(event, context):
    conn = psycopg2.connect(
        host=os.environ['DB_HOST'],
        database=os.environ['DB_NAME'],
        user=os.environ['DB_USER'],
        password=os.environ['DB_PASSWORD']
    )
    
    with conn.cursor() as cur:
        # Read and execute the maintenance script
        with open('maintenance_script.sql', 'r') as f:
            cur.execute(f.read())
        conn.commit()
    
    conn.close()
    return {'statusCode': 200, 'body': 'Maintenance completed'}

Or use a cron job on an EC2 instance:

# Add to crontab for 2 AM daily maintenance
0 2 * * * psql -h your-aurora-cluster.region.rds.amazonaws.com \
               -U your_user \
               -d your_db \
               -f /path/to/maintenance_script.sql \
               >> /var/log/postgres_maintenance.log 2>&1

4.0 Memory Configuration for Vacuum Operations

While tuning autovacuum thresholds and cost-based settings is crucial, proper memory allocation can dramatically improve vacuum performance, especially for large tables. Two key parameters control how much memory vacuum operations can use.

Understanding Vacuum Memory Parameters

maintenance_work_mem: This parameter controls the maximum amount of memory used by maintenance operations including VACUUMCREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. The default is typically 64MB, which is often insufficient for large tables.

-- Check current setting
SHOW maintenance_work_mem;

-- Set globally (requires reload)
ALTER SYSTEM SET maintenance_work_mem = '2GB';
SELECT pg_reload_conf();

-- Or set per session for manual vacuum
SET maintenance_work_mem = '4GB';
VACUUM VERBOSE user_activities;

autovacuum_work_mem: Introduced in PostgreSQL 9.4, this parameter specifically controls memory for autovacuum workers. If set to -1 (default), it falls back to maintenance_work_mem. Setting this separately allows you to allocate different memory limits for automatic vs. manual vacuum operations.

-- Check current setting
SHOW autovacuum_work_mem;

-- Set globally (requires reload)
ALTER SYSTEM SET autovacuum_work_mem = '1GB';
SELECT pg_reload_conf();

4.1 How Memory Affects Vacuum Performance

During vacuum, PostgreSQL maintains an array of dead tuple identifiers (TIDs) in memory. When this array fills up, vacuum must:

  1. Stop scanning the table
  2. Scan and clean all indexes
  3. Remove the dead tuples from the heap
  4. Continue scanning for more dead tuples

This process repeats until the entire table is processed. More memory means:

  • Fewer index scan passes (expensive operation)
  • Better vacuum throughput
  • Reduced overall vacuum time

4.2 Memory Sizing Guidelines

Calculate required memory: Each dead tuple requires 6 bytes of memory. For a table with many dead tuples:

required_memory = (dead_tuples × 6 bytes) + overhead

Best practices:

  • Small instances: Set maintenance_work_mem to 256MB-512MB
  • Medium instances: 1GB-2GB for maintenance_work_mem, 512MB-1GB for autovacuum_work_mem
  • Large instances: 4GB-8GB for maintenance_work_mem, 1GB-2GB per autovacuum worker
  • Critical consideration: Remember autovacuum_work_mem is allocated per worker, so with autovacuum_max_workers = 5 and autovacuum_work_mem = 2GB, you could use up to 10GB total

4.3 Aurora-Specific Considerations

For Amazon Aurora PostgreSQL:

  • Aurora uses shared storage, so vacuum doesn’t rewrite data to new storage
  • Memory settings still impact performance of index cleaning phases
  • Monitor using CloudWatch metric FreeableMemory to ensure you’re not over-allocating
  • Consider Aurora’s instance size when setting these parameters
-- Conservative Aurora settings for db.r5.2xlarge (64GB RAM)
ALTER SYSTEM SET maintenance_work_mem = '2GB';
ALTER SYSTEM SET autovacuum_work_mem = '1GB';
ALTER SYSTEM SET autovacuum_max_workers = 3;
SELECT pg_reload_conf();

4.4 Monitoring Memory Usage

Check if vacuum operations are hitting memory limits:

-- Check for multiple index scan passes (indicates insufficient memory)
SELECT 
    schemaname,
    relname,
    last_vacuum,
    n_dead_tup,
    round(pg_table_size(schemaname||'.'||relname)::numeric / (1024^3), 2) as table_size_gb,
    round((n_dead_tup * 6) / (1024^2), 2) as min_required_mem_mb
FROM pg_stat_user_tables
WHERE n_dead_tup > 0
ORDER BY n_dead_tup DESC
LIMIT 20;

When running manual vacuum with VERBOSE, watch for messages like:

INFO:  index "user_activities_pkey" now contains 1000000 row versions in 2745 pages
INFO:  "user_activities": removed 500000 row versions in 12500 pages

If you see multiple cycles of “removed X row versions”, your maintenance_work_mem may be too small.

  1. Assess current state: Run the estimation script below to calculate memory requirements
  2. Set conservative values: Start with moderate memory allocations
  3. Monitor performance: Watch vacuum duration and CloudWatch metrics
  4. Iterate: Gradually increase memory if vacuum is still slow and memory is available
  5. Balance resources: Ensure vacuum memory doesn’t starve your application connections

4.7 SQL Script: Estimate Required Vacuum Memory

This script analyzes your largest tables and estimates the optimal maintenance_work_mem and autovacuum_work_mem settings:

-- Vacuum Memory Requirement Estimator
-- This script analyzes table sizes and estimates memory needed for efficient vacuum operations

WITH table_stats AS (
    SELECT
        schemaname,
        tablename,
        pg_total_relation_size(schemaname||'.'||tablename) as total_size_bytes,
        pg_table_size(schemaname||'.'||tablename) as table_size_bytes,
        pg_indexes_size(schemaname||'.'||tablename) as indexes_size_bytes,
        n_live_tup,
        n_dead_tup,
        last_vacuum,
        last_autovacuum,
        -- Estimate potential dead tuples based on table size and typical churn
        GREATEST(n_dead_tup, n_live_tup * 0.20) as estimated_max_dead_tup
    FROM pg_stat_user_tables
    WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
),
memory_calculations AS (
    SELECT
        schemaname,
        tablename,
        -- Size formatting
        pg_size_pretty(total_size_bytes) as total_size,
        pg_size_pretty(table_size_bytes) as table_size,
        pg_size_pretty(indexes_size_bytes) as indexes_size,
        -- Tuple counts
        n_live_tup,
        n_dead_tup,
        estimated_max_dead_tup::bigint,
        -- Memory calculations (6 bytes per dead tuple TID)
        round((estimated_max_dead_tup * 6) / (1024.0 * 1024.0), 2) as min_memory_mb,
        round((estimated_max_dead_tup * 6 * 1.2) / (1024.0 * 1024.0), 2) as recommended_memory_mb,
        -- Vacuum history
        last_vacuum,
        last_autovacuum,
        -- Number of index scan passes with current maintenance_work_mem
        CASE 
            WHEN estimated_max_dead_tup = 0 THEN 0
            ELSE CEIL(
                (estimated_max_dead_tup * 6.0) / 
                (SELECT setting::bigint * 1024 FROM pg_settings WHERE name = 'maintenance_work_mem')
            )::integer
        END as estimated_index_scans
    FROM table_stats
),
system_config AS (
    SELECT
        name,
        setting,
        unit,
        CASE 
            WHEN unit = 'kB' THEN (setting::bigint / 1024)::text || ' MB'
            WHEN unit = 'MB' THEN setting || ' MB'
            WHEN unit = 'GB' THEN setting || ' GB'
            ELSE setting || COALESCE(' ' || unit, '')
        END as formatted_value
    FROM pg_settings
    WHERE name IN ('maintenance_work_mem', 'autovacuum_work_mem', 'autovacuum_max_workers')
)
SELECT
    '=== CURRENT MEMORY CONFIGURATION ===' as info,
    NULL::text as schemaname,
    NULL::text as tablename,
    NULL::text as total_size,
    NULL::bigint as n_live_tup,
    NULL::bigint as n_dead_tup,
    NULL::bigint as estimated_max_dead_tup,
    NULL::numeric as min_memory_mb,
    NULL::numeric as recommended_memory_mb,
    NULL::integer as estimated_index_scans,
    NULL::timestamp as last_vacuum,
    NULL::timestamp as last_autovacuum
UNION ALL
SELECT
    name || ': ' || formatted_value,
    NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL
FROM system_config
UNION ALL
SELECT
    '=== TOP TABLES BY SIZE (Memory Requirements) ===' as info,
    NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL
UNION ALL
SELECT
    CASE 
        WHEN recommended_memory_mb > 1024 THEN 'Warn:  ' || tablename
        ELSE tablename
    END as info,
    schemaname,
    tablename,
    total_size,
    n_live_tup,
    n_dead_tup,
    estimated_max_dead_tup,
    min_memory_mb,
    recommended_memory_mb,
    estimated_index_scans,
    last_vacuum,
    last_autovacuum
FROM memory_calculations
WHERE total_size_bytes > 1048576  -- Only tables > 1MB
ORDER BY total_size_bytes DESC
LIMIT 25;

-- Summary recommendations
SELECT
    '=== RECOMMENDATIONS ===' as section,
    '' as recommendation;

SELECT
    'Based on largest tables:' as section,
    'Suggested maintenance_work_mem: ' || 
    CASE
        WHEN MAX(recommended_memory_mb) < 256 THEN '256 MB (small DB)'
        WHEN MAX(recommended_memory_mb) < 1024 THEN CEIL(MAX(recommended_memory_mb) / 256) * 256 || ' MB'
        WHEN MAX(recommended_memory_mb) < 4096 THEN CEIL(MAX(recommended_memory_mb) / 512) * 512 || ' MB'
        ELSE LEAST(8192, CEIL(MAX(recommended_memory_mb) / 1024) * 1024) || ' MB (capped at 8GB)'
    END as recommendation
FROM memory_calculations;

SELECT
    'For autovacuum workers:' as section,
    'Suggested autovacuum_work_mem: ' || 
    CASE
        WHEN MAX(recommended_memory_mb) < 512 THEN '256 MB'
        WHEN MAX(recommended_memory_mb) < 2048 THEN '512 MB to 1 GB'
        ELSE '1 GB to 2 GB per worker'
    END || 
    ' (remember: allocated per worker!)' as recommendation
FROM memory_calculations;

SELECT
    'Tables requiring attention:' as section,
    COUNT(*)::text || ' tables need more than 1GB for optimal vacuum' as recommendation
FROM memory_calculations
WHERE recommended_memory_mb > 1024;

SELECT
    'Memory efficiency:' as section,
    CASE
        WHEN COUNT(*) = 0 THEN 'All tables can vacuum efficiently with current settings'
        ELSE COUNT(*)::text || ' tables will require multiple index scans with current settings'
    END as recommendation
FROM memory_calculations
WHERE estimated_index_scans > 1;

4.8 Using the Estimation Script

  1. Run the script against your database to see current configuration and requirements
  2. Review the output focusing on:
    • Tables with estimated_index_scans > 1 (need more memory)
    • recommended_memory_mb for your largest tables
    • Tables marked with “Warn: ” (require > 1GB memory)
  3. Apply recommendations using the summary output
  4. Monitor vacuum performance after changes

Example output interpretation:

tablename              | recommended_memory_mb | estimated_index_scans
-----------------------|----------------------|----------------------
user_activities        | 1843.20              | 2
orders                 | 512.45               | 1
products               | 128.30               | 1

This indicates user_activities needs ~1.8GB for single-pass vacuum. If maintenance_work_mem = 1GB, vacuum will scan indexes twice, which is inefficient.

Pro tip: For tables that require excessive memory (>4GB), consider using pg_repack instead of relying solely on vacuum, or vacuum during maintenance windows with temporarily increased maintenance_work_mem.

5.0 Parallel Vacuuming

Starting with PostgreSQL 13, vacuum operations can leverage multiple CPU cores through parallel processing, dramatically reducing vacuum time for large tables with multiple indexes. This feature is particularly valuable for Aurora PostgreSQL environments where large tables can take hours to vacuum serially.

5.1 How Parallel Vacuum Works

Parallel vacuum speeds up the index cleanup phase—often the most time-consuming part of the vacuum process. When enabled:

  1. The leader process scans the table heap and collects dead tuple identifiers
  2. Multiple parallel workers simultaneously clean indexes
  3. The leader process removes dead tuples from the heap
  4. The cycle repeats until the table is fully vacuumed

Key point: Only the index cleanup phase is parallelized. Table scanning and heap cleanup remain single-threaded, but since index cleanup often dominates vacuum time (especially for tables with many indexes), the speedup can be substantial.

5.2 Enabling Parallel Vacuum

PostgreSQL automatically uses parallel vacuum when:

  • The table has at least 2 indexes
  • min_parallel_index_scan_size threshold is met (default 512KB per index)
  • Sufficient parallel workers are available

Configuration Parameters

-- Check current settings
SHOW max_parallel_maintenance_workers;  -- Default: 2
SHOW min_parallel_index_scan_size;       -- Default: 512kB
SHOW max_parallel_workers;                -- Overall parallel worker limit
-- Adjust for better parallelism (requires reload)
ALTER SYSTEM SET max_parallel_maintenance_workers = 4;
ALTER SYSTEM SET min_parallel_index_scan_size = '256kB';
SELECT pg_reload_conf();

Parameter descriptions:

  • max_parallel_maintenance_workers: Maximum workers for maintenance operations (VACUUM, CREATE INDEX). Limited by max_parallel_workers
  • min_parallel_index_scan_size: Minimum index size to consider for parallel processing
  • max_parallel_workers: System-wide limit for all parallel operations

5.3 Per-Table Parallel Configuration

For specific large tables, you can control parallel vacuum behavior:

-- Enable parallel vacuum with specific worker count
ALTER TABLE user_activities SET (parallel_workers = 4);
-- Disable parallel vacuum for a specific table
ALTER TABLE sensitive_table SET (parallel_workers = 0);
-- Check table-level settings
SELECT 
    schemaname,
    tablename,
    reloptions
FROM pg_tables
WHERE tablename = 'user_activities';

5.6 Manual Vacuum with Parallel Workers

When running manual vacuum, you can specify the degree of parallelism:

-- Vacuum with explicit parallel workers
VACUUM (PARALLEL 4, VERBOSE) user_activities;
-- Vacuum with parallel disabled
VACUUM (PARALLEL 0, VERBOSE) user_activities;
-- Let PostgreSQL decide (based on table and system settings)
VACUUM (VERBOSE) user_activities;

5.7 Monitoring Parallel Vacuum

Check if vacuum is using parallel workers:

-- View active vacuum operations and their parallel workers
SELECT 
    pid,
    datname,
    usename,
    query_start,
    state,
    wait_event_type,
    wait_event,
    query
FROM pg_stat_activity
WHERE query LIKE '%VACUUM%'
   OR backend_type LIKE '%parallel worker%'
ORDER BY query_start;

Watch for “parallel worker” processes that accompany the main vacuum process.

5.8 Performance Testing

Compare vacuum performance with and without parallelism:

-- Create test table with multiple indexes
CREATE TABLE vacuum_test AS 
SELECT * FROM user_activities LIMIT 1000000;
CREATE INDEX idx1 ON vacuum_test(user_id);
CREATE INDEX idx2 ON vacuum_test(activity_type);
CREATE INDEX idx3 ON vacuum_test(created_at);
CREATE INDEX idx4 ON vacuum_test(session_id);
-- Generate some dead tuples
UPDATE vacuum_test SET activity_type = 'modified' WHERE random() < 0.2;
-- Test serial vacuum
\timing on
VACUUM (PARALLEL 0, VERBOSE) vacuum_test;
-- Note the time
-- Test parallel vacuum
VACUUM (PARALLEL 4, VERBOSE) vacuum_test;
-- Note the time and compare

5.9 Aurora-Specific Considerations

When using parallel vacuum with Aurora PostgreSQL:

Instance sizing: Ensure your instance has sufficient vCPUs for parallel operations

  • db.r5.large (2 vCPUs): max_parallel_maintenance_workers = 2
  • db.r5.xlarge (4 vCPUs): max_parallel_maintenance_workers = 2-3
  • db.r5.2xlarge (8 vCPUs): max_parallel_maintenance_workers = 4
  • db.r5.4xlarge+ (16+ vCPUs): max_parallel_maintenance_workers = 4-6

Memory considerations: Each parallel worker requires its own memory allocation from maintenance_work_mem (or autovacuum_work_mem for autovacuum). With 4 workers and maintenance_work_mem = 2GB, you could use up to 8GB total.

-- Conservative Aurora parallel vacuum configuration
-- For db.r5.2xlarge (8 vCPU, 64GB RAM)
ALTER SYSTEM SET max_parallel_maintenance_workers = 4;
ALTER SYSTEM SET maintenance_work_mem = '1GB';  -- 4GB total with 4 workers
ALTER SYSTEM SET max_worker_processes = 16;     -- Ensure worker pool is sufficient
SELECT pg_reload_conf();

Reader endpoint impact: Parallel vacuum on the writer can increase replication lag to reader endpoints. Monitor ReplicaLag CloudWatch metric during parallel vacuum operations.

5.10 When Parallel Vacuum Helps Most

Parallel vacuum provides the biggest benefit when:

Tables have 4+ indexes – More indexes = more parallelizable work

Indexes are large (>1GB each) – Meets min_parallel_index_scan_size threshold

Sufficient CPU cores available – Won’t compete with application queries

I/O isn’t the bottleneck – Aurora’s storage architecture handles concurrent I/O well

Parallel vacuum helps less when:

Tables have only 1-2 small indexes – Limited parallelizable work

CPU is already saturated – Parallel workers compete with application

During peak traffic hours – Better to run with fewer workers to avoid contention

5.12 Autovacuum and Parallelism

Autovacuum workers can also use parallel processing (PostgreSQL 13+):

-- Enable parallel autovacuum for specific table
ALTER TABLE user_activities SET (
    parallel_workers = 3,
    autovacuum_vacuum_scale_factor = 0.05,
    autovacuum_vacuum_threshold = 5000
);

However, be cautious with parallel autovacuum on production systems:

  • Each autovacuum worker can spawn additional parallel workers
  • With autovacuum_max_workers = 3 and parallel_workers = 4, you could have 12 total workers
  • This can quickly exhaust max_worker_processes and max_connections

Recommendation: Start with parallel_workers = 2 for autovacuum, monitor resource usage, then adjust.

5.13 Practical Example: Optimizing a Large Table

-- Scenario: 100GB table with 6 indexes taking 2 hours to vacuum
-- Step 1: Check current configuration
SHOW max_parallel_maintenance_workers;  -- Returns 2
-- Step 2: Analyze the table
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as total_size,
    pg_size_pretty(pg_indexes_size(schemaname||'.'||tablename)) as indexes_size,
    (SELECT count(*) FROM pg_indexes WHERE tablename = 'user_activities') as num_indexes
FROM pg_stat_user_tables
WHERE tablename = 'user_activities';
-- Result: 100GB total, 45GB indexes, 6 indexes
-- Step 3: Enable parallel vacuum
ALTER TABLE user_activities SET (parallel_workers = 4);
-- Step 4: Increase maintenance workers (if needed)
ALTER SYSTEM SET max_parallel_maintenance_workers = 4;
SELECT pg_reload_conf();
-- Step 5: Run vacuum with timing
\timing on
VACUUM (VERBOSE) user_activities;
-- Expected result: Vacuum time reduced from 2 hours to 45-60 minutes

5.14 Troubleshooting Parallel Vacuum

Problem: Vacuum not using parallel workers

-- Check if indexes meet size threshold
SELECT 
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE tablename = 'user_activities'
ORDER BY pg_relation_size(indexrelid) DESC;
-- If indexes < 512KB, lower the threshold
ALTER SYSTEM SET min_parallel_index_scan_size = '128kB';
SELECT pg_reload_conf();

Problem: Running out of worker processes

-- Check worker process limits
SHOW max_worker_processes;  -- Total worker pool
SHOW max_parallel_workers;  -- Max parallel workers allowed
-- Increase if needed
ALTER SYSTEM SET max_worker_processes = 16;
ALTER SYSTEM SET max_parallel_workers = 8;
-- Requires restart for max_worker_processes

Problem: High memory usage during parallel vacuum

-- Reduce per-worker memory allocation
SET maintenance_work_mem = '512MB';  -- Each worker gets this amount
VACUUM (PARALLEL 4) user_activities;  -- 2GB total

5.15 Best Practices For Parallelisation

  1. Baseline first: Measure vacuum time before enabling parallel processing
  2. Match CPU availability: Set max_parallel_maintenance_workers based on vCPUs and workload
  3. Consider memory: maintenance_work_mem × parallel_workers = total memory usage
  4. Start conservative: Begin with 2-3 workers, increase based on results
  5. Monitor during peak: Watch CPU and memory metrics when parallel vacuum runs
  6. Test index threshold: Lower min_parallel_index_scan_size if indexes are smaller
  7. Schedule strategically: Use parallel vacuum during maintenance windows for predictable performance
  8. Aurora readers: Monitor replication lag impact on read replicas

Parallel vacuum is a powerful tool for managing large tables in Aurora PostgreSQL, but it requires careful configuration to balance vacuum speed against resource consumption. When properly tuned, it can reduce vacuum time by 50-70% for index-heavy tables.

6.0 Optimizing Aurora PostgreSQL Vacuums with TOAST Table Parameters

What is TOAST?

TOAST (The Oversized-Attribute Storage Technique) is PostgreSQL’s mechanism for handling data that exceeds the standard 8KB page size limit. When you store large text fields, JSON documents, bytea columns, or other substantial data types, PostgreSQL automatically moves this data out of the main table into a separate TOAST table. This keeps the main table pages compact and efficient for scanning, while the oversized data is stored separately and retrieved only when needed.

Every table with potentially large columns has an associated TOAST table (named pg_toast.pg_toast_<oid>) that operates behind the scenes. While this separation improves query performance on the main table, TOAST tables can accumulate dead tuples from updates and deletes just like regular tables, requiring their own vacuum maintenance.

6.1 Understanding TOAST Autovacuum Parameters

TOAST tables can be tuned independently from their parent tables using specific parameters. Here are the key options and their recommended values:

toast.autovacuum_vacuum_cost_delay

  • Default: Inherits from autovacuum_vacuum_cost_delay (typically 2ms in Aurora)
  • Recommended: 0 for high-throughput systems
  • Purpose: Controls the delay between vacuum operations to throttle I/O impact
  • Effect: Setting to 0 removes throttling, allowing vacuums to complete faster at the cost of higher instantaneous I/O
ALTER TABLE your_large_table SET (toast.autovacuum_vacuum_cost_delay = 0);

toast.autovacuum_vacuum_threshold

  • Default: 50 tuples
  • Recommended: 1000-5000 for large, frequently updated tables
  • Purpose: Minimum number of dead tuples before triggering an autovacuum
  • Effect: Higher values reduce vacuum frequency but may allow more bloat
ALTER TABLE your_large_table SET (toast.autovacuum_vacuum_threshold = 2000);

toast.autovacuum_vacuum_scale_factor

  • Default: 0.2 (20% of table size)
  • Recommended: 0.05-0.1 for very large tables, 0.2-0.3 for smaller tables
  • Purpose: Percentage of table size that, when combined with threshold, triggers autovacuum
  • Effect: Lower values mean more frequent vacuums, preventing excessive bloat
ALTER TABLE your_large_table SET (toast.autovacuum_vacuum_scale_factor = 0.1);

toast.autovacuum_vacuum_cost_limit

  • Default: Inherits from autovacuum_vacuum_cost_limit (typically 200 in Aurora)
  • Recommended: 2000-4000 for aggressive cleanup
  • Purpose: Maximum “cost” budget before vacuum process sleeps
  • Effect: Higher values allow more work per cycle before throttling kicks in
ALTER TABLE your_large_table SET (toast.autovacuum_vacuum_cost_limit = 3000);

6.2 Practical Example

For a large table with frequent updates to text or JSON columns in Aurora PostgreSQL:

-- Optimize TOAST table for aggressive, fast vacuuming
ALTER TABLE user_profiles SET (
    toast.autovacuum_vacuum_cost_delay = 0,
    toast.autovacuum_vacuum_threshold = 2000,
    toast.autovacuum_vacuum_scale_factor = 0.05,
    toast.autovacuum_vacuum_cost_limit = 3000
);

This configuration ensures TOAST tables are vacuumed frequently and quickly, preventing bloat from degrading performance while leveraging Aurora’s optimized storage layer. Monitor your vacuum activity using pg_stat_user_tables and adjust these parameters based on your workload’s specific characteristics.

7.0 “No Regrets” Optimizations for Mission-Critical Large Tables

When you have mission-critical large tables and sufficient infrastructure to scale memory and CPU, these optimizations will deliver immediate performance improvements without significant trade-offs:

7.1 Increase Maintenance Memory Allocation

Set generous memory limits to ensure vacuum operations complete in a single index scan pass:

-- For large instances with adequate RAM
ALTER SYSTEM SET maintenance_work_mem = '4GB';
ALTER SYSTEM SET autovacuum_work_mem = '2GB';
SELECT pg_reload_conf();

Why this works: Each dead tuple requires 6 bytes in memory. Insufficient memory forces multiple expensive index scan passes. With adequate memory, vacuum completes faster and more efficiently.

Impact: Reduces vacuum time by 40-60% for tables requiring multiple index scans.

7.2 Enable Aggressive Parallel Vacuum

Leverage multiple CPU cores for dramatically faster vacuum operations:

-- System-wide settings (adjust based on available vCPUs)
ALTER SYSTEM SET max_parallel_maintenance_workers = 4;
ALTER SYSTEM SET min_parallel_index_scan_size = '256kB';
SELECT pg_reload_conf();
-- Per-table optimization for mission-critical tables
ALTER TABLE your_critical_table SET (parallel_workers = 4);

Why this works: Parallel vacuum distributes index cleanup across multiple workers. For tables with 4+ indexes, this parallelization provides substantial speedups.

Impact: 50-70% reduction in vacuum time for index-heavy tables.

7.3 Remove Autovacuum Throttling

Eliminate I/O throttling delays to let vacuum run at full speed:

-- Apply to critical tables
ALTER TABLE your_critical_table SET (
    autovacuum_vacuum_cost_delay = 0,
    autovacuum_vacuum_cost_limit = 10000
);

Why this works: Default throttling was designed for resource-constrained systems. With sufficient infrastructure, removing these limits allows vacuum to complete faster without impacting performance.

Impact: 30-50% faster vacuum completion with no downside on properly provisioned systems.

7.4 Tune TOAST Table Parameters

Optimize vacuum for oversized attribute storage:

ALTER TABLE your_critical_table SET (
    toast.autovacuum_vacuum_cost_delay = 0,
    toast.autovacuum_vacuum_threshold = 2000,
    toast.autovacuum_vacuum_scale_factor = 0.05,
    toast.autovacuum_vacuum_cost_limit = 3000
);

Why this works: TOAST tables accumulate dead tuples independently and are often overlooked. Aggressive TOAST vacuuming prevents hidden bloat in large text/JSON columns.

Impact: Eliminates TOAST bloat, which can represent 20-40% of total table bloat.

7.5 Lower Autovacuum Thresholds

Trigger vacuum earlier to prevent bloat accumulation:

ALTER TABLE your_critical_table SET (
    autovacuum_vacuum_scale_factor = 0.05,  -- Down from default 0.2
    autovacuum_vacuum_threshold = 5000
);

Why this works: More frequent, smaller vacuums are faster and less disruptive than infrequent, massive cleanup operations. This prevents bloat before it impacts query performance.

Impact: Maintains bloat under 10% consistently, preventing query degradation.

7.6 Install and Configure pg_repack

Have pg_repack ready for zero-downtime space reclamation:

CREATE EXTENSION pg_repack;

Why this works: When bloat exceeds 30-40%, pg_repack reclaims space without the long exclusive locks required by VACUUM FULL. Critical tables remain online throughout the operation.

Impact: Space reclamation during business hours without downtime.

7.7 Complete Configuration Template

For a mission-critical large table on a properly sized Aurora instance:

-- Main table optimization
ALTER TABLE your_critical_table SET (
    autovacuum_vacuum_scale_factor = 0.05,
    autovacuum_vacuum_threshold = 5000,
    autovacuum_vacuum_cost_delay = 0,
    autovacuum_vacuum_cost_limit = 10000,
    parallel_workers = 4,
    toast.autovacuum_vacuum_cost_delay = 0,
    toast.autovacuum_vacuum_threshold = 2000,
    toast.autovacuum_vacuum_scale_factor = 0.05,
    toast.autovacuum_vacuum_cost_limit = 3000
);
-- System-wide settings (for db.r5.2xlarge or larger)
ALTER SYSTEM SET maintenance_work_mem = '4GB';
ALTER SYSTEM SET autovacuum_work_mem = '2GB';
ALTER SYSTEM SET max_parallel_maintenance_workers = 4;
ALTER SYSTEM SET min_parallel_index_scan_size = '256kB';
SELECT pg_reload_conf();

These optimizations are “no regrets” because they:

  • Require no application changes
  • Leverage existing infrastructure capacity
  • Provide immediate, measurable improvements
  • Have minimal risk when resources are available
  • Prevent problems rather than reacting to them

8.0 Conclusion

Effective vacuum management is not a one-time configuration task—it’s an ongoing optimization process that scales with your database. As your PostgreSQL Aurora tables grow, the default vacuum settings that worked initially can become a significant performance bottleneck, leading to bloat, degraded query performance, and wasted storage.

The strategies covered in this guide provide a comprehensive toolkit for managing vacuum at scale:

  • Monitoring queries help you identify bloat before it impacts performance
  • Table-level autovacuum tuning allows you to customize behavior for high-churn tables
  • Memory configuration (maintenance_work_mem and autovacuum_work_mem) ensures vacuum operations complete efficiently without multiple index scans
  • Parallel vacuuming leverages multiple CPU cores to dramatically reduce vacuum time for large, index-heavy tables
  • pg_repack offers a near-zero-downtime solution for reclaiming space from heavily bloated tables
  • Automated maintenance workflows enable proactive vacuum management during off-peak hours

The key is to be proactive rather than reactive. Regularly run the monitoring queries and memory estimation scripts provided in this article. Watch for warning signs like increasing dead tuple counts, growing bloat percentages, and tables requiring multiple index scan passes. When you spot these indicators, apply targeted tuning before they escalate into production issues.

For tables with multiple indexes that take hours to vacuum, parallel vacuuming offers a game-changing performance boost—often reducing vacuum time by 50-70% by distributing index cleanup across multiple CPU cores. However, this power comes with resource trade-offs: each parallel worker consumes its own memory allocation and CPU cycles. The key is finding the sweet spot for your Aurora instance size, testing with 2-3 workers initially and scaling up based on available vCPUs and observed performance gains. This is especially valuable during maintenance windows when you need to vacuum large tables quickly without blocking operations for extended periods.

Remember that vacuum optimization is a balance: too aggressive and you risk impacting production workload; too conservative and bloat accumulates faster than it can be cleaned. Start with the conservative recommendations provided, monitor the results, and iterate based on your specific workload patterns and Aurora instance capabilities.

With the right monitoring, configuration, and tooling in place, you can maintain healthy tables even as they scale to hundreds of gigabytes—ensuring consistent query performance and optimal storage utilization for your PostgreSQL Aurora database.

0
0

Deep Dive: AWS NLB Sticky Sessions (stickiness) Setup, Behavior, and Hidden Pitfalls

When you deploy applications behind a Network Load Balancer (NLB) in AWS, you usually expect perfect traffic distribution, fast, fair, and stateless.
But what if your backend holds stateful sessions, like in-memory login sessions, caching, or WebSocket connections and you need a given client to keep hitting the same target every time?

That’s where NLB sticky sessions (also called connection stickiness or source IP affinity) come in. They’re powerful but also misunderstood and misconfiguring them can lead to uneven load, dropped connections, or mysterious client “resets.”

Let’s break down exactly how they work, how to set them up, what to watch for, and how to troubleshoot the tricky edge cases that appear in production.


1. What Are Sticky Sessions on an NLB?

At a high level, sticky sessions ensure that traffic from the same client consistently lands on the same target (EC2 instance, IP, or container) behind your NLB.

Unlike the Application Load Balancer (ALB) — which uses HTTP cookies for stickiness, the NLB operates at Layer 4 (TCP/UDP).
That means it doesn’t look inside your packets. Instead, it bases stickiness on network-level parameters like:

  • Source IP address
  • Destination IP and port
  • Source port (sometimes included in the hash)
  • Protocol (TCP, UDP, or TLS passthrough)

AWS refers to this as “source IP affinity.”
When enabled, the NLB creates a flow-hash mapping that ties the client to a backend target.
As long as the hash remains the same, the same client gets routed to the same target — even across multiple connections.


2. Enabling Sticky Sessions on an AWS NLB

Stickiness is configured per target group, not at the NLB level.

Step-by-Step via AWS Console

  1. Go to EC2 → Load Balancers → Target Groups
    Find the target group your NLB listener uses.
  2. Select the Target Group → Attributes tab
  3. Under Attributes, set:
  • Stickiness.enabled = true
  • Stickiness.type = source_ip
  1. Save changes and confirm the attributes are updated.

Step-by-Step via AWS CLI

```bash
aws elbv2 modify-target-group-attributes \
--target-group-arn arn:aws:elasticloadbalancing:region:acct:targetgroup/mytg/abc123 \
--attributes Key=stickiness.enabled,Value=true Key=stickiness.type,Value=source_ip

How to Verify:

aws elbv2 describe-target-group-attributes \
  --target-group-arn arn:aws:elasticloadbalancing:region:acct:targetgroup/mytg/abc123

Sample Output:

{
    "Attributes": [
        { "Key": "stickiness.enabled", "Value": "true" },
        { "Key": "stickiness.type", "Value": "source_ip" }
    ]
}

3. How NLB Stickiness Actually Works (Under the Hood)

The NLB’s flow hashing algorithm calculates a hash from several parameters, often the “five-tuple”:

<protocol, source IP, source port, destination IP, destination port>

The hash is used to choose a target. When stickiness is enabled, NLB remembers this mapping for some time (typically a few minutes to hours, depending on flow expiration).

Key Behavior Points:

  • If the same client connects again using the same IP and port, the hash matches == same backend target.
  • If any part of that tuple changes (e.g. client source port changes), the hash may change == client might hit a different target.
  • NLBs maintain this mapping in memory; if the NLB node restarts or fails over, the mapping is lost.
  • Sticky mappings can also be lost when cross-zone load balancing or target health status changes.

Not Cookie Based

Because NLBs don’t inspect HTTP traffic, there’s no cookie involved.
This means:

  • You can’t set session duration or expiry time like in ALB stickiness.
  • Stickiness only works as long as the same network path and source IP persist.

4. Known Limitations & Edge Cases

Sticky sessions on NLBs are helpful but brittle. Here’s what can go wrong:

IssueCauseEffect
Client source IP changesNAT, VPN, mobile switching networksHash changes → new target
Different source portClient opens multiple sockets or reconnectsEach connection may map differently
TLS termination at NLBNLB terminates TLSStickiness not supported (only for TCP listeners)
Unhealthy targetHealth check failsMapping breaks; NLB reroutes
Cross-zone load balancing toggledDistribution rules changeMay break existing sticky mappings
DNS round-robin at clientNLB has multiple IPs per AZClient DNS resolver may change NLB node
UDP behaviorStateless packets; different flow hashStickiness unreliable for UDP
Scaling up/downNew targets addedHash table rebalanced; some clients remapped

Tip: If you rely on stickiness, keep your clients stable (same IP) and avoid frequent target registration changes.

5. Troubleshooting Sticky Session Problems

When things go wrong, these are the most common patterns you’ll see:

1. “Stickiness not working”

  • Check target group attributes: aws elbv2 describe-target-group-attributes --target-group-arn <arn> Ensure stickiness.enabled is true.
  • Make sure your listener protocol is TCP, not TLS.
  • Confirm that client IPs aren’t being rewritten by NAT or proxy.
  • Check CloudWatch metrics. If one target gets all the traffic, stickiness might be too “sticky” due to limited source IP variety.

2. “Some clients lose session state randomly”

  • Verify client network stability. Mobile clients or corporate proxies can rotate IPs.
  • Confirm health checks aren’t flapping targets.
  • Review your application session design, if session data lives in memory, consider an external session store (Redis, DynamoDB, etc.).

3. “Load imbalance: one instance overloaded”

  • This can happens when many users share one public IP (common in offices or ISPs).
    All those clients hash to the same backend.
  • Mitigate by:
    • Disabling stickiness if not strictly required.
    • Using ALB with cookie based stickiness (more granular).
    • Scaling target capacity.

4. “Connections drop after some time”

  • NLB may remove stale flow mappings.
  • Check TCP keepalive settings on clients and targets. Ensure keepalive_time < NLB idle timeout (350 seconds) to prevent connection resets. Linux commands below:
# Check keepalive time (seconds before sending first keepalive probe)
sysctl net.ipv4.tcp_keepalive_time

# Check keepalive interval (seconds between probes)
sysctl net.ipv4.tcp_keepalive_intvl

# Check keepalive probes (number of probes before giving up)
sysctl net.ipv4.tcp_keepalive_probes

# View all at once
sysctl -a | grep tcp_keepalive
  • Verify idle timeout on backend apps (e.g., web servers closing connections too early).

6. Observability & Testing

You can validate sticky behavior with:

  • CloudWatch metrics:
    ActiveFlowCount, NewFlowCount, and per target request metrics.
  • VPC Flow Logs: confirm that repeated requests from the same client IP go to the same backend ENI.
  • Packet captures: Use tcpdump or ss on your backend instances to see if the same source IP consistently connects.

Quick test with curl:

for i in {1..100}; do 
    echo "=== Request $i at $(date) ===" | tee -a curl_test.log
    curl http://<nlb-dns-name>/ -v 2>&1 | tee -a curl_test.log
    sleep 0.5
done

Run it from the same host and check which backend responds (log hostname on each instance).
Then try from another IP or VPN; you’ll likely see a different target.

7. Best Practices

  1. Only enable stickiness if necessary.
    Stateless applications scale better without it.
  2. If using TLS: terminate TLS at the backend or use ALB if you need session affinity.
  3. Use shared session stores.
    Tools like ElastiCache (Redis) or DynamoDB make scaling simpler and safer.
  4. Avoid toggling cross-zone load balancing during traffic, it resets the sticky map.
  5. Set up proper health checks. Unhealthy targets break affinity immediately.
  6. Monitor uneven load. Large NAT’d user groups can overload a single instance.
  7. For UDP consider designing idempotent stateless processing; sticky sessions may not behave reliably.

8. Example Architecture Pattern

Scenario: A multiplayer game server behind an NLB.
Each player connects via TCP to the game backend that stores their in-memory state.

✅ Recommended setup:

  • Enable stickiness.enabled = true and stickiness.type = source_ip
  • Disable TLS termination at NLB
  • Keep targets in the same AZ with cross-zone load balancing disabled to maintain stable mapping
  • Maintain external health and scaling logic to avoid frequent re-registrations

This setup ensures that the same player IP always lands on the same backend server, as long as their network path is stable.

9. Summary Table

AttributeSupported ValueNotes
stickiness.enabledtrue / falseEnables sticky sessions
stickiness.typesource_ipOnly option for NLB
Supported ProtocolsTCP, UDP (limited)Not supported for TLS listeners
Persistence DurationUntil flow resetNot configurable
Cookie-based Stickiness❌ NoUse ALB for cookie-based
Best forStateful TCP appse.g. games, custom protocols

10. When to Use ALB Instead

If you’re dealing with HTTP/HTTPS applications that manage user sessions via cookies or tokens, you’ll be much happier using an Application Load Balancer.
It offers:

  • Configurable cookie duration
  • Per application stickiness
  • Layer 7 routing and metrics

The NLB should be reserved for high performance, low latency, or non HTTP workloads that need raw TCP/UDP handling.

11. Closing Thoughts

AWS NLB sticky sessions are a great feature, but they’re not magic glue.
They work well when your network topology and client IPs are predictable, and your app genuinely needs flow affinity. However, if your environment involves NATs, mobile networks, or frequent scale-ups, expect surprises.

When in doubt:
1. Keep your app stateless,
2. Let the load balancer do its job, and
3. Use stickiness only as a last resort for legacy or session bound systems.

🧩 References

0
0

AWS: Use the AWS CLI to delete snapshots from your account

The Amazon EC2 console allows you to delete up to 50 Amazon Elastic Block Store (Amazon EBS) snapshots at once. To delete more than 50 snapshots, use the AWS Command Line Interface (AWS CLI) or the AWS SDK.

To see all the snapshots that you own in a specific region, run the following. Note, replace af-south-1 with your region:

aws ec2 describe-snapshots --owner-ids self  --query 'Snapshots[]' --region af-south-1

Note: To run the code below, first make sure your in the correct account (or life will become difficult for you). Next replace BOTH instances “af-south-1” with your particular region. Finally, you can use a specific account number in place of –owner-ids=self (eg –owner-ids=1234567890).

for SnapshotID in $(aws ec2 --region af-south-1 describe-snapshots --owner-ids=self --query 'Snapshots[*].SnapshotId' --output=text); do
aws ec2 --region af-south-1 delete-snapshot --snapshot-id ${SnapshotID}
done
0
0

How to make an offline copy of a static website using wget and hosting on AWS S3 with CloudFront

I have an old website that I want to avoid the hosting costs and so just wanted to download the website and run it from an AWS S3 bucket using Cloud Front to publish the content. Below are the steps I took to do this:

First download the website to your laptop

$ wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --no-check-certificate \
     --restrict-file-names=unix \
     --domains archive.andrewbaker.ninja \
     --no-parent \
         http://archive.andrewbaker.ninja/
$ cd archive.andrewbaker.ninja
$ ls

Below is a summary of the parameters (inc common alternatives):

–recursive: Wget is capable of traversing parts of the Web (or a single HTTP or FTP server), following links and directory structure. We refer to this as to recursive retrieval, or recursion.

–no-clobber: If a file is downloaded more than once in the same directory, Wget’s behavior depends on a few options, including `-nc’. In certain cases, the local file will be clobbered, or overwritten, upon repeated download. In other cases it will be preserved. When running Wget without `-N’`-nc’, or `-r’, downloading the same file in the same directory will result in the original copy of file being preserved and the second copy being named `file.1′. If that file is downloaded yet again, the third copy will be named `file.2′, and so on. When `-nc’ is specified, this behavior is suppressed, and Wget will refuse to download newer copies of `file. Therefore, “no-clobber” is actually a misnomer in this mode–it’s not clobbering that’s prevented (as the numeric suffixes were already preventing clobbering), but rather the multiple version saving that’s prevented. When running Wget with `-r’, but without `-N’ or `-nc’, re-downloading a file will result in the new copy simply overwriting the old. Adding `-nc’ will prevent this behavior, instead causing the original version to be preserved and any newer copies on the server to be ignored. When running Wget with `-N’, with or without `-r’, the decision as to whether or not to download a newer copy of a file depends on the local and remote timestamp and size of the file (see section Time-Stamping). `-nc’ may not be specified at the same time as `-N’. Note that when `-nc’ is specified, files with the suffixes `.html’ or (yuck) `.htm’ will be loaded from the local disk and parsed as if they had been retrieved from the Web.

–page-requisites: This causes wget to download all the files that are necessary to properly display a given HTML page which includes images, css, js, etc. –adjust-extension Preserves proper file extensions for . html, . css, and other assets

–html-extension: This adds .html after the downloaded filename, to make sure it plays nicely on whatever system you’re going to view the archive on

–convert-links: After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.

–no-check-certificate: Don’t check the server certificate against the available certificate authorities. Also don’t require the URL host name to match the common name presented by the certificate.

–restrict-file-names: By default, Wget escapes the characters that are not valid or safe as part of file names on your operating system, as well as control characters that are typically unprintable. This option is useful for changing these defaults, perhaps because you are downloading to a non-native partition”. So unless you are not downloading to non-native partition you do not need to restrict file names by OS. its automatic. Additionally: “The values ‘unix’ and ‘windows’ are mutually exclusive (one will override the other)”

–domains: Limit spanning to specified domains

–no-parent: If you don’t want wget to descend down to the parent directory, use -np or –no-parent option. This instructs wget not to ascend to the parent directory when it hits references like ../ in href links.

Upload Files to S3 Bucket

Next upload the files to your S3 bucket. First move into the relevant bucket, then perform the recursive upload.

$ cd archive.andrewbaker.ninja
$ ls .
$ aws s3 cp . s3://vbusers.com/ --recursive

Create a CloudFront Distribution from an S3 Bucket

Finally go to CloudFront and create a distribution from the S3 Bucket you just created. You can pretty much use the default settings. Note: you will need to wait a few minutes before you browse to the distributions domain name:

0
0

AWS: Install and configure the AWS CLI on a Macbook

You can absolutely get the following from the AWS help pages; but this is the lazy way to get everything you need for a simple single account setup.

Run the two commands below to drop the package on your Mac.

$ curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
$ sudo installer -pkg AWSCLIV2.pkg -target /

Then check the versions you have installed:

$ which aws
$ aws --version

Next you need to setup your environment. Note: This is NOT the recommended way (as it uses long term credentials).

The following example configures a default profile using sample values. Replace them with your own values as described in the following sections.

$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: secretaccesskey
Default region name [None]: af-south-1
Default output format [None]: json

You can also use named profiles. The following example configures a profile named userprod using sample values. Replace them with your own values as described in the following sections.

$ aws configure --profile userprod
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: secretaccesskey
Default region name [None]: af-south-1
Default output format [None]: json

Get your access keys

  1. Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
  2. In the navigation pane of the IAM console, select Users and then select the User name of the user that you created previously.
  3. On the user’s page, select the Security credentials page. Then, under Access keys, select Create access key.
  4. For Create access key Step 1, choose Command Line Interface (CLI).
  5. For Create access key Step 2, enter an optional tag and select Next.
  6. For Create access key Step 3, select Download .csv file to save a .csv file with your IAM user’s access key and secret access key. You need this information for later.
  7. Select Done.
0
0

AWS: Automatically Stop and Start your EC2 Services

Below is a quick (am busy) outline on how to automatically stop and start your EC2 instances.

Step 1: Tag your resources

In order to decide which instances stop and start you first need to add an auto-start-stop: Yes tag to all the instances you want to be affected by the start / stop functions. Note: You can use “Resource Groups and Tag Editor” to bulk apply these tags to the resources you want to be affected by the lambda functions you are going to create. See below (click the orange button called “Manage tags of Selected Resources”).

Step 2: Create a new role for our lambda functions

First we need to create the IAM role to run the Lambda functions. Go to IAM and click the “Create Role” button. Then select “AWS Service” from the “Trusted entity options”, and select Lambda from the “Use Cases” options. Then click “Next”, followed by “Create Policy”. To specify the permission, simply Click the JSON button on the right of the screen and enter the below policy (swapping the region and account id for your region and account id):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:StartInstances",
                "ec2:DescribeTags",
                "logs:*",
                "ec2:DescribeInstanceTypes",
                "ec2:StopInstances",
                "ec2:DescribeInstanceStatus"
            ],
            "Resource": "arn:aws:ec2:<region>:<accountID>:instance/*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/auto-start-stop": "Yes"
                }
            }
        }
    ]
}

Hit next and under “Review and create”, save the above policy as ec2-lambda-start-stop by clicking the “Create Policy” button. Next, search for this newly created policy and select it as per below and hit “Next”.

You will now see the “Name, review, and create” screen. Here you simply need to hit “Create Role” after you enter the role name as ec2-lambda-start-stop-role.

Note the policy is restricted to only have access to EC2 instances that contains auto-start-stop: Yes tags (least privileges).

If you want to review your role, this is how it should look. You can see I have filled in my region and account number in the policy:

Step 3: Create Lambda Functions To Start/Stop EC2 Instances

In this section we will create two lambda functions, one to start the instances and the other to stop the instances.

Step 3a: Add the Stop EC2 instance function

  • Goto Lambda console and click on create function
  • Create a lambda function with a function name of stop-ec2-instance-lambda, python3.11 runtime, and ec2-lambda-stop-start-role (see image below).

Next add the lamdba stop function and save it as stop-ec2-instance. Note, you will need to change the value of the region_name parameter accordingly.

import json
import boto3

ec2 = boto3.resource('ec2', region_name='af-south-1')
def lambda_handler(event, context):
   instances = ec2.instances.filter(Filters=[{'Name': 'instance-state-name', 'Values': ['running']},{'Name': 'tag:auto-start-stop','Values':['Yes']}])
   for instance in instances:
       id=instance.id
       ec2.instances.filter(InstanceIds=[id]).stop()
       print("Instance ID is stopped:- "+instance.id)
   return "success"

This is how your Lambda function should look:

Step 3b: Add the Start EC2 instance function

  • Goto Lambda console and click on create function
  • Create lambda functions with start-ec2-instance, python3.11 runtime, and ec2-lambda-stop-start-role.
  • Then add the below code and save the function as start-ec2-instance-lambda.

Note, you will need to change the value of the region_name parameter accordingly.

import json
import boto3

ec2 = boto3.resource('ec2', region_name='af-south-1')
def lambda_handler(event, context):
   instances = ec2.instances.filter(Filters=[{'Name': 'instance-state-name', 'Values': ['stopped']},{'Name': 'tag:auto-start-stop','Values':['Yes']}])
   for instance in instances:
       id=instance.id
       ec2.instances.filter(InstanceIds=[id]).stop()
       print("Instance ID is stopped:- "+instance.id)
   return "success"

4. Summary

If either of the above lambda functions are triggered, they will start or stop your EC2 instances based on the instance state and the value of auto-start-stop tag. To automate this you can simply setup up cron jobs, step functions, AWS Event Bridge, Jenkins etc.

0
0