The AWS Well-Architected Framework will not spot this. Your cloud governance team will not catch it either, nor will Trusted Advisor. Your war room will be hours into the incident before anyone catches on. Your SRE reviews will miss it, your APM will not flag it, Performance Insights will not surface it, and your FinOps team will have no clue it is even there. It sits quietly in your estate for months, sometimes years, until the day load conditions expose it and you spend four hours in a war room trying to explain to your CTO why a bank cannot process payments.
What is it? IOPS mismatches, where your storage can deliver far more throughput than your compute instance can actually consume.
1. What Is an IOPS Mismatch?
IOPS (Input/Output Operations Per Second) is the fundamental currency of storage performance. When you provision an RDS instance or attach an EBS volume to an EC2 host, you are making two independent decisions: how many IOPS the storage can deliver, and how many IOPS the instance can actually push through its I/O subsystem. These two numbers are governed by entirely separate limits. EBS gp3 and io2 volumes can be provisioned up to 64,000 IOPS, but your EC2 instance has its own maximum EBS throughput ceiling, and for many instance types that ceiling is well below what your storage can theoretically deliver.
The result is a mismatch where you are paying for storage performance that your compute layer cannot reach, and worse, you are designing your system around an IOPS budget that does not actually exist at the instance level. Every rand you spend on IOPS above that ceiling is money you will never get value from, and none of your tooling will tell you.
2. The AWS Architecture Pattern That Makes This Worse
AWS makes it easy to provision storage and compute independently, and neither the console, the CLI, CloudFormation, nor Terraform will warn you when your storage IOPS ceiling exceeds your instance throughput ceiling. The typical path to a mismatch involves an engineer provisioning an r6g.large with a 3,500 IOPS maximum EBS throughput because the application is memory intensive, while a separate decision by a different team provisions the attached RDS storage with 10,000 IOPS because the database team uses a standard enterprise template. The two numbers are never compared against each other and the system goes to production.
At any large institution, as at any institution operating at scale across hundreds of accounts and workloads, the opportunity for this kind of drift is significant. An OU spanning multiple teams, each with their own provisioning habits and templates, will accumulate mismatches over time through the natural entropy of infrastructure management.
Some of the most frequent mismatches appear in exactly the instance types engineers reach for most often. The t3 family is particularly dangerous because it is the default choice for cost conscious workloads, yet every size from nano to 2xlarge is capped at 2,085 IOPS while a gp3 volume defaults to 3,000 and io1 storage attached by a database team using an enterprise template can easily reach 6,000 or more. The r6g.large caps at 3,500 IOPS and is routinely paired with RDS storage provisioned at double that. The m5.xlarge at 6,000 IOPS is frequently attached to io2 volumes at 10,000. On the RDS side, a db.t3.medium is limited to 1,536 IOPS but is commonly found with io1 storage provisioned at 3,000, and db.r6g.large instances capped at 3,500 are routinely given storage provisioned at 8,000 or more.
3. Why This Causes Outages and Not Just Waste
Paying for IOPS you cannot use is expensive but survivable. The dangerous scenario is what happens at the boundary when load actually increases.
Consider a team that knows their application spikes at just under 30,000 IOPS. They provision storage at 40,000 to give themselves headroom, which is exactly what good engineering practice tells them to do. Everything looks fine in testing, everything looks fine in monitoring, and the system runs without complaint. Then load spikes, the I/O subsystem begins to queue, upstream systems scale out in response, a flood of autoscaling connections exhausts the database connection pool, and services start timing out in a rapid cascading outage. The application is down, and now your team is trying to correlate thread exhaustion in your EKS clusters back to a blind configuration mismatch that no automated tool flagged at any point along the way.
That cascade is the real danger. Under normal conditions your application runs fine because actual I/O demand is modest and the mismatch is invisible. But when demand approaches the instance throughput ceiling, requests begin to queue at the virtualization layer and what was a 1ms storage read becomes 50ms and then 200ms as connection pools back up. In Aurora Serverless v2 or RDS PostgreSQL, connection pool exhaustion under I/O pressure is a documented failure mode where the database appears healthy and is accepting connections but query execution times blow through application timeouts, upstream services retry, retries compound the I/O pressure, and you are now in a feedback loop. On smaller instance types that operate on a credit model for I/O throughput, a sustained I/O spike drains the burst bucket and when it empties throughput drops to the baseline, which is often a fraction of what the application expects, even though the storage volume itself could still deliver perfectly well.
The most insidious part is that CloudWatch metrics can hide this entirely. VolumeReadOps and VolumeWriteOps report what the volume delivered while EBSReadOps on the instance side reports what the instance consumed, and under a mismatch these metrics appear healthy right up until they do not because the bottleneck lives in the virtualized I/O path between instance and volume rather than in either component independently. Performance Insights, which most teams trust to surface database I/O problems, operates above this layer entirely and will show you wait events and query latency but will not show you that the ceiling your instance is hitting is a hardware throughput limit rather than a query or index problem.
4. The FinOps Trap: When Cost Optimisation Creates the Outage
There is a particularly cruel variant of this failure that deserves its own warning. Trusted Advisor and your FinOps team will periodically flag RDS instances with low CPU utilisation as oversized and recommend downsizing to save cost. The recommendation is correct by every metric they can see. CPU is idle, the instance looks like waste, and the resize gets approved and executed. What nobody checks is whether the workload is IO bound rather than CPU bound, and whether the smaller instance class sits below the provisioned storage IOPS ceiling. The downsize goes through, the system runs fine for weeks because load is modest, and then three months later under a transaction spike you are in an outage with no obvious cause because nobody remembers a routine right sizing exercise from last quarter. The CloudWatch CPU graphs look fine throughout. The storage metrics look fine. The failure lives entirely in the gap between what the instance can consume and what the storage was provisioned to deliver, and that gap was created by a cost optimisation that was correct by every measure anyone was watching.
5. Detection Requires Automation
Once you have identified a mismatch the fix is straightforward: upgrade the instance to a class whose throughput ceiling exceeds the provisioned storage IOPS, or reduce the provisioned IOPS on the storage to match what the instance can actually consume. The hard part is finding these mismatches across a large estate before they manifest as incidents, because you cannot do this manually and you cannot rely on individual teams to self report. It needs to be a scheduled job that generates a report and produces findings you treat with the same severity as any security or cost compliance alert.
The script below is what I need Zak and Ryan to run across all our accounts. It queries every account in the OU, enumerates all RDS instances and EC2 volumes, compares provisioned storage IOPS against the instance throughput ceiling for each, and flags every mismatch with a severity classification and an estimated monthly waste figure. Findings are classified as CRITICAL where provisioned IOPS exceed the ceiling by three times or more and require immediate remediation, HIGH at two to three times over for remediation within the current sprint, MEDIUM at 1.5 to two times for the next planning cycle, and LOW for anything above the ceiling but within the 1 to 1.5x range. Output is a colour coded Excel workbook with a findings sheet and summary tab alongside a flat CSV, and the script exits with code 1 if any CRITICAL findings are present so it can be wired into a CI pipeline for scheduled runs.
To run it you will need boto3, pandas, and openpyxl installed, and your caller identity needs organizations:ListChildren, organizations:DescribeAccount, and sts:AssumeRole into each target account, while the assumed role needs rds:DescribeDBInstances, ec2:DescribeVolumes, and ec2:DescribeInstances.
6. The Broader Lesson
Every major cloud outage I have seen, and this spans AWS and the broader industry, has a mundane configuration decision somewhere near its root cause. It is not a sophisticated attack and not a novel failure mode, it is a number that should have been checked against another number at provisioning time and was not. IOPS mismatches are not glamorous, they do not appear on architectural review templates, and as we have seen they slip past every layer of automated governance you have in place. But in a payment processing environment they translate directly into transaction failures, customer impact, and regulatory exposure, which means they deserve the same treatment as any other class of compliance finding. Audit your estate, fix the mismatches, and make it a scheduled job rather than a post incident action item.
Andrew Baker is Chief Information Officer at Capitec Bank. He writes about cloud architecture, banking technology, and the gap between what systems are designed to do and what they actually do under load at andrewbaker.ninja.
If you run a technical blog on WordPress, you know the pain. You paste a markdown article with fenced code blocks, Gutenberg creates bland core/code blocks with no syntax highlighting, no copy button, no dark mode. You end up wrestling with third party plugins that haven’t been updated in years or manually formatting every code snippet.
I built CloudScale Code Block to solve this once and for all. It’s a lightweight WordPress plugin that gives you proper syntax highlighting with automatic language detection, a one click clipboard copy button, dark and light theme toggle, full width responsive display, and a built in migration tool for converting your existing code blocks. It works as both a Gutenberg block and a shortcode for classic editor users.
In this post I’ll walk through how to install it, how to handle the Gutenberg paste problem, and how to migrate your existing code blocks.
1 What You Get
CloudScale Code Block uses highlight.js 11.11.1 under the hood with support for 28 languages out of the box. When you add a code block in the editor, you get a clean textarea with a toolbar showing the block type, detected language, and optional title (useful for filenames). On the frontend your visitors see beautifully highlighted code with line numbers, a copy to clipboard button, and a toggle to switch between dark (Atom One Dark) and light (Atom One Light) themes.
The plugin has zero build step required. No webpack, no npm install, no node modules. Upload it and activate.
That’s it. You’ll see CloudScale Code Block available in the Gutenberg block inserter under the Formatting category. You can also access settings at Settings > CloudScale Code to configure the default theme.
3 The Gutenberg Paste Problem
Here’s something every WordPress developer needs to know. When you paste markdown containing fenced code blocks (the triple backtick syntax), Gutenberg’s built in markdown parser intercepts the paste event before any plugin can touch it. It converts the fenced blocks into core/code blocks, which are WordPress’s default code blocks with no syntax highlighting.
This isn’t a bug in any plugin. It’s how Gutenberg’s paste pipeline works internally. The markdown parser runs synchronously during the paste event, creates the core blocks, and only then gives plugins a chance to respond.
CloudScale Code Block handles this with a practical solution: a floating convert toast.
4 Converting Pasted Code Blocks
When you paste markdown that contains fenced code blocks, Gutenberg will create core/code blocks as described above. CloudScale detects this automatically and shows a floating notification in the bottom right corner of the editor:
⚠️ 2 core code blocks found ⚡ Convert All to CloudScale
Click the Convert All to CloudScale button and every core/code and core/preformatted block in the post is instantly replaced with a CloudScale Code Block. The code content is preserved exactly as it was, and highlight.js will auto detect the language on the frontend.
This is a one click operation. Paste your markdown, click Convert All, done.
5 Migrating Existing Posts
If you have an existing blog with dozens or hundreds of posts using WordPress’s default code blocks or the Code Syntax Block plugin, you don’t want to edit each post manually. CloudScale Code Block includes a built in migration tool that handles this in bulk. Once the plugin is activated, go to Tools > Code Block Migrator in your WordPress admin.
5.1 How the Migrator Works
The migrator handles three types of legacy blocks:
wp:code blocks are the default WordPress code blocks. The migrator extracts the code content, decodes HTML entities, and detects the language from any lang attribute or language-xxx CSS class.
wp:code-syntax-block/code blocks are from the popular Code Syntax Block plugin. The migrator reads the language from the block’s JSON attributes where Code Syntax Block stores it.
wp:preformatted blocks are WordPress preformatted text blocks that some themes and plugins use for code. The migrator converts br tags back to proper newlines and strips any residual HTML formatting.
5.2 Migration Workflow
The process is straightforward:
Click Scan Posts to find every post and page containing legacy code blocks
The scan results show each post with a count of how many code blocks it contains
Click Preview on any post to see a side by side comparison of the original block markup and what CloudScale will produce
Click Migrate This Post to convert a single post, or use Migrate All Remaining to batch convert everything
The migrator writes directly to the database and clears the post cache, so changes take effect immediately. I recommend taking a database backup before running a bulk migration, but in practice the conversion is deterministic and safe. The migrator only touches block comment delimiters and HTML structure. Your actual code content is never modified.
5.3 After Migration
Once migration is complete you can deactivate the Code Syntax Block plugin if you were using it. All your posts will now use CloudScale Code Block format and render with full syntax highlighting on the frontend.
6 Technical Details
For those interested in what’s under the hood:
The plugin registers a single Gutenberg block (cloudscale/code-block) with a PHP render callback. The block stores its data as three attributes: content (the raw code text), language (optional, for explicit language selection), and title (optional, shown above the code). The block uses save: function() { return null; } which means all rendering happens server side via PHP. This makes the block resilient to markup changes and avoids the dreaded “This block contains unexpected or invalid content” error that plagues so many WordPress code plugins.
Frontend assets are loaded on demand. The highlight.js library, theme stylesheets, and the clipboard/toggle JavaScript are only enqueued when a post actually contains a CloudScale Code Block. No unnecessary scripts on pages that don’t need them.
The auto convert watcher uses wp.data.subscribe to monitor the Gutenberg block store for core/code and core/preformatted blocks. When it finds them, it renders a floating toast with a convert button. The conversion calls wp.data.dispatch(‘core/block-editor’).replaceBlock() to swap each core block for a CloudScale block, preserving the code content and extracting any language hints from the original block’s attributes.
7 Configuration
The plugin includes a settings page at Settings > CloudScale Code where you can set the default theme (dark or light) for all code blocks on your site. Individual blocks can override this setting using the Theme dropdown in the block’s sidebar inspector.
You can also set the language explicitly per block if auto detection isn’t picking the right one. The language selector supports 28 languages including Bash, Python, JavaScript, TypeScript, Java, Go, Rust, SQL, YAML, Docker, and more.
8 Shortcode Support
For classic editor users or anywhere you need code highlighting outside of Gutenberg, the plugin provides a shortcode called cs_code. Wrap your code between the opening and closing tags and optionally set the language, title, and theme parameters. The shortcode renders identically to the Gutenberg block on the frontend, complete with syntax highlighting, copy button, and theme toggle.
After you publish this post, you can add a shortcode example manually by inserting a CloudScale Code Block in the editor and typing the example there. This avoids WordPress trying to execute the shortcode during paste.
9 Fourteen Color Themes
The original release shipped with Atom One as the only color scheme. That’s a fine default but if you’re running a blog with a specific visual identity, you want options. Version 1.7 adds 14 of the most popular syntax highlighting themes, all loaded directly from the highlight.js CDN with zero local files.
The full theme list: Atom One, GitHub, Monokai, Nord, Dracula, Tokyo Night, VS 2015 / VS Code, Stack Overflow, Night Owl, Gruvbox, Solarized, Panda, Tomorrow Night, and Shades of Purple.
Each theme comes in both a dark and light variant. When you select a theme in the settings, the plugin loads the appropriate dark and light CSS files from the CDN. The frontend toggle button switches between the two variants of your chosen theme. So if you pick Dracula, your visitors see Dracula Dark by default and can toggle to Dracula Light. If you pick Solarized, they get Solarized Dark and Solarized Light.
To change the theme, go to Tools > CloudScale Code and SQL. The Code Block Settings panel at the top of the Code Migrator tab has a Color Theme dropdown. Pick your theme and click Save Settings. The change applies site wide immediately, no page reload required.
Under the hood the theme system uses a registry pattern. Each theme entry defines its CDN filenames, dark background color, dark toolbar color, light background color, and light toolbar color. The frontend CSS uses CSS custom properties for all theme dependent values (background, toolbar, scrollbar, line numbers, hover states). When the page loads, JavaScript reads the theme colors from the registry and sets the custom properties on each code block wrapper. This means any new theme can be added to the registry without touching the CSS or JavaScript.
10 The Merged Admin Interface
In earlier versions, the Code Block Migrator and the SQL Command tool were separate plugins with separate admin pages. Version 1.6 merged everything into a single plugin with a unified admin interface at Tools > CloudScale Code and SQL.
The admin page uses a tabbed layout with two tabs: Code Migrator and SQL Command. The Code Migrator tab includes the inline settings panel (color theme and default mode) at the top, followed by the scan and migrate controls. The SQL Command tab has the query editor, results table, and quick query buttons.
The styling matches the CloudScale Page Views plugin exactly. You get the same navy gradient banner across the top, the dark tab bar with an orange underline on the active tab, white card panels with colored gradient section headers, and the same button styles, spacing, and typography throughout. If you’re running multiple CloudScale plugins, your Tools menu now has a consistent visual language across all of them.
11 SQL Command Tool
This is the feature I built for myself and use almost daily. If you manage a WordPress site and need to check database health, find bloat, or debug migration issues, you normally have to SSH into the server and run MySQL from the command line, or install phpMyAdmin, or use a separate database client. The SQL Command tool gives you a read only query interface right inside the WordPress admin.
Go to Tools > CloudScale Code and SQL and click the SQL Command tab. You’ll see a dark themed query editor at the top with a Run Query button. Type any SELECT, SHOW, DESCRIBE, or EXPLAIN query and press Enter (or Ctrl+Enter, or click the button). Results appear in a scrollable table below the editor with sticky column headers, striped rows, and hover highlighting.
The tool is strictly read only. All write operations are blocked at the PHP level before the query reaches the database. INSERT, UPDATE, DELETE, DROP, ALTER, TRUNCATE, CREATE, RENAME, REPLACE, LOAD, and GRANT are all rejected. The validation runs server side so it cannot be bypassed from the browser. You also need the manage_options capability, which means only WordPress administrators can access it.
11.1 Quick Queries
Below the query editor you’ll find 14 preset queries organized into four groups. Click any button to populate the editor and run the query immediately.
Health and Diagnostics gives you three queries. Database Health Check returns your MySQL version, max connections, wait timeout, max allowed packet size, and current database name. Site Identity Options pulls the six key values from wp_options: site URL, home URL, blog name, description, WordPress version, and database version. Table Sizes and Rows shows every table in your database with its storage engine, row count, data size in megabytes, index size, and total size, sorted largest first.
Content Summary has three queries. Posts by Type and Status gives you a grouped count of every post type and status combination in your database, which is useful for spotting unexpected post types from plugins. Site Stats Summary runs a single query that returns your total published posts, revision count, auto drafts, trashed items, total comments, spam comments, user count, and transient count. Latest 20 Published Posts shows your most recent content with title, publish date, and status.
Bloat and Cleanup Checks has four queries for finding waste. Orphaned Postmeta counts metadata rows where the parent post no longer exists. Expired Transients counts transient timeout entries that have passed their expiry. Revisions, Drafts and Trash shows how many revision posts, auto drafts, and trashed items are sitting in your database. Largest Autoloaded Options lists the 20 biggest entries in wp_options that have autoload set to yes, sorted by value size, which is usually the first place to look when your options table is bloated.
URL and Migration Helpers has four queries for sites that have changed domains or moved to HTTPS. HTTP References finds any wp_options rows still referencing HTTP versions of your domain. Posts with HTTP GUIDs finds posts where the GUID column still uses HTTP. Old IP References checks postmeta for values containing a specific IP address pattern (useful after migrating away from a legacy server). Posts Missing Meta Descriptions finds published posts that don’t have a CloudScale SEO meta description set, which is helpful for working through your SEO backlog.
11.2 Keyboard Shortcuts
Press Enter to run the query. Use Shift+Enter to insert a newline if you need to write a multiline query. Ctrl+Enter (or Cmd+Enter on Mac) also runs the query. The Clear button wipes both the editor and the results table.
12 Updated Configuration
With the merge, the old Settings > CloudScale Code Block page no longer exists. All settings have moved to the inline panel on the Code Migrator tab at Tools > CloudScale Code and SQL. You’ll find two dropdowns: Color Theme (the 14 theme options) and Default Mode (dark or light). Changes save via AJAX with no page reload.
In the Gutenberg editor sidebar, each individual block still has a Theme Override dropdown with Default, Dark, and Light options. Setting it to Default means the block follows the site wide setting. Setting it to Dark or Light forces that mode regardless of the site wide default. The help text in the sidebar now points to the Tools page for site wide theme selection.
The language selector in the editor sidebar has also been expanded. In addition to the original 28 languages, you can now select HCL/Terraform and TOML, bringing the total to 30+ supported languages.
13 Plugin Architecture
For developers interested in the internals, the merged plugin is a single PHP class (CS_Code_Block) with all functionality in one file. The admin interface uses inline CSS embedded directly in the page output rather than external stylesheet files. This is the same approach used by the CloudScale Page Views plugin and it eliminates browser caching issues entirely. The styles render correctly on first load every time, regardless of WordPress configuration, caching plugins, or CDN setups.
The theme registry is a static method that returns an associative array keyed by theme slug. Each entry contains the human readable label, dark CSS filename, light CSS filename, and four hex color values for backgrounds and toolbars. Adding a new theme means adding one array entry. The frontend JavaScript reads the active theme’s colors via wp_localize_script and sets CSS custom properties on each code block wrapper at page load.
The SQL query validation uses a whitelist approach. The is_safe_query method strips comments and checks that the query starts with SELECT, SHOW, DESCRIBE, or EXPLAIN. Everything else is rejected before it reaches wpdb. The AJAX handler also verifies a nonce and the manage_options capability on every request.
Quick query buttons are defined as HTML data attributes containing the full SQL string. Clicking a button copies the SQL into the textarea and triggers the run function. This keeps the query definitions in the PHP template where they can reference the WordPress table prefix dynamically, rather than hardcoding table names in JavaScript.
14 Wrapping Up
CloudScale Code Block is purpose built for technical bloggers who want clean, highlighted code on their WordPress sites without fighting the editor. The paste convert workflow means you can write in markdown, paste into Gutenberg, click one button, and publish. The built in migration tool means your existing content gets the same treatment without manual editing.
The plugin is free and open source. Grab it using the link above and let me know how it works for you.
Running WordPress on ARM-based Graviton instances delivers up to 40% better price-performance compared to x86 equivalents. This guide provides production-ready scripts to deploy an optimised WordPress stack in minutes, plus everything you need to migrate your existing site.
Why Graviton for WordPress?
Graviton3 processors deliver:
40% better price-performance vs comparable x86 instances
Up to 25% lower cost for equivalent workloads
60% less energy consumption per compute hour
Native ARM64 optimisations for PHP 8.x and MariaDB
The t4g.small instance (2 vCPU, 2GB RAM) at ~$12/month handles most WordPress sites comfortably. For higher traffic, t4g.medium or c7g instances scale beautifully.
# 1. Launch instance (local machine)
./launch-graviton-wp.sh
# 2. SSH in and setup WordPress
ssh -i ~/.ssh/key.pem ec2-user@IP
sudo ./setup-wordpress.sh
# 3. If migrating - on old server
./wp-export.sh
scp /tmp/wp-migration/wordpress-migration-*.tar.gz ec2-user@NEW_IP:/tmp/
# 4. If migrating - on new server
sudo ./wp-import.sh /tmp/wordpress-migration-*.tar.gz
This setup delivers a production-ready WordPress installation that’ll handle significant traffic while keeping your AWS bill minimal. The combination of Graviton’s price-performance, Caddy’s efficiency, and properly-tuned PHP creates a stack that punches well above its weight class.
1. Backups Should Be Boring (and That Is the Point)
Backups are boring. They should be boring. A backup system that generates excitement is usually signalling failure.
The only time backups become interesting is when they are missing, and that interest level is lethal. Emergency bridges. Frozen change windows. Executive escalation. Media briefings. Regulatory apology letters. Engineers being asked questions that have no safe answers.
Most backup platforms are built for the boring days. Rubrik is designed for the day boredom ends.
2. Backup Is Not the Product. Restore Is.
Many organisations still evaluate backup platforms on the wrong metric: how fast they can copy data somewhere else.
That metric is irrelevant during an incident.
When things go wrong, the only questions that matter are:
What can I restore?
How fast can it be used?
How many restores can run in parallel?
How little additional infrastructure is required?
Rubrik treats restore as the primary product, not a secondary feature.
3. Architectural Starting Point: Designed for Failure, Not Demos
Rubrik was built without tape era assumptions. There is no central backup server, no serial job controller, and no media server bottleneck. Instead, it uses a distributed, scale out architecture with a global metadata index and a stateless policy engine.
Restore becomes a metadata lookup problem, not a job replay problem. This distinction is invisible in demos and decisive during outages.
4. Performance Metrics That Actually Matter
Backup throughput is easy to optimise and easy to market. Restore performance is constrained by network fan out, restore concurrency, control plane orchestration, and application host contention.
Rubrik addresses this by default through parallel restore streams, linear scaling with node count, and minimal control plane chatter. Restore performance becomes predictable rather than optimistic.
5. Restore Semantics That Match Reality
The real test of any backup platform is not how elegantly it captures data, but how usefully it returns that data when needed. This is where architectural decisions made years earlier either pay dividends or extract penalties.
5.1 Instant Access Instead of Full Rehydration
Rubrik does not require full data copy back before access. It supports live mount of virtual machines, database mounts directly from backup storage, and file system mounts for selective recovery.
The recovery model becomes access first, copy later if needed. This is the difference between minutes and hours when production is down.
5.2 Dropping a Table Should Not Be a Crisis
Rubrik understands databases as structured systems, not opaque blobs.
It supports table level restores for SQL Server, mounting a database backup as a live database, extracting tables or schemas without restoring the full database, and point in time recovery without rollback.
Accidental table drops should be operational annoyances, not existential threats.
5.3 Supported Database Engines
Rubrik provides native protection for the major enterprise database platforms:
Database Engine
Live Mount
Point in Time Recovery
Key Constraints
Microsoft SQL Server
Yes
Yes (transaction log replay)
SQL 2012+ supported; Always On AG, FCI, standalone
Oracle Database
Yes
Yes (archive log replay)
RAC, Data Guard, Exadata supported; SPFILE required for automated recovery
SAP HANA
No
Yes
Backint API integration; uses native HANA backup scheduling
PostgreSQL
No
Yes (up to 5 minute RPO)
File level incremental; on premises and cloud (AWS, Azure, GCP)
IBM Db2
Via Elastic App Service
Yes
Uses native Db2 backup utilities
MongoDB
Via Elastic App Service
Yes
Sharded and unsharded clusters; no quiescing required
MySQL
Via Elastic App Service
Yes
Uses native MySQL backup tools
Cassandra
Via Elastic App Service
Yes
Via Rubrik Datos IO integration
The distinction between native integration and Elastic App Service matters operationally. Native integration means Rubrik handles discovery, scheduling, and orchestration directly. Elastic App Service means Rubrik provides managed volumes as backup targets while the database’s native tools handle the actual backup process. Both approaches deliver immutability and policy driven retention, but the operational experience differs.
5.4 Live Mount: Constraints and Caveats
Live Mount is Rubrik’s signature capability—mounting backups as live, queryable databases without copying data back to production storage. The database runs with its data files served directly from the Rubrik cluster over NFS (for Oracle) or SMB 3.0 (for SQL Server).
This capability is transformative for specific use cases. It is not a replacement for production storage.
What Live Mount Delivers:
Near instant database availability (seconds to minutes, regardless of database size)
Zero storage provisioning on the target host
Multiple concurrent mounts from the same backup
Point in time access across the entire retention window
Ideal for granular recovery, DBCC health checks, test/dev cloning, audit queries, and upgrade validation
What Live Mount Does Not Deliver:
Production grade I/O performance
High availability during Rubrik cluster maintenance
Persistence across host or cluster reboots
IOPS Constraints:
Live Mount performance is bounded by the Rubrik appliance’s ability to serve I/O, not by the target host’s storage subsystem. Published figures suggest approximately 30,000 IOPS per Rubrik appliance for Live Mount workloads. This is adequate for reporting queries, data extraction, and validation testing. It is not adequate for transaction heavy production workloads.
The performance characteristics are inherently different from production storage:
Metric
Production SAN/Flash
Rubrik Live Mount
Random read IOPS
100,000+
~30,000 per appliance
Latency profile
Sub millisecond
Network + NFS overhead
Write optimisation
Production tuned
Backup optimised
Concurrent workloads
Designed for contention
Shared with backup operations
SQL Server Live Mount Specifics:
Databases mount via SMB 3.0 shares with UNC paths
Transaction log replay occurs during mount for point in time positioning
The mounted database is read write, but writes go to the Rubrik cluster
Supported for standalone instances, Failover Cluster Instances, and Always On Availability Groups
Table level recovery requires mounting the database, then using T SQL to extract and import specific objects
Oracle Live Mount Specifics:
Data files mount via NFS; redo logs and control files remain on the target host
Automated recovery requires source and target configurations to match (RAC to RAC, single instance to single instance, ASM to ASM)
Files only recovery allows dissimilar configurations but requires DBA managed RMAN recovery
SPFILE is required for automated recovery; PFILE databases require manual intervention
Block change tracking (BCT) is disabled on Live Mount targets
Live Mount fails if the target host, RAC cluster, or Rubrik cluster reboots during the mount—requiring forced unmount to clean up metadata
Direct NFS (DNFS) is recommended on Oracle RAC nodes for improved recovery performance
What Live Mount Is Not:
Live Mount is explicitly designed for temporary access, not sustained production workloads. The use cases Rubrik markets—test/dev, DBCC validation, granular recovery, audit queries—all share a common characteristic: they are time bounded operations that tolerate moderate I/O performance in exchange for instant availability.
Running production transaction processing against a Live Mount database would be technically possible and operationally inadvisable. The I/O profile, the network dependency, and the lack of high availability guarantees make it unsuitable for workloads where performance and uptime matter.
5.5 The Recovery Hierarchy
Understanding when to use each recovery method matters:
Recovery Need
Recommended Method
Time to Access
Storage Required
Extract specific rows/tables
Live Mount + query
Minutes
None
Validate backup integrity
Live Mount + DBCC
Minutes
None
Clone for test/dev
Live Mount
Minutes
None
Full database replacement
Export/Restore
Hours (size dependent)
Full database size
Disaster recovery cutover
Instant Recovery
Minutes (then migrate)
Temporary, then full
The strategic value of Live Mount is avoiding full restores when full restores are unnecessary. For a 5TB database where someone dropped a single table, Live Mount means extracting that table in minutes rather than waiting hours for a complete restore.
For actual disaster recovery, where the production database is gone and must be replaced, Live Mount provides bridge access while the full restore completes in parallel. The database is queryable immediately; production grade performance follows once data migration finishes.
5.6 The Hidden Failure Mode After a Successful Restore
Rubrik is not deployed in a single explosive moment. In the real world, it is rolled out carefully over weeks. Systems are onboarded one by one, validated, and then left to settle. Each system performs a single full backup, after which life becomes calm and predictable. From that point forward, everything is incremental. Deltas are small, backup windows shrink, networks breathe easily, and the platform looks deceptively relaxed.
This operating state creates a dangerous illusion.
After a large scale recovery event, you will spend hours restoring systems. That work feels like the crisis. It is not. The real stress event happens later, quietly, on the first night after the restores complete. Every restored system now believes it is brand new. Every one of them schedules a full backup. At that moment, your entire estate attempts to perform a first full backup simultaneously while still serving live traffic.
This is the point where Rubrik appliances, networks, and upstream storage experience their true failure conditions. Not during the restore, but after it. Massive ingest rates, saturated links, constrained disk, and queueing effects all arrive at once. If this scenario is not explicitly planned for, the recovery that looked successful during the day can cascade into instability overnight.
Recovery planning therefore cannot stop at restore completion. Backup re entry must be treated as a first class recovery phase. In most environments, the only viable strategy is to deliberately phase backup schedules over multiple days following a large scale restore. Systems must be staggered back into protection in controlled waves, rather than allowed to collide into a single catastrophic full backup storm.
Restore is the product. But what comes after restore is where architectures either hold, or quietly collapse.
6. Why Logical Streaming Is a Design Failure
Traditional restore models stream backup data through the database host. This guarantees CPU contention, IO pressure, and restore times proportional to database size rather than change size.
Figure 1 illustrates this contrast clearly. Traditional restore requires data to be copied back through the database server, creating high I/O, CPU and network load with correspondingly long restore times. Rubrik’s Live Mount approach mounts the backup copy directly, achieving near zero RTO with minimal data movement. The difference between these approaches becomes decisive when production is down and every minute of restore time translates to business impact.
Rubrik avoids this by mounting database images and extracting only required objects. The database host stops being collateral damage during recovery.
6.1 The VSS Tax: Why SQL Server Backups Cannot Escape Application Coordination
For VMware workloads without databases, Rubrik can leverage storage level snapshots that are instantaneous, application agnostic, and impose zero load on the guest operating system. The hypervisor freezes the VM state, the storage array captures the point in time image, and the backup completes before the application notices.
SQL Server cannot offer this simplicity. The reason is not a Microsoft limitation or a Rubrik constraint. The reason is transactional consistency.
6.1.1 The Crash Consistent Option Exists
Nothing technically prevents Rubrik, or any backup tool, from taking a pure storage snapshot of a SQL Server volume without application coordination. The snapshot would complete in milliseconds with zero database load.
The problem is what you would recover: a crash consistent image, not an application consistent one.
A crash consistent snapshot captures storage state mid flight. This includes partially written pages, uncommitted transactions, dirty buffers not yet flushed to disk, and potentially torn writes caught mid I/O. SQL Server is designed to recover from exactly this state. Every time the database engine starts after an unexpected shutdown, it runs crash recovery, rolling forward committed transactions from the log and rolling back uncommitted ones.
The database will become consistent. Eventually. Probably.
6.1.2 Why Probably Is Not Good Enough
Crash recovery works. It works reliably. It is tested millions of times daily across every SQL Server instance that experiences an unclean shutdown.
But restore confidence matters. When production is down and executives are asking questions, the difference between “this backup is guaranteed consistent” and “this backup should recover correctly after crash recovery completes” is operationally significant.
VSS exists to eliminate that uncertainty.
6.1.3 What VSS Actually Does
When a backup application requests an application consistent SQL Server snapshot, the sequence shown in Figure 2 executes. The backup server sends a signal through VSS Orchestration, which triggers the SQL Server VSS Writer to prepare the database. This preparation involves flushing dirty pages to storage, hardening transaction logs, and momentarily freezing I/O. Only then does the storage-level snapshot execute, capturing a point-in-time consistent image that requires no crash recovery on restore.
The result is a snapshot that requires no crash recovery on restore. The database is immediately consistent, immediately usable, and carries no uncertainty about transactional integrity.
6.1.4 The Coordination Cost
The VSS freeze window is typically brief, milliseconds to low seconds. But the preparation is not free.
Buffer pool flushes on large databases generate I/O pressure. Checkpoint operations compete with production workloads. The freeze, however short, introduces latency for in flight transactions. The database instance is actively participating in its own backup.
For databases measured in terabytes, with buffer pools consuming hundreds of gigabytes, this coordination overhead becomes operationally visible. Backup windows that appear instantaneous from the storage console are hiding real work inside the SQL Server instance.
6.1.5 The Architectural Asymmetry
This creates a fundamental difference in backup elegance across workload types:
Workload Type
Backup Method
Application Load
Restore State
VMware VM (no database)
Storage snapshot
Zero
Crash consistent (acceptable)
VMware VM (with SQL Server)
VSS coordinated snapshot
Moderate
Application consistent
Physical SQL Server
VSS coordinated snapshot
Moderate to high
Application consistent
Physical SQL Server
Pure storage snapshot
Zero
Crash consistent (risky)
For a web server or file share, crash consistent is fine. The application has no transactional state worth protecting. For a database, crash consistent means trusting recovery logic rather than guaranteeing consistency.
6.1.6 The Uncomfortable Reality
The largest, most critical SQL Server databases, the ones that would benefit most from zero overhead instantaneous backup, are precisely the workloads where crash consistent snapshots carry the most risk. More transactions in flight. Larger buffer pools. More recovery time if something needs replay.
Rubrik supports VSS coordination because the alternative is shipping backups that might need crash recovery. That uncertainty is acceptable for test environments. It is rarely acceptable for production databases backing financial systems, customer records, or regulatory reporting.
The VSS tax is not a limitation imposed by Microsoft or avoided by competitors. It is the cost of consistency. Every backup platform that claims application consistent SQL Server protection is paying it. The only question is whether they admit the overhead exists.
7. Snapshot Based Protection Is Objectively Better (When You Can Get It)
The previous section explained why SQL Server backups cannot escape application coordination. VSS exists because transactional consistency requires it, and the coordination overhead is the price of certainty.
This makes the contrast with pure snapshot based protection even starker. Where snapshots work cleanly, they are not incrementally better. They are categorically superior.
7.1 What Pure Snapshots Deliver
Snapshot based backups in environments that support them provide:
Near instant capture: microseconds to milliseconds, regardless of dataset size
Zero application load: the workload never knows a backup occurred
Consistent recovery points: the storage layer guarantees point in time consistency
Predictable backup windows: duration is independent of data volume
No bandwidth consumption during capture: data movement happens later, asynchronously
A 50TB VMware datastore snapshots in the same time as a 50GB datastore. Backup windows become scheduling decisions rather than capacity constraints.
Rubrik exploits this deeply in VMware environments. Snapshot orchestration, instant VM recovery, and live mounts all depend on the hypervisor providing clean, consistent, zero overhead capture points.
7.2 Why This Is Harder Than It Looks
The elegance of snapshot based protection depends entirely on the underlying platform providing the right primitives. This is where the gap between VMware and everything else becomes painful.
VMware offers:
Native snapshot APIs with transactional semantics
Changed Block Tracking (CBT) for efficient incrementals
Hypervisor level consistency without guest coordination
Storage integration through VADP (vSphere APIs for Data Protection)
These are not accidental features. VMware invested years building a backup ecosystem because they understood that enterprise adoption required operational maturity, not just compute virtualisation.
Physical hosts offer none of this.
There is no universal snapshot API for bare metal servers. Storage arrays provide snapshot capabilities, but each vendor implements them differently, with different consistency guarantees, different integration points, and different failure modes. The operating system has no standard mechanism to coordinate application state with storage level capture.
7.3 The Physical Host Penalty
This is why physical SQL Server hosts face a compounding disadvantage:
No hypervisor abstraction: there is no layer between the OS and storage that can freeze state cleanly
VSS remains mandatory: application consistency still requires database coordination
No standardised incremental tracking: without CBT or equivalent, every backup must rediscover what changed
Storage integration is bespoke: each array, each SAN, each configuration requires specific handling
The result is that physical hosts with the largest databases—the workloads generating the most backup data, with the longest restore times, under the most operational pressure, receive the least architectural benefit from modern backup platforms.
They are stuck paying the VSS tax without receiving the snapshot dividend.
7.4 The Integration Hierarchy
Backup elegance follows a clear hierarchy based on platform integration depth:
Environment
Snapshot Quality
Incremental Efficiency
Application Consistency
Overall Experience
VMware (no database)
Excellent
CBT driven
Not required
Seamless
VMware (with SQL Server)
Excellent
CBT driven
VSS coordinated
Good with overhead
Cloud native (EBS, managed disks)
Good
Provider dependent
Varies by workload
Generally clean
Physical with enterprise SAN
Possible
Array dependent
VSS coordinated
Complex but workable
Physical with commodity storage
Limited
Often full scan
VSS coordinated
Painful
The further down this hierarchy, the more the backup platform must compensate for missing primitives. Rubrik handles this better than most, but even excellent software cannot conjure APIs that do not exist.
7.5 Why the Industry Irony Persists
The uncomfortable truth is that snapshot based protection delivers its greatest value precisely where it is least available.
A 500GB VMware VM snapshots effortlessly. The hypervisor provides everything needed. Backup is boring, as it should be.
A 50TB physical SQL Server, the database actually keeping the business running, containing years of transactional history, backing regulatory reporting and financial reconciliation, must coordinate through VSS, flush terabytes of buffer pool, sustain I/O pressure during capture, and hope the storage layer cooperates.
The workloads that need snapshot elegance the most are architecturally prevented from receiving it.
This is not a Rubrik limitation. It is not a Microsoft conspiracy. It is the accumulated consequence of decades of infrastructure evolution where virtualisation received backup investment and physical infrastructure did not.
7.6 What This Means for Architecture Decisions
Understanding this hierarchy should influence infrastructure strategy:
Virtualise where possible. The backup benefits alone often justify the overhead. A SQL Server VM with VSS coordination still benefits from CBT, instant recovery, and hypervisor level orchestration.
Choose storage with snapshot maturity. If physical hosts are unavoidable, enterprise arrays with proven snapshot integration reduce the backup penalty. This is not the place for commodity storage experimentation.
Accept the VSS overhead. For SQL Server workloads, crash consistent snapshots are technically possible but operationally risky. The coordination cost is worth paying. Budget for it in backup windows and I/O capacity.
Plan restore, not backup. Snapshot speed is irrelevant if restore requires hours of data rehydration. The architectural advantage of snapshots extends to recovery only if the platform supports instant mount and selective restore.
Rubrik’s value in this landscape is not eliminating the integration gaps, nobody can, but navigating them intelligently. Where snapshots work, Rubrik exploits them fully. Where they do not, Rubrik minimises the penalty through parallel restore, live mounts, and metadata driven recovery.
The goal remains the same: make restore the product, regardless of how constrained the backup capture had to be.
8. Rubrik Restore Policies: Strategy, Trade offs, and Gotchas
SLA Domains are Rubrik’s policy abstraction layer, and understanding how to configure them properly separates smooth recoveries from painful ones. The flexibility is substantial, but so are the consequences of misconfiguration.
8.1 Understanding SLA Domain Architecture
Rubrik’s policy model centres on SLA Domains, named policies that define retention, frequency, replication, and archival behaviour. Objects are assigned to SLA Domains rather than configured individually, which creates operational leverage but requires upfront design discipline.
The core parameters that matter for restore planning:
Snapshot Frequency determines your Recovery Point Objective (RPO). A 4-hour frequency means you could lose up to 4 hours of data. For SQL Server with log backup enabled, transaction logs between snapshots reduce effective RPO to minutes, but the full snapshot frequency still determines how quickly you can access a baseline restore point.
Local Retention controls how many snapshots remain on the Rubrik cluster for instant access. This is your Live Mount window. Data within local retention restores in minutes. Data beyond it requires rehydration from archive, which takes hours.
Replication copies snapshots to a secondary Rubrik cluster, typically in another location. This is your disaster recovery tier. Replication targets can serve Live Mount operations, meaning DR isn’t just “eventually consistent backup copies” but actual instant recovery capability at the secondary site.
Archival moves aged snapshots to object storage (S3, Azure Blob, Google Cloud Storage). Archive tier data cannot be Live Mounted, it must be retrieved first, which introduces retrieval latency and potentially egress costs.
8.2 The Retention vs. Recovery Speed Trade off
This is where most organisations get the policy design wrong.
The temptation is to keep minimal local retention and archive aggressively to reduce storage costs. The consequence is that any restore request older than a few days becomes a multi hour operation.
Consider the mathematics for a 5TB SQL Server database:
Recovery Scenario
Local Retention
Time to Access
Operational Impact
Yesterday’s backup
Within local retention
2-5 minutes (Live Mount)
Minimal
Last week’s backup
Within local retention
2-5 minutes (Live Mount)
Minimal
Last month’s backup
Archived
4-8 hours (retrieval + restore)
Significant
Last quarter’s backup
Archived (cold tier)
12-24 hours
Major incident
The storage cost of keeping 30 days local versus 7 days local might seem significant when multiplied across the estate. But the operational cost of a 6 hour restore delay during an audit request or compliance investigation often exceeds years of incremental storage spend.
Recommendation: Size local retention to cover your realistic recovery scenarios, not your theoretical minimum. For most organisations, 14-30 days of local retention provides the right balance between cost and operational flexibility.
8.3 SLA Domain Design Patterns
8.3.1 Pattern 1: Tiered by Criticality
Create separate SLA Domains for different criticality levels:
Platinum: 4 hour snapshots, 30 day local retention, synchronous replication, 7 year archive
Gold: 8 hour snapshots, 14 day local retention, asynchronous replication, 3 year archive
Silver: Daily snapshots, 7 day local retention, no replication, 1 year archive
Bronze: Daily snapshots, 7 day local retention, no replication, 90 day archive
This pattern works well when criticality maps cleanly to workload types, but creates governance overhead when applications span tiers.
8.3.2 Pattern 2: Tiered by Recovery Requirements
Align SLA Domains to recovery time objectives rather than business criticality:
Instant Recovery: Maximum local retention, synchronous replication, Live Mount always available
Same Day Recovery: 14 day local retention, asynchronous replication
Next Day Recovery: 7 day local retention, archive first strategy
This pattern acknowledges that “critical” and “needs instant recovery” aren’t always the same thing. A compliance archive might be business critical but tolerate 24 hour recovery times.
8.3.3 Pattern 3: Application Aligned
Create SLA Domains per major application or database platform:
SQL Server Production
SQL Server Non Production
Oracle Production
VMware Infrastructure
File Shares
This pattern simplifies troubleshooting and reporting but can lead to policy sprawl as the estate grows.
8.4 Log Backup Policies: The Hidden Complexity
For SQL Server and Oracle, snapshot frequency alone doesn’t tell the full story. Transaction log backups between snapshots determine actual RPO.
Rubrik supports log backup frequencies down to 1 minute for SQL Server. The trade offs:
Aggressive Log Backup (1-5 minute frequency):
Sub 5 minute RPO
Higher metadata overhead on Rubrik cluster
More objects to manage during restore
Longer Live Mount preparation time (more logs to replay)
Conservative Log Backup (15-60 minute frequency):
Acceptable RPO for most workloads
Lower operational overhead
Faster Live Mount operations
Simpler troubleshooting
Gotcha: Log backup frequency creates a hidden I/O load on the source database. A 1 minute log backup interval on a high transaction database generates constant log backup traffic. For already I/O constrained databases, this can become the straw that breaks performance.
Recommendation: Match log backup frequency to actual RPO requirements, not aspirational ones. If the business can tolerate 15 minutes of data loss, don’t configure 1 minute log backups just because you can.
8.5 Replication Topology Gotchas
Replication seems straightforward, copy snapshots to another cluster, but the implementation details matter.
8.5.1 Gotcha 1: Replication Lag Under Load
Asynchronous replication means the target cluster is always behind the source. During high backup activity (month end processing, batch loads), this lag can extend to hours. If a disaster occurs during this window, you lose more data than your SLA suggests.
Monitor replication lag as an operational metric, not just a capacity planning number.
8.5.2 Gotcha 2: Bandwidth Contention with Production Traffic
Replication competes for the same network paths as production traffic. If your backup replication saturates a WAN link, production application performance degrades.
Either implement QoS policies to protect production traffic, or schedule replication during low utilisation windows. Rubrik supports replication scheduling, but the default is “as fast as possible,” which isn’t always appropriate.
8.5.3 Gotcha 3: Cascaded Replication Complexity
For multi site architectures, you might configure Site A → Site B → Site C replication. Each hop adds latency and failure modes. A Site B outage breaks the chain to Site C.
Consider whether hub and spoke (Site A replicates independently to both B and C) better matches your DR requirements, despite the additional bandwidth consumption.
8.6 Archive Tier Selection: Retrieval Time Matters
Object storage isn’t monolithic. The choice between storage classes has direct recovery implications.
Storage Class
Typical Retrieval Time
Use Case
S3 Standard / Azure Hot
Immediate
Frequently accessed archives
S3 Standard-IA / Azure Cool
Immediate (higher retrieval cost)
Infrequent but urgent access
S3 Glacier Instant Retrieval
Milliseconds
Compliance archives with occasional audit access
S3 Glacier Flexible Retrieval
1-12 hours
Long-term retention with rare access
S3 Glacier Deep Archive
12-48 hours
Legal hold, never access unless subpoenaed
Gotcha: Rubrik’s archive policy assigns snapshots to a single storage class. If your retention spans 7 years, all 7 years of archives pay the same storage rate, even though year 1 archives are accessed far more frequently than year 7 archives.
Recommendation: Consider tiered archive policies—recent archives to Standard-IA, aged archives to Glacier. This requires multiple SLA Domains and careful lifecycle management, but the cost savings compound significantly at scale.
8.7 Policy Assignment Gotchas
8.7.1 Gotcha 1: Inheritance and Override Conflicts
Rubrik supports hierarchical policy assignment (cluster → host → database). When policies conflict, the resolution logic isn’t always intuitive. A database with an explicit SLA assignment won’t inherit changes made to its parent host’s policy.
Document your policy hierarchy explicitly. During audits, the question “what policy actually applies to this database?” should have an immediate, verifiable answer.
8.7.2 Gotcha 2: Pre script and Post script Failures
Custom scripts for application quiescing or notification can fail, and failure handling varies. A pre script failure might skip the backup entirely (safe but creates a gap) or proceed without proper quiescing (dangerous).
Test script failure modes explicitly. Know what happens when your notification webhook is unreachable or your custom quiesce script times out.
8.7.3 Gotcha 3: Time Zone Confusion
Rubrik displays times in the cluster’s configured time zone, but SLA schedules operate in UTC unless explicitly configured otherwise. An “8 PM backup” might run at midnight local time if the time zone mapping is wrong.
Verify backup execution times after policy configuration, don’t trust the schedule display alone.
8.8 Testing Your Restore Policies
Policy design is theoretical until tested. The following tests should be regular operational practice:
Live Mount Validation: Mount a backup from local retention and verify application functionality. This proves both backup integrity and Live Mount operational capability.
Archive Retrieval Test: Retrieve a backup from archive tier and time the operation. Compare actual retrieval time against SLA commitments.
Replication Failover Test: Perform a Live Mount from the replication target, not the source cluster. This validates that DR actually works, not just that replication is running.
Point in Time Recovery Test: For databases with log backup enabled, recover to a specific timestamp between snapshots. This validates that log chain integrity is maintained.
Concurrent Restore Test: Simulate a ransomware scenario by triggering multiple simultaneous restores. Measure whether your infrastructure can sustain the required parallelism.
8.9 Policy Review Triggers
SLA Domains shouldn’t be “set and forget.” Trigger policy reviews when:
Application criticality changes (promotion to production, decommissioning)
Recovery requirements change (new compliance mandates, audit findings)
Infrastructure changes (new replication targets, storage tier availability)
Performance issues emerge (backup windows exceeded, replication lag growing)
The goal is proactive policy maintenance, not reactive incident response when a restore takes longer than expected.
9. Ransomware: Where Architecture Is Exposed
9.1 The Restore Storm Problem
After ransomware, the challenge is not backup availability. The challenge is restoring everything at once.
Constraints appear immediately. East-west traffic saturates. DWDM links run hot. Core switch buffers overflow. Cloud egress throttling kicks in.
Rubrik mitigates this through parallel restores, SLA based prioritisation, and live mounts for critical systems. What it cannot do is defeat physics. A good recovery plan avoids turning a data breach into a network outage.
10. SaaS vs Appliance: This Is a Network Decision
Functionally, Rubrik SaaS and on prem appliances share the same policy engine, metadata index, and restore semantics.
The difference is bandwidth reality.
On prem appliances provide fast local restores, predictable latency, and minimal WAN dependency. SaaS based protection provides excellent cloud workload coverage and operational simplicity, but restore speed is bounded by network capacity and egress costs.
Hybrid estates usually require both.
11. Why Rubrik in the Cloud?
Cloud providers offer native backup primitives. These are necessary but insufficient. They do not provide unified policy across environments, cross account recovery at scale, ransomware intelligence, or consistent restore semantics. Rubrik turns cloud backups into recoverable systems rather than isolated snapshots.
11.1 Should You Protect Your AWS Root and Crypto Accounts?
Yes, because losing the control plane is worse than losing data.
Rubrik protects IAM configuration, account state, and infrastructure metadata. After a compromise, restoring how the account was configured is as important as restoring the data itself.
12. Backup Meets Security (Finally)
Rubrik integrates threat awareness into recovery using entropy analysis, change rate anomaly detection, and snapshot divergence tracking.cThis answers the most dangerous question in recovery: which backup is actually safe to restore? Most platforms cannot answer this with confidence.
13. VMware First Class Citizen, Physical Hosts Still Lag
Rubrik’s deepest integrations exist in VMware environments, including snapshot orchestration, instant VM recovery, and live mounts.
The uncomfortable reality remains that physical hosts with the largest datasets would benefit most from snapshot based protection, yet receive the least integration. This is an industry gap, not just a tooling one.
14. When Rubrik Is Not the Right Tool
Rubrik is not universal.
It is less optimal when bandwidth is severely constrained, estates are very small, or tape workflows are legally mandated.
Rubrik’s value emerges at scale, under pressure, and during failure.
15. Conclusion: Boredom Is Success
Backups should be boring. Restores should be quiet. Executives should never know the platform exists.
The only time backups become exciting is when they fail, and that excitement is almost always lethal.
Rubrik is not interesting because it stores data. It is interesting because, when everything is already on fire, restore remains a controlled engineering exercise rather than a panic response.
A Complete Guide to Archiving, Restoring, and Querying Large Table Partitions
When dealing with multi-terabyte tables in Aurora PostgreSQL, keeping historical partitions online becomes increasingly expensive and operationally burdensome. This guide presents a complete solution for archiving partitions to S3 in Iceberg/Parquet format, restoring them when needed, and querying archived data directly via a Spring Boot API without database restoration.
1. Architecture Overview
The solution comprises three components:
Archive Script: Exports a partition from Aurora PostgreSQL to Parquet files organised in Iceberg table format on S3
Restore Script: Imports archived data from S3 back into a staging table for validation and migration to the main table
Query API: A Spring Boot application that reads Parquet files directly from S3, applying predicate pushdown for efficient filtering
This approach reduces storage costs by approximately 70 to 80 percent compared to keeping data in Aurora, while maintaining full queryability through the API layer.
This script reverses the archive operation by reading Parquet files from S3 and loading them into a staging table.
4.1 Restore Script
#!/usr/bin/env python3
# restore_partition.py
"""
Restore an archived partition from S3 back to Aurora PostgreSQL.
Usage:
python restore_partition.py \
--source-path s3://bucket/prefix/schema/table/partition_col=value \
--target-table transactions_staging \
--target-schema public
"""
import argparse
import json
import logging
import sys
from datetime import datetime
from typing import Dict, Any, List, Optional
from urllib.parse import urlparse
import boto3
import pandas as pd
import psycopg2
from psycopg2 import sql
from psycopg2.extras import execute_values
import pyarrow.parquet as pq
from config import DatabaseConfig
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger(__name__)
class PartitionRestorer:
"""Restores archived partitions from S3 to PostgreSQL."""
def __init__(
self,
db_config: DatabaseConfig,
source_path: str,
target_schema: str,
target_table: str,
create_table: bool = True,
batch_size: int = 10000
):
self.db_config = db_config
self.source_path = source_path
self.target_schema = target_schema
self.target_table = target_table
self.create_table = create_table
self.batch_size = batch_size
parsed = urlparse(source_path)
self.bucket = parsed.netloc
self.prefix = parsed.path.lstrip("/")
self.s3_client = boto3.client("s3")
def _load_schema_snapshot(self) -> Dict[str, Any]:
"""Load the schema snapshot from the archive."""
response = self.s3_client.get_object(
Bucket=self.bucket,
Key=f"{self.prefix}/schema_snapshot.json"
)
return json.loads(response["Body"].read())
def _load_iceberg_metadata(self) -> Dict[str, Any]:
"""Load Iceberg metadata."""
response = self.s3_client.get_object(
Bucket=self.bucket,
Key=f"{self.prefix}/metadata/v1.metadata.json"
)
return json.loads(response["Body"].read())
def _list_data_files(self) -> List[str]:
"""List all Parquet data files in the archive."""
data_prefix = f"{self.prefix}/data/"
files = []
paginator = self.s3_client.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=self.bucket, Prefix=data_prefix):
for obj in page.get("Contents", []):
if obj["Key"].endswith(".parquet"):
files.append(obj["Key"])
return sorted(files)
def _postgres_type_from_column_def(self, col: Dict[str, Any]) -> str:
"""Convert column definition to PostgreSQL type."""
data_type = col["data_type"]
if data_type == "character varying":
max_len = col.get("character_maximum_length")
if max_len:
return f"varchar({max_len})"
return "text"
if data_type == "numeric":
precision = col.get("numeric_precision")
scale = col.get("numeric_scale")
if precision and scale:
return f"numeric({precision},{scale})"
return "numeric"
return data_type
def _create_staging_table(
self,
schema_snapshot: Dict[str, Any],
conn: psycopg2.extensions.connection
) -> None:
"""Create the staging table based on archived schema."""
columns = schema_snapshot["columns"]
column_defs = []
for col in columns:
pg_type = self._postgres_type_from_column_def(col)
nullable = "" if col["is_nullable"] == "YES" else " NOT NULL"
column_defs.append(f' "{col["column_name"]}" {pg_type}{nullable}')
create_sql = f"""
DROP TABLE IF EXISTS {self.target_schema}.{self.target_table};
CREATE TABLE {self.target_schema}.{self.target_table} (
{chr(10).join(column_defs)}
)
"""
with conn.cursor() as cur:
cur.execute(create_sql)
conn.commit()
logger.info(f"Created staging table {self.target_schema}.{self.target_table}")
def _insert_batch(
self,
df: pd.DataFrame,
columns: List[str],
conn: psycopg2.extensions.connection
) -> int:
"""Insert a batch of records into the staging table."""
if df.empty:
return 0
for col in df.columns:
if pd.api.types.is_datetime64_any_dtype(df[col]):
df[col] = df[col].apply(
lambda x: x.isoformat() if pd.notna(x) else None
)
values = [tuple(row) for row in df[columns].values]
column_names = ", ".join(f'"{c}"' for c in columns)
insert_sql = f"""
INSERT INTO {self.target_schema}.{self.target_table} ({column_names})
VALUES %s
"""
with conn.cursor() as cur:
execute_values(cur, insert_sql, values, page_size=self.batch_size)
return len(values)
def restore(self) -> Dict[str, Any]:
"""Execute the restore operation."""
logger.info(f"Starting restore from {self.source_path}")
schema_snapshot = self._load_schema_snapshot()
metadata = self._load_iceberg_metadata()
data_files = self._list_data_files()
logger.info(f"Found {len(data_files)} data files to restore")
columns = [col["column_name"] for col in schema_snapshot["columns"]]
with psycopg2.connect(self.db_config.connection_string) as conn:
if self.create_table:
self._create_staging_table(schema_snapshot, conn)
total_records = 0
for file_key in data_files:
s3_uri = f"s3://{self.bucket}/{file_key}"
logger.info(f"Restoring from {file_key}")
response = self.s3_client.get_object(Bucket=self.bucket, Key=file_key)
table = pq.read_table(response["Body"])
df = table.to_pandas()
file_records = 0
for start in range(0, len(df), self.batch_size):
batch_df = df.iloc[start:start + self.batch_size]
inserted = self._insert_batch(batch_df, columns, conn)
file_records += inserted
conn.commit()
total_records += file_records
logger.info(f"Restored {file_records} records from {file_key}")
result = {
"status": "success",
"source": self.source_path,
"target": f"{self.target_schema}.{self.target_table}",
"total_records": total_records,
"files_processed": len(data_files),
"restored_at": datetime.utcnow().isoformat()
}
logger.info(f"Restore complete: {total_records} records")
return result
def main():
parser = argparse.ArgumentParser(
description="Restore archived partition from S3 to PostgreSQL"
)
parser.add_argument("--source-path", required=True, help="S3 path to archived partition")
parser.add_argument("--target-table", required=True, help="Target table name")
parser.add_argument("--target-schema", default="public", help="Target schema name")
parser.add_argument("--batch-size", type=int, default=10000, help="Insert batch size")
parser.add_argument("--no-create", action="store_true", help="Don't create table, assume it exists")
args = parser.parse_args()
db_config = DatabaseConfig.from_environment()
restorer = PartitionRestorer(
db_config=db_config,
source_path=args.source_path,
target_schema=args.target_schema,
target_table=args.target_table,
create_table=not args.no_create,
batch_size=args.batch_size
)
result = restorer.restore()
print(json.dumps(result, indent=2))
return 0
if __name__ == "__main__":
sys.exit(main())
5. SQL Operations for Partition Migration
Once data is restored to a staging table, you need SQL operations to validate and migrate it to the main table.
5.1 Schema Validation
-- Validate that staging table schema matches the main table
CREATE OR REPLACE FUNCTION validate_table_schemas(
p_source_schema TEXT,
p_source_table TEXT,
p_target_schema TEXT,
p_target_table TEXT
) RETURNS TABLE (
validation_type TEXT,
column_name TEXT,
source_value TEXT,
target_value TEXT,
is_valid BOOLEAN
) AS $$
BEGIN
-- Check column count
RETURN QUERY
SELECT
'column_count'::TEXT,
NULL::TEXT,
src.cnt::TEXT,
tgt.cnt::TEXT,
src.cnt = tgt.cnt
FROM
(SELECT COUNT(*)::INT AS cnt
FROM information_schema.columns
WHERE table_schema = p_source_schema
AND table_name = p_source_table) src,
(SELECT COUNT(*)::INT AS cnt
FROM information_schema.columns
WHERE table_schema = p_target_schema
AND table_name = p_target_table) tgt;
-- Check each column exists with matching type
RETURN QUERY
SELECT
'column_definition'::TEXT,
src.column_name,
src.data_type || COALESCE('(' || src.character_maximum_length::TEXT || ')', ''),
COALESCE(tgt.data_type || COALESCE('(' || tgt.character_maximum_length::TEXT || ')', ''), 'MISSING'),
src.data_type = COALESCE(tgt.data_type, '')
AND COALESCE(src.character_maximum_length, 0) = COALESCE(tgt.character_maximum_length, 0)
FROM
information_schema.columns src
LEFT JOIN
information_schema.columns tgt
ON tgt.table_schema = p_target_schema
AND tgt.table_name = p_target_table
AND tgt.column_name = src.column_name
WHERE
src.table_schema = p_source_schema
AND src.table_name = p_source_table
ORDER BY src.ordinal_position;
-- Check nullability
RETURN QUERY
SELECT
'nullability'::TEXT,
src.column_name,
src.is_nullable,
COALESCE(tgt.is_nullable, 'MISSING'),
src.is_nullable = COALESCE(tgt.is_nullable, '')
FROM
information_schema.columns src
LEFT JOIN
information_schema.columns tgt
ON tgt.table_schema = p_target_schema
AND tgt.table_name = p_target_table
AND tgt.column_name = src.column_name
WHERE
src.table_schema = p_source_schema
AND src.table_name = p_source_table
ORDER BY src.ordinal_position;
END;
$$ LANGUAGE plpgsql;
-- Usage
SELECT * FROM validate_table_schemas('public', 'transactions_staging', 'public', 'transactions');
5.2 Comprehensive Validation Report
-- Generate a full validation report before migration
CREATE OR REPLACE FUNCTION generate_migration_report(
p_staging_schema TEXT,
p_staging_table TEXT,
p_target_schema TEXT,
p_target_table TEXT,
p_partition_column TEXT,
p_partition_value TEXT
) RETURNS TABLE (
check_name TEXT,
result TEXT,
details JSONB
) AS $$
DECLARE
v_staging_count BIGINT;
v_existing_count BIGINT;
v_schema_valid BOOLEAN;
BEGIN
-- Get staging table count
EXECUTE format(
'SELECT COUNT(*) FROM %I.%I',
p_staging_schema, p_staging_table
) INTO v_staging_count;
RETURN QUERY SELECT
'staging_record_count'::TEXT,
'INFO'::TEXT,
jsonb_build_object('count', v_staging_count);
-- Check for existing data in target partition
BEGIN
EXECUTE format(
'SELECT COUNT(*) FROM %I.%I WHERE %I = $1',
p_target_schema, p_target_table, p_partition_column
) INTO v_existing_count USING p_partition_value;
IF v_existing_count > 0 THEN
RETURN QUERY SELECT
'existing_partition_data'::TEXT,
'WARNING'::TEXT,
jsonb_build_object(
'count', v_existing_count,
'message', 'Target partition already contains data'
);
ELSE
RETURN QUERY SELECT
'existing_partition_data'::TEXT,
'OK'::TEXT,
jsonb_build_object('count', 0);
END IF;
EXCEPTION WHEN undefined_column THEN
RETURN QUERY SELECT
'partition_column_check'::TEXT,
'ERROR'::TEXT,
jsonb_build_object(
'message', format('Partition column %s not found', p_partition_column)
);
END;
-- Validate schemas match
SELECT bool_and(is_valid) INTO v_schema_valid
FROM validate_table_schemas(
p_staging_schema, p_staging_table,
p_target_schema, p_target_table
);
RETURN QUERY SELECT
'schema_validation'::TEXT,
CASE WHEN v_schema_valid THEN 'OK' ELSE 'ERROR' END::TEXT,
jsonb_build_object('schemas_match', v_schema_valid);
-- Check for null values in NOT NULL columns
RETURN QUERY
SELECT
'null_check_' || c.column_name,
CASE WHEN null_count > 0 THEN 'ERROR' ELSE 'OK' END,
jsonb_build_object('null_count', null_count)
FROM information_schema.columns c
CROSS JOIN LATERAL (
SELECT COUNT(*) as null_count
FROM (
SELECT 1
FROM information_schema.columns ic
WHERE ic.table_schema = p_staging_schema
AND ic.table_name = p_staging_table
AND ic.column_name = c.column_name
) x
) nc
WHERE c.table_schema = p_target_schema
AND c.table_name = p_target_table
AND c.is_nullable = 'NO';
END;
$$ LANGUAGE plpgsql;
-- Usage
SELECT * FROM generate_migration_report(
'public', 'transactions_staging',
'public', 'transactions',
'transaction_date', '2024-01'
);
5.3 Partition Migration
-- Migrate data from staging table to main table
CREATE OR REPLACE PROCEDURE migrate_partition_data(
p_staging_schema TEXT,
p_staging_table TEXT,
p_target_schema TEXT,
p_target_table TEXT,
p_partition_column TEXT,
p_partition_value TEXT,
p_delete_existing BOOLEAN DEFAULT FALSE,
p_batch_size INTEGER DEFAULT 50000
)
LANGUAGE plpgsql
AS $$
DECLARE
v_columns TEXT;
v_total_migrated BIGINT := 0;
v_batch_migrated BIGINT;
v_validation_passed BOOLEAN;
BEGIN
-- Validate schemas match
SELECT bool_and(is_valid) INTO v_validation_passed
FROM validate_table_schemas(
p_staging_schema, p_staging_table,
p_target_schema, p_target_table
);
IF NOT v_validation_passed THEN
RAISE EXCEPTION 'Schema validation failed. Run validate_table_schemas() for details.';
END IF;
-- Build column list
SELECT string_agg(quote_ident(column_name), ', ' ORDER BY ordinal_position)
INTO v_columns
FROM information_schema.columns
WHERE table_schema = p_staging_schema
AND table_name = p_staging_table;
-- Delete existing data if requested
IF p_delete_existing THEN
EXECUTE format(
'DELETE FROM %I.%I WHERE %I = $1',
p_target_schema, p_target_table, p_partition_column
) USING p_partition_value;
RAISE NOTICE 'Deleted existing data for partition % = %',
p_partition_column, p_partition_value;
END IF;
-- Migrate in batches using a cursor approach
LOOP
EXECUTE format($sql$
WITH to_migrate AS (
SELECT ctid
FROM %I.%I
WHERE NOT EXISTS (
SELECT 1 FROM %I.%I t
WHERE t.%I = $1
)
LIMIT $2
),
inserted AS (
INSERT INTO %I.%I (%s)
SELECT %s
FROM %I.%I s
WHERE s.ctid IN (SELECT ctid FROM to_migrate)
RETURNING 1
)
SELECT COUNT(*) FROM inserted
$sql$,
p_staging_schema, p_staging_table,
p_target_schema, p_target_table, p_partition_column,
p_target_schema, p_target_table, v_columns,
v_columns,
p_staging_schema, p_staging_table
) INTO v_batch_migrated USING p_partition_value, p_batch_size;
v_total_migrated := v_total_migrated + v_batch_migrated;
IF v_batch_migrated = 0 THEN
EXIT;
END IF;
RAISE NOTICE 'Migrated batch: % records (total: %)', v_batch_migrated, v_total_migrated;
COMMIT;
END LOOP;
RAISE NOTICE 'Migration complete. Total records migrated: %', v_total_migrated;
END;
$$;
-- Usage
CALL migrate_partition_data(
'public', 'transactions_staging',
'public', 'transactions',
'transaction_date', '2024-01',
TRUE, -- delete existing
50000 -- batch size
);
5.4 Attach Partition (for Partitioned Tables)
-- For natively partitioned tables, attach the staging table as a partition
CREATE OR REPLACE PROCEDURE attach_restored_partition(
p_staging_schema TEXT,
p_staging_table TEXT,
p_target_schema TEXT,
p_target_table TEXT,
p_partition_column TEXT,
p_partition_start TEXT,
p_partition_end TEXT
)
LANGUAGE plpgsql
AS $$
DECLARE
v_partition_name TEXT;
v_constraint_name TEXT;
BEGIN
-- Validate schemas match
IF NOT (
SELECT bool_and(is_valid)
FROM validate_table_schemas(
p_staging_schema, p_staging_table,
p_target_schema, p_target_table
)
) THEN
RAISE EXCEPTION 'Schema validation failed';
END IF;
-- Add constraint to staging table that matches partition bounds
v_constraint_name := p_staging_table || '_partition_check';
EXECUTE format($sql$
ALTER TABLE %I.%I
ADD CONSTRAINT %I
CHECK (%I >= %L AND %I < %L)
$sql$,
p_staging_schema, p_staging_table,
v_constraint_name,
p_partition_column, p_partition_start,
p_partition_column, p_partition_end
);
-- Validate constraint without locking
EXECUTE format($sql$
ALTER TABLE %I.%I
VALIDATE CONSTRAINT %I
$sql$,
p_staging_schema, p_staging_table,
v_constraint_name
);
-- Detach old partition if exists
v_partition_name := p_target_table || '_' || replace(p_partition_start, '-', '_');
BEGIN
EXECUTE format($sql$
ALTER TABLE %I.%I
DETACH PARTITION %I.%I
$sql$,
p_target_schema, p_target_table,
p_target_schema, v_partition_name
);
RAISE NOTICE 'Detached existing partition %', v_partition_name;
EXCEPTION WHEN undefined_table THEN
RAISE NOTICE 'No existing partition to detach';
END;
-- Rename staging table to partition name
EXECUTE format($sql$
ALTER TABLE %I.%I RENAME TO %I
$sql$,
p_staging_schema, p_staging_table,
v_partition_name
);
-- Attach as partition
EXECUTE format($sql$
ALTER TABLE %I.%I
ATTACH PARTITION %I.%I
FOR VALUES FROM (%L) TO (%L)
$sql$,
p_target_schema, p_target_table,
p_staging_schema, v_partition_name,
p_partition_start, p_partition_end
);
RAISE NOTICE 'Successfully attached partition % to %',
v_partition_name, p_target_table;
END;
$$;
-- Usage for range partitioned table
CALL attach_restored_partition(
'public', 'transactions_staging',
'public', 'transactions',
'transaction_date',
'2024-01-01', '2024-02-01'
);
5.5 Cleanup Script
-- Clean up after successful migration
CREATE OR REPLACE PROCEDURE cleanup_after_migration(
p_staging_schema TEXT,
p_staging_table TEXT,
p_verify_target_schema TEXT DEFAULT NULL,
p_verify_target_table TEXT DEFAULT NULL,
p_verify_count BOOLEAN DEFAULT TRUE
)
LANGUAGE plpgsql
AS $$
DECLARE
v_staging_count BIGINT;
v_target_count BIGINT;
BEGIN
IF p_verify_count AND p_verify_target_schema IS NOT NULL THEN
EXECUTE format(
'SELECT COUNT(*) FROM %I.%I',
p_staging_schema, p_staging_table
) INTO v_staging_count;
EXECUTE format(
'SELECT COUNT(*) FROM %I.%I',
p_verify_target_schema, p_verify_target_table
) INTO v_target_count;
IF v_target_count < v_staging_count THEN
RAISE WARNING 'Target count (%) is less than staging count (%). Migration may be incomplete.',
v_target_count, v_staging_count;
RETURN;
END IF;
END IF;
EXECUTE format(
'DROP TABLE IF EXISTS %I.%I',
p_staging_schema, p_staging_table
);
RAISE NOTICE 'Dropped staging table %.%', p_staging_schema, p_staging_table;
END;
$$;
-- Usage
CALL cleanup_after_migration(
'public', 'transactions_staging',
'public', 'transactions',
TRUE
);
6. Spring Boot Query API
This API allows querying archived data directly from S3 without restoring to the database.
Expected performance for a 1TB partition (compressed to ~150GB Parquet):
Query Type
Typical Latency
Point lookup (indexed column)
500ms to 2s
Range scan (10% selectivity)
5s to 15s
Full scan with aggregation
30s to 60s
8.3 Monitoring and Alerting
Implement these CloudWatch metrics for production use:
@Component
public class QueryMetrics {
private final MeterRegistry meterRegistry;
public void recordQuery(QueryResponse response) {
meterRegistry.counter("archive.query.count").increment();
meterRegistry.timer("archive.query.duration")
.record(response.executionTimeMs(), TimeUnit.MILLISECONDS);
meterRegistry.gauge("archive.query.records_scanned", response.totalScanned());
meterRegistry.gauge("archive.query.records_matched", response.totalMatched());
}
}
9. Conclusion
This solution provides a complete data lifecycle management approach for large Aurora PostgreSQL tables. The archive script efficiently exports partitions to the cost effective Iceberg/Parquet format on S3, while the restore script enables seamless data recovery when needed. The Spring Boot API bridges the gap by allowing direct queries against archived data, eliminating the need for restoration in many analytical scenarios.
Key benefits:
Cost reduction: 90 to 98 percent storage cost savings compared to keeping data in Aurora
Operational flexibility: Query archived data without restoration
Schema preservation: Full schema metadata maintained for reliable restores
Partition management: Clean attach/detach operations for partitioned tables
Predicate pushdown: Efficient filtering reduces data transfer and processing
The Iceberg format ensures compatibility with the broader data ecosystem, allowing tools like Athena, Spark, and Trino to query the same archived data when needed for more complex analytical workloads.
Prepared statements are one of PostgreSQL’s most powerful features for query optimization. By parsing and planning queries once, then reusing those plans for subsequent executions, they can dramatically improve performance. But this optimization comes with a hidden danger: sometimes caching the same plan for every execution can lead to catastrophic memory exhaustion and performance degradation.
In this deep dive, we’ll explore how prepared statement plan caching works, when it fails spectacularly, and how PostgreSQL has evolved to address these challenges.
1. Understanding Prepared Statements and Plan Caching
When you execute a prepared statement in PostgreSQL, the database goes through several phases:
Parsing: Converting the SQL text into a parse tree
Planning: Creating an execution plan based on statistics and parameters
Execution: Running the plan against actual data
The promise of prepared statements is simple: do steps 1 and 2 once, then reuse the results for repeated executions with different parameter values.
-- Prepare the statement
PREPARE get_orders AS
SELECT * FROM orders WHERE customer_id = $1;
-- Execute multiple times with different parameters
EXECUTE get_orders(123);
EXECUTE get_orders(456);
EXECUTE get_orders(789);
PostgreSQL uses a clever heuristic to decide when to cache plans. For the first five executions, it creates a custom plan specific to the parameter values. Starting with the sixth execution, it evaluates whether a generic plan (one that works for any parameter value) would be more efficient. If the average cost of the custom plans is close enough to the generic plan’s cost, PostgreSQL switches to reusing the generic plan.
2. The Dark Side: Memory Exhaustion from Plan Caching
Here’s where things can go catastrophically wrong. Consider a partitioned table:
CREATE TABLE events (
id BIGSERIAL,
event_date DATE,
user_id INTEGER,
event_type TEXT,
data JSONB
) PARTITION BY RANGE (event_date);
-- Create 365 partitions, one per day
CREATE TABLE events_2024_01_01 PARTITION OF events
FOR VALUES FROM ('2024-01-01') TO ('2024-01-02');
CREATE TABLE events_2024_01_02 PARTITION OF events
FOR VALUES FROM ('2024-01-02') TO ('2024-01-03');
-- ... 363 more partitions
Now consider this prepared statement:
PREPARE get_events AS
SELECT * FROM events WHERE event_date = $1;
The Problem: Generic Plans Can’t Prune Partitions
When PostgreSQL creates a generic plan for this query, it doesn’t know which specific date you’ll query at execution time. Without this knowledge, the planner cannot perform partition pruning the critical optimization that eliminates irrelevant partitions from consideration.
Here’s what happens:
Custom plan (first 5 executions): PostgreSQL sees the actual date value, realizes only one partition is relevant, and creates a plan that touches only that partition. Fast and efficient.
Generic plan (6th execution onward): PostgreSQL creates a plan that must be valid for ANY date value. Since it can’t know which partition you’ll need, it includes ALL 365 partitions in the plan.
The result: Instead of reading from 1 partition, PostgreSQL’s generic plan prepares to read from all 365 partitions. This leads to:
Memory exhaustion: The query plan itself becomes enormous, containing nodes for every partition
Planning overhead: Even though the plan is cached, initializing it for execution requires allocating memory for all partition nodes
Execution inefficiency: The executor must check every partition, even though 364 of them will return zero rows
In extreme cases with thousands of partitions, this can consume gigabytes of memory per connection and bring your database to its knees.
3. Partition Pruning: The Critical Optimization and How It Works
Partition pruning is the process of eliminating partitions that cannot possibly contain relevant data based on query constraints. Understanding partition pruning in depth is essential for working with partitioned tables effectively.
3.1 What Is Partition Pruning?
At its core, partition pruning is PostgreSQL’s mechanism for avoiding unnecessary work. When you query a partitioned table, the database analyzes your WHERE clause and determines which partitions could possibly contain matching rows. All other partitions are excluded from the query execution entirely.
Consider a table partitioned by date range:
CREATE TABLE sales (
sale_id BIGINT,
sale_date DATE,
amount NUMERIC,
product_id INTEGER
) PARTITION BY RANGE (sale_date);
CREATE TABLE sales_2023_q1 PARTITION OF sales
FOR VALUES FROM ('2023-01-01') TO ('2023-04-01');
CREATE TABLE sales_2023_q2 PARTITION OF sales
FOR VALUES FROM ('2023-04-01') TO ('2023-07-01');
CREATE TABLE sales_2023_q3 PARTITION OF sales
FOR VALUES FROM ('2023-07-01') TO ('2023-10-01');
CREATE TABLE sales_2023_q4 PARTITION OF sales
FOR VALUES FROM ('2023-10-01') TO ('2024-01-01');
When you execute:
SELECT * FROM sales WHERE sale_date = '2023-05-15';
PostgreSQL performs partition pruning by examining the partition constraints. It determines that only sales_2023_q2 can contain rows with sale_date = ‘2023-05-15’, so it completely ignores the other three partitions. They never get opened, scanned, or loaded into memory.
3.2 The Two Stages of Partition Pruning
PostgreSQL performs partition pruning at two distinct stages in query execution, and understanding the difference is crucial for troubleshooting performance issues.
Stage 1: Plan Time Pruning (Static Pruning)
Plan time pruning happens during the query planning phase, before execution begins. This is the ideal scenario because pruned partitions never appear in the execution plan at all.
When it occurs:
The query contains literal values in the WHERE clause
The partition key columns are directly compared to constants
The planner can evaluate the partition constraints at planning time
Example:
EXPLAIN SELECT * FROM sales WHERE sale_date = '2023-05-15';
Output might show:
Seq Scan on sales_2023_q2 sales
Filter: (sale_date = '2023-05-15'::date)
Notice that only one partition appears in the plan. The other three partitions were pruned away during planning, and they consume zero resources.
What makes plan time pruning possible:
The planner evaluates the WHERE clause condition against each partition’s constraint. For sales_2023_q2, the constraint is:
sale_date >= '2023-04-01' AND sale_date < '2023-07-01'
The planner performs boolean logic: “Can sale_date = ‘2023-05-15’ be true if the constraint requires sale_date >= ‘2023-04-01’ AND sale_date < ‘2023-07-01’?” Yes, it can. For the other partitions, the answer is no, so they’re eliminated.
Performance characteristics:
No runtime overhead for pruned partitions
Minimal memory usage
Optimal query performance
The execution plan is lean and specific
Stage 2: Execution Time Pruning (Dynamic Pruning)
Execution time pruning, also called runtime pruning, happens during query execution rather than planning. This occurs when the planner cannot determine which partitions to prune until the query actually runs.
When it occurs:
Parameters or variables are used instead of literal values
Subqueries provide the filter values
Join conditions determine which partitions are needed
Prepared statements with parameters
Example:
PREPARE get_sales AS
SELECT * FROM sales WHERE sale_date = $1;
EXPLAIN (ANALYZE) EXECUTE get_sales('2023-05-15');
With execution time pruning, the plan initially includes all partitions, but the output shows:
The key indicator is “Subplans Removed: 3”, which tells you that three partitions were pruned at execution time.
How execution time pruning works:
During the initialization phase of query execution, PostgreSQL evaluates the actual parameter values and applies the same constraint checking logic as plan time pruning. However, instead of eliminating partitions from the plan, it marks them as “pruned” and skips their initialization and execution.
The critical difference:
Even though execution time pruning skips scanning the pruned partitions, the plan still contains nodes for all partitions. This means:
Memory is allocated for all partition nodes (though less than full initialization)
The plan structure is larger
There is a small runtime cost to check each partition
More complex bookkeeping is required
This is why execution time pruning, while much better than no pruning, is not quite as efficient as plan time pruning.
3.3 Partition Pruning with Different Partition Strategies
PostgreSQL supports multiple partitioning strategies, and pruning works differently for each.
Range Partitioning
Range partitioning is the most common and supports the most effective pruning:
CREATE TABLE measurements (
measurement_time TIMESTAMPTZ,
sensor_id INTEGER,
value NUMERIC
) PARTITION BY RANGE (measurement_time);
Pruning logic: PostgreSQL uses range comparison. Given a filter like measurement_time >= '2024-01-01' AND measurement_time < '2024-02-01', it identifies all partitions whose ranges overlap with this query range.
Pruning effectiveness: Excellent. Range comparisons are computationally cheap and highly selective.
List Partitioning
List partitioning groups rows by discrete values:
CREATE TABLE orders (
order_id BIGINT,
country_code TEXT,
amount NUMERIC
) PARTITION BY LIST (country_code);
CREATE TABLE orders_us PARTITION OF orders
FOR VALUES IN ('US');
CREATE TABLE orders_uk PARTITION OF orders
FOR VALUES IN ('UK');
CREATE TABLE orders_eu PARTITION OF orders
FOR VALUES IN ('DE', 'FR', 'IT', 'ES');
Pruning logic: PostgreSQL checks if the query value matches any value in each partition’s list.
SELECT * FROM orders WHERE country_code = 'FR';
Only orders_eu is accessed because ‘FR’ appears in its value list.
Pruning effectiveness: Very good for equality comparisons. Less effective for OR conditions across many values or pattern matching.
Hash Partitioning
Hash partitioning distributes rows using a hash function:
CREATE TABLE users (
user_id BIGINT,
username TEXT,
email TEXT
) PARTITION BY HASH (user_id);
CREATE TABLE users_p0 PARTITION OF users
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE users_p1 PARTITION OF users
FOR VALUES WITH (MODULUS 4, REMAINDER 1);
-- ... p2, p3
Pruning logic: PostgreSQL computes the hash of the query value and determines which partition it maps to.
SELECT * FROM users WHERE user_id = 12345;
PostgreSQL calculates hash(12345) % 4 and accesses only the matching partition.
Pruning effectiveness: Excellent for equality on the partition key. Completely ineffective for range queries, pattern matching, or anything except exact equality matches.
3.4 Complex Partition Pruning Scenarios
Real world queries are often more complex than simple equality comparisons. Here’s how pruning handles various scenarios:
Multi Column Partition Keys
CREATE TABLE events (
event_date DATE,
region TEXT,
data JSONB
) PARTITION BY RANGE (event_date, region);
Pruning works on the leading columns of the partition key. A query filtering only on event_date can still prune effectively. A query filtering only on region cannot prune at all because region is not the leading column.
OR Conditions
SELECT * FROM sales
WHERE sale_date = '2023-05-15' OR sale_date = '2023-08-20';
PostgreSQL must access partitions for both dates (Q2 and Q3), so it keeps both and prunes Q1 and Q4. OR conditions reduce pruning effectiveness.
Inequality Comparisons
SELECT * FROM sales WHERE sale_date >= '2023-05-01';
PostgreSQL prunes partitions entirely before the date (Q1) but must keep all partitions from Q2 onward. Range queries reduce pruning selectivity.
Joins Between Partitioned Tables
SELECT * FROM sales s
JOIN products p ON s.product_id = p.product_id
WHERE s.sale_date = '2023-05-15';
If sales is partitioned by sale_date, that partition pruning works normally. If products is also partitioned, PostgreSQL attempts partitionwise joins where possible, enabling pruning on both sides.
Subqueries Providing Values
SELECT * FROM sales
WHERE sale_date = (SELECT MAX(order_date) FROM orders);
This requires execution time pruning because the subquery must run before PostgreSQL knows which partition to access.
3.5 Monitoring Partition Pruning
To verify partition pruning is working, use EXPLAIN:
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM sales WHERE sale_date = '2023-05-15';
What to look for:
Plan time pruning succeeded:
Seq Scan on sales_2023_q2
Only one partition appears in the plan at all.
Execution time pruning succeeded:
Append
Subplans Removed: 3
-> Seq Scan on sales_2023_q2
All partitions appear in the plan structure, but “Subplans Removed” shows pruning happened.
No pruning occurred:
Append
-> Seq Scan on sales_2023_q1
-> Seq Scan on sales_2023_q2
-> Seq Scan on sales_2023_q3
-> Seq Scan on sales_2023_q4
All partitions were scanned. This indicates a problem.
3.6 Why Partition Pruning Fails
Understanding why pruning fails helps you fix it:
Query doesn’t filter on partition key: If your WHERE clause doesn’t reference the partition column(s), PostgreSQL cannot prune.
Function calls on partition key:WHERE EXTRACT(YEAR FROM sale_date) = 2023 prevents pruning because PostgreSQL can’t map the function result back to partition ranges. Use WHERE sale_date >= '2023-01-01' AND sale_date < '2024-01-01' instead.
Type mismatches: If your partition key is DATE but you compare to TEXT without explicit casting, pruning may fail.
Generic plans in prepared statements: As discussed in the main article, generic plans prevent plan time pruning and older PostgreSQL versions struggled with execution time pruning.
OR conditions with non-partition columns:WHERE sale_date = '2023-05-15' OR customer_id = 100 prevents pruning because customer_id isn’t the partition key.
Volatile functions:WHERE sale_date = CURRENT_DATE may prevent plan time pruning (but should work with execution time pruning).
3.7 Partition Pruning Performance Impact
The performance difference between pruned and unpruned queries can be staggering:
Example scenario: 1000 partitions, each with 1 million rows. Query targets one partition.
With pruning:
Partitions opened: 1
Rows scanned: 1 million
Memory for plan nodes: ~10KB
Query time: 50ms
Without pruning:
Partitions opened: 1000
Rows scanned: 1 billion (returning ~1 million)
Memory for plan nodes: ~10MB
Query time: 45 seconds
The difference is not incremental; it’s exponential as partition count grows.
4. Partition Pruning in Prepared Statements: The Core Problem
Let me illustrate the severity with a real-world scenario:
-- Table with 1000 partitions
CREATE TABLE metrics (
timestamp TIMESTAMPTZ,
metric_name TEXT,
value NUMERIC
) PARTITION BY RANGE (timestamp);
-- Create 1000 daily partitions...
-- Prepared statement
PREPARE get_metrics AS
SELECT * FROM metrics
WHERE timestamp >= $1 AND timestamp < $2;
After the 6th execution, PostgreSQL switches to a generic plan. Each subsequent execution:
Allocates memory for 1000 partition nodes
Initializes executor state for 1000 partitions
Checks 1000 partition constraints
Returns data from just 1-2 partitions
If you have 100 connections each executing this prepared statement, you’re multiplying this overhead by 100. With connection poolers reusing connections (and thus reusing prepared statements), the problem compounds.
6. The Fix: Evolution Across PostgreSQL Versions
PostgreSQL has steadily improved partition pruning for prepared statements:
PostgreSQL 11: Execution-Time Pruning Introduced
PostgreSQL 11 introduced run-time partition pruning, but it had significant limitations with prepared statements. Generic plans still included all partitions in memory, even if they could be skipped during execution.
PostgreSQL 12: Better Prepared Statement Pruning
PostgreSQL 12 made substantial improvements:
Generic plans gained the ability to defer partition pruning to execution time more effectively
The planner became smarter about when to use generic vs. custom plans for partitioned tables
Memory consumption for generic plans improved significantly
However, issues remained in edge cases, particularly with: • Multi level partitioning • Complex join queries involving partitioned tables • Prepared statements in stored procedures
PostgreSQL 13-14: Refined Heuristics
These versions improved the cost model for deciding between custom and generic plans:
Better accounting for partition pruning benefits in the cost calculation
More accurate statistics gathering on partitioned tables
Improved handling of partitionwise joins
PostgreSQL 15-16: The Real Game Changers
PostgreSQL 15 and 16 brought transformative improvements:
PostgreSQL 15:
Dramatically reduced memory usage for generic plans on partitioned tables
Improved execution-time pruning performance
Better handling of prepared statements with partition pruning
PostgreSQL 16:
Introduced incremental sorting improvements that benefit partitioned queries
Enhanced partition-wise aggregation
More aggressive execution-time pruning
The key breakthrough: PostgreSQL now builds “stub” plans that allocate minimal memory for partitions that will be pruned, rather than fully initializing all partition nodes.
Workarounds for Older Versions
If you’re stuck on older PostgreSQL versions, here are strategies to avoid the prepared statement pitfall:
1. Disable Generic Plans
Force PostgreSQL to always use custom plans:
-- Set at session level
SET plan_cache_mode = force_custom_plan;
-- Or for specific prepared statement contexts
PREPARE get_events AS
SELECT * FROM events WHERE event_date = $1;
-- Before execution
SET LOCAL plan_cache_mode = force_custom_plan;
EXECUTE get_events('2024-06-15');
This sacrifices the planning time savings but ensures proper partition pruning.
2. Use Statement Level Caching Instead
Many ORMs and database drivers offer statement level caching that doesn’t persist across multiple executions:
# psycopg2 example - named cursors create server-side cursors
# but don't persist plans
cursor = connection.cursor()
cursor.execute(
"SELECT * FROM events WHERE event_date = %s",
(date_value,)
)
3. Adjust plan_cache_mode Per Query
PostgreSQL 12+ provides plan_cache_mode:
-- auto (default): use PostgreSQL's heuristics
-- force_generic_plan: always use generic plan
-- force_custom_plan: always use custom plan
SET plan_cache_mode = force_custom_plan;
For partitioned tables, force_custom_plan is often the right choice.
4. Increase Custom Plan Count
The threshold of 5 custom plans before switching to generic is hardcoded, but you can work around it by using different prepared statement names or by periodically deallocating and recreating prepared statements:
DEALLOCATE get_events;
PREPARE get_events AS SELECT * FROM events WHERE event_date = $1;
5. Partition Pruning Hints
In PostgreSQL 12+, you can sometimes coerce the planner into better behavior:
-- Using an explicit constraint that helps the planner
SELECT * FROM events
WHERE event_date = $1
AND event_date >= CURRENT_DATE - INTERVAL '1 year';
This additional constraint provides a hint about the parameter range.
Best Practices
Monitor your query plans: Use EXPLAIN (ANALYZE, BUFFERS) to check if partition pruning is happening:
Check prepared statement statistics: Query pg_prepared_statements to see generic vs. custom plan usage:
SELECT name,
generic_plans,
custom_plans
FROM pg_prepared_statements;
Upgrade PostgreSQL: If you’re dealing with large partitioned tables, the improvements in PostgreSQL 15+ are worth the upgrade effort.
Design partitions appropriately: Don’t over-partition. Having 10,000 tiny partitions creates problems even with perfect pruning.
Use connection pooling wisely: Prepared statements persist per connection. With connection pooling, long-lived connections accumulate many prepared statements. Configure your pooler to occasionally recycle connections.
Benchmark both modes: Test your specific workload with both custom and generic plans to measure the actual impact.
Conclusion
Prepared statements are a powerful optimization, but their interaction with partitioned tables exposes a fundamental tension: caching for reuse versus specificity for efficiency. PostgreSQL’s evolution from version 11 through 16 represents a masterclass in addressing this challenge.
The key takeaway: if you’re using prepared statements with partitioned tables on PostgreSQL versions older than 15, be vigilant about plan caching behavior. Monitor memory usage, check execution plans, and don’t hesitate to force custom plans when generic plans cause problems.
For modern PostgreSQL installations (15+), the improvements are substantial enough that the traditional guidance of “be careful with prepared statements on partitioned tables” is becoming outdated. The database now handles these scenarios with far more intelligence and efficiency.
But understanding the history and mechanics remains crucial, because the next time you see mysterious memory growth in your PostgreSQL connections, you’ll know exactly where to look.
Next you need to setup your environment. Note: This is NOT the recommended way (as it uses long term credentials).
The following example configures a default profile using sample values. Replace them with your own values as described in the following sections.
$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: secretaccesskey
Default region name [None]: af-south-1
Default output format [None]: json
You can also use named profiles. The following example configures a profile named userprod using sample values. Replace them with your own values as described in the following sections.
$ aws configure --profile userprod
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: secretaccesskey
Default region name [None]: af-south-1
Default output format [None]: json
In the navigation pane of the IAM console, select Users and then select the User name of the user that you created previously.
On the user’s page, select the Security credentials page. Then, under Access keys, select Create access key.
For Create access key Step 1, choose Command Line Interface (CLI).
For Create access key Step 2, enter an optional tag and select Next.
For Create access key Step 3, select Download .csv file to save a .csv file with your IAM user’s access key and secret access key. You need this information for later.
Below is a quick (am busy) outline on how to automatically stop and start your EC2 instances.
Step 1: Tag your resources
In order to decide which instances stop and start you first need to add an auto-start-stop: Yes tag to all the instances you want to be affected by the start / stop functions. Note: You can use “Resource Groups and Tag Editor” to bulk apply these tags to the resources you want to be affected by the lambda functions you are going to create. See below (click the orange button called “Manage tags of Selected Resources”).
Step 2: Create a new role for our lambda functions
First we need to create the IAM role to run the Lambda functions. Go to IAM and click the “Create Role” button. Then select “AWS Service” from the “Trusted entity options”, and select Lambda from the “Use Cases” options. Then click “Next”, followed by “Create Policy”. To specify the permission, simply Click the JSON button on the right of the screen and enter the below policy (swapping the region and account id for your region and account id):
Hit next and under “Review and create”, save the above policy as ec2-lambda-start-stop by clicking the “Create Policy” button. Next, search for this newly created policy and select it as per below and hit “Next”.
You will now see the “Name, review, and create” screen. Here you simply need to hit “Create Role” after you enter the role name as ec2-lambda-start-stop-role.
Note the policy is restricted to only have access to EC2 instances that contains auto-start-stop: Yes tags (least privileges).
If you want to review your role, this is how it should look. You can see I have filled in my region and account number in the policy:
Step 3: Create Lambda Functions To Start/Stop EC2 Instances
In this section we will create two lambda functions, one to start the instances and the other to stop the instances.
Step 3a: Add the Stop EC2 instance function
Goto Lambda console and click on create function
Create a lambda function with a function name of stop-ec2-instance-lambda, python3.11 runtime, and ec2-lambda-stop-start-role (see image below).
Next add the lamdba stop function and save it as stop-ec2-instance. Note, you will need to change the value of the region_name parameter accordingly.
import json
import boto3
ec2 = boto3.resource('ec2', region_name='af-south-1')
def lambda_handler(event, context):
instances = ec2.instances.filter(Filters=[{'Name': 'instance-state-name', 'Values': ['running']},{'Name': 'tag:auto-start-stop','Values':['Yes']}])
for instance in instances:
id=instance.id
ec2.instances.filter(InstanceIds=[id]).stop()
print("Instance ID is stopped:- "+instance.id)
return "success"
This is how your Lambda function should look:
Step 3b: Add the Start EC2 instance function
Goto Lambda console and click on create function
Create lambda functions with start-ec2-instance, python3.11 runtime, and ec2-lambda-stop-start-role.
Then add the below code and save the function as start-ec2-instance-lambda.
Note, you will need to change the value of the region_name parameter accordingly.
import json
import boto3
ec2 = boto3.resource('ec2', region_name='af-south-1')
def lambda_handler(event, context):
instances = ec2.instances.filter(Filters=[{'Name': 'instance-state-name', 'Values': ['stopped']},{'Name': 'tag:auto-start-stop','Values':['Yes']}])
for instance in instances:
id=instance.id
ec2.instances.filter(InstanceIds=[id]).stop()
print("Instance ID is stopped:- "+instance.id)
return "success"
4. Summary
If either of the above lambda functions are triggered, they will start or stop your EC2 instances based on the instance state and the value of auto-start-stop tag. To automate this you can simply setup up cron jobs, step functions, AWS Event Bridge, Jenkins etc.
If you are testing how your autoscaling policies respond to CPU load then a really simple way to test this is using the “stress” command. Note: this is a very crude mechanism to test and wherever possible you should try and generate synthetic application load.
#!/bin/bash
# DESCRIPTION: After updating from the repo, installs stress-ng, a tool used to create various system load for testing purposes.
yum update -y
# Install stress-ng
sudo apt install stress-ng
# CPU spike: Run a CPU spike for 5 seconds
uptime
stress-ng --cpu 4 --timeout 5s --metrics-brief
uptime
# Disk Test: Start N (2) workers continually writing, reading and removing temporary files:
stress-ng --disk 2 --timeout 5s --metrics-brief
# Memory stress test
# Populate memory. Use mmap N bytes per vm worker, the default is 256MB.
# You can also specify the size as % of total available memory or in units of
# Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g:
# Note: The --vm 2 will start N workers (2 workers) continuously calling
# mmap/munmap and writing to the allocated memory. Note that this can cause
# systems to trip the kernel OOM killer on Linux systems if not enough
# physical memory and swap is not available
stress-ng --vm 2 --vm-bytes 1G --timeout 5s
# Combination Stress
# To run for 5 seconds with 4 cpu stressors, 2 io stressors and 1 vm
# stressor using 1GB of virtual memory, enter:
stress-ng --cpu 4 --io 2 --vm 1 --vm-bytes 1G --timeout 5s --metrics-brief
A single tenancy datacenter is a fixed scale, fixed price service on a closed network. The costs of the resources in the datacenter are divided up and shared out to the enterprise constituents on a semi-random basis. If anyone uses less resources than the forecast this generates waste which is shared back to the enterprise. If there is more demand than forecasted, it will either generate service degradation, panic or an outage! This model is clearly fragile and doesn’t respond quickly to change; it is also wasteful as it requires a level of overprovisioning based on forecast consumption (otherwise you will experience delays in projects, service degradation or have reduced resilience).
Cloud, on the other hand is a multi-tenanted on demand software service which you pay for as you use. But surely having multiple tenants running on the same fixed capacity actually increases the risks, and just because its in the cloud it doesn’t mean that you can get away without over provisioning – so who sits with the over provisioned costs? The cloud providers have to build this into their rates. So cloud providers have to deal with a balance sheet of fixed capacity shared amongst customers running on demand infrastructure. They do this with very clever forecasting, very short provisioning cycles and asking their customers for forecasts and then offering discounts for pre-commits.
Anything that moves you back towards managing resources levels / forecasting will destroy a huge portion of the value of moving to the cloud in the first instance. For example, if you have ever been to a Re:Invent you will be flawed by the rate of innovation and also how easy it is to absorb these new innovative products. But wait – you just signed a 5yr cost commit and now you learn about Aurura’s new serverless database model. You realise that you can save millions of dollars; but you have to wait for your 5yr commits to expire before you adopt or maybe start mining bitcoin with all your excess commits! This is anti-innovation and anti-customer.
Whats even worse is that pre-commits are typically signed up front on day 1- this is total madness!!! At the point where you know nothing about your brave new world, you use the old costs as a proxy to predict the new costs so that you can squeeze a lousy 5px saving at the risk of 100px of the commit size! What you will start to learn is that your cloud success is NOT based on the commercial contract that you sign with your cloud provider; its actually based on the quality of the engineering talent that your organisation is able to attract. Cloud is a IP war – its not a legal/sourcing war. Allow yourself to learn, don’t box yourself in on day 1. When you sign the pre-commit you will notice your first year utilisation projections are actually tiny and therefore the savings are small. So whats the point of signing so early on when the risk is at a maximum and the gains are at a minimum? When you sign this deal you are essentially turning the cloud into a “financial data center” – you have destroyed the cloud before you even started!
A Lesson from the field – Solving Hadoop Compute Demand Spike:
We moved 7000 cores of burst compute to AWS to solve a capacity issue on premise. That’s expensive, so lets “fix the costs”! We can go a sign a RI (reserved instance), play with spot, buy savings plans or even beg / barter for some EDP relief. But instead we plugged the service usuage into Quicksight and analysed the queries. We found one query was using 60 percent of the entire banks compute! Nobody confessed to owning the query, so we just disabled it (if you need a reason for your change management; describe the change as “disabling a financial DDOS”). We quickly found the service owner and explained that running a table scan across billions of rows to return a report with just last months data is not a good idea. We also explained that if they don’t fix this we will start billing them in 6 weeks time (a few million dollars). The team deployed a fix and now we run the banks big data stack at half the costs – just by tuning one query!!!
So the point of the above is that there is no substitute for engineering excellence. You have to understand and engineer the cloud to win, you cannot contract yourself into the cloud. The more contracts you sign the more failures you will experience. This leads me to point 2…
Step 2: Training, Training, Training
Start the biggest training campaign you possibly can – make this your crusade. Train everyone; business, finance, security, infrastructure – you name it, you train it. Don’t limit what anyone can train on, training is cheap – feast as much as you can. Look at Udemy, ACloudGuru, Youtube, WhizLabs etc etc etc. If you get this wrong then you will find your organisation fills up with expensive consultants and bespoke migration products that you don’t need ++ can easily do yourself, via opensource or with your cloud provider toolsets. In fact I would go one step further – if your not prepared to learn about the cloud, your not ready to go there.
Step 3: The OS Build
When you do start your cloud migration and begin to review your base OS images – go right back to the very beginning, remove every single product in all of these base builds. Look at what you can get out the box from your cloud provider and really push yourself hard on what do I really need vs nice to have. But the trick is that to get the real benefit from a cloud migration, you have to start by making your builds as “naked” as possible. Nothing should move into the base build without a good reason. Ownership and report lines are not a good enough reason for someones special “tool” to make it into the build. This process, if done correctly, should deliver you between 20-40px of your cloud migration savings. Do this badly and your costs, complexity and support will all head in the wrong direction.
Security HAS to be a first class citizen of your new world. In most organizations this will likely make for some awkward cultural collisions (control and ownership vs agility) and some difficult dialogs. The cloud, by definition, should be liberating – so how do you secure it without creating a “cloud bunker” that nobody can actually use? More on this later… 🙂
Step 4: Hybrid Networking
For any organisation with data centers – make no mistake, if you get this wrong its over before it starts.