1. Backups Should Be Boring (and That Is the Point)
Backups are boring. They should be boring.
A backup system that generates excitement is usually signalling failure.
The only time backups become interesting is when they are missing, and that interest level is lethal. Emergency bridges. Frozen change windows. Executive escalation. Media briefings. Regulatory apology letters. Engineers being asked questions that have no safe answers.
Most backup platforms are built for the boring days. Rubrik is designed for the day boredom ends.
2. Backup Is Not the Product. Restore Is.
Many organisations still evaluate backup platforms on the wrong metric: how fast they can copy data somewhere else.
That metric is irrelevant during an incident.
When things go wrong, the only questions that matter are:
- What can I restore?
- How fast can it be used?
- How many restores can run in parallel?
- How little additional infrastructure is required?
Rubrik treats restore as the primary product, not a secondary feature.
3. Architectural Starting Point: Designed for Failure, Not Demos
Rubrik was built without tape era assumptions.
There is no central backup server, no serial job controller, and no media server bottleneck. Instead, it uses a distributed, scale out architecture with a global metadata index and a stateless policy engine.
Restore becomes a metadata lookup problem, not a job replay problem. This distinction is invisible in demos and decisive during outages.
4. Performance Metrics That Actually Matter
Backup throughput is easy to optimise and easy to market. Restore performance is constrained by network fan out, restore concurrency, control plane orchestration, and application host contention.
Rubrik addresses this by default through parallel restore streams, linear scaling with node count, and minimal control plane chatter. Restore performance becomes predictable rather than optimistic.
5. Restore Semantics That Match Reality
The real test of any backup platform is not how elegantly it captures data, but how usefully it returns that data when needed. This is where architectural decisions made years earlier either pay dividends or extract penalties.
5.1 Instant Access Instead of Full Rehydration
Rubrik does not require full data copy back before access. It supports live mount of virtual machines, database mounts directly from backup storage, and file system mounts for selective recovery.
The recovery model becomes access first, copy later if needed. This is the difference between minutes and hours when production is down.
5.2 Dropping a Table Should Not Be a Crisis
Rubrik understands databases as structured systems, not opaque blobs.
It supports table level restores for SQL Server, mounting a database backup as a live database, extracting tables or schemas without restoring the full database, and point in time recovery without rollback.
Accidental table drops should be operational annoyances, not existential threats.
5.3 Supported Database Engines
Rubrik provides native protection for the major enterprise database platforms:
| Database Engine | Live Mount | Point in Time Recovery | Key Constraints |
|---|---|---|---|
| Microsoft SQL Server | Yes | Yes (transaction log replay) | SQL 2012+ supported; Always On AG, FCI, standalone |
| Oracle Database | Yes | Yes (archive log replay) | RAC, Data Guard, Exadata supported; SPFILE required for automated recovery |
| SAP HANA | No | Yes | Backint API integration; uses native HANA backup scheduling |
| PostgreSQL | No | Yes (up to 5 minute RPO) | File level incremental; on premises and cloud (AWS, Azure, GCP) |
| IBM Db2 | Via Elastic App Service | Yes | Uses native Db2 backup utilities |
| MongoDB | Via Elastic App Service | Yes | Sharded and unsharded clusters; no quiescing required |
| MySQL | Via Elastic App Service | Yes | Uses native MySQL backup tools |
| Cassandra | Via Elastic App Service | Yes | Via Rubrik Datos IO integration |
The distinction between native integration and Elastic App Service matters operationally. Native integration means Rubrik handles discovery, scheduling, and orchestration directly. Elastic App Service means Rubrik provides managed volumes as backup targets while the database’s native tools handle the actual backup process. Both approaches deliver immutability and policy driven retention, but the operational experience differs.
5.4 Live Mount: Constraints and Caveats
Live Mount is Rubrik’s signature capability—mounting backups as live, queryable databases without copying data back to production storage. The database runs with its data files served directly from the Rubrik cluster over NFS (for Oracle) or SMB 3.0 (for SQL Server).
This capability is transformative for specific use cases. It is not a replacement for production storage.
What Live Mount Delivers:
- Near instant database availability (seconds to minutes, regardless of database size)
- Zero storage provisioning on the target host
- Multiple concurrent mounts from the same backup
- Point in time access across the entire retention window
- Ideal for granular recovery, DBCC health checks, test/dev cloning, audit queries, and upgrade validation
What Live Mount Does Not Deliver:
- Production grade I/O performance
- High availability during Rubrik cluster maintenance
- Persistence across host or cluster reboots
IOPS Constraints:
Live Mount performance is bounded by the Rubrik appliance’s ability to serve I/O, not by the target host’s storage subsystem. Published figures suggest approximately 30,000 IOPS per Rubrik appliance for Live Mount workloads. This is adequate for reporting queries, data extraction, and validation testing. It is not adequate for transaction heavy production workloads.
The performance characteristics are inherently different from production storage:
| Metric | Production SAN/Flash | Rubrik Live Mount |
|---|---|---|
| Random read IOPS | 100,000+ | ~30,000 per appliance |
| Latency profile | Sub millisecond | Network + NFS overhead |
| Write optimisation | Production tuned | Backup optimised |
| Concurrent workloads | Designed for contention | Shared with backup operations |
SQL Server Live Mount Specifics:
- Databases mount via SMB 3.0 shares with UNC paths
- Transaction log replay occurs during mount for point in time positioning
- The mounted database is read write, but writes go to the Rubrik cluster
- Supported for standalone instances, Failover Cluster Instances, and Always On Availability Groups
- Table level recovery requires mounting the database, then using T SQL to extract and import specific objects
Oracle Live Mount Specifics:
- Data files mount via NFS; redo logs and control files remain on the target host
- Automated recovery requires source and target configurations to match (RAC to RAC, single instance to single instance, ASM to ASM)
- Files only recovery allows dissimilar configurations but requires DBA managed RMAN recovery
- SPFILE is required for automated recovery; PFILE databases require manual intervention
- Block change tracking (BCT) is disabled on Live Mount targets
- Live Mount fails if the target host, RAC cluster, or Rubrik cluster reboots during the mount—requiring forced unmount to clean up metadata
- Direct NFS (DNFS) is recommended on Oracle RAC nodes for improved recovery performance
What Live Mount Is Not:
Live Mount is explicitly designed for temporary access, not sustained production workloads. The use cases Rubrik markets test/dev, DBCC validation, granular recovery, audit queries: all share a common characteristic: they are time bounded operations that tolerate moderate I/O performance in exchange for instant availability.
Running production transaction processing against a Live Mount database would be technically possible and operationally inadvisable. The I/O profile, the network dependency, and the lack of high availability guarantees make it unsuitable for workloads where performance and uptime matter.
5.5 The Recovery Hierarchy
Understanding when to use each recovery method matters:
| Recovery Need | Recommended Method | Time to Access | Storage Required |
|---|---|---|---|
| Extract specific rows/tables | Live Mount + query | Minutes | None |
| Validate backup integrity | Live Mount + DBCC | Minutes | None |
| Clone for test/dev | Live Mount | Minutes | None |
| Full database replacement | Export/Restore | Hours (size dependent) | Full database size |
| Disaster recovery cutover | Instant Recovery | Minutes (then migrate) | Temporary, then full |
The strategic value of Live Mount is avoiding full restores when full restores are unnecessary. For a 5TB database where someone dropped a single table, Live Mount means extracting that table in minutes rather than waiting hours for a complete restore.
For actual disaster recovery, where the production database is gone and must be replaced, Live Mount provides bridge access while the full restore completes in parallel. The database is queryable immediately; production grade performance follows once data migration finishes.
6. Why Logical Streaming Is a Design Failure
Traditional restore models stream backup data through the database host. This guarantees CPU contention, IO pressure, and restore times proportional to database size rather than change size.
Rubrik avoids this by mounting database images and extracting only required objects. The database host stops being collateral damage during recovery.
6.1 The VSS Tax: Why SQL Server Backups Cannot Escape Application Coordination
For VMware workloads without databases, Rubrik can leverage storage level snapshots that are instantaneous, application agnostic, and impose zero load on the guest operating system. The hypervisor freezes the VM state, the storage array captures the point in time image, and the backup completes before the application notices.
SQL Server cannot offer this simplicity. The reason is not a Microsoft limitation or a Rubrik constraint. The reason is transactional consistency.
The Crash Consistent Option Exists
Nothing technically prevents Rubrik, or any backup tool, from taking a pure storage snapshot of a SQL Server volume without application coordination. The snapshot would complete in milliseconds with zero database load.
The problem is what you would recover: a crash consistent image, not an application consistent one.
A crash consistent snapshot captures storage state mid flight. This includes partially written pages, uncommitted transactions, dirty buffers not yet flushed to disk, and potentially torn writes caught mid I/O. SQL Server is designed to recover from exactly this state. Every time the database engine starts after an unexpected shutdown, it runs crash recovery, rolling forward committed transactions from the log and rolling back uncommitted ones.
The database will become consistent. Eventually. Probably.
Why Probably Is Not Good Enough
Crash recovery works. It works reliably. It is tested millions of times daily across every SQL Server instance that experiences an unclean shutdown.
But restore confidence matters. When production is down and executives are asking questions, the difference between “this backup is guaranteed consistent” and “this backup should recover correctly after crash recovery completes” is operationally significant.
VSS exists to eliminate that uncertainty.
What VSS Actually Does
When a backup application requests an application consistent SQL Server snapshot, the following sequence executes:
- The backup application calls the VSS coordinator
- VSS notifies the SQL Server VSS Writer that a backup is imminent
- SQL Server flushes dirty pages from the buffer pool to disk
- SQL Server briefly freezes write I/O to guarantee a consistent capture point
- The storage snapshot executes
- SQL Server resumes normal operation
- VSS confirms completion to the backup application
The result is a snapshot that requires no crash recovery on restore. The database is immediately consistent, immediately usable, and carries no uncertainty about transactional integrity.
The Coordination Cost
The VSS freeze window is typically brief, milliseconds to low seconds. But the preparation is not free.
Buffer pool flushes on large databases generate I/O pressure. Checkpoint operations compete with production workloads. The freeze, however short, introduces latency for in flight transactions. The database instance is actively participating in its own backup.
For databases measured in terabytes, with buffer pools consuming hundreds of gigabytes, this coordination overhead becomes operationally visible. Backup windows that appear instantaneous from the storage console are hiding real work inside the SQL Server instance.
The Architectural Asymmetry
This creates a fundamental difference in backup elegance across workload types:
| Workload Type | Backup Method | Application Load | Restore State |
|---|---|---|---|
| VMware VM (no database) | Storage snapshot | Zero | Crash consistent (acceptable) |
| VMware VM (with SQL Server) | VSS coordinated snapshot | Moderate | Application consistent |
| Physical SQL Server | VSS coordinated snapshot | Moderate to high | Application consistent |
| Physical SQL Server | Pure storage snapshot | Zero | Crash consistent (risky) |
For a web server or file share, crash consistent is fine. The application has no transactional state worth protecting. For a database, crash consistent means trusting recovery logic rather than guaranteeing consistency.
The Uncomfortable Reality
The largest, most critical SQL Server databases, the ones that would benefit most from zero overhead instantaneous backup are precisely the workloads where crash consistent snapshots carry the most risk. More transactions in flight. Larger buffer pools. More recovery time if something needs replay.
Rubrik supports VSS coordination because the alternative is shipping backups that might need crash recovery. That uncertainty is acceptable for test environments. It is rarely acceptable for production databases backing financial systems, customer records, or regulatory reporting.
The VSS tax is not a limitation imposed by Microsoft or avoided by competitors. It is the cost of consistency. Every backup platform that claims application consistent SQL Server protection is paying it. The only question is whether they admit the overhead exists.
7. Snapshot Based Protection Is Objectively Better (When You Can Get It)
The previous section explained why SQL Server backups cannot escape application coordination. VSS exists because transactional consistency requires it, and the coordination overhead is the price of certainty.
This makes the contrast with pure snapshot based protection even starker. Where snapshots work cleanly, they are not incrementally better. They are categorically superior.
What Pure Snapshots Deliver
Snapshot based backups in environments that support them provide:
- Near instant capture: microseconds to milliseconds, regardless of dataset size
- Zero application load: the workload never knows a backup occurred
- Consistent recovery points: the storage layer guarantees point in time consistency
- Predictable backup windows: duration is independent of data volume
- No bandwidth consumption during capture: data movement happens later, asynchronously
A 50TB VMware datastore snapshots in the same time as a 50GB datastore. Backup windows become scheduling decisions rather than capacity constraints.
Rubrik exploits this deeply in VMware environments. Snapshot orchestration, instant VM recovery, and live mounts all depend on the hypervisor providing clean, consistent, zero overhead capture points.
Why This Is Harder Than It Looks
The elegance of snapshot based protection depends entirely on the underlying platform providing the right primitives. This is where the gap between VMware and everything else becomes painful.
VMware offers:
- Native snapshot APIs with transactional semantics
- Changed Block Tracking (CBT) for efficient incrementals
- Hypervisor level consistency without guest coordination
- Storage integration through VADP (vSphere APIs for Data Protection)
These are not accidental features. VMware invested years building a backup ecosystem because they understood that enterprise adoption required operational maturity, not just compute virtualisation.
Physical hosts offer none of this.
There is no universal snapshot API for bare metal servers. Storage arrays provide snapshot capabilities, but each vendor implements them differently, with different consistency guarantees, different integration points, and different failure modes. The operating system has no standard mechanism to coordinate application state with storage level capture.
The Physical Host Penalty
This is why physical SQL Server hosts face a compounding disadvantage:
- No hypervisor abstraction: there is no layer between the OS and storage that can freeze state cleanly
- VSS remains mandatory: application consistency still requires database coordination
- No standardised incremental tracking: without CBT or equivalent, every backup must rediscover what changed
- Storage integration is bespoke: each array, each SAN, each configuration requires specific handling
The result is that physical hosts with the largest databases, the workloads generating the most backup data, with the longest restore times, under the most operational pressure, receive the least architectural benefit from modern backup platforms.
They are stuck paying the VSS tax without receiving the snapshot dividend.
The Integration Hierarchy
Backup elegance follows a clear hierarchy based on platform integration depth:
| Environment | Snapshot Quality | Incremental Efficiency | Application Consistency | Overall Experience |
|---|---|---|---|---|
| VMware (no database) | Excellent | CBT driven | Not required | Seamless |
| VMware (with SQL Server) | Excellent | CBT driven | VSS coordinated | Good with overhead |
| Cloud native (EBS, managed disks) | Good | Provider dependent | Varies by workload | Generally clean |
| Physical with enterprise SAN | Possible | Array dependent | VSS coordinated | Complex but workable |
| Physical with commodity storage | Limited | Often full scan | VSS coordinated | Painful |
The further down this hierarchy, the more the backup platform must compensate for missing primitives. Rubrik handles this better than most, but even excellent software cannot conjure APIs that do not exist.
Why the Industry Irony Persists
The uncomfortable truth is that snapshot based protection delivers its greatest value precisely where it is least available.
A 500GB VMware VM snapshots effortlessly. The hypervisor provides everything needed. Backup is boring, as it should be.
A 50TB physical SQL Server, the database actually keeping the business running, containing years of transactional history, backing regulatory reporting and financial reconciliation, must coordinate through VSS, flush terabytes of buffer pool, sustain I/O pressure during capture, and hope the storage layer cooperates.
The workloads that need snapshot elegance the most are architecturally prevented from receiving it.
This is not a Rubrik limitation. It is not a Microsoft conspiracy. It is the accumulated consequence of decades of infrastructure evolution where virtualisation received backup investment and physical infrastructure did not.
What This Means for Architecture Decisions
Understanding this hierarchy should influence infrastructure strategy:
Virtualise where possible. The backup benefits alone often justify the overhead. A SQL Server VM with VSS coordination still benefits from CBT, instant recovery, and hypervisor level orchestration.
Choose storage with snapshot maturity. If physical hosts are unavoidable, enterprise arrays with proven snapshot integration reduce the backup penalty. This is not the place for commodity storage experimentation.
Accept the VSS overhead. For SQL Server workloads, crash consistent snapshots are technically possible but operationally risky. The coordination cost is worth paying. Budget for it in backup windows and I/O capacity.
Plan restore, not backup. Snapshot speed is irrelevant if restore requires hours of data rehydration. The architectural advantage of snapshots extends to recovery only if the platform supports instant mount and selective restore.
Rubrik’s value in this landscape is not eliminating the integration gaps—nobody can—but navigating them intelligently. Where snapshots work, Rubrik exploits them fully. Where they do not, Rubrik minimises the penalty through parallel restore, live mounts, and metadata driven recovery.
The goal remains the same: make restore the product, regardless of how constrained the backup capture had to be.
8. Ransomware: Where Architecture Is Exposed
8.1 The Restore Storm Problem
After ransomware, the challenge is not backup availability. The challenge is restoring everything at once.
Constraints appear immediately. East-west traffic saturates. DWDM links run hot. Core switch buffers overflow. Cloud egress throttling kicks in.
Rubrik mitigates this through parallel restores, SLA based prioritisation, and live mounts for critical systems. What it cannot do is defeat physics. A good recovery plan avoids turning a data breach into a network outage.
9. SaaS vs Appliance: This Is a Network Decision
Functionally, Rubrik SaaS and on prem appliances share the same policy engine, metadata index, and restore semantics.
The difference is bandwidth reality.
On prem appliances provide fast local restores, predictable latency, and minimal WAN dependency. SaaS based protection provides excellent cloud workload coverage and operational simplicity, but restore speed is bounded by network capacity and egress costs.
Hybrid estates usually require both.
10. Why Rubrik in the Cloud?
Cloud providers offer native backup primitives. These are necessary but insufficient.
They do not provide unified policy across environments, cross account recovery at scale, ransomware intelligence, or consistent restore semantics.
Rubrik turns cloud backups into recoverable systems rather than isolated snapshots.
10.1 Should You Protect Your AWS Root and Crypto Accounts?
Yes, because losing the control plane is worse than losing data.
Rubrik protects IAM configuration, account state, and infrastructure metadata. After a compromise, restoring how the account was configured is as important as restoring the data itself.
11. Backup Meets Security (Finally)
Rubrik integrates threat awareness into recovery using entropy analysis, change rate anomaly detection, and snapshot divergence tracking.
This answers the most dangerous question in recovery: which backup is actually safe to restore?
Most platforms cannot answer this with confidence.
12. VMware First Class Citizen, Physical Hosts Still Lag
Rubrik’s deepest integrations exist in VMware environments, including snapshot orchestration, instant VM recovery, and live mounts.
The uncomfortable reality remains that physical hosts with the largest datasets would benefit most from snapshot based protection, yet receive the least integration. This is an industry gap, not just a tooling one.
13. When Rubrik Is Not the Right Tool
Rubrik is not universal.
It is less optimal when bandwidth is severely constrained, estates are very small, or tape workflows are legally mandated.
Rubrik’s value emerges at scale, under pressure, and during failure.
14. Conclusion: Boredom Is Success
Backups should be boring. Restores should be quiet. Executives should never know the platform exists.
The only time backups become exciting is when they fail, and that excitement is almost always lethal.
Rubrik is not interesting because it stores data. It is interesting because, when everything is already on fire, restore remains a controlled engineering exercise rather than a panic response.
References
- Gartner Magic Quadrant for Enterprise Backup and Recovery Solutions – https://www.gartner.com/en/documents/5138291
- Rubrik Technical Architecture Whitepapers – https://www.rubrik.com/resources
- Microsoft SQL Server Backup and Restore Internals – https://learn.microsoft.com/en-us/sql/relational-databases/backup-restore/backup-overview-sql-server
- VMware Snapshot and Backup Best Practices – https://knowledge.broadcom.com/external/article?legacyId=1025279
- AWS Backup and Recovery Documentation – https://docs.aws.amazon.com/aws-backup/
- NIST SP 800-209 Security Guidelines for Storage Infrastructure – https://csrc.nist.gov/publications/detail/sp/800-209/final
- Rubrik SQL Live Mount Documentation – https://www.rubrik.com/solutions/sql-live-mount
- Rubrik Oracle Live Mount Documentation – https://docs.rubrik.com/en-us/saas/oracle/oracle_live_mount.html
- Rubrik for Oracle and Microsoft SQL Server Data Sheet – https://www.rubrik.com/content/dam/rubrik/en/resources/data-sheet/Rubrik-for-Oracle-and-Microsoft-SQL-Sever-DS.pdf
- Rubrik Enhanced Performance for Microsoft SQL and Oracle Database – https://www.rubrik.com/blog/technology/2021/12/rubrik-enhanced-performance-for-microsoft-sql-and-oracle-database
- Rubrik PostgreSQL Support Announcement – https://www.rubrik.com/blog/technology/24/10/rubrik-expands-database-protection-with-postgre-sql-support-and-on-premises-sensitive-data-monitoring-for-microsoft-sql-server
- Rubrik Elastic App Service – https://www.rubrik.com/solutions/elastic-app-service
- Rubrik and VMware vSphere Reference Architecture – https://www.rubrik.com/content/dam/rubrik/en/resources/white-paper/ra-rubrik-vmware-vsphere.pdf
- Protecting Microsoft SQL Server with Rubrik Technical White Paper – https://www.rubrik.com/content/dam/rubrik/en/resources/white-paper/rwp-protecting-microsoft-sql-server-with-rubrik.pdf
- The Definitive Guide to Rubrik Cloud Data Management – https://www.rubrik.com/content/dam/rubrik/en/resources/white-paper/rwp-definitive-guide-to-rubrik-cdm.pdf
- Rubrik Oracle Tools GitHub Repository – https://github.com/rubrikinc/rubrik_oracle_tools
- Automating SQL Server Live Mounts with Rubrik – https://virtuallysober.com/2017/08/08/automating-sql-server-live-mounts-with-rubrik-alta-4-0/