Backup and Recovery Strategy
A backup is only as good as the last successful restore. Snapshots and incremental backups protect your data on disk, but without a tested recovery procedure, you cannot know whether those files are usable when it matters most. This page helps you design a backup strategy that aligns with your operational requirements, covers the methods available in Apache Cassandra, and explains how to validate that recovery actually works.
RPO and RTO Framework
Recovery Point Objective (RPO) is the maximum acceptable data loss measured in time. Recovery Time Objective (RTO) is the maximum acceptable downtime before the system is back in service. These two metrics drive every other decision in backup design.
| Use Case | Target RPO | Target RTO | Recommended approach |
|---|---|---|---|
Critical financial or transactional data |
Near zero (seconds to minutes) |
< 1 hour |
Incremental backups enabled, snapshots every 6 hours, off-site replication, tested restore runbook |
General-purpose operational data |
1 to 4 hours |
2 to 8 hours |
Daily snapshots, incremental backups enabled, remote storage, monthly restore test |
Analytics or time-series (append-only) |
4 to 24 hours |
8 to 24 hours |
Daily snapshots, local or NFS staging, quarterly restore test |
Development or staging environment |
Best effort |
Best effort |
Ad-hoc snapshots before schema changes, no off-site replication required |
|
Cassandra’s multi-datacenter replication is not a substitute for backups. Replication propagates writes, including accidental deletes and truncations, to all replicas in near real time. |
Backup Methods Comparison
Apache Cassandra provides two built-in mechanisms: snapshots and incremental backups. Third-party tooling such as Apache Cassandra Medusa extends these with coordinated cluster-wide backup and object storage upload.
| Method | How it works | Storage overhead | Restore complexity | Best suited for |
|---|---|---|---|---|
Snapshots |
Hard-links to SSTable files at a point in time; taken with |
Low at creation; grows if SSTables are not compacted away |
Moderate — copy files back, run |
Periodic full backups, pre-upgrade safety nets, schema-change guards |
Incremental backups |
Hard-link created for each SSTable flushed to disk; stored in |
Can grow quickly under heavy write workloads if not pruned |
Higher — requires pairing with a snapshot for the base, then applying incremental files |
Reducing RPO between full snapshots when combined with a recent snapshot baseline |
Medusa (third-party) |
Coordinates snapshot across all nodes, uploads to object storage (S3, GCS, Azure Blob) |
Depends on storage backend; deduplication available in some backends |
Lower — single command restores, handles topology changes |
Production clusters where off-site storage and coordinated multi-node restore are required |
For the built-in snapshot and incremental backup mechanics, see Backups and Snapshots.
Storage Targets
Where you store backup files determines durability, cost, and restore speed.
Local Disk
Storing snapshot hard-links on the same volume as live data provides no protection against disk failure or node loss. Local storage is only acceptable as a transient staging area before transferring files to a remote target.
Pros: Zero transfer latency, no network cost.
Cons: No protection against hardware failure; disk contention under restore.
NFS or Shared Network Storage
Mounting a network filesystem under the Cassandra data directory and copying snapshot directories to it is a common pattern for on-premises clusters.
Pros: Simple to configure, familiar tooling, reasonable cost for moderate data volumes.
Cons: NFS becomes a single point of failure; throughput is limited by network
bandwidth; large clusters may saturate the NFS server during concurrent backups.
Object Storage (S3, GCS, Azure Blob)
Cloud object storage is the recommended durable target for production clusters.
Files are uploaded after a local snapshot is taken.
Tools such as Medusa and aws s3 sync automate this transfer.
Pros: High durability (typically 99.999999999%), cross-region replication
available, cost scales with usage, no capacity planning required.
Cons: Egress costs on restore; upload speed depends on available network
bandwidth; requires credentials management and IAM policy configuration.
|
Bucket lifecycle policies can automate retention management. Set a lifecycle rule on your backup prefix to expire objects after the retention period rather than relying on manual cleanup. |
Retention Policy Design
A retention policy answers two questions: how long to keep backups, and how many generations to keep at each granularity.
A tiered approach works well for most clusters:
| Tier | Retention window | Notes |
|---|---|---|
Daily snapshots |
7 to 14 days |
Sufficient for recovering from accidental data loss discovered within a week |
Weekly snapshots |
4 to 8 weeks |
Covers discovery lag for issues that surface after several days in production |
Monthly snapshots |
3 to 12 months |
Required by many compliance frameworks (GDPR, PCI-DSS, HIPAA) |
Incremental backup files |
Same window as the most recent daily snapshot baseline |
Incremental files older than the oldest retained snapshot are not useful; prune them together |
When using Cassandra’s built-in TTL snapshot feature, pass --ttl to
nodetool snapshot to schedule automatic cleanup:
nodetool snapshot --ttl 7d -t daily-$(date +%Y%m%d) -- <keyspace>
For object storage backends, use bucket lifecycle rules rather than TTL snapshots, as TTL expiry requires the Cassandra node to be running.
Environment-Specific Recommendations
Bare Metal
On bare metal, Cassandra manages its own data directories entirely. Snapshot files live on the same physical disks as live data until you copy them off.
-
Schedule
nodetool snapshotvia cron or a cluster orchestration tool on each node independently. -
Transfer snapshot directories to NFS or object storage immediately after creation using
rsyncor a cloud CLI tool. -
Stagger snapshot schedules across nodes (for example, 15-minute offsets) to avoid simultaneous disk and network I/O spikes.
-
Enable incremental backups in
cassandra.yaml(incremental_backups: true) only if your RPO requires sub-daily recovery points and you have a process to prune thebackups/directories regularly.
Cloud Virtual Machines
In addition to Cassandra-native snapshots, cloud environments offer VM disk or volume snapshots at the hypervisor layer (AWS EBS snapshots, GCP Persistent Disk snapshots, Azure Managed Disk snapshots).
-
Volume snapshots capture the full disk state including OS and Cassandra binaries; they are coarser but simpler to restore for a full node failure.
-
Cassandra-native snapshots give you table-level granularity and are independent of the OS state; prefer them for data-level recovery.
-
Combining both approaches is reasonable: volume snapshots for disaster recovery of the node, Cassandra snapshots uploaded to object storage for table-level point-in-time recovery.
-
For cloud deployments, Medusa with S3 or GCS provides the most operationally mature solution.
Kubernetes (PersistentVolume Snapshots)
Cassandra on Kubernetes typically runs via an operator (k8ssandra-operator or
cass-operator) with each pod backed by a PersistentVolumeClaim.
-
Kubernetes
VolumeSnapshotresources (CSI snapshots) can capture aPersistentVolumeat a point in time, similar to EBS snapshots. These are coordinated at the storage layer, not at the Cassandra layer. -
CSI snapshots do not flush memtables before snapping the volume. Issue
nodetool flushon each pod before triggering a CSI snapshot to ensure all in-memory data is persisted to SSTables. -
For table-level restore granularity, use k8ssandra’s Medusa integration, which runs Medusa as a sidecar on each Cassandra pod and coordinates backups across the cluster.
-
Backup scheduling in Kubernetes environments is typically handled by
MedusaBackupSchedulecustom resources rather than cron jobs on nodes.
Backup Validation
Running backups without testing restores is not a backup strategy — it is a hope strategy. Schedule periodic restore tests to verify that backup files are complete, uncorrupted, and restorable within your RTO.
What to Test
-
File integrity: Verify SSTable files using
sstablescrubor check that hard-link targets still exist before uploading. -
Schema restore: Confirm that the
schema.cqlfile stored in each snapshot directory can recreate the table structure withcqlsh -f. -
Data restore: Load snapshot files into a test cluster using
sstableloaderornodetool refreshand query a sample of rows to confirm correctness. -
Timing: Measure actual restore time against your RTO target. Data volumes grow over time; a restore that met RTO six months ago may not meet it today.
Restore Test Schedule
| Frequency | Scope |
|---|---|
Monthly |
Restore a single table from the most recent daily snapshot into a dedicated test environment; verify row counts and a sample of data |
Quarterly |
Full keyspace restore from an off-site backup; measure end-to-end time from download to query-ready |
After each major schema change |
Verify that the backup taken before the schema change restores correctly to the pre-change schema |
|
Test restores must run against a separate cluster or a test namespace that
is isolated from production.
Never run |
Monitoring Backup Success
A backup job that silently fails is worse than no backup job at all. Instrument your backup pipeline to alert on:
-
Non-zero exit codes from
nodetool snapshotor Medusa backup commands -
Missing expected snapshot tags when querying
nodetool listsnapshotsor thesystem_views.snapshotsvirtual table -
Upload size significantly below the previous baseline (may indicate a partial backup or a skipped keyspace)
-
Backup jobs that exceed the expected duration (may indicate node I/O saturation or a stalled upload)
Related Pages
-
Backups and Snapshots — step-by-step instructions for
nodetool snapshot, incremental backup configuration, listing and clearing snapshots, and restoring from snapshots -
Repair — repair is complementary to backup; run repair regularly to prevent data inconsistency that could propagate into backups
-
Metrics — disk usage metrics help detect snapshot accumulation before storage is exhausted
-
Configuration Overview — covers
cassandra.yamlsettings includingauto_snapshot,snapshot_before_compaction, andincremental_backups -
Cassandra Sidecar — Sidecar provides API-driven snapshot and restore workflows