Node Replacement Runbook
|
Preview | Unofficial | For review only |
This runbook covers the two replacement paths available in Cassandra 6 and guides you through pre-replacement preparation, execution, and post-replacement validation. For TCM background, see TCM Day-2 Operations. For repair concepts, see Repair.
When to Replace vs. Repair
Not every troubled node requires replacement. Use this decision guide before starting any replacement procedure.
| Signal | Recommended action |
|---|---|
Node is up but slow (GC pressure, high read latency) |
Tune JVM heap, run |
Node has fallen behind on hints or streaming |
Allow hints to drain; run incremental repair; monitor for 24h |
Node reports repeated commit-log failures |
Replace disk, restart node, run full repair |
Node has persistent hardware failure (failed disk, NIC, failed host) |
Replace the node — use Option A or Option B below |
Node is DOWN and has been unreachable longer than |
Replace the node using Option B (dead-node replacement) |
Node is DOWN and hints are still within the window |
Attempt node restart first; escalate to replacement only if restart fails |
Node is a CMS voter and is unresponsive |
See TCM Troubleshooting before replacing |
|
In Cassandra 6, TCM tracks topology state in the distributed metadata log.
A node that appears DOWN in |
Pre-Replacement Checklist
Complete every item before starting either replacement path.
-
Confirm cluster health.
nodetool statusAll nodes except the target must be UN (Up/Normal). Do not start a replacement if another node is already UL, DN, or DL.
-
Check replication factor.
Every keyspace must have RF ≥ 2 so remaining nodes hold at least one copy of all data the replacement node will need to stream. Verify with:
nodetool describering <keyspace> -
Capture the host ID of the node being replaced (Option B only).
nodetool status | grep <node-ip>Record the host ID shown in the first column. You will pass this to
cassandra.replace_address_first_boot. -
Check pending TCM operations.
nodetool cms describeThere must be no in-progress topology operations (bootstrap, decommission, move, or replace) from a prior attempt. If a stale operation is listed, see TCM Troubleshooting to abort it before proceeding.
-
Snapshot keyspaces on a healthy node (optional but recommended).
nodetool snapshot -t pre-replace-$(date +%Y%m%d) -
Verify available disk space on the replacement node.
The replacement node must have free space at least equal to the data directory size of the node being replaced.
-
Review maintenance windows.
Replacement operations generate significant streaming traffic. Schedule during low-traffic periods and notify on-call stakeholders.
Option A: Decommission + Add (Graceful Replacement)
Use this path when the node being replaced is alive and healthy enough to stream its own data out. It is the lower-risk path: the leaving node participates in data transfer and TCM can safely checkpoint each step.
Step 1 — Decommission the existing node
Run from the node being replaced:
nodetool decommission
Under TCM, this proceeds through three committed epochs:
START_LEAVE → MID_LEAVE → FINISH_LEAVE.
Monitor progress in the system log:
tail -f /var/log/cassandra/system.log | grep -i "leave\|decommission"
If decommission stalls, resume it:
nodetool decommission --force
Or abort and revert cleanly:
# Run on any healthy node
nodetool abortdecommission <host-id-of-leaving-node>
Step 2 — Confirm the node has left
nodetool status
The decommissioned node must no longer appear in the ring before you add the replacement.
Step 3 — Provision and start the replacement node
Install Cassandra on the new host with the same version as the cluster. Start it normally — it will bootstrap as a new node:
# On the new node
cassandra -f
TCM tracks the bootstrap through START_JOIN → MID_JOIN → FINISH_JOIN.
If bootstrap is interrupted, resume it:
nodetool bootstrap resume
Step 4 — Validate
Proceed to [post-replacement-validation].
Option B: Replace a Dead Node
Use this path when the node being replaced is unreachable or dead and cannot participate in streaming. The replacement node assumes the dead node’s tokens and rebuilds data from remaining replicas.
|
Do not use this path while a rolling upgrade is in progress. Replace dead nodes before starting the upgrade, or after all three upgrade phases complete. See Upgrade Procedure. |
Step 1 — Record the address and host ID of the dead node
nodetool status
Note the IP address and host ID of the DN (Down/Normal) node. These are required in the next step.
Step 3 — Start the replacement node with the replace flag
Edit cassandra.yaml on the replacement node or pass the JVM property at
startup:
JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address_first_boot=<dead-node-ip>" \
cassandra -f
On first boot only, the node registers itself as the replacement in TCM
(START_REPLACE), streams data from remaining replicas (MID_REPLACE),
and assumes the dead node’s tokens (FINISH_REPLACE).
The property is automatically ignored on subsequent restarts.
|
Under TCM, the |
Step 4 — Monitor the replacement
tail -f /var/log/cassandra/system.log | grep -i "replace\|stream\|bootstrap"
If streaming is interrupted, the operation is resumable:
nodetool bootstrap resume
If the replacement cannot proceed (for example, RF = 1 and there are no other replicas for some ranges), streaming will fail with an explicit error. In that case you must restore from backup rather than replace.
Step 5 — Validate
Proceed to [post-replacement-validation].
TCM-Aware Replacement: What Is Different in Cassandra 6
Cassandra 6 replaces gossip-based topology coordination with the Transactional Cluster Metadata (TCM) system. The differences that affect replacement procedures are:
- Epoch tracking
-
Every topology step produces a monotonically increasing epoch committed to the distributed metadata log. Each epoch must be acknowledged by a quorum in each datacenter before the next step proceeds. This eliminates the ring-settle delays that gossip required.
- Resumable operations
-
If a replacement is interrupted — host reboot, network partition, operator error — the TCM state machine records exactly where the operation stopped.
nodetool bootstrap resumerestarts from that checkpoint rather than from the beginning. - Stale operation detection
-
A replacement that was aborted without a clean
FINISH_REPLACEleaves an in-progress record in the metadata log.nodetool cms describewill show it. Attempting a second replacement without first clearing the stale record will fail with a conflict error. Usenodetool cms abortOperation <sequence-id>to clear it. - Movement map precision
-
Under gossip, streaming sources were chosen from gossip state, which could be stale. Under TCM, the replacement node derives its
MovementMapfrom the committed placement deltas. Streaming targets are authoritative and do not vary by observer. - Operator visibility
-
Each committed epoch emits a structured log entry. You can correlate log timestamps with epoch numbers to reconstruct the exact sequence of events during and after a replacement.
Post-Replacement Validation
Complete all of the following steps after either Option A or Option B.
Immediate checks (within 5 minutes)
-
Confirm the new node is UN:
nodetool statusThe replacement must appear as UN. No other node should be in an abnormal state.
Successful replacement should also produce log lines that mention
START_REPLACE, MID_REPLACE, and FINISH_REPLACE. If those markers are
missing, the replacement did not complete cleanly.
-
Confirm no pending TCM operations:
nodetool cms describeThere must be no in-progress sequences.
-
Verify ring coverage:
nodetool ringAll token ranges must be assigned to live nodes.
Post-streaming repair (within 1 hour)
Run a full repair on the replacement node to reconcile any data written during the streaming window:
nodetool repair --full
For large datasets, scope the repair to one keyspace at a time to reduce pressure:
nodetool repair --full <keyspace>
24-hour monitoring checklist
Monitor the replacement node and the cluster for at least 24 hours using these checks:
# Check for dropped messages (indicates coordinator overload)
nodetool tpstats | grep -i drop
# Check pending compactions
nodetool compactionstats
# Check hint delivery to the replacement node
nodetool tpstats | grep -i hint
# Check read/write latency
nodetool proxyhistograms
Alert thresholds to watch:
-
Dropped messages on any node: investigate immediately
-
Pending compaction tasks growing unbounded: review compaction strategy
-
P99 read or write latency increase > 2x baseline: check repair completion and GC logs
-
Any node transitioning to DN: do not start another replacement until resolved
Investigate immediately if the replacement node shows dropped messages or if pending compactions stay elevated for more than 15 minutes after streaming completes.
Related Pages
-
TCM Day-2 Operations and Performance — bootstrap, decommission, replace, and move under TCM
-
TCM Troubleshooting — how to diagnose and recover from stalled topology operations
-
Upgrade Procedure — constraints on node replacement during rolling upgrades
-
Repair — full and incremental repair concepts
-
Automated Repair — scheduling repair automatically
-
Production Readiness — hardware sizing and RF recommendations