TCM Troubleshooting and Testing
|
Preview | Unofficial | For review only |
This page is a collection of runbooks for specific TCM failure scenarios, followed by a test plan for validating TCM behavior before production deployment. For concepts and architecture, see TCM Overview. For operational behavior, see TCM Operations.
Failure Playbooks
Playbooks are ordered by severity, from routine self-resolving situations to emergency recovery procedures.
Playbook: Single CMS Node Lost
Severity: Low. No operator action required.
What You See
A CMS member goes down.
The unreachableCMSMembers metric ticks up by one.
Remaining CMS members continue operating.
Why It Is Not a Problem
The CMS is a Paxos group. As long as a majority of CMS members are alive, the group can reach consensus.
The quorum calculation in PaxosBackedProcessor:
blockFor = (replicas.size() / 2) + 1
For a 3-member CMS, blockFor = 2.
For a 5-member CMS, blockFor = 3.
All metadata operations continue without interruption.
Source: src/java/org/apache/cassandra/tcm/
What to Do
-
Diagnose the failed node (hardware, process logs, network). This is standard Cassandra troubleshooting — nothing specific to TCM.
-
Restore the node. When the node comes back,
PeerLogFetcherautomatically fetches and applies missed log entries. -
If the node cannot be restored, replace it as you would any other Cassandra node (
nodetool removenodefollowed by bootstrapping a replacement). The CMS will automatically reconfigure if the dead node was a CMS member and a replacement is eligible.
Playbook: CMS Quorum Lost
Severity: High. Metadata changes are blocked until quorum is restored.
What You See
More than half of the CMS members are down. The cluster continues serving reads and writes for existing data. But any operation that requires a metadata commit is blocked:
-
New bootstraps cannot start
-
Decommissions cannot proceed
-
Schema changes (CREATE TABLE, ALTER TABLE) are rejected
-
CMS reconfiguration cannot complete
Non-CMS nodes attempting commits receive failure responses and retry with exponential backoff.
Why It Happens
Paxos requires a majority to agree on each commit. Without a majority, no new log entries can be committed. This prevents split-brain scenarios where two partitions of the CMS could commit conflicting metadata.
What to Do
Step 1: Restore CMS nodes. Every CMS node you bring back moves you closer to quorum. Once a majority is up, the system recovers automatically.
Step 2: If nodes cannot be restored quickly, assess:
-
Can you restart the CMS processes? (Process crash, OOM)
-
Can you restore network connectivity? (Partition, firewall)
-
Is the hardware permanently lost? (Disk failure, host termination)
Step 3: If quorum cannot be restored through normal means:
-
Option A: Wait. If the outage is temporary, waiting is the safest choice. No metadata changes can occur, but no data is at risk.
-
Option B: Emergency recovery. If the CMS nodes are permanently lost, proceed to the Total CMS Loss playbook.
Decision criteria:
-
Choose wait when at least one CMS node is expected back soon and no topology or schema change is urgent.
-
Choose restore or replace when the missing CMS node is individually recoverable but quorum is currently lost.
-
Choose emergency recovery only when the CMS set is unrecoverable as deployed and no normal restart path exists.
Prevention
Size your CMS appropriately. A 3-member CMS is one failure away from quorum loss. For production clusters, a 5-member CMS provides a more comfortable buffer. See TCM Operations for sizing guidance.
Playbook: Total CMS Loss
Severity: Critical. Emergency recovery required.
What You See
All CMS members are down. No metadata commits are possible. The cluster can still serve reads and writes, but no topology or schema changes can occur.
This scenario is extraordinarily rare. If your CMS members are distributed across racks and datacenters, simultaneous loss of all of them implies a much larger infrastructure incident.
The Escape Hatch
TCM provides emergency recovery mechanisms that bypass Paxos. These are break-glass operations.
|
All escape hatch operations require |
Prerequisites for all escape hatch operations:
-
Set
cassandra.unsafe_tcm_mode=trueincassandra.yaml -
Restart at least one node with this flag enabled
-
Execute the recovery operation via JMX (
org.apache.cassandra.tcm:type=CMSOperations)
Source: src/java/org/apache/cassandra/tcm/CMSOperationsMBean.java
Recovery Method 1: Revert to a Previous Epoch
# Via JMX
# MBean: org.apache.cassandra.tcm:type=CMSOperations
# Method: unsafeRevertClusterMetadata(long epoch)
What this does:
-
Retrieves the metadata snapshot at the specified epoch from
system_cluster_metadata -
Creates a new snapshot at epoch N+1 with that historical state
-
Commits a
ForceSnapshottransformation that bypasses Paxos
Use when: you know the last good epoch and want to roll back to it. Any metadata changes committed after that epoch are lost.
Recovery Method 2: Load Metadata from a Dump File
# First, dump current metadata (from any surviving node with local state):
# JMX Method: dumpClusterMetadata(long epoch, long transformToEpoch, String version)
# Then load the dump on the recovery node:
# JMX Method: unsafeLoadClusterMetadata(String filePath)
What this does:
-
Deserializes the
ClusterMetadataobject from the dump file -
Forces the epoch to current+1
-
Commits a
ForceSnapshottransformation with the loaded state
Use when: you have a metadata backup or can extract one from any surviving node. This is the most common recovery path for total CMS loss.
Recovery Method 3: Boot with a Metadata File
# Add JVM property before starting the node:
-Dcassandra.unsafe_boot_with_clustermetadata=/path/to/metadata/dump
This causes the node to boot in RESET state — it ignores the CMS entirely
and uses the provided metadata file.
The property is consumed on startup and does not persist.
Use when: no CMS nodes can start at all and you need to bring at least one node up with a known metadata state.
Recovery Procedure: Step by Step
-
Stop all remaining CMS nodes (if any are in a bad state).
-
Choose one node to be the recovery node. Prefer a node that was recently a CMS member.
-
Enable unsafe mode on that node: set
cassandra.unsafe_tcm_mode=trueincassandra.yaml. -
Start the recovery node.
-
Execute the appropriate recovery method (revert, load, or boot-with-file).
-
Verify the metadata state with
nodetool cms describe. Confirm epoch, CMS members, and directory look correct. -
Disable unsafe mode: set
cassandra.unsafe_tcm_mode=false. -
Restart the recovery node in normal mode.
-
Start the remaining CMS nodes. They will fetch the recovery node’s metadata and sync.
-
Verify cluster-wide convergence: all nodes report the same epoch.
Dump Early, Dump Often
Consider making periodic metadata dumps part of your backup routine:
# Via JMX
# MBean: org.apache.cassandra.tcm:type=CMSOperations
# Method: dumpClusterMetadata(currentEpoch, currentEpoch, "V8")
Store dumps alongside your regular Cassandra backups. In a total CMS loss scenario, the dump is your fastest path back to a functioning cluster.
Playbook: Node Restarted Mid-Bootstrap
Severity: Low. Resumable.
What You See
A bootstrapping node crashes or is restarted during the MID_JOIN streaming phase.
When it comes back up, its state in the directory is either BOOTSTRAPPING or REGISTERED.
The InProgressSequences in cluster metadata still has the node’s bootstrap entry.
What to Do
Resume the bootstrap:
$ nodetool bootstrap resume
The BootstrapAndJoin sequence knows which step it was on and resumes from there.
If resume fails or you want to start fresh:
$ nodetool bootstrap abort <node-id> <endpoint>
This cancels the in-progress sequence, releases range locks, and cleans up the metadata.
The node returns to REGISTERED state, and you can start a fresh bootstrap.
|
Do not manually clear tokens, delete system tables, or restart the node hoping it will "forget" the partial bootstrap. The metadata log has a record of the in-progress sequence. The proper path is always resume or abort. |
Source: src/java/org/apache/cassandra/tcm/sequences/BootstrapAndJoin.java
Playbook: Stuck Topology Operation
Severity: Medium. Requires operator decision.
What You See
A topology operation (bootstrap, decommission, move, replace) started but did not complete.
The operation is visible in nodetool cms describe as an in-progress sequence.
No further progress is being made.
Range locks held by this operation are preventing other topology changes.
Common causes:
-
The node performing the operation crashed and has not been restarted
-
Streaming failed due to a source node going down
-
A network partition isolated the node performing the operation
-
A disk filled up on the streaming target
Diagnosing the Stuck Operation
$ nodetool cms describe
Look for entries in the directory with states BOOTSTRAPPING, LEAVING, or MOVING.
Cross-reference with the multi_step_operation field to see which step the operation is on.
Check the node’s logs:
ERROR - Error while decommissioning node: ... ERROR - Streaming error during bootstrap: ...
Resolution Options
Resume the operation (if the underlying issue is resolved):
$ nodetool bootstrap resume # For stuck bootstrap
$ nodetool decommission # For stuck decommission (re-invocation resumes)
$ nodetool move <token> # For stuck move (re-invocation resumes)
Abort the operation:
$ nodetool bootstrap abort <node-id> <endpoint>
$ nodetool cancel_decommission <node-id>
$ nodetool stop_moving
Generic cancellation (for any operation type):
$ nodetool cms cancel_in_progress_sequences <node-id> <operation-type>
# operation-type: JOIN, REPLACE, LEAVE, REMOVE, MOVE, RECONFIGURE_CMS
What Cancellation Does
When you cancel an in-progress sequence:
-
Type-specific cleanup is executed
-
The node is removed from the
InProgressSequencesmap -
Associated range locks are released
-
The cancellation is committed as a new epoch
For a cancelled bootstrap: node returns to REGISTERED.
For a cancelled decommission: node returns to JOINED.
For a cancelled move: node keeps its original tokens.
Source: src/java/org/apache/cassandra/tcm/transformations/cms/
Playbook: Epoch Divergence
Severity: Low. Usually self-resolving.
What You See
Different nodes report different epochs when you run nodetool cms describe.
Or you see:
WARN - Could not perform consistent fetch, downgrading to fetching from CMS peers.
Why This Is Usually Not a Problem
Epoch divergence is a transient state. It happens naturally when:
-
A metadata change was just committed and some nodes have not yet received the update
-
A node was down briefly and is catching up
-
A network blip delayed log replication to some nodes
Non-CMS nodes learn about new epochs through the PeerLogFetcher background process.
CMS nodes are directly involved in committing each epoch and should always be at or near latest.
When to Investigate
-
A node is persistently behind. If a node stays at the same epoch for minutes while others advance, it may be unable to reach CMS peers. Check network connectivity.
-
The gap is large and growing. A node at epoch 100 while the cluster is at epoch 150 suggests prolonged network isolation or a processing failure. Check logs for
fetchCMSLogConsistencyDowngrademetrics or repeatedReadTimeoutException. -
Metadata operations fail on the lagging node. The barrier will log degradation: ---- INFO - Could not collect epoch acknowledgements within Xms for EACH_QUORUM. Falling back to QUORUM. ----
What to Do
Usually: nothing. Catch-up mechanisms are automatic. Give the node time to fetch and apply log entries.
If the node is network-isolated: restore connectivity. Once reachable, it will catch up.
If the node appears stuck: check that PeerLogFetcher is running and CMS peers are responding.
The fetchCMSLogLatency metric indicates how long log fetches are taking.
If the log has been compacted and the node needs entries that are gone: the node will recover through a metadata snapshot. It fetches the nearest snapshot and applies only the log entries after that point.
Playbook: CMS Reconfiguration Failure
Severity: Medium. Requires explicit resume or cancel.
What You See
A CMS reconfiguration started but did not complete.
nodetool cms reconfigure --status shows which step it stopped at.
What to Do
Check status:
$ nodetool cms reconfigure --status
Resume:
$ nodetool cms reconfigure --resume
Picks up from the last completed step. If the failure was transient, resuming will often succeed.
Cancel:
$ nodetool cms reconfigure --cancel
Aborts the reconfiguration, releases any locks, and reverts partially completed membership changes. You can then retry with different parameters.
Playbook: Network Partition During Metadata Change
Severity: Variable. Depends on which nodes are partitioned.
Scenario 1: CMS Majority on One Side
The side with the CMS majority continues operating normally. The side without CMS access cannot commit metadata changes. Non-CMS nodes on the isolated side will try known CMS members, then seed nodes, then the discovery protocol.
Resolve: fix the network partition. When connectivity is restored, isolated nodes catch up automatically.
Scenario 2: CMS Split Across the Partition
If no side has a quorum of CMS members, all metadata commits are blocked cluster-wide. This is equivalent to the CMS Quorum Lost playbook.
Resolve: fix the partition. If the partition is long-lived and you need to make metadata changes, you may need to pause commits on one side and use unsafe operations — but this is an extreme measure.
Automatic Recovery After Partition Heals
-
The failure detector marks previously unreachable nodes as alive
-
CMS nodes exchange Paxos ballots and reach consensus on the latest state
-
Non-CMS nodes fetch and apply missed log entries
-
Progress barriers that were waiting can now complete
No operator action is required for automatic recovery.
Playbook: Emergency Commit Pause
Severity: Operator-initiated. Use to freeze metadata changes for investigation.
When to Use
You suspect metadata corruption, an unexpected transformation was committed, or you need to investigate the log before allowing further changes.
How to Pause
$ nodetool cms set_commits_paused true
# Verify the pause is active
$ nodetool cms describe
# Look for: Commits Paused: true
While commits are paused:
-
No topology or schema changes can proceed
-
In-progress sequences are frozen at their current step
-
The cluster continues serving reads and writes
Break-Glass Reference
All unsafe and force operations for quick reference during incidents:
| Operation | Command | Precondition | Risk Level |
|---|---|---|---|
Cancel stuck sequence |
|
Sequence exists |
Low |
Pause commits |
|
None |
Low |
Resume reconfiguration |
|
Reconfig was interrupted |
Low |
Revert to epoch |
JMX: |
|
High |
Load metadata from file |
JMX: |
|
High |
Boot with metadata file |
JVM: |
All CMS down |
High |
Source: src/java/org/apache/cassandra/tcm/CMSOperationsMBean.java
Testing TCM in Lower Environments
Do not enable TCM in production without testing it first. CMS initialization has a point of no return. The time to discover that your monitoring misses a metric or your automation assumes gossip-based schema propagation is before the production rollout.
Test Environment Options
Option 1: CCM (Cassandra Cluster Manager)
The quickest way to spin up a multi-node cluster on a single machine.
# Create a 3-node cluster on Cassandra 6.0
ccm create tcm-test -v 6.0 -n 3
# Start the cluster
ccm start
# Verify all nodes are up
ccm status
# Run the full upgrade sequence
ccm node1 nodetool cms initialize
ccm node1 nodetool cms reconfigure 3
ccm node1 nodetool cms describe
Ideal for: validating the upgrade sequence, testing nodetool cms commands,
verifying schema propagation, practicing failure playbooks.
Not ideal for: network partition testing (all nodes share the same loopback interface), realistic latency testing, multi-datacenter scenarios.
Option 2: Docker Compose
Better isolation than CCM with network-level controls.
services:
cassandra-seed:
image: cassandra:6.0
environment:
CASSANDRA_CLUSTER_NAME: tcm-test
CASSANDRA_DC: dc1
networks:
- cassandra-net
cassandra-2:
image: cassandra:6.0
environment:
CASSANDRA_CLUSTER_NAME: tcm-test
CASSANDRA_SEEDS: cassandra-seed
CASSANDRA_DC: dc1
networks:
- cassandra-net
depends_on:
- cassandra-seed
networks:
cassandra-net:
driver: bridge
Use docker network disconnect to simulate node isolation,
tc (traffic control) to inject latency,
and separate Docker networks to model datacenter boundaries.
Option 3: Cassandra In-JVM Distributed Test Framework
The framework Cassandra’s developers use to test TCM itself. It provides fine-grained message filtering to simulate network partitions at the Cassandra protocol level, and ByteBuddy-based bytecode injection to trigger failures at exact code points.
Reference: ClusterMetadataUpgradeTest in the Cassandra test suite
Scenario 1: Smoke Test
Goal: Verify that the basic TCM lifecycle works end to end.
-
Start a 3-node cluster on Cassandra 6.0
-
Initialize CMS:
nodetool cms initialize -
Verify CMS is active:
nodetool cms describe -
Reconfigure CMS to 3 members:
nodetool cms reconfigure 3 -
Create a keyspace and table
-
Verify the schema propagated to all nodes
-
Verify all nodes report the same epoch
Expected time: 5–10 minutes.
This is the minimum viable test. If this does not pass, nothing else will.
Scenario 2: Upgrade Path
Goal: Verify that the three-phase upgrade from a pre-TCM version works correctly.
-
Start a 3-node cluster on Cassandra 5.0
-
Create keyspaces, tables, and insert data
-
Perform a rolling upgrade to 6.0 (Phase 1)
-
Verify all nodes are in
GOSSIPmode -
Initialize CMS (Phase 2)
-
Verify all nodes have migrated (no node shows
GOSSIPservice state) -
Reconfigure CMS (Phase 3)
-
Verify data is still readable
-
Create a new table and verify schema propagation
-
Bootstrap a new node and verify it joins correctly
In-JVM framework pattern:
new TestCase()
.nodes(3)
.nodesToUpgrade(1, 2, 3)
.withConfig(cfg -> cfg.with(Feature.NETWORK, Feature.GOSSIP)
.set(Constants.KEY_DTEST_FULL_STARTUP, true))
.upgradesToCurrentFrom(v50)
.setup(cluster -> {
cluster.schemaChange("CREATE TABLE ks.tbl (pk int PRIMARY KEY)");
})
.runAfterClusterUpgrade(cluster -> {
cluster.get(1).nodetoolResult("cms", "initialize")
.asserts().success();
cluster.forEach(i ->
assertFalse(ClusterUtils.isMigrating(i)));
cluster.get(2).nodetoolResult("cms", "reconfigure", "3")
.asserts().success();
})
.run();
Reference: ClusterMetadataUpgradeTest
Scenario 3: Topology Operations
Goal: Verify that bootstrap, decommission, and node replacement work under TCM.
Bootstrap test:
-
Start a 3-node cluster with CMS initialized
-
Bootstrap a 4th node
-
Verify the 4th node appears in
nodetool statusas UN -
Verify all nodes report the same epoch
Decommission test:
-
From the 4-node cluster, decommission node 4
-
Verify node 4 disappears from the ring
-
Verify data was streamed to remaining nodes
Node replacement test:
-
Stop node 3 abruptly (simulate crash)
-
Bootstrap a replacement node with the same tokens
-
Verify the replacement joins the ring and the replaced node is removed from the directory
Assertions:
-
The three-step model (START → MID → FINISH) completes for each operation type
-
Range locking prevents conflicting operations
-
Progress barriers ensure epoch propagation before streaming begins
Scenario 4: Network Partition Simulation
Goal: Verify TCM handles partitions gracefully.
Partition tests to run:
-
Isolate one non-CMS node — schema change should still commit; isolated node catches up on reconnect
-
Isolate one CMS node — 2 of 3 still form quorum; schema change commits; isolated CMS member catches up
-
Partition 2 of 3 CMS members — schema change should fail; cluster continues serving existing data; reconnect and verify the change can now be committed
In-JVM framework pattern:
IMessageFilters.Filter partition1 = cluster.filters()
.allVerbs().from(1, 2).to(3, 4, 5).drop();
IMessageFilters.Filter partition2 = cluster.filters()
.allVerbs().from(3, 4, 5).to(1, 2).drop();
// Test operations during partition...
// Heal partition
partition1.off();
partition2.off();
// Verify recovery
ClusterUtils.waitForCMSToQuiesce(cluster, cluster.get(1));
Reference: SplitBrainTest in the Cassandra test suite
Scenario 5: Concurrent Operation Safety
Goal: Verify that range locking prevents conflicting concurrent topology changes.
Non-overlapping operations (should succeed):
-
Start a 6-node cluster with well-separated token ranges
-
Simultaneously bootstrap two new nodes with non-overlapping token ranges
-
Verify both bootstraps complete successfully
Overlapping operations (should be rejected):
-
Start a 4-node cluster
-
Begin bootstrapping a new node
-
While the bootstrap is in MID_JOIN, attempt to decommission a node whose ranges overlap with the bootstrapping node
-
Verify the decommission is rejected with a range locking error
-
Wait for the bootstrap to complete; retry the decommission — it should now succeed
Scenario 6: Failure Recovery
Goal: Validate the failure playbooks in a controlled environment.
Stuck bootstrap recovery:
-
Bootstrap a new node
-
Kill the node mid-streaming (after START_JOIN, during MID_JOIN)
-
Verify the bootstrap appears as in-progress in
nodetool cms describe -
Restart the node and run
nodetool bootstrap resume -
Verify the bootstrap completes
-
Alternatively: run
nodetool bootstrap abortand verify the node returns to REGISTERED
CMS member loss and recovery:
-
Start a 5-node cluster with 5 CMS members
-
Stop 2 CMS members; verify metadata commits still work (3 of 5 is quorum)
-
Stop a 3rd CMS member; verify metadata commits fail (2 of 5 is not quorum)
-
Restart one CMS member (back to 3 of 5); verify commits resume automatically
Emergency recovery drill:
|
Do not skip this test. The emergency recovery procedure is the one you will need when everything else has failed, and you do not want to be reading the instructions for the first time during a production incident. |
-
Start a 3-node cluster with CMS initialized
-
Create some metadata (keyspaces, tables)
-
Take a metadata dump via JMX
-
Stop all nodes
-
Enable
unsafe_tcm_modeon one node -
Start the recovery node
-
Load the metadata dump via JMX
-
Verify the metadata state with
nodetool cms describe -
Disable unsafe mode, restart, and bring up remaining nodes
-
Verify cluster-wide convergence (all nodes report the same epoch)
Scenario 7: Monitoring and Alerting Validation
Goal: Verify your monitoring captures TCM metrics and alerts fire correctly.
Metrics to verify are being collected:
-
CommitSuccessLatency -
CommitRetries -
FetchPeerLogLatency/FetchCMSLogLatency -
ProgressBarrierLatency -
CoordinatorBehindSchema/CoordinatorBehindPlacements -
UnreachableCMSMembers -
currentEpochGauge
Alert validation:
-
Stop a CMS member — verify
UnreachableCMSMembersgoes to 1 and your alert fires -
Create a network partition — verify
ProgressBarrierCLRelaxedincrements -
Perform rapid schema changes — verify
CoordinatorBehindSchemaincrements briefly
Test Plan Tiers
| Plan | Scenarios | Time Estimate |
|---|---|---|
Minimum viable |
1 (smoke test) + 2 (upgrade path) |
1–2 hours |
Standard |
Add 3 (topology), 6 (failure recovery), 7 (monitoring) |
4–8 hours |
Comprehensive |
Add 4 (partitions) + 5 (concurrent operations) |
1–2 days |
Regardless of which plan you choose, run the emergency recovery drill from Scenario 6. It takes 30 minutes and could save hours during an actual incident.
Universal Test Assertions
Evaluate these assertions after every scenario:
Epoch consistency:
After every operation, all nodes should report the same epoch.
Compare system_views.cluster_metadata_log maximum epoch values across nodes.
Ring integrity:
After bootstrap or decommission, nodetool status shows the expected number of nodes,
all in UN state.
Schema agreement: After schema changes, all nodes should report the same schema version. Under TCM, if all nodes are at the same epoch, they have the same schema.
Data availability: After topology changes, reads and writes at your production consistency level succeed.
Log continuity:
The metadata log should have no gaps.
Query system_cluster_metadata.distributed_metadata_log and verify consecutive epoch numbers.
CMS health:
nodetool cms describe shows all CMS members reachable.
UnreachableCMSMembers equals 0.
Useful Log Grep Commands
# Quick health check — find TCM errors and warnings
grep -E "ERROR|WARN" /var/log/cassandra/system.log | grep -i "epoch\|CMS\|metadata\|transform"
# Monitor log fetch activity
grep "fetch.*log\|caught up" /var/log/cassandra/system.log
# Detect progress barrier fallbacks
grep "Falling back to" /var/log/cassandra/system.log
# Find snapshot activity
grep -i "snapshot" /var/log/cassandra/system.log