TCM Day-2 Operations and Performance
|
Preview | Unofficial | For review only |
This page covers what changes in day-to-day operations once TCM is active. For upgrade instructions, see Upgrade Procedure. For troubleshooting, see TCM Troubleshooting.
The New Shape of Topology Operations
Under gossip, topology changes were asynchronous and eventually consistent. Under TCM, every topology operation follows a coordinated, multi-step sequence committed through the distributed metadata log. Each step produces an epoch. Each epoch must be acknowledged by affected nodes before the next step proceeds. The operation is visible in the log, trackable through its steps, and resumable if interrupted.
The universal structure for all topology operations:
PREPARE ──► START ──► MID ──► FINISH (validate) (announce) (stream) (complete)
This structure applies to bootstrap, decommission, move, replace, and CMS reconfiguration. Once you understand it for one operation, you understand it for all of them.
Source: src/java/org/apache/cassandra/tcm/sequences/
Bootstrap Under TCM
Three Steps
START_JOIN.
The joining node commits a StartJoin transformation to the metadata log.
This produces a new epoch.
The node is added to the write replica set for its token ranges.
A progress barrier fires and the system sends TCM_CURRENT_EPOCH_REQ messages to all nodes
that own affected ranges.
Only after a quorum in each datacenter acknowledges the new epoch does the operation proceed.
MID_JOIN.
The joining node computes a MovementMap — a precise specification of which ranges to stream
from which source nodes.
The map is derived from the placement deltas between the old and new topology, not from gossip state.
After streaming completes, the MidJoin transformation is committed.
The node transitions from write-only to read-write in the placement information.
FINISH_JOIN. The final transformation removes the original owners' read replicas for the ranges the joining node now owns. The joining node is now a full cluster member.
Source: src/java/org/apache/cassandra/tcm/sequences/BootstrapAndJoin.java
What Operators Notice
Speed of convergence. Under gossip, a bootstrap could take 30 seconds or more before all nodes knew it was happening. Under TCM, the START_JOIN epoch propagates to all nodes within milliseconds. There is no ring-settle period.
Visibility. Each step produces log entries:
INFO - Starting to bootstrap... INFO - fetching new ranges and streaming old ranges INFO - Accord metadata is ready, continuing with bootstrap INFO - Bootstrap completed for tokens [...]
Resumability. If the bootstrap fails mid-stream:
$ nodetool bootstrap resume
Or abort it:
$ nodetool bootstrap abort
Under gossip, a failed bootstrap often required manual cleanup. Under TCM, the state machine tracks where the operation stopped and recovery is a single command.
Decommission Under TCM
Three Steps
START_LEAVE.
The leaving node commits a StartLeave transformation.
The node’s state is marked LEAVING in the cluster metadata.
Optionally, a severity penalty is applied to the endpoint snitch,
causing coordinators to prefer other replicas for reads.
MID_LEAVE.
The leaving node streams its data to the nodes that will take over its ranges.
The streaming targets are computed from placement deltas.
After streaming, MidLeave is committed.
FINISH_LEAVE.
The node is removed from all placements, transitions to LEFT state,
and can safely shut down.
Decommission vs. Remove
Decommission (LEAVE). The node is alive and participates in streaming its own data out. Normal graceful path.
Remove (REMOVE). The node is dead and cannot stream. Other nodes must reconstruct the missing data from remaining replicas.
$ nodetool decommission # Graceful departure
$ nodetool removenode <host-id> # Dead node removal
Under gossip, a decommission that failed mid-stream left the cluster in a gray area.
Under TCM, abortDecommission reverts cleanly.
Source: src/java/org/apache/cassandra/tcm/sequences/LeaveStreams.java
Node Replacement Under TCM
Three Steps
START_REPLACE. The replacement node registers itself and marks the node it is replacing. Gossip state for the replaced node is set to "hibernating" to prevent writes to the dead node during replacement.
MID_REPLACE. The replacement node streams data. Its movement map specifically identifies the being-replaced node’s ranges and streams them from the remaining replicas. Strict movement validation is relaxed — the replacement can rebuild from partial data if the replaced node is dead.
FINISH_REPLACE. The replacement node assumes the replaced node’s tokens. The replaced node is removed from the directory. The replacement becomes the authoritative replica.
|
Do not perform host replacement during the upgrade itself. Replace dead nodes before starting the rolling upgrade, or after completing all three upgrade phases. See Upgrade Procedure. |
Token Moves Under TCM
Three Steps
START_MOVE. The system commits the intent to move, splitting ranges at the new token boundaries. The cluster metadata reflects the new token assignment immediately.
MID_MOVE. Streaming occurs using the same movement map logic as bootstrap. After streaming, Paxos state is repaired for the moved ranges.
FINISH_MOVE. Old tokens are released, new tokens are finalized, and the node’s local token metadata is updated. CMS placement is re-evaluated.
Under gossip, a token move required waiting for gossip to propagate the new assignment to all nodes, then verifying ring consistency. Under TCM, epoch ordering guarantees immediate consistency.
No More Ring-Settle Waits
The Old Ritual
Under gossip, every topology change required a waiting period:
1. Start bootstrap of node A 2. Wait for bootstrap to complete 3. Wait 30-60 seconds for "ring to settle" 4. Verify with nodetool status 5. Start bootstrap of node B 6. Repeat for each node
The ring-settle wait existed because gossip propagation was not instantaneous. Starting a second operation before the first had fully propagated could lead to incorrect range assignments.
Why TCM Eliminates It
When a topology operation completes (FINISH_JOIN, FINISH_LEAVE, FINISH_MOVE),
the final epoch is committed and a progress barrier ensures that affected nodes
have acknowledged it before the system considers the operation fully propagated.
The barrier starts at EACH_QUORUM — quorum acknowledgment in every datacenter.
That is the operational replacement for "wait until the ring settles."
If the barrier cannot complete at EACH_QUORUM, the system logs the degraded path and
the operator should treat the operation as slowed, not finished.
The next operation can start immediately. There is no propagation delay to wait for.
Conflicting operations are rejected, not silently broken. If node B’s bootstrap would affect ranges locked by node A’s in-progress operation, the system rejects it with a clear error.
Non-overlapping operations can run in parallel. If node A’s bootstrap affects token ranges (0, 1000) and node B’s affects (5000, 6000), both can proceed simultaneously.
Range Locking
How It Works
Every topology operation — bootstrap, decommission, move, replace — locks the token ranges it affects. When a new operation is prepared, the system checks whether its ranges intersect with any existing locks:
lockedRanges.intersects(newOperationRanges) → NOT_LOCKED: proceed → Key(epoch): conflict with existing operation at that epoch
The intersection check handles token ring wraparound and is aware of replication parameters.
Source: src/java/org/apache/cassandra/tcm/ClusterMetadata.java (LockedRanges)
Operator Behavior
You cannot accidentally run overlapping operations. Two bootstraps on adjacent token ranges, a bootstrap and a decommission on the same range — all are caught and rejected before they can conflict.
You can safely run non-overlapping operations. Adding a node in rack 1 while decommissioning a node in rack 3, provided their token ranges do not overlap, is safe and allowed.
Locks are released automatically. When a FINISH transformation completes, the lock is released as part of the metadata commit. No manual unlock step is needed.
Stuck locks indicate stuck operations. If a lock persists, the associated operation is still in progress (or failed and needs resumption). See Stuck Topology Operation playbook.
Elimination of Split-Brain Data Loss
The Problem
In a gossip-based cluster, different coordinators observe ring changes at different speeds. This is not a theoretical concern — it can produce real data loss:
Time 0: Node X begins decommission. Gossip starts propagating.
Time 1: Coordinator A sees the ring change. Routes write to replicas {Y, Z}.
Time 1: Coordinator B has NOT yet seen the ring change.
Routes the same write to replicas {X, Z}.
Result: The write "succeeded" on both coordinators at CL=QUORUM,
but the replica sets differ. If X finishes decommission,
Coordinator B's write to X is lost.
The window is seconds to tens of seconds. In a high-throughput cluster, thousands of writes can pass through that window. No errors are raised, no warnings are logged.
How TCM Fixes It
When a decommission begins, the START_LEAVE transformation is committed at a specific epoch.
The progress barrier waits for affected nodes to acknowledge this epoch
before the operation proceeds to streaming.
This means:
-
All coordinators see the ring change before streaming begins.
-
Epoch application is atomic — a node either has applied epoch N or it has not.
-
The consistency level cascade provides a safety net if some nodes are slow.
The quorum inconsistency window is closed.
Source: src/java/org/apache/cassandra/tcm/ (progress barrier mechanism)
Operations Summary
| Operation | Before TCM | After TCM |
|---|---|---|
Bootstrap convergence |
5–30 seconds via gossip |
Sub-second via epoch commit |
Ring-settle wait |
30–60 seconds (manual) |
Eliminated (barrier-based) |
Concurrent operations |
Unsafe on overlapping ranges |
Safe with range locking |
Failed operation recovery |
Manual cleanup |
Resume or abort command |
Split-brain risk during topology changes |
Present (silent data loss) |
Eliminated (epoch ordering) |
Operation progress visibility |
Unclear |
3-step model in logs and metadata |
Performance and Latency
The Read/Write Hot Path: No Overhead
TCM does not sit in the read/write request path.
When a coordinator receives a read or write request,
it looks up current replica placements from an in-memory ClusterMetadata object.
This is the same object it has always used — TCM changes how the object is populated,
not how it is consulted.
The only per-request check is an epoch comparison. If a replica’s epoch is higher than the coordinator’s, the coordinator triggers an asynchronous background fetch to update its local metadata. This fetch does not block the request.
The EpochAwareDebounce class deduplicates concurrent requests for the same epoch —
if ten requests all discover that the coordinator is behind epoch 42,
only one log fetch is triggered.
Two metrics track coordinator freshness:
-
coordinatorBehindSchema— coordinator schema is older than a replica’s -
coordinatorBehindPlacements— coordinator placement information is older than a replica’s
Both should be near zero in a healthy cluster. Spikes during topology changes or schema modifications are expected and normal.
Metadata Commit Latency
Metadata operations go through Paxos consensus. A commit requires two phases (prepare + accept/commit), each requiring a network round-trip to a quorum of CMS members.
Estimated commit latency by deployment:
| Deployment | 3-Member CMS | 5-Member CMS | 7-Member CMS |
|---|---|---|---|
Same DC |
50–100 ms |
100–200 ms |
150–300 ms |
Multi-DC (US regions) |
100–200 ms |
150–300 ms |
200–400 ms |
Multi-DC (global) |
200–400 ms |
300–500 ms |
400–700 ms |
|
These are estimates based on typical Paxos latency characteristics. Actual numbers vary based on network quality, disk speed, and CMS member placement. |
Metadata commits are infrequent compared to data operations. You will not notice this latency in daily operations.
Schema Changes
Under gossip, schema propagation was eventually consistent with no confirmation. Under TCM, a schema change is a single Paxos commit (100–300 ms). After the commit, the new epoch propagates to all nodes through the log replication mechanism.
Schema changes are faster to confirm under TCM — you know within milliseconds that the change is committed — and propagation to all nodes happens through the same log replication mechanism used for all other metadata changes.
Log Replication Speed
Push (epoch notifications). When a topology operation commits a new epoch,
the progress barrier sends TCM_CURRENT_EPOCH_REQ messages to affected nodes.
Same-DC: typically 50–150 ms. Cross-DC: 200–500 ms.
Pull (background fetching). The PeerLogFetcher periodically checks whether new log entries
are available.
Fetches are asynchronous with exponential backoff.
In practice, fetches complete in milliseconds.
|
The TCM development team has noted that CMS nodes and followers have not been observed to lag in practice. The metadata payload is tiny — a schema change produces a few hundred bytes. If you see a node lagging behind on epoch, the problem is almost certainly a network partition or node health issue, not a CMS performance bottleneck. |
Memory and Storage Footprint
The ClusterMetadata object is held entirely in memory on every node.
For a typical cluster (10–50 nodes, 50–200 tables), it is measured in tens of kilobytes.
For large clusters (hundreds of nodes, thousands of tables), it may reach low megabytes.
The metadata log grows linearly with the number of metadata changes. Each entry is small (a few hundred bytes to a few kilobytes). Periodic snapshots prevent unbounded log growth.
Source: src/java/org/apache/cassandra/tcm/ClusterMetadata.java
Metadata Snapshots
A snapshot is a serialized copy of the complete ClusterMetadata at a specific epoch.
When a node needs to catch up from an older epoch,
it can load the nearest snapshot and replay only the entries after that point.
Snapshot frequency is configurable via metadata_snapshot_frequency (default: 100 epochs).
Source: src/java/org/apache/cassandra/config/Config.java
| Frequency | Tradeoff |
|---|---|
More frequent (every 10–20 epochs) |
Faster node recovery; higher serialization cost |
Less frequent (every 100–500 epochs) |
Lower storage overhead; slower recovery for lagging nodes |
For clusters with frequent topology changes, every 10–20 epochs is reasonable. For stable clusters, 50–100 epochs is sufficient.
Force a snapshot manually:
$ nodetool cms snapshot
CMS Sizing vs. Latency Tradeoff
More CMS members means higher fault tolerance but higher Paxos latency, because the quorum is larger. For most clusters, the latency difference between RF=3 and RF=5 is 50–100 ms per commit — negligible for infrequent metadata operations.
Recommendation: RF=5 for production. This gives 3-node quorum, 2-node failure tolerance, and commit latency in the 100–300 ms range. Use RF=3 only for development or testing. Use RF=7 only if you have a specific fault tolerance requirement requiring 3-node loss tolerance.
Gossip vs. TCM: Latency Comparison
| Operation | Gossip | TCM | Notes |
|---|---|---|---|
Schema change initiation |
< 10 ms |
100–300 ms |
Gossip: fire-and-forget. TCM: Paxos commit. |
Schema change confirmation |
Unknown (10–30 s propagation) |
100–300 ms |
Gossip: no confirmation. TCM: commit is confirmation. |
Topology change visibility |
5–30 seconds |
Sub-second |
TCM wins decisively. |
Ring settle after topology change |
30–60 seconds |
0 seconds |
TCM eliminates this entirely. |
Per-request overhead |
None |
None |
Epoch comparison is in-memory. |
TCM trades a small per-commit cost (100–300 ms of Paxos latency) for the elimination of much larger operational costs (30–60 seconds of ring-settle waiting per topology change, and the risk of silent data loss from split-brain quorum inconsistency).
Timeout and Retry Configuration
All properties are defined in src/java/org/apache/cassandra/config/Config.java.
| Parameter | Default | Purpose |
|---|---|---|
|
|
Maximum time to wait for a CMS commit or log fetch |
|
|
Backoff expression for commit retries |
|
|
Maximum time for a progress barrier to complete |
|
|
Time between barrier retry attempts |
|
|
Starting consistency level for barriers |
|
|
How often (in epochs) to snapshot cluster metadata |
Progress barrier fallback sequence:
EACH_QUORUM ──► QUORUM ──► LOCAL_QUORUM ──► ONE ──► NODE_LOCAL (default) (fallback) (fallback) (fallback) (last resort)
At each level, the barrier retries every second (progress_barrier_backoff) for up to
30 seconds before falling back to the next lower level.
Monitoring Recommendations
Key Metrics
The org.apache.cassandra.tcm:type=TCMMetrics MBean provides:
| Metric | Type | What It Tells You |
|---|---|---|
|
Timer |
Histogram of successful commit latencies |
|
Counter |
Rate of commit retries; sustained high rates indicate CMS issues |
|
Timer |
Log fetch latencies; should be < 1 s |
|
Timer |
How long progress barriers take; > 5 s warrants investigation |
|
Counter |
Times the CL was degraded; non-zero during node restarts is normal |
|
Counter |
Stale-metadata hits on coordinators |
|
Gauge |
CMS members not responding — alert if > 0 |
|
Gauge |
Current cluster epoch — primary health signal |
Source: src/java/org/apache/cassandra/tcm/CMSOperations.java
Alert Thresholds
| Metric | Warning | Critical |
|---|---|---|
|
> 0 |
>= quorum threshold |
|
> 2 s |
> 10 s |
|
> 5/minute |
> 20/minute |
|
> 5 s |
> 30 s |
|
> 1/hour |
> 5/hour |
|
> 10/minute |
> 50/minute |
Epoch difference between nodes |
> 10 for > 60 s |
— |
These thresholds are starting points. Adjust based on your deployment topology, network characteristics, and SLAs.
How Non-CMS Nodes Stay in Sync
Non-CMS nodes do not participate in Paxos.
They rely on the PeerLogFetcher — a background process that discovers and retrieves
missing log entries.
-
Discovery. The node compares its local epoch to the cluster’s current epoch. If there is a gap, it knows entries are missing.
-
Fetch. The node requests missing entries from a CMS member or any peer that has them, using the
TCM_FETCH_CMS_LOGandTCM_REPLICATIONmessaging verbs. -
Apply. Entries are applied in strict epoch order.
For nodes very far behind, the PeerLogFetcher requests a metadata snapshot
instead of replaying the entire log.
Source: src/java/org/apache/cassandra/tcm/