TCM Upgrade Procedure
|
Preview | Unofficial | For review only |
This page is the execution runbook for upgrading a Cassandra cluster to Transactional Cluster Metadata. Before starting, complete every item in Pre-Upgrade Prerequisites.
The upgrade has three phases. Phase 1 is fully reversible. Phase 2 is the commit point — it is not reversible in practice. Phase 3 scales the CMS to production resilience.
The Three Phases
| Phase | What Happens | Duration | Reversible? |
|---|---|---|---|
1: Rolling Binary Upgrade |
All nodes upgraded one at a time; cluster stays in gossip mode |
Hours |
Yes — downgrade binaries |
2: CMS Initialization |
First CMS member established; metadata log created; TCM activated |
1–2 minutes |
No (in practice) |
3: CMS Reconfiguration |
CMS scaled to production replication factor |
Minutes |
Yes |
Phase 1: Rolling Binary Upgrade
This is standard Cassandra procedure.
The cluster remains in gossip mode throughout.
TCM code is present but inactive — nodes operate in GOSSIP service state.
Per-Node Procedure
For each node, rack by rack, datacenter by datacenter:
Step 1. Drain the node.
$ nodetool drain
Flushes all memtables to SSTables and stops accepting new connections.
Step 2. Stop the Cassandra process.
$ sudo systemctl stop cassandra
Step 3. Install the new Cassandra version.
Replace the Cassandra binaries with the 6.0 release.
Preserve your cassandra.yaml, cassandra-env.sh, and any other customized configuration files.
Step 4. Start the node on the new version.
$ sudo systemctl start cassandra
Step 5. Wait for the node to rejoin the cluster.
$ nodetool status
Confirm the node shows UN (Up/Normal) and that gossip has settled.
Check the system log for:
Gossip settled after X ms
Step 6. Verify the node is healthy before moving to the next one.
$ nodetool describecluster
Confirm that the schema version count has not increased. During a rolling upgrade you may see at most two schema versions — one for upgraded nodes, one for not-yet-upgraded nodes. This is expected.
What the Cluster Looks Like During Phase 1
While the rolling upgrade is in progress, the cluster is in a mixed-version state.
Upgraded nodes contain the TCM code but operate in GOSSIP mode.
An upgraded node will reject any attempt to commit a TCM transformation:
Can't commit transformations when running in gossip mode
Non-upgraded nodes continue operating exactly as before.
The messaging version boundary is respected (5.0 uses messaging version 13; 6.0 uses version 14). During the mixed-version window, inter-node communication uses the lower version. This is why schema propagation across the version boundary does not work.
Upgrade Pace
-
Upgrade one rack at a time.
-
Do not upgrade all nodes in a datacenter simultaneously.
-
Complete the upgrade within a maintenance window. The longer the mixed-version state, the higher the chance that someone triggers a blocked operation.
Prohibited During Phase 1
These are hard constraints, not guidelines:
-
Do NOT perform host replacements during the upgrade
-
Do NOT issue schema changes (CREATE, ALTER, DROP)
-
Do NOT bootstrap new nodes
-
Do NOT decommission nodes
-
Do NOT run
nodetool move -
Do NOT run
nodetool assassinate -
Do NOT change
storage_compatibility_mode
Read and write operations for existing data are unaffected throughout all three phases.
Phase 2: CMS Initialization
Phase 2 is the commit point. After this, reverting to gossip requires significant manual effort and is not a supported operation.
Before running nodetool cms initialize, re-run the full readiness checklist from
Pre-Upgrade Prerequisites.
Step-by-Step
Step 1. Choose the initiating node.
Pick a stable, JOINED node with good network connectivity to all peers. This node will become the first CMS member. Avoid recently restarted nodes and nodes you plan to decommission soon.
Step 2. Run the initialize command.
$ nodetool cms initialize
If you have nodes that are permanently down and will not return:
$ nodetool cms initialize --ignore 10.0.1.50,10.0.1.51
Step 3. Watch the output.
A successful initialization:
Initializing CMS... Verifying cluster metadata agreement... All peers agree on cluster metadata. CMS initialized successfully. Current epoch: 1
If another node has already started initialization:
Migration already initiated by /10.0.1.10:7000
Abort and retry from your chosen node:
$ nodetool cms abortinitialization --initiator 10.0.1.10
Step 4. Verify initialization.
$ nodetool cms describe
If initialization fails, capture the exact error before retrying. For example:
Initializing CMS... Verifying cluster metadata agreement... ERROR: Node 10.0.1.15 reported a mismatched schema digest ERROR: CMS initialization aborted
That output means one or more peers did not match the initiating node’s metadata.
Do not force the operation through.
Fix the mismatch, then rerun nodetool cms initialize.
What Happens During Initialization
-
Validation. The five gates from Pre-Upgrade Prerequisites are checked.
-
Election. The initiating node broadcasts a
CMSInitializationRequestto every non-ignored peer. Each peer compares its directory, token map, and schema digest against the initiator’s. All must match. -
PreInitialize transformation. The first entry in the distributed metadata log is committed.
-
Initialize transformation. The second log entry captures the full cluster metadata snapshot — every node, every token, every schema definition — as the baseline state at Epoch 1.
-
State transition. The initiating node transitions from
GOSSIPtoLOCALstate. All other nodes transition fromGOSSIPtoREMOTEstate. TheLegacyStateListenerbegins feeding TCM-managed state back into gossip for compatibility. -
Snapshot. A metadata snapshot is triggered for fast recovery.
Source: src/java/org/apache/cassandra/tcm/migration/
The Point of No Return
Once nodetool cms initialize succeeds:
-
The
system_cluster_metadatakeyspace exists on disk. -
All nodes have transitioned out of
GOSSIPmode. -
The distributed metadata log is the authoritative source for token ownership, schema, and membership.
-
Gossip continues to run but now receives its metadata state from TCM via
LegacyStateListener.
Reverting from this point is not a supported operation. Phase 1 is fully reversible — spend time validating before executing Phase 2.
Phase 3: CMS Reconfiguration
After initialization, the CMS has a replication factor of 1. This is a single point of failure for metadata operations. Scale it up immediately.
Choosing the Target Replication Factor
| Cluster Size | Recommended CMS RF | Quorum Size | Tolerated Failures |
|---|---|---|---|
Small (~12 nodes) |
3 |
2 |
1 |
Medium (12–50 nodes) |
5 |
3 |
2 |
Large (50+ nodes) or multi-DC |
7 |
4 |
3 |
The CMS RF must be an odd number (Paxos requires strict majority). For production clusters, 5 CMS members provides the best balance of latency and fault tolerance.
|
CMS RF=1 (the post-initialization default) is an availability risk. If that one node goes down, no metadata operations can proceed — no schema changes, no topology changes. Data reads and writes continue, but the cluster is operationally frozen for metadata changes. Do not remain at RF=1. |
Step-by-Step
Step 1. Run the reconfigure command.
Single-datacenter:
$ nodetool cms reconfigure 3
Multi-datacenter:
$ nodetool cms reconfigure dc1:3 dc2:3
Step 2. Monitor the reconfiguration.
$ nodetool cms reconfigure --status
Step 3. Verify the final state.
$ nodetool cms describe
Confirm that the CMS membership matches your target RF, and that nodes are distributed across racks and (if applicable) datacenters.
How CMS Placement Works
You do not choose which nodes become CMS members.
The CMSPlacementStrategy selects members automatically using rack-diversity principles
similar to NetworkTopologyStrategy:
no two CMS members share a rack if possible,
and in multi-DC deployments, members are spread across datacenters.
Source: src/java/org/apache/cassandra/tcm/
Reconfiguration Anti-Patterns
Unbalanced rack distribution. If one rack contains 80% of the nodes, CMS placement is constrained. With RF=5, at least two members must share a rack.
Single-rack clusters. CMS placement degenerates to arbitrary node selection. Configure racks before enabling TCM if you have not already.
CMS Commands Reference
| Command | Purpose |
|---|---|
|
Show CMS state, members, epoch, migration status |
|
Initialize CMS from gossip state |
|
Abort a failed initialization |
|
Change CMS replication factor |
|
Per-DC reconfiguration |
|
Check reconfiguration progress |
|
Resume interrupted reconfiguration |
|
Cancel in-progress reconfiguration |
|
Force a metadata snapshot |
|
Unregister a node in LEFT state |
|
Dump the node directory |
|
Dump metadata log entries |
Source: src/java/org/apache/cassandra/tools/nodetool/CMSAdmin.java
Post-Upgrade Validation
Understanding the nodetool cms describe Output
$ nodetool cms describe
| Field | What to Verify |
|---|---|
|
Greater than 0 after initialization; higher after reconfiguration |
|
Lists CMS node IDs at your target RF |
|
|
|
|
|
|
|
|
|
|
Run this on every node and compare Epoch values.
In a healthy, quiescent cluster, every node should report the same epoch.
Querying the Metadata Log
SELECT epoch, kind, entry_id, entry_time
FROM system_views.cluster_metadata_log
ORDER BY epoch DESC
LIMIT 20;
After initialization and reconfiguration to RF=3, you should see:
epoch | kind
-------+-----------------------------------
5 | FINISH_ADD_TO_CMS
4 | START_ADD_TO_CMS
3 | PREPARE_SIMPLE_CMS_RECONFIGURATION
2 | INITIALIZE_CMS
1 | PRE_INITIALIZE_CMS
Verify that PRE_INITIALIZE_CMS and INITIALIZE_CMS entries exist at epochs 1 and 2.
Querying the Node Directory
SELECT node_id, host_id, state, cassandra_version, dc, rack,
broadcast_address, multi_step_operation
FROM system_views.cluster_metadata_directory;
Verify:
-
Every expected node appears in the directory
-
All active nodes show
state = 'JOINED' -
No nodes are stuck in
BOOTSTRAPPING,MOVING, orLEAVING -
multi_step_operationis empty for all JOINED nodes -
cassandra_versionis consistent across all nodes
The Schema Smoke Test
This end-to-end test provides the strongest confidence that metadata propagation works.
Step 1. Record the current epoch.
$ nodetool cms describe | grep "Epoch:"
Epoch: 5
Step 2. Create a test keyspace.
CREATE KEYSPACE test_tcm_validation
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
Step 3. Verify the epoch advanced.
$ nodetool cms describe | grep "Epoch:"
Epoch: 6
Step 4. Verify propagation to other nodes.
On a different node:
DESCRIBE KEYSPACE test_tcm_validation;
The keyspace should exist.
Step 5. Verify in the metadata log.
SELECT epoch, kind
FROM system_views.cluster_metadata_log
WHERE epoch = 6;
You should see a SCHEMA_CHANGE entry.
Step 6. Clean up.
DROP KEYSPACE test_tcm_validation;
If all steps pass, TCM is working end-to-end.
Log Patterns to Watch
Healthy operation — expected:
INFO - Fetching log from <peer>, at least <epoch> DEBUG - Fetched log from CMS - caught up from epoch X to epoch Y INFO - First CMS node
Warning signs — investigate:
WARN - Learned about epoch X from <peer>, but could not fetch log WARN - Could not fetch log entries from peer, remote = <peer>, await = <epoch> WARN - Could not reconfigure CMS, operator should run... INFO - Could not collect epoch acknowledgements within Xms for Y. Falling back to Z.
The progress barrier fallback message is common during node restarts and usually benign. If you see it on every metadata operation, investigate slow nodes.
Error conditions — action required:
ERROR - Caught an exception while processing entry X. This can mean that this node
is configured differently from CMS.
WARN - Stopping log processing on the node. All subsequent epochs will be ignored.
The "stopping log processing" message means the node has given up applying metadata changes and must be restarted.
JMX Metrics to Configure Immediately
The org.apache.cassandra.tcm:type=TCMMetrics MBean provides:
| Metric | Type | Alert Condition |
|---|---|---|
|
Gauge |
Primary health signal; compare across nodes |
|
Gauge |
Should match your target RF |
|
Gauge |
Alert if > 0 |
|
Gauge (0/1) |
Verify on CMS and non-CMS nodes |
|
Gauge (0/1) |
Alert if 1 for more than 5 minutes |
|
Timer |
Alert if p99 > 2 seconds |
Source: src/java/org/apache/cassandra/tcm/CMSOperations.java
Post-Enablement Validation Checklist
-
nodetool cms describeshowsEpoch >= 1on every node -
Service StateisLOCALon CMS members andREMOTEon others — no node showsGOSSIP -
Is Migratingisfalseon every node -
Local Pending Countis0on every node -
Commits Pausedisfalseon every node -
All nodes report the same epoch
-
system_views.cluster_metadata_logshowsPRE_INITIALIZE_CMSandINITIALIZE_CMSentries -
system_views.cluster_metadata_directoryshows all expected nodes inJOINEDstate -
Schema smoke test passes (epoch advances on CREATE; keyspace visible on all nodes)
-
nodetool describeclustershows a single schema UUID -
JMX metrics accessible for
currentEpochGaugeandunreachableCMSMembers -
Alerting configured for quorum loss and epoch divergence
Complete Upgrade Sequence Reference
PRE-UPGRADE ├── Complete all repairs ├── Disable automation (auto-scaling, scheduled DDL, repair cron) ├── Run pre-upgrade readiness checklist └── Confirm rollback plan PHASE 1: ROLLING BINARY UPGRADE ├── For each node (rack by rack, DC by DC): │ ├── nodetool drain │ ├── Stop Cassandra │ ├── Install 6.0 binaries │ ├── Start Cassandra │ ├── Wait for UN status │ └── Verify with nodetool describecluster ├── Confirm all nodes are on 6.0 └── Re-run pre-upgrade readiness checklist PHASE 2: CMS INITIALIZATION ├── Choose initiating node ├── nodetool cms initialize [--ignore <down-nodes>] ├── nodetool cms describe (verify success) └── Confirm epoch is advancing PHASE 3: CMS RECONFIGURATION ├── nodetool cms reconfigure <target-rf> ├── nodetool cms reconfigure --status (monitor) ├── nodetool cms describe (verify final state) └── Re-enable automation POST-UPGRADE VALIDATION ├── Run post-enablement validation checklist ├── Verify nodetool status shows all UN ├── Verify nodetool describecluster shows single schema version ├── Verify client connectivity └── Resume normal operations
Rollback Considerations
Before Phase 2 (CMS not initialized): Full rollback is straightforward. Perform a reverse rolling upgrade — stop each node, install the previous version’s binaries, restart. No data is lost, no metadata is changed.
After Phase 2 (CMS initialized):
Rollback is not a supported operation.
The distributed metadata log now exists, the system_cluster_metadata keyspace is populated,
and all nodes are operating in TCM mode.
|
Treat Phase 2 as a one-way door. Spend your time validating before you execute it. The in-tree design document states: "reverting to the previous method of metadata management is not supported." |