Debugging Cassandra
|
Preview | Unofficial | For review only |
Debugging Cassandra locally requires a few techniques that differ from typical application debugging. This page covers the most common debugging workflows for contributor work: attaching a debugger to unit tests, debugging dtest failures, tracing a query path, and reproducing CI failures locally. This preview page tracks the current contributor workflow; if your branch changes test tooling, confirm the commands against the branch you are working on.
Debugging Unit Tests
Attaching a Debugger (IntelliJ / Eclipse)
Run the target test in debug mode by passing the JDWP agent as a JVM argument:
ant test -Dtest.name=ClassName \
-Dtest.jvm.arg="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"
The test JVM suspends on startup and waits for a debugger to connect.
In IntelliJ or Eclipse, create a Remote JVM Debug run configuration pointed at localhost:5005, then start it after the Ant command is running.
suspend=y is required — without it the test will proceed past your breakpoints before the debugger attaches.
|
Useful JVM Flags for Debugging
Pass these via -Dtest.jvm.arg or by editing build.xml locally:
-
-Dcassandra.test.logback=true— enables test-specific logback configuration fromtest/conf/logback-test.xml -
-Dcassandra.skip_wait_for_gossip_to_settle=0— skips the gossip settle delay, speeds up test startup in gossip-dependent tests -
-ea— enables assertions (Ant enables this by default for test tasks, but worth confirming in custom runs) -
-XX:+PrintGCDetails— useful when investigating GC pauses inside test runs
Debugging Distributed Tests (dtests)
dtest Logging
dtests run nodes via CCM (Cassandra Cluster Manager). Each node writes logs to:
~/.ccm/<cluster-name>/<node-name>/logs/system.log
Start with system.log — errors and stack traces from the Cassandra process appear there.
For verbose query tracing during a dtest, set the log level when invoking pytest:
pytest test_file.py::TestClass::test_method -v -s \
--cassandra-version=5.0 \
-Dcassandra.test.loglevel=DEBUG
Reproducing a dtest Failure Locally
Install the dtest prerequisites before reproducing a failure:
pip install ccm
cd cassandra-dtest
pip install -r requirements.txt
The --keep-test-dir flag prevents CCM from tearing down the cluster on failure, which lets you inspect logs and connect CQL shells to nodes afterward.
# Run a specific failing test
pytest test_file.py::TestClass::test_method -v -s
# Keep the CCM cluster alive after failure for inspection
pytest test_file.py::TestClass::test_method -v -s --keep-test-dir
Expect cluster startup to take a few minutes on the first run.
A successful repro ends with a normal pytest summary such as 1 passed; a failure ends with the traceback plus the preserved CCM directory if you used --keep-test-dir.
Inspecting a Live CCM Cluster
# List CCM clusters
ccm list
# Check node status in the last active cluster
ccm status
# Connect a CQL shell to node1
ccm node1 cqlsh
# Stream node1 logs in real time
tail -f ~/.ccm/<cluster-name>/node1/logs/system.log
CCM clusters from failed dtest runs persist until explicitly removed.
Run ccm remove <cluster-name> to clean up, or ccm remove to remove the current active cluster.
If ccm is not on your PATH, install it with pip install ccm and restart your shell.
|
Tracing a Query Path
CQL tracing captures each stage of the coordinator and replica path with microsecond timestamps.
Enable it per session in cqlsh:
-- Enable tracing for all subsequent queries
TRACING ON;
SELECT * FROM ks.table WHERE pk = 1;
Trace output shows the operation sequence from coordinator dispatch through replica response and read repair decisions. Each row includes the activity description, timestamp, source node, and elapsed time.
For code-level tracing, search for Tracing.instance.trace(…) calls in the coordinator and storage paths — these produce the entries visible in the CQL trace output.
Following the Code Path
Start at these entry points depending on the operation:
-
org.apache.cassandra.service.StorageProxy— coordinator logic for reads and writes -
ReadCallback— read path response handling and consistency level tracking -
WriteResponseHandler— write path acknowledgement and consistency tracking -
ColumnFamilyStore— storage engine entry point -
Memtable— in-memory write path -
SSTableReader— on-disk read path
See Query Execution Path for a full code-level walkthrough.
Capturing Thread Dumps
Thread dumps identify blocked or deadlocked threads.
Capture one by sending SIGQUIT to the JVM process:
kill -3 <pid>
Output goes to stderr or the node log file.
Alternatively, use jstack:
jstack <pid> > threaddump.txt
During a hanging Ant test, find the test JVM first:
# Find the test JVM PID
jps -l | grep cassandra
# Capture the thread dump
jstack <pid>
When reading a thread dump, look for threads in BLOCKED or WAITING state as the first signal of deadlock or resource contention.
A healthy running cluster should have most threads in TIMED_WAITING in epoll_wait or park.
Take two or three thread dumps a few seconds apart.
A thread stuck in the same BLOCKED frame across all dumps confirms a real lock contention rather than a transient wait.
|
Reproducing CI Failures Locally
Isolating a Flaky Test
If CI fails on a test that passes locally, the test may be timing-sensitive or order-dependent. Run it repeatedly to check for flakiness:
for i in {1..10}; do
ant test -Dtest.name=ClassName && echo "PASS $i" || echo "FAIL $i"
done
Check JIRA for existing flakiness reports before investing time in a false root cause: search JIRA for the test name.
| The CircleCI test results page for a failing build shows the full test output including JVM stdout. Compare the CI log output line-by-line against a local failure to identify environment differences. |
Running Tests With a CI-Equivalent Environment
# Run the test inside an OpenJDK 17 container matching CI
docker run --rm \
-v $(pwd):/cassandra \
-w /cassandra \
openjdk:17 \
ant test -Dtest.name=ClassName
This eliminates local JDK version and OS differences as variables.
Use the same JDK major version as the failing CI job — check the CircleCI config at .circleci/config.yml for the exact image.
Logging During Development
Cassandra uses Logback for logging. To increase verbosity for a specific class during test runs, edit the test logback configuration:
<!-- In test/conf/logback-test.xml -->
<logger name="org.apache.cassandra.service.StorageProxy" level="DEBUG"/>
To print all log output to stdout during a test run:
ant test -Dtest.name=ClassName -Dtest.stdout=true
-Dtest.stdout=true produces a large volume of output for any test that touches the storage engine.
Use it with targeted logger configuration rather than at root DEBUG level.
|
For persistent changes across test runs during active development, set the root level in test/conf/logback-test.xml.
Revert before submitting a patch — committing a modified logback config is a common review comment.