Profiling and Performance Validation
|
Preview | Unofficial | For review only |
Many Cassandra changes are correctness-neutral but performance-sensitive. A test that passes is not proof that a read path change did not regress latency by 5%. This page covers the profiling and performance validation workflow for contributors working on hot-path code changes. Before you open a patch that touches a latency-critical code path, understand what data reviewers will expect to see.
This preview page reflects the contributor workflow in the current branch. If the branch you are working on changes profiling tooling, confirm the commands against that branch’s README or build scripts.
Getting the Tools
async-profiler
Clone async-profiler and build it from the repository root:
git clone https://github.com/async-profiler/async-profiler.git
cd async-profiler
make
The build produces the build/bin/asprof launcher. In Cassandra examples, profiler.sh is the wrapper that invokes the bundled async-profiler binary.
cassandra-harry
Clone cassandra-harry and follow its README to build the runner jar used by the examples on this page. Once built, the usage looks like:
java -jar harry-runner.jar run --conf conf/default.yaml
The key requirement is that you have a runnable harry-runner.jar and a config file that matches the cluster you are benchmarking.
When Performance Validation Is Required
Performance validation is expected (not optional) when a change:
-
Touches the read or write hot path (coordinator, replica, storage engine)
-
Changes compaction logic or compaction scheduling
-
Changes memtable behavior or flush mechanics
-
Modifies the commitlog write path
-
Adds or changes background thread behavior that competes with foreground I/O
Performance validation is recommended (not required) when a change:
-
Refactors frequently-called code on the hot path
-
Changes serialization or deserialization in the storage or protocol layer
-
Touches GC-sensitive allocation patterns
|
Reviewers may ask for profiling data even when a change does not fall into the required category. If your change is in a latency-critical path, provide data rather than assertions. |
CPU Profiling With async-profiler
async-profiler is the recommended CPU profiler for Cassandra performance work.
It uses Linux perf_events or macOS Instruments and is free of JVM safepoint bias — meaning it captures the true CPU profile rather than only what the JVM reports at safepoint-safe moments.
Basic Profiling of a Running Node
Attach to a running Cassandra process for 30 seconds and generate an interactive flamegraph:
# Attach to the Cassandra JVM for 30 seconds and generate a flamegraph
./profiler.sh -d 30 -f flamegraph.html $(jps -l | grep CassandraDaemon | awk '{print $1}')
This command usually runs for the requested duration plus a few seconds of profiler startup and file-write time.
Success means flamegraph.html is created and opens in a browser; if the file is missing, the profiler did not attach cleanly.
Profiling During a Specific Benchmark
Use start/stop mode to bracket a controlled workload window:
# Start profiling before workload
./profiler.sh start $(jps -l | grep CassandraDaemon | awk '{print $1}')
# Run your workload here...
# Stop and dump flamegraph
./profiler.sh stop -f flamegraph.html $(jps -l | grep CassandraDaemon | awk '{print $1}')
On success, the stop command emits the output path and leaves a readable HTML flamegraph on disk.
Open flamegraph.html in a browser and compare frame widths against a baseline (unmodified) profile.
Unexpectedly wide frames in your modified build indicate new CPU cost that should be explained or eliminated.
|
Always generate both a baseline profile and a modified profile under identical workload conditions. A flamegraph in isolation has limited diagnostic value without a baseline for comparison. |
Allocation Profiling
Allocation regressions matter in GC-sensitive paths including the read path, compaction, and query execution. Excessive allocation increases GC pause frequency and can produce p999 latency spikes that a mean-latency benchmark will not reveal.
Use async-profiler in allocation mode:
./profiler.sh -e alloc -d 30 -f alloc-flamegraph.html $(jps -l | grep CassandraDaemon | awk '{print $1}')
Compare baseline vs. modified allocation profiles. Look for new allocation sites in hot frames — even small per-operation allocations accumulate under high throughput.
|
Allocation profiling requires the following JVM flags:
Add these to |
Running Workloads for Performance Validation
cassandra-harry (Correctness + Performance)
cassandra-harry is a fuzz testing tool that also works for performance baseline comparisons. Run it against a local cluster before and after your change to establish correctness and capture throughput data.
java -jar harry-runner.jar run --conf conf/default.yaml
|
Use the same |
tlp-stress (Load Generation)
tlp-stress is a flexible workload generator for Cassandra with built-in histogram reporting. Use it to drive controlled read-heavy, write-heavy, or mixed workloads while profiling.
# Run a 60-second read-heavy workload
tlp-stress run KeyValue --duration 60s --readrate 0.9 -n 1000000
tlp-stress emits per-operation latency histograms at the end of a run. Capture this output for both baseline and modified runs and include it in your JIRA attachment.
nodetool tpstats (Quick Health Check)
Use nodetool commands to inspect node behavior during workload execution:
# Check thread pool stats during workload
nodetool tpstats
# Check compaction activity
nodetool compactionstats
# Check latency histograms
nodetool tablehistograms <keyspace> <table>
|
Do not run |
Latency Regression Detection
Use nodetool tablehistograms before and after your change under the same workload.
Compare the following percentiles:
-
p50 — median latency (baseline health check)
-
p95 — moderate tail latency
-
p99 — significant tail latency (common SLA threshold)
-
p999 — extreme tail latency (often the most important for database workloads)
|
A regression in p999 is often more operationally significant than a regression in p50. A change that improves mean latency while worsening tail latency is generally not acceptable. |
If using tlp-stress, compare the histogram output from baseline and modified runs directly.
When sharing performance data in a patch or JIRA comment, include:
-
Workload type (read-heavy, write-heavy, mixed, specify ratios)
-
Node count and hardware configuration
-
Before/after p50, p99, and p999 latency
-
Before/after throughput (ops/sec)
-
Flamegraph link or inline image if significant CPU changes were observed
Reporting Performance Results in JIRA
Attach before/after profiling data directly to the JIRA ticket. Include the workload description and cluster configuration so results are reproducible.
|
If the change has no measurable performance impact, state that explicitly with the supporting data. "No regression observed" is a valid and valuable result — reviewers need to see the evidence, not just the conclusion. |
Reviewers may request profiling data even when you believe the change is neutral. Provide the data rather than assertions — this is a faster path to merge than a review discussion about theoretical impact.
See Testing for the full test selection matrix and required test gates before a patch is considered ready for review.
Quick Reference
| Goal | Tool | Command |
|---|---|---|
CPU flamegraph |
async-profiler |
|
Allocation flamegraph |
async-profiler |
|
Load generation |
tlp-stress |
|
Fuzz + correctness baseline |
cassandra-harry |
|
Latency histograms |
nodetool |
|
Thread pool health |
nodetool |
|
Compaction activity |
nodetool |
|