Operational Empathy

Preview | Unofficial | For review only

Cassandra is a production database that operators run in live clusters, often under strict availability requirements and narrow maintenance windows. A contributor who changes the storage engine, repair behavior, or startup logic is also changing what operators experience during upgrades, maintenance windows, and incidents. Operational empathy means thinking through those consequences before submitting a patch — not as an afterthought, but as part of the design.

Why Operational Impact Matters For Code Contributors

The operator lane — node management, repair, compaction, configuration — is closely coupled to the code. A change that adds a background thread, changes compaction strategy behavior, or alters startup sequencing has direct operational consequences, even if the code change looks small on the diff.

Reviewers will ask about operational impact for any patch that touches:

Startup and shutdown sequences
Repair and streaming
Compaction and space amplification
Configuration parameters or defaults
Nodetool output or subcommand behavior
Log verbosity on the hot path
Failure handling and JVM stability

If you have thought through these questions before review, you will save time and build reviewer trust. If you have not, expect the review to surface them.

Does This Change Startup or Shutdown Behavior?

Startup and shutdown are high-stakes sequences. Operators rarely watch them in normal operation, but when something goes wrong they become the entire focus of an incident. Changes that look safe in unit tests can cause long delays, silent partial failures, or unclean exits in production.

Before submitting, answer:

Does this change add, remove, or reorder initialization steps?
Does this change affect how long startup takes — new I/O, schema loading, index building?
Does this change what happens if startup fails partway through — what state is the node left in?
Does this change affect graceful drain or shutdown?

Checklist:

Startup sequence changes are documented in the commit message
Startup time impact is estimated with a before/after measurement on the same machine and the delta is recorded in the JIRA ticket
If startup gets slower by more than 100 ms, the change calls that out explicitly
Partial-startup failure behavior is tested or documented

Does This Change Repair, Streaming, or Compaction?

These operations run in the background and compete with foreground read and write I/O. Operators schedule repair and compaction deliberately around traffic patterns. Changes here affect what happens during maintenance windows — which is exactly when operators have the least margin for surprises.

Before submitting, answer:

Does this change when compaction is triggered or how much I/O it uses?
Does this change repair streaming throughput or the ordering of streamed ranges?
Does this change tombstone handling or space reclamation timing?
Does this change the behavior of any compaction strategy in ways that would affect space amplification?

Checklist:

Compaction throughput or space amplification impact is understood
Streaming changes are tested under load, not just correctness tests
Repair changes are tested with tombstones and expiring data

Compaction strategy changes are particularly sensitive because operators tune strategies explicitly for their workloads. A change to default trigger thresholds or I/O rate limiting can break assumptions that are baked into their runbooks.

Does This Change Nodetool Output or Generated Reference Docs?

Operators rely on nodetool for cluster health monitoring and routine maintenance tasks. nodetool status, nodetool repair, nodetool compactionstats, and similar commands are called in scripts, monitoring systems, and runbooks. Changes to their output format or behavior are breaking changes from the operator’s perspective, even when the underlying code change is correct.

Before submitting, answer:

Does this add, remove, or rename a nodetool command or subcommand?
Does this change the output format of an existing command — column order, field names, units?
Does this change what is logged by a nodetool operation?

Checklist:

New nodetool commands have generated docs (reference Generated Documentation)
Removed or renamed commands are deprecated first, not deleted outright
Output format changes are noted in NEWS.txt

Removing or renaming a nodetool command without a deprecation cycle will break operator scripts silently. If the change is necessary, deprecate in the current release and remove in the next.

Does This Affect Upgrade Safety?

Rolling upgrades are the normal Cassandra upgrade path. Operators bring nodes up and down one at a time while the cluster continues to serve traffic. This means mixed-version clusters are a real production state, not a theoretical edge case.

For the full compatibility rules, see Compatibility Checklist.

Summary checklist for operational upgrade impact:

A rolling upgrade through this change leaves the cluster operational — no split-brain, no data loss, no hard failures on mixed versions
Operators do not need to take manual action during the upgrade window
If manual action is required, it is documented in upgrade docs and NEWS.txt

If your change requires all nodes to be on the new version before a feature activates, that is usually the right design. Gating on cluster version agreement avoids mixed-version hazards. Confirm the mechanism you are using for that gating — coordinator version checks, schema versioning, or gossip state — is consistent with how similar features are handled in the codebase.

Does This Require Operator Documentation?

Not every code change requires docs, but if operators need to know about it, they need to find out from the documentation — not from a surprise at 2am.

A change requires operator documentation if it:

Adds or changes a configuration parameter
Changes repair, compaction, or maintenance behavior in a way operators need to know about
Changes default behavior that operators may have tuned
Introduces a new operational workflow or monitoring surface

Checklist:

New config parameters are documented in operator docs
Behavior changes visible to operators are in NEWS.txt
Operator runbooks or docs pages are updated, or a follow-up ticket is filed and linked from the patch

NEWS.txt is the primary channel for operator-facing change communication. If the change would cause a deployed operator to change their configuration, their scripts, or their monitoring, it belongs in NEWS.txt.

Does This Create Client-Visible Behavior Changes?

Clients use CQL and the native protocol. Operator-facing and client-facing changes often overlap — a change to consistency level semantics or error handling may look like an internal correctness fix but show up as changed behavior in driver telemetry or application logs.

Before submitting, answer:

Does this change query latency or throughput in ways a client would observe?
Does this change the error types or error messages returned to clients?
Does this change consistency level semantics or availability behavior under failure?

Checklist:

Client-visible error or behavior changes are in NEWS.txt
Driver compatibility implications are considered (see Compatibility Checklist)

The Quick Operational Review Checklist

Before submitting your patch, check:

Startup and shutdown

Startup sequence changes are documented in the commit message
Startup time impact is estimated with a before/after measurement on the same machine and the delta is recorded in the JIRA ticket
If startup gets slower by more than 100 ms, the change calls that out explicitly
Partial-startup failure behavior is tested or documented

Repair, streaming, and compaction

Compaction throughput or space amplification impact is understood
Streaming changes are tested under load, not just correctness tests
Repair changes are tested with tombstones and expiring data

Nodetool and generated docs

New nodetool commands have generated docs (reference Generated Documentation)
Removed or renamed commands are deprecated first
Output format changes are noted in NEWS.txt

Upgrade safety

A rolling upgrade through this change leaves the cluster operational
Operators do not need to take manual action during the upgrade window
If manual action is required, it is documented in upgrade docs and NEWS.txt

Operator documentation

New config parameters are documented in operator docs
Behavior changes visible to operators are in NEWS.txt
Operator runbooks or docs pages are updated or filed as follow-up tickets

Client-visible behavior

Client-visible error or behavior changes are in NEWS.txt
Driver compatibility implications are considered (see Compatibility Checklist)