Checksums and Integrity

SSTables carry integrity metadata at two levels: per-chunk checksums over Data.db blocks (inline when compressed, in CRC.db when not), and the single per-SSTable Digest.crc32 covering the whole Data.db. Readers validate the chunk CRC on the read path; the digest is a whole-file check used by tools, streaming and verification.

Correction notice: An earlier version of this chapter contained a section titled “Header CRC32 Prefixes” describing a CRC32 prefix that was believed to appear before certain NB-format SSTable headers. That section has been removed. Per HEADER_CRC32_DOCUMENTATION.md and verification against Cassandra 5.0.8 source, NB format Data.db has NO magic number or global header — the file starts directly with compressed chunk data. The bytes previously mistaken for a header CRC32 (e.g., 0x71160000, 0xf1185c00) are the first bytes of compressed chunk data. NB format is identified by filename pattern, not file content. See CompressedSequentialWriter constructor and flushData() for confirmation.

In this chapter you will learn

How per-chunk checksums are stored and validated for compressed Data.db
What Digest.crc32 covers (and what it does not)
How a writer accumulates both checksums incrementally, without re-reading Data.db
How readers/writers interact with integrity metadata
How to demonstrate a minimal verification example

Checksum coverage at a glance (authoritative)

Component	Per-chunk CRCs	Where the chunk CRCs live	Byte order (stored)	CRC scope	Covered by `Digest.crc32`
Data.db, uncompressed	yes	separate `CRC.db`	big-endian u32	one 64 KiB block of raw data	yes (digest = CRC32 over raw data)
Data.db, compressed	yes	inline, after each chunk in `Data.db`	big-endian u32	compressed chunk bytes only	yes (digest spans data and the inline CRCs)
Index.db	no	n/a	n/a	n/a	no
Summary.db	no	n/a	n/a	n/a	no
Filter.db	no	n/a	n/a	n/a	no
Statistics.db	no	n/a	n/a	n/a	no — but its own TOC/component CRC32s (Ch. 8)
CompressionInfo.db	no (holds chunk offsets, not CRCs)	n/a	n/a	n/a	no

Notes:

Data.db has no magic number or global header in either form; a compressed file starts directly with the first compressed chunk.
Digest.crc32 covers Data.db and nothing else. There is exactly one Digest.crc32 per SSTable, and only the data writer produces it: DataComponent.buildWriter passes Components.DIGEST to whichever data writer it builds (DataComponent.java:43-60), the two writeFullChecksum call sites are that data writer’s close path (ChecksummedSequentialWriter.java:70, CompressedSequentialWriter.java:393), and SSTableReader.getDigestValidator validates Components.DATA against Components.DIGEST (SSTableReader.java:535-536). Index.db is written by a plain SequentialWriter with no checksum writer at all (BigTableWriter.java:253), so Index.db, Summary.db and Filter.db carry no checksum of any kind. Verified on the real fixture test_basic/uncompressed_table.../nb-1-big-*: Digest.crc32 reads 4192247423, which is zlib.crc32(Data.db) exactly, and matches none of the other components’ CRC32s.
Full matrix with details appears later in this chapter.

Correction notice: earlier revisions of this chapter said each component “generates its own independent Digest.crc32” and listed every component as “Verified by Digest.crc32”. That is wrong in both directions: there is a single Digest.crc32 file per SSTable, and it covers Data.db only. The corrected statement — one digest, data-only — is what the citations and the fixture above support. The original claim that no digest enumerates all TOC components remains true.

Compressed Data.db: Trailing Chunk CRCs

For a compressed Data.db, each chunk’s CRC is placed after (trailing) the chunk it covers — never before it, and never in CompressionInfo.db.

CRC Placement

[chunk_bytes: variable length] <- Compressed data
[crc32: 4 bytes, big-endian]   <- CRC32(chunk_bytes)
[next_chunk_bytes: variable]
[crc32: 4 bytes, big-endian]
...

Source: CompressedSequentialWriter.java:187-192 — channel.write(toWrite) then crcMetadata.appendDirect(toWrite, true). Each chunk offset advances by compressedLength + 4 (for the trailing CRC) (CompressedSequentialWriter.java:203).

Validation Process

Decide whether to check at all: shouldCheckCrc() is true when crc_check_chance >= 1.0, else probabilistically (CompressedChunkReader.java:63-67)
Read chunk bytes from Data.db (length from CompressionInfo.db; when checking, read chunk.length + 4 so the trailing CRC comes along)
Read those 4 trailing bytes as a big-endian u32 (expected CRC)
Compute CRC32 over the chunk bytes only, using the Java algorithm
On match: decompress and continue
On mismatch: CorruptBlockException (CompressedChunkReader.java:134-149)

Explicit note: CRC32 is computed over the compressed chunk only and excludes the trailing 4-byte CRC itself (ChecksumWriter.java:68-69).

Minimal illustration (excerpt from a real Data.db, first 32 bytes):

00000000: fe1e 0000 f209 0010 6b88 bf20 a251 11f0
00000010: a3fe f1a5 5138 3fb9 7fff ffff 8000 0100

When aligned to a chunk boundary, the 4 bytes immediately following the compressed chunk are the big-endian CRC32 for that chunk.

CRC Algorithm Details

Standard: Java java.util.zip.CRC32 (ChecksumType.java)
Polynomial: 0x04C11DB7 (IEEE standard)
Initial value: 0
Reflected: Yes (reversed polynomial: 0xEDB88320)
Output: Big-endian u32

Cassandra Configuration

crc_check_chance: probability of validating a compressed chunk’s CRC on read (0.0 to 1.0)
Default: 1.0 — at >= 1.0 every chunk read is checked; below that the check is sampled per read (CompressedChunkReader.java:63-67)
Purpose: trade integrity checking for performance
Scope: this setting governs the compressed chunk-CRC read path only; it does not affect Digest.crc32 verification or Statistics.db’s component CRC32s

Implementation Note

The crc32fast Rust crate implements the same algorithm. Ensure big-endian byte order when comparing.

CRC.db Component (Separate Per-Chunk CRCs for Uncompressed Data)

For uncompressed (non-compressed) Data.db, a separate CRC.db file holds per-chunk CRC32 values.

Source: ChecksummedSequentialWriter.java:33-40, ChecksumWriter.java:43-53.

CRC.db layout:

[Chunk Size: 4 bytes, signed int]   <- uncompressed buffer capacity (header)
[CRC32 chunk 0: 4 bytes]
[CRC32 chunk 1: 4 bytes]
...

Seek formula — to locate the CRC for a given byte offset in Data.db:

chunk_index   = byte_offset / chunkSize
crc_file_pos  = (chunk_index * 4) + 4

where chunkSize is the 4-byte int at the start of CRC.db (DataIntegrityMetadata.java:52-53: reader.seek(((start / chunkSize) * 4L) + 4) where start = chunkStart(offset)).

The header value is the data SequentialWriter’s buffer.capacity() (ChecksummedSequentialWriter.java:38), 64 KiB by default, and each CRC32 covers one block of raw (uncompressed) data.

Note: per-chunk CRC values written to CRC.db are not included in the full checksum for ChecksummedSequentialWriter (checksumIncrementalResult=false, ChecksummedSequentialWriter.java:49). For CompressedSequentialWriter, per-chunk CRC bytes are included in the full checksum (checksumIncrementalResult=true, CompressedSequentialWriter.java:192). A direct consequence: for a single-chunk uncompressed Data.db the one CRC.db entry equals the Digest.crc32 value. Verified on test_basic/uncompressed_table.../nb-1-big-*, whose 8-byte CRC.db is 00010000 f9e09e7f — chunk size 65536 and one CRC 0xf9e09e7f = 4192247423, exactly the Digest.crc32 contents for a 19803-byte Data.db.

A writer does not need to re-read the finished file to produce either value — see Computing Checksums Incrementally While Writing, which also covers the flush-vs-compaction CRC.db tail difference (issue #1222).

Where per-chunk CRCs live (compressed vs uncompressed)

CompressionInfo.db does not contain per-chunk CRC values. Its body is a writeUTF compressor name, the otherOptions map, chunkLength, maxCompressedLength, the uncompressed dataLength, the chunk count, and then the chunk offsets (CompressionMetadata.java:375-392). The CRCs for a compressed table are stored inline in Data.db, immediately after each compressed chunk (see Compressed Data.db: Trailing Chunk CRCs above). For an uncompressed table they live in the separate CRC.db component. The two are mutually exclusive, which follows directly from DataComponent.buildWriter choosing CompressedSequentialWriter (inline) or ChecksummedSequentialWriter (CRC.db) (DataComponent.java:36-61). That choice is made on metadata.params.compression.isEnabled() alone, so the exclusivity holds for BIG and BTI alike — see the CRC.db precondition under Computing Checksums Incrementally While Writing.

Readers should validate the chunk CRC before decompressing. Note the precondition on “always”: whether a given read validates is governed by crc_check_chance — the check runs unconditionally only at 1.0, and otherwise probabilistically (CompressedChunkReader.java:63-67: checkChance >= 1d || (checkChance > 0d && checkChance > ThreadLocalRandom...nextDouble())). When the check does run and fails, Cassandra throws CorruptBlockException (CompressedChunkReader.java:144-149). For validation walkthroughs, see Appendix C.

Correction notice: this section previously stated that “CompressionInfo.db may include a CRC for each compressed chunk”. It does not — see the layout citation above.

Digest Files

Digest.crc32 is written by ChecksumWriter.writeFullChecksum() on SSTable finalization (ChecksumWriter.java:91-103). It contains a single line: the CRC32 checksum value as a UTF-8 decimal string, flushed and synced to disk.

There is one Digest.crc32 per SSTable and it covers Data.db only — see the citations under Checksum coverage at a glance. FileDigestValidator reads Long.parseLong(digestReader.readLine()) and validates a single dataFile against a single digestFile (DataIntegrityMetadata), and the dataFile it is given is always Components.DATA (SSTableReader.java:535-536). No digest enumerates all components listed in TOC.txt, and no digest exists for Index.db, Summary.db, Filter.db, Statistics.db or CompressionInfo.db.

Digest.crc32 is complementary to per-chunk CRCs: the digest validates whole-file Data.db contents (a whole-file read), while per-chunk CRCs validate one block during a normal read.

Minimal verification example:

Locate the SSTable’s single Digest.crc32.
Compute CRC32 over the entire Data.db contents.
Compare against the decimal value in Digest.crc32; on mismatch, quarantine and rehydrate via repair/streaming. (Other components have no digest — validate Statistics.db through its own per-component CRC32s instead, Ch. 8.)

Computing Checksums Incrementally While Writing (issue #1663)

A writer can produce both Digest.crc32 and CRC.db without ever re-reading the finished Data.db, because both are CRC32s over the same byte stream in write order. Cassandra does exactly this: ChecksumWriter keeps two CRC32 instances — incrementalChecksum for the current chunk and fullChecksum for the whole file — and appendDirect folds each buffer into both, emits the chunk value, then resets only the incremental one (ChecksumWriter.java:34-36, 62-89). writeFullChecksum then serializes the accumulated fullChecksum at close (ChecksumWriter.java:91-103). The checksumIncrementalResult boolean is the only difference between the two paths: it decides whether the 4-byte chunk CRC is itself folded into the full checksum (see Digest.crc32 Coverage).

CQLite mirrors this with a single accumulator, StreamingCrc (cqlite-core/src/storage/sstable/writer/crc_writer.rs:121-185):

Field	Role
`whole`	whole-file hasher — becomes the `Digest.crc32` value
`chunk`	hasher for the chunk currently being filled
`chunk_filled`	bytes accumulated toward the current `CRC_CHUNK_SIZE` block
`chunk_crcs`	finalized per-chunk CRC32s, in order

CRC_CHUNK_SIZE is 64 * 1024, matching Cassandra’s default SequentialWriter bufferSize (.../crc_writer.rs:62-67).

Where bytes are fed. Exactly once, in write order, immediately after the sink write and before the scratch buffer is cleared: self.crc.update(&self.buffer) in DataWriter::flush_partition (cqlite-core/src/storage/sstable/writer/data_writer/mod.rs:424-442). Because update carries chunk_filled across calls (crc_writer.rs:157-171), chunk boundaries are a fixed grid over the raw Data.db bytes and straddle partition boundaries — they are not per-partition. finish_streaming closes any trailing short chunk and returns both results in a StreamFinish { data_size, digest_crc32, chunk_crcs } (.../data_writer/mod.rs:324-338, 507-537).

Where they are consumed. SSTableWriter::finish writes Digest.crc32 from stream.digest_crc32 (cqlite-core/src/storage/sstable/writer/finish.rs:177-190) and assembles CRC.db from stream.chunk_crcs via assemble_crc_bytes (.../finish.rs:192-224). No pass over the finished file is involved.

Preconditions and exceptions worth stating explicitly:

Format truth: CRC.db exists ⟺ the data writer is uncompressed — for any format, BTI included. It is a compression gate, not a BIG-vs-BTI gate. Two independent citations at cassandra-5.0.8:
- SSTableWriter.Builder.addDefaultComponents branches on params.compression.isEnabled() alone: the compressed arm adds Components.COMPRESSION_INFO, the else arm adds Components.CRC. There is no format test anywhere in the branch, and the method is inherited by the BTI writer builder (SSTableWriter.java:483-497).
- DataComponent.buildWriter is likewise format-agnostic — it takes a Descriptor plus TableMetadata and picks CompressedSequentialWriter (inline chunk CRCs, CompressionInfo.db) or ChecksummedSequentialWriter (CRC.db) purely on metadata.params.compression.isEnabled() (DataComponent.java:36-61).
Confirming this from the component registry: BtiFormat.Components.ALL_COMPONENTS explicitly lists CRC alongside COMPRESSION_INFO (BtiFormat.java:100-108). So a Cassandra-written uncompressed BTI (da) SSTable does carry a CRC.db, and a compressed BIG (nb) SSTable does not. The correct statement of the exclusivity is compressed-inline-CRCs XOR CRC.db, orthogonal to BIG/BTI.
CQLITE IMPLEMENTATION DETAIL (not format authority): CQLite’s writer gates CRC.db on is_bti. finish.rs:205-207 skips CRC.db for any BTI output (let crc_path = if is_bti { None } else { … }), and the module doc asserts the same (cqlite-core/src/storage/sstable/writer/crc_writer.rs:1-8). Because CQLite’s production write surface is uncompressed-only (issue #1406), that gate is narrower than Cassandra’s rule: an uncompressed da written by CQLite omits a CRC.db that Cassandra would have emitted. This is a documented CQLite behavior, not a format property.

The supporting observation is confounded evidence. The note that “BTI (da) fixtures written by Cassandra carry no CRC.db” is true of the fixtures but proves nothing about the gate: every test-data/datasets/sstables/test_da/* fixture ships a CompressionInfo.db, i.e. all four are compressed. A compressed SSTable has no CRC.db under either hypothesis, so those fixtures cannot distinguish a BTI gate from a compression gate. Distinguishing them requires an uncompressed Cassandra-written da fixture (WITH compression = {'enabled': false}), which the corpus does not currently contain. Per the citations above, Cassandra’s rule is the compression gate.
Empty Data.db is the one recomputation. When no partition was streamed, there are no accumulated bytes, so finish computes the digest over the just-written empty file (finish.rs:185-189); CRC32 of zero bytes is 0.
The re-read implementation still exists, as a test oracle only. build_crc_bytes streams the finished file in CRC_CHUNK_SIZE blocks and routes through the same assemble_crc_bytes, so the incremental and re-read paths are provably byte-identical. It is #[cfg(test)] (crc_writer.rs:54-60, 187-235) and each call bumps the data_db_checksum_full_reads work counter, so a regression that reintroduces a production full-file re-read is caught by a test rather than by a benchmark.
Flush and compaction tails differ by design (issue #1222). Cassandra’s compaction path flushes the data writer once more at close over a zero-length buffer, so a compacted CRC.db ends with one extra 00000000 group; the flush path does not. assemble_crc_bytes takes an explicit CrcTrailer (None for flush, EmptyFinalChunk for compaction, crc_writer.rs:69-119) selected from is_compaction_output (finish.rs:209-213), and the trailer is suppressed for an empty Data.db (no Cassandra golden exists for that case).
Issue #1197 (writer omitted CRC.db for uncompressed BIG) is fixed: CRC.db is emitted and listed in TOC.txt for the uncompressed BIG write path.

Because CRC.db chunk CRCs are not folded into the full checksum on the uncompressed path, a single-chunk uncompressed Data.db has exactly one CRC.db entry that equals its Digest.crc32 value.

Recovery Strategies (Beyond Detection)

Scope note: focus on SSTable-level recovery patterns; node-level operations are out of scope.

Isolate and quarantine:
- Move suspected-corrupt components out of the live path; keep originals for forensics
- Prevent partial reads by ensuring TOC.txt no longer references quarantined files
Targeted file replacement:
- Replace only failed components from known-good copies (snapshot/backup)
- Validate digests and, if compressed, sample chunk CRCs before activation
Range-based rehydration:
- Trigger repair/streaming for affected token ranges to reconstruct data from replicas
- Prefer re-streaming over attempting to salvage partially corrupt Data.db
Post-recovery hygiene:
- Run verification tools; schedule compaction to remove overlap and rebuild summaries if required
- Monitor error counters; re-scan directories after compaction

Key Takeaways

NB format Data.db has NO magic number or header — the file starts directly with compressed chunk data. There are no “Header CRC32 Prefixes” in NB format.
NB format uses trailing chunk CRCs — placed after each compressed chunk, big-endian u32, covering compressed chunk bytes only.
CRC.db (uncompressed path) holds a 4-byte int header (chunk size) + 4-byte CRC32 per chunk. Seek to chunk N’s CRC with: crc_file_pos = (chunk_index * 4) + 4.
Digest.crc32 is a single per-SSTable file containing one UTF-8 decimal CRC32 value, and it covers Data.db only. Index.db, Summary.db and Filter.db carry no checksum at all; Statistics.db is self-checksummed via its TOC/component CRC32s (Ch. 8).
Both Digest.crc32 and CRC.db can be produced incrementally during the write — one whole-file hasher plus one per-chunk hasher over the same byte stream — so no writer needs to re-read the finished Data.db (issue #1663).
Readers should validate checksums on-the-fly; tools may verify digests offline.
Fail-fast on a detected CRC mismatch — do not attempt heuristic recovery in modern formats. Note the precondition: on the compressed read path a chunk CRC is only checked unconditionally when crc_check_chance = 1.0, so “fail-fast” describes what happens when a check runs, not a guarantee that every corrupt byte is checked on every read.

References

Cassandra 5.0.8:
- DataIntegrityMetadata: org.apache.cassandra.io.util.DataIntegrityMetadata
- PureJavaCrc32: org.apache.cassandra.utils.PureJavaCrc32
- ChecksumWriter: org.apache.cassandra.io.util.ChecksumWriter (chunk size header L43-48; appendDirect L62-89; writeFullChecksum L91-103)
- ChecksummedSequentialWriter: org.apache.cassandra.io.util.ChecksummedSequentialWriter (constructor L33-40; flushData L43-50 calling appendDirect with false)
- CompressedSequentialWriter: org.apache.cassandra.io.compress.CompressedSequentialWriter (flushData L140-206: inline CRC after compressed chunk; digest on finalize L392-393)
- ChecksumType: org.apache.cassandra.utils.ChecksumType
CQLite implementation:
- CRC32 computation: crc32fast crate (byte-compatible with java.util.zip.CRC32)
- Incremental accumulator (Digest.crc32 + CRC.db during the write, issue #1663): cqlite-core/src/storage/sstable/writer/crc_writer.rs
- Byte feed point: cqlite-core/src/storage/sstable/writer/data_writer/mod.rs:424-442
- Component emission: cqlite-core/src/storage/sstable/writer/finish.rs:177-224

For implementation details and walkthroughs, see Appendix C.

`Digest.crc32` Coverage

Digest.crc32 is written once per SSTable by ChecksumWriter.writeFullChecksum() from the data writer’s close path, and it holds a single UTF-8 decimal line: the CRC32 over the full byte range of Data.db. Both of the writeFullChecksum call sites in 5.0.8 are that data writer (ChecksummedSequentialWriter.java:70, CompressedSequentialWriter.java:393); no other SSTable component is given a digest file.

What “full byte range of Data.db” includes differs by path, and this is the only effect of checksumIncrementalResult:

Compressed (CompressedSequentialWriter, checksumIncrementalResult = true): all compressed data bytes plus each 4-byte inline per-chunk CRC value (ChecksumWriter.java:74-81). Since those CRCs are physically part of Data.db, this is simply “CRC32 over the whole file” — confirmed on test_basic/compression_test_table.../nb-1-big-*, where Digest.crc32 = 2910221354 = zlib.crc32(entire 213472-byte Data.db).
Uncompressed (ChecksummedSequentialWriter, checksumIncrementalResult = false): only the raw data bytes; the CRC.db chunk values are not fed in. Again “CRC32 over the whole file”, because CRC.db is a separate file.

Minimal verification example:

Compute CRC32 over the entire Data.db.
Compare against the single decimal value in Digest.crc32; on mismatch, quarantine and rehydrate via repair/streaming.
Do not expect a digest for any other component — validate Statistics.db via its own component CRC32s (Ch. 8) and the rest by rebuilding them from Data.db.