SSTable Reference
|
Preview | Unofficial | For review only |
This page consolidates reference material for SSTable contributors: the versioning matrix, type mappings, encoding formats, tool usage, and a glossary. Use it as a lookup companion to the other SSTable Architecture pages — Fundamentals, Data Format, and Write Path.
Versioning Matrix
Each Cassandra release series uses a specific on-disk format version.
The version string is the prefix on every SSTable component filename (for example, nb-1-big-Data.db).
The first letter identifies the broad family, and the second letter increments as that family’s on-disk schema evolves.
| Cassandra Version | SSTable Format | Notable On-Disk Changes |
|---|---|---|
3.0–3.x |
|
Introduced |
4.0–4.x |
|
Compression improvements; expanded metadata in Statistics.db; incremental metadata versioning between |
5.0+ |
|
BTI format added ( |
The format identifier encodes both a version letter sequence and a format type (big for BIG, bti for BTI).
A Cassandra node will refuse to open an SSTable whose format version it does not recognize.
CQL Type to SSTable Encoding
The table below maps each CQL primitive and collection type to its on-disk byte representation in Data.db.
All multi-byte integers are big-endian unless otherwise noted.
| CQL Type | Byte Encoding | Size (bytes) |
|---|---|---|
|
UTF-8 bytes, no null terminator |
variable |
|
4-byte big-endian signed integer |
4 |
|
8-byte big-endian signed integer |
8 |
|
4-byte IEEE 754 single-precision |
4 |
|
8-byte IEEE 754 double-precision |
8 |
|
|
1 |
|
Raw bytes, no framing |
variable |
|
16 bytes, MSB first |
16 |
|
8-byte milliseconds since Unix epoch |
8 |
|
4 bytes for IPv4, 16 bytes for IPv6; no length prefix (caller knows from cell value length) |
4 or 16 |
|
4-byte big-endian scale + unscaled |
variable |
|
Minimal two’s complement |
variable |
|
Single cell; entire collection serialized as a blob with type-specific internal encoding |
variable |
non-frozen collection |
Multi-cell; one cell per element; each cell carries a path (see collection cell paths below) |
variable |
|
Single cell; concatenated field values, each prefixed with a 2-byte length or |
variable |
|
N × 4-byte IEEE 754 floats, contiguous, no framing |
N × 4 |
Collection Cell Paths
Non-frozen collections use a cell path to identify each element within the partition:
| Collection Kind | Cell Path Value |
|---|---|
|
TimeUUID (used as a unique, time-ordered token for the list element) |
|
Serialized element bytes (the element itself is the path) |
|
Serialized key bytes (the map key is the path) |
Encoding Quick Reference
VInt (Variable-Length Integer)
VInt is used throughout SSTable internals for lengths, offsets, and counters.
| Property | Detail |
|---|---|
Continuation flag |
High bit of each byte: |
Range |
1 byte encodes values 0–127; up to 9 bytes for a full signed 64-bit value |
Common uses |
Cell value lengths, row offsets within partitions, counter values, index entry sizes |
Example byte sequences:
0x00 → 0
0x7F → 127
0x80 0x01 → 128
0xFF 0x7F → 16383
ZigZag Encoding (Signed Integers)
ZigZag maps signed integers to unsigned so that small negative values remain small after encoding. It is applied before VInt when the value is expected to be a small signed delta.
Mapping formula (for 64-bit): n → (n << 1) ^ (n >> 63)
| Signed input | ZigZag unsigned output |
|---|---|
0 |
0 |
-1 |
1 |
1 |
2 |
-2 |
3 |
2 |
4 |
-N |
2N - 1 |
N |
2N |
Delta Encoding
Delta encoding reduces storage when most cells in a partition share similar timestamp or TTL values.
| Delta target | How it is stored |
|---|---|
Cell timestamp relative to row timestamp |
|
Row timestamp relative to header baseline |
|
Cell TTL relative to row TTL |
|
Row TTL relative to header baseline |
|
Baseline values (minTimestamp, minTTL) are stored in the SerializationHeader section of Statistics.db.
Reading code reconstructs the absolute value by adding the delta back to the appropriate baseline.
Cell Flags Byte
Each cell in Data.db is preceded by a single flags byte.
Bits are numbered 0 (LSB) to 7 (MSB).
| Bit | Flag Name | Meaning |
|---|---|---|
0 |
|
Cell is a tombstone (deletion marker) |
1 |
|
Cell has a TTL and will expire at |
2 |
|
Cell has no value bytes (zero-length value, not null) |
3 |
|
Cell carries its own timestamp (not inherited from row) |
4 |
|
Cell carries its own TTL (not inherited from row) |
5 |
|
Cell timestamp equals the enclosing row’s timestamp; no per-cell timestamp stored |
6 |
|
Cell TTL equals the enclosing row’s TTL; no per-cell TTL stored |
7 |
|
Cell is a counter cell; value is a 8-byte counter shard context |
Reference Walkthroughs
Point Read Walkthrough (BIG Format)
The steps below trace a single-partition point read through the BIG format component chain.
1. Hash the partition key with Murmur3 → 128-bit token
2. Check Filter.db (Bloom filter)
→ Negative result: partition definitely absent, stop here
→ Positive result: proceed (false positives are possible)
3. Binary search Summary.db for the sampled token nearest the target
→ Summary.db holds one index entry per index_interval partitions
→ Returns a byte offset into Index.db
4. Seek to the Summary.db offset in Index.db
Binary search the Index.db region for the exact partition key
→ Returns a byte offset into Data.db
5. If the SSTable is compressed:
→ Look up the Data.db offset in CompressionInfo.db to find the chunk offset
→ Read the compressed chunk
→ Verify the trailing 4-byte CRC32 (NB format only)
→ Decompress the chunk
6. Read the partition header at the Data.db offset
→ Confirm partition key matches (guard against index false positives)
→ Check partition deletion timestamp
7. Scan or seek within the unfiltered sequence
→ Apply clustering key bounds from the query
→ Merge with results from other SSTables at the read coordinator
Chunk CRC Verification (NB Format)
The nb format appends a 4-byte CRC32 after each compressed chunk in Data.db.
1. Locate the compressed chunk via CompressionInfo.db offset
2. Read compressed_length bytes (from CompressionInfo.db chunk entry)
3. Read the next 4 bytes as a big-endian u32 → expected_crc
4. Compute CRC32 over the compressed_length bytes
5. Compare computed CRC32 with expected_crc
→ Match: chunk is intact, proceed to decompression
→ Mismatch: chunk is corrupt; do not decompress; report corruption
6. After decompression, verify that uncompressed size equals the value
recorded in CompressionInfo.db (decompression bomb protection)
Tools Quick Reference
All tools are located in the Cassandra source tree under tools/bin/.
Run them against a stopped node or a snapshot copy — never against live data files that Cassandra is actively using.
|
Always verify that |
| Tool | Command | When To Use | Notes |
|---|---|---|---|
|
|
Debugging data content; verifying cell values and tombstones |
Outputs JSON; reads Data.db directly; can be slow on large files |
|
|
Inspecting SSTable properties without reading row data |
Shows partition count, min/max timestamps, compression info, estimated row count |
|
|
Rebuilding SSTables that fail to open due to corruption |
Destructive: drops unreadable rows; always snapshot first |
|
|
Validating SSTable integrity before an upgrade or after hardware events |
Non-destructive; safe to run on live snapshots |
Compression Algorithms
| Algorithm | Frame Format | Typical Ratio | CPU Cost | Notes |
|---|---|---|---|---|
LZ4 |
Frame header + compressed blocks + end mark |
Moderate (~2:1) |
Very low |
Default in many Cassandra deployments; favors speed over ratio |
Snappy |
Compressed-length prefix + compressed bytes |
Moderate (~2:1) |
Low |
Google-originated; fast decompression; no framing overhead |
Deflate |
zlib-wrapped DEFLATE stream |
High (~3:1) |
Moderate |
Better compression ratio at the cost of higher CPU |
Zstd |
Frame header + data blocks + optional checksum |
High (~3:1) |
Low–moderate |
Best ratio/speed tradeoff; available in Cassandra 4.x+ |
Per-chunk CRC32 applies in the nb format regardless of which algorithm is selected.
All algorithms are subject to decompression bomb protection: the decompressor verifies that uncompressed output size does not exceed the chunk_length recorded in CompressionInfo.db.
Glossary
- BIG
-
The classic multi-file SSTable format used from Cassandra 1.x through 5.x. Components include
Data.db,Index.db,Summary.db,Filter.db,Statistics.db,CompressionInfo.db,Digest.crc32(or.sha1), andTOC.txt. - Bloom Filter
-
A probabilistic data structure stored in
Filter.dbthat answers "is this partition key possibly present?" with a tunable false-positive rate (FPR). A negative answer is always correct; a positive answer may be a false positive. - BTI (B-Tree/Trie Indexed)
-
The newer SSTable format introduced in Cassandra 5.0. Replaces
Index.dbandSummary.dbwith trie-basedPartitions.dbandRows.dbcomponents for faster lookups and reduced memory overhead. - Delta Encoding
-
Storing a value as the difference from a known baseline rather than as an absolute value. Used for timestamps and TTLs to reduce the magnitude (and therefore the VInt byte count) of stored values.
- FPR (False Positive Rate)
-
The probability that a Bloom filter incorrectly reports a partition key as present. Configurable per table via
bloom_filter_fp_chance; lower FPR requires more memory. - gc_grace_seconds
-
A table-level setting that controls how long Cassandra retains tombstones before they are eligible for compaction-based deletion. Prevents tombstones from being removed before they propagate to all replicas.
- LCS (Leveled Compaction Strategy)
-
A compaction strategy that organizes SSTables into size-tiered levels. Provides better read performance and more predictable space amplification than STCS.
- Memtable
-
The in-memory, mutable data structure that accumulates writes before they are flushed to disk as a new SSTable. Multiple implementations exist; see TrieMemtable.
- Merkle Tree
-
A hash tree used during Cassandra’s
repairoperation to identify data divergence between replicas. Each leaf covers a token range; nodes are built from SHA-256 hashes of the data. - Promoted Index
-
An embedded mini-index within
Index.db(BIG format) for wide partitions. Stores clustering key offsets within a partition so that a point read does not have to scan from the partition start. Stored inline after the partition key entry inIndex.db. - SerializationHeader
-
A metadata section in
Statistics.dbthat stores the baseline timestamp and TTL values used for delta encoding, along with the column schema at write time. Read during SSTable open to initialize decoders. - SSTable (Sorted String Table)
-
An immutable, append-only file set that persists flushed or compacted Cassandra data. SSTables are the durable output of the LSM-tree write path; they are never modified in place after initial write.
- STCS (Size-Tiered Compaction Strategy)
-
The default compaction strategy. Groups SSTables of similar size into compaction jobs; well-suited for write-heavy workloads.
- TOC.txt (Table of Contents)
-
The manifest file that lists every valid component file belonging to an SSTable. Cassandra atomically writes
TOC.txtlast during a flush or compaction, making its presence the publication barrier: an SSTable withoutTOC.txtis incomplete and will not be opened. - TrieMemtable
-
The byte-ordered prefix trie memtable introduced as the default in Cassandra 5.0. Replaces the
SkipListMemtablefor the common case; provides faster in-order iteration during flush and improved memory locality. - TWCS (Time Window Compaction Strategy)
-
A compaction strategy designed for time-series data. Groups SSTables by write time into fixed-size windows, then compacts within each window; minimizes compaction work on data that is written once and read rarely.
- UCS (Unified Compaction Strategy)
-
An experimental compaction strategy in Cassandra 4.x+ that unifies STCS and LCS behavior under a single configurable framework.
- VInt (Variable-Length Integer)
-
A space-efficient integer encoding where each byte’s high bit is a continuation flag. Values 0–127 fit in one byte; larger values use 2–9 bytes. Used pervasively in SSTable internals for lengths, offsets, and counts.
- ZigZag Encoding
-
A mapping from signed integers to unsigned integers that keeps small-magnitude negative numbers small after encoding. Applied before VInt when the value may be negative (for example, signed timestamp deltas).