SSTable Data Format
|
Preview | Unofficial | For review only |
This page is the reference contributors need when debugging data corruption, implementing new cell types, or understanding how CQL values map to bytes on disk.
It covers the binary layout of Data.db in full, from partition headers through cell flags and collection encoding, and documents the metadata tables in Statistics.db.
Readers working on compaction, repair, or a new storage engine feature will find the on-disk contract described here.
Source links point to the canonical Java classes so you can navigate directly to the serialization code.
Partition Layout in Data.db
Data.db is a flat sequence of partitions, written in token order (ascending by partition key hash).
Each partition is self-contained: a reader can skip from one partition to the next using the index in Index.db without parsing the interior.
A partition has four logical sections:
-
Partition header — partition key + deletion info
-
Static row (optional) — one row containing static column values, if any exist
-
Unfiltered sequence — interleaved regular rows and range tombstone markers, in clustering key order
-
End-of-partition marker — a sentinel that terminates the sequence
Partition in Data.db
+ partition header
+ static row (optional)
+ row / tombstone marker
+ row / tombstone marker
+ row / tombstone marker
+ end-of-partition marker
Partition Header
| Field | Type | Description |
|---|---|---|
Key length |
u16 (big-endian) |
Byte length of the raw partition key that follows |
Key bytes |
|
Serialized partition key; encoding depends on the partition key type(s) in the SerializationHeader |
Deletion timestamp |
i64 (big-endian) |
|
Local deletion time |
i32 (big-endian) |
|
Partition keys are always big-endian regardless of the VInt convention used for interior values.
Row Layout
Each row in the unfiltered sequence begins with a flags byte that controls what fields follow.
Row Flags Byte
| Bit | Name | Meaning |
|---|---|---|
0x01 |
|
Marks the end-of-partition sentinel; no further fields follow |
0x02 |
|
This is the static row; no clustering key follows |
0x04 |
|
Row-level timestamp is present; cells may inherit it |
0x08 |
|
Row-level TTL is present; cells may inherit it |
0x10 |
|
Row-level tombstone (timestamp + local deletion time) is present |
0x20 |
|
All non-PK columns are present; no column presence bitmap follows |
0x40 |
|
At least one non-frozen collection column has a complex deletion |
0x80 |
|
Reserved for future extensions |
Row Fields (in order)
After the flags byte, the following fields appear in order, each conditional on the corresponding flag:
-
Clustering key — serialized according to the table’s clustering column types from the SerializationHeader; absent for static rows
-
Row timestamp — VInt, delta-encoded from the baseline in SerializationHeader; present if
HAS_TIMESTAMP -
Row TTL — VInt, delta-encoded; present if
HAS_TTL -
Row local deletion time — VInt, delta-encoded; present if
HAS_TTL -
Row deletion — timestamp (VInt) + local deletion time (VInt); present if
HAS_DELETION -
Column presence bitmap — a bit per non-PK column indicating which columns have cells; omitted if
HAS_ALL_COLUMNS -
Cell data — one cell entry per column whose presence bit is set (see Cell Encoding below)
Cell Encoding
Each cell begins with a cell flags byte followed by conditional fields.
Cell Flags Byte
| Bit | Name | Meaning |
|---|---|---|
0x01 |
|
Cell is a tombstone; no value bytes follow |
0x02 |
|
Cell carries a TTL countdown; expiry fields follow |
0x04 |
|
Cell has a logical empty value (e.g., |
0x08 |
|
Inherit timestamp from the enclosing row header; no per-cell timestamp |
0x10 |
|
Inherit TTL and local deletion time from the enclosing row header; no per-cell TTL fields |
0x20 |
|
Used on the first cell of a non-frozen collection; indicates a collection-level tombstone precedes the cells |
Cell Fields (in order)
-
Timestamp — VInt, delta-encoded from row/partition baseline; omitted if
USE_ROW_TIMESTAMP -
TTL — VInt, delta-encoded; present only if
IS_EXPIRINGand notUSE_ROW_TTL -
Local deletion time — VInt, delta-encoded; present if
IS_EXPIRINGorIS_DELETED -
Value — value length as VInt, followed by that many raw bytes; absent if
IS_DELETED,HAS_EMPTY_VALUE, or the type has a fixed serialized width implied by the SerializationHeader
VInt and Delta Encoding
Variable-Length Integers (VInt)
Cassandra uses a custom variable-length integer encoding for most numeric fields inside rows and cells.
The high bit of each byte is a continuation flag: 1 means more bytes follow, 0 means this is the last byte.
The remaining 7 bits of each byte contribute to the value, little-endian within the VInt.
| Value range | Bytes used |
|---|---|
0 – 127 |
1 |
128 – 16,383 |
2 |
16,384 – 2,097,151 |
3 |
2,097,152 – 268,435,455 |
4 |
268,435,456 – 34,359,738,367 |
5 |
34,359,738,368 – 4,398,046,511,103 |
6 |
4,398,046,511,104 – 562,949,953,421,311 |
7 |
562,949,953,421,312 – 72,057,594,037,927,935 |
8 |
72,057,594,037,927,936 – 9,223,372,036,854,775,807 |
9 |
ZigZag Encoding for Signed Values
Signed VInts (timestamps, deletion times) use ZigZag encoding before being stored as an unsigned VInt. ZigZag maps negative numbers to odd positive integers, interleaving positive and negative values:
encode(n) = (n << 1) ^ (n >> 63) // Java long arithmetic 0 → 0 -1 → 1 1 → 2 -2 → 3 2 → 4 ...
This keeps small absolute values small on disk regardless of sign.
Delta Encoding
Timestamps, TTLs, and local deletion times are stored as deltas from a baseline, not absolute values.
The baseline is recorded in the SerializationHeader inside Statistics.db.
For a table where most writes share the same approximate timestamp (a common case), the delta fits in 1–2 VInt bytes even though the absolute microsecond timestamp requires 8 bytes.
Row-level timestamps and TTLs serve a second level of delta: cells that match the row’s values need not store any per-cell timestamp at all (signaled by USE_ROW_TIMESTAMP / USE_ROW_TTL).
Deletion Types
Cassandra supports five distinct kinds of deletion, each with different scope and on-disk representation.
Partition Tombstone
Triggered by DELETE FROM t WHERE pk = ? with no clustering predicate.
Stored in the partition header as a deletion timestamp + local deletion time.
Supersedes every row and cell within the partition for any read whose timestamp is before the tombstone.
Row Tombstone
Triggered by DELETE FROM t WHERE pk = ? AND ck = ?.
Stored in the row’s deletion fields (HAS_DELETION flag set) as a timestamp + local deletion time.
Applies to all cells in that clustering row.
Cell Tombstone
Triggered by setting a column to null in an UPDATE or INSERT.
Stored at the individual cell level with IS_DELETED set; no value bytes follow.
Range Tombstone
Triggered by DELETE FROM t WHERE pk = ? AND ck > ? AND ck < ? (or any range predicate on clustering columns).
Stored as a pair of markers in the unfiltered sequence:
-
Open marker — flags byte with
IS_RANGE_TOMBSTONE_MARKERset + bound (start of range) + deletion info -
Close marker — same structure, marks the end of the deleted range
Every clustering row whose key falls inside an open-to-close marker pair is treated as deleted if its timestamp is before the range tombstone’s timestamp. Multiple range tombstones can be interleaved; open and close markers must be strictly balanced.
TTL Expiry
Not a tombstone in the traditional sense.
Expiring cells carry a TTL (seconds) and a local deletion time (wall-clock time when expiry will occur).
Once now > local_deletion_time, the cell is treated as deleted on read.
The cell’s bytes remain on disk until compaction, when the cell’s TTL expiry time has passed and gc_grace_seconds has elapsed.
gc_grace_seconds
All tombstone types (partition, row, cell, range) are subject to gc_grace_seconds.
A tombstone is eligible for purging by compaction only after it is at least gc_grace_seconds old.
This grace period ensures that nodes which were down long enough to miss the tombstone have time to be repaired before the deletion marker disappears.
Collection and UDT Encoding
Frozen Collections
A frozen collection is serialized as a single cell whose value is a blob containing the entire collection. Mutation replaces the entire blob; partial updates are not possible.
Blob layout by type:
| Type | Layout |
|---|---|
|
u32 element count + serialized elements in list order |
|
u32 element count + serialized elements in sorted order |
|
u32 entry count + (serialized key + serialized value) pairs in key-sorted order |
Non-Frozen Collections
A non-frozen collection is stored as multiple cells, one per element. Each cell has a cell path that identifies the element within the collection:
-
list<T>— cell path is a time UUID (write timestamp) that orders elements -
set<T>— cell path is the serialized set element itself -
map<K,V>— cell path is the serialized map key
Frozen collection
one cell
-> value bytes contain the whole collection blob
Non-frozen collection
cell(path=element_1, value=...)
cell(path=element_2, value=...)
cell(path=element_3, value=...)
This representation allows partial updates (append to list, delete one map entry) without rewriting the entire collection. It also means a large non-frozen collection can produce many cells and many Index.db entries.
Collection-level tombstones (e.g., UPDATE t SET col = {} WHERE pk = ?) are encoded as a complex deletion that precedes the replacement cells; readers apply the complex deletion to clear prior elements before applying the new ones.
User-Defined Types (UDTs)
UDTs follow the same frozen/non-frozen split:
-
Frozen UDT — single cell; value is field values concatenated in field-declaration order, each preceded by a u16 length prefix (
-1for null) -
Non-frozen UDT — multiple cells; cell path is the u16 field index
Frozen vs. Non-Frozen: Operational Trade-offs
| Property | Frozen | Non-frozen |
|---|---|---|
Write amplification |
Full collection rewritten on every mutation |
Only changed elements written |
Read cost |
Single cell read, one deserialization pass |
All element cells read; potentially high cell count |
Partial updates |
Not supported |
Supported (append, delete single element) |
Compaction impact |
Fewer cells, simpler tombstone tracking |
More cells, complex deletion tracking |
Statistics.db
Statistics.db contains SSTable-level metadata that readers, compaction, and repair use without opening Data.db.
It is written once at flush time and replaced atomically on compaction.
StatsMetadata
| Field | Description |
|---|---|
Bloom filter FP chance |
Target false positive rate used when the Bloom filter in |
Compression ratio |
Ratio of compressed size to uncompressed size; used by size-tiered compaction strategy for size estimation |
Min/max timestamps |
Minimum and maximum write timestamps of any live cell in the SSTable (microseconds) |
Min/max local deletion time |
Minimum and maximum local deletion times; used to skip SSTables whose tombstones can’t cover a read timestamp |
Min/max clustering values |
Per-column min and max of clustering key values; enables range-based SSTable skipping on reads |
Partition size histogram |
Approximate histogram of partition sizes in bytes; used for monitoring and compaction sizing |
Column count distribution |
Histogram of cells-per-row; feeds compaction heuristics |
Total rows |
Estimated count of rows (not partitions); used in repair stream estimation |
Total range tombstones |
Count of range tombstone markers; high values signal potential read-path overhead |
Total cells |
Count of all cells written; used in compaction throughput accounting |
Repaired-at timestamp |
Wall-clock time when this SSTable was last included in an incremental repair; |
Compaction level |
Current LCS level ( |
SerializationHeader
The SerializationHeader is the schema embedded inside each SSTable. It is written at flush time and reflects the table schema as it existed at that moment.
Contents:
-
Partition key type — the type (or composite type) used to serialize partition keys
-
Clustering column types — ordered list of types for the clustering columns
-
Regular column types — map of column name to type for non-static, non-PK columns
-
Static column types — map of column name to type for static columns
-
Timestamp baseline — the minimum timestamp seen during the flush; used as the delta-encoding origin
-
TTL baseline — the minimum TTL seen during the flush
-
Local deletion time baseline — the minimum local deletion time seen during the flush
The SerializationHeader enables an SSTable to be read even after a live schema change (e.g., a column was dropped or a type was altered), because the reader uses the embedded type information rather than the live schema to deserialize values. It also provides the baselines that make delta encoding effective.
Key Source Files
| Area | Key Class |
|---|---|
Partition serialization |
|
Row encoding |
|
Cell flags |
|
Range tombstones |
|
Statistics |
|
Serialization header |
|
Collections |
|
UDTs |
|