Data.db Format
Data.db Format
Section titled “Data.db Format”This chapter describes the on-disk layout of partitions, rows, and cells in Data.db: how headers reference schema, how unfiltered rows, range tombstones, and markers are encoded, and how encodings like vints and cell flags are interpreted.
In this chapter you will learn
Section titled “In this chapter you will learn”- Partition headers and row/cluster layout basics
- Cell value encodings, varints/vints, collections/UDTs
- Deletions, range tombstones, TTLs and expiring cells
- How readers interpret flags and headers during parsing
Partition and Row Layout
Section titled “Partition and Row Layout”Minimal annotated example from test_basic/simple_table (trimmed and formatted):
partition key = 4d4321e2-662b-4ba1-b75f-48e080727a52row liveness ts = 2025-09-16T22:14:23.739Zcells: account_balance=21088.5, active=false, age=75, name=(utf8) ...Underlying file shows a partition stream with a serialization header followed by unfiltered rows and optional tombstone markers.
- Alt text: Annotated Data.db partition/row/cell structure
- Caption: Serialization header → unfiltered rows/markers → cells with flags and vints
Encodings and Flags
Section titled “Encodings and Flags”VInt parsing (Cassandra-compatible), used across headers and lengths. For a concise implementation walkthrough, see Appendix C.
Readers interpret row/cell flags to distinguish live cells, TTLs, and tombstones; see Chapter 11 for tombstone semantics. Cross-link to Appendix B for a compact encoding summary.
Common cell flags (high level):
- live cell vs tombstone
- presence of timestamp, ttl, local deletion time
- empty/expiring cells
Bit-level flags (Cassandra 5.0, authoritative references):
| Bit | Name | When set |
|---|---|---|
| 0 | IS_DELETED_MASK | Cell is a tombstone |
| 1 | IS_EXPIRING_MASK | Cell is expiring (has TTL) |
| 2 | HAS_EMPTY_VALUE_MASK | Cell value is empty (zero-length but present) |
| 3 | USE_ROW_TIMESTAMP_MASK | Cell timestamp equals row timestamp — timestamp field is omitted |
| 4 | USE_ROW_TTL_MASK | Cell TTL/LDT equals row TTL/LDT — TTL and LDT fields are omitted |
| 5+ | (reserved) | Format-specific extensions |
Authoritative classes to consult in Cassandra 5.0:
org.apache.cassandra.db.rows.*(e.g.,Unfiltered,Cell,BufferCell)org.apache.cassandra.db.SerializationHeaderorg.apache.cassandra.db.rows.SerializationHelper
Endianness:
- Integers in SSTable payloads are big-endian unless otherwise specified; varints are MSB-first variable-length.
- Network/binary compatibility relies on consistent big-endian parsing for fixed-width fields.
Deletions and TTL Semantics
Section titled “Deletions and TTL Semantics”- Partition tombstone: marks entire partition deleted at a timestamp
- Row tombstone: targets a specific clustering row
- Range tombstone: spans clustering ranges
- TTL/expiring: cells carry ttl and local deletion time; expired cells are omitted at read
Collections and UDTs
Section titled “Collections and UDTs”Collections (list/set/map) have two storage modes:
- Frozen (
frozen<list<...>>): Single-cell storage, entire collection serialized as one blob - Non-frozen (
list<...>): Multi-cell storage, each element stored as separate cell
Frozen Collection Serialization
Section titled “Frozen Collection Serialization”Frozen collections are stored as a single cell with the entire collection serialized as a binary blob. The format uses 4-byte big-endian i32 length prefixes (matching Java’s serialization format).
Frozen List/Set Format (identical for both types):
[i32 BE: element_count][for each element: [i32 BE: element_length] [element_bytes]]Example (frozen list with 2 integers):
Hex: 00 00 00 02 00 00 00 04 00 00 00 2A 00 00 00 04 00 00 00 64 |___________| |_____________________| |_____________________| count=2 elem1: len=4, val=42 elem2: len=4, val=100Frozen Map Format:
[i32 BE: entry_count][for each entry: [i32 BE: key_length] [key_bytes] [i32 BE: value_length] [value_bytes]]Example (frozen map with 1 entry: “a” -> 42):
Hex: 00 00 00 01 00 00 00 01 61 00 00 00 04 00 00 00 2A |___________| |____________| |_______________________| count=1 key: len=1, "a" val: len=4, int(42)Non-Frozen Collection Serialization
Section titled “Non-Frozen Collection Serialization”Non-frozen collections are stored as multiple cells, one per element or entry. Each cell has a cell_path that identifies the element and a cell_value that contains the data.
Non-frozen collection cell format (complex columns):
[flags: u8][timestamp: VInt if not USE_ROW_TIMESTAMP_MASK][local_deletion_time: VInt if deleted/expiring][ttl: VInt if expiring][cell_path: VInt length + bytes] ← See table below[value: VInt length + bytes] ← See table belowCell Path and Value by Collection Type:
| Collection Type | cell_path | cell_value |
|---|---|---|
list<T> | TimeUUID (16 bytes) | Serialized element |
set<T> | Serialized element | Empty (0 bytes) |
map<K,V> | Serialized key | Serialized value |
List Element Ordering:
Lists use TimeUUID (UUID version 1) for the cell_path to maintain insertion order. TimeUUIDs are time-sortable, ensuring elements remain in the order they were written. Each element gets a unique TimeUUID generated at write time.
Example (list with 2 integers):
Cell 1: path: f35cf98a-220c-11ef-8b04-f4ff7ffcf681 (16 bytes, TimeUUID) value: 00 00 00 2A (4 bytes, int 42)
Cell 2: path: f35cf98b-220c-11ef-8b04-f4ff7ffcf681 (16 bytes, TimeUUID) value: 00 00 00 64 (4 bytes, int 100)Set Element Storage:
Sets use the element value itself as the cell_path for efficient membership testing. The cell_value is always empty (0 bytes). This allows Cassandra to check set membership by looking for a cell with a matching path.
Example (set with 2 text values):
Cell 1: path: 61 6C 70 68 61 (5 bytes, "alpha") value: (empty, 0 bytes)
Cell 2: path: 62 65 74 61 (4 bytes, "beta") value: (empty, 0 bytes)Map Entry Storage:
Maps use the serialized key as the cell_path and the serialized value as the cell_value. This allows efficient key lookups.
Example (map with 2 entries: 1->“one”, 2->“two”):
Cell 1: path: 00 00 00 01 (4 bytes, int key 1) value: 6F 6E 65 (3 bytes, "one")
Cell 2: path: 00 00 00 02 (4 bytes, int key 2) value: 74 77 6F (3 bytes, "two")Implementation References:
- Frozen collections:
cqlite-core/src/storage/sstable/writer/data_writer.rs::serialize_frozen_list(),serialize_frozen_set(),serialize_frozen_map() - Non-frozen collections:
serialize_nonfrozen_list(),serialize_nonfrozen_set(),serialize_nonfrozen_map() - Tests:
cqlite-core/tests/collection_roundtrip_test.rs
UDTs (User-Defined Types) serialize fields in schema order with 4-byte BE length prefixes:
[field_1_length: 4-byte BE i32][field_1_data][field_2_length: 4-byte BE i32][field_2_data]...UDT field length semantics (confirmed via Issue #220):
-1(0xFFFFFFFF): Field is NULL0(0x00000000): Field is empty (zero-length but present)>0: Number of bytes of field data following- Trailing omitted fields are implicitly NULL
Critical distinction: The outer type determines storage:
list<frozen<udt>>= multi-cell (each UDT element is separate cell)frozen<list<udt>>= single-cell (entire list is one blob)
See tables/type-mapping-complex.md for detailed format specifications.
Counter Cells (Issue #241)
Section titled “Counter Cells (Issue #241)”Counter columns store a CounterContext structure, not a raw i64 value. The CounterContext tracks counter updates across multiple replicas (shards).
Cell format:
[VInt length] ← Length of CounterContext bytes[CounterContext] ← Variable-length structure belowCounterContext format (from CounterContext.java):
| Field | Size | Description |
|---|---|---|
| header_size | 2 bytes | BE signed short - number of shards |
| indices | 2 * |header_size| bytes | Shard type indicators (negative = global) |
| shards | 32 * |header_size| bytes | Each shard: counter_id (16) + clock (8) + count (8) |
Shard structure (32 bytes each):
[counter_id: 16 bytes] ← Replica's CounterId (UUID)[clock: 8 bytes] ← Logical clock (BE unsigned long)[count: 8 bytes] ← Counter value for this shard (BE signed long)The counter value is the sum of all shard counts, matching Cassandra’s total() function.
Example (single-shard counter):
24 ← VInt length (36 bytes)0001 ← header_size = 18000 ← header index (0x8000 = global shard at index 0)f35cf98a220c40fb8b04f4ff7ffcf681 ← counter_id (16 bytes)00064073 23d1d210 ← clock (8 bytes)00000000 00000029 ← count = 41 (8 bytes)Reference: org.apache.cassandra.db.context.CounterContext in Cassandra 5.0 source.
Key Takeaways
Section titled “Key Takeaways”Data.dbis schema-driven and encodes partitions as unfiltered row streams.- VInts and bit flags compactly encode sizes, timestamps, and cell metadata.
- Tombstones and TTLs are first-class and affect reconciliation.
Troubleshooting
Section titled “Troubleshooting”- If parsed sizes seem inconsistent, verify VInt decoding and endian assumptions.
- For collections with unexpected nulls, check for element tombstones and TTL expiration handling.
References
Section titled “References”- Cassandra 5.0.8:
- Rows and tombstones:
org.apache.cassandra.db.rows.*(Unfiltered,RangeTombstoneMarker) - Serialization header: org.apache.cassandra.db.SerializationHeader
- Rows and tombstones:
For implementation details, see Appendix C.
V5CompressedLegacy Row Header Format (Cassandra 5.0)
Section titled “V5CompressedLegacy Row Header Format (Cassandra 5.0)”The V5CompressedLegacy format (BigFormat with compression, “nb” file prefix) uses a structured row header with delta-encoded metadata fields. This format is used by Cassandra 5.0 SSTables with the legacy “big” format and compression enabled.
Row Structure (Corrected - Issue #213, Issue #237)
Section titled “Row Structure (Corrected - Issue #213, Issue #237)”The complete row format, confirmed via Cassandra’s UnfilteredSerializer.java:
[row_flags: u8][extended_flags: u8 if 0x80 set][clustering_prefix: variable] ← For tables with clustering keys[row_size: VInt] ← Size measured from AFTER this VInt (Issue #237)[prev_size: VInt][timestamp: VInt if 0x04 set] ← Delta from min_timestamp (unsigned VInt)[ttl: VInt32 if 0x08 set] ← TTL delta from min_ttl (unsigned VInt32)[liveness_ldt: VInt32 if 0x08 set] ← Local expiration time delta from min_local_deletion_time (unsigned VInt32)[deletion: 2 VInts if 0x10 set] ← markedForDeleteAt delta (unsigned VInt) + local_deletion_time delta (unsigned VInt32)[column_bitmap: VInt + bytes if NOT 0x20][cell_data...]Critical Notes:
-
Clustering Prefix Ordering (Issue #213): For tables WITH clustering keys, the clustering prefix comes IMMEDIATELY after flags and BEFORE
row_size. This differs from initial documentation which placedrow_sizeimmediately after flags. -
row_sizeMeasurement (Issue #237): Therow_sizevalue is measured from the position AFTER therow_sizeVInt is consumed, NOT from where the VInt starts. This matches Cassandra’sgetFilePointer()semantics:next_row_offset = position_after_row_size_vint + row_size_value -
No Trailing Field (Issue #237): There is NO trailing field after row data in V5CompressedLegacy format. The next partition or row starts immediately after
row_sizebytes from the position after the VInt.
Clustering Prefix Format
Section titled “Clustering Prefix Format”For tables with clustering keys, values are encoded between flags and row_size:
[header: VInt] ← 2 bits per clustering column[value_1: type-specific] ← Only if state indicates PRESENT[value_2: type-specific]...The header VInt uses 2 bits per column to indicate state:
00(0): Value PRESENT - followed by type-specific bytes01(1): Value EMPTY - zero-length (no bytes follow)10(2): Value NULL - no bytes follow11(3): Reserved
Type-specific encoding:
- Fixed-width types (timestamp, int, bigint, UUID): Raw bytes, no length prefix
- Variable-width types (text, varchar, blob): VInt length prefix + bytes
Unfiltered Markers (Issue #229 Fix)
Section titled “Unfiltered Markers (Issue #229 Fix)”Between rows in a partition, or at the end of a partition, the parser may encounter special markers:
| Marker | Hex | Meaning |
|---|---|---|
| END_OF_PARTITION | 0x01 | Signals end of partition - nothing follows this byte |
| IS_MARKER | 0x02 | Range tombstone marker (boundary or bound) |
END_OF_PARTITION (0x01):
- Written by
UnfilteredSerializer.writeEndOfPartition()as exactly0x01 - When detected, the partition is complete; move to next partition
- Critical for tables with clustering keys to avoid misinterpreting marker as row data
IS_MARKER (0x02):
- Indicates a range tombstone boundary
- Followed by clustering bound/boundary data and deletion time(s)
- Must be skipped when parsing row data
Implementation Note: CQLite uses bitwise AND (flags & IS_MARKER != 0) to detect IS_MARKER because markers can have additional flag bits set (e.g., 0x52 = IS_MARKER | HAS_DELETION | HAS_COMPLEX_DELETION). END_OF_PARTITION still uses exact match (0x01) as it is always written alone without other flags.
Row Flags
Section titled “Row Flags”| Flag | Hex | Meaning | Details |
|---|---|---|---|
| 0x04 | HAS_TIMESTAMP | Timestamp delta present | Delta-encoded from Statistics.db min_timestamp |
| 0x08 | HAS_TTL | TTL delta present | Delta-encoded from Statistics.db min_ttl |
| 0x10 | HAS_DELETION | Deletion time present | Two VInts in Cassandra canonical order: (1) markedForDeleteAt delta (unsigned VInt, base min_timestamp, microseconds — the authoritative tombstone reconciliation timestamp), then (2) local_deletion_time delta (unsigned VInt32, base min_local_deletion_time, seconds). See DeletionTime.Serializer / SerializationHeader.writeDeletionTime. |
| 0x20 | HAS_ALL_COLUMNS | All columns present (no bitmap) | When set, all schema columns have values (no NULLs) |
| 0x80 | EXTENSION_FLAG (source) / HAS_EXTENDED_FLAGS (guide alias) | Extended flags byte follows | Reserved for future format extensions |
Delta Decoding
Section titled “Delta Decoding”All metadata fields use delta encoding against minimum values from Statistics.db:
absolute_timestamp = min_timestamp + timestamp_deltaabsolute_ttl = min_ttl + ttl_deltaabsolute_marked_for_delete_at = min_timestamp + marked_for_delete_at_delta # microseconds (reconciliation ts)absolute_local_deletion_time = min_local_deletion_time + local_deletion_time_delta # secondsExample: If Statistics.db shows min_timestamp = 1759713125983682 and row header contains timestamp_delta = 1000, the absolute timestamp is 1759713125984682 (microseconds since epoch).
Column Bitmap
Section titled “Column Bitmap”When HAS_ALL_COLUMNS (0x20) is NOT set, a columns-subset field follows the metadata fields.
Cassandra’s on-disk format (Columns.Serializer.serializeSubset, Columns.java:503-531) encodes
missing columns, not present columns:
- < 64 columns in superset: write a single unsigned VInt where bit = 1 means the column at that index is absent.
- ≥ 64 columns: write an unsigned VInt32 count of missing columns, then either indices of present columns or missing columns (whichever is smaller set) as unsigned VInt32 deltas.
The CQLite write path uses a simplified bitmap format:
[column_count: VInt] ← CQLite internal[bitmap_bytes: (column_count + 7) / 8] ← Bit = 1 means column PRESENT (CQLite convention)Note: CQLite’s write-side bitmap (bit=1 = present) is the inverse of Cassandra’s
serializeSubset(bit=1 = missing). Parsers reading Cassandra-produced SSTables must use the authoritative subset encoding above; the CQLite bitmap is only produced by CQLite’s own writer and is parsed by CQLite’s own reader accordingly.
Example (Cassandra format): For a table with 10 columns, if columns 1 and 3 are absent:
- Bitmap VInt: bit 1 and bit 3 set =
0b00001010=0x0a
Validation
Section titled “Validation”This format specification is confirmed through:
- Implementation:
cqlite-core/src/storage/sstable/reader/parsing/v5_compressed_legacy.rs - Cassandra Source:
org.apache.cassandra.db.rows.UnfilteredSerializer.java(lines 151-210) - Integration tests: All 26/33 test tables pass (tables with clustering keys now work)
- Test data: Real Cassandra 5.0 SSTables including sensor_data, wide_partition_table, app_metrics
References
Section titled “References”- Cassandra 5.0.0 Source:
org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer - SerializationHeader: Delta encoding semantics for Statistics.db integration
- Implementation research: See
docs/sstables-definitive-guide/ISSUE_162_LEARNINGS.mdfor detailed findings
Writing Data.db Files
Section titled “Writing Data.db Files”This section documents how to construct valid V5CompressedLegacy Data.db files for write operations. All write operations must maintain the format described above while adhering to strict ordering and encoding rules.
Partition Ordering
Section titled “Partition Ordering”Partitions MUST be written in order of their Murmur3 token values (ascending). Within each token, partitions with the same token (rare but possible) are ordered by partition key bytes lexicographically.
Enforcement: The caller (write engine) is responsible for partition ordering. The DataWriter accepts partitions in the order provided.
Partition Header Format
Section titled “Partition Header Format”Each partition begins with:
[key_length: u16 BE] ← Partition key length (2 bytes, big-endian)[key_bytes] ← Raw partition key bytes[deletion_time] ← 1 byte 0x80 if LIVE; 12 bytes (u64 mfda + u32 ldt) if deletedSource: SortedTablePartitionWriter.java:104-105 —
ByteBufferUtil.writeWithShortLength(key.getKey(), writer) writes a 2-byte big-endian u16
length followed by key bytes; then DeletionTime.getSerializer(version).serialize(...).
There is no leading partition_flags byte and no trailing unknown_field. The partition key length is a 2-byte unsigned short (max 65,535 bytes), not a 1-byte limit.
End-of-Partition Marker: Each partition ends with a single byte 0x01 (END_OF_PARTITION) after all rows.
Row Ordering
Section titled “Row Ordering”Within a partition, rows MUST be ordered by their clustering keys according to the table’s clustering order (ASC or DESC per column). For tables without clustering keys, there is at most one row per partition.
Enforcement: The caller must provide rows in the correct clustering order. The DataWriter writes rows in the order provided.
Writing Partitions
Section titled “Writing Partitions”Complete partition structure:
[partition_header] ← As described above[row_1] ← Multiple rows in clustering order[row_2]...[END_OF_PARTITION: 0x01] ← Single byte markerRow Flag Construction
Section titled “Row Flag Construction”Row flags are constructed by OR-ing flag bits based on the row’s properties:
| Condition | Flag | Hex | Result |
|---|---|---|---|
| Timestamp present (always for writes) | ROW_HAS_TIMESTAMP | 0x04 | Include timestamp delta |
| TTL specified | ROW_HAS_TTL | 0x08 | Include TTL delta |
| Row deletion | ROW_HAS_DELETION | 0x10 | Include deletion fields |
| All columns present (no NULLs) | ROW_HAS_ALL_COLUMNS | 0x20 | Skip column bitmap |
| Complex column deletion | ROW_HAS_COMPLEX_DELETION | 0x40 | Complex deletion present |
| Extended flags follow | EXTENSION_FLAG | 0x80 | Extended flags byte follows |
ROW_HAS_ALL_COLUMNS Truth Table:
This flag is set when ALL of these conditions are true:
- All operations are writes (no deletes)
- No NULL values present
- Number of columns matches schema column count
| All Writes? | No NULLs? | Column Count Matches? | HAS_ALL_COLUMNS |
|---|---|---|---|
| Yes | Yes | Yes | SET (0x20) |
| Yes | Yes | No | NOT SET |
| Yes | No | Yes | NOT SET |
| No | Yes | Yes | NOT SET |
Example:
- Row with timestamp, no TTL, all columns present:
0x04 | 0x20 = 0x24 - Row with timestamp and TTL, some columns NULL:
0x04 | 0x08 = 0x0c - Row with timestamp, TTL, and deletion:
0x04 | 0x08 | 0x10 = 0x1c
Cell Flag Construction
Section titled “Cell Flag Construction”Cell flags are constructed based on cell properties:
| Condition | Flag | Hex | Result |
|---|---|---|---|
| Cell is tombstone | CELL_IS_DELETED | 0x01 | Include deletion fields |
| Cell has TTL | CELL_IS_EXPIRING | 0x02 | Include TTL fields |
| Value is empty string | CELL_HAS_EMPTY_VALUE | 0x04 | Zero-length value |
| Use row timestamp | CELL_USE_ROW_TIMESTAMP | 0x08 | Skip cell timestamp |
| Use row TTL | CELL_USE_ROW_TTL | 0x10 | Skip cell TTL |
CELL_USE_ROW_TIMESTAMP Truth Table:
Most cells use the row-level timestamp for efficiency. Cells need their own timestamp when the cell timestamp differs from the row timestamp (e.g., different write operations).
| Cell Type | Timestamp Differs? | USE_ROW_TIMESTAMP |
|---|---|---|
| Regular write | No | SET (0x08) |
| Regular write | Yes | NOT SET |
| Tombstone | N/A | NOT SET (always own timestamp) |
CELL_IS_DELETED Truth Table:
Tombstone cells have special flag requirements:
| Cell Operation | Flag Bits | Timestamp | Deletion Time | Value |
|---|---|---|---|---|
| Regular write | 0x08 (USE_ROW_TIMESTAMP) | Skip | Skip | Present |
| Regular write (own TS) | 0x00 | Include delta | Skip | Present |
| Empty string write | 0x08 | 0x04 (0x0c) | Skip | Skip | Zero-length |
| Tombstone | 0x01 | Include delta | Include delta | None |
Example:
- Normal cell using row timestamp:
0x08 - Empty string using row timestamp:
0x08 | 0x04 = 0x0c - Cell with own timestamp:
0x00(no flags) - Tombstone:
0x01(no USE_ROW_TIMESTAMP) - Expiring cell with row timestamp:
0x08 | 0x02 = 0x0a
Critical: Tombstones MUST NOT use ROW_USE_ROW_TIMESTAMP (0x08). Tombstones always include their own timestamp delta.
NULL vs Empty Values
Section titled “NULL vs Empty Values”The format distinguishes between NULL and empty values:
NULL Values:
- NOT written as cells in the cell data section
- Represented by absence in the column bitmap (bit = 0)
- Presence of NULL prevents ROW_HAS_ALL_COLUMNS flag
Empty Values (e.g., empty string ”):
- Written as cells with CELL_HAS_EMPTY_VALUE flag (0x04)
- Zero-length value (value_length VInt = 0)
- Counted as “present” in column bitmap (bit = 1)
Example Column Bitmap:
For a table with columns [name, age, city]:
- Row with
name='Alice', age=NULL, city='':- Bitmap:
0b101(name present, age absent, city present) - Cells: Two cells (name, city), city has HAS_EMPTY_VALUE flag
- Bitmap:
Delta Encoding
Section titled “Delta Encoding”All temporal metadata uses delta encoding against Statistics.db baseline values:
Timestamp Delta (unsigned VInt — SerializationHeader.writeTimestamp calls writeUnsignedVInt):
timestamp_delta = mutation_timestamp - min_timestampTTL Delta (unsigned VInt32 — SerializationHeader.writeTTL calls writeUnsignedVInt32):
ttl_delta = mutation_ttl - min_ttlLocal Deletion Time Delta (unsigned VInt):
deletion_time_delta = local_deletion_time - min_local_deletion_timeConstraints:
- Timestamp deltas MUST be >= 0:
min_timestampis the minimum across all rows, so every delta is non-negative. - TTL deltas MUST be >= 0:
min_ttlis the minimum across all rows. - Deletion time deltas MUST be >= 0 (error if local_deletion_time < min_local_deletion_time).
- All three fields use unsigned encoding because the baselines guarantee non-negative deltas.
Delta Encoding Examples
Section titled “Delta Encoding Examples”Example 1: Simple Row with Timestamp Delta
Given Statistics.db values:
min_timestamp = 1000000 (microseconds)min_ttl = 0min_local_deletion_time = 0Row with timestamp = 1005000:
[0x04] Row flags: HAS_TIMESTAMP[VInt(5000)] Timestamp delta: 1005000 - 1000000 = 5000 Unsigned VInt(5000) = 0x93 0x88 (2 bytes)Byte-level encoding of unsigned VInt(5000):
- No ZigZag — timestamps use
writeUnsignedVInt 5000 = 0x1388; 2-byte encoding:(0x13 | 0x80) = 0x93,0x88
Example 2: Row with TTL
Row with timestamp = 1005000, ttl = 7200:
[0x0c] Row flags: HAS_TIMESTAMP | HAS_TTL (0x04 | 0x08)[VInt(5000)] Timestamp delta[VInt(7200)] TTL delta: 7200 - 0 = 7200Example 3: Cell Timestamp Delta
Cell with own timestamp (not using row timestamp):
[0x00] Cell flags: no USE_ROW_TIMESTAMP[VInt(2000)] Timestamp delta from min_timestamp[VInt(value_length)] Value length[value_bytes] Value dataExample 4: Tombstone Cell
Tombstone with timestamp = 1003000, local_deletion_time = 1700000100:
[0x01] Cell flags: IS_DELETED (no USE_ROW_TIMESTAMP)[VInt(3000)] Timestamp delta: 1003000 - 1000000[VUInt(1700000100)] Deletion time delta: 1700000100 - 0 (unsigned)Example 5: No Negative Deltas
Negative timestamp deltas cannot occur in a valid SSTable: min_timestamp is computed as
the minimum of all row timestamps, so every row’s delta is >= 0. If you encounter a
value smaller than min_timestamp, the SSTable is malformed.
Clustering Prefix Encoding
Section titled “Clustering Prefix Encoding”For tables with clustering keys, values are encoded immediately after row flags:
Header VInt Construction:
- 2 bits per clustering column, packed into a VInt
- Bits are packed starting from LSB (column 0 uses bits 0-1)
State Values:
00(0): PRESENT - value bytes follow01(1): EMPTY - zero-length value (no bytes)10(2): NULL - no value (no bytes)11(3): Reserved
Type-Specific Encoding:
Fixed-width types (no length prefix):
int: 4 bytes (BE)bigint: 8 bytes (BE)timestamp: 8 bytes (BE)uuid: 16 bytes (raw)
Variable-width types (VInt length + bytes):
text/varchar: VInt(byte_length) + UTF-8 bytesblob: VInt(byte_length) + raw bytes
Example: Table with clustering keys (timestamp, text):
Row with clustering = (1234567890, “sensor1”):
[0x00] Header VInt: both PRESENT (0b0000)[0x00, 0x00, 0x00, 0x00, 0x49, 0x96, 0x02, 0xD2] timestamp (8 bytes)[0x07] VInt length (7 bytes)[0x73, 0x65, 0x6E, 0x73, 0x6F, 0x72, 0x31] "sensor1" UTF-8Row with clustering = (1234567890, NULL):
[0x02] Header VInt: timestamp PRESENT (00), text NULL (10)[0x00, 0x00, 0x00, 0x00, 0x49, 0x96, 0x02, 0xD2] timestamp (8 bytes) No text bytes (NULL)Column Bitmap Encoding
Section titled “Column Bitmap Encoding”When ROW_HAS_ALL_COLUMNS is NOT set, a column bitmap is required:
[column_count: VInt] ← Total columns in schema[bitmap_bytes: (count + 7) / 8] ← Bit = 1 means column presentBit Mapping:
- Column index determines bit position
- Bit position = column_index (0-based)
- Byte index = column_index / 8
- Bit index within byte = column_index % 8
Example: 10 columns, columns [0, 2, 5, 9] have values:
[0x0a] VInt(10) - column count[0b00100101, 0b00000010] 2 bytes for 10 columns Byte 0: bits for columns 0-7 Byte 1: bits for columns 8-9Bit positions:
- Column 0: byte 0, bit 0 = SET
- Column 2: byte 0, bit 2 = SET
- Column 5: byte 0, bit 5 = SET
- Column 9: byte 1, bit 1 = SET
Cell Data Format
Section titled “Cell Data Format”Regular Cell (live value):
[flags: u8] ← Cell flags[timestamp_delta: VInt if NOT USE_ROW_TIMESTAMP] ← Delta from min_timestamp[value_length: VInt] ← Byte length of value[value_bytes] ← Type-specific serializationTombstone Cell (deleted):
[flags: u8] ← CELL_IS_DELETED (0x01)[timestamp_delta: VInt] ← Delta from min_timestamp (required)[deletion_time_delta: VUInt] ← Delta from min_local_deletion_timeNote: Tombstones do NOT have value_length or value_bytes fields. The parser returns immediately after reading the deletion time delta.
Cell Value Serialization
Section titled “Cell Value Serialization”Type-specific serialization rules for cell values:
| Type | Format | Example |
|---|---|---|
| boolean | 1 byte | true = 0x01, false = 0x00 |
| tinyint | 1 byte (signed) | -5 = 0xFB |
| smallint | 2 bytes BE | 300 = 0x01 0x2C |
| int | 4 bytes BE | 42 = 0x00 0x00 0x00 0x2A |
| bigint | 8 bytes BE | 1000 = 0x00 0x00 0x00 0x00 0x00 0x00 0x03 0xE8 |
| float | 4 bytes BE (IEEE 754) | 3.14f = 0x40 0x48 0xF5 0xC3 |
| double | 8 bytes BE (IEEE 754) | 3.14 = 0x40 0x09 0x1E 0xB8 0x51 0xEB 0x85 0x1F |
| text/varchar | UTF-8 bytes (no prefix) | “test” = 0x74 0x65 0x73 0x74 |
| blob | Raw bytes (no prefix) | Binary data as-is |
| timestamp | 8 bytes BE (milliseconds) | Epoch milliseconds |
| date | 4 bytes BE (days + offset) | days - Integer.MIN_VALUE |
| time | 8 bytes BE (nanoseconds) | Nanoseconds since midnight |
| uuid/timeuuid | 16 bytes (raw) | UUID bytes |
| inet | 4 or 16 bytes | IPv4 (4) or IPv6 (16) |
| varint | Variable-length BE signed | Big integer, no length prefix |
| decimal | 4 bytes scale + varint | Scale (BE i32) + unscaled value |
| duration | 3x i32 BE | months, days, nanos |
Special Cases:
- Empty string: Zero-length value with CELL_HAS_EMPTY_VALUE flag
- NULL: Not written as a cell (represented by bitmap absence)
- Date encoding: Add Integer.MIN_VALUE to days value for storage
- Decimal: Scale is 4-byte BE i32, followed by varint unscaled value
Write Operation Flow
Section titled “Write Operation Flow”Complete write sequence for a partition:
- Compute Statistics: Calculate min_timestamp, min_ttl, min_local_deletion_time from all mutations
- Initialize DataWriter: Create with computed statistics for delta encoding
- Order Partitions: Sort by Murmur3 token, then partition key bytes
- For Each Partition:
a. Write partition header
b. Order rows by clustering key
c. For each row:
- Compute row flags
- Write clustering prefix (if present)
- Compute row_size (body bytes only)
- Write row_size VInt
- Write prev_size VInt (0 for now)
- Write timestamp delta (if HAS_TIMESTAMP)
- Write TTL delta (if HAS_TTL)
- Write column bitmap (if NOT HAS_ALL_COLUMNS)
- Write cells (skip NULLs) d. Write END_OF_PARTITION marker (0x01)
- Finish: Return complete Data.db bytes
Critical: row_size is measured from AFTER the row_size VInt, not from where it starts (Issue #237).
Validation
Section titled “Validation”This write specification is validated through:
- Implementation:
cqlite-core/src/storage/sstable/writer/data_writer.rs - Unit Tests: 20+ tests covering all encoding paths
- Round-trip Tests: Written SSTables are readable by both CQLite and Cassandra’s sstabledump
- Cassandra Source: Cross-referenced with
org.apache.cassandra.db.rows.UnfilteredSerializer.java
References
Section titled “References”- Implementation:
cqlite-core/src/storage/sstable/writer/data_writer.rs - Parser:
cqlite-core/src/storage/sstable/reader/parsing/v5_compressed_legacy.rs - Cassandra Source:
org.apache.cassandra.db.rows.UnfilteredSerializer(lines 151-475) - Issue Tracking: Issue #237 (row_size measurement), Issue #401 (tombstone encoding)