Skip to content

Data.db Format

This chapter describes the on-disk layout of partitions, rows, and cells in Data.db: how headers reference schema, how unfiltered rows, range tombstones, and markers are encoded, and how encodings like vints and cell flags are interpreted.

  • Partition headers and row/cluster layout basics
  • Cell value encodings, varints/vints, collections/UDTs
  • Deletions, range tombstones, TTLs and expiring cells
  • How readers interpret flags and headers during parsing

Minimal annotated example from test_basic/simple_table (trimmed and formatted):

partition key = 4d4321e2-662b-4ba1-b75f-48e080727a52
row liveness ts = 2025-09-16T22:14:23.739Z
cells: account_balance=21088.5, active=false, age=75, name=(utf8) ...

Underlying file shows a partition stream with a serialization header followed by unfiltered rows and optional tombstone markers.

Data.db row layout

  • Alt text: Annotated Data.db partition/row/cell structure
  • Caption: Serialization header → unfiltered rows/markers → cells with flags and vints

VInt parsing (Cassandra-compatible), used across headers and lengths. For a concise implementation walkthrough, see Appendix C.

Readers interpret row/cell flags to distinguish live cells, TTLs, and tombstones; see Chapter 11 for tombstone semantics. Cross-link to Appendix B for a compact encoding summary.

Common cell flags (high level):

  • live cell vs tombstone
  • presence of timestamp, ttl, local deletion time
  • empty/expiring cells

Bit-level flags (Cassandra 5.0, authoritative references):

BitNameWhen set
0IS_DELETED_MASKCell is a tombstone
1IS_EXPIRING_MASKCell is expiring (has TTL)
2HAS_EMPTY_VALUE_MASKCell value is empty (zero-length but present)
3USE_ROW_TIMESTAMP_MASKCell timestamp equals row timestamp — timestamp field is omitted
4USE_ROW_TTL_MASKCell TTL/LDT equals row TTL/LDT — TTL and LDT fields are omitted
5+(reserved)Format-specific extensions

Authoritative classes to consult in Cassandra 5.0:

  • org.apache.cassandra.db.rows.* (e.g., Unfiltered, Cell, BufferCell)
  • org.apache.cassandra.db.SerializationHeader
  • org.apache.cassandra.db.rows.SerializationHelper

Endianness:

  • Integers in SSTable payloads are big-endian unless otherwise specified; varints are MSB-first variable-length.
  • Network/binary compatibility relies on consistent big-endian parsing for fixed-width fields.
  • Partition tombstone: marks entire partition deleted at a timestamp
  • Row tombstone: targets a specific clustering row
  • Range tombstone: spans clustering ranges
  • TTL/expiring: cells carry ttl and local deletion time; expired cells are omitted at read

Collections (list/set/map) have two storage modes:

  • Frozen (frozen<list<...>>): Single-cell storage, entire collection serialized as one blob
  • Non-frozen (list<...>): Multi-cell storage, each element stored as separate cell

Frozen collections are stored as a single cell with the entire collection serialized as a binary blob. The format uses 4-byte big-endian i32 length prefixes (matching Java’s serialization format).

Frozen List/Set Format (identical for both types):

[i32 BE: element_count]
[for each element:
[i32 BE: element_length]
[element_bytes]
]

Example (frozen list with 2 integers):

Hex: 00 00 00 02 00 00 00 04 00 00 00 2A 00 00 00 04 00 00 00 64
|___________| |_____________________| |_____________________|
count=2 elem1: len=4, val=42 elem2: len=4, val=100

Frozen Map Format:

[i32 BE: entry_count]
[for each entry:
[i32 BE: key_length]
[key_bytes]
[i32 BE: value_length]
[value_bytes]
]

Example (frozen map with 1 entry: “a” -> 42):

Hex: 00 00 00 01 00 00 00 01 61 00 00 00 04 00 00 00 2A
|___________| |____________| |_______________________|
count=1 key: len=1, "a" val: len=4, int(42)

Non-frozen collections are stored as multiple cells, one per element or entry. Each cell has a cell_path that identifies the element and a cell_value that contains the data.

Non-frozen collection cell format (complex columns):

[flags: u8]
[timestamp: VInt if not USE_ROW_TIMESTAMP_MASK]
[local_deletion_time: VInt if deleted/expiring]
[ttl: VInt if expiring]
[cell_path: VInt length + bytes] ← See table below
[value: VInt length + bytes] ← See table below

Cell Path and Value by Collection Type:

Collection Typecell_pathcell_value
list<T>TimeUUID (16 bytes)Serialized element
set<T>Serialized elementEmpty (0 bytes)
map<K,V>Serialized keySerialized value

List Element Ordering:

Lists use TimeUUID (UUID version 1) for the cell_path to maintain insertion order. TimeUUIDs are time-sortable, ensuring elements remain in the order they were written. Each element gets a unique TimeUUID generated at write time.

Example (list with 2 integers):

Cell 1:
path: f35cf98a-220c-11ef-8b04-f4ff7ffcf681 (16 bytes, TimeUUID)
value: 00 00 00 2A (4 bytes, int 42)
Cell 2:
path: f35cf98b-220c-11ef-8b04-f4ff7ffcf681 (16 bytes, TimeUUID)
value: 00 00 00 64 (4 bytes, int 100)

Set Element Storage:

Sets use the element value itself as the cell_path for efficient membership testing. The cell_value is always empty (0 bytes). This allows Cassandra to check set membership by looking for a cell with a matching path.

Example (set with 2 text values):

Cell 1:
path: 61 6C 70 68 61 (5 bytes, "alpha")
value: (empty, 0 bytes)
Cell 2:
path: 62 65 74 61 (4 bytes, "beta")
value: (empty, 0 bytes)

Map Entry Storage:

Maps use the serialized key as the cell_path and the serialized value as the cell_value. This allows efficient key lookups.

Example (map with 2 entries: 1->“one”, 2->“two”):

Cell 1:
path: 00 00 00 01 (4 bytes, int key 1)
value: 6F 6E 65 (3 bytes, "one")
Cell 2:
path: 00 00 00 02 (4 bytes, int key 2)
value: 74 77 6F (3 bytes, "two")

Implementation References:

  • Frozen collections: cqlite-core/src/storage/sstable/writer/data_writer.rs::serialize_frozen_list(), serialize_frozen_set(), serialize_frozen_map()
  • Non-frozen collections: serialize_nonfrozen_list(), serialize_nonfrozen_set(), serialize_nonfrozen_map()
  • Tests: cqlite-core/tests/collection_roundtrip_test.rs

UDTs (User-Defined Types) serialize fields in schema order with 4-byte BE length prefixes:

[field_1_length: 4-byte BE i32][field_1_data]
[field_2_length: 4-byte BE i32][field_2_data]
...

UDT field length semantics (confirmed via Issue #220):

  • -1 (0xFFFFFFFF): Field is NULL
  • 0 (0x00000000): Field is empty (zero-length but present)
  • >0: Number of bytes of field data following
  • Trailing omitted fields are implicitly NULL

Critical distinction: The outer type determines storage:

  • list<frozen<udt>> = multi-cell (each UDT element is separate cell)
  • frozen<list<udt>> = single-cell (entire list is one blob)

See tables/type-mapping-complex.md for detailed format specifications.

Counter columns store a CounterContext structure, not a raw i64 value. The CounterContext tracks counter updates across multiple replicas (shards).

Cell format:

[VInt length] ← Length of CounterContext bytes
[CounterContext] ← Variable-length structure below

CounterContext format (from CounterContext.java):

FieldSizeDescription
header_size2 bytesBE signed short - number of shards
indices2 * |header_size| bytesShard type indicators (negative = global)
shards32 * |header_size| bytesEach shard: counter_id (16) + clock (8) + count (8)

Shard structure (32 bytes each):

[counter_id: 16 bytes] ← Replica's CounterId (UUID)
[clock: 8 bytes] ← Logical clock (BE unsigned long)
[count: 8 bytes] ← Counter value for this shard (BE signed long)

The counter value is the sum of all shard counts, matching Cassandra’s total() function.

Example (single-shard counter):

24 ← VInt length (36 bytes)
0001 ← header_size = 1
8000 ← header index (0x8000 = global shard at index 0)
f35cf98a220c40fb8b04f4ff7ffcf681 ← counter_id (16 bytes)
00064073 23d1d210 ← clock (8 bytes)
00000000 00000029 ← count = 41 (8 bytes)

Reference: org.apache.cassandra.db.context.CounterContext in Cassandra 5.0 source.

  • Data.db is schema-driven and encodes partitions as unfiltered row streams.
  • VInts and bit flags compactly encode sizes, timestamps, and cell metadata.
  • Tombstones and TTLs are first-class and affect reconciliation.
  • If parsed sizes seem inconsistent, verify VInt decoding and endian assumptions.
  • For collections with unexpected nulls, check for element tombstones and TTL expiration handling.

For implementation details, see Appendix C.

V5CompressedLegacy Row Header Format (Cassandra 5.0)

Section titled “V5CompressedLegacy Row Header Format (Cassandra 5.0)”

The V5CompressedLegacy format (BigFormat with compression, “nb” file prefix) uses a structured row header with delta-encoded metadata fields. This format is used by Cassandra 5.0 SSTables with the legacy “big” format and compression enabled.

Row Structure (Corrected - Issue #213, Issue #237)

Section titled “Row Structure (Corrected - Issue #213, Issue #237)”

The complete row format, confirmed via Cassandra’s UnfilteredSerializer.java:

[row_flags: u8]
[extended_flags: u8 if 0x80 set]
[clustering_prefix: variable] ← For tables with clustering keys
[row_size: VInt] ← Size measured from AFTER this VInt (Issue #237)
[prev_size: VInt]
[timestamp: VInt if 0x04 set] ← Delta from min_timestamp (unsigned VInt)
[ttl: VInt32 if 0x08 set] ← TTL delta from min_ttl (unsigned VInt32)
[liveness_ldt: VInt32 if 0x08 set] ← Local expiration time delta from min_local_deletion_time (unsigned VInt32)
[deletion: 2 VInts if 0x10 set] ← markedForDeleteAt delta (unsigned VInt) + local_deletion_time delta (unsigned VInt32)
[column_bitmap: VInt + bytes if NOT 0x20]
[cell_data...]

Critical Notes:

  1. Clustering Prefix Ordering (Issue #213): For tables WITH clustering keys, the clustering prefix comes IMMEDIATELY after flags and BEFORE row_size. This differs from initial documentation which placed row_size immediately after flags.

  2. row_size Measurement (Issue #237): The row_size value is measured from the position AFTER the row_size VInt is consumed, NOT from where the VInt starts. This matches Cassandra’s getFilePointer() semantics:

    next_row_offset = position_after_row_size_vint + row_size_value
  3. No Trailing Field (Issue #237): There is NO trailing field after row data in V5CompressedLegacy format. The next partition or row starts immediately after row_size bytes from the position after the VInt.

For tables with clustering keys, values are encoded between flags and row_size:

[header: VInt] ← 2 bits per clustering column
[value_1: type-specific] ← Only if state indicates PRESENT
[value_2: type-specific]
...

The header VInt uses 2 bits per column to indicate state:

  • 00 (0): Value PRESENT - followed by type-specific bytes
  • 01 (1): Value EMPTY - zero-length (no bytes follow)
  • 10 (2): Value NULL - no bytes follow
  • 11 (3): Reserved

Type-specific encoding:

  • Fixed-width types (timestamp, int, bigint, UUID): Raw bytes, no length prefix
  • Variable-width types (text, varchar, blob): VInt length prefix + bytes

Between rows in a partition, or at the end of a partition, the parser may encounter special markers:

MarkerHexMeaning
END_OF_PARTITION0x01Signals end of partition - nothing follows this byte
IS_MARKER0x02Range tombstone marker (boundary or bound)

END_OF_PARTITION (0x01):

  • Written by UnfilteredSerializer.writeEndOfPartition() as exactly 0x01
  • When detected, the partition is complete; move to next partition
  • Critical for tables with clustering keys to avoid misinterpreting marker as row data

IS_MARKER (0x02):

  • Indicates a range tombstone boundary
  • Followed by clustering bound/boundary data and deletion time(s)
  • Must be skipped when parsing row data

Implementation Note: CQLite uses bitwise AND (flags & IS_MARKER != 0) to detect IS_MARKER because markers can have additional flag bits set (e.g., 0x52 = IS_MARKER | HAS_DELETION | HAS_COMPLEX_DELETION). END_OF_PARTITION still uses exact match (0x01) as it is always written alone without other flags.

FlagHexMeaningDetails
0x04HAS_TIMESTAMPTimestamp delta presentDelta-encoded from Statistics.db min_timestamp
0x08HAS_TTLTTL delta presentDelta-encoded from Statistics.db min_ttl
0x10HAS_DELETIONDeletion time presentTwo VInts in Cassandra canonical order: (1) markedForDeleteAt delta (unsigned VInt, base min_timestamp, microseconds — the authoritative tombstone reconciliation timestamp), then (2) local_deletion_time delta (unsigned VInt32, base min_local_deletion_time, seconds). See DeletionTime.Serializer / SerializationHeader.writeDeletionTime.
0x20HAS_ALL_COLUMNSAll columns present (no bitmap)When set, all schema columns have values (no NULLs)
0x80EXTENSION_FLAG (source) / HAS_EXTENDED_FLAGS (guide alias)Extended flags byte followsReserved for future format extensions

All metadata fields use delta encoding against minimum values from Statistics.db:

absolute_timestamp = min_timestamp + timestamp_delta
absolute_ttl = min_ttl + ttl_delta
absolute_marked_for_delete_at = min_timestamp + marked_for_delete_at_delta # microseconds (reconciliation ts)
absolute_local_deletion_time = min_local_deletion_time + local_deletion_time_delta # seconds

Example: If Statistics.db shows min_timestamp = 1759713125983682 and row header contains timestamp_delta = 1000, the absolute timestamp is 1759713125984682 (microseconds since epoch).

When HAS_ALL_COLUMNS (0x20) is NOT set, a columns-subset field follows the metadata fields. Cassandra’s on-disk format (Columns.Serializer.serializeSubset, Columns.java:503-531) encodes missing columns, not present columns:

  • < 64 columns in superset: write a single unsigned VInt where bit = 1 means the column at that index is absent.
  • ≥ 64 columns: write an unsigned VInt32 count of missing columns, then either indices of present columns or missing columns (whichever is smaller set) as unsigned VInt32 deltas.

The CQLite write path uses a simplified bitmap format:

[column_count: VInt] ← CQLite internal
[bitmap_bytes: (column_count + 7) / 8] ← Bit = 1 means column PRESENT (CQLite convention)

Note: CQLite’s write-side bitmap (bit=1 = present) is the inverse of Cassandra’s serializeSubset (bit=1 = missing). Parsers reading Cassandra-produced SSTables must use the authoritative subset encoding above; the CQLite bitmap is only produced by CQLite’s own writer and is parsed by CQLite’s own reader accordingly.

Example (Cassandra format): For a table with 10 columns, if columns 1 and 3 are absent:

  • Bitmap VInt: bit 1 and bit 3 set = 0b00001010 = 0x0a

This format specification is confirmed through:

  • Implementation: cqlite-core/src/storage/sstable/reader/parsing/v5_compressed_legacy.rs
  • Cassandra Source: org.apache.cassandra.db.rows.UnfilteredSerializer.java (lines 151-210)
  • Integration tests: All 26/33 test tables pass (tables with clustering keys now work)
  • Test data: Real Cassandra 5.0 SSTables including sensor_data, wide_partition_table, app_metrics
  • Cassandra 5.0.0 Source: org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer
  • SerializationHeader: Delta encoding semantics for Statistics.db integration
  • Implementation research: See docs/sstables-definitive-guide/ISSUE_162_LEARNINGS.md for detailed findings

This section documents how to construct valid V5CompressedLegacy Data.db files for write operations. All write operations must maintain the format described above while adhering to strict ordering and encoding rules.

Partitions MUST be written in order of their Murmur3 token values (ascending). Within each token, partitions with the same token (rare but possible) are ordered by partition key bytes lexicographically.

Enforcement: The caller (write engine) is responsible for partition ordering. The DataWriter accepts partitions in the order provided.

Each partition begins with:

[key_length: u16 BE] ← Partition key length (2 bytes, big-endian)
[key_bytes] ← Raw partition key bytes
[deletion_time] ← 1 byte 0x80 if LIVE; 12 bytes (u64 mfda + u32 ldt) if deleted

Source: SortedTablePartitionWriter.java:104-105ByteBufferUtil.writeWithShortLength(key.getKey(), writer) writes a 2-byte big-endian u16 length followed by key bytes; then DeletionTime.getSerializer(version).serialize(...).

There is no leading partition_flags byte and no trailing unknown_field. The partition key length is a 2-byte unsigned short (max 65,535 bytes), not a 1-byte limit.

End-of-Partition Marker: Each partition ends with a single byte 0x01 (END_OF_PARTITION) after all rows.

Within a partition, rows MUST be ordered by their clustering keys according to the table’s clustering order (ASC or DESC per column). For tables without clustering keys, there is at most one row per partition.

Enforcement: The caller must provide rows in the correct clustering order. The DataWriter writes rows in the order provided.

Complete partition structure:

[partition_header] ← As described above
[row_1] ← Multiple rows in clustering order
[row_2]
...
[END_OF_PARTITION: 0x01] ← Single byte marker

Row flags are constructed by OR-ing flag bits based on the row’s properties:

ConditionFlagHexResult
Timestamp present (always for writes)ROW_HAS_TIMESTAMP0x04Include timestamp delta
TTL specifiedROW_HAS_TTL0x08Include TTL delta
Row deletionROW_HAS_DELETION0x10Include deletion fields
All columns present (no NULLs)ROW_HAS_ALL_COLUMNS0x20Skip column bitmap
Complex column deletionROW_HAS_COMPLEX_DELETION0x40Complex deletion present
Extended flags followEXTENSION_FLAG0x80Extended flags byte follows

ROW_HAS_ALL_COLUMNS Truth Table:

This flag is set when ALL of these conditions are true:

  1. All operations are writes (no deletes)
  2. No NULL values present
  3. Number of columns matches schema column count
All Writes?No NULLs?Column Count Matches?HAS_ALL_COLUMNS
YesYesYesSET (0x20)
YesYesNoNOT SET
YesNoYesNOT SET
NoYesYesNOT SET

Example:

  • Row with timestamp, no TTL, all columns present: 0x04 | 0x20 = 0x24
  • Row with timestamp and TTL, some columns NULL: 0x04 | 0x08 = 0x0c
  • Row with timestamp, TTL, and deletion: 0x04 | 0x08 | 0x10 = 0x1c

Cell flags are constructed based on cell properties:

ConditionFlagHexResult
Cell is tombstoneCELL_IS_DELETED0x01Include deletion fields
Cell has TTLCELL_IS_EXPIRING0x02Include TTL fields
Value is empty stringCELL_HAS_EMPTY_VALUE0x04Zero-length value
Use row timestampCELL_USE_ROW_TIMESTAMP0x08Skip cell timestamp
Use row TTLCELL_USE_ROW_TTL0x10Skip cell TTL

CELL_USE_ROW_TIMESTAMP Truth Table:

Most cells use the row-level timestamp for efficiency. Cells need their own timestamp when the cell timestamp differs from the row timestamp (e.g., different write operations).

Cell TypeTimestamp Differs?USE_ROW_TIMESTAMP
Regular writeNoSET (0x08)
Regular writeYesNOT SET
TombstoneN/ANOT SET (always own timestamp)

CELL_IS_DELETED Truth Table:

Tombstone cells have special flag requirements:

Cell OperationFlag BitsTimestampDeletion TimeValue
Regular write0x08 (USE_ROW_TIMESTAMP)SkipSkipPresent
Regular write (own TS)0x00Include deltaSkipPresent
Empty string write0x08 | 0x04 (0x0c)SkipSkipZero-length
Tombstone0x01Include deltaInclude deltaNone

Example:

  • Normal cell using row timestamp: 0x08
  • Empty string using row timestamp: 0x08 | 0x04 = 0x0c
  • Cell with own timestamp: 0x00 (no flags)
  • Tombstone: 0x01 (no USE_ROW_TIMESTAMP)
  • Expiring cell with row timestamp: 0x08 | 0x02 = 0x0a

Critical: Tombstones MUST NOT use ROW_USE_ROW_TIMESTAMP (0x08). Tombstones always include their own timestamp delta.

The format distinguishes between NULL and empty values:

NULL Values:

  • NOT written as cells in the cell data section
  • Represented by absence in the column bitmap (bit = 0)
  • Presence of NULL prevents ROW_HAS_ALL_COLUMNS flag

Empty Values (e.g., empty string ”):

  • Written as cells with CELL_HAS_EMPTY_VALUE flag (0x04)
  • Zero-length value (value_length VInt = 0)
  • Counted as “present” in column bitmap (bit = 1)

Example Column Bitmap:

For a table with columns [name, age, city]:

  • Row with name='Alice', age=NULL, city='':
    • Bitmap: 0b101 (name present, age absent, city present)
    • Cells: Two cells (name, city), city has HAS_EMPTY_VALUE flag

All temporal metadata uses delta encoding against Statistics.db baseline values:

Timestamp Delta (unsigned VInt — SerializationHeader.writeTimestamp calls writeUnsignedVInt):

timestamp_delta = mutation_timestamp - min_timestamp

TTL Delta (unsigned VInt32 — SerializationHeader.writeTTL calls writeUnsignedVInt32):

ttl_delta = mutation_ttl - min_ttl

Local Deletion Time Delta (unsigned VInt):

deletion_time_delta = local_deletion_time - min_local_deletion_time

Constraints:

  • Timestamp deltas MUST be >= 0: min_timestamp is the minimum across all rows, so every delta is non-negative.
  • TTL deltas MUST be >= 0: min_ttl is the minimum across all rows.
  • Deletion time deltas MUST be >= 0 (error if local_deletion_time < min_local_deletion_time).
  • All three fields use unsigned encoding because the baselines guarantee non-negative deltas.

Example 1: Simple Row with Timestamp Delta

Given Statistics.db values:

min_timestamp = 1000000 (microseconds)
min_ttl = 0
min_local_deletion_time = 0

Row with timestamp = 1005000:

[0x04] Row flags: HAS_TIMESTAMP
[VInt(5000)] Timestamp delta: 1005000 - 1000000 = 5000
Unsigned VInt(5000) = 0x93 0x88 (2 bytes)

Byte-level encoding of unsigned VInt(5000):

  • No ZigZag — timestamps use writeUnsignedVInt
  • 5000 = 0x1388; 2-byte encoding: (0x13 | 0x80) = 0x93, 0x88

Example 2: Row with TTL

Row with timestamp = 1005000, ttl = 7200:

[0x0c] Row flags: HAS_TIMESTAMP | HAS_TTL (0x04 | 0x08)
[VInt(5000)] Timestamp delta
[VInt(7200)] TTL delta: 7200 - 0 = 7200

Example 3: Cell Timestamp Delta

Cell with own timestamp (not using row timestamp):

[0x00] Cell flags: no USE_ROW_TIMESTAMP
[VInt(2000)] Timestamp delta from min_timestamp
[VInt(value_length)] Value length
[value_bytes] Value data

Example 4: Tombstone Cell

Tombstone with timestamp = 1003000, local_deletion_time = 1700000100:

[0x01] Cell flags: IS_DELETED (no USE_ROW_TIMESTAMP)
[VInt(3000)] Timestamp delta: 1003000 - 1000000
[VUInt(1700000100)] Deletion time delta: 1700000100 - 0 (unsigned)

Example 5: No Negative Deltas

Negative timestamp deltas cannot occur in a valid SSTable: min_timestamp is computed as the minimum of all row timestamps, so every row’s delta is >= 0. If you encounter a value smaller than min_timestamp, the SSTable is malformed.

For tables with clustering keys, values are encoded immediately after row flags:

Header VInt Construction:

  • 2 bits per clustering column, packed into a VInt
  • Bits are packed starting from LSB (column 0 uses bits 0-1)

State Values:

  • 00 (0): PRESENT - value bytes follow
  • 01 (1): EMPTY - zero-length value (no bytes)
  • 10 (2): NULL - no value (no bytes)
  • 11 (3): Reserved

Type-Specific Encoding:

Fixed-width types (no length prefix):

  • int: 4 bytes (BE)
  • bigint: 8 bytes (BE)
  • timestamp: 8 bytes (BE)
  • uuid: 16 bytes (raw)

Variable-width types (VInt length + bytes):

  • text/varchar: VInt(byte_length) + UTF-8 bytes
  • blob: VInt(byte_length) + raw bytes

Example: Table with clustering keys (timestamp, text):

Row with clustering = (1234567890, “sensor1”):

[0x00] Header VInt: both PRESENT (0b0000)
[0x00, 0x00, 0x00, 0x00, 0x49, 0x96, 0x02, 0xD2] timestamp (8 bytes)
[0x07] VInt length (7 bytes)
[0x73, 0x65, 0x6E, 0x73, 0x6F, 0x72, 0x31] "sensor1" UTF-8

Row with clustering = (1234567890, NULL):

[0x02] Header VInt: timestamp PRESENT (00), text NULL (10)
[0x00, 0x00, 0x00, 0x00, 0x49, 0x96, 0x02, 0xD2] timestamp (8 bytes)
No text bytes (NULL)

When ROW_HAS_ALL_COLUMNS is NOT set, a column bitmap is required:

[column_count: VInt] ← Total columns in schema
[bitmap_bytes: (count + 7) / 8] ← Bit = 1 means column present

Bit Mapping:

  • Column index determines bit position
  • Bit position = column_index (0-based)
  • Byte index = column_index / 8
  • Bit index within byte = column_index % 8

Example: 10 columns, columns [0, 2, 5, 9] have values:

[0x0a] VInt(10) - column count
[0b00100101, 0b00000010] 2 bytes for 10 columns
Byte 0: bits for columns 0-7
Byte 1: bits for columns 8-9

Bit positions:

  • Column 0: byte 0, bit 0 = SET
  • Column 2: byte 0, bit 2 = SET
  • Column 5: byte 0, bit 5 = SET
  • Column 9: byte 1, bit 1 = SET

Regular Cell (live value):

[flags: u8] ← Cell flags
[timestamp_delta: VInt if NOT USE_ROW_TIMESTAMP] ← Delta from min_timestamp
[value_length: VInt] ← Byte length of value
[value_bytes] ← Type-specific serialization

Tombstone Cell (deleted):

[flags: u8] ← CELL_IS_DELETED (0x01)
[timestamp_delta: VInt] ← Delta from min_timestamp (required)
[deletion_time_delta: VUInt] ← Delta from min_local_deletion_time

Note: Tombstones do NOT have value_length or value_bytes fields. The parser returns immediately after reading the deletion time delta.

Type-specific serialization rules for cell values:

TypeFormatExample
boolean1 bytetrue = 0x01, false = 0x00
tinyint1 byte (signed)-5 = 0xFB
smallint2 bytes BE300 = 0x01 0x2C
int4 bytes BE42 = 0x00 0x00 0x00 0x2A
bigint8 bytes BE1000 = 0x00 0x00 0x00 0x00 0x00 0x00 0x03 0xE8
float4 bytes BE (IEEE 754)3.14f = 0x40 0x48 0xF5 0xC3
double8 bytes BE (IEEE 754)3.14 = 0x40 0x09 0x1E 0xB8 0x51 0xEB 0x85 0x1F
text/varcharUTF-8 bytes (no prefix)“test” = 0x74 0x65 0x73 0x74
blobRaw bytes (no prefix)Binary data as-is
timestamp8 bytes BE (milliseconds)Epoch milliseconds
date4 bytes BE (days + offset)days - Integer.MIN_VALUE
time8 bytes BE (nanoseconds)Nanoseconds since midnight
uuid/timeuuid16 bytes (raw)UUID bytes
inet4 or 16 bytesIPv4 (4) or IPv6 (16)
varintVariable-length BE signedBig integer, no length prefix
decimal4 bytes scale + varintScale (BE i32) + unscaled value
duration3x i32 BEmonths, days, nanos

Special Cases:

  • Empty string: Zero-length value with CELL_HAS_EMPTY_VALUE flag
  • NULL: Not written as a cell (represented by bitmap absence)
  • Date encoding: Add Integer.MIN_VALUE to days value for storage
  • Decimal: Scale is 4-byte BE i32, followed by varint unscaled value

Complete write sequence for a partition:

  1. Compute Statistics: Calculate min_timestamp, min_ttl, min_local_deletion_time from all mutations
  2. Initialize DataWriter: Create with computed statistics for delta encoding
  3. Order Partitions: Sort by Murmur3 token, then partition key bytes
  4. For Each Partition: a. Write partition header b. Order rows by clustering key c. For each row:
    • Compute row flags
    • Write clustering prefix (if present)
    • Compute row_size (body bytes only)
    • Write row_size VInt
    • Write prev_size VInt (0 for now)
    • Write timestamp delta (if HAS_TIMESTAMP)
    • Write TTL delta (if HAS_TTL)
    • Write column bitmap (if NOT HAS_ALL_COLUMNS)
    • Write cells (skip NULLs) d. Write END_OF_PARTITION marker (0x01)
  5. Finish: Return complete Data.db bytes

Critical: row_size is measured from AFTER the row_size VInt, not from where it starts (Issue #237).

This write specification is validated through:

  • Implementation: cqlite-core/src/storage/sstable/writer/data_writer.rs
  • Unit Tests: 20+ tests covering all encoding paths
  • Round-trip Tests: Written SSTables are readable by both CQLite and Cassandra’s sstabledump
  • Cassandra Source: Cross-referenced with org.apache.cassandra.db.rows.UnfilteredSerializer.java
  • Implementation: cqlite-core/src/storage/sstable/writer/data_writer.rs
  • Parser: cqlite-core/src/storage/sstable/reader/parsing/v5_compressed_legacy.rs
  • Cassandra Source: org.apache.cassandra.db.rows.UnfilteredSerializer (lines 151-475)
  • Issue Tracking: Issue #237 (row_size measurement), Issue #401 (tombstone encoding)