Index.db and Summary.db

This chapter explains the partition index (Index.db) and the sampled summary (Summary.db), and how they guide binary search and seeks into Data.db. It also outlines token-range iteration behavior.

In this chapter you will learn

The structure of index entries and promoted index behavior
How summary sampling accelerates lookups
How binary search is guided from summary to index to data
How token range iteration interacts with the index

Partition Index Structure

Index.db (the BIG / NB partition index) stores, for every partition, the raw partition key (length-prefixed) together with its byte offset into Data.db. There is no 0x0010 marker and no MD5 digest — that was a long-standing documentation error (Issue #552). Each entry is:

[key_len: u16 BE]                    ← length of the raw partition key
[raw partition key bytes: key_len]   ← the partition key exactly as serialized in Data.db
[data_offset: unsigned vint]         ← offset in the UNCOMPRESSED Data.db stream
[promoted_index_len: unsigned vint]  ← byte length of the promoted index (0 = none)
[promoted_index_data: promoted_index_len bytes]

The leading u16 is the partition key LENGTH, not a marker. It varies with the key: 0x0010 for a 16-byte UUID key, 0x0026 (= 38) for a 38-byte composite partition key, 0x0004 for a 4-byte int key, and so on. Earlier revisions of this guide misread the 0x0010 produced by single-UUID tables as a fixed marker; it is simply the length of a 16-byte key.

Annotated example — single UUID key (simple_table, key length 16):

00000000: 0010 1529 1a77 d739 4e73 8397 b787 442f  |...).w.9Ns....D/|
00000010: 3a1f 0000 ...                            |:.. .          |

0010 → key length = 16
1529…3a1f → 16-byte raw partition key (a UUID)
00 → unsigned VInt data offset (value 0)
00 → unsigned VInt promoted-index length (0, no promoted index)

Annotated example — composite key (multi_partition_table, PRIMARY KEY ((tenant_id, user_id), …), key length 38):

00000000: 0026 0010 98e0 5820 982d 411c 961f 26d1  |.&....X .-A...&.|
00000010: 0574 74e4 0000 109d 159a 2b08 da4a d1be  |.tt.......+..J..|
00000020: 78c9 0f87 83e5 c100 ...                  |x..... .       |

0026 → key length = 38
0010 98e0…74e4 00 0010 9d15…c1 00 → 38-byte composite key. Cassandra frames each component as [len: u16 BE][value][0x00 end-of-component], so two 16-byte UUIDs become 0010<uuid1>00 0010<uuid2>00 = 19 + 19 = 38 bytes.
the data-offset and promoted-index-length vints follow the key.

Pseudo-struct (field order; big-endian for fixed-width):

u16   key_len
bytes raw_partition_key[key_len]
vint  data_offset
vint  promoted_index_len
bytes promoted_index[promoted_index_len]   // only when promoted_index_len > 0

Critical: VInt offset encoding (NB format)

The data_offset and promoted_index_len fields use Cassandra unsigned VInt encoding, not a fixed-width or length-prefixed byte array:

VInt encoding (from DataInputPlus.java):

First byte’s leading 1-bits indicate total byte count
0x00-0x7F: 1 byte, value = byte itself
0x80-0xBF: 2 bytes
0xC0-0xDF: 3 bytes
etc.

Example offsets:

0x00       -> 1 byte  -> value = 0
0xb0 0x5d  -> 2 bytes -> value = 12381
0xc0 0x5f 0x11 -> 3 bytes -> value = 24337

Important: data_offset is a position in the uncompressed data stream, i.e. the offset of the partition’s first byte as if Data.db were fully decompressed. Cassandra takes it directly from the partition writer’s initial position in that stream (BigTableWriter.createRowIndexEntry), so there is no header to add:

Uncompressed Data.db — the uncompressed stream is the file, and the file has no standalone header, so file_offset == data_offset. Verified on test_basic/uncompressed_table: Index.db entry 0 has data_offset = 0 and Data.db byte 0 is that partition’s 0010 045e 63fc… key; entry 1’s data_offset = 0xc5 (197) lands exactly on 0010 5f5e 6b02…, the next key.
Compressed Data.db — resolve data_offset through CompressionInfo.db: find the chunk containing it, decompress that chunk, and index into the result. The offset is not a raw file position. Verified on test_basic/compression_test_table, where Data.db byte 0 is Snappy chunk data, not a key.

Caution: any advice of the form “add ~30 bytes of Data.db header” is wrong for na/nb and oa BIG SSTables — Data.db has no such header. (CQLite carries an actual_header_size for other, non-Cassandra-written inputs and it is 0 for real nb files: reader/header_helpers.rs:89-92; the summary-guided point path passes data_offset through unmodified, reader/data_access/big_point.rs:114-122.)

Keys and collisions:

The on-disk key is the raw partition key, so lookups compare keys byte-for-byte. There is no key hash on disk, hence no collision handling at the index layer at all. (CQLite’s in-memory lookup map is keyed on those raw bytes; its key_digest field name is historical — index_reader/mod.rs:52-60 documents the misnomer, Issue #552.)
A point lookup that misses therefore misses for a real reason (absent key, or an index whose covering range cannot be trusted), not because of a digest collision. When a reader cannot bound the search authoritatively it must widen it rather than report absence — see “Partition Lookup Flow” below and Ch.10.

Promoted index payload (BIG):

Emitted for wide partitions to accelerate within-partition seeks. Its length is given explicitly by promoted_index_len; when it is 0 the entry has no promoted index. See org.apache.cassandra.io.sstable.format.big reader/writer for the payload fields.

Promoted index trigger (data-size based, not key-count based): A new IndexInfo block is emitted whenever the uncompressed data written for the current partition since the last block boundary reaches column_index_size (default 64 KiB). Source: BigFormatPartitionWriter.java:49 (DEFAULT_GRANULARITY = 64 * 1024) and line 213 (if (currentPosition() - startPosition >= indexSize)). A promoted index is only included when two or more IndexInfo blocks result (RowIndexEntry.create() lines 227-239). Partitions smaller than 64 KiB produce a plain RowIndexEntry with serialized size = 0 and no promoted index data.

DeletionTime field order (“oa” vs legacy, MC-4): For SSTable version “oa” (5.0+), DeletionTime inside a promoted index is serialized as:

LIVE: 1 byte 0x80 (high bit set)

Non-LIVE: 8 bytes markedForDeleteAt (sign bit must be 0, high bit used as flags), then 4 bytes localDeletionTime as unsigned int

For legacy versions (< “oa”) the LegacySerializer writes localDeletionTime (4 bytes, signed int) first, then markedForDeleteAt (8 bytes). Gate on version.hasUintDeletionTime (BigFormat.java:409). Source: DeletionTime.java#L210-L219

Mini-parser — conceptual:

pos = 0
key_len            = read_u16_be()
raw_key            = read_bytes(key_len)
data_offset        = read_vint_u64()
promoted_index_len = read_vint_u64()
promoted_index     = read_bytes(promoted_index_len)   // empty when length == 0

Note: a BTI-indexed SSTable does not use Index.db at all — it uses the Partitions.db / Rows.db trie structures (see Ch.17). So Index.db, when present, is always this BIG-format partition index; there is no separate “BTI Index.db” variant to detect.

Summary.db Format

Summary.db samples index entries for faster navigation. It contains a subset of partition keys at configurable intervals (default: every 128 partitions).

Sample order is partitioner order, not byte order. Summary.db samples — and the Index.db entries they point into — are ordered by decorated key: the table partitioner’s token first, then the raw key bytes compared as unsigned. With the 5.0 default Murmur3Partitioner the order is therefore decided by murmur3(raw_key), and the raw bytes only break token ties. A binary search that compares raw keys with memcmp lands in the wrong interval on any table whose token order differs from its byte order. The token itself is not stored in a summary entry (see “Entry Format” below) — the reader recomputes it by decorating the stored key.

Sources: IndexSummary.binarySearch(PartitionPosition) compares with DecoratedKey.compareTo(IPartitioner, ByteBuffer, PartitionPosition), which does getToken().compareTo(...) first (Murmur3Partitioner.LongToken.compareTo = Long.compare) and only then ByteBufferUtil.compareUnsigned on the raw keys.

File Structure

+------------------------+
| Header (24 bytes)      |
+------------------------+
| Offset Table (LE u32[])| <- Little-endian!
+------------------------+
| Entry Data             |
|   key + position (LE)  | <- position is little-endian too (#1054)
+------------------------+
| First Key (serialized) |
+------------------------+
| Last Key (serialized)  |
+------------------------+

Header Format (24 bytes, big-endian)

struct summary_header {
    be32 min_index_interval;      // Minimum partitions between entries (usually 128)
    be32 entries_count;           // Number of sampled entries
    be64 summary_entries_size;    // Size of offset table + entry data
    be32 sampling_level;          // Sampling level (1-128)
    be32 size_at_full_sampling;   // Entries at full sampling
};

Annotated example:

00000000: 00 00 00 80  // min_index_interval = 128
00000004: 00 00 00 08  // entries_count = 8
00000008: 00 00 00 00 00 00 00 e0  // summary_entries_size = 224
00000010: 00 00 00 80  // sampling_level = 128
00000014: 00 00 00 08  // size_at_full_sampling = 8

Offset Table (Little-Endian!)

Critical gotcha: Unlike all other Cassandra formats, the offset table uses little-endian encoding.

le32 offsets[entries_count];  // Offset to each entry, measured from the START of the
                              // offset table (i.e. offsets[0] == entries_count * 4)

Second gotcha: the offsets are absolute within the combined (offset_table + entry_data) region, not relative to the entry-data section. Cassandra adds baseOffset = offsetCount * 4 when serializing and subtracts it when deserializing (IndexSummary.java#L405-L420), and its deserializer rejects an offset smaller than the table itself (#L459-L467), so a zero-based offsets[0] is not readable by Cassandra (CQLite issue #666).

Real bytes — test_basic/composite_key_table, nb-1-big-Summary.db, one sample with a 16-byte key:

00000018: 04 00 00 00  // Entry 0 at absolute offset 4 == entries_count(1) * 4 (LE!)
0000001c: 24 5d ff 69 02 6f 45 c6 b6 8f ba 0c 96 4d f3 c9  // 16-byte raw key, no prefix
0000002c: 00 00 00 00 00 00 00 00                          // Index.db position = 0 (LE u64)

Here summary_entries_size (header) is 0x1c = 28 = 4 (offset table) + 16 (key) + 8 (position), confirming it spans both regions.

Entry Format

Entries have no length prefix. Key boundaries are determined by offset differences.

struct summary_entry {
    byte key[];        // Variable length - no prefix!
    le64 position;     // Position in Index.db file (LITTLE-ENDIAN — issue #1054)
};

Endianness note (issue #1054): the trailing per-entry position is the only other little-endian field besides the offset table. Cassandra serializes it with out.writeLong against a little-endian buffer, so it must be decoded with u64::from_le_bytes. The 24-byte header and the length-prefixed first/last keys are big-endian.

Key length calculation:

key_length = next_offset - current_offset - 8  // Subtract 8 for position field

For the last entry there is no next_offset; its end is summary_entries_size (the header field spanning offset table + entry data). Cassandra’s in-memory equivalent is getEndInSummary, which returns entriesLength for the last index and getPositionInSummary(index + 1) otherwise.

Important: Tokens are NOT stored in Summary.db entries. The position field points to a byte offset in Index.db, not a token. The entries are nevertheless ordered by token (see “Sample order is partitioner order” above); a reader that needs a sample’s token recomputes it by decorating the stored raw key (IndexSummary.getKey(int) returns raw bytes; binarySearch decorates them via the partitioner).

Serialized Keys (File End)

struct serialized_key {
    be32 size;
    byte key[size];
};

First and last keys are serialized at the end of the file for quick boundary lookups.

Partition Lookup Flow

A BIG point lookup never parses the whole Index.db. Each step below is bounded:

Summary.db interval lookup — binary search the token-ordered samples (decorated-key comparator, above) for the floor sample: the greatest sample whose decorated key is <= target. Its position is where the Index.db walk starts; the next sample’s position is where the covering interval ends. A target below the first sample clamps to the index start. Cassandra: IndexSummary.getScanPosition → binarySearch → getScanPositionFromBinarySearchResult (a “no sample <= target” result yields position 0).
Bounded Index.db walk — read forward from that position, comparing keys, and stop once the interval is exhausted. Cassandra bounds the EQ walk by the interval width for this sample, getEffectiveIndexIntervalAfterIndex(sampledIndex): while i <= effectiveInterval it compares raw key bytes for speed; past that bound it compares decorated keys and returns “absent” as soon as an index key sorts above the target (BigTableReader.java:277-320). Non-matching entries are skipped with RowIndexEntry.Serializer.skip, never re-parsed.
Data.db seek — use the matched entry’s data_offset to read the partition.

Open cost. A BIG reader loads Summary.db at open and does not scan Index.db; it only walks Index.db when the summary (or the Bloom filter) has to be rebuilt because Summary.db is absent, corrupt, or was written with a different min_index_interval (BigSSTableReaderLoadingBuilder.java:96-130, loadSummary()). So “open is proportional to the summary, not the partition count” holds only when a usable Summary.db is present — the rebuild path is proportional to Index.db. The deserializer itself decides “usable”: it throws (triggering a rebuild) when the stored min_index_interval differs from the table’s, when the derived effective interval exceeds max_index_interval, or when the offset table looks big-endian (IndexSummary.java#L423-L467).

Interval width and downsampling

Sample spacing equals min_index_interval partitions only at full sampling. Under index-summary memory pressure Cassandra downsamples an existing summary, dropping samples according to a fixed pattern (Downsampling.getSamplingPattern), which widens intervals. Two different quantities matter, and confusing them is a correctness trap for reader authors:

Average width — BASE_SAMPLING_LEVEL / sampling_level * min_index_interval, where BASE_SAMPLING_LEVEL = 128. This is IndexSummary.getEffectiveIndexInterval(), a double estimate.
Exact per-interval width — getEffectiveIndexIntervalAfterIndex(index), computed from the downsampling pattern (Downsampling.java:116-123). Individual intervals of a downsampled summary are uneven, and the widest one can exceed the average: at min_index_interval = 128, sampling_level = 96 the average is ~170.7 partitions while the widest actual interval is 256. Cassandra’s read path uses the exact per-interval value, never the average.

Warning: an entry-count cap derived from the average width can cut a walk short before a present key that legitimately sits further into a widened interval. Bound a walk by the authoritative next-sample byte position where one exists, and by the exact per-interval width otherwise.

Token-range iteration

Token-range scans use the same summary search, entered by token rather than by key. Cassandra turns a range bound into a “fake” key via Token.minKeyBound()/maxKeyBound() — a PartitionPosition that compares by token first — so BigTableScanner.seekToCurrentRangeStart() can call BigTableReader.getIndexScanPosition(range.left) → IndexSummary.getScanPosition and get the floor sample’s Index.db position. It then walks forward, skipping entries until one sorts above range.left or falls inside the range. IndexSummary also exposes range-aware sample enumeration directly — getSampleIndexesForRanges and getKeySamples(Range<Token>).

So token-range iteration is summary-guided: what Summary.db lacks is a stored token, not the ability to enter by token. There is no getPosition(Token) API — getPosition(int) takes a sample index; tokens enter through the key-bound comparison above, and a reader recomputes each sample’s token from its stored raw key.

CQLite implementation note. CQLite mirrors this shape for BIG: a reader with a usable Summary.db opens the partition index lazily and resolves a point read through exactly one summary-bounded Index.db interval, while a token-range scan starts its forward walk at the floor sample. Details, including how CQLite classifies an interval miss, are in Ch.10 and Appendix C.

BTI Notes

BTI SSTables do not use Index.db or Summary.db at all. The BIG-format Summary → Index → Data flow does not apply to BTI. BTI uses Partitions.db (a page-aware trie, PartitionIndex) for O(log n) key lookup and Rows.db for the row index. See Ch.17 for the complete BTI format. Source: BtiFormat.java#L83-L102

Key Takeaways

Index.db maps partition keys to positions; Summary.db accelerates binary search.
Sampling reduces memory while preserving fast seeks.
Both components are in decorated-key order — partitioner token first, raw key bytes (unsigned) only as a tie-break. Never binary-search them with a raw-byte comparator.
A point lookup walks one summary interval of Index.db, not the whole file; open reads only Summary.db unless the summary is missing/corrupt/interval-mismatched, in which case it is rebuilt from Index.db.
Interval width is min_index_interval only at full sampling; a downsampled summary has uneven, wider intervals, and the widest can exceed the average-width formula.
Token-range iteration is summary-guided (enter by key bound, walk forward); what Summary.db omits is a stored token, not token entry.

Writing Index.db

This section documents the SSTable write workflow for generating Index.db and Summary.db components.

Index.db Entry Format (Write)

When writing Index.db entries in BIG format (NB variant), each entry follows this structure. The entry begins with the partition key length and the raw partition key bytes — there is no 0x0010 marker and no MD5 digest (Issue #552):

struct index_entry {
    be16 key_len;                  // Length of the raw partition key
    byte raw_key[key_len];         // Raw partition key bytes (same as in Data.db)
    vint data_offset;              // Byte offset in Data.db (VInt encoded)
    vint promoted_index_length;    // Length of promoted index data (0 = none)
    byte promoted_index_data[promoted_index_length];  // Only if length > 0
};

Key Requirements:

Key length: be16 length of the raw partition key (e.g. 0x0010 for a 16-byte UUID, 0x0026 for a 38-byte composite key, 0x0004 for a 4-byte int). This is a LENGTH, not a marker.
Raw key: The raw serialized partition key bytes, byte-for-byte identical to the key in Data.db (no hashing, no digest).
Data Offset: VInt-encoded byte offset in Data.db where the partition starts.
Promoted Index: Length of 0 for simple partitions (M5 Stage 0 implementation).

Index.db Offset Tracking

Critical: Capture offset BEFORE writing entry (Issue #407)

When adding entries to Index.db, the file offset where each entry starts must be captured BEFORE writing the entry bytes. This is essential for accurate Summary.db sampling:

// Capture the offset BEFORE writing
let index_offset = buffer.len() as u64;

// Write entry (key_len + raw key + position + promoted_index_length)
write_entry(&mut buffer, key, data_offset)?;

// Return IndexEntryInfo for Summary.db sampling
IndexEntryInfo {
    index_offset,      // Where this entry starts in Index.db
    entry_size,        // How many bytes were written
}

IndexEntryInfo Structure:

index_offset: Byte offset in Index.db where this entry starts
entry_size: Size of this entry in bytes (varies due to VInt encoding)

This information is used by Summary.db sampling to record accurate Index.db positions.

Raw Key Storage (no digest)

Index.db stores the raw partition key bytes length-prefixed; it does NOT store an MD5 digest. (A prior version of this guide incorrectly described an MD5 digest — see Issue #552.) Because the raw key is on disk, readers can:

Compare the raw partition key directly while walking Index.db forward from the summary sample (entries are variable-length, so Index.db itself cannot be binary-searched — the binary search happens in Summary.db)
Use the inline data_offset to seek straight into Data.db
Avoid any hash-collision handling at the index layer

VInt Encoding for Offsets

Data offsets use Cassandra’s unsigned VInt encoding:

1 byte for values 0-127 (0x00-0x7F)
2 bytes for values 128-16383 (0x80-0xBFFF)
3 bytes for values 16384-2097151 (0xC0-0xDFFFFF)
And so on…

Example offset encodings:

0       → 0x00           (1 byte)
127     → 0x7F           (1 byte)
128     → 0x80 0x80      (2 bytes)
12381   → 0xB0 0x5D      (2 bytes)
16384   → 0xC0 0x40 0x00 (3 bytes)

Variable VInt sizes affect entry sizes and must be accounted for when computing Summary.db offsets.

Promoted Index (M5 Stage 0: Skipped)

For M5 Stage 0 (simple partitions), promoted index is not written:

// Write promoted index length (0 = no promoted index)
encode_unsigned(0, &mut buffer);

Promoted index is used for wide partitions — those where uncompressed data written for the current partition since the last IndexInfo block boundary reaches column_index_size (default 64 KiB, BigFormatPartitionWriter.DEFAULT_GRANULARITY). A partition with few but large rows can trigger this; one with many tiny rows below 64 KiB will not. A promoted index is only included when two or more IndexInfo blocks result (RowIndexEntry.create() lines 227-239). This can be added in future stages for wide partition support. Source: BigFormatPartitionWriter.java#L49,L213

Token Ordering Requirement

Index.db entries MUST be written in decorated-key order, matching Data.db partition ordering: token first, raw key bytes (unsigned) as the tie-break — the same order readers binary search (see “Sample order is partitioner order, not byte order” above).

CQLite’s writer enforces this with a strictly-increasing token check:

if key.token <= last_token {
    return Err("Partitions must be written in token order");
}

CQLite implementation note. That check is stricter than the format requires: it rejects two distinct partition keys that hash to the same 64-bit Murmur3 token, which the format permits and orders by comparing raw key bytes. A 64-bit token collision is vanishingly rare in practice, so this is a conservative refusal rather than a correctness problem — but a writer aiming for full format generality would tie-break on the raw key instead of erroring. Source: cqlite-core/src/storage/sstable/writer/incremental.rs:105-112 (and :213-220).

Writing Summary.db

Summary.db samples Index.db entries for efficient partition lookup without reading the full index.

Sampling Strategy

Default Sampling Interval: 128 partitions (min_index_interval)

Summary.db samples every Nth Index.db entry where N = min_index_interval. The first entry is always sampled (entry 0), then entries 128, 256, 384, … at full sampling.

Sampling Logic:

// Sample the first partition and every min_index_interval-th partition thereafter
if partition_index % min_index_interval == 0 {
    summary_writer.add_entry(&key, index_offset)?;
}

Cassandra expresses the same rule as a running nextSamplePosition advanced by minIndexInterval per sample, with a startPoints filter that skips the sample positions a target sampling_level drops (IndexSummaryBuilder.maybeAddEntry/setNextSamplePosition). At sampling_level == BASE_SAMPLING_LEVEL == 128, Downsampling.getStartPoints returns an empty array (numRounds == 0, Downsampling.java#L125-L147), so nothing is skipped and the two formulations coincide; a builder constructed at a lower level is what produces uneven, wider intervals.

size_at_full_sampling is ceil(keys_written / min_index_interval) (IndexSummaryBuilder.java#L273), which equals entries_count for a summary written at full sampling.

Trade-offs:

Smaller interval (e.g., 64) = more memory, faster lookups
Larger interval (e.g., 256) = less memory, more I/O during lookups

Cassandra default of 128 provides a good balance for most workloads.

Uniform spacing is a write-time property. A freshly written summary has sampling_level == BASE_SAMPLING_LEVEL == 128, so every interval is exactly min_index_interval partitions wide. Cassandra may later downsample an existing summary under index-summary memory pressure, which drops samples in a fixed pattern and leaves the surviving intervals uneven and wider. Readers must therefore not assume uniform spacing — see “Interval width and downsampling” above.

When to Sample

Sampling decision is made during partition writes, using the IndexEntryInfo returned by IndexWriter::add_partition():

// Write partition to Data.db
let data_offset = data_writer.write_partition(&key, &mutations, &schema)?;

// Add entry to Index.db and get offset info
let entry_info = index_writer.add_partition(&key, data_offset)?;

// Sample for Summary.db if at interval boundary
if sample_counter % min_index_interval == 0 {
    summary_writer.add_entry(&key, entry_info.index_offset)?;
}
sample_counter += 1;

Critical: Use the actual index_offset from entry_info, not an estimated value. VInt encoding causes variable entry sizes, making offset estimation unreliable.

Summary.db Entry Format (Write)

Summary entries have no length prefix. Key boundaries are determined by offset table:

struct summary_entry {
    byte key[];        // Variable length partition key bytes (no prefix!)
    le64 position;     // Position in Index.db file (LITTLE-ENDIAN — issue #1054)
};

Entry serialization:

// Write key bytes (no length prefix!)
buffer.extend_from_slice(&key_bytes);

// Write position (little-endian u64 — issue #1054)
buffer.extend_from_slice(&index_position.to_le_bytes());

Summary.db Offset Table

The offset table records the starting position of each entry, measured from the start of the offset table itself — so offsets[0] == entries_count * 4, not 0 (see “Offset Table” above).

Critical Gotcha: The offset table uses little-endian encoding (unlike all other Cassandra components):

// Write offset table (LITTLE-ENDIAN!)
for offset in entry_offsets {
    buffer.extend_from_slice(&offset.to_le_bytes());
}

Offset Calculation:

Two details are easy to get wrong here, and Cassandra’s deserializer rejects both mistakes:

Offsets are absolute within the combined (offset_table + entry_data) region, so offset[0] == offset_table_size (= entries_count * 4), not 0. Cassandra adds this baseOffset on serialize and subtracts it on deserialize (IndexSummary.java#L405-L420). CQLite biases its offsets the same way (issue #666).
The per-entry position is little-endian, matching the offset table (issue #1054).

let offset_table_size = entries.len() * 4;      // u32 per entry
let mut entry_offsets = Vec::new();
let mut entry_data = Vec::new();

for entry in entries {
    // Record the ABSOLUTE offset BEFORE writing entry data
    entry_offsets.push((offset_table_size + entry_data.len()) as u32);

    // Write key (no length prefix) and Index.db position (little-endian)
    entry_data.extend_from_slice(&entry.key);
    entry_data.extend_from_slice(&entry.position.to_le_bytes());
}

Summary.db Header Format (Write)

The header is 24 bytes (big-endian):

fn write_header(&self, buffer: &mut Vec<u8>, entries_count: u32, summary_entries_size: u64) {
    // min_index_interval (u32, BE)
    buffer.extend_from_slice(&self.min_index_interval.to_be_bytes());

    // entries_count (u32, BE)
    buffer.extend_from_slice(&entries_count.to_be_bytes());

    // summary_entries_size (u64, BE) = offset_table_size + entry_data_size
    buffer.extend_from_slice(&summary_entries_size.to_be_bytes());

    // sampling_level (u32, BE) - downsampling level: 1..BASE_SAMPLING_LEVEL (128).
    // For a freshly written SSTable use BASE_SAMPLING_LEVEL (128), NOT min_index_interval.
    // They share the same default value (128) by coincidence but diverge after downsampling.
    // Source: Downsampling.java:34 (BASE_SAMPLING_LEVEL=128); IndexSummary.java:88-94,226-229
    buffer.extend_from_slice(&self.sampling_level.to_be_bytes()); // 128 for new SSTables

    // size_at_full_sampling (u32, BE) - entry count this summary WOULD have at full sampling.
    // Cassandra computes ceil(keys_written / min_index_interval), NOT entries_count.
    // After downsampling, entries_count < size_at_full_sampling; they are only equal for a
    // freshly written SSTable where sampling_level == BASE_SAMPLING_LEVEL == 128.
    // Source: IndexSummaryBuilder.java:273 (value written); IndexSummary.getMaxNumberOfEntries()
    buffer.extend_from_slice(&self.size_at_full_sampling.to_be_bytes());
}

summary_entries_size is the combined region length, not just the key/position bytes. Cassandra writes getOffHeapSize() = entries_count * 4 + entries_length — the offset table plus the entry data (IndexSummary.java#L401-L405, #L258-L261). A reader that treats it as the entry-data length alone will mis-locate the trailing first/last keys.

summary_entries_size Calculation:

let offset_table_size = entry_count * 4;  // u32 per entry
let entry_data_size = total_key_bytes + (entry_count * 8);  // keys + positions
let summary_entries_size = offset_table_size + entry_data_size;

First and Last Keys

Summary.db stores serialized first and last keys at the end of the file for quick boundary lookups:

// Write first key (length-prefixed, big-endian)
buffer.extend_from_slice(&(first_key.len() as u32).to_be_bytes());
buffer.extend_from_slice(&first_key);

// Write last key (length-prefixed, big-endian)
buffer.extend_from_slice(&(last_key.len() as u32).to_be_bytes());
buffer.extend_from_slice(&last_key);

These are tracked automatically during writes:

First key: Captured on first add_entry() call
Last key: Updated on every add_entry() call

Component Integration Workflow

The complete SSTable write workflow coordinates all components:

Write Order (Critical)

Components MUST be written in this order:

Statistics.db - Written first in the CQLite implementation to provide a delta encoding baseline. Note: this ordering is a CQLite implementation decision; Cassandra’s BigTableWriter streams Data.db as the primary output and computes statistics alongside—there is no format constraint requiring Statistics.db before Data.db.
Data.db - Main partition/row data
Index.db - Partition index (uses Data.db offsets)
Summary.db - Sampled index entries (uses Index.db offsets)
Filter.db - Bloom filter
Digest.crc32 - Data.db checksum
TOC.txt - Table of contents (LAST, publication barrier)

Data.db → Index.db → Summary.db Flow

Phase 1: Write Partition to Data.db

// Write partition and get Data.db offset
let data_offset = data_writer.write_partition(&key, &mutations, &schema)?;
// data_offset = byte position where partition starts in Data.db

Phase 2: Write Index.db Entry

// Add Index.db entry and get offset info
let entry_info = index_writer.add_partition(&key, data_offset)?;
// entry_info.index_offset = byte position where entry starts in Index.db
// entry_info.entry_size = size of entry in bytes (varies due to VInt)

Phase 3: Sample for Summary.db

// Sample every 128th entry
if sample_counter % 128 == 0 {
    summary_writer.add_entry(&key, entry_info.index_offset)?;
}

Phase 4: Add to Bloom Filter

// Add partition key to Filter.db
filter_writer.add_key(&key);

Complete Example

// Initialize writers
let mut data_writer = DataWriter::new(stats);
let mut index_writer = IndexWriter::new();
let mut summary_writer = SummaryWriter::new(128);
let mut filter_writer = FilterWriter::new(filter_path, capacity, 0.01)?;

let mut sample_counter = 0;

// For each partition (in token order)
for (key, mutations) in partitions {
    // 1. Write to Data.db
    let data_offset = data_writer.write_partition(&key, &mutations, &schema)?;

    // 2. Write to Index.db
    let entry_info = index_writer.add_partition(&key, data_offset)?;

    // 3. Sample for Summary.db
    if sample_counter % 128 == 0 {
        summary_writer.add_entry(&key, entry_info.index_offset)?;
    }
    sample_counter += 1;

    // 4. Add to Bloom filter
    filter_writer.add_key(&key);
}

// Finalize components
let data_bytes = data_writer.finish()?;
let index_bytes = index_writer.finish()?;
let summary_bytes = summary_writer.finish()?;
filter_writer.finish().await?;

Offset Relationships

The offset relationships between components:

Data.db:
  [Partition 1 at offset 0]
  [Partition 2 at offset 250]
  [Partition 3 at offset 500]

Index.db (16-byte UUID keys → each entry is 2 + 16 + vint + vint bytes):
  [Entry 1 at offset 0:  key_len + raw_key + data_offset=0]
  [Entry 2 at offset 20: key_len + raw_key + data_offset=250]
  [Entry 3 at offset 40: key_len + raw_key + data_offset=500]

Summary.db:
  [Entry 0 sampled: key + index_offset=0]    ← Sample every 128th
  [Entry 128 sampled: key + index_offset=X]  ← (if 128+ partitions)

For the read direction of these same offsets, see “Partition Lookup Flow” above — the summary sample bounds where the Index.db walk starts and ends, so the walk is never a whole-file scan when a usable Summary.db is present.

Read-Side: Full-Scan Enumeration (CQLite implementation note)

Not every read is a point lookup: compaction and unfiltered SELECT enumerate every partition of an SSTable. This subsection is CQLite behavior, not a format requirement — Cassandra’s own enumeration is BigTableScanner over Index.db. CQLite has three enumeration shapes for uncompressed BIG, distinguished by what they hold in memory:

Shape	`Index.db` entries	Result rows	Source
Materializing	whole map resident	whole `Vec` before first emit	`reader/data_access/full_index_scan.rs`
Streaming (#2361/#2366)	whole map resident	emitted per partition	`reader/data_access/full_index_stream.rs`
Summary-guided streaming (#2412/#2413)	forward-streamed, window-bounded	emitted per partition	`reader/data_access/summary_scan/mod.rs`

Two properties are worth knowing when reasoning about scan I/O:

Streaming emission is not the same as a streaming index. The #2361 streaming walk still materializes the Index.db entry map before walking it (full_index_stream.rs:161 calls ensure_materialized); what it avoids is buffering the whole result. Only the Summary-guided walk streams the index itself, starting from an authoritative summary sample position (summary_scan/mod.rs:201-220).
Sequential windowing beats per-partition positioned reads. Because entries are visited in ascending data_offset order (== physical Data.db order), the uncompressed walk fills one forward-only read window (target 4 MiB, grown to fit an oversized partition — full_index_stream.rs:68) and serves many consecutive partitions from it, instead of one seek + read_exact + CRC per partition. Work scales with data_section_len / window, not with partition count. Measured on the in-repo fixtures: 500 small partitions went from 500 index probes and 500 positioned reads to 0 probes and 1 window refill; a 51-partition / ≈9.9 MiB fixture needs 3 window refills and 0 probes (full_index_stream_tests.rs:408-410,621-623).

Neither property changes a single on-disk byte — both are consequences of the ordering guarantee that Index.db entries ascend in token order and therefore in Data.db offset order.

Memory Efficiency

Streaming Writes: All writers use streaming serialization to avoid unbounded memory growth:

IndexWriter: Serializes entries immediately to buffer
SummaryWriter: Stores entries in-memory (small, sampled subset)
DataWriter: Serializes rows immediately to buffer
FilterWriter: Uses disk-based Bloom filter construction

Memory usage is bounded by:

Number of sampled entries (not total entries)
Bloom filter size (configurable)
Statistics metadata (fixed size)

References

Cassandra 5.0.0:
- IndexSummary: org.apache.cassandra.io.sstable.indexsummary.IndexSummary (package path changed in 5.0.x)
- BIG reader (was SSTableReader): org.apache.cassandra.io.sstable.format.big.BigTableReader (SSTableReader is now abstract; BIG reading is in BigTableReader)
- BIG reader: org/apache/cassandra/io/sstable/format/big/BigTableReader.java
- BIG writer: org/apache/cassandra/io/sstable/format/big/BigTableWriter.java
Cassandra 5.0.8 — additional citations:
- IndexSummary offset LE encoding: IndexSummary.java#L411-L418
- Promoted index 64 KiB trigger: BigFormatPartitionWriter.java#L49,L213
- RowIndexEntry binary format: RowIndexEntry.java#L57-L68
- DeletionTime “oa” serialization (mfda first): DeletionTime.java#L210-L219
- BASE_SAMPLING_LEVEL=128: Downsampling.java#L34
- Decorated-key comparator (token first, then compareUnsigned): DecoratedKey.java#L79-L102
- Murmur3 token comparison: Murmur3Partitioner.java#L177-L179
- Summary binary search + scan position: IndexSummary.java#L127-L152, #L361-L373
- Average vs exact interval width: IndexSummary.java#L208-L211, #L273-L276, Downsampling.java#L116-L123
- Bounded EQ index walk: BigTableReader.java#L277-L320
- Summary loaded at open; rebuilt from Index.db only when absent/corrupt/interval-mismatched: BigSSTableReaderLoadingBuilder.java#L96-L130, #L249-L266
- Token-range scan start: BigTableScanner.java#L67-L97, BigTableReader.java#L493-L499, Token.java#L204-L252
- Summary serialize/deserialize (absolute offsets, LE detection, rebuild triggers): IndexSummary.java#L399-L420, #L423-L476
- summary_entries_size = offset table + entry data: IndexSummary.getOffHeapSize
- Sampling rule + size_at_full_sampling: IndexSummaryBuilder.java#L200-L243, #L273, Downsampling.getStartPoints
- data_offset is an uncompressed-stream position (no Data.db header): BigTableWriter.java#L95-L113
CQLite Implementation:
- cqlite-core/src/storage/sstable/writer/index_writer.rs - Index.db writer
- cqlite-core/src/storage/sstable/writer/summary_writer.rs - Summary.db writer
- cqlite-core/src/storage/sstable/writer/data_writer.rs - Data.db writer
- cqlite-core/src/storage/sstable/writer/mod.rs - SSTableWriter coordinator
- cqlite-core/src/storage/sstable/summary_reader/interval.rs - token-ordered summary search + bounded Index.db interval read
- cqlite-core/src/storage/sstable/summary_reader/mod.rs:229 - scan_start_position_for_token (floor sample for a token-range walk)
- cqlite-core/src/storage/sstable/reader/data_access/full_index_stream.rs - streaming walk + 4 MiB sequential read window
- cqlite-core/src/storage/sstable/reader/data_access/summary_scan/mod.rs - Summary-guided streaming walk with token pushdown
- cqlite-core/src/storage/sstable/index_reader/mod.rs:52-60 - key_digest field is a historical misnomer holding raw key bytes (Issue #552)
- cqlite-core/src/storage/sstable/writer/incremental.rs:105-112 - strict token-order gate

For implementation details, see Appendix C.