How to find a row by key (flow card)

This one-pager stitches Bloom → (Summary → Index) → Data for BIG vs BTI, with byte-level seeks and failure paths.

Overview

Bloom check (Filter.db) → early negative exit when absent
BIG: Summary jump (Summary.db) → locate index_offset; BTI has no Summary.db
Index scan (Index.db for BIG, Partitions.db trie for BTI) → get data_offset, bounded by the summary interval for BIG
Data read (Data.db) → parse partition + rows via SerializationHeader

Cassandra also short-circuits before any of this: the reader compares the target to the SSTable’s first/last keys and skips the file entirely when out of range, and consults the key cache before touching Summary.db (BigTableReader.java#L228-L275).

BIG (legacy/newbig) flow

1) Bloom

Load Filter.db if present; might_contain(decorated_key).
Negative → stop. Positive → continue.

2) Summary — O(log s) over s samples

Binary search the samples in decorated-key order: partitioner token first (Murmur3Partitioner.LongToken.compareTo = Long.compare), raw key bytes compared unsigned only as a tie-break. A raw-byte (memcmp) comparator is wrong and lands in the wrong interval.
The result is the floor sample — the greatest sample whose decorated key is <= target — whose position is where the Index.db walk starts. The next sample’s position is where the covering interval ends. No sample <= target yields position 0 (walk from the index start).
Sources: IndexSummary.binarySearch(PartitionPosition) and getScanPositionFromBinarySearchResult; comparator in DecoratedKey.java#L93-L102.
Note: the binary search operates on decorated keys, not raw token values — and the token is not stored in the entry, so it is recomputed from the sample’s raw key.
A BIG reader loads Summary.db at open and does not scan Index.db — unless the summary is absent, corrupt, or was written with a different min_index_interval, in which case it is rebuilt by walking Index.db (BigSSTableReaderLoadingBuilder.java#L96-L130).

3) Index

From index_offset, parse entries sequentially until the target is found or passed.

Entry format (always length-prefixed — there is no marker byte, no MD5 digest):

u16     key_length        <- raw partition-key byte count (e.g. 0x0010 for a 16-byte UUID)
bytes   raw_key_bytes     <- partition key, key_length bytes
vint    data_offset       <- byte offset into Data.db
vint    promoted_index_len<- length of promoted index block (0 = none)
bytes   promoted_index    <- present only when promoted_index_len > 0

Source: BigTableReader.java:298 uses ByteBufferUtil.readWithShortLength(in) — the u16 is always a key length. BigTableReader.java#L298-L325
Compare raw key bytes to target. On match → deserialize RowIndexEntry.
On mismatch → call RowIndexEntry.Serializer.skip(in, descriptor.version) and advance to next entry. Source: BigTableReader.java:349

Note on 0x0010: The value 0x0010 is the key length for a 16-byte UUID partition key. It is not a fixed marker. An earlier version of this guide misidentified it as a format marker followed by an MD5 digest — that was incorrect (Issue #552). Every entry uses a plain u16 key length prefix regardless of key type.

4) Data

Seek to data_offset in Data.db.
NB format note: Data.db has no standalone header; chunk boundaries come from CompressionInfo.db.
Parse partition header and rows using SerializationHeader.

Bounded interval scan — the walk is not open-ended

The Index.db walk in step 3 is bounded by the summary interval the floor sample opens, so an EQ lookup touches roughly one interval’s worth of entries, not the whole file:

Cassandra computes the bound as getEffectiveIndexIntervalAfterIndex(sampledIndex) — the exact width of this interval, from the downsampling pattern. While i <= effectiveInterval it compares raw key bytes (fast path); beyond that it compares decorated keys and returns “absent” the moment an index key sorts above the target (BigTableReader.java#L277-L320).
Width is min_index_interval only at full sampling. A downsampled summary (sampling_level < 128) has uneven, wider intervals, and the widest can exceed the average-width formula min_index_interval * 128 / sampling_level (at 128/96: average ~170.7, widest 256). Never bound a walk by the average.

CQLite implementation note. CQLite’s BIG reader, when a usable Summary.db is present, reads exactly the covering interval’s [start, end) bytes and classifies a miss by whether the interval is end-bounded (delimited above by a real next sample) — an end-bounded miss is authoritative absence, while a miss in the last (read-to-EOF) interval falls back to a whole-file key scan, because a tail-truncated Index.db can only lose entries there. Source: cqlite-core/src/storage/sstable/reader/summary_point.rs:53-76 and cqlite-core/src/storage/sstable/reader/data_access/big_point.rs:140-177.

Failure/negative path

If raw key != target → call Serializer.skip() and advance to the next Index.db entry.
If the walk leaves the covering interval without a match, the partition is absent from this SSTable (for an EQ search reached through a positive Bloom check, a false-positive Bloom hit). This conclusion is only as authoritative as the interval bound: it assumes the interval’s bytes are intact, which is why a reader that cannot rule out a truncated tail must widen the search rather than report absence.

BTI (5.0) exact-key lookup — full detail

BTI uses a trie partition index (Partitions.db) instead of the Summary → Index scan chain. There is no Summary.db and no Index.db in a BTI SSTable. Source: BtiFormat.java#L83-L102

1) Bloom — BtiTableReader.isPresentInFilter(dk). Negative → stop.

2) PartitionIndex trie (Partitions.db) — partitionIndex.openReader().exactCandidate(dk). Returns NOT_FOUND → stop. Otherwise returns indexPos:

indexPos >= 0 → position in Rows.db. Proceed to step 3a.
indexPos < 0 → ~indexPos is a direct position in Data.db (small partition, no row index). Proceed to step 3b.

Source: BtiTableReader.java:223-276 BtiTableReader.java#L223

3a) Rows.db key verify + TrieIndexEntry — Read u16-prefixed partition key at indexPos; compare to target. Mismatch → stop. Deserialize TrieIndexEntry:

vint    dataFilePosition   <- offset of partition start in Data.db
vint    indexTrieRoot      <- delta-encoded relative to current file position
u32     rowIndexBlockCount
bytes   DeletionTime       <- 12 bytes (version "oa": mfda 8 B + ldt 4 B)

Source: TrieIndexEntry.java#L90-L118

3b) Data.db direct — Read u16-prefixed key from ~indexPos; compare to target. Mismatch → stop. Use ~indexPos as dataFilePosition.

4) Data seek — Seek dataFilePosition in Data.db; parse with SerializationHeader.

Key cache absent in BTI: TrieIndexEntry.serializeForCache() throws AssertionError("BTI SSTables should not use key cache"). All BTI lookups traverse the trie. Source: TrieIndexEntry.java#L68-L76

Byte-level seek checklist

BIG — Summary → Index: verify index_offset is within file bounds; summary entries are in decorated-key order (partitioner token first, raw bytes unsigned as tie-break), so search them with a token-order comparator, never memcmp. Bound the forward walk by the next sample’s position (or by the exact per-interval width), not by the average interval width.
BIG — Index → Data: the leading u16 is the key byte length; compare raw key bytes to target. No marker, no digest. Call Serializer.skip() on mismatch.
BTI — Partitions.db → Data: sign-bit of indexPos routes to Rows.db (>= 0) or directly to Data.db (< 0, decode as ~indexPos).
Data seek (both formats): ensure data_offset points within a valid chunk boundary (NB compressed: use CompressionInfo.db).

Annotated hex — BIG Index.db entries

// BIG Index.db -- 16-byte UUID partition key (key_length = 0x0010 = 16, NOT a marker)
00000000: 0010 1529 1a77 d739 4e73 8397 b787 442f  |...).w.9Ns....D/|
00000010: 3a1f 00 00 ...
//         ^^^^ ^^   ^^
//         uuid  |    promoted_index_len = 0
//               data_offset vint

// BIG Index.db -- 26-byte composite key (key_length = 0x001a = 26)
00000000: 001a 0010 37ac 9f53 bd8e 4da5 a41a 240f  |....7..S..M...$.|
//         ^^^^
//         key_length = 26  (not a marker!)

References

IndexSummary (5.0.8): org.apache.cassandra.io.sstable.indexsummary.IndexSummary
Decorated-key comparator (token first, then unsigned bytes): DecoratedKey.java#L79-L102, Murmur3Partitioner.java#L177-L179
Exact per-interval walk bound: IndexSummary.java#L273-L276, Downsampling.java#L116-L123
Summary loaded at open (rebuild only when absent/corrupt/interval-mismatched): BigSSTableReaderLoadingBuilder.java#L96-L130
BIG reader: BigTableReader.java
BIG raw-key comparison (no digest): BigTableReader.java#L298-L325
BTI has no Summary.db: BtiFormat.java#L100-L108
TrieIndexEntry on-disk format: TrieIndexEntry.java#L90-L118
Key cache absent in BTI: TrieIndexEntry.java#L68-L76
SerializationHeader: SerializationHeader.java
BTI format spec: org.apache.cassandra.io.sstable.format.bti (see BtiFormat.md in-tree)