BTI (Big Trie-Indexed) Formats
BTI (Big Trie-Indexed) Formats
Section titled “BTI (Big Trie-Indexed) Formats”BTI is the modern SSTable index format introduced to improve lookup efficiency, cache locality, and on-disk structure over the classic big family. Instead of a single Index.db plus sampled Summary.db, BTI splits indexing into trie-structured files that directly encode byte-comparable keys, reducing indirection and making prefix and range navigation more predictable. This chapter contrasts BTI with big/mc/mm and calls out practical impacts on read amplification.
In this chapter you will learn
Section titled “In this chapter you will learn”- What BTI changes relative to big/mc/mm
- How BTI’s index structures affect read paths
- Where BTI lives in the codebase
- Practical implications for latency
Motivation and Structure
Section titled “Motivation and Structure”BTI (Big Trie-Indexed) replaces the classic big index structure with trie-based indexes that traverse keys byte-by-byte instead of binary-searching a sampled summary—reading less data and requiring less processing per lookup, with equivalent seek counts on a match (BtiFormat.md lines 581–589). In Cassandra 5.0, BTI artifacts live alongside the data file and statistics:
Connection to In-Memory Tries: The Efficiency Foundation
Section titled “Connection to In-Memory Tries: The Efficiency Foundation”Cross-reference: This section connects to Chapter 4: From CQL to Disk, which covers the flush pipeline from memtable to SSTable.
BTI’s efficiency is not just an on-disk optimization—it is architecturally aligned with Cassandra 5.0’s in-memory TrieMemtable. Both use byte-comparable keys and trie organisation, but the flush path constructs a new on-disk trie incrementally from the memtable iterator; it is not a raw serialisation of the in-memory trie.
Key alignment points:
-
Identical byte-comparable representation: Both
TrieMemtableand BTI useByteComparable.Version.OSS50for partition key encoding. This means keys in memory are already in the exact format needed for the on-disk trie—no transformation required during flush. -
No sorting pass required:
SkipListMemtable(the compiled-in default factory in Cassandra 5.0 unless overridden incassandra.yaml; seeMemtableParams.java:99) stored data in a structure that required iteration to produce sorted output.TrieMemtableis an opt-in alternative: it stores partition keys in a trie that is inherently sorted by byte-comparable order. TheentryIterator()method walks the trie and emits partitions in exactly the order BTI expects. -
Prefix sharing preserved: The in-memory trie shares prefixes between partition keys (e.g., keys
user:aliceanduser:bobshare theuser:prefix). During flush, theIncrementalTrieWriterconstructs the on-disk trie incrementally and naturally preserves this prefix structure. -
Single-pass incremental construction: The BTI writer receives pre-sorted keys from the memtable trie iterator and builds the
Partitions.dbindex in a single pass usingPartitionIndexBuilder. No buffering, no intermediate structures, no second pass.
Pseudocode illustrating the alignment:
// Flush path (simplified)for entry in memtableTrie.entryIterator(): // Already sorted! key = entry.getKey() // DecoratedKey implements ByteComparable partition = entry.getValue()
position = dataWriter.write(partition) // Write to Data.db partitionIndexBuilder.addEntry(key, position) // Key treated as ByteComparable (OSS50)
// PartitionIndexBuilder internally:// - Computes diff point with previous key (ByteComparable.diffPoint)// - Writes only the shortest unique prefix to trie// - Uses IncrementalTrieWriter for page-aware outputPosition encoding trick: The partition index uses position sign to distinguish pointer types (BtiFormat.md lines 965–971):
- Positive position → points to row index file (
Rows.db) for wide partitions - Negative position (
~dataPosition) → bitwise NOT of directData.dboffset (e.g., position 0 → -1, position 1 → -2)
This encoding eliminates the need for a separate flag field and allows the reader to immediately know whether to consult the row index.
Hash byte (Cassandra 5.0 always present): Every leaf node in the partition trie carries a hash
byte before the position value. This byte holds the lowest 8 bits of the partition key’s filter
hash and allows fast mismatch rejection without reading the full key from disk. When
payloadBits >= 8 (FLAG_HAS_HASH_BYTE = 8), byte 0 at the payload position is the hash byte
and bytes 1–(payloadBits−8+1) encode the position. In Cassandra 5.0 the hash byte is
always written (PartitionIndex.java:131–135;
BtiFormat.md lines 946–963).
Why this matters for performance:
| Aspect | Big Format (classic) | BTI with TrieMemtable |
|---|---|---|
| Key format during flush | May need transformation | Already byte-comparable |
| Sort guarantee | Iterator provides order | Trie iteration is inherently ordered |
| Prefix sharing | None (full keys in Index.db) | Preserved memory→disk |
| Construction passes | Multiple (data, index, summary) | Single incremental pass |
This alignment was intentionally designed as described in the VLDB 2022 paper that introduced TrieMemtable to Cassandra [external citation — not verified against 5.0.8 source].
Where to look in source:
-
TrieMemtable:org.apache.cassandra.db.memtable.TrieMemtable— seegetFlushSet()method (lines 350–403) -
PartitionIndexBuilder:org.apache.cassandra.io.sstable.format.bti.PartitionIndexBuilder— buildsPartitions.dbfrom sorted byte-comparable keys -
IncrementalTrieWriter:org.apache.cassandra.io.tries.IncrementalTrieWriter— incremental trie construction from sorted input -
BTI-specific components:
Partitions.db(partition trie),Rows.db(per-partition clustering trie) -
Common components retained:
Data.db,Statistics.db,TOC.txt,Digest.crc32,CompressionInfo.db(when compressed)
Note: BTI format SSTables do not include
Index.dborSummary.db. These are components of the classicbigformat only. The complete required BTI component set is:Data.db,Partitions.db,Rows.db,Statistics.db,TOC.txt,Digest.crc32, and optionallyCompressionInfo.db(BtiFormat.java:83–102).
Where to look in source:
- Cassandra 5.0.8 (pinned):
org.apache.cassandra.io.sstable.format.bti— see package directoryhttps://github.com/apache/cassandra/tree/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/format/bti
- Classic big format for comparison:
- Reader:
BigTableReaderhttps://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/format/big/BigTableReader.java
- Reader:
Sidebar: Version Differences
BTI is a Cassandra 5.x format family. Older releases rely on big plus Index.db/Summary.db. Readers should expect co-existence during upgrades; mixed-format directories are normal during transitions.
Read Amplification and Index Layout
Section titled “Read Amplification and Index Layout”Conceptual contrast (trimmed):
- big:
Index.db(per-partition entries) +Summary.db(sampling) → seek intoData.db - BTI:
Partitions.dbtrie → partition payload; thenRows.dbtrie (within-partition) → row payload inData.db
Illustrative bullets:
- Trie traversal reads less data and requires less processing per lookup than binary-searching sampled summaries; seek counts on a match are equivalent to the
bigformat - Better prefix navigation for wide-partition clustering keys
- Similar Bloom filter role for negative lookups; statistics unchanged
- Mixed deployments are supported; compaction/upgrade can rewrite formats
For implementation walkthroughs of BTI headers and trie navigation, see Appendix C.
Prefix/range navigation (byte-wise example)
Section titled “Prefix/range navigation (byte-wise example)”Consider a composite clustering key (user_id uuid, path text) with UTF-8 collation. BTI’s Rows.db encodes a trie over the byte sequence of the clustering prefix, enabling prefix seeks:
Pseudo (simplified):
advance(trie, prefix_bytes): node = root for b in prefix_bytes: if node.has_child(b): node = node.child(b) else: return node.first_ge_branch(b) return nodeEffectively, prefix seek walks byte-by-byte until divergence, then takes the first branch ≥ the requested byte; this contrasts with BIG’s binary search over sampled entries in Summary.db followed by scans in Index.db.
Trie node type families
Section titled “Trie node type families”Every trie node starts with one byte: high nibble (bits 7–4) = 4-bit node-type ordinal (0–15);
low nibble (bits 3–0) = 4 payload flag bits (pb). Four families
(BtiFormat.md lines 806–877;
TrieNode.java:947–969):
| Family | Ordinals | Description |
|---|---|---|
PAYLOAD_ONLY | 0 | Leaf; no transitions |
SINGLE | 1–4 | One child; 4-/8-/12-/16-bit distance |
SPARSE | 5–9 | Binary-searched byte list; 8- to 40-bit distances |
DENSE | 10–15 | Consecutive byte range; 12- to 64-bit distances (LONG_DENSE catch-all) |
All distances are unsigned and subtracted from the current node position (children are earlier in the file). See Appendix C for complete per-type byte layouts.
Rows.db per-partition footer
Section titled “Rows.db per-partition footer”Each partition’s row index in Rows.db is padded to a 4096-byte page boundary. After the trie
pages the footer contains, in order
(BtiFormat.md lines 977–1010;
TrieIndexEntry.java:92–116):
- Partition key (short-length-prefixed bytes)
- Data file position of the partition start (unsigned vint)
- Root node position: signed vint delta relative to the data file position
- Row count in the partition (unsigned vint)
- Partition deletion time (12 bytes: local deletion time int + marked-for-delete-at long)
The partition index (Partitions.db) points to the position of the partition key bytes (step 1).
Row index granularity
Section titled “Row index granularity”The row index does not index every row—it indexes blocks. The default block size is at least
16 KB of serialised row data, controlled by the column_index_size parameter in
cassandra.yaml. Separator keys (the shortest prefix greater than the last key of the prior
block) are stored, not exact start keys, keeping the index compact
(BtiFormat.md lines 646–653).
Performance Considerations and Benchmark Methodology
Section titled “Performance Considerations and Benchmark Methodology”Note: Provide methodology and harness only; do not claim specific results here.
- Goals:
- Compare point lookup and slice traversal costs for BTI vs BIG across key distributions
- Measure IO and CPU separately where possible (warm vs cold cache runs)
- Dataset: Use
test-data/datasets/test_basicplus synthetic wide-partition variants - Metrics:
- Median/95p latency for partition key lookups and clustering slices
- Hops/steps: trie transitions vs binary-search steps; bytes read from
Index/Partitions/Rows
- Procedure:
- Run N repeated lookups against a fixed corpus; alternate hot/cold cache
- Record OS-level IO and per-query timings; pin CPU governor when possible
- Harness guidance:
- Use consistent datasets and key distributions across formats
- Ensure format detection is bypassed in the hot path to avoid skew
- Report confidence intervals; avoid extrapolating beyond tested sizes
Key Takeaways
Section titled “Key Takeaways”- BTI stands for Big Trie-Indexed; it uses trie indexes (
Partitions.db/Rows.db) and does not includeIndex.dborSummary.db. - Trie traversal reads less data and requires less CPU per lookup than binary-searching sampled summaries; seek counts on a match are equivalent.
- Every partition-index leaf node carries a hash byte (
FLAG_HAS_HASH_BYTE = 8) for fast mismatch rejection—always present in Cassandra 5.0. SkipListMemtableremains the compiled-in default in Cassandra 5.0;TrieMemtableis opt-in viacassandra.yaml.- The row index operates on 16 KB blocks (configurable via
column_index_size), not individual rows. - Bloom filters and statistics continue to guide/guard the read path.
- Mixed-format directories occur during upgrades; readers must detect format. For implementation details, see Appendix C.
References
Section titled “References”- Cassandra 5.0.8 (pinned):
- BTI package:
https://github.com/apache/cassandra/tree/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/format/bti - Authoritative in-tree spec:
https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md BtiFormat.java(component sets):https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.java#L83PartitionIndex.java(hash byte, position encoding):https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/format/bti/PartitionIndex.javaTrieNode.java(node types):https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/tries/TrieNode.java#L947PageAware.java(4096-byte page constant):https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/util/PageAware.java#L24MemtableParams.java(default factory):https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/schema/MemtableParams.java#L99- Big format reader (contrast):
https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/format/big/BigTableReader.java
- BTI package:
For implementation details, see Appendix C.