From CQL to Disk
From CQL to Disk
Section titled “From CQL to Disk”This chapter traces a CQL mutation from client to durable storage: how it becomes a partition with clustering rows and cells, how memtables and the commit log (WAL) capture it, and how a flush turns memory state into SSTable components. The classic (BIG) format produces Data.db, Index.db, Summary.db, Filter.db, Statistics.db, and CompressionInfo.db. The BTI format (Cassandra 5.0+) replaces Index.db/Summary.db with trie-based Partitions.db/Rows.db.
In this chapter you will learn
Section titled “In this chapter you will learn”- How mutations map to partitions, clustering keys, and cells
- How memtables and the WAL capture writes before flush
- How flush builds Data/Index/Summary/Stats/Filter/CompressionInfo (classic BIG format)
- How flush builds Data/Partitions/Rows/Stats/Filter/CompressionInfo (BTI format)
- The core steps of the flush pipeline via short pseudocode for both formats
Mutation to Partition Structures
Section titled “Mutation to Partition Structures”A single CQL mutation is normalized into a partition key, zero or more clustering keys, and one or more cells. Deletes are represented as tombstones (partition-, row-, cell-level) and range tombstones. The on-disk Data.db encodes this with a serialization header derived from schema and a sequence of unfiltered rows and markers.
Memtables and WAL
Section titled “Memtables and WAL”On write, Cassandra appends to a commit log segment and updates an in-memory memtable for the affected table. The memtable maintains partitions in sorted order, ready for efficient flush to disk. When the memtable reaches its size threshold or a flush trigger fires, Cassandra converts the in-memory structure into immutable SSTable components and discards the memtable.
TrieMemtable: Cassandra 5.0’s Opt-In Trie-Based Memtable
Section titled “TrieMemtable: Cassandra 5.0’s Opt-In Trie-Based Memtable”Cassandra 5.0 ships TrieMemtable as an available but non-default implementation.
SkipListMemtable remains the compiled default
(MemtableParams.DEFAULT_MEMTABLE_FACTORY = SkipListMemtableFactory.INSTANCE,
schema/MemtableParams.java:99).
To activate TrieMemtable cluster-wide, add to cassandra.yaml:
memtable: configurations: default: class_name: TrieMemtableWhen TrieMemtable is configured, the architectural alignment with the BTI on-disk trie format
becomes an efficiency advantage — but BTI SSTables can also be produced from SkipList-based
flushes, and TrieMemtable can flush to BIG-format SSTables. The flush format is determined by
sstable_format, not the memtable implementation.
Key characteristics of TrieMemtable:
- Trie-based storage: Partition keys are stored in an
InMemoryTrieusing byte-comparable representations (ByteComparable.Version.OSS50). - Sharded for concurrency: The trie is sharded across CPU cores, with each shard handling a token range.
- Inherently sorted: Trie iteration produces partitions in lexicographic (byte-comparable) order, eliminating any sorting pass during flush.
- Prefix sharing: Common prefixes between partition keys are stored only once, reducing memory overhead.
- Memory efficiency: Stores trie structure in buffers (optionally off-heap), reducing GC pressure compared to object-heavy skip lists.
Cross-reference: See Chapter 17: BTI Formats for how this in-memory trie structure directly persists to the on-disk BTI index, reducing flush complexity through structural alignment.
Flush Pipeline
Section titled “Flush Pipeline”Diagram: the flush pipeline from memtable to per-component outputs.
- Diagram (Mermaid source):
/cqlite/format-guide/diagrams/flush-pipeline - Alt text: Flush pipeline from Memtable/WAL to per-component files with TOC.
- Caption: Flush steps create
Data.dbthen deriveIndex/Summary/Filter/Statistics/CompressionInfoandTOC.txt.
Minimal pseudocode for the classic (BIG format) pipeline:
// Memtable → SSTable (BIG format, simplified)for each partition in memtable in key order: write_partition_to(Data.db) record_index_entry(Index.db)sample_index_for_summary(Summary.db)build_bloom_over_partition_keys(Filter.db)collect_stats_and_write(Statistics.db)write_compression_metadata(CompressionInfo.db)emit_component_listing(TOC.txt)BTI Format Flush Pipeline
Section titled “BTI Format Flush Pipeline”When the BTI (B-Tree/Trie Indexed) format is active, the flush pipeline differs significantly. Instead of Index.db and Summary.db, the BTI format writes trie-structured index files:
Partitions.db: Partition index trie mapping partition keys to data positions or row index positionsRows.db: Per-partition row index tries for wide partitions (>16KB of row data by default)
The flush iterates the in-memory trie (Cassandra 5.0’s default memtable structure) and constructs on-disk tries incrementally:
// Memtable → SSTable (BTI format, simplified)for each partition in memtable trie in byte-comparable order: position = write_partition_to(Data.db)
if partition has row index (wide partition): row_index_root = complete_row_trie(Rows.db) add_to_partition_trie(key, row_index_position) // positive offset else: add_to_partition_trie(key, ~data_position) // negative offset (direct)
complete_partition_trie(Partitions.db) // writes footer with first/last keys, count, rootbuild_bloom_over_partition_keys(Filter.db)collect_stats_and_write(Statistics.db)write_compression_metadata(CompressionInfo.db)emit_component_listing(TOC.txt)Key BTI flush characteristics:
-
Incremental trie construction: The
PartitionIndexBuilderwrites trie nodes to disk as they become complete, keeping only the active path in memory. This enables flushing millions of partitions with minimal memory overhead. -
Shortest unique prefix optimization: Partition keys are written with a one-key delay to compute the shortest prefix that distinguishes consecutive keys, minimizing index size.
-
Row index threshold: Partitions with only one index block (≤
column_index_size, default 16KB) skip the row index entirely. The partition trie points directly toData.dbusing a negative position encoding (~dataPosition). -
No summary file: The trie structure is self-indexing—binary search through a sampled summary is unnecessary because trie traversal provides O(log n) lookup directly.
Cassandra 5.0 writer touchpoints
Section titled “Cassandra 5.0 writer touchpoints”Classic (BIG) format:
BigTableWriterbuildsData/Index/Summary/Statisticsand emitsTOC.txt
BTI format:
BtiTableWriterbuildsData/Partitions/Rows/Statisticsand emitsTOC.txtPartitionIndexBuilderincrementally constructs the partition trie, computing shortest unique prefixesRowIndexWriterconstructs per-partition row tries with separator-based block indexingBtiFormatPartitionWritertracks row index block boundaries (every 16KB by default)
For a concrete implementation walkthrough of a writer, see Appendix C.
Tiny flush artifact (trimmed, from test_basic)
Section titled “Tiny flush artifact (trimmed, from test_basic)”Classic (BIG) format TOC.txt shows the emitted components for one table:
Data.dbStatistics.dbDigest.crc32TOC.txtCompressionInfo.dbFilter.dbIndex.dbSummary.dbBTI format TOC.txt shows trie-based index components instead:
Data.dbStatistics.dbDigest.crc32TOC.txtCompressionInfo.dbFilter.dbPartitions.dbRows.dbNote that BTI replaces Index.db + Summary.db with Partitions.db + Rows.db. The Rows.db file may be absent or minimal if the table contains only narrow partitions (single index block each).
Sidebar: Version Differences and Format Selection
Section titled “Sidebar: Version Differences and Format Selection”3.x/4.x vs 5.0:
- 3.x/4.x use the classic BIG format exclusively (
Index.db+Summary.db) - 5.0 supports both BIG and BTI formats; BTI is selected via
sstable_formatincassandra.yaml - Memtables in 5.0 default to
SkipListMemtable;TrieMemtableis opt-in viacassandra.yamland aligns naturally with BTI’s trie-based indexes
Mixed format deployments:
- During upgrades or format transitions, a directory may contain both BIG and BTI SSTables
- Readers detect format from component presence (
Index.dbvsPartitions.db) - Compaction can rewrite SSTables to the configured format
See detailed format differences in Chapters 5, 6, and 17 where component structures diverge.
Key Takeaways
Section titled “Key Takeaways”- Flush writes immutable components alongside a
TOC.txtenumerating what was produced. Data.dbis primary; index components accelerate lookups;StatisticsandCompressionInfoguide reads and maintenance.- Classic (BIG) format: Uses
Index.db+Summary.dbfor partition lookup via binary search over sampled entries. - BTI format: Uses
Partitions.db+Rows.db(trie-based indexes) for O(log n) lookup without a summary file. - BTI flush iterates the in-memory trie and incrementally constructs on-disk tries with shortest-prefix optimization.
- Writers operate in a stable order: write data, derive indexes, then metadata.
References
Section titled “References”- Cassandra 5.0.8:
SSTableWriter: org.apache.cassandra.io.sstable.format.SSTableWriter- Classic (BIG) format:
BigTableWriter: org.apache.cassandra.io.sstable.format.big.BigTableWriter
- BTI format:
BtiTableWriter: org.apache.cassandra.io.sstable.format.bti.BtiTableWriterPartitionIndexBuilder: org.apache.cassandra.io.sstable.format.bti.PartitionIndexBuilderRowIndexWriter: org.apache.cassandra.io.sstable.format.bti.RowIndexWriterBtiFormatPartitionWriter: org.apache.cassandra.io.sstable.format.bti.BtiFormatPartitionWriter- BTI package: org.apache.cassandra.io.sstable.format.bti
- Memtables:
TrieMemtable: org.apache.cassandra.db.memtable.TrieMemtable- Memtable package: org.apache.cassandra.db.memtable
For implementation details, see Appendix C.