Skip to content

From CQL to Disk

This chapter traces a CQL mutation from client to durable storage: how it becomes a partition with clustering rows and cells, how memtables and the commit log (WAL) capture it, and how a flush turns memory state into SSTable components. The classic (BIG) format produces Data.db, Index.db, Summary.db, Filter.db, Statistics.db, and CompressionInfo.db. The BTI format (Cassandra 5.0+) replaces Index.db/Summary.db with trie-based Partitions.db/Rows.db.

  • How mutations map to partitions, clustering keys, and cells
  • How memtables and the WAL capture writes before flush
  • How flush builds Data/Index/Summary/Stats/Filter/CompressionInfo (classic BIG format)
  • How flush builds Data/Partitions/Rows/Stats/Filter/CompressionInfo (BTI format)
  • The core steps of the flush pipeline via short pseudocode for both formats

A single CQL mutation is normalized into a partition key, zero or more clustering keys, and one or more cells. Deletes are represented as tombstones (partition-, row-, cell-level) and range tombstones. The on-disk Data.db encodes this with a serialization header derived from schema and a sequence of unfiltered rows and markers.

On write, Cassandra appends to a commit log segment and updates an in-memory memtable for the affected table. The memtable maintains partitions in sorted order, ready for efficient flush to disk. When the memtable reaches its size threshold or a flush trigger fires, Cassandra converts the in-memory structure into immutable SSTable components and discards the memtable.

TrieMemtable: Cassandra 5.0’s Opt-In Trie-Based Memtable

Section titled “TrieMemtable: Cassandra 5.0’s Opt-In Trie-Based Memtable”

Cassandra 5.0 ships TrieMemtable as an available but non-default implementation. SkipListMemtable remains the compiled default (MemtableParams.DEFAULT_MEMTABLE_FACTORY = SkipListMemtableFactory.INSTANCE, schema/MemtableParams.java:99). To activate TrieMemtable cluster-wide, add to cassandra.yaml:

memtable:
configurations:
default:
class_name: TrieMemtable

When TrieMemtable is configured, the architectural alignment with the BTI on-disk trie format becomes an efficiency advantage — but BTI SSTables can also be produced from SkipList-based flushes, and TrieMemtable can flush to BIG-format SSTables. The flush format is determined by sstable_format, not the memtable implementation.

Key characteristics of TrieMemtable:

  • Trie-based storage: Partition keys are stored in an InMemoryTrie using byte-comparable representations (ByteComparable.Version.OSS50).
  • Sharded for concurrency: The trie is sharded across CPU cores, with each shard handling a token range.
  • Inherently sorted: Trie iteration produces partitions in lexicographic (byte-comparable) order, eliminating any sorting pass during flush.
  • Prefix sharing: Common prefixes between partition keys are stored only once, reducing memory overhead.
  • Memory efficiency: Stores trie structure in buffers (optionally off-heap), reducing GC pressure compared to object-heavy skip lists.

Cross-reference: See Chapter 17: BTI Formats for how this in-memory trie structure directly persists to the on-disk BTI index, reducing flush complexity through structural alignment.

Diagram: the flush pipeline from memtable to per-component outputs.

  • Diagram (Mermaid source): /cqlite/format-guide/diagrams/flush-pipeline
  • Alt text: Flush pipeline from Memtable/WAL to per-component files with TOC.
  • Caption: Flush steps create Data.db then derive Index/Summary/Filter/Statistics/CompressionInfo and TOC.txt.

Minimal pseudocode for the classic (BIG format) pipeline:

// Memtable → SSTable (BIG format, simplified)
for each partition in memtable in key order:
write_partition_to(Data.db)
record_index_entry(Index.db)
sample_index_for_summary(Summary.db)
build_bloom_over_partition_keys(Filter.db)
collect_stats_and_write(Statistics.db)
write_compression_metadata(CompressionInfo.db)
emit_component_listing(TOC.txt)

When the BTI (B-Tree/Trie Indexed) format is active, the flush pipeline differs significantly. Instead of Index.db and Summary.db, the BTI format writes trie-structured index files:

  • Partitions.db: Partition index trie mapping partition keys to data positions or row index positions
  • Rows.db: Per-partition row index tries for wide partitions (>16KB of row data by default)

The flush iterates the in-memory trie (Cassandra 5.0’s default memtable structure) and constructs on-disk tries incrementally:

// Memtable → SSTable (BTI format, simplified)
for each partition in memtable trie in byte-comparable order:
position = write_partition_to(Data.db)
if partition has row index (wide partition):
row_index_root = complete_row_trie(Rows.db)
add_to_partition_trie(key, row_index_position) // positive offset
else:
add_to_partition_trie(key, ~data_position) // negative offset (direct)
complete_partition_trie(Partitions.db) // writes footer with first/last keys, count, root
build_bloom_over_partition_keys(Filter.db)
collect_stats_and_write(Statistics.db)
write_compression_metadata(CompressionInfo.db)
emit_component_listing(TOC.txt)

Key BTI flush characteristics:

  • Incremental trie construction: The PartitionIndexBuilder writes trie nodes to disk as they become complete, keeping only the active path in memory. This enables flushing millions of partitions with minimal memory overhead.

  • Shortest unique prefix optimization: Partition keys are written with a one-key delay to compute the shortest prefix that distinguishes consecutive keys, minimizing index size.

  • Row index threshold: Partitions with only one index block (≤column_index_size, default 16KB) skip the row index entirely. The partition trie points directly to Data.db using a negative position encoding (~dataPosition).

  • No summary file: The trie structure is self-indexing—binary search through a sampled summary is unnecessary because trie traversal provides O(log n) lookup directly.

Classic (BIG) format:

  • BigTableWriter builds Data/Index/Summary/Statistics and emits TOC.txt

BTI format:

  • BtiTableWriter builds Data/Partitions/Rows/Statistics and emits TOC.txt
  • PartitionIndexBuilder incrementally constructs the partition trie, computing shortest unique prefixes
  • RowIndexWriter constructs per-partition row tries with separator-based block indexing
  • BtiFormatPartitionWriter tracks row index block boundaries (every 16KB by default)

For a concrete implementation walkthrough of a writer, see Appendix C.

Tiny flush artifact (trimmed, from test_basic)

Section titled “Tiny flush artifact (trimmed, from test_basic)”

Classic (BIG) format TOC.txt shows the emitted components for one table:

Data.db
Statistics.db
Digest.crc32
TOC.txt
CompressionInfo.db
Filter.db
Index.db
Summary.db

BTI format TOC.txt shows trie-based index components instead:

Data.db
Statistics.db
Digest.crc32
TOC.txt
CompressionInfo.db
Filter.db
Partitions.db
Rows.db

Note that BTI replaces Index.db + Summary.db with Partitions.db + Rows.db. The Rows.db file may be absent or minimal if the table contains only narrow partitions (single index block each).

Section titled “Sidebar: Version Differences and Format Selection”

3.x/4.x vs 5.0:

  • 3.x/4.x use the classic BIG format exclusively (Index.db + Summary.db)
  • 5.0 supports both BIG and BTI formats; BTI is selected via sstable_format in cassandra.yaml
  • Memtables in 5.0 default to SkipListMemtable; TrieMemtable is opt-in via cassandra.yaml and aligns naturally with BTI’s trie-based indexes

Mixed format deployments:

  • During upgrades or format transitions, a directory may contain both BIG and BTI SSTables
  • Readers detect format from component presence (Index.db vs Partitions.db)
  • Compaction can rewrite SSTables to the configured format

See detailed format differences in Chapters 5, 6, and 17 where component structures diverge.

  • Flush writes immutable components alongside a TOC.txt enumerating what was produced.
  • Data.db is primary; index components accelerate lookups; Statistics and CompressionInfo guide reads and maintenance.
  • Classic (BIG) format: Uses Index.db + Summary.db for partition lookup via binary search over sampled entries.
  • BTI format: Uses Partitions.db + Rows.db (trie-based indexes) for O(log n) lookup without a summary file.
  • BTI flush iterates the in-memory trie and incrementally constructs on-disk tries with shortest-prefix optimization.
  • Writers operate in a stable order: write data, derive indexes, then metadata.

For implementation details, see Appendix C.