From CQL to Disk

This chapter traces a CQL mutation from client to durable storage: how it becomes a partition with clustering rows and cells, how memtables and the commit log (WAL) capture it, and how a flush turns memory state into SSTable components. The classic (BIG) format produces Data.db, Index.db, Summary.db, Filter.db, Statistics.db, and CompressionInfo.db. The BTI format (Cassandra 5.0+) replaces Index.db/Summary.db with trie-based Partitions.db/Rows.db.

In this chapter you will learn

How mutations map to partitions, clustering keys, and cells
How memtables and the WAL capture writes before flush
How flush builds Data/Index/Summary/Stats/Filter/CompressionInfo (classic BIG format)
How flush builds Data/Partitions/Rows/Stats/Filter/CompressionInfo (BTI format)
The core steps of the flush pipeline via short pseudocode for both formats

Mutation to Partition Structures

A single CQL mutation is normalized into a partition key, zero or more clustering keys, and one or more cells. Deletes are represented as tombstones (partition-, row-, cell-level) and range tombstones. The on-disk Data.db encodes this with a serialization header derived from schema and a sequence of unfiltered rows and markers.

Memtables and WAL

On write, Cassandra appends to a commit log segment and updates an in-memory memtable for the affected table. The memtable maintains partitions in sorted order, ready for efficient flush to disk. When the memtable reaches its size threshold or a flush trigger fires, Cassandra converts the in-memory structure into immutable SSTable components and discards the memtable.

TrieMemtable: Cassandra 5.0’s Opt-In Trie-Based Memtable

Cassandra 5.0 ships TrieMemtable as an available but non-default implementation. SkipListMemtable remains the compiled default (MemtableParams.DEFAULT_MEMTABLE_FACTORY = SkipListMemtableFactory.INSTANCE, schema/MemtableParams.java:99). To activate TrieMemtable cluster-wide, add to cassandra.yaml:

memtable:
  configurations:
    default:
      class_name: TrieMemtable

When TrieMemtable is configured, the architectural alignment with the BTI on-disk trie format becomes an efficiency advantage — but BTI SSTables can also be produced from SkipList-based flushes, and TrieMemtable can flush to BIG-format SSTables. The flush format is determined by sstable_format, not the memtable implementation.

Key characteristics of TrieMemtable:

Trie-based storage: Partition keys are stored in an InMemoryTrie using byte-comparable representations (ByteComparable.Version.OSS50).
Sharded for concurrency: The trie is sharded across CPU cores, with each shard handling a token range.
Inherently sorted: Trie iteration produces partitions in lexicographic (byte-comparable) order, eliminating any sorting pass during flush.
Prefix sharing: Common prefixes between partition keys are stored only once, reducing memory overhead.
Memory efficiency: Stores trie structure in buffers (optionally off-heap), reducing GC pressure compared to object-heavy skip lists.

Cross-reference: See Chapter 17: BTI Formats for how this in-memory trie structure directly persists to the on-disk BTI index, reducing flush complexity through structural alignment.

Flush Pipeline

Diagram: the flush pipeline from memtable to per-component outputs.

Diagram (Mermaid source): /cqlite/format-guide/diagrams/flush-pipeline
Alt text: Flush pipeline from Memtable/WAL to per-component files with TOC.
Caption: Flush steps create Data.db then derive Index/Summary/Filter/Statistics/CompressionInfo and TOC.txt.

Minimal pseudocode for the classic (BIG format) pipeline:

// Memtable → SSTable (BIG format, simplified)
for each partition in memtable in key order:
  write_partition_to(Data.db)
  record_index_entry(Index.db)
sample_index_for_summary(Summary.db)
build_bloom_over_partition_keys(Filter.db)
collect_stats_and_write(Statistics.db)
write_compression_metadata(CompressionInfo.db)
emit_component_listing(TOC.txt)

BTI Format Flush Pipeline

When the BTI (B-Tree/Trie Indexed) format is active, the flush pipeline differs significantly. Instead of Index.db and Summary.db, the BTI format writes trie-structured index files:

Partitions.db: Partition index trie mapping partition keys to data positions or row index positions
Rows.db: Per-partition row index tries for wide partitions (>16KB of row data by default)

The flush iterates the in-memory trie (Cassandra 5.0’s default memtable structure) and constructs on-disk tries incrementally:

// Memtable → SSTable (BTI format, simplified)
for each partition in memtable trie in byte-comparable order:
  position = write_partition_to(Data.db)

  if partition has row index (wide partition):
    row_index_root = complete_row_trie(Rows.db)
    add_to_partition_trie(key, row_index_position)  // positive offset
  else:
    add_to_partition_trie(key, ~data_position)      // negative offset (direct)

complete_partition_trie(Partitions.db)  // writes footer with first/last keys, count, root
build_bloom_over_partition_keys(Filter.db)
collect_stats_and_write(Statistics.db)
write_compression_metadata(CompressionInfo.db)
emit_component_listing(TOC.txt)

Key BTI flush characteristics:

Incremental trie construction: The PartitionIndexBuilder writes trie nodes to disk as they become complete, keeping only the active path in memory. This enables flushing millions of partitions with minimal memory overhead.
Shortest unique prefix optimization: Partition keys are written with a one-key delay to compute the shortest prefix that distinguishes consecutive keys, minimizing index size.
Row index threshold: Partitions with only one index block (≤column_index_size, default 16KB) skip the row index entirely. The partition trie points directly to Data.db using a negative position encoding (~dataPosition).
No summary file: The trie structure is self-indexing—binary search through a sampled summary is unnecessary because trie traversal provides O(log n) lookup directly.

Cassandra 5.0 writer touchpoints

Classic (BIG) format:

BigTableWriter builds Data/Index/Summary/Statistics and emits TOC.txt

BTI format:

BtiTableWriter builds Data/Partitions/Rows/Statistics and emits TOC.txt
PartitionIndexBuilder incrementally constructs the partition trie, computing shortest unique prefixes
RowIndexWriter constructs per-partition row tries with separator-based block indexing
BtiFormatPartitionWriter tracks row index block boundaries (every 16KB by default)

For a concrete implementation walkthrough of a writer, see Appendix C.

Tiny flush artifact (trimmed, from `test_basic`)

Classic (BIG) format TOC.txt shows the emitted components for one table:

Data.db
Statistics.db
Digest.crc32
TOC.txt
CompressionInfo.db
Filter.db
Index.db
Summary.db

BTI format TOC.txt shows trie-based index components instead:

Data.db
Statistics.db
Digest.crc32
TOC.txt
CompressionInfo.db
Filter.db
Partitions.db
Rows.db

Note that BTI replaces Index.db + Summary.db with Partitions.db + Rows.db. The Rows.db file may be absent or minimal if the table contains only narrow partitions (single index block each).

3.x/4.x vs 5.0:

3.x/4.x use the classic BIG format exclusively (Index.db + Summary.db)
5.0 supports both BIG and BTI formats; BTI is selected via sstable_format in cassandra.yaml
Memtables in 5.0 default to SkipListMemtable; TrieMemtable is opt-in via cassandra.yaml and aligns naturally with BTI’s trie-based indexes

Mixed format deployments:

During upgrades or format transitions, a directory may contain both BIG and BTI SSTables
Readers detect format from component presence (Index.db vs Partitions.db)
Compaction can rewrite SSTables to the configured format

See detailed format differences in Chapters 5, 6, and 17 where component structures diverge.

Key Takeaways

Flush writes immutable components alongside a TOC.txt enumerating what was produced.
Data.db is primary; index components accelerate lookups; Statistics and CompressionInfo guide reads and maintenance.
Classic (BIG) format: Uses Index.db + Summary.db for partition lookup via binary search over sampled entries.
BTI format: Uses Partitions.db + Rows.db (trie-based indexes) for O(log n) lookup without a summary file.
BTI flush iterates the in-memory trie and incrementally constructs on-disk tries with shortest-prefix optimization.
Writers operate in a stable order: write data, derive indexes, then metadata.

References

Cassandra 5.0.8:
- SSTableWriter: org.apache.cassandra.io.sstable.format.SSTableWriter
- Classic (BIG) format:
  - BigTableWriter: org.apache.cassandra.io.sstable.format.big.BigTableWriter
- BTI format:
  - BtiTableWriter: org.apache.cassandra.io.sstable.format.bti.BtiTableWriter
  - PartitionIndexBuilder: org.apache.cassandra.io.sstable.format.bti.PartitionIndexBuilder
  - RowIndexWriter: org.apache.cassandra.io.sstable.format.bti.RowIndexWriter
  - BtiFormatPartitionWriter: org.apache.cassandra.io.sstable.format.bti.BtiFormatPartitionWriter
  - BTI package: org.apache.cassandra.io.sstable.format.bti
- Memtables:
  - TrieMemtable: org.apache.cassandra.db.memtable.TrieMemtable
  - Memtable package: org.apache.cassandra.db.memtable

For implementation details, see Appendix C.

From CQL to Disk

From CQL to Disk

In this chapter you will learn

Mutation to Partition Structures

Memtables and WAL

TrieMemtable: Cassandra 5.0’s Opt-In Trie-Based Memtable

Flush Pipeline

BTI Format Flush Pipeline

Cassandra 5.0 writer touchpoints

Tiny flush artifact (trimmed, from test_basic)

Sidebar: Version Differences and Format Selection

Key Takeaways

References

Tiny flush artifact (trimmed, from `test_basic`)