Anatomy of an SSTable

This chapter explains each SSTable component and how they fit together. Cassandra writes a family of files per SSTable generation; TOC.txt enumerates the components present and serves as the invariant list for integrity checks and lifecycle tooling.

In this chapter you will learn

The role of each component file and how they interact
Versioning and feature flags across formats
How schema and CQL types map to on-disk encodings
How directory layout and TOC establish invariants
Where BTI (5.0) differs from big in component internals

Components Overview

Data.db: The primary row/partition data, optionally compressed in fixed-size chunks.
Index.db: Per-partition index entries mapping the raw partition key (length-prefixed — not a digest) to its Data.db position; may include promoted index data for wide partitions.
Summary.db: A sampled index over Index.db, in decorated-key order, giving a bounded entry point for the Index.db walk.
Filter.db (Bloom): Probabilistic membership filter to skip non-existent partitions.
Statistics.db: SSTable-level metadata (timestamps, histograms, repair/level info, min/max tokens, etc.).
CompressionInfo.db: Algorithm, chunk length, and the map of compressed chunk offsets (and optional per-chunk CRCs) for Data.db.
Digest.crc32: Digest file for end-to-end integrity.
TOC.txt: Text file listing the components present; tools use it to validate completeness.

Publication barrier = TOC.txt. See “TOC Invariants and Integrity Checks” below and Chapter 16 (“SSTable Lifecycle and Maintenance”) for a practical checklist and tooling pointers.

Diagram: sstable components and relationships

Alt text: Component diagram showing Data, Index, Summary, Bloom, Stats, CompressionInfo, TOC
Caption: How SSTable components reference each other during reads

Versioning and Feature Flags

SSTable formats evolved over time. In 5.0, BTI (B-Tree/Trie Indexed) coexists with the long-standing big family. The component set is stable, but internal formats and metadata change with version/feature flags. The Descriptor defines the {format} segment and controls feature availability, while StatsMetadata evolves fields used by compaction heuristics and read optimizations. See Chapter 17 for BTI details.

Format Version Evolution (concise)

Scope note: this guide targets Cassandra 5.0 — BIG versions na/nb/oa and BTI da. The “earlier big” column below is context only, not a supported target.

Area	Earlier `big`	5.0 (`big` / BTI)
Partition index	`Index.db`: length-prefixed raw partition key + `Data.db` offset + optional promoted index	`big`: same entry shape (no wire-format change). BTI: no `Index.db` at all — a page-aware trie in `Partitions.db` + `Rows.db`
Summary	`Summary.db`: sampled keys → `Index.db` offsets, in decorated-key order	`big`: unchanged. BTI: no `Summary.db` (the trie needs no sampled entry point)
CompressionInfo	Offsets only (legacy)	Optional per-chunk CRCs; format detection supported
Statistics	Fewer fields	Expanded histograms/repair/level fields

The Index.db entry has never been a “digest list”. Each entry is a length-prefixed raw partition key followed by VInt offsets — no marker byte and no MD5 digest. See Ch.6 and Ch.21 for the byte layout and the history of that documentation error (Issue #552).

Read-path note (algorithm, not format). For big, the entry point at open is Summary.db, not Index.db: a reader loads the summary (and the first/last keys) and walks Index.db only when the summary or Bloom filter must be rebuilt — i.e. when Summary.db is absent, corrupt, or was written with a different min_index_interval (BigSSTableReaderLoadingBuilder.java#L96-L130). A point read then walks one summary interval of Index.db rather than the whole file (BigTableReader.java#L277-L320). This is a property of the reader, not a version feature flag — and it holds only while a usable Summary.db is present. See Ch.6 (“Partition Lookup Flow”) and Ch.10.

Schema and Type Mapping

On-disk encodings derive from the table schema (partition/clustering keys and column types). Cassandra’s SerializationHeader computes how values are encoded; in this guide we cross-reference concrete mappings in Appendix A (types) and Appendix B (encodings cheat sheet). For an implementation walkthrough, see Appendix C.

Directory Listing Example

Trimmed listing from test_basic/simple_table (one SSTable generation):

nb-1-big-CompressionInfo.db
nb-1-big-Data.db
nb-1-big-Digest.crc32
nb-1-big-Filter.db
nb-1-big-Index.db
nb-1-big-Statistics.db
nb-1-big-Summary.db
nb-1-big-TOC.txt

Note: Cassandra 5.0 nodes write oa-…-big-* for new SSTables (version oa is current for the BIG format in 5.0; BigFormat.java:343). The nb-… prefix above is valid for SSTables written by pre-5.0 nodes or during a rolling upgrade.

The TOC.txt inside the same directory confirms the components present:

Data.db
Statistics.db
Digest.crc32
TOC.txt
CompressionInfo.db
Filter.db
Index.db
Summary.db

BTI directory example (5.0)

When BTI is enabled, a generation includes BTI-specific components alongside common ones. During upgrades, directories may contain both BIG and BTI generations. Real filenames (trimmed):

da-3-bti-Data.db
da-3-bti-Partitions.db
da-3-bti-Rows.db
da-3-bti-Statistics.db
da-3-bti-TOC.txt
da-3-bti-Digest.crc32

See the BTI package for details: org.apache.cassandra.io.sstable.format.bti.

Correction: BTI only ever uses version da (BtiFormat.java:289: current_version = "da"). Version na is a BIG-format version letter and is incompatible with BTI filenames.

TOC Invariants and Integrity Checks

TOC.txt is authoritative: tools validate that every listed component exists and that unexpected files do not appear. Integrity checks commonly include:

Presence: Each required component listed in TOC.txt exists on disk
Consistency: Statistics.db and CompressionInfo.db fields align with observed file sizes and counts
Cross-component alignment: Index.db positions must resolve into valid Data.db boundaries; Summary.db samples must be sorted and within token range
Digest validation: Digest.crc32 matches computed digests over the appropriate component payloads

TOC.txt Canonical Component Ordering

While component order in TOC.txt does not affect functionality, Cassandra writes components in a canonical order for consistency. The standard order is:

Data.db - Primary partition and row data
Statistics.db - SSTable-level metadata
Digest.crc32 - Integrity checksums
TOC.txt - Table of contents (self-referential)
CompressionInfo.db - Compression chunk metadata
Filter.db - Bloom filter
Index.db - Partition index
Summary.db - Sampled index for acceleration

This ordering matches the implementation in CQLite’s TocWriter. Cassandra’s source does not enforce a canonical TOC line ordering; the order above is CQLite’s own convention.

TOC.txt Self-Inclusion

Critical requirement: TOC.txt must list itself as a component. This self-referential entry ensures:

Integrity tools can verify all components, including the TOC
Completeness validation accounts for the full component set
Deletion/archival operations have a complete manifest

Writers automatically add TOC.txt to the component list if not explicitly included.

Publication Barrier

TOC.txt serves as the publication barrier for SSTable atomicity. An SSTable is not considered complete or visible until TOC.txt exists on disk. This ensures:

Atomic Visibility: Readers can detect incomplete writes by checking for TOC.txt presence. If the file is missing, the SSTable generation is ignored.

Crash Safety: If a node crashes during SSTable write, partial components without a corresponding TOC.txt are discarded during restart/recovery. Only fully-written SSTables (with TOC) are loaded.

Write Order Guarantee: TOC.txt MUST be written LAST, after all other components have been flushed and synced to disk. This ordering ensures no reader observes a partial SSTable.

Implementation note: Writers use fsync() on each component before writing TOC.txt, then fsync() the TOC itself to guarantee durability.

Write Order Dependencies

SSTable components have internal dependencies that dictate write ordering beyond the final TOC barrier:

Data.db first via flush: Partition rows are streamed to disk first. Statistics accumulate during the write.
Data.db + Index.db: Written together during the flush pass. Index.db entries reference Data.db byte offsets, so Data.db chunks must be flushed before corresponding Index entries.

Note: Statistics.db is written in doPrepare() after Data.db is complete (SSTableWriter.java:384–392), using finalized metadata. It is serialized just before TOC.txt, not before Data.db.
Summary.db: Samples Index.db entries, so Index.db must be complete before Summary generation.
Filter.db: Built from partition keys during Data.db write, can be finalized once all partitions are known.
CompressionInfo.db: Tracks compressed chunk boundaries in Data.db, written as Data.db chunks are compressed.
Digest.crc32: Checksums over finalized components, written after all data components are complete.
TOC.txt LAST: Publication barrier, written after all components are flushed and synced.

Violating these dependencies results in corrupted SSTables with invalid offsets, missing metadata, or incomplete indexes.

File family remains multi-component; feature flags and index internals differ
Descriptor format names are big and bti only (BigFormat.java:75, BtiFormat.java:64). mc, mm, nb, and oa are version letters within big, not format names. 5.0 pairs big (version oa) with BTI (version da).
Statistics fields expanded over time; tooling output formatting changed subtly

Key Takeaways

TOC.txt is authoritative for the component set in a given SSTable
Summary.db samples Index.db to accelerate seeks; Bloom reduces unnecessary IO
CompressionInfo.db is required to read compressed Data.db
Version/format changes do not remove the core components, but affect internal structure

References

Cassandra 5.0.8 (pinned):
- Descriptor — https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/Descriptor.java
- StatsMetadata — https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/metadata/StatsMetadata.java
- IndexSummary — https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/IndexSummary.java
- SSTableWriter.doPrepare() (Statistics.db write timing) — https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/format/SSTableWriter.java#L384-L394
- BigFormat (current_version = “oa”) — https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java#L343
- BtiFormat (current_version = “da”) — https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.java#L289-L290 Reference (diagram source):

%% SSTable component relationship diagram (stub)
flowchart TD
  A[Memtable] -->|Flush| B[Data.db]
  A -->|Flush| C[Index.db]
  A -->|Flush| D[Summary.db]
  A -->|Flush| E[Filter.db]
  A -->|Flush| F[Statistics.db]
  A -->|Flush| G[CompressionInfo.db]

For implementation details, see Appendix C.

Anatomy of an SSTable