Skip to content

Anatomy of an SSTable

This chapter explains each SSTable component and how they fit together. Cassandra writes a family of files per SSTable generation; TOC.txt enumerates the components present and serves as the invariant list for integrity checks and lifecycle tooling.

  • The role of each component file and how they interact
  • Versioning and feature flags across formats
  • How schema and CQL types map to on-disk encodings
  • How directory layout and TOC establish invariants
  • Where BTI (5.0) differs from big in component internals
  • Data.db: The primary row/partition data, optionally compressed in fixed-size chunks.
  • Index.db: Per-partition index entries mapping key digests to Data.db positions; may include promoted index data for wide partitions.
  • Summary.db: A sampled (promoted) index to accelerate binary search into Index.db.
  • Filter.db (Bloom): Probabilistic membership filter to skip non-existent partitions.
  • Statistics.db: SSTable-level metadata (timestamps, histograms, repair/level info, min/max tokens, etc.).
  • CompressionInfo.db: Algorithm, chunk length, and the map of compressed chunk offsets (and optional per-chunk CRCs) for Data.db.
  • Digest.crc32: Digest file for end-to-end integrity.
  • TOC.txt: Text file listing the components present; tools use it to validate completeness.

Publication barrier = TOC.txt. See “TOC Invariants and Integrity Checks” below and Chapter 16 (“SSTable Lifecycle and Maintenance”) for a practical checklist and tooling pointers.

Diagram: sstable components and relationships

  • Alt text: Component diagram showing Data, Index, Summary, Bloom, Stats, CompressionInfo, TOC SSTable components and relationships
  • Caption: How SSTable components reference each other during reads

SSTable formats evolved over time. In 5.0, BTI (B-Tree/Trie Indexed) coexists with the long-standing big family. The component set is stable, but internal formats and metadata change with version/feature flags. The Descriptor defines the {format} segment and controls feature availability, while StatsMetadata evolves fields used by compaction heuristics and read optimizations. See Chapter 17 for BTI details.

AreaPre-5.x (big)5.0 (BTI/big)
Index entrySimple digest list; promoted index optionalBTI adjusts indexing structures; promoted index layout may differ
SummarySampling rate + offsetsToken-sorted with explicit index offsets
CompressionInfoOffsets only (legacy)Optional per-chunk CRCs; format detection supported
StatisticsFewer fieldsExpanded histograms/repair/level fields

On-disk encodings derive from the table schema (partition/clustering keys and column types). Cassandra’s SerializationHeader computes how values are encoded; in this guide we cross-reference concrete mappings in Appendix A (types) and Appendix B (encodings cheat sheet). For an implementation walkthrough, see Appendix C.

Trimmed listing from test_basic/simple_table (one SSTable generation):

nb-1-big-CompressionInfo.db
nb-1-big-Data.db
nb-1-big-Digest.crc32
nb-1-big-Filter.db
nb-1-big-Index.db
nb-1-big-Statistics.db
nb-1-big-Summary.db
nb-1-big-TOC.txt

Note: Cassandra 5.0 nodes write oa-…-big-* for new SSTables (version oa is current for the BIG format in 5.0; BigFormat.java:343). The nb-… prefix above is valid for SSTables written by pre-5.0 nodes or during a rolling upgrade.

The TOC.txt inside the same directory confirms the components present:

Data.db
Statistics.db
Digest.crc32
TOC.txt
CompressionInfo.db
Filter.db
Index.db
Summary.db

When BTI is enabled, a generation includes BTI-specific components alongside common ones. During upgrades, directories may contain both BIG and BTI generations. Real filenames (trimmed):

da-3-bti-Data.db
da-3-bti-Partitions.db
da-3-bti-Rows.db
da-3-bti-Statistics.db
da-3-bti-TOC.txt
da-3-bti-Digest.crc32

See the BTI package for details: org.apache.cassandra.io.sstable.format.bti.

Correction: BTI only ever uses version da (BtiFormat.java:289: current_version = "da"). Version na is a BIG-format version letter and is incompatible with BTI filenames.

TOC.txt is authoritative: tools validate that every listed component exists and that unexpected files do not appear. Integrity checks commonly include:

  • Presence: Each required component listed in TOC.txt exists on disk
  • Consistency: Statistics.db and CompressionInfo.db fields align with observed file sizes and counts
  • Cross-component alignment: Index.db positions must resolve into valid Data.db boundaries; Summary.db samples must be sorted and within token range
  • Digest validation: Digest.crc32 matches computed digests over the appropriate component payloads

While component order in TOC.txt does not affect functionality, Cassandra writes components in a canonical order for consistency. The standard order is:

  1. Data.db - Primary partition and row data
  2. Statistics.db - SSTable-level metadata
  3. Digest.crc32 - Integrity checksums
  4. TOC.txt - Table of contents (self-referential)
  5. CompressionInfo.db - Compression chunk metadata
  6. Filter.db - Bloom filter
  7. Index.db - Partition index
  8. Summary.db - Sampled index for acceleration

This ordering matches the implementation in CQLite’s TocWriter. Cassandra’s source does not enforce a canonical TOC line ordering; the order above is CQLite’s own convention.

Critical requirement: TOC.txt must list itself as a component. This self-referential entry ensures:

  • Integrity tools can verify all components, including the TOC
  • Completeness validation accounts for the full component set
  • Deletion/archival operations have a complete manifest

Writers automatically add TOC.txt to the component list if not explicitly included.

TOC.txt serves as the publication barrier for SSTable atomicity. An SSTable is not considered complete or visible until TOC.txt exists on disk. This ensures:

Atomic Visibility: Readers can detect incomplete writes by checking for TOC.txt presence. If the file is missing, the SSTable generation is ignored.

Crash Safety: If a node crashes during SSTable write, partial components without a corresponding TOC.txt are discarded during restart/recovery. Only fully-written SSTables (with TOC) are loaded.

Write Order Guarantee: TOC.txt MUST be written LAST, after all other components have been flushed and synced to disk. This ordering ensures no reader observes a partial SSTable.

Implementation note: Writers use fsync() on each component before writing TOC.txt, then fsync() the TOC itself to guarantee durability.

SSTable components have internal dependencies that dictate write ordering beyond the final TOC barrier:

  1. Data.db first via flush: Partition rows are streamed to disk first. Statistics accumulate during the write.

  2. Data.db + Index.db: Written together during the flush pass. Index.db entries reference Data.db byte offsets, so Data.db chunks must be flushed before corresponding Index entries.

    Note: Statistics.db is written in doPrepare() after Data.db is complete (SSTableWriter.java:384–392), using finalized metadata. It is serialized just before TOC.txt, not before Data.db.

  3. Summary.db: Samples Index.db entries, so Index.db must be complete before Summary generation.

  4. Filter.db: Built from partition keys during Data.db write, can be finalized once all partitions are known.

  5. CompressionInfo.db: Tracks compressed chunk boundaries in Data.db, written as Data.db chunks are compressed.

  6. Digest.crc32: Checksums over finalized components, written after all data components are complete.

  7. TOC.txt LAST: Publication barrier, written after all components are flushed and synced.

Violating these dependencies results in corrupted SSTables with invalid offsets, missing metadata, or incomplete indexes.

  • File family remains multi-component; feature flags and index internals differ
  • Descriptor format names are big and bti only (BigFormat.java:75, BtiFormat.java:64). mc, mm, nb, and oa are version letters within big, not format names. 5.0 pairs big (version oa) with BTI (version da).
  • Statistics fields expanded over time; tooling output formatting changed subtly
  • TOC.txt is authoritative for the component set in a given SSTable
  • Summary.db samples Index.db to accelerate seeks; Bloom reduces unnecessary IO
  • CompressionInfo.db is required to read compressed Data.db
  • Version/format changes do not remove the core components, but affect internal structure
%% SSTable component relationship diagram (stub)
flowchart TD
A[Memtable] -->|Flush| B[Data.db]
A -->|Flush| C[Index.db]
A -->|Flush| D[Summary.db]
A -->|Flush| E[Filter.db]
A -->|Flush| F[Statistics.db]
A -->|Flush| G[CompressionInfo.db]

For implementation details, see Appendix C.