Anatomy of an SSTable
Anatomy of an SSTable
Section titled “Anatomy of an SSTable”This chapter explains each SSTable component and how they fit together. Cassandra writes a family of files per SSTable generation; TOC.txt enumerates the components present and serves as the invariant list for integrity checks and lifecycle tooling.
In this chapter you will learn
Section titled “In this chapter you will learn”- The role of each component file and how they interact
- Versioning and feature flags across formats
- How schema and CQL types map to on-disk encodings
- How directory layout and TOC establish invariants
- Where BTI (5.0) differs from
bigin component internals
Components Overview
Section titled “Components Overview”Data.db: The primary row/partition data, optionally compressed in fixed-size chunks.Index.db: Per-partition index entries mapping key digests toData.dbpositions; may include promoted index data for wide partitions.Summary.db: A sampled (promoted) index to accelerate binary search intoIndex.db.Filter.db(Bloom): Probabilistic membership filter to skip non-existent partitions.Statistics.db: SSTable-level metadata (timestamps, histograms, repair/level info, min/max tokens, etc.).CompressionInfo.db: Algorithm, chunk length, and the map of compressed chunk offsets (and optional per-chunk CRCs) forData.db.Digest.crc32: Digest file for end-to-end integrity.TOC.txt: Text file listing the components present; tools use it to validate completeness.
Publication barrier =
TOC.txt. See “TOC Invariants and Integrity Checks” below and Chapter 16 (“SSTable Lifecycle and Maintenance”) for a practical checklist and tooling pointers.
Diagram: sstable components and relationships
- Alt text: Component diagram showing Data, Index, Summary, Bloom, Stats, CompressionInfo, TOC
- Caption: How SSTable components reference each other during reads
Versioning and Feature Flags
Section titled “Versioning and Feature Flags”SSTable formats evolved over time. In 5.0, BTI (B-Tree/Trie Indexed) coexists with the long-standing big family. The component set is stable, but internal formats and metadata change with version/feature flags. The Descriptor defines the {format} segment and controls feature availability, while StatsMetadata evolves fields used by compaction heuristics and read optimizations. See Chapter 17 for BTI details.
Format Version Evolution (concise)
Section titled “Format Version Evolution (concise)”| Area | Pre-5.x (big) | 5.0 (BTI/big) |
|---|---|---|
| Index entry | Simple digest list; promoted index optional | BTI adjusts indexing structures; promoted index layout may differ |
| Summary | Sampling rate + offsets | Token-sorted with explicit index offsets |
| CompressionInfo | Offsets only (legacy) | Optional per-chunk CRCs; format detection supported |
| Statistics | Fewer fields | Expanded histograms/repair/level fields |
Schema and Type Mapping
Section titled “Schema and Type Mapping”On-disk encodings derive from the table schema (partition/clustering keys and column types). Cassandra’s SerializationHeader computes how values are encoded; in this guide we cross-reference concrete mappings in Appendix A (types) and Appendix B (encodings cheat sheet). For an implementation walkthrough, see Appendix C.
Directory Listing Example
Section titled “Directory Listing Example”Trimmed listing from test_basic/simple_table (one SSTable generation):
nb-1-big-CompressionInfo.dbnb-1-big-Data.dbnb-1-big-Digest.crc32nb-1-big-Filter.dbnb-1-big-Index.dbnb-1-big-Statistics.dbnb-1-big-Summary.dbnb-1-big-TOC.txtNote: Cassandra 5.0 nodes write
oa-…-big-*for new SSTables (versionoais current for the BIG format in 5.0;BigFormat.java:343). Thenb-…prefix above is valid for SSTables written by pre-5.0 nodes or during a rolling upgrade.
The TOC.txt inside the same directory confirms the components present:
Data.dbStatistics.dbDigest.crc32TOC.txtCompressionInfo.dbFilter.dbIndex.dbSummary.dbBTI directory example (5.0)
Section titled “BTI directory example (5.0)”When BTI is enabled, a generation includes BTI-specific components alongside common ones. During upgrades, directories may contain both BIG and BTI generations. Real filenames (trimmed):
da-3-bti-Data.dbda-3-bti-Partitions.dbda-3-bti-Rows.dbda-3-bti-Statistics.dbda-3-bti-TOC.txtda-3-bti-Digest.crc32See the BTI package for details: org.apache.cassandra.io.sstable.format.bti.
Correction: BTI only ever uses version
da(BtiFormat.java:289:current_version = "da"). Versionnais a BIG-format version letter and is incompatible with BTI filenames.
TOC Invariants and Integrity Checks
Section titled “TOC Invariants and Integrity Checks”TOC.txt is authoritative: tools validate that every listed component exists and that unexpected files do not appear. Integrity checks commonly include:
- Presence: Each required component listed in
TOC.txtexists on disk - Consistency:
Statistics.dbandCompressionInfo.dbfields align with observed file sizes and counts - Cross-component alignment:
Index.dbpositions must resolve into validData.dbboundaries;Summary.dbsamples must be sorted and within token range - Digest validation:
Digest.crc32matches computed digests over the appropriate component payloads
TOC.txt Canonical Component Ordering
Section titled “TOC.txt Canonical Component Ordering”While component order in TOC.txt does not affect functionality, Cassandra writes components in a canonical order for consistency. The standard order is:
- Data.db - Primary partition and row data
- Statistics.db - SSTable-level metadata
- Digest.crc32 - Integrity checksums
- TOC.txt - Table of contents (self-referential)
- CompressionInfo.db - Compression chunk metadata
- Filter.db - Bloom filter
- Index.db - Partition index
- Summary.db - Sampled index for acceleration
This ordering matches the implementation in CQLite’s TocWriter. Cassandra’s source does not enforce a canonical TOC line ordering; the order above is CQLite’s own convention.
TOC.txt Self-Inclusion
Section titled “TOC.txt Self-Inclusion”Critical requirement: TOC.txt must list itself as a component. This self-referential entry ensures:
- Integrity tools can verify all components, including the TOC
- Completeness validation accounts for the full component set
- Deletion/archival operations have a complete manifest
Writers automatically add TOC.txt to the component list if not explicitly included.
Publication Barrier
Section titled “Publication Barrier”TOC.txt serves as the publication barrier for SSTable atomicity. An SSTable is not considered complete or visible until TOC.txt exists on disk. This ensures:
Atomic Visibility: Readers can detect incomplete writes by checking for TOC.txt presence. If the file is missing, the SSTable generation is ignored.
Crash Safety: If a node crashes during SSTable write, partial components without a corresponding TOC.txt are discarded during restart/recovery. Only fully-written SSTables (with TOC) are loaded.
Write Order Guarantee: TOC.txt MUST be written LAST, after all other components have been flushed and synced to disk. This ordering ensures no reader observes a partial SSTable.
Implementation note: Writers use fsync() on each component before writing TOC.txt, then fsync() the TOC itself to guarantee durability.
Write Order Dependencies
Section titled “Write Order Dependencies”SSTable components have internal dependencies that dictate write ordering beyond the final TOC barrier:
-
Data.db first via flush: Partition rows are streamed to disk first. Statistics accumulate during the write.
-
Data.db + Index.db: Written together during the flush pass. Index.db entries reference Data.db byte offsets, so Data.db chunks must be flushed before corresponding Index entries.
Note:
Statistics.dbis written indoPrepare()after Data.db is complete (SSTableWriter.java:384–392), using finalized metadata. It is serialized just beforeTOC.txt, not before Data.db. -
Summary.db: Samples Index.db entries, so Index.db must be complete before Summary generation.
-
Filter.db: Built from partition keys during Data.db write, can be finalized once all partitions are known.
-
CompressionInfo.db: Tracks compressed chunk boundaries in Data.db, written as Data.db chunks are compressed.
-
Digest.crc32: Checksums over finalized components, written after all data components are complete.
-
TOC.txt LAST: Publication barrier, written after all components are flushed and synced.
Violating these dependencies results in corrupted SSTables with invalid offsets, missing metadata, or incomplete indexes.
Sidebar: Version Differences (3.x/4.x)
Section titled “Sidebar: Version Differences (3.x/4.x)”- File family remains multi-component; feature flags and index internals differ
Descriptorformat names arebigandbtionly (BigFormat.java:75,BtiFormat.java:64).mc,mm,nb, andoaare version letters withinbig, not format names. 5.0 pairsbig(versionoa) with BTI (versionda).- Statistics fields expanded over time; tooling output formatting changed subtly
Key Takeaways
Section titled “Key Takeaways”TOC.txtis authoritative for the component set in a given SSTableSummary.dbsamplesIndex.dbto accelerate seeks; Bloom reduces unnecessary IOCompressionInfo.dbis required to read compressedData.db- Version/format changes do not remove the core components, but affect internal structure
References
Section titled “References”- Cassandra 5.0.8 (pinned):
Descriptor— https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/Descriptor.javaStatsMetadata— https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/metadata/StatsMetadata.javaIndexSummary— https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/IndexSummary.javaSSTableWriter.doPrepare()(Statistics.db write timing) — https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/format/SSTableWriter.java#L384-L394BigFormat(current_version = “oa”) — https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java#L343BtiFormat(current_version = “da”) — https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.java#L289-L290 Reference (diagram source):
%% SSTable component relationship diagram (stub)flowchart TD A[Memtable] -->|Flush| B[Data.db] A -->|Flush| C[Index.db] A -->|Flush| D[Summary.db] A -->|Flush| E[Filter.db] A -->|Flush| F[Statistics.db] A -->|Flush| G[CompressionInfo.db]For implementation details, see Appendix C.