Secondary Indexes (2i)
Secondary Indexes (2i)
Section titled “Secondary Indexes (2i)”One-paragraph summary: Introduce per-table local indexes, their storage interaction with SSTables, and high-level query flow and limitations.
In this chapter you will learn
Section titled “In this chapter you will learn”- How 2i are organized and stored
- How queries flow through 2i to rows
- Interactions with SSTables and filtering
- Key limitations and trade-offs
Storage and Lifecycle
Section titled “Storage and Lifecycle”Secondary indexes (2i) are local, per-table indexes that map a non-primary-key column value to the base table’s primary key. In Cassandra 5.0, built-in 2i implementations attach to the table via the org.apache.cassandra.index.Index API and maintain their own index memtables and SSTables. They flush and compact independently of the base table, but reads ultimately retrieve rows from the base table’s Data.db.
The built-in implementation is CassandraIndex (internal name legacy_local_table), which is considered legacy in Cassandra 5.0; SAI is the recommended replacement for new indexes.
- Index data model: value -> set of row identifiers (partition key plus clustering key as needed).
- Storage: hidden, per-index SSTables managed by the index implementation. These files follow the table’s lifecycle (flush/compaction) but are separate artifacts.
- Local scope: indexes do not span multiple nodes; coordinator nodes query relevant replicas and merge results.
- Consistency: index updates are part of the base table mutation path; the base row remains the source of truth during reads.
Practical consequences:
- Base-table SSTables remain authoritative. Index SSTables supply candidate row keys that must be validated against the base table.
- Index compactions can lag base-table compactions; Cassandra reconciles during reads using timestamps/tombstones.
Query Flow
Section titled “Query Flow”At read time, the coordinator consults the index to produce candidate primary keys, then fetches the corresponding rows from the base table. Non-indexed predicates are applied as filters on the fetched rows.
Tiny example (conceptual):
- Table:
users(id uuid PRIMARY KEY, email text, age int); 2i onemail. - Query:
SELECT id FROM users WHERE email = 'a@example.com' AND age >= 30; - Execution:
- 2i lookup yields candidate keys:
{id1, id2, id3}foremail='a@example.com'. - Fetch rows from
Data.dbusing the normal read path (Bloom -> Index -> Summary -> Data). - Apply
age >= 30as a post-filter to the fetched rows.
- 2i lookup yields candidate keys:
Notes:
- 2i is best for moderate-to-high cardinality columns with selective predicates (equality or simple IN). Very low-cardinality columns can still cause large candidate sets and heavy filtering.
- Range queries on arbitrary non-primary-key columns are not supported by classic 2i. Use SAI (Ch. 14) for efficient range/text/vector queries in 5.0.
Caching and Memory Behavior
Section titled “Caching and Memory Behavior”- Index readers maintain small in-memory structures (e.g., open file handles, minimal metadata). Hot partitions benefit primarily from OS page cache rather than bespoke 2i caching.
- Expect memory usage to be dominated by OS caching of index/base SSTables; application-level caches for 2i are minimal compared to SAI.
Compaction of Index Segments
Section titled “Compaction of Index Segments”- 2i SSTables compact along with their own index files. Merging removes deleted/tombstoned entries and consolidates postings.
- Lag between base and index compactions can cause temporary read amplification; correctness is maintained by timestamps and validation against base rows.
Corruption and Error Handling
Section titled “Corruption and Error Handling”- If an index SSTable is corrupted, reads fall back to base-table scanning for affected predicates or surface an error depending on failure policy. Operators should run validation tools and rebuild the index if necessary.
- The base table remains authoritative; index corruption does not corrupt base data.
Complexity Notes
Section titled “Complexity Notes”- Equality lookup in 2i: O(log S + K) per segment (binary search/lookup plus iterating K candidates), where S is index segment size.
- Fetch/validation from base SSTables follows normal read-path complexity (see Ch. 10-12). Overall cost scales with candidate set size.
Key Takeaways
Section titled “Key Takeaways”- 2i are local, per-index SSTables that map values to primary keys; base table remains authoritative.
- Reads consult the index to produce candidate keys, then fetch and filter base rows.
- Works well for selective equality (and small IN) predicates; avoid low-cardinality columns.
- 2i lifecycle (flush/compact) is independent but coordinated with the table’s write path.
- Prefer SAI for range, LIKE, and vector searches in Cassandra 5.0.
References
Section titled “References”- Cassandra 5.0.8
- Index API:
org.apache.cassandra.index.Index—https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/index/Index.java - Built-in 2i base (legacy):
org.apache.cassandra.index.internal.CassandraIndex—https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/index/internal/CassandraIndex.java - Keys searcher:
org.apache.cassandra.index.internal.keys.KeysSearcher—https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/index/internal/keys/KeysSearcher.java - Composites searcher:
org.apache.cassandra.index.internal.composites.CompositesSearcher—https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/index/internal/composites/CompositesSearcher.java
- Index API:
For implementation details, see Appendix C.