Secondary Indexes (2i)

One-paragraph summary: Introduce per-table local indexes, their storage interaction with SSTables, and high-level query flow and limitations.

In this chapter you will learn

How 2i are organized and stored
How queries flow through 2i to rows
Interactions with SSTables and filtering
Key limitations and trade-offs

Storage and Lifecycle

Secondary indexes (2i) are local, per-table indexes that map a non-primary-key column value to the base table’s primary key. In Cassandra 5.0, built-in 2i implementations attach to the table via the org.apache.cassandra.index.Index API and maintain their own index memtables and SSTables. They flush and compact independently of the base table, but reads ultimately retrieve rows from the base table’s Data.db.

The built-in implementation is CassandraIndex (internal name legacy_local_table), which is considered legacy in Cassandra 5.0; SAI is the recommended replacement for new indexes.

Index data model: value -> set of row identifiers (partition key plus clustering key as needed).
Storage: hidden, per-index SSTables managed by the index implementation. These files follow the table’s lifecycle (flush/compaction) but are separate artifacts.
Local scope: indexes do not span multiple nodes; coordinator nodes query relevant replicas and merge results.
Consistency: index updates are part of the base table mutation path; the base row remains the source of truth during reads.

Practical consequences:

Base-table SSTables remain authoritative. Index SSTables supply candidate row keys that must be validated against the base table.
Index compactions can lag base-table compactions; Cassandra reconciles during reads using timestamps/tombstones.

Query Flow

At read time, the coordinator consults the index to produce candidate primary keys, then fetches the corresponding rows from the base table. Non-indexed predicates are applied as filters on the fetched rows.

Tiny example (conceptual):

Table: users(id uuid PRIMARY KEY, email text, age int); 2i on email.
Query: SELECT id FROM users WHERE email = 'a@example.com' AND age >= 30;
Execution:
- 2i lookup yields candidate keys: {id1, id2, id3} for email='a@example.com'.
- Fetch rows from Data.db using the normal read path (Bloom -> Index -> Summary -> Data).
- Apply age >= 30 as a post-filter to the fetched rows.

Notes:

2i is best for moderate-to-high cardinality columns with selective predicates (equality or simple IN). Very low-cardinality columns can still cause large candidate sets and heavy filtering.
Range queries on arbitrary non-primary-key columns are not supported by classic 2i. Use SAI (Ch. 14) for efficient range/text/vector queries in 5.0.

Caching and Memory Behavior

Index readers maintain small in-memory structures (e.g., open file handles, minimal metadata). Hot partitions benefit primarily from OS page cache rather than bespoke 2i caching.
Expect memory usage to be dominated by OS caching of index/base SSTables; application-level caches for 2i are minimal compared to SAI.

Compaction of Index Segments

2i SSTables compact along with their own index files. Merging removes deleted/tombstoned entries and consolidates postings.
Lag between base and index compactions can cause temporary read amplification; correctness is maintained by timestamps and validation against base rows.

Corruption and Error Handling

If an index SSTable is corrupted, reads fall back to base-table scanning for affected predicates or surface an error depending on failure policy. Operators should run validation tools and rebuild the index if necessary.
The base table remains authoritative; index corruption does not corrupt base data.

Complexity Notes

Equality lookup in 2i: O(log S + K) per segment (binary search/lookup plus iterating K candidates), where S is index segment size.
Fetch/validation from base SSTables follows normal read-path complexity (see Ch. 10-12). Overall cost scales with candidate set size.

Key Takeaways

2i are local, per-index SSTables that map values to primary keys; base table remains authoritative.
Reads consult the index to produce candidate keys, then fetch and filter base rows.
Works well for selective equality (and small IN) predicates; avoid low-cardinality columns.
2i lifecycle (flush/compact) is independent but coordinated with the table’s write path.
Prefer SAI for range, LIKE, and vector searches in Cassandra 5.0.

References

Cassandra 5.0.8
- Index API: org.apache.cassandra.index.Index — https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/index/Index.java
- Built-in 2i base (legacy): org.apache.cassandra.index.internal.CassandraIndex — https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/index/internal/CassandraIndex.java
- Keys searcher: org.apache.cassandra.index.internal.keys.KeysSearcher — https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/index/internal/keys/KeysSearcher.java
- Composites searcher: org.apache.cassandra.index.internal.composites.CompositesSearcher — https://github.com/apache/cassandra/blob/cassandra-5.0.8/src/java/org/apache/cassandra/index/internal/composites/CompositesSearcher.java

For implementation details, see Appendix C.