SAI indexing with collections

SAI supports collections of type map, list, and set. Collections allow you to group and store data together in a column.

In a relational database, a grouping such as a user’s multiple email addresses is achieved via many-to-one joined relationship between (for example) a user table and an email table. Apache Cassandra avoids joins between two tables by storing the user’s email addresses in a collection column in the user table. Each collection specifies the data type of the data held.

A collection is appropriate if the data for collection storage is limited. If the data has unbounded growth potential, like messages sent or sensor events registered every second, do not use collections. Instead, use a table with a compound primary key where data is stored in the clustering columns.

In CQL queries of database tables with SAI indexes, the CONTAINS clauses are supported with, and specific to:

  • SAI collection maps with keys, values, and entries

  • SAI collections with list and set types

SAI also supports element-level indexes (VALUES, KEYS, ENTRIES) on frozen collections. See Frozen collection indexing below.

Frozen collection indexing

SAI supports element-level indexing on frozen collections (added in Cassandra 6.0, CASSANDRA-18492). Before Cassandra 6.0, frozen collections could only be indexed with the FULL target, which enables exact equality matching on the entire serialized collection value. SAI now also accepts VALUES, KEYS, and ENTRIES targets on frozen collections, enabling element-level queries with CONTAINS, CONTAINS KEY, and map entry equality (column[key] = value).

This relaxation is SAI-specific. Non-SAI (legacy 2i) indexes still require FULL indexing for frozen collections. Source: src/java/org/apache/cassandra/cql3/statements/schema/CreateIndexStatement.java, lines 297-304, trunk.

How SAI indexes frozen collection elements

When an SAI index targets VALUES, KEYS, or ENTRIES on a frozen collection, SAI decomposes the serialized collection into individual terms at write time (both the memtable path and the SSTable path). Each element is indexed as a separate term. This is transparent to the query path but has write-amplification implications for large frozen collections: a frozen set with 100 elements produces 100 index terms per write instead of one.

The sai_frozen_term_size guardrail applies to each individually extracted term, not to the full serialized collection.

Supported index targets for frozen collections

Collection type Supported index targets Enabled queries

frozen<set<T>>

VALUES, FULL

CONTAINS, exact equality on full value

frozen<list<T>>

VALUES, FULL

CONTAINS, exact equality on full value

frozen<map<K,V>>

KEYS, VALUES, ENTRIES, FULL

CONTAINS KEY, CONTAINS, column[key] = value, exact equality on full value

ENTRIES indexing on a frozen map column that is a clustering column is not supported and will be rejected at index creation time. Map entry predicates (column[key] = value) cannot be evaluated on clustering columns. Use FULL, KEYS, or VALUES instead for frozen map clustering columns. Source: src/java/org/apache/cassandra/cql3/statements/schema/CreateIndexStatement.java, line 310-312, trunk.

Create indexes on frozen collections

The following examples show CREATE INDEX syntax for frozen collection columns. All examples use USING 'sai'.

-- Frozen set: VALUES index enables CONTAINS queries
CREATE INDEX ON ks.tbl (VALUES(frozen_tags)) USING 'sai';

-- Frozen list: VALUES index enables CONTAINS queries
CREATE INDEX ON ks.tbl (VALUES(frozen_events)) USING 'sai';

-- Frozen map: KEYS index enables CONTAINS KEY queries
CREATE INDEX ON ks.tbl (KEYS(frozen_metadata)) USING 'sai';

-- Frozen map: VALUES index enables CONTAINS queries on map values
CREATE INDEX ON ks.tbl (VALUES(frozen_metadata)) USING 'sai';

-- Frozen map: ENTRIES index enables map entry equality queries
CREATE INDEX ON ks.tbl (ENTRIES(frozen_metadata)) USING 'sai';

-- FULL indexing continues to work for exact equality on the entire frozen collection
CREATE INDEX ON ks.tbl (FULL(frozen_tags)) USING 'sai';

Query frozen collections with element-level indexes

With element-level SAI indexes defined, you can query frozen collection columns using the same CONTAINS, CONTAINS KEY, and map entry equality syntax used for non-frozen collections.

-- Query a frozen set or list (requires VALUES index)
SELECT * FROM ks.tbl WHERE frozen_tags CONTAINS 'cassandra';
SELECT * FROM ks.tbl WHERE frozen_events CONTAINS 'summit';

-- Query a frozen map by key (requires KEYS index)
SELECT * FROM ks.tbl WHERE frozen_metadata CONTAINS KEY 'env';

-- Query a frozen map by value (requires VALUES index)
SELECT * FROM ks.tbl WHERE frozen_metadata CONTAINS 'production';

-- Query a frozen map by entry (requires ENTRIES index)
SELECT * FROM ks.tbl WHERE frozen_metadata['env'] = 'production';

Using the set type

This example uses the following table and index:

CREATE TABLE IF NOT EXISTS cycling.cyclist_career_teams (
  id UUID PRIMARY KEY,
  lastname text,
  teams set<text>
);
CREATE INDEX teams_idx 
    ON cycling.cyclist_career_teams (teams) 
    USING 'sai';

An index is created on the set column teams in the cyclist_career_teams table.

Query with CONTAINS from the set teams column:

CQL
SELECT * FROM cycling.cyclist_career_teams 
   WHERE teams CONTAINS 'Rabobank-Liv Giant';
Result
 id                                   | lastname | teams
--------------------------------------+----------+------------------------------------------------------------------------------------------------------
 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |      VOS | {'Nederland bloeit', 'Rabobank Women Team', 'Rabobank-Liv Giant', 'Rabobank-Liv Woman Cycling Team'}

Frozen set indexing

Starting in Cassandra 6.0 (CASSANDRA-18492), SAI supports VALUES indexing on frozen sets. This enables CONTAINS queries on individual set elements without requiring an exact match on the entire serialized value. FULL indexing continues to work for exact equality on the complete frozen set.

-- Table with a frozen set column
CREATE TABLE ks.cyclist_career_teams (
    id uuid PRIMARY KEY,
    firstname text,
    lastname text,
    teams frozen<set<text>>
);

-- Create a VALUES index on the frozen set column (Cassandra 6.0+, SAI only)
CREATE INDEX ON ks.cyclist_career_teams (VALUES(teams)) USING 'sai';

-- Query: find cyclists who were on a specific team
SELECT * FROM ks.cyclist_career_teams WHERE teams CONTAINS 'Team Sky';

VALUES indexing on frozen sets is supported by SAI only. Non-SAI (legacy 2i) indexes require FULL indexing for frozen sets. Source: test/unit/org/apache/cassandra/index/sai/cql/types/collections/sets/SetFrozenCollectionTest.java, trunk.

Using the list type

This example uses the following table and index:

CREATE TABLE IF NOT EXISTS cycling.upcoming_calendar (
  year int,
  month int,
  events list<text>,
  PRIMARY KEY (year, month)
);
CREATE INDEX events_idx 
    ON cycling.upcoming_calendar (events) 
    USING 'sai';

An index is created on the list column events in the upcoming_calendar table.

Query with CONTAINS from the list events column:

CQL
SELECT * FROM cycling.upcoming_calendar 
   WHERE events CONTAINS 'Criterium du Dauphine';
Result
 year | month | events
------+-------+-----------------------------------------------
 2015 |     6 | ['Criterium du Dauphine', 'Tour de Sui\nsse']

A slightly more complex query selects rows that either contain a particular event or have a particular month date:

CQL
SELECT * FROM cycling.upcoming_calendar
    WHERE events CONTAINS 'Criterium du Dauphine' 
          OR month = 7;
Result
 year | month | events
------+-------+-----------------------------------------------
 2015 |     6 | ['Criterium du Dauphine', 'Tour de Sui\nsse']
 2015 |     7 |                            ['Tour de France']

Frozen list indexing

Starting in Cassandra 6.0 (CASSANDRA-18492), SAI supports VALUES indexing on frozen lists. This enables CONTAINS queries on individual list elements without requiring an exact match on the entire serialized value. FULL indexing continues to work for exact equality on the complete frozen list.

-- Table with a frozen list column
CREATE TABLE ks.upcoming_calendar (
    id uuid PRIMARY KEY,
    month_year text,
    events frozen<list<text>>
);

-- Create a VALUES index on the frozen list column (Cassandra 6.0+, SAI only)
CREATE INDEX ON ks.upcoming_calendar (VALUES(events)) USING 'sai';

-- Query: find calendar entries containing a specific event
SELECT * FROM ks.upcoming_calendar WHERE events CONTAINS 'Grand Tour';

VALUES indexing on frozen lists is supported by SAI only. Non-SAI (legacy 2i) indexes require FULL indexing for frozen lists. Source: test/unit/org/apache/cassandra/index/sai/cql/types/collections/lists/ListFrozenCollectionTest.java, trunk.

Using the map type

This example uses the following table and indexes:

CREATE TABLE IF NOT EXISTS cycling.cyclist_teams (
  id uuid PRIMARY KEY,
  firstname text,
  lastname text,
  teams map<int, text>
);
CREATE INDEX IF NOT EXISTS team_year_keys_idx
ON cycling.cyclist_teams ( KEYS (teams) );
CREATE INDEX IF NOT EXISTS team_year_entries_idx
ON cycling.cyclist_teams ( ENTRIES (teams) );
CREATE INDEX IF NOT EXISTS team_year_values_idx
ON cycling.cyclist_teams ( VALUES (teams) );

Indexes created on the map column teams in the cyclist_career_teams table target the keys, values, and full entries of the column data.

Query with KEYS from the map teams column:

CQL
SELECT * FROM cyclist_teams WHERE teams CONTAINS KEY 2014;
Result
 id                                   | firstname | lastname   | teams
--------------------------------------+-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 cb07baad-eac8-4f65-b28a-bddc06a0de23 | Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  Marianne |        VOS |                                                                                          {2014: 'Rabobank-Liv Woman Cycling Team', 2015: 'Rabobank-Liv Woman Cycling Team'}

Query a value from the map teams column, noting that only the keyword CONTAINS is included:

CQL
SELECT * FROM cyclist_teams WHERE teams CONTAINS 'Team Garmin - Cervelo';
Result
 id                                   | firstname | lastname   | teams
--------------------------------------+-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 cb07baad-eac8-4f65-b28a-bddc06a0de23 | Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}

Query entries from the map teams column, noting the difference in the WHERE clause:

CQL
SELECT * FROM cyclist_teams 
WHERE 
    teams[2014] = 'Boels:Dolmans Cycling Team' 
    AND teams[2015] = 'Boels:Dolmans Cycling Team';
Result
 id                                   | firstname | lastname   | teams
--------------------------------------+-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 cb07baad-eac8-4f65-b28a-bddc06a0de23 | Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}

This example looks for a row where two entries are present in the map teams column.

Frozen map indexing

Starting in Cassandra 6.0 (CASSANDRA-18492), SAI supports KEYS, VALUES, and ENTRIES indexing on frozen maps. This enables element-level queries on frozen map columns using the same query syntax available for non-frozen maps. FULL indexing continues to work for exact equality on the complete frozen map.

-- Table with a frozen map column
CREATE TABLE ks.cyclist_teams (
    id uuid PRIMARY KEY,
    firstname text,
    lastname text,
    teams frozen<map<int, text>>
);

-- Create element-level indexes on the frozen map column (Cassandra 6.0+, SAI only)
CREATE INDEX ON ks.cyclist_teams (KEYS(teams)) USING 'sai';
CREATE INDEX ON ks.cyclist_teams (VALUES(teams)) USING 'sai';
CREATE INDEX ON ks.cyclist_teams (ENTRIES(teams)) USING 'sai';

-- Query by key: find cyclists associated with a specific year
SELECT * FROM ks.cyclist_teams WHERE teams CONTAINS KEY 2014;

-- Query by value: find cyclists who rode for a specific team
SELECT * FROM ks.cyclist_teams WHERE teams CONTAINS 'Team Garmin';

-- Query by entry: find cyclists who rode for a specific team in a specific year
SELECT * FROM ks.cyclist_teams WHERE teams[2014] = 'Team Garmin';

Element-level indexing (KEYS, VALUES, ENTRIES) on frozen maps is supported by SAI only. Non-SAI (legacy 2i) indexes require FULL indexing for frozen maps. Source: test/unit/org/apache/cassandra/index/sai/cql/types/collections/maps/MapKeysFrozenCollectionTest.java, MapValuesFrozenCollectionTest.java, MapEntriesFrozenCollectionTest.java, trunk.

ENTRIES indexing is not supported on frozen map columns that are clustering columns. Map entry predicates (column[key] = value) cannot be evaluated on clustering columns. Attempting to create an ENTRIES index on a frozen map clustering column will produce an error. Use FULL, KEYS, or VALUES instead for frozen map clustering columns. Source: src/java/org/apache/cassandra/cql3/statements/schema/CreateIndexStatement.java, trunk.

For more information, see: