AI Application Patterns with Cassandra
|
Preview | Unofficial | For review only |
Cassandra 6’s native VECTOR data type and SAI-powered ANN search make it a strong operational backend for AI applications.
This guide goes beyond basic vector search to cover real-world AI application architectures: RAG pipelines, agent memory, hybrid search, and framework integration.
For the foundational vector search reference — the VECTOR type, SAI index creation syntax, and ANN query mechanics — see Vector Search Overview.
Cassandra as a Vector Store
When Cassandra Is the Right Vector Store
Purpose-built vector databases are optimized for pure similarity search over large embedding collections. Cassandra is a better fit when vector search is one part of a broader application, not the only workload.
Choose Cassandra when:
-
Single database for vectors and application data. Most AI applications need more than embeddings — they need the source documents, user records, session state, and audit logs that live alongside the vectors. Keeping everything in Cassandra eliminates the operational cost of synchronizing two stores.
-
Operational maturity. Multi-DC replication, tunable consistency, zero-downtime schema changes, and streaming-compatible index files work the same for vector columns as for any other column.
-
Query model integration. ANN queries compose with CQL scalar filters in a single statement. There is no application-layer join between a vector store and a relational database.
Choose a dedicated vector database when:
-
similarity search is the primary workload and the rest of the application data lives elsewhere
-
you need engine-specific ANN tuning as the main optimization surface
-
you do not need Cassandra’s replication, data-modeling, or mixed-workload strengths
|
The |
RAG Architecture with Cassandra
Retrieval-Augmented Generation (RAG) is the most common AI use case for Cassandra’s vector capabilities. The architecture stores document chunks alongside their embeddings, then retrieves relevant context at query time to ground an LLM’s answer.
Data Flow
Ingest path: Documents → Chunker → Embeddings API → Cassandra (chunks + vectors + metadata) Query path: User Query → Embeddings API → ANN Search → Retrieved Chunks → LLM → Answer
Schema for a RAG System
CREATE TABLE documents (
doc_id uuid,
chunk_id int,
content text,
embedding vector<float, 1536>,
metadata map<text, text>,
created_at timestamp,
PRIMARY KEY (doc_id, chunk_id)
);
-- ANN index for semantic retrieval
CREATE CUSTOM INDEX ON documents (embedding)
USING 'StorageAttachedIndex'
WITH OPTIONS = {'similarity_function': 'cosine'};
-- SAI index on metadata for hybrid filtering
CREATE CUSTOM INDEX ON documents (metadata)
USING 'StorageAttachedIndex';
The metadata map stores arbitrary key/value pairs (source system, category, language, access tier) without requiring schema changes when new attributes are introduced.
Retrieval with Metadata Filtering
-- Retrieve the 5 most relevant chunks within a specific category
SELECT content, metadata
FROM documents
WHERE metadata CONTAINS KEY 'category'
ORDER BY embedding ANN OF ?
LIMIT 5;
-- Filter by a specific metadata value
SELECT content, doc_id, chunk_id
FROM documents
WHERE metadata['category'] = 'legal'
ORDER BY embedding ANN OF ?
LIMIT 10;
|
Pass the embedded form of the user’s question as the bind parameter to |
Chunking Strategies
How you split source documents affects retrieval quality more than most other configuration choices:
-
Fixed-size — split every N tokens with a small overlap. Simple to implement; may cut sentences mid-thought. Good baseline.
-
Sentence-boundary — split on sentence endings using an NLP tokenizer. Preserves semantic units. Slightly more complex.
-
Semantic — embed candidate splits and merge or split based on embedding similarity. Produces the most coherent chunks but requires an extra embedding pass during ingestion.
The chunk_id clustering column in the schema above preserves the original document order, which can be used to expand retrieved chunks with neighboring context before sending to the LLM.
Embedding Model Selection
| Model | Dimensions | Provider | Notes |
|---|---|---|---|
|
1536 |
OpenAI |
Good quality, widely used, low cost |
|
1024 |
Cohere |
Strong multilingual variants available |
|
384 |
Local (HuggingFace) |
No external API call; good for on-premises deployments |
|
3072 |
OpenAI |
Highest OpenAI quality; 2× storage and query cost vs. small |
|
Smaller dimensions mean faster ANN queries and less storage. Benchmark your specific corpus before committing to a high-dimension model — smaller models often reach comparable recall for domain-specific content. |
Embedding Storage and Retrieval Patterns
Embeddings Alongside Structured Data
Resist the temptation to create a separate embeddings table. Storing the embedding in the same row as the source data avoids cross-table joins and simplifies consistency management.
-- Single table: structured data and embedding co-located
CREATE TABLE products (
product_id uuid PRIMARY KEY,
name text,
description text,
category text,
price decimal,
description_emb vector<float, 1536>
);
If the description changes, update both description and description_emb in a single UPDATE statement.
With Cassandra 6 Accord transactions, you can update multiple denormalized copies atomically.
Multi-Vector Patterns
A single entity may need embeddings for different semantic aspects. Store them as separate columns:
CREATE TABLE articles (
article_id uuid PRIMARY KEY,
title text,
body text,
summary text,
title_emb vector<float, 1536>, -- for title-based retrieval
body_emb vector<float, 1536>, -- for full-text semantic search
summary_emb vector<float, 768> -- smaller model for summary
);
CREATE CUSTOM INDEX ON articles (title_emb) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX ON articles (body_emb) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX ON articles (summary_emb) USING 'StorageAttachedIndex';
At query time, choose which embedding column to search against based on the user’s intent. Reciprocal rank fusion across multiple ANN results can combine signals from different embedding columns.
Updating Embeddings
When source text changes, the embedding must be regenerated and stored. There is no trigger mechanism in Cassandra; the application is responsible for detecting changes and re-embedding.
Pattern: use a last_embedded_at timestamp column.
A background job queries WHERE last_embedded_at < last_modified_at (with appropriate SAI indexes) to find stale embeddings and re-embed them.
Hybrid Search: Vector + Scalar Filtering
Hybrid search combines ANN similarity ranking with scalar WHERE clause filters. SAI applies scalar filters before the ANN graph traversal, which narrows the candidate set and improves both result quality and latency.
Product Search with Filters
-- Prerequisite indexes
CREATE CUSTOM INDEX ON products (category) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX ON products (price) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX ON products (description_emb) USING 'StorageAttachedIndex';
-- Semantic search scoped to a category and price range
SELECT name, description, price
FROM products
WHERE category = 'electronics'
AND price < 500
ORDER BY description_emb ANN OF ?
LIMIT 10;
Performance Implications
-
Scalar filters that are highly selective (few matching rows) dramatically speed up the ANN traversal. The query planner evaluates the scalar filter first and only runs ANN over the surviving rows.
-
Scalar filters that match most of the table add little overhead but also provide little benefit.
-
Always create SAI indexes on the columns used in WHERE clauses alongside vector columns — without them, Cassandra scans all rows that pass the partition filter before applying the scalar predicate.
|
Test the selectivity of your scalar filters in isolation before combining them with ANN queries.
A filter on |
Cassandra 6 OR Filtering
Cassandra 6 supports OR logic in SAI filter expressions, which allows multi-category or multi-tag hybrid queries:
SELECT name, price
FROM products
WHERE (category = 'electronics' OR category = 'computers')
AND price < 1000
ORDER BY description_emb ANN OF ?
LIMIT 20;
See SAI Usage Patterns for the full filter expression reference.
Agent Memory and Conversation History
AI agents require persistent memory to maintain context across turns and sessions. Cassandra’s time-ordered clustering columns and TTL support make it a natural fit.
Conversation History Schema
CREATE TABLE conversations (
session_id uuid,
message_id timeuuid,
role text, -- 'user', 'assistant', or 'system'
content text,
embedding vector<float, 1536>,
PRIMARY KEY (session_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
CREATE CUSTOM INDEX ON conversations (embedding)
USING 'StorageAttachedIndex'
WITH OPTIONS = {'similarity_function': 'cosine'};
timeuuid as the clustering column provides natural time ordering and global uniqueness without a separate timestamp column.
CLUSTERING ORDER BY (message_id DESC) returns the most recent messages first when paginating by session.
Retrieving Recent Context (Sliding Window)
-- Fetch the last 20 messages for a session
SELECT role, content
FROM conversations
WHERE session_id = ?
LIMIT 20;
Semantic Memory Retrieval
For long sessions, a sliding window includes recent messages but loses older relevant context. Semantic memory retrieval finds past messages similar to the current query regardless of when they occurred:
-- Find the 5 most semantically relevant past messages in this session
-- (not necessarily the most recent)
SELECT role, content, message_id
FROM conversations
WHERE session_id = ?
ORDER BY embedding ANN OF ?
LIMIT 5;
Combine the sliding window (recent context) with semantic retrieval (relevant older context) to build an effective long-term memory layer.
TTL for Automatic Expiration
-- Insert a message with a 30-day TTL
INSERT INTO conversations (session_id, message_id, role, content, embedding)
VALUES (?, ?, ?, ?, ?)
USING TTL 2592000; -- 30 days in seconds
Cassandra automatically removes expired rows at compaction time, keeping storage bounded without application-layer cleanup jobs.
|
TTL is set per-row at insert time. If you update a row, the new write’s TTL applies to the columns updated. Use a consistent TTL policy in your application layer to avoid partial row expiration. |
Framework Integration
LangChain
LangChain provides a CassandraVectorStore class that wraps Cassandra’s ANN capabilities behind the standard LangChain VectorStore interface.
Install the integration packages first:
pip install cassio langchain-community langchain-openai
from langchain_community.vectorstores import Cassandra
from langchain_openai import OpenAIEmbeddings
import cassio
# Initialize cassio with your cluster connection
cassio.init(contact_points=["127.0.0.1"], keyspace="my_keyspace")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create or connect to a Cassandra vector store
vectorstore = Cassandra(
embedding=embeddings,
table_name="documents",
session=None, # uses cassio session
keyspace=None, # uses cassio keyspace
)
# Add documents
vectorstore.add_texts(["chunk one", "chunk two"], metadatas=[{}, {}])
# Similarity search
results = vectorstore.similarity_search("my query", k=5)
|
|
LlamaIndex
LlamaIndex provides a CassandraVectorStore index that integrates with its retrieval and query engine pipeline.
from llama_index.vector_stores.cassandra import CassandraVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
import cassio
# pip install cassio llama-index llama-index-vector-stores-cassandra
cassio.init(contact_points=["127.0.0.1"], keyspace="my_keyspace")
vector_store = CassandraVectorStore(
table="llamaindex_docs",
embedding_dimension=1536,
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Build an index from documents
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
)
# Query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What is Cassandra's consistency model?")
|
Refer to the LlamaIndex Cassandra documentation for the current API. Embedding dimension must match the model you configure in LlamaIndex. |
Cassandra MCP Server
For AI agents that need to interact with Cassandra directly through tool calls, the Cassandra MCP (Model Context Protocol) server exposes CQL execution and schema inspection as callable tools. See Cassandra MCP Server for setup and usage.
Related Pages
-
Vector Search Overview —
VECTORtype, SAI index creation, ANN query syntax -
Data Modeling Overview — query-driven schema design principles
-
SAI Usage Patterns — scalar filtering, hybrid queries, performance guidance
-
Cassandra MCP Server — agent tool integration via the Model Context Protocol