Lakehouse Projections with the Cassandra Sidecar

This page explains how CQLite fits into a Parquet projection pipeline that is triggered by Cassandra SSTable flush events, co-located with the Apache Cassandra Sidecar.

The internal position document (docs/architecture/cassandra-sidecar-parquet-projections.md) is the authoritative source of truth for CQLite maintainers. This page is the user-facing distillation.

What is already built

The read-to-Parquet pipeline is largely in place:

CQLite reads externally-discovered SSTables via open_with_discovered_sstables() — the integration hook for “the Sidecar told me where a new SSTable is, read it.”
Memory-bounded streaming reads via execute_streaming() keep peak memory predictable on large tables.
Parquet output is available via --out parquet in the CLI, via db.export_parquet(...) / db.exportParquet(...) in the Python and Node bindings, and as an embeddable writer in cqlite-core behind the parquet cargo feature.

A minimal per-SSTable projection is already expressible as a one-shot CLI call:

cqlite --schema users.cql \
       --data-dir /var/lib/cassandra/data/my_ks/users-<uuid>/ \
       --query "SELECT * FROM my_ks.users" \
       --out parquet -o /projections/users/nb-<gen>.parquet

Architecture

┌─────────────────┐   flush     ┌──────────────────────┐
│  Cassandra node │────────────▶│  data dir (SSTables) │
└─────────────────┘  new SSTable└──────────┬───────────┘
                                            │ TOC.txt appears (file complete)
                                ┌───────────▼───────────┐
                                │  Projection service   │  (co-located w/ Sidecar)
                                │  - detect new SSTable  │
                                │  - debounce + dedupe   │
                                └───────────┬───────────┘
                                            │ path + keyspace/table + schema
                                ┌───────────▼───────────┐
                                │  CQLite                │
                                │  open_with_discovered_ │
                                │    sstables()          │
                                │  execute_streaming()   │
                                │  StreamingParquetWriter│
                                └───────────┬───────────┘
                                            │
                                ┌───────────▼───────────┐
                                │  /projections/<ks>/    │
                                │    <table>/nb-<gen>.parquet
                                │  (ideally Iceberg/Delta)
                                └────────────────────────┘

Trigger options

The Sidecar does not expose a push stream of “memtable flushed” events. Options:

Trigger	Mechanism	Trade-off
Filesystem watch	`inotify` on the table data dir; key on `TOC.txt` (written last ⇒ SSTable complete)	Simplest, reliable; per-host
Commitlog CDC	Cassandra CDC on the commitlog	Ordered + carries timestamps; closest to correct CDC; more setup
Sidecar API polling	Diff SSTable component listings per table	Easy; latency = poll interval
Diagnostic events / JMX	SSTable lifecycle notifications	JVM-side; reintroduces a cluster dependency

For a simple bulk projection, filesystem watch keyed on TOC.txt is the pragmatic default. For correctness-sensitive pipelines, commitlog CDC is the better source — see the delta semantics section below for why.

⚠ Delta semantics — this caveat is load-bearing

A flushed SSTable is a delta, not a table snapshot.

This is inherent to Cassandra’s storage model — not a CQLite defect. Four properties combine to make a naive Parquet-union of per-flush files silently wrong:

Rows are partial / superseded. The same primary key can appear across many SSTables; the live value is the per-cell last-write-wins (LWW) merge of all of them. One flush’s Parquet reflects only that flush’s writes.
Tombstones carry deletes. A flush can contain row, range, or cell tombstones and TTL expirations. The current projection has no representation of these, so a naive union of insert rows will resurrect deleted data.
Reconciliation requires write-timestamps. LWW merge is driven by each cell’s writetime. A plain SELECT * drops it, so two flushes cannot be correctly merged downstream without it.
Absent-vs-null is collapsed. CQLite’s SELECT * flattens Cassandra’s sparse cell-sets into a rectangular result; a cell not written in a particular flush is indistinguishable from a null.

Two coherent approaches — pick deliberately

CDC / append-log projection (recommended). Treat each flushed SSTable as an immutable event batch → one Parquet file per generation. Embrace that it is a delta. For correctness, carry writetime and represent tombstones (e.g. __writetime / __deleted columns, Debezium-style). Downstream (Spark/Trino/Iceberg) performs the merge. Idempotent; matches the immutable-SSTable grain.

Current-snapshot projection. To reflect current table state, point open_with_discovered_sstables() at all live SSTables for the table so CQLite’s read path performs the LWW merge, then export. Correct, but re-reads everything on each flush.

Type mapping fidelity

The type mapping is analytics-grade for schema-aware queries: query results carry the authoritative schema CqlType, and the Parquet writer maps it recursively to a faithful Arrow schema.

CQL type	Arrow/Parquet	Fidelity
boolean, tinyint … bigint, float, double	Boolean, Int8/16/32/64, Float32/64	Clean
text / varchar / ascii	Utf8	Clean
blob	Binary	Clean
timestamp	Timestamp(ms, UTC)	Clean
uuid / timeuuid	FixedSizeBinary(16) + `arrow.uuid` extension	Clean (UUID logical annotation)
date	`Date32`	Clean
time	`Time64(Nanosecond)`	Clean
decimal	`Decimal128(38, 9)`	Fixed column scale 9; checked rescale, overflow errors deterministically
varint	`Decimal128(38, 0)`	Values > 38 digits error deterministically
list<T> / set<T>	`List<T>`	Typed elements, recursive (sets export as List — Arrow has no set type)
map<K,V>	`Map<K, V>`	Typed keys (non-null) and values, recursive
tuple / UDT	`Struct`	Positional `field_N` / schema field names
frozen<T>	Same as T	Transparent at every nesting level
inet	Utf8 (canonical text)	Deliberate: no standard Arrow inet type
duration	Utf8 (CQL text form)	`parquet` crate v53 cannot write `Interval(MonthDayNano)`

Collections, UDTs, and high-precision scalars all support columnar predicate pushdown in downstream consumers (Trino, Spark, DuckDB). The batch and streaming writers produce identical schemas and values. See Output Formats for the full table and notes.

Embedding the Parquet writer

The Parquet writer lives in cqlite-core/src/export/parquet.rs behind an off-by-default parquet feature flag. A projection service can call the writer directly as a library without a subprocess, and the Python (db.export_parquet(...)) and Node (db.exportParquet(...)) bindings export Parquet natively. CQLite produces Parquet files only; committing them to Iceberg/Delta remains the external committer’s job.

Roadmap: delta-aware projections

Fully reconcilable CDC-mode projections need a delta-scan API that streams per-cell writetime, TTL/expiry, and an operation discriminator covering all five Cassandra delete shapes (upsert, static upsert, row delete, range delete, partition delete), paired with a delta-aware Parquet writer and an export subcommand. This is not yet available. Until it lands, preserve writetime and represent tombstones yourself (see Recommendations) so a union of per-flush Parquet is reconcilable rather than silently wrong.

Schema sourcing

CQLite requires the CQL schema to decode an SSTable — it does not guess types. The projection service must source and cache the schema — flush events do not carry it. The practical approach is to pull it from DESCRIBE TABLE output and keep it in sync with schema changes.

Comparison with TiDB’s approach

TiDB’s TiFlash is instructive: it is a Raft learner that receives every committed write in real-time, providing a strongly consistent, transactionally-consistent columnar replica. TiCDC (the structural analog to this pipeline) tails the ordered row-change log and emits Kafka or cloud-storage changefeeds with explicit insert/update/delete events and commit timestamps.

Our approach’s distinct value is open, lake-native columnar output from Cassandra with no cluster dependency in the read path. The trade-off is that Cassandra has no ordered committed log — its source of truth is independently-flushed, LWW-merged SSTables — so any columnar projection is inherently a delta-reconciliation problem. CQLite already provides the type fidelity and the embeddable writer that projection needs; full delta-aware scanning (above) is the remaining piece.

Recommendations

Model it as CDC/append-log, not snapshot. It matches the immutable-SSTable grain and is idempotent per generation.
Land in a table format (Iceberg or Delta), not bare Parquet files. Those provide snapshot isolation, compaction, and merge-on-read that otherwise you must build by hand.
Preserve writetime and represent tombstones (__writetime / __deleted columns) so projections are reconcilable. Without this, a union of per-flush Parquet is silently wrong.
Prefer commitlog CDC over raw flush events when correctness matters.
Embed the writer rather than shelling out — the Parquet writer lives in cqlite-core behind the parquet feature, and the Python/Node bindings expose it directly.