Skip to content

Lakehouse Projections with the Cassandra Sidecar

Lakehouse Projections with the Cassandra Sidecar

Section titled “Lakehouse Projections with the Cassandra Sidecar”

This page explains how CQLite fits into a Parquet projection pipeline that is triggered by Cassandra SSTable flush events, co-located with the Apache Cassandra Sidecar.

The internal position document (docs/architecture/cassandra-sidecar-parquet-projections.md) is the authoritative source of truth for CQLite maintainers. This page is the user-facing distillation.

The read-to-Parquet pipeline is largely in place:

  1. CQLite reads externally-discovered SSTables via open_with_discovered_sstables() — the integration hook for “the Sidecar told me where a new SSTable is, read it.”
  2. Memory-bounded streaming reads via execute_streaming() keep peak memory predictable on large tables.
  3. Parquet output is available via --out parquet in the CLI, via db.export_parquet(...) / db.exportParquet(...) in the Python and Node bindings, and as an embeddable writer in cqlite-core behind the parquet cargo feature.

A minimal per-SSTable projection is already expressible as a one-shot CLI call:

Terminal window
cqlite --schema users.cql \
--data-dir /var/lib/cassandra/data/my_ks/users-<uuid>/ \
--query "SELECT * FROM my_ks.users" \
--out parquet -o /projections/users/nb-<gen>.parquet
┌─────────────────┐ flush ┌──────────────────────┐
│ Cassandra node │────────────▶│ data dir (SSTables) │
└─────────────────┘ new SSTable└──────────┬───────────┘
│ TOC.txt appears (file complete)
┌───────────▼───────────┐
│ Projection service │ (co-located w/ Sidecar)
│ - detect new SSTable │
│ - debounce + dedupe │
└───────────┬───────────┘
│ path + keyspace/table + schema
┌───────────▼───────────┐
│ CQLite │
│ open_with_discovered_ │
│ sstables() │
│ execute_streaming() │
│ StreamingParquetWriter│
└───────────┬───────────┘
┌───────────▼───────────┐
│ /projections/<ks>/ │
│ <table>/nb-<gen>.parquet
│ (ideally Iceberg/Delta)
└────────────────────────┘

The Sidecar does not expose a push stream of “memtable flushed” events. Options:

TriggerMechanismTrade-off
Filesystem watchinotify on the table data dir; key on TOC.txt (written last ⇒ SSTable complete)Simplest, reliable; per-host
Commitlog CDCCassandra CDC on the commitlogOrdered + carries timestamps; closest to correct CDC; more setup
Sidecar API pollingDiff SSTable component listings per tableEasy; latency = poll interval
Diagnostic events / JMXSSTable lifecycle notificationsJVM-side; reintroduces a cluster dependency

For a simple bulk projection, filesystem watch keyed on TOC.txt is the pragmatic default. For correctness-sensitive pipelines, commitlog CDC is the better source — see the delta semantics section below for why.

⚠ Delta semantics — this caveat is load-bearing

Section titled “⚠ Delta semantics — this caveat is load-bearing”

A flushed SSTable is a delta, not a table snapshot.

This is inherent to Cassandra’s storage model — not a CQLite defect. Four properties combine to make a naive Parquet-union of per-flush files silently wrong:

  1. Rows are partial / superseded. The same primary key can appear across many SSTables; the live value is the per-cell last-write-wins (LWW) merge of all of them. One flush’s Parquet reflects only that flush’s writes.

  2. Tombstones carry deletes. A flush can contain row, range, or cell tombstones and TTL expirations. The current projection has no representation of these, so a naive union of insert rows will resurrect deleted data.

  3. Reconciliation requires write-timestamps. LWW merge is driven by each cell’s writetime. A plain SELECT * drops it, so two flushes cannot be correctly merged downstream without it.

  4. Absent-vs-null is collapsed. CQLite’s SELECT * flattens Cassandra’s sparse cell-sets into a rectangular result; a cell not written in a particular flush is indistinguishable from a null.

Two coherent approaches — pick deliberately

Section titled “Two coherent approaches — pick deliberately”

CDC / append-log projection (recommended). Treat each flushed SSTable as an immutable event batch → one Parquet file per generation. Embrace that it is a delta. For correctness, carry writetime and represent tombstones (e.g. __writetime / __deleted columns, Debezium-style). Downstream (Spark/Trino/Iceberg) performs the merge. Idempotent; matches the immutable-SSTable grain.

Current-snapshot projection. To reflect current table state, point open_with_discovered_sstables() at all live SSTables for the table so CQLite’s read path performs the LWW merge, then export. Correct, but re-reads everything on each flush.

The type mapping is analytics-grade for schema-aware queries: query results carry the authoritative schema CqlType, and the Parquet writer maps it recursively to a faithful Arrow schema.

CQL typeArrow/ParquetFidelity
boolean, tinyint … bigint, float, doubleBoolean, Int8/16/32/64, Float32/64Clean
text / varchar / asciiUtf8Clean
blobBinaryClean
timestampTimestamp(ms, UTC)Clean
uuid / timeuuidFixedSizeBinary(16) + arrow.uuid extensionClean (UUID logical annotation)
dateDate32Clean
timeTime64(Nanosecond)Clean
decimalDecimal128(38, 9)Fixed column scale 9; checked rescale, overflow errors deterministically
varintDecimal128(38, 0)Values > 38 digits error deterministically
list<T> / set<T>List<T>Typed elements, recursive (sets export as List — Arrow has no set type)
map<K,V>Map<K, V>Typed keys (non-null) and values, recursive
tuple / UDTStructPositional field_N / schema field names
frozen<T>Same as TTransparent at every nesting level
inetUtf8 (canonical text)Deliberate: no standard Arrow inet type
durationUtf8 (CQL text form)parquet crate v53 cannot write Interval(MonthDayNano)

Collections, UDTs, and high-precision scalars all support columnar predicate pushdown in downstream consumers (Trino, Spark, DuckDB). The batch and streaming writers produce identical schemas and values. See Output Formats for the full table and notes.

The Parquet writer lives in cqlite-core/src/export/parquet.rs behind an off-by-default parquet feature flag. A projection service can call the writer directly as a library without a subprocess, and the Python (db.export_parquet(...)) and Node (db.exportParquet(...)) bindings export Parquet natively. CQLite produces Parquet files only; committing them to Iceberg/Delta remains the external committer’s job.

Fully reconcilable CDC-mode projections need a delta-scan API that streams per-cell writetime, TTL/expiry, and an operation discriminator covering all five Cassandra delete shapes (upsert, static upsert, row delete, range delete, partition delete), paired with a delta-aware Parquet writer and an export subcommand. This is not yet available. Until it lands, preserve writetime and represent tombstones yourself (see Recommendations) so a union of per-flush Parquet is reconcilable rather than silently wrong.

CQLite requires the CQL schema to decode an SSTable — it does not guess types. The projection service must source and cache the schema — flush events do not carry it. The practical approach is to pull it from DESCRIBE TABLE output and keep it in sync with schema changes.

TiDB’s TiFlash is instructive: it is a Raft learner that receives every committed write in real-time, providing a strongly consistent, transactionally-consistent columnar replica. TiCDC (the structural analog to this pipeline) tails the ordered row-change log and emits Kafka or cloud-storage changefeeds with explicit insert/update/delete events and commit timestamps.

Our approach’s distinct value is open, lake-native columnar output from Cassandra with no cluster dependency in the read path. The trade-off is that Cassandra has no ordered committed log — its source of truth is independently-flushed, LWW-merged SSTables — so any columnar projection is inherently a delta-reconciliation problem. CQLite already provides the type fidelity and the embeddable writer that projection needs; full delta-aware scanning (above) is the remaining piece.

  1. Model it as CDC/append-log, not snapshot. It matches the immutable-SSTable grain and is idempotent per generation.
  2. Land in a table format (Iceberg or Delta), not bare Parquet files. Those provide snapshot isolation, compaction, and merge-on-read that otherwise you must build by hand.
  3. Preserve writetime and represent tombstones (__writetime / __deleted columns) so projections are reconcilable. Without this, a union of per-flush Parquet is silently wrong.
  4. Prefer commitlog CDC over raw flush events when correctness matters.
  5. Embed the writer rather than shelling out — the Parquet writer lives in cqlite-core behind the parquet feature, and the Python/Node bindings expose it directly.