Python: Data Science and ETL
Python: Data Science and ETL
Section titled “Python: Data Science and ETL”The CQLite Python bindings (cqlite package, built with PyO3) let you read
Cassandra 5.0 SSTables directly in Python — no cluster, no JVM, no Cassandra driver.
This page covers the data-science and ETL use case: offline analytics on snapshots and backups, pandas integration, and memory-bounded streaming for large tables.
Installation
Section titled “Installation”pip install cqliteOr build from source for the latest development version:
git clone https://github.com/pmcfadin/cqlitecd cqlite/bindings/pythonpip install maturinmaturin develop # development build, editable installRequires Python 3.9+ and Rust 1.85+ (for source builds only).
The core pattern
Section titled “The core pattern”import cqlite
with cqlite.open("/path/to/sstables", schema="/path/to/schema.cql") as db: result = db.execute("SELECT * FROM my_keyspace.my_table LIMIT 100") for row in result.rows: print(row.to_dict())The schema parameter is a CQL file containing CREATE TABLE statements.
CQLite requires it to decode binary SSTable data — there is no guessing from bytes.
cqlite.open() works as a context manager; db.close() is idempotent and safe to
call from any thread.
Complete example: pandas analytics on a Cassandra backup
Section titled “Complete example: pandas analytics on a Cassandra backup”This example was run against the real CQLite test datasets and its output is shown below.
import cqliteimport pandas as pdfrom decimal import Decimal
DATASETS = "test-data/datasets/sstables"SCHEMA = "test-data/schemas/basic-types.cql"
with cqlite.open(DATASETS, schema=SCHEMA) as db: result = db.execute( "SELECT name, age, salary, account_balance, active " "FROM test_basic.simple_table LIMIT 10" )
# Build a DataFrame from the result rows rows = [row.to_dict() for row in result.rows] df = pd.DataFrame(rows)
# CQLite returns Decimal for decimal columns — convert to float for pandas math df["account_balance"] = df["account_balance"].apply(float)
print("DataFrame shape:", df.shape) print() print("Column dtypes:") print(df.dtypes) print() print("Summary statistics:") print(df[["age", "salary", "account_balance"]].describe().round(2)) print() active = df[df["active"] == True] print(f"Active users: {len(active)} of {len(df)}") if len(active) > 0: print(f" Average salary: {active['salary'].mean():.2f}")Actual output (run against test_basic.simple_table, 10 rows):
DataFrame shape: (10, 5)
Column dtypes:salary int64age int64name stractive boolaccount_balance float64dtype: object
Summary statistics: age salary account_balancecount 10.00 10.00 10.00mean 49.20 123326.50 54020.85std 20.43 45608.41 24617.66min 18.00 44179.00 2656.6925% 31.00 89334.50 44476.7250% 52.50 132994.50 62270.9275% 64.00 161093.50 71958.12max 79.00 175802.00 80199.79
Active users: 4 of 10 Average salary: 127350.50CQL → Python type mapping
Section titled “CQL → Python type mapping”| CQL type | Python type |
|---|---|
| boolean | bool |
| tinyint, smallint, int, bigint, counter | int |
| float, double | float |
| decimal, varint | decimal.Decimal |
| text, varchar, ascii | str |
| blob | bytes |
| timestamp | datetime.datetime (timezone-aware, UTC) |
| date | datetime.date |
| time | datetime.time |
| uuid, timeuuid | uuid.UUID |
| duration | datetime.timedelta |
| inet | ipaddress.IPv4Address or ipaddress.IPv6Address |
| list, set | list |
| map | dict |
| tuple | tuple |
| UDT | dict (field name → value) |
Streaming for large tables
Section titled “Streaming for large tables”For large tables where loading all rows into memory is impractical, use
db.execute_streaming(). It returns a lazy iterator; rows are decoded one at a
time and never buffered in bulk:
import cqlite
DATASETS = "test-data/datasets/sstables"SCHEMA = "test-data/schemas/basic-types.cql"
with cqlite.open(DATASETS, schema=SCHEMA) as db: count = 0 total_balance = 0.0
for row in db.execute_streaming( "SELECT name, age, account_balance FROM test_basic.simple_table" ): d = row.to_dict() total_balance += float(d["account_balance"]) count += 1
print(f"Processed {count} rows, total balance: {total_balance:.2f}")Actual output (run against test_basic.simple_table):
Processed 100 rows, total balance: 5482530.44The streaming iterator releases the Python GIL during I/O, so it is safe to use from multiple threads (each with its own iterator).
ETL pattern: snapshot to CSV / Parquet
Section titled “ETL pattern: snapshot to CSV / Parquet”import cqliteimport csvimport io
DATASETS = "/mnt/cassandra-backup/data"SCHEMA = "/etc/cassandra-schemas/my_keyspace.cql"OUTPUT = "/tmp/my_table_export.csv"
with cqlite.open(DATASETS, schema=SCHEMA) as db: result = db.execute("SELECT * FROM my_keyspace.my_table") columns = [c.name for c in result.columns]
with open(OUTPUT, "w", newline="") as f: writer = csv.DictWriter(f, fieldnames=columns) writer.writeheader() for row in result.rows: writer.writerow(row.to_dict())
print(f"Exported {len(result.rows)} rows to {OUTPUT}")For Parquet, export directly from the bindings — the query streams, so large tables export within bounded memory:
import cqlite
with cqlite.open(DATASETS, schema=SCHEMA) as db: rows = db.export_parquet( "SELECT * FROM my_keyspace.my_table", "/tmp/my_table_export.parquet", row_group_size=10000, # rows per Parquet row group compression="snappy", # "snappy" (default), "zstd", or "none" )
print(f"Exported {rows} rows")The output preserves nested and high-precision types — typed lists, maps, and structs — see Output Formats.
Notebook workflow
Section titled “Notebook workflow”In a Jupyter or Colab notebook:
# Cell 1: imports and setupimport cqliteimport pandas as pd
DATASETS = "/path/to/cassandra-snapshot"SCHEMA = "/path/to/schema.cql"
# Cell 2: load datawith cqlite.open(DATASETS, schema=SCHEMA) as db: result = db.execute("SELECT * FROM my_ks.events WHERE date = '2025-01-01'") df = pd.DataFrame([row.to_dict() for row in result.rows])
# Cell 3: exploredf.head()df.describe()df["event_type"].value_counts().plot(kind="bar")What you cannot do (yet)
Section titled “What you cannot do (yet)”- Write back to SSTables from Python. The write API is in the core library but not yet exposed in the Python bindings. Tracked in the write support roadmap.