Skip to content

Test Data

The git repository contains only JSONL reference files (sstabledump output, used for parity validation). Binary *-Data.db files are NOT in git — they are fetched from a GitHub release asset.

Terminal window
bash test-data/scripts/fetch-datasets.sh

This downloads, verifies (SHA256), and extracts to test-data/datasets/. It also removes macOS AppleDouble (._*) and .DS_Store files that macOS tar embeds in archives and that break file-suffix scanners like *-Data.db.

The fetch script uses these defaults (override with environment variables):

Terminal window
DATASET_TAG=datasets-v3
DATASET_ASSET=cassandra5-small-full-v3.1.tar.gz
DATASET_SHA256=f5fa0b6599a27c1c493d7c6c063194d55d031cab417396947313e7245afc5ceb

CI reads the same pins from .github/workflows/sstabledump-parity-gate.yml:

DATASET_TAG: datasets-v3
DATASET_ASSET: cassandra5-small-full-v3.1.tar.gz
DATASET_SHA256: f5fa0b6599a27c1c493d7c6c063194d55d031cab417396947313e7245afc5ceb

Why SHA256 is the cache key: CI caches the extracted dataset by SHA256. Using the tag name as the cache key would allow a re-published asset (same tag, different content) to serve stale data. The SHA256 is content-addressed — cache hit means the exact bytes are correct.

Why v3.1 rather than v3: The v3 tarball was produced on macOS and contained AppleDouble entries. v3.1 was repackaged with --exclude='*/._*' to strip them. When inspecting or repackaging archives, use Python’s tarfile module (platform- neutral) rather than tar -tf (macOS bsdtar hides ._* on listing and re-embeds them on repack).

Set this environment variable to point at the extracted dataset directory:

Terminal window
export CQLITE_DATASETS_ROOT=$PWD/test-data/datasets

The gate script sets it automatically to $REPO_ROOT/test-data/datasets if not already set. Integration test runners require it; CI sets it in the workflow env.

Without it, tests that scan for SSTables will find no files and return results as if no data exists — not an error, just empty. This is the silent-pass failure mode the gate’s dataset preflight check was added to catch.

  • cargo test --package cqlite-core — unit tests pass; integration tests that scan CQLITE_DATASETS_ROOT return 0 rows and count as passing
  • scripts/agent-gate.shaborts with exit code 1 before running any component; you must fetch the data first
  • Smoke tests — fail immediately; the smoke script expects at least one Data.db per table

The gate’s dataset preflight check:

Terminal window
DATA_COUNT=$(find "$CQLITE_DATASETS_ROOT/sstables" -name "*-Data.db" 2>/dev/null | wc -l | tr -d ' ')
if [ "$DATA_COUNT" -eq 0 ]; then
echo "agent-gate: no Data.db files under $CQLITE_DATASETS_ROOT/sstables" >&2
echo "agent-gate: fetch them first: bash test-data/scripts/fetch-datasets.sh" >&2
exit 1
fi

After extraction:

test-data/datasets/
├── metadata.yml
├── references.yml
└── sstables/
├── test_basic/
│ └── simple_table-<hash>/
│ ├── nb-1-big-Data.db ← binary (not in git)
│ ├── nb-1-big-Data.db.jsonl ← sstabledump golden (in git)
│ ├── nb-1-big-Index.db
│ ├── nb-1-big-Statistics.db
│ └── nb-1-big-TOC.txt
├── test_collections/
├── test_timeseries/
└── test_wide_rows/
TierSSTable versionFormatKeyspacesTables
Primarynbbigtest_basic, test_collections, test_timeseries, test_wide_rows33
OA extendedoabigtest_oa6
BTIdabtitest_da3 (smoke SKIP-PENDING BTI parser)

Current pass rate: 100% (33/33 nb tables as of Dec 2025). The da/BTI tables are excluded from the default smoke run pending full BTI parser support.

KeyspaceTablesPurpose
test_basic8Simple types
test_collections8Lists, sets, maps
test_timeseries9Time-series patterns
test_wide_rows8Wide partitions

If you need to regenerate the test data (e.g., after schema changes):

Terminal window
# Full regeneration — three SSTable tiers, ~50 rows/table
bash test-data/scripts/regenerate-datasets.sh
# Custom row count
bash test-data/scripts/regenerate-datasets.sh --rows 200
# Dry-run
bash test-data/scripts/regenerate-datasets.sh --dry-run

Requires Docker. The script runs a cassandra:5.0.2 container through three phases to produce nb, oa, and da/BTI SSTables. After regeneration, package and publish:

Terminal window
bash test-data/scripts/package_datasets.sh
bash test-data/scripts/publish_datasets.sh

See .claude/skills/test-data-management/dataset-generation.md in the repo for the full workflow including the compose-stack interactive path.