test-data-generation-%26-validation by pmcfadin/cqlite
npx skills add https://github.com/pmcfadin/cqlite --skill 'Test Data Generation & Validation'此技能提供关于生成真实 Cassandra 5.0 测试数据及验证解析正确性的指导。
CQLite 使用真实的 Cassandra 5.0 实例来生成测试数据,确保:
完整工作流详情请参阅 dataset-generation.md。
cd test-data
# 1. 启动干净的 Cassandra 5 并应用模式
./scripts/start-clean.sh
# 2. 生成数据(每张表 N 行)
ROWS=1000 ./scripts/generate.sh
# 3. 导出 SSTables
./scripts/export.sh
# 4. 关闭并清理卷
./scripts/shutdown-clean.sh
启动 Cassandra 5.0 容器并应用模式。
功能:
cassandra-5-0 容器schemas/core.list 的模式环境变量:
SCHEMA_SET=core - 使用精选的模式列表(默认)SCHEMA_SET=all - 使用所有 *.cql 文件示例:
# 使用默认核心模式
./scripts/start-clean.sh
# 使用所有模式
SCHEMA_SET=all ./scripts/start-clean.sh
使用 Python 数据生成器生成测试数据。
功能:
环境变量:
ROWS=N - 每张表的行数(默认值:根据 SCALE 变化)TABLES=table1,table2 - 仅为特定表生成数据SCALE=SMALL|MEDIUM|LARGE - 预设大小示例:
# 每张表生成 1000 行
ROWS=1000 ./scripts/generate.sh
# 仅为特定表生成数据
TABLES=simple_table,collection_table ROWS=500 ./scripts/generate.sh
# 使用 LARGE 规模预设
SCALE=LARGE ./scripts/generate.sh
从 Cassandra 数据目录导出 SSTables。
功能:
datasets/sstables/输出结构:
test-data/datasets/
├── metadata.yml # 由 generate.sh 生成
├── sstables/
│ ├── test_basic/
│ │ └── simple_table/
│ │ ├── *-Data.db
│ │ ├── *-Index.db
│ │ ├── *-Statistics.db
│ │ ├── *-Summary.db
│ │ └── *-TOC.txt
│ ├── test_collections/
│ └── test_timeseries/
停止 Cassandra 并移除 Docker 卷。
功能:
在以下情况使用:
test-data/schemas/ 中的模式:
包含所有基本类型的简单表:
集合类型:
时间序列模式:
宽分区测试:
添加你自己的模式:
# 创建模式
echo "CREATE TABLE test_keyspace.my_table (...);" > schemas/my-schema.cql
# 添加到 core.list
echo "my-schema.cql" >> schemas/core.list
# 生成
./scripts/start-clean.sh
./scripts/generate.sh
完整验证过程请参阅 validation-workflow.md。
# 1. 生成 sstabledump 参考
sstabledump test-data/datasets/sstables/keyspace/table/*-Data.db \
> reference.json
# 2. 使用 cqlite 解析
cargo run --bin cqlite -- \
--data-dir test-data/datasets/sstables/keyspace/table \
--schema test-data/schemas/schema.cql \
--out json > cqlite.json
# 3. 比较(忽略格式)
jq -S '.' reference.json > ref-sorted.json
jq -S '.' cqlite.json > cql-sorted.json
diff ref-sorted.json cql-sorted.json
运行验证脚本:
# 验证所有测试表
cargo test --test sstable_validation
# 验证特定表
cargo test --test sstable_validation -- simple_table
为属性测试生成随机数据:
use proptest::prelude::*;
proptest! {
#[test]
fn test_row_parsing_roundtrip(
partition_key in any::<i32>(),
text_value in "\\PC*", // 任何有效的 unicode
int_value in any::<i32>(),
) {
// 在 Cassandra 中生成测试数据
insert_test_row(partition_key, &text_value, int_value)?;
flush_memtable()?;
// 使用 cqlite 解析
let parsed = parse_sstable()?;
// 验证往返
assert_eq!(parsed.get_int("partition_key"), partition_key);
assert_eq!(parsed.get_text("text_col"), text_value);
assert_eq!(parsed.get_int("int_col"), int_value);
}
}
为 CI 或分发打包数据集:
# 打包当前数据集
./scripts/package_datasets.sh
# 输出:test-data/cqlite-test-data-v5.0-<date>.tar.gz
内容:
在 CI 中进行快速验证:
# 使用打包的数据集
tar xzf cqlite-test-data-v5.0.tar.gz
# 运行核心测试
./scripts/ci-one-shot-smoke.sh
# 验证:
# - 基本解析
# - 所有 CQL 类型
# - 压缩
# - 集合
详情请参阅 test-data/scripts/CI_SMOKE_TEST_USAGE.md。
# 1. 向模式添加列
echo "ALTER TABLE test_basic.simple_table ADD duration_col duration;" \
>> schemas/basic-types.cql
# 2. 重新生成数据
./scripts/start-clean.sh
./scripts/generate.sh
./scripts/export.sh
# 3. 验证解析
cargo test --test sstable_validation
# 使用特定的行大小生成
ROWS=100 SCALE=LARGE ./scripts/generate.sh
# 验证:
# - 大文本值(1MB+)
# - 大 blob 值
# - 大集合(1000+ 元素)
# 修改 generate_comprehensive_test_data.py
def generate_edge_cases(session):
# 空值
session.execute("INSERT INTO table (pk) VALUES (?)", [uuid.uuid4()])
# 空集合
session.execute("INSERT INTO table (pk, tags) VALUES (?, [])",
[uuid.uuid4()])
# 空字符串
session.execute("INSERT INTO table (pk, name) VALUES (?, '')",
[uuid.uuid4()])
支持里程碑 M1(核心读取库):
支持所有里程碑:
# 检查日志
docker logs cassandra-5-0
# 常见问题:端口 9042 被占用
lsof -i :9042
# 终止进程或更改 docker-compose-cassandra5.yml 中的端口
# 检查生成器日志
cat test-data/logs/data_generation.log
# 验证模式已应用
docker exec cassandra-5-0 cqlsh -e "DESCRIBE KEYSPACES;"
# 验证容器中存在数据
docker exec cassandra-5-0 ls -la /var/lib/cassandra/data/
# 检查是否发生了刷新
docker logs cassandra-5-0 | grep flush
打包的数据集可在以下位置获取:
https://github.com/pmcfadin/cqlite/releases/tag/test-data-v5.0
下载用于:
创建新测试时:
schemas/ 中设计模式generate.sh 生成数据export.sh 导出 SSTables参阅文档:
每周安装次数
–
仓库
首次出现
–
This skill provides guidance on generating real Cassandra 5.0 test data and validating parsing correctness.
CQLite uses real Cassandra 5.0 instances to generate test data, ensuring:
See dataset-generation.md for complete workflow details.
cd test-data
# 1. Start clean Cassandra 5 with schemas
./scripts/start-clean.sh
# 2. Generate data (N rows per table)
ROWS=1000 ./scripts/generate.sh
# 3. Export SSTables
./scripts/export.sh
# 4. Shutdown and clean volumes
./scripts/shutdown-clean.sh
Starts Cassandra 5.0 container and applies schemas.
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
What it does:
cassandra-5-0 container via docker-composeschemas/core.listEnvironment variables:
SCHEMA_SET=core - Use curated schema list (default)SCHEMA_SET=all - Use all *.cql filesExample:
# Use default core schemas
./scripts/start-clean.sh
# Use all schemas
SCHEMA_SET=all ./scripts/start-clean.sh
Generates test data using Python data generator.
What it does:
Environment variables:
ROWS=N - Rows per table (default: varies by SCALE)TABLES=table1,table2 - Generate for specific tables onlySCALE=SMALL|MEDIUM|LARGE - Preset sizesExample:
# Generate 1000 rows per table
ROWS=1000 ./scripts/generate.sh
# Generate only for specific tables
TABLES=simple_table,collection_table ROWS=500 ./scripts/generate.sh
# Use LARGE scale preset
SCALE=LARGE ./scripts/generate.sh
Exports SSTables from Cassandra data directory.
What it does:
datasets/sstables/Output structure:
test-data/datasets/
├── metadata.yml # Generated by generate.sh
├── sstables/
│ ├── test_basic/
│ │ └── simple_table/
│ │ ├── *-Data.db
│ │ ├── *-Index.db
│ │ ├── *-Statistics.db
│ │ ├── *-Summary.db
│ │ └── *-TOC.txt
│ ├── test_collections/
│ └── test_timeseries/
Stops Cassandra and removes Docker volumes.
What it does:
Use when:
Schemas in test-data/schemas/:
Simple table with all primitive types:
Collection types:
Time-series pattern:
Wide partition testing:
Add your own:
# Create schema
echo "CREATE TABLE test_keyspace.my_table (...);" > schemas/my-schema.cql
# Add to core.list
echo "my-schema.cql" >> schemas/core.list
# Generate
./scripts/start-clean.sh
./scripts/generate.sh
See validation-workflow.md for complete validation process.
# 1. Generate sstabledump reference
sstabledump test-data/datasets/sstables/keyspace/table/*-Data.db \
> reference.json
# 2. Parse with cqlite
cargo run --bin cqlite -- \
--data-dir test-data/datasets/sstables/keyspace/table \
--schema test-data/schemas/schema.cql \
--out json > cqlite.json
# 3. Compare (ignoring formatting)
jq -S '.' reference.json > ref-sorted.json
jq -S '.' cqlite.json > cql-sorted.json
diff ref-sorted.json cql-sorted.json
Run validation script:
# Validate all test tables
cargo test --test sstable_validation
# Validate specific table
cargo test --test sstable_validation -- simple_table
Generate random data for property tests:
use proptest::prelude::*;
proptest! {
#[test]
fn test_row_parsing_roundtrip(
partition_key in any::<i32>(),
text_value in "\\PC*", // Any valid unicode
int_value in any::<i32>(),
) {
// Generate test data in Cassandra
insert_test_row(partition_key, &text_value, int_value)?;
flush_memtable()?;
// Parse with cqlite
let parsed = parse_sstable()?;
// Validate roundtrip
assert_eq!(parsed.get_int("partition_key"), partition_key);
assert_eq!(parsed.get_text("text_col"), text_value);
assert_eq!(parsed.get_int("int_col"), int_value);
}
}
Package datasets for CI or distribution:
# Package current dataset
./scripts/package_datasets.sh
# Output: test-data/cqlite-test-data-v5.0-<date>.tar.gz
Contents:
Quick validation in CI:
# Use packaged dataset
tar xzf cqlite-test-data-v5.0.tar.gz
# Run core tests
./scripts/ci-one-shot-smoke.sh
# Validates:
# - Basic parsing
# - All CQL types
# - Compression
# - Collections
See test-data/scripts/CI_SMOKE_TEST_USAGE.md for details.
# 1. Add column to schema
echo "ALTER TABLE test_basic.simple_table ADD duration_col duration;" \
>> schemas/basic-types.cql
# 2. Regenerate data
./scripts/start-clean.sh
./scripts/generate.sh
./scripts/export.sh
# 3. Validate parsing
cargo test --test sstable_validation
# Generate with specific row size
ROWS=100 SCALE=LARGE ./scripts/generate.sh
# Validates:
# - Large text values (1MB+)
# - Large blob values
# - Large collections (1000+ elements)
# Modify generate_comprehensive_test_data.py
def generate_edge_cases(session):
# Null values
session.execute("INSERT INTO table (pk) VALUES (?)", [uuid.uuid4()])
# Empty collections
session.execute("INSERT INTO table (pk, tags) VALUES (?, [])",
[uuid.uuid4()])
# Empty strings
session.execute("INSERT INTO table (pk, name) VALUES (?, '')",
[uuid.uuid4()])
Supports Milestone M1 (Core Reading Library):
Supports All Milestones:
# Check logs
docker logs cassandra-5-0
# Common issue: Port 9042 in use
lsof -i :9042
# Kill process or change port in docker-compose-cassandra5.yml
# Check generator logs
cat test-data/logs/data_generation.log
# Verify schema applied
docker exec cassandra-5-0 cqlsh -e "DESCRIBE KEYSPACES;"
# Verify data exists in container
docker exec cassandra-5-0 ls -la /var/lib/cassandra/data/
# Check if flush happened
docker logs cassandra-5-0 | grep flush
Packaged datasets available at:
https://github.com/pmcfadin/cqlite/releases/tag/test-data-v5.0
Download for:
When creating new tests:
schemas/generate.shexport.shSee documentation:
Weekly Installs
–
Repository
First Seen
–
通过 LiteLLM 代理让 Claude Code 对接 GitHub Copilot 运行 | 高级变通方案指南
31,600 周安装
Agile Skill Build:快速创建和扩展ace-skills的自动化工具,提升AI技能开发效率
1 周安装
LLM评估工具lm-evaluation-harness使用指南:HuggingFace模型基准测试与性能分析
212 周安装
Agently TriggerFlow 状态与资源管理:runtime_data、flow_data 和运行时资源详解
1 周安装
Agently Tools 工具系统详解:Python 代理工具注册、循环控制与内置工具使用
1 周安装
Agently Prompt配置文件技能:YAML/JSON提示模板加载、映射与导出指南
1 周安装
iOS/Android推送通知设置指南:Firebase Cloud Messaging与React Native实现
212 周安装