senior-data-engineer by alirezarezvani/claude-skills
npx skills add https://github.com/alirezarezvani/claude-skills --skill senior-data-engineer用于构建可扩展、可靠数据系统的生产级数据工程技能。
当您看到以下内容时激活此技能:
管道设计:
架构:
数据建模:
数据质量:
性能:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 生成管道编排配置
python scripts/pipeline_orchestrator.py generate \
--type airflow \
--source postgres \
--destination snowflake \
--schedule "0 5 * * *"
# 验证数据质量
python scripts/data_quality_validator.py validate \
--input data/sales.parquet \
--schema schemas/sales.json \
--checks freshness,completeness,uniqueness
# 优化 ETL 性能
python scripts/etl_performance_optimizer.py analyze \
--query queries/daily_aggregation.sql \
--engine spark \
--recommend
→ 详情请参阅 references/workflows.md
使用此框架为您的数据管道选择正确的方法。
| 标准 | 批处理 | 流处理 |
|---|---|---|
| 延迟要求 | 数小时到数天 | 数秒到数分钟 |
| 数据量 | 大型历史数据集 | 连续事件流 |
| 处理复杂度 | 复杂转换、机器学习 | 简单聚合、过滤 |
| 成本敏感性 | 更具成本效益 | 基础设施成本更高 |
| 错误处理 | 更容易重新处理 | 需要精心设计 |
决策树:
Is real-time insight required?
├── Yes → Use streaming
│ └── Is exactly-once semantics needed?
│ ├── Yes → Kafka + Flink/Spark Structured Streaming
│ └── No → Kafka + consumer groups
└── No → Use batch
└── Is data volume > 1TB daily?
├── Yes → Spark/Databricks
└── No → dbt + warehouse compute
| 方面 | Lambda | Kappa |
|---|---|---|
| 复杂度 | 两个代码库(批处理 + 流处理) | 单一代码库 |
| 维护 | 较高(同步批处理/流处理逻辑) | 较低 |
| 重新处理 | 原生批处理层 | 从源头重播 |
| 用例 | 机器学习训练 + 实时服务 | 纯事件驱动 |
何时选择 Lambda:
何时选择 Kappa:
| 特性 | 数据仓库 (Snowflake/BigQuery) | 数据湖仓 (Delta/Iceberg) |
|---|---|---|
| 最适合 | 商业智能、SQL 分析 | 机器学习、非结构化数据 |
| 存储成本 | 较高(专有格式) | 较低(开放格式) |
| 灵活性 | 写时模式 | 读时模式 |
| 性能 | SQL 性能优异 | 良好,持续改进 |
| 生态系统 | 成熟的商业智能工具 | 不断增长的机器学习工具 |
| 类别 | 技术 |
|---|---|
| 编程语言 | Python, SQL, Scala |
| 编排调度 | Airflow, Prefect, Dagster |
| 数据转换 | dbt, Spark, Flink |
| 流处理 | Kafka, Kinesis, Pub/Sub |
| 存储 | S3, GCS, Delta Lake, Iceberg |
| 数据仓库 | Snowflake, BigQuery, Redshift, Databricks |
| 数据质量 | Great Expectations, dbt tests, Monte Carlo |
| 监控 | Prometheus, Grafana, Datadog |
请参阅 references/data_pipeline_architecture.md 了解:
请参阅 references/data_modeling_patterns.md 了解:
请参阅 references/dataops_best_practices.md 了解:
→ 详情请参阅 references/troubleshooting.md
每周安装量
218
代码仓库
GitHub 星标数
3.6K
首次出现
2026年1月20日
安全审计
安装于
claude-code178
opencode170
gemini-cli168
codex154
github-copilot141
cursor139
Production-grade data engineering skill for building scalable, reliable data systems.
Activate this skill when you see:
Pipeline Design:
Architecture:
Data Modeling:
Data Quality:
Performance:
# Generate pipeline orchestration config
python scripts/pipeline_orchestrator.py generate \
--type airflow \
--source postgres \
--destination snowflake \
--schedule "0 5 * * *"
# Validate data quality
python scripts/data_quality_validator.py validate \
--input data/sales.parquet \
--schema schemas/sales.json \
--checks freshness,completeness,uniqueness
# Optimize ETL performance
python scripts/etl_performance_optimizer.py analyze \
--query queries/daily_aggregation.sql \
--engine spark \
--recommend
→ See references/workflows.md for details
Use this framework to choose the right approach for your data pipeline.
| Criteria | Batch | Streaming |
|---|---|---|
| Latency requirement | Hours to days | Seconds to minutes |
| Data volume | Large historical datasets | Continuous event streams |
| Processing complexity | Complex transformations, ML | Simple aggregations, filtering |
| Cost sensitivity | More cost-effective | Higher infrastructure cost |
| Error handling | Easier to reprocess | Requires careful design |
Decision Tree:
Is real-time insight required?
├── Yes → Use streaming
│ └── Is exactly-once semantics needed?
│ ├── Yes → Kafka + Flink/Spark Structured Streaming
│ └── No → Kafka + consumer groups
└── No → Use batch
└── Is data volume > 1TB daily?
├── Yes → Spark/Databricks
└── No → dbt + warehouse compute
| Aspect | Lambda | Kappa |
|---|---|---|
| Complexity | Two codebases (batch + stream) | Single codebase |
| Maintenance | Higher (sync batch/stream logic) | Lower |
| Reprocessing | Native batch layer | Replay from source |
| Use case | ML training + real-time serving | Pure event-driven |
When to choose Lambda:
When to choose Kappa:
| Feature | Warehouse (Snowflake/BigQuery) | Lakehouse (Delta/Iceberg) |
|---|---|---|
| Best for | BI, SQL analytics | ML, unstructured data |
| Storage cost | Higher (proprietary format) | Lower (open formats) |
| Flexibility | Schema-on-write | Schema-on-read |
| Performance | Excellent for SQL | Good, improving |
| Ecosystem | Mature BI tools | Growing ML tooling |
| Category | Technologies |
|---|---|
| Languages | Python, SQL, Scala |
| Orchestration | Airflow, Prefect, Dagster |
| Transformation | dbt, Spark, Flink |
| Streaming | Kafka, Kinesis, Pub/Sub |
| Storage | S3, GCS, Delta Lake, Iceberg |
| Warehouses | Snowflake, BigQuery, Redshift, Databricks |
| Quality | Great Expectations, dbt tests, Monte Carlo |
| Monitoring | Prometheus, Grafana, Datadog |
See references/data_pipeline_architecture.md for:
See references/data_modeling_patterns.md for:
See references/dataops_best_practices.md for:
→ See references/troubleshooting.md for details
Weekly Installs
218
Repository
GitHub Stars
3.6K
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubPassSocketFailSnykPass
Installed on
claude-code178
opencode170
gemini-cli168
codex154
github-copilot141
cursor139
Excel财务建模规范与xlsx文件处理指南:专业格式、零错误公式与数据分析
40,800 周安装