高级数据工程师技能：构建可扩展数据管道、ETL/ELT流程与数据质量框架

senior-data-engineer by alirezarezvani/claude-skills

222 周安装量

6,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/alirezarezvani/claude-skills --skill senior-data-engineer

数据分析系统架构数据处理

🇨🇳中文介绍

高级数据工程师

用于构建可扩展、可靠数据系统的生产级数据工程技能。

触发短语

当您看到以下内容时激活此技能：

管道设计：

"为...设计一个数据管道"
"构建一个 ETL/ELT 流程..."
"我应该如何从...摄取数据"
"设置从...的数据提取"

架构：

"我应该使用批处理还是流处理？"
"Lambda 与 Kappa 架构"
"如何处理迟到数据"
"设计一个数据湖仓"

数据建模：

"创建一个维度模型..."
"星型模式与雪花模式"
"实现缓慢变化维度"
"设计一个数据仓库"

数据质量：

"为...添加数据验证"
"设置数据质量检查"
"监控数据新鲜度"
"实现数据契约"

性能：

"优化这个 Spark 作业"

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

批处理 vs 流处理

标准	批处理	流处理
延迟要求	数小时到数天	数秒到数分钟
数据量	大型历史数据集	连续事件流
处理复杂度	复杂转换、机器学习	简单聚合、过滤
成本敏感性	更具成本效益	基础设施成本更高
错误处理	更容易重新处理	需要精心设计

Is real-time insight required?
├── Yes → Use streaming
│   └── Is exactly-once semantics needed?
│       ├── Yes → Kafka + Flink/Spark Structured Streaming
│       └── No → Kafka + consumer groups
└── No → Use batch
    └── Is data volume > 1TB daily?
        ├── Yes → Spark/Databricks
        └── No → dbt + warehouse compute

Lambda vs Kappa 架构

方面	Lambda	Kappa
复杂度	两个代码库（批处理 + 流处理）	单一代码库
维护	较高（同步批处理/流处理逻辑）	较低
重新处理	原生批处理层	从源头重播
用例	机器学习训练 + 实时服务	纯事件驱动

何时选择 Lambda：

需要在历史数据上训练机器学习模型
复杂的批处理转换在流处理中不可行
现有的批处理基础设施

何时选择 Kappa：

事件溯源架构
所有处理都可以表示为流操作
从头开始，没有遗留系统

数据仓库 vs 数据湖仓

特性	数据仓库 (Snowflake/BigQuery)	数据湖仓 (Delta/Iceberg)
最适合	商业智能、SQL 分析	机器学习、非结构化数据
存储成本	较高（专有格式）	较低（开放格式）
灵活性	写时模式	读时模式
性能	SQL 性能优异	良好，持续改进
生态系统	成熟的商业智能工具	不断增长的机器学习工具

类别	技术
编程语言	Python, SQL, Scala
编排调度	Airflow, Prefect, Dagster
数据转换	dbt, Spark, Flink
流处理	Kafka, Kinesis, Pub/Sub
存储	S3, GCS, Delta Lake, Iceberg
数据仓库	Snowflake, BigQuery, Redshift, Databricks
数据质量	Great Expectations, dbt tests, Monte Carlo
监控	Prometheus, Grafana, Datadog

1. 数据管道架构

请参阅 references/data_pipeline_architecture.md 了解：

Lambda 与 Kappa 架构模式
使用 Spark 和 Airflow 进行批处理
使用 Kafka 和 Flink 进行流处理
精确一次语义实现
错误处理和死信队列

2. 数据建模模式

请参阅 references/data_modeling_patterns.md 了解：

维度建模（星型/雪花型）
缓慢变化维度（SCD 类型 1-6）
数据仓库建模
dbt 最佳实践
分区和聚类

3. DataOps 最佳实践

请参阅 references/dataops_best_practices.md 了解：

数据测试框架
数据契约和模式验证
数据管道的持续集成/持续部署
可观测性和血缘关系
事件响应

→ 详情请参阅 references/troubleshooting.md

🇺🇸English

Senior Data Engineer

Production-grade data engineering skill for building scalable, reliable data systems.

Trigger Phrases
Quick Start
Workflows
- Building a Batch ETL Pipeline
- Implementing Real-Time Streaming
- Data Quality Framework Setup
Architecture Decision Framework
Tech Stack
Reference Documentation
Troubleshooting

Trigger Phrases

Activate this skill when you see:

Pipeline Design:

"Design a data pipeline for..."
"Build an ETL/ELT process..."
"How should I ingest data from..."
"Set up data extraction from..."

Architecture:

"Should I use batch or streaming?"
"Lambda vs Kappa architecture"
"How to handle late-arriving data"
"Design a data lakehouse"

Data Modeling:

"Create a dimensional model..."
"Star schema vs snowflake"
"Implement slowly changing dimensions"
"Design a data vault"

Data Quality:

"Add data validation to..."
"Set up data quality checks"
"Monitor data freshness"
"Implement data contracts"

Performance:

"Optimize this Spark job"
"Query is running slow"
"Reduce pipeline execution time"
"Tune Airflow DAG"

Quick Start

Core Tools

# Generate pipeline orchestration config
python scripts/pipeline_orchestrator.py generate \
  --type airflow \
  --source postgres \
  --destination snowflake \
  --schedule "0 5 * * *"

# Validate data quality
python scripts/data_quality_validator.py validate \
  --input data/sales.parquet \
  --schema schemas/sales.json \
  --checks freshness,completeness,uniqueness

# Optimize ETL performance
python scripts/etl_performance_optimizer.py analyze \
  --query queries/daily_aggregation.sql \
  --engine spark \
  --recommend

Workflows

→ See references/workflows.md for details

Architecture Decision Framework

Use this framework to choose the right approach for your data pipeline.

Batch vs Streaming

Criteria	Batch	Streaming
Latency requirement	Hours to days	Seconds to minutes
Data volume	Large historical datasets	Continuous event streams
Processing complexity	Complex transformations, ML	Simple aggregations, filtering
Cost sensitivity	More cost-effective	Higher infrastructure cost
Error handling	Easier to reprocess	Requires careful design

Decision Tree:

Is real-time insight required?
├── Yes → Use streaming
│   └── Is exactly-once semantics needed?
│       ├── Yes → Kafka + Flink/Spark Structured Streaming
│       └── No → Kafka + consumer groups
└── No → Use batch
    └── Is data volume > 1TB daily?
        ├── Yes → Spark/Databricks
        └── No → dbt + warehouse compute

Lambda vs Kappa Architecture

Aspect	Lambda	Kappa
Complexity	Two codebases (batch + stream)	Single codebase
Maintenance	Higher (sync batch/stream logic)	Lower
Reprocessing	Native batch layer	Replay from source
Use case	ML training + real-time serving	Pure event-driven

When to choose Lambda:

Need to train ML models on historical data
Complex batch transformations not feasible in streaming
Existing batch infrastructure

When to choose Kappa:

Event-sourced architecture
All processing can be expressed as stream operations
Starting fresh without legacy systems

Data Warehouse vs Data Lakehouse

Feature	Warehouse (Snowflake/BigQuery)	Lakehouse (Delta/Iceberg)
Best for	BI, SQL analytics	ML, unstructured data
Storage cost	Higher (proprietary format)	Lower (open formats)
Flexibility	Schema-on-write	Schema-on-read
Performance	Excellent for SQL	Good, improving
Ecosystem	Mature BI tools	Growing ML tooling

Tech Stack

Category	Technologies
Languages	Python, SQL, Scala
Orchestration	Airflow, Prefect, Dagster
Transformation	dbt, Spark, Flink
Streaming	Kafka, Kinesis, Pub/Sub
Storage	S3, GCS, Delta Lake, Iceberg
Warehouses	Snowflake, BigQuery, Redshift, Databricks
Quality	Great Expectations, dbt tests, Monte Carlo
Monitoring	Prometheus, Grafana, Datadog

Reference Documentation

1. Data Pipeline Architecture

See references/data_pipeline_architecture.md for:

Lambda vs Kappa architecture patterns
Batch processing with Spark and Airflow
Stream processing with Kafka and Flink
Exactly-once semantics implementation
Error handling and dead letter queues

2. Data Modeling Patterns

See references/data_modeling_patterns.md for:

Dimensional modeling (Star/Snowflake)
Slowly Changing Dimensions (SCD Types 1-6)
Data Vault modeling
dbt best practices
Partitioning and clustering

3. DataOps Best Practices

See references/dataops_best_practices.md for:

Data testing frameworks
Data contracts and schema validation
CI/CD for data pipelines
Observability and lineage
Incident response

Troubleshooting

→ See references/troubleshooting.md for details

Weekly Installs

218

Repository

alirezarezvani/…e-skills

GitHub Stars

3.6K

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubPass SocketFail SnykPass

Installed on

claude-code178

opencode170

gemini-cli168

codex154

github-copilot141

cursor139

Excel财务建模规范与xlsx文件处理指南：专业格式、零错误公式与数据分析

40,800 周安装