数据管道架构师：ETL/ELT、批流处理、数据质量与可扩展性设计指南

data-pipeline-architect by 4444j99/a-i--skills

1 周安装量

2 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/4444j99/a-i--skills --skill data-pipeline-architect

数据分析系统架构数据处理

🇨🇳中文介绍

数据管道架构师

此技能为设计稳健、可扩展的数据管道提供指导，确保数据可靠地从源头传输到目的地。

核心能力

ETL 与 ELT : 传统的提取-转换-加载与现代的提取-加载-转换模式
编排调度 : 用于工作流管理的 Airflow、Dagster、Prefect、dbt
数据质量 : 验证、监控、血缘追踪
可扩展性 : 批处理与流处理、分区、并行化

管道设计流程

1. 需求分析

开始管道设计前，需收集：

源系统与数据格式（API、数据库、文件、流）
目标目的地（数据仓库、数据湖、湖仓一体）
数据新鲜度要求（实时、每小时、每日）
数据量与速度预估
质量与合规性要求

2. 架构选择

批处理管道 - 用于周期性批量处理：

调度驱动（每小时、每日、每周）
延迟容忍度较高
错误恢复更简单（重新运行整个批次）
工具：Airflow、dbt、Spark

流处理管道 - 用于实时性要求：

事件驱动处理
亚秒级到分钟级延迟
复杂的状态管理
工具：Kafka、Flink、Spark Streaming

混合方法 - Lambda 或 Kappa 架构：

批处理层保证数据完整性
速度层保证低延迟
服务层处理查询

3. ETL 与 ELT 决策

ETL（加载前转换）：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

5. 错误处理模式

┌─────────────────────────────────────────────────────────┐
│                    管道执行                    │
├─────────────────────────────────────────────────────────┤
│  ┌─────────┐    ┌───────────┐    ┌──────────┐          │
│  │ 提取    │───▶│ 转换      │───▶│   加载   │          │
│  └────┬────┘    └─────┬─────┘    └────┬─────┘          │
│       │               │               │                 │
│       ▼               ▼               ▼                 │
│  ┌─────────┐    ┌───────────┐    ┌──────────┐          │
│  │ 重试    │    │ 死信队列  │    │ 回滚     │          │
│  │ 退避策略│   │           │    │ 检查点   │          │
│  └─────────┘    └───────────┘    └──────────┘          │
└─────────────────────────────────────────────────────────┘

退避重试 : 处理暂时性故障（网络、速率限制）
死信队列 : 处理无法处理的毒丸消息
检查点 : 从最后一个成功点恢复
幂等性 : 可安全重新运行，不会产生重复数据

6. 数据质量框架

在每个阶段实施检查：

阶段	检查类型	示例
提取	完整性	行数与源系统匹配
提取	新鲜度	数据时间戳在 SLA 范围内
转换	有效性	值在预期范围内
转换	唯一性	主键唯一
加载	对账	目标数据与源数据总量匹配
加载	完整性	外键有效

7. 监控与可观测性

需要追踪的关键指标：

管道运行时长与趋势
各阶段行数
错误率与错误类型
数据新鲜度（自上次成功运行以来的时间）
资源利用率

需告警的情况：

违反 SLA（数据不新鲜）
行数异常（与基线相差 ±20%）
源系统模式变更
重复失败

类型 1 : 覆盖（无历史记录）
类型 2 : 添加带有效期的行
类型 3 : 添加存储先前值的列
类型 4 : 历史表

-- 基于时间戳的增量
SELECT * FROM source
WHERE updated_at > {{ last_run_timestamp }}

-- 基于 CDC（变更数据捕获）
-- 从事务日志中捕获插入、更新、删除操作

-- 删除 + 插入模式
DELETE FROM target WHERE date_partition = '2024-01-15';
INSERT INTO target SELECT * FROM staging WHERE date_partition = '2024-01-15';

-- 合并/更新插入模式
MERGE INTO target t
USING staging s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...

references/orchestration-patterns.md - Airflow、Dagster、Prefect 模式
references/data-quality-checks.md - 验证框架与规则
references/pipeline-templates.md - 常见管道架构

🇺🇸English

Data Pipeline Architect

This skill provides guidance for designing robust, scalable data pipelines that move data reliably from sources to destinations.

Core Competencies

ETL vs ELT : Traditional Extract-Transform-Load vs modern Extract-Load-Transform patterns
Orchestration : Airflow, Dagster, Prefect, dbt for workflow management
Data Quality : Validation, monitoring, lineage tracking
Scalability : Batch vs streaming, partitioning, parallelization

Pipeline Design Process

1. Requirements Analysis

To begin pipeline design, gather:

Source systems and data formats (APIs, databases, files, streams)
Target destinations (data warehouse, lake, lakehouse)
Freshness requirements (real-time, hourly, daily)
Data volume and velocity estimates
Quality and compliance requirements

2. Architecture Selection

Batch Pipelines - For periodic bulk processing:

Schedule-driven (hourly, daily, weekly)
Higher latency tolerance
Simpler error recovery (re-run entire batch)
Tools: Airflow, dbt, Spark

Streaming Pipelines - For real-time requirements:

Event-driven processing
Sub-second to minute latency
Complex state management
Tools: Kafka, Flink, Spark Streaming

Hybrid Approaches - Lambda or Kappa architecture:

Batch layer for completeness
Speed layer for low latency
Serving layer for queries

3. ETL vs ELT Decision

ETL (Transform before Load) :

When target has limited compute
When transformation reduces data volume significantly
When sensitive data must be masked before landing
Legacy data warehouse patterns

ELT (Transform after Load) :

Modern cloud warehouses with cheap compute
When raw data preservation is needed
When transformations change frequently
dbt-style transformations in warehouse

4. Pipeline Components

Extraction Layer :

Full extraction vs incremental (CDC, timestamp-based)
API pagination and rate limiting
Connection pooling and retry logic
Schema detection and drift handling

Transformation Layer :

Data cleansing and standardization
Business logic application
Aggregation and denormalization
Type casting and null handling

Loading Layer :

Upsert strategies (merge, delete+insert)
Partitioning schemes (time, hash, range)
Index management
Transaction boundaries

5. Error Handling Patterns

┌─────────────────────────────────────────────────────────┐
│                    Pipeline Execution                    │
├─────────────────────────────────────────────────────────┤
│  ┌─────────┐    ┌───────────┐    ┌──────────┐          │
│  │ Extract │───▶│ Transform │───▶│   Load   │          │
│  └────┬────┘    └─────┬─────┘    └────┬─────┘          │
│       │               │               │                 │
│       ▼               ▼               ▼                 │
│  ┌─────────┐    ┌───────────┐    ┌──────────┐          │
│  │  Retry  │    │ Dead Letter│    │ Rollback │          │
│  │ w/Backoff│   │   Queue   │    │ Checkpoint│          │
│  └─────────┘    └───────────┘    └──────────┘          │
└─────────────────────────────────────────────────────────┘

Retry with backoff : Transient failures (network, rate limits)
Dead letter queues : Poison messages that can't be processed
Checkpointing : Resume from last successful point
Idempotency : Safe to re-run without duplicates

6. Data Quality Framework

Implement checks at each stage:

Stage	Check Type	Example
Extract	Completeness	Row count matches source
Extract	Freshness	Data timestamp within SLA
Transform	Validity	Values in expected ranges
Transform	Uniqueness	Primary keys unique
Load	Reconciliation	Target matches source totals
Load	Integrity	Foreign keys valid

7. Monitoring and Observability

Essential metrics to track:

Pipeline duration and trends
Row counts at each stage
Error rates and types
Data freshness (time since last successful run)
Resource utilization

Alert on:

SLA breaches (data not fresh)
Anomalous row counts (±20% from baseline)
Schema changes in sources
Repeated failures

Common Patterns

Slowly Changing Dimensions (SCD)

Type 1 : Overwrite (no history)
Type 2 : Add row with validity dates
Type 3 : Previous value column
Type 4 : History table

Incremental Processing

-- Timestamp-based incremental
SELECT * FROM source
WHERE updated_at > {{ last_run_timestamp }}

-- CDC-based (Change Data Capture)
-- Captures inserts, updates, deletes from transaction log

Idempotent Loads

-- Delete + Insert pattern
DELETE FROM target WHERE date_partition = '2024-01-15';
INSERT INTO target SELECT * FROM staging WHERE date_partition = '2024-01-15';

-- Merge/Upsert pattern
MERGE INTO target t
USING staging s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...

References

references/orchestration-patterns.md - Airflow, Dagster, Prefect patterns
references/data-quality-checks.md - Validation frameworks and rules
references/pipeline-templates.md - Common pipeline architectures

Weekly Installs

Repository

4444j99/a-i--skills

GitHub Stars

First Seen

1 day ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

zencoder1

amp1

cline1

openclaw1

opencode1

cursor1

Excel财务建模规范与xlsx文件处理指南：专业格式、零错误公式与数据分析

42,000 周安装