数据管道构建指南：ETL工作流自动化与n8n模板实践

data-pipeline by claude-office-skills/skills

296 周安装量

26 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/claude-office-skills/skills --skill data-pipeline

自动化数据分析数据处理

🇨🇳中文介绍

数据管道

构建数据管道和 ETL 工作流，用于数据集成、转换和分析自动化。基于 n8n 的数据工作流模板。

概述

此技能涵盖：

从多个来源提取数据
转换和清洗
加载到目标位置
调度和监控
错误处理和警报

ETL 模式

基础 ETL 流程

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   提取      │───▶│    转换     │───▶│     加载    │
│             │    │             │    │             │
│ • APIs      │    │ • 清洗      │    │ • 数据库    │
│ • 数据库    │    │ • 映射      │    │ • 数据仓库  │
│ • 文件      │    │ • 聚合      │    │ • 文件      │
│ • Webhooks  │    │ • 丰富      │    │ • APIs      │
└─────────────┘    └─────────────┘    └─────────────┘

n8n ETL 工作流

workflow: "Daily Sales ETL"
schedule: "2am daily"

nodes:
  # EXTRACT
  - name: "Extract from Shopify"
    type: shopify
    action: get_orders
    filter: created_at >= yesterday

  - name: "Extract from Stripe"
    type: stripe
    action: get_payments
    filter: created >= yesterday

  # TRANSFORM
  - name: "Merge Data"
    type: merge
    mode: combine_by_key
    key: order_id

  - name: "Transform"
    type: code
    code: |
      return items.map(item => ({
        date: item.created_at.split('T')[0],
        order_id: item.id,
        customer_email: item.email,
        total: parseFloat(item.total_price),
        currency: item.currency,
        items: item.line_items.length,
        source: item.source_name,
        payment_status: item.payment.status
      }));

  # LOAD
  - name: "Load to BigQuery"
    type: google_bigquery
    action: insert_rows
    table: sales_daily

  - name: "Update Google Sheets"
    type: google_sheets
    action: append_rows
    spreadsheet: "Daily Sales Report"

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

FlyClaw：零登录航班聚合查询工具，Python实现多源航班信息与价格搜索

4,000,000 周安装

Azure RBAC 权限管理工具：查找最小角色、创建自定义角色与自动化分配

104,600 周安装

Azure Data Explorer (Kusto) 查询技能：KQL数据分析、日志遥测与时间序列处理

102,600 周安装

专业SEO审计工具：全面网站诊断、技术SEO优化与页面分析指南

59,900 周安装

extractors:
  databases:
    - postgresql:
        connection: connection_string
        query: "SELECT * FROM orders WHERE date >= $1"

    - mysql:
        connection: connection_string
        query: custom_sql

    - mongodb:
        connection: connection_string
        collection: orders
        filter: {date: {$gte: yesterday}}

  apis:
    - rest_api:
        url: "https://api.example.com/data"
        method: GET
        headers: {Authorization: "Bearer {token}"}
        pagination: handle_automatically

    - graphql:
        url: "https://api.example.com/graphql"
        query: graphql_query

  files:
    - csv:
        source: sftp/s3/google_drive
        delimiter: ","
        encoding: utf-8

    - excel:
        source: file_path
        sheet: "Sheet1"

    - json:
        source: api/file
        path: "data.items"

  saas:
    - salesforce: get_objects
    - hubspot: get_contacts/deals
    - stripe: get_charges
    - shopify: get_orders

transformations:
  cleaning:
    - remove_nulls: drop_or_fill
    - trim_whitespace: all_string_fields
    - deduplicate: by_key
    - validate: against_schema

  mapping:
    - rename_fields: {old_name: new_name}
    - convert_types: {date_string: date}
    - map_values: {status_code: status_name}

  aggregation:
    - group_by: [date, category]
    - sum: [revenue, quantity]
    - count: orders
    - average: order_value

  enrichment:
    - lookup: from_reference_table
    - geocode: from_address
    - calculate: derived_fields

  filtering:
    - where: condition
    - limit: n_rows
    - sample: percentage

// 清洗和规范化数据
function transform(items) {
  return items.map(item => ({
    // 清洗字符串
    name: item.name?.trim().toLowerCase(),

    // 解析日期
    date: new Date(item.created_at).toISOString().split('T')[0],

    // 转换类型
    amount: parseFloat(item.amount) || 0,

    // 映射值
    status: statusMap[item.status_code] || 'unknown',

    // 计算字段
    total: item.quantity * item.unit_price,

    // 过滤嵌套项
    tags: item.tags?.filter(t => t.active).map(t => t.name),

    // 默认值
    source: item.source || 'direct'
  }));
}

// 聚合数据
function aggregate(items) {
  const grouped = {};

  items.forEach(item => {
    const key = `${item.date}_${item.category}`;
    if (!grouped[key]) {
      grouped[key] = {
        date: item.date,
        category: item.category,
        total_revenue: 0,
        order_count: 0
      };
    }
    grouped[key].total_revenue += item.amount;
    grouped[key].order_count += 1;
  });

  return Object.values(grouped);
}

loaders:
  data_warehouses:
    - bigquery:
        project: project_id
        dataset: analytics
        table: sales
        write_mode: append/truncate

    - snowflake:
        account: account_id
        warehouse: compute_wh
        database: analytics
        schema: public

    - redshift:
        cluster: cluster_id
        database: analytics

  databases:
    - postgresql:
        upsert: on_conflict_update

    - mysql:
        batch_insert: 1000_rows

  files:
    - s3:
        bucket: data-lake
        path: /processed/{date}/
        format: parquet

    - google_cloud_storage:
        bucket: data-bucket

  spreadsheets:
    - google_sheets:
        mode: append/overwrite

    - airtable:
        base: base_id
        table: table_name

  apis:
    - webhook:
        url: destination_url
        batch_size: 100

scheduling:
  patterns:
    hourly:
      cron: "0 * * * *"
      use_for: real_time_dashboards

    daily:
      cron: "0 2 * * *"
      use_for: daily_reports

    weekly:
      cron: "0 3 * * 1"
      use_for: weekly_summaries

    on_demand:
      trigger: webhook/manual
      use_for: ad_hoc_analysis

  dependencies:
    - pipeline_a: must_complete_before pipeline_b
    - wait_for: all_extracts_complete

  retries:
    max_attempts: 3
    delay: exponential_backoff
    alert_on: final_failure

monitoring:
  metrics:
    - rows_processed
    - execution_time
    - error_count
    - data_freshness

  alerts:
    pipeline_failed:
      channels: [slack, pagerduty]
      template: |
        🚨 *管道失败*

        管道: {pipeline_name}
        阶段: {failed_stage}
        错误: {error_message}

        [查看日志]({logs_url})

    data_quality:
      trigger: anomaly_detected
      conditions:
        - row_count: differs_by > 50%
        - null_rate: exceeds_threshold
        - schema: changed_unexpectedly

    stale_data:
      trigger: last_update > threshold
      threshold: 2_hours

data_quality:
  schema_validation:
    - required_fields: [id, date, amount]
    - field_types:
        id: integer
        date: date
        amount: number
    - allowed_values:
        status: [active, pending, closed]

  statistical_checks:
    - null_rate: < 5%
    - duplicate_rate: < 1%
    - value_range:
        amount: [0, 1000000]

  business_rules:
    - total_equals_sum_of_line_items
    - dates_are_not_in_future
    - email_format_valid

  trend_analysis:
    - row_count: within_2_std_of_mean
    - total_value: within_expected_range

请求："创建一个每日销售数据管道"

# 每日销售数据管道

## 管道概述

Shopify + Stripe → 转换 → BigQuery + Sheets

## 调度
- 运行时间: 每日凌晨 2 点
- 时区: UTC
- 重试: 3 次尝试

## 提取

### Shopify 订单
```yaml
source: shopify
filter: created_at >= yesterday
fields: [id, email, total_price, line_items, created_at]

source: stripe
filter: created >= yesterday
fields: [id, amount, status, metadata.order_id]

// 连接和清洗数据
{
  date: order.created_at.split('T')[0],
  order_id: order.id,
  customer: order.email,
  revenue: parseFloat(order.total_price),
  items: order.line_items.length,
  payment_status: payment.status
}

表: analytics.sales_daily
模式: 追加

表格: "每日销售仪表板"
标签页: "原始数据"

行数 > 0
无空 order_id
收入总和与 Stripe 匹配

Slack: #data-alerts
失败时: @data-team

数据管道技能 - Claude Office Skills 的一部分

🇺🇸English

Data Pipeline

Build data pipelines and ETL workflows for data integration, transformation, and analytics automation. Based on n8n's data workflow templates.

Overview

This skill covers:

Data extraction from multiple sources
Transformation and cleaning
Loading to destinations
Scheduling and monitoring
Error handling and alerts

ETL Patterns

Basic ETL Flow

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   EXTRACT   │───▶│  TRANSFORM  │───▶│    LOAD     │
│             │    │             │    │             │
│ • APIs      │    │ • Clean     │    │ • Database  │
│ • Databases │    │ • Map       │    │ • Warehouse │
│ • Files     │    │ • Aggregate │    │ • Files     │
│ • Webhooks  │    │ • Enrich    │    │ • APIs      │
└─────────────┘    └─────────────┘    └─────────────┘

n8n ETL Workflow

workflow: "Daily Sales ETL"
schedule: "2am daily"

nodes:
  # EXTRACT
  - name: "Extract from Shopify"
    type: shopify
    action: get_orders
    filter: created_at >= yesterday
    
  - name: "Extract from Stripe"
    type: stripe
    action: get_payments
    filter: created >= yesterday
    
  # TRANSFORM
  - name: "Merge Data"
    type: merge
    mode: combine_by_key
    key: order_id
    
  - name: "Transform"
    type: code
    code: |
      return items.map(item => ({
        date: item.created_at.split('T')[0],
        order_id: item.id,
        customer_email: item.email,
        total: parseFloat(item.total_price),
        currency: item.currency,
        items: item.line_items.length,
        source: item.source_name,
        payment_status: item.payment.status
      }));
      
  # LOAD
  - name: "Load to BigQuery"
    type: google_bigquery
    action: insert_rows
    table: sales_daily
    
  - name: "Update Google Sheets"
    type: google_sheets
    action: append_rows
    spreadsheet: "Daily Sales Report"

Data Sources

Common Extractors

extractors:
  databases:
    - postgresql:
        connection: connection_string
        query: "SELECT * FROM orders WHERE date >= $1"
        
    - mysql:
        connection: connection_string
        query: custom_sql
        
    - mongodb:
        connection: connection_string
        collection: orders
        filter: {date: {$gte: yesterday}}
        
  apis:
    - rest_api:
        url: "https://api.example.com/data"
        method: GET
        headers: {Authorization: "Bearer {token}"}
        pagination: handle_automatically
        
    - graphql:
        url: "https://api.example.com/graphql"
        query: graphql_query
        
  files:
    - csv:
        source: sftp/s3/google_drive
        delimiter: ","
        encoding: utf-8
        
    - excel:
        source: file_path
        sheet: "Sheet1"
        
    - json:
        source: api/file
        path: "data.items"
        
  saas:
    - salesforce: get_objects
    - hubspot: get_contacts/deals
    - stripe: get_charges
    - shopify: get_orders

Transformations

Common Transformations

transformations:
  cleaning:
    - remove_nulls: drop_or_fill
    - trim_whitespace: all_string_fields
    - deduplicate: by_key
    - validate: against_schema
    
  mapping:
    - rename_fields: {old_name: new_name}
    - convert_types: {date_string: date}
    - map_values: {status_code: status_name}
    
  aggregation:
    - group_by: [date, category]
    - sum: [revenue, quantity]
    - count: orders
    - average: order_value
    
  enrichment:
    - lookup: from_reference_table
    - geocode: from_address
    - calculate: derived_fields
    
  filtering:
    - where: condition
    - limit: n_rows
    - sample: percentage

Code Transform Examples

// Clean and normalize data
function transform(items) {
  return items.map(item => ({
    // Clean strings
    name: item.name?.trim().toLowerCase(),
    
    // Parse dates
    date: new Date(item.created_at).toISOString().split('T')[0],
    
    // Convert types
    amount: parseFloat(item.amount) || 0,
    
    // Map values
    status: statusMap[item.status_code] || 'unknown',
    
    // Calculate fields
    total: item.quantity * item.unit_price,
    
    // Filter nested
    tags: item.tags?.filter(t => t.active).map(t => t.name),
    
    // Default values
    source: item.source || 'direct'
  }));
}

// Aggregate data
function aggregate(items) {
  const grouped = {};
  
  items.forEach(item => {
    const key = `${item.date}_${item.category}`;
    if (!grouped[key]) {
      grouped[key] = {
        date: item.date,
        category: item.category,
        total_revenue: 0,
        order_count: 0
      };
    }
    grouped[key].total_revenue += item.amount;
    grouped[key].order_count += 1;
  });
  
  return Object.values(grouped);
}

Data Destinations

Common Loaders

loaders:
  data_warehouses:
    - bigquery:
        project: project_id
        dataset: analytics
        table: sales
        write_mode: append/truncate
        
    - snowflake:
        account: account_id
        warehouse: compute_wh
        database: analytics
        schema: public
        
    - redshift:
        cluster: cluster_id
        database: analytics
        
  databases:
    - postgresql:
        upsert: on_conflict_update
        
    - mysql:
        batch_insert: 1000_rows
        
  files:
    - s3:
        bucket: data-lake
        path: /processed/{date}/
        format: parquet
        
    - google_cloud_storage:
        bucket: data-bucket
        
  spreadsheets:
    - google_sheets:
        mode: append/overwrite
        
    - airtable:
        base: base_id
        table: table_name
        
  apis:
    - webhook:
        url: destination_url
        batch_size: 100

Scheduling & Monitoring

Pipeline Scheduling

scheduling:
  patterns:
    hourly:
      cron: "0 * * * *"
      use_for: real_time_dashboards
      
    daily:
      cron: "0 2 * * *"
      use_for: daily_reports
      
    weekly:
      cron: "0 3 * * 1"
      use_for: weekly_summaries
      
    on_demand:
      trigger: webhook/manual
      use_for: ad_hoc_analysis
      
  dependencies:
    - pipeline_a: must_complete_before pipeline_b
    - wait_for: all_extracts_complete
    
  retries:
    max_attempts: 3
    delay: exponential_backoff
    alert_on: final_failure

Monitoring & Alerts

monitoring:
  metrics:
    - rows_processed
    - execution_time
    - error_count
    - data_freshness
    
  alerts:
    pipeline_failed:
      channels: [slack, pagerduty]
      template: |
        🚨 *Pipeline Failed*
        
        Pipeline: {pipeline_name}
        Stage: {failed_stage}
        Error: {error_message}
        
        [View Logs]({logs_url})
        
    data_quality:
      trigger: anomaly_detected
      conditions:
        - row_count: differs_by > 50%
        - null_rate: exceeds_threshold
        - schema: changed_unexpectedly
        
    stale_data:
      trigger: last_update > threshold
      threshold: 2_hours

Data Quality

Quality Checks

data_quality:
  schema_validation:
    - required_fields: [id, date, amount]
    - field_types:
        id: integer
        date: date
        amount: number
    - allowed_values:
        status: [active, pending, closed]
        
  statistical_checks:
    - null_rate: < 5%
    - duplicate_rate: < 1%
    - value_range:
        amount: [0, 1000000]
        
  business_rules:
    - total_equals_sum_of_line_items
    - dates_are_not_in_future
    - email_format_valid
    
  trend_analysis:
    - row_count: within_2_std_of_mean
    - total_value: within_expected_range

Output Example

Request : "Create a daily sales data pipeline"

Output :

# Daily Sales Data Pipeline

## Pipeline Overview

Shopify + Stripe → Transform → BigQuery + Sheets

## Schedule
- Runs: 2am daily
- Timezone: UTC
- Retry: 3 attempts

## Extract

### Shopify Orders
```yaml
source: shopify
filter: created_at >= yesterday
fields: [id, email, total_price, line_items, created_at]

Stripe Payments

source: stripe
filter: created >= yesterday
fields: [id, amount, status, metadata.order_id]

Transform

// Join and clean data
{
  date: order.created_at.split('T')[0],
  order_id: order.id,
  customer: order.email,
  revenue: parseFloat(order.total_price),
  items: order.line_items.length,
  payment_status: payment.status
}

Load

BigQuery

Table: analytics.sales_daily
Mode: Append

Google Sheets

Sheet: "Daily Sales Dashboard"
Tab: "Raw Data"

Quality Checks

Row count > 0
No null order_ids
Revenue sum matches Stripe

Alerts

Slack: #data-alerts
On failure: @data-team

Data Pipeline Skill - Part of Claude Office Skills

Weekly Installs

Repository

claude-office-s…s/skills

GitHub Stars

First Seen

5 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

claude-code20

opencode5

gemini-cli5

github-copilot5

codex5

amp5

Python PDF处理教程：合并拆分、提取文本表格、创建PDF文件

55,400 周安装

数据管道构建指南：ETL工作流自动化与n8n模板实践

🇨🇳中文介绍

数据管道

概述

ETL 模式

基础 ETL 流程

n8n ETL 工作流

相关 Skills

数据源

常用提取器

数据转换

常用转换

代码转换示例

数据目标

常用加载器

调度与监控

管道调度

监控与警报

数据质量

质量检查

输出示例

Stripe 支付

转换

加载

BigQuery

Google Sheets

质量检查

警报

🇺🇸English

Data Pipeline

Overview

ETL Patterns

Basic ETL Flow

n8n ETL Workflow

Data Sources

Common Extractors

Transformations

Common Transformations

Code Transform Examples

Data Destinations

Common Loaders

Scheduling & Monitoring

Pipeline Scheduling

Monitoring & Alerts

Data Quality

Quality Checks

Output Example

Stripe Payments

Transform

Load

BigQuery

Google Sheets

Quality Checks

Alerts

最新 Skills