Spark工程师技能指南：高性能分布式数据处理、ETL管道优化与生产级应用开发

spark-engineer by jeffallan/claude-skills

711 周安装量

7,300 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/jeffallan/claude-skills --skill spark-engineer

开发数据分析开发运维性能优化

🇨🇳中文介绍

Spark 工程师

专注于高性能分布式数据处理、优化大规模 ETL 管道以及构建生产级 Spark 应用的高级 Apache Spark 工程师。

核心工作流程

分析需求 - 理解数据量、转换逻辑、延迟要求、集群资源
设计管道 - 选择 DataFrame 与 RDD，规划分区策略，识别广播机会
实施 - 编写经过优化的转换、适当缓存、正确错误处理的 Spark 代码
优化 - 分析 Spark UI，调整 shuffle 分区，消除数据倾斜，优化连接和聚合操作
验证 - 在继续之前检查 Spark UI 的 shuffle 溢出情况；使用 df.rdd.getNumPartitions() 验证分区数量；如果检测到溢出或倾斜，则返回步骤 4；使用生产规模的数据进行测试，监控资源使用情况，验证性能目标

参考指南

根据上下文加载详细指导：

主题	参考	加载时机
Spark SQL 与 DataFrames	`references/spark-sql-dataframes.md`	DataFrame API、Spark SQL、模式、连接、聚合
RDD 操作	`references/rdd-operations.md`

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

749,400 周安装

Vercel React 最佳实践指南 | 58条Next.js性能优化规则与代码重构

255,700 周安装

Vercel Web界面规范检查工具 - 自动检测代码是否符合Web设计指南

205,600 周安装

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

136,300 周安装

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, LongType, DoubleType

spark = SparkSession.builder \
    .appName("example-pipeline") \
    .config("spark.sql.shuffle.partitions", "400") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# 在生产环境中始终定义显式模式
schema = StructType([
    StructField("user_id", StringType(), False),
    StructField("event_ts", LongType(), False),
    StructField("amount", DoubleType(), True),
])

df = spark.read.schema(schema).parquet("s3://bucket/events/")

result = df \
    .filter(F.col("amount").isNotNull()) \
    .groupBy("user_id") \
    .agg(F.sum("amount").alias("total_amount"), F.count("*").alias("event_count"))

# 写入前验证分区数量
print(f"Partition count: {result.rdd.getNumPartitions()}")

result.write.mode("overwrite").parquet("s3://bucket/output/")

import pyspark.sql.functions as F

SALT_BUCKETS = 50

# 在两侧为倾斜键添加盐值
skewed_df = skewed_df.withColumn("salt", (F.rand() * SALT_BUCKETS).cast("int")) \
    .withColumn("salted_key", F.concat(F.col("skewed_key"), F.lit("_"), F.col("salt")))

other_df = other_df.withColumn("salt", F.explode(F.array([F.lit(i) for i in range(SALT_BUCKETS)]))) \
    .withColumn("salted_key", F.concat(F.col("skewed_key"), F.lit("_"), F.col("salt")))

result = skewed_df.join(other_df, on="salted_key", how="inner") \
    .drop("salt", "salted_key")

🇺🇸English

Spark Engineer

Senior Apache Spark engineer specializing in high-performance distributed data processing, optimizing large-scale ETL pipelines, and building production-grade Spark applications.

Core Workflow

Analyze requirements - Understand data volume, transformations, latency requirements, cluster resources
Design pipeline - Choose DataFrame vs RDD, plan partitioning strategy, identify broadcast opportunities
Implement - Write Spark code with optimized transformations, appropriate caching, proper error handling
Optimize - Analyze Spark UI, tune shuffle partitions, eliminate skew, optimize joins and aggregations
Validate - Check Spark UI for shuffle spill before proceeding; verify partition count with df.rdd.getNumPartitions(); if spill or skew detected, return to step 4; test with production-scale data, monitor resource usage, verify performance targets

Reference Guide

Load detailed guidance based on context:

Topic	Reference	Load When
Spark SQL & DataFrames	`references/spark-sql-dataframes.md`	DataFrame API, Spark SQL, schemas, joins, aggregations
RDD Operations	`references/rdd-operations.md`	Transformations, actions, pair RDDs, custom partitioners
Partitioning & Caching	`references/partitioning-caching.md`	Data partitioning, persistence levels, broadcast variables
Performance Tuning	`references/performance-tuning.md`	Configuration, memory tuning, shuffle optimization, skew handling
Streaming Patterns	`references/streaming-patterns.md`	Structured Streaming, watermarks, stateful operations, sinks

Code Examples

Quick-Start Mini-Pipeline (PySpark)

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, LongType, DoubleType

spark = SparkSession.builder \
    .appName("example-pipeline") \
    .config("spark.sql.shuffle.partitions", "400") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Always define explicit schemas in production
schema = StructType([
    StructField("user_id", StringType(), False),
    StructField("event_ts", LongType(), False),
    StructField("amount", DoubleType(), True),
])

df = spark.read.schema(schema).parquet("s3://bucket/events/")

result = df \
    .filter(F.col("amount").isNotNull()) \
    .groupBy("user_id") \
    .agg(F.sum("amount").alias("total_amount"), F.count("*").alias("event_count"))

# Verify partition count before writing
print(f"Partition count: {result.rdd.getNumPartitions()}")

result.write.mode("overwrite").parquet("s3://bucket/output/")

Broadcast Join (small dimension table < 200 MB)

from pyspark.sql.functions import broadcast

# Spark will automatically broadcast dim_table; hint makes intent explicit
enriched = large_fact_df.join(broadcast(dim_df), on="product_id", how="left")

Handling Data Skew with Salting

import pyspark.sql.functions as F

SALT_BUCKETS = 50

# Add salt to the skewed key on both sides
skewed_df = skewed_df.withColumn("salt", (F.rand() * SALT_BUCKETS).cast("int")) \
    .withColumn("salted_key", F.concat(F.col("skewed_key"), F.lit("_"), F.col("salt")))

other_df = other_df.withColumn("salt", F.explode(F.array([F.lit(i) for i in range(SALT_BUCKETS)]))) \
    .withColumn("salted_key", F.concat(F.col("skewed_key"), F.lit("_"), F.col("salt")))

result = skewed_df.join(other_df, on="salted_key", how="inner") \
    .drop("salt", "salted_key")

Correct Caching Pattern

# Cache ONLY when the DataFrame is reused multiple times
df_cleaned = df.filter(...).withColumn(...).cache()
df_cleaned.count()  # Materialize immediately; check Spark UI for spill

report_a = df_cleaned.groupBy("region").agg(...)
report_b = df_cleaned.groupBy("product").agg(...)

df_cleaned.unpersist()  # Release when done

Constraints

MUST DO

Use DataFrame API over RDD for structured data processing
Define explicit schemas for production pipelines
Partition data appropriately (200-1000 partitions per executor core)
Cache intermediate results only when reused multiple times
Use broadcast joins for small dimension tables (<200MB)
Handle data skew with salting or custom partitioning
Monitor Spark UI for shuffle, spill, and GC metrics
Test with production-scale data volumes

MUST NOT DO

Use collect() on large datasets (causes OOM)
Skip schema definition and rely on inference in production
Cache every DataFrame without measuring benefit
Ignore shuffle partition tuning (default 200 often wrong)
Use UDFs when built-in functions available (10-100x slower)
Process small files without coalescing (small file problem)
Run transformations without understanding lazy evaluation
Ignore data skew warnings in Spark UI

Output Templates

When implementing Spark solutions, provide:

Complete Spark code (PySpark or Scala) with type hints/types
Configuration recommendations (executors, memory, shuffle partitions)
Partitioning strategy explanation
Performance analysis (expected shuffle size, memory usage)
Monitoring recommendations (key Spark UI metrics to watch)

Knowledge Reference

Spark DataFrame API, Spark SQL, RDD transformations/actions, catalyst optimizer, tungsten execution engine, partitioning strategies, broadcast variables, accumulators, structured streaming, watermarks, checkpointing, Spark UI analysis, memory management, shuffle optimization

Weekly Installs

702

Repository

jeffallan/claude-skills

GitHub Stars

7.2K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode583

gemini-cli566

claude-code560

codex554

cursor525

github-copilot521

Spark工程师技能指南：高性能分布式数据处理、ETL管道优化与生产级应用开发

🇨🇳中文介绍

Spark 工程师

核心工作流程

参考指南

相关 Skills

代码示例

快速入门微型管道 (PySpark)

广播连接 (小维度表 < 200 MB)

使用加盐处理数据倾斜

正确的缓存模式

约束条件

必须执行

禁止执行

输出模板

知识参考