Dask 并行计算库：Python 大数据处理与分布式计算解决方案

dask by davila7/claude-code-templates

140 周安装量

23,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill dask

Python Web框架高性能计算数据处理

🇨🇳中文介绍

Dask

概述

Dask 是一个用于并行和分布式计算的 Python 库，它提供了三个关键能力：

大于内存的执行：在单台机器上处理超出可用 RAM 的数据
并行处理：跨多个核心进行计算以提高速度
分布式计算：支持跨多台机器处理 TB 级数据集

Dask 的规模可以从笔记本电脑（处理约 100 GiB 数据）扩展到集群（处理约 100 TiB 数据），同时保持熟悉的 Python API。

何时使用此技能

在以下情况下应使用此技能：

处理超出可用 RAM 的数据集
将 pandas 或 NumPy 操作扩展到更大的数据集
并行化计算以提高性能
高效处理多个文件（CSV、Parquet、JSON、文本日志）
构建具有任务依赖关系的自定义并行工作流
跨多个核心或机器分配工作负载

核心能力

Dask 提供五个主要组件，每个组件适用于不同的用例：

1. DataFrames - 并行 Pandas 操作

目的：通过并行处理将 pandas 操作扩展到更大的数据集。

何时使用：

表格数据超出可用 RAM
需要一起处理多个 CSV/Parquet 文件
Pandas 操作速度慢，需要并行化
从 pandas 原型扩展到生产环境

参考文档：有关 Dask DataFrames 的全面指导，请参阅 references/dataframes.md，其中包括：

读取数据（单个文件、多个文件、通配符模式）
常见操作（过滤、分组、连接、聚合）

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

2. Arrays - 并行 NumPy 操作

目的：使用分块算法将 NumPy 功能扩展到大于内存的数据集。

数组超出可用 RAM
NumPy 操作需要并行化
处理科学数据集（HDF5、Zarr、NetCDF）
需要并行线性代数或数组操作

参考文档：有关 Dask Arrays 的全面指导，请参阅 references/arrays.md，其中包括：

创建数组（从 NumPy、随机、从磁盘）
分块策略和优化
常见操作（算术、归约、线性代数）
使用 map_blocks 进行自定义操作
与 HDF5、Zarr 和 XArray 的集成

import dask.array as da

# 创建带有分块的大数组
x = da.random.random((100000, 100000), chunks=(10000, 10000))

# 操作是惰性的
y = x + 100
z = y.mean(axis=0)

# 计算结果
result = z.compute()

分块大小至关重要（目标约为每个分块 100 MB）
操作在分块上并行执行
在需要时重新分块数据以提高操作效率
对于 Dask 中不可用的操作，使用 map_blocks

3. Bags - 非结构化数据的并行处理

目的：使用函数式操作处理非结构化或半结构化数据（文本、JSON、日志）。

处理文本文件、日志或 JSON 记录
在结构化分析之前进行数据清洗和 ETL
处理不适合数组/数据框格式的 Python 对象
需要内存高效的流式处理

参考文档：有关 Dask Bags 的全面指导，请参阅 references/bags.md，其中包括：

读取文本和 JSON 文件
函数式操作（map、filter、fold、groupby）
转换为 DataFrames
常见模式（日志分析、JSON 处理、文本处理）
性能考虑

import dask.bag as db
import json

# 读取并解析 JSON 文件
bag = db.read_text('logs/*.json').map(json.loads)

# 过滤和转换
valid = bag.filter(lambda x: x['status'] == 'valid')
processed = valid.map(lambda x: {'id': x['id'], 'value': x['value']})

# 转换为 DataFrame 进行分析
ddf = processed.to_dataframe()

用于初始数据清洗，然后转换为 DataFrame/Array
使用 foldby 代替 groupby 以获得更好的性能
操作是流式的且内存高效
对于复杂操作，转换为结构化格式（DataFrame）

4. Futures - 基于任务的并行化

目的：构建自定义并行工作流，对任务执行和依赖关系进行细粒度控制。

构建动态、演进的工作流
需要立即执行任务（非惰性）
计算依赖于运行时条件
实现自定义并行算法
需要状态计算

参考文档：有关 Dask Futures 的全面指导，请参阅 references/futures.md，其中包括：

设置分布式客户端
提交任务和处理 futures
任务依赖性和数据移动
高级协调（队列、锁、事件、参与者）
常见模式（参数扫描、动态任务、迭代算法）

from dask.distributed import Client

client = Client()  # 创建本地集群

# 提交任务（立即执行）
def process(x):
    return x ** 2

futures = client.map(process, range(100))

# 收集结果
results = client.gather(futures)

client.close()

需要分布式客户端（即使是单台机器）
任务在提交时立即执行
预先分散大型数据以避免重复传输
每个任务约 1 毫秒开销（不适合数百万个微小任务）
对于有状态的工作流，使用参与者

5. Schedulers - 执行后端

目的：控制 Dask 任务如何以及在何处执行（线程、进程、分布式）。

何时选择调度器：

线程（默认）：NumPy/Pandas 操作、释放 GIL 的库、共享内存优势
进程：纯 Python 代码、文本处理、受 GIL 限制的操作
同步：使用 pdb 调试、性能分析、理解错误
分布式：需要仪表板、多机集群、高级功能

参考文档：有关 Dask Schedulers 的全面指导，请参阅 references/schedulers.md，其中包括：

详细的调度器描述和特性
配置方法（全局、上下文管理器、每次计算）
性能考虑和开销
常见模式和故障排除
用于最佳性能的线程配置

import dask
import dask.dataframe as dd

# 对 DataFrame 使用线程（默认，适用于数值计算）
ddf = dd.read_csv('data.csv')
result1 = ddf.mean().compute()  # 使用线程

# 对 Python 密集型工作使用进程
import dask.bag as db
bag = db.read_text('logs/*.txt')
result2 = bag.map(python_function).compute(scheduler='processes')

# 对调试使用同步
dask.config.set(scheduler='synchronous')
result3 = problematic_computation.compute()  # 可以使用 pdb

# 对监控和扩展使用分布式
from dask.distributed import Client
client = Client()
result4 = computation.compute()  # 使用带仪表板的分布式

线程：开销最低（约 10 µs/任务），最适合数值计算
进程：避免 GIL（约 10 ms/任务），最适合 Python 工作
分布式：监控仪表板（约 1 ms/任务），可扩展到集群
可以按每次计算或全局切换调度器

有关全面的性能优化指导、内存管理策略以及需要避免的常见陷阱，请参阅 references/best-practices.md。关键原则包括：

从更简单的解决方案开始

在使用 Dask 之前，请探索：

更好的算法
高效的文件格式（使用 Parquet 代替 CSV）
编译代码（Numba、Cython）
数据采样

1. 不要在本地加载数据然后交给 Dask

# 错误：首先将所有数据加载到内存中
import pandas as pd
df = pd.read_csv('large.csv')
ddf = dd.from_pandas(df, npartitions=10)

# 正确：让 Dask 处理加载
import dask.dataframe as dd
ddf = dd.read_csv('large.csv')

2. 避免重复的 compute() 调用

# 错误：每次 compute 都是独立的
for item in items:
    result = dask_computation(item).compute()

# 正确：所有计算一次完成
computations = [dask_computation(item) for item in items]
results = dask.compute(*computations)

3. 不要构建过大的任务图

如果任务数量达到数百万，请增加分块大小
使用 map_partitions/map_blocks 来融合操作
检查任务图大小：len(ddf.__dask_graph__())

4. 选择合适的分块大小

目标：每个分块约 100 MB（或每个工作进程内存中约 10 个分块）
太大：内存溢出
太小：调度开销

5. 使用仪表板

from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # 监控性能，识别瓶颈

常见工作流模式

import dask.dataframe as dd

# 提取：读取数据
ddf = dd.read_csv('raw_data/*.csv')

# 转换：清理和处理
ddf = ddf[ddf['status'] == 'valid']
ddf['amount'] = ddf['amount'].astype('float64')
ddf = ddf.dropna(subset=['important_col'])

# 加载：聚合并保存
summary = ddf.groupby('category').agg({'amount': ['sum', 'mean']})
summary.to_parquet('output/summary.parquet')

非结构化到结构化管道

import dask.bag as db
import json

# 从 Bag 开始处理非结构化数据
bag = db.read_text('logs/*.json').map(json.loads)
bag = bag.filter(lambda x: x['status'] == 'valid')

# 转换为 DataFrame 进行结构化分析
ddf = bag.to_dataframe()
result = ddf.groupby('category').mean().compute()

大规模数组计算

import dask.array as da

# 加载或创建大数组
x = da.from_zarr('large_dataset.zarr')

# 分块处理
normalized = (x - x.mean()) / x.std()

# 保存结果
da.to_zarr(normalized, 'normalized.zarr')

自定义并行工作流

from dask.distributed import Client

client = Client()

# 一次性分散大型数据集
data = client.scatter(large_dataset)

# 并行处理，具有依赖关系
futures = []
for param in parameters:
    future = client.submit(process, data, param)
    futures.append(future)

# 收集结果
results = client.gather(futures)

选择正确的组件

使用此决策指南选择适当的 Dask 组件：

表格数据 → DataFrames
数值数组 → Arrays
文本/JSON/日志 → Bags（然后转换为 DataFrame）
自定义 Python 对象 → Bags 或 Futures

标准 pandas 操作 → DataFrames
标准 NumPy 操作 → Arrays
自定义并行任务 → Futures
文本处理/ETL → Bags

高级、自动 → DataFrames/Arrays
低级、手动 → Futures

工作流类型：

静态计算图 → DataFrames/Arrays/Bags
动态、演进 → Futures

高效：Parquet、HDF5、Zarr（列式、压缩、并行友好）
兼容但较慢：CSV（仅用于初始摄取）
对于数组：HDF5、Zarr、NetCDF

集合之间的转换

# Bag → DataFrame
ddf = bag.to_dataframe()

# DataFrame → Array（用于数值数据）
arr = ddf.to_dask_array(lengths=True)

# Array → DataFrame
ddf = dd.from_dask_array(arr, columns=['col1', 'col2'])

与其他库的集成

XArray：用带标签的维度包装 Dask 数组（地理空间、成像）
Dask-ML：使用与 scikit-learn 兼容的 API 进行机器学习
Distributed：高级集群管理和监控

迭代式开发工作流

使用同步调度器在小数据上测试：

dask.config.set(scheduler='synchronous')
result = computation.compute()  # 可以使用 pdb，易于调试

在样本上使用线程验证：

sample = ddf.head(1000)  # 小样本
# 测试逻辑，然后扩展到完整数据集

使用分布式进行扩展和监控：

from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # 监控性能
result = computation.compute()

减小分块大小
策略性地使用 persist() 并在完成后删除
检查自定义函数中的内存泄漏

任务图太大（增加分块大小）
使用 map_partitions 或 map_blocks 来减少任务数量

并行化效果差：

分块太大（增加分区数量）
对 Python 代码使用线程（切换到进程）
数据依赖性阻碍了并行性

所有参考文档文件都可以根据需要读取以获取详细信息：

references/dataframes.md - 完整的 Dask DataFrame 指南
references/arrays.md - 完整的 Dask Array 指南
references/bags.md - 完整的 Dask Bag 指南
references/futures.md - 完整的 Dask Futures 和分布式计算指南
references/schedulers.md - 完整的调度器选择和配置指南
references/best-practices.md - 全面的性能优化和故障排除指南

当用户需要关于特定 Dask 组件、操作或模式的详细信息，超出此处提供的快速指导时，请加载这些文件。

🇺🇸English

Dask

Overview

Dask is a Python library for parallel and distributed computing that enables three critical capabilities:

Larger-than-memory execution on single machines for data exceeding available RAM
Parallel processing for improved computational speed across multiple cores
Distributed computation supporting terabyte-scale datasets across multiple machines

Dask scales from laptops (processing ~100 GiB) to clusters (processing ~100 TiB) while maintaining familiar Python APIs.

When to Use This Skill

This skill should be used when:

Process datasets that exceed available RAM
Scale pandas or NumPy operations to larger datasets
Parallelize computations for performance improvements
Process multiple files efficiently (CSVs, Parquet, JSON, text logs)
Build custom parallel workflows with task dependencies
Distribute workloads across multiple cores or machines

Core Capabilities

Dask provides five main components, each suited to different use cases:

1. DataFrames - Parallel Pandas Operations

Purpose : Scale pandas operations to larger datasets through parallel processing.

When to Use :

Tabular data exceeds available RAM
Need to process multiple CSV/Parquet files together
Pandas operations are slow and need parallelization
Scaling from pandas prototype to production

Reference Documentation : For comprehensive guidance on Dask DataFrames, refer to references/dataframes.md which includes:

Reading data (single files, multiple files, glob patterns)
Common operations (filtering, groupby, joins, aggregations)
Custom operations with map_partitions
Performance optimization tips
Common patterns (ETL, time series, multi-file processing)

Quick Example :

import dask.dataframe as dd

# Read multiple files as single DataFrame
ddf = dd.read_csv('data/2024-*.csv')

# Operations are lazy until compute()
filtered = ddf[ddf['value'] > 100]
result = filtered.groupby('category').mean().compute()

Key Points :

Operations are lazy (build task graph) until .compute() called
Use map_partitions for efficient custom operations
Convert to DataFrame early when working with structured data from other sources

2. Arrays - Parallel NumPy Operations

Purpose : Extend NumPy capabilities to datasets larger than memory using blocked algorithms.

When to Use :

Arrays exceed available RAM
NumPy operations need parallelization
Working with scientific datasets (HDF5, Zarr, NetCDF)
Need parallel linear algebra or array operations

Reference Documentation : For comprehensive guidance on Dask Arrays, refer to references/arrays.md which includes:

Creating arrays (from NumPy, random, from disk)
Chunking strategies and optimization
Common operations (arithmetic, reductions, linear algebra)
Custom operations with map_blocks
Integration with HDF5, Zarr, and XArray

Quick Example :

import dask.array as da

# Create large array with chunks
x = da.random.random((100000, 100000), chunks=(10000, 10000))

# Operations are lazy
y = x + 100
z = y.mean(axis=0)

# Compute result
result = z.compute()

Key Points :

Chunk size is critical (aim for ~100 MB per chunk)
Operations work on chunks in parallel
Rechunk data when needed for efficient operations
Use map_blocks for operations not available in Dask

3. Bags - Parallel Processing of Unstructured Data

Purpose : Process unstructured or semi-structured data (text, JSON, logs) with functional operations.

When to Use :

Processing text files, logs, or JSON records
Data cleaning and ETL before structured analysis
Working with Python objects that don't fit array/dataframe formats
Need memory-efficient streaming processing

Reference Documentation : For comprehensive guidance on Dask Bags, refer to references/bags.md which includes:

Reading text and JSON files
Functional operations (map, filter, fold, groupby)
Converting to DataFrames
Common patterns (log analysis, JSON processing, text processing)
Performance considerations

Quick Example :

import dask.bag as db
import json

# Read and parse JSON files
bag = db.read_text('logs/*.json').map(json.loads)

# Filter and transform
valid = bag.filter(lambda x: x['status'] == 'valid')
processed = valid.map(lambda x: {'id': x['id'], 'value': x['value']})

# Convert to DataFrame for analysis
ddf = processed.to_dataframe()

Key Points :

Use for initial data cleaning, then convert to DataFrame/Array
Use foldby instead of groupby for better performance
Operations are streaming and memory-efficient
Convert to structured formats (DataFrame) for complex operations

4. Futures - Task-Based Parallelization

Purpose : Build custom parallel workflows with fine-grained control over task execution and dependencies.

When to Use :

Building dynamic, evolving workflows
Need immediate task execution (not lazy)
Computations depend on runtime conditions
Implementing custom parallel algorithms
Need stateful computations

Reference Documentation : For comprehensive guidance on Dask Futures, refer to references/futures.md which includes:

Setting up distributed client
Submitting tasks and working with futures
Task dependencies and data movement
Advanced coordination (queues, locks, events, actors)
Common patterns (parameter sweeps, dynamic tasks, iterative algorithms)

Quick Example :

from dask.distributed import Client

client = Client()  # Create local cluster

# Submit tasks (executes immediately)
def process(x):
    return x ** 2

futures = client.map(process, range(100))

# Gather results
results = client.gather(futures)

client.close()

Key Points :

Requires distributed client (even for single machine)
Tasks execute immediately when submitted
Pre-scatter large data to avoid repeated transfers
~1ms overhead per task (not suitable for millions of tiny tasks)
Use actors for stateful workflows

5. Schedulers - Execution Backends

Purpose : Control how and where Dask tasks execute (threads, processes, distributed).

When to Choose Scheduler :

Threads (default): NumPy/Pandas operations, GIL-releasing libraries, shared memory benefit
Processes : Pure Python code, text processing, GIL-bound operations
Synchronous : Debugging with pdb, profiling, understanding errors
Distributed : Need dashboard, multi-machine clusters, advanced features

Reference Documentation : For comprehensive guidance on Dask Schedulers, refer to references/schedulers.md which includes:

Detailed scheduler descriptions and characteristics
Configuration methods (global, context manager, per-compute)
Performance considerations and overhead
Common patterns and troubleshooting
Thread configuration for optimal performance

Quick Example :

import dask
import dask.dataframe as dd

# Use threads for DataFrame (default, good for numeric)
ddf = dd.read_csv('data.csv')
result1 = ddf.mean().compute()  # Uses threads

# Use processes for Python-heavy work
import dask.bag as db
bag = db.read_text('logs/*.txt')
result2 = bag.map(python_function).compute(scheduler='processes')

# Use synchronous for debugging
dask.config.set(scheduler='synchronous')
result3 = problematic_computation.compute()  # Can use pdb

# Use distributed for monitoring and scaling
from dask.distributed import Client
client = Client()
result4 = computation.compute()  # Uses distributed with dashboard

Key Points :

Threads: Lowest overhead (~10 µs/task), best for numeric work
Processes: Avoids GIL (~10 ms/task), best for Python work
Distributed: Monitoring dashboard (~1 ms/task), scales to clusters
Can switch schedulers per computation or globally

Best Practices

For comprehensive performance optimization guidance, memory management strategies, and common pitfalls to avoid, refer to references/best-practices.md. Key principles include:

Start with Simpler Solutions

Before using Dask, explore:

Better algorithms
Efficient file formats (Parquet instead of CSV)
Compiled code (Numba, Cython)
Data sampling

Critical Performance Rules

1. Don't Load Data Locally Then Hand to Dask

# Wrong: Loads all data in memory first
import pandas as pd
df = pd.read_csv('large.csv')
ddf = dd.from_pandas(df, npartitions=10)

# Correct: Let Dask handle loading
import dask.dataframe as dd
ddf = dd.read_csv('large.csv')

2. Avoid Repeated compute() Calls

# Wrong: Each compute is separate
for item in items:
    result = dask_computation(item).compute()

# Correct: Single compute for all
computations = [dask_computation(item) for item in items]
results = dask.compute(*computations)

3. Don't Build Excessively Large Task Graphs

Increase chunk sizes if millions of tasks
Use map_partitions/map_blocks to fuse operations
Check task graph size: len(ddf.__dask_graph__())

4. Choose Appropriate Chunk Sizes

Target: ~100 MB per chunk (or 10 chunks per core in worker memory)
Too large: Memory overflow
Too small: Scheduling overhead

5. Use the Dashboard

from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # Monitor performance, identify bottlenecks

Common Workflow Patterns

ETL Pipeline

import dask.dataframe as dd

# Extract: Read data
ddf = dd.read_csv('raw_data/*.csv')

# Transform: Clean and process
ddf = ddf[ddf['status'] == 'valid']
ddf['amount'] = ddf['amount'].astype('float64')
ddf = ddf.dropna(subset=['important_col'])

# Load: Aggregate and save
summary = ddf.groupby('category').agg({'amount': ['sum', 'mean']})
summary.to_parquet('output/summary.parquet')

Unstructured to Structured Pipeline

import dask.bag as db
import json

# Start with Bag for unstructured data
bag = db.read_text('logs/*.json').map(json.loads)
bag = bag.filter(lambda x: x['status'] == 'valid')

# Convert to DataFrame for structured analysis
ddf = bag.to_dataframe()
result = ddf.groupby('category').mean().compute()

Large-Scale Array Computation

import dask.array as da

# Load or create large array
x = da.from_zarr('large_dataset.zarr')

# Process in chunks
normalized = (x - x.mean()) / x.std()

# Save result
da.to_zarr(normalized, 'normalized.zarr')

Custom Parallel Workflow

from dask.distributed import Client

client = Client()

# Scatter large dataset once
data = client.scatter(large_dataset)

# Process in parallel with dependencies
futures = []
for param in parameters:
    future = client.submit(process, data, param)
    futures.append(future)

# Gather results
results = client.gather(futures)

Selecting the Right Component

Use this decision guide to choose the appropriate Dask component:

Data Type :

Tabular data → DataFrames
Numeric arrays → Arrays
Text/JSON/logs → Bags (then convert to DataFrame)
Custom Python objects → Bags or Futures

Operation Type :

Standard pandas operations → DataFrames
Standard NumPy operations → Arrays
Custom parallel tasks → Futures
Text processing/ETL → Bags

Control Level :

High-level, automatic → DataFrames/Arrays
Low-level, manual → Futures

Workflow Type :

Static computation graph → DataFrames/Arrays/Bags
Dynamic, evolving → Futures

Integration Considerations

File Formats

Efficient : Parquet, HDF5, Zarr (columnar, compressed, parallel-friendly)
Compatible but slower : CSV (use for initial ingestion only)
For Arrays : HDF5, Zarr, NetCDF

Conversion Between Collections

# Bag → DataFrame
ddf = bag.to_dataframe()

# DataFrame → Array (for numeric data)
arr = ddf.to_dask_array(lengths=True)

# Array → DataFrame
ddf = dd.from_dask_array(arr, columns=['col1', 'col2'])

With Other Libraries

XArray : Wraps Dask arrays with labeled dimensions (geospatial, imaging)
Dask-ML : Machine learning with scikit-learn compatible APIs
Distributed : Advanced cluster management and monitoring

Debugging and Development

Iterative Development Workflow

Test on small data with synchronous scheduler :

dask.config.set(scheduler='synchronous')

result = computation.compute()  # Can use pdb, easy debugging

2. Validate with threads on sample :

sample = ddf.head(1000)  # Small sample
# Test logic, then scale to full dataset

3. Scale with distributed for monitoring :

from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # Monitor performance
result = computation.compute()

Common Issues

Memory Errors :

Decrease chunk sizes
Use persist() strategically and delete when done
Check for memory leaks in custom functions

Slow Start :

Task graph too large (increase chunk sizes)
Use map_partitions or map_blocks to reduce tasks

Poor Parallelization :

Chunks too large (increase number of partitions)
Using threads with Python code (switch to processes)
Data dependencies preventing parallelism

Reference Files

All reference documentation files can be read as needed for detailed information:

references/dataframes.md - Complete Dask DataFrame guide
references/arrays.md - Complete Dask Array guide
references/bags.md - Complete Dask Bag guide
references/futures.md - Complete Dask Futures and distributed computing guide
references/schedulers.md - Complete scheduler selection and configuration guide
references/best-practices.md - Comprehensive performance optimization and troubleshooting

Load these files when users need detailed information about specific Dask components, operations, or patterns beyond the quick guidance provided here.

Weekly Installs

116

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code102

opencode91

cursor84

gemini-cli83

antigravity80

codex75

FastAPI官方技能：Python Web开发最佳实践与CLI工具使用指南

1,000 周安装

Dask 并行计算库：Python 大数据处理与分布式计算解决方案

🇨🇳中文介绍

Dask

概述

何时使用此技能

核心能力

1. DataFrames - 并行 Pandas 操作

相关 Skills

2. Arrays - 并行 NumPy 操作

3. Bags - 非结构化数据的并行处理

4. Futures - 基于任务的并行化

5. Schedulers - 执行后端

最佳实践

从更简单的解决方案开始

关键性能规则

常见工作流模式

ETL 管道

非结构化到结构化管道

大规模数组计算

自定义并行工作流

选择正确的组件

集成考虑

文件格式

集合之间的转换

与其他库的集成

调试和开发

迭代式开发工作流

常见问题

参考文件

🇺🇸English

Dask

Overview

When to Use This Skill

Core Capabilities

1. DataFrames - Parallel Pandas Operations

2. Arrays - Parallel NumPy Operations

3. Bags - Parallel Processing of Unstructured Data

4. Futures - Task-Based Parallelization

5. Schedulers - Execution Backends

Best Practices

Start with Simpler Solutions

Critical Performance Rules

Common Workflow Patterns

ETL Pipeline

Unstructured to Structured Pipeline

Large-Scale Array Computation

Custom Parallel Workflow

Selecting the Right Component

Integration Considerations

File Formats

Conversion Between Collections

With Other Libraries

Debugging and Development

Iterative Development Workflow

Common Issues

Reference Files

最新 Skills