Pandas Pro 高效数据操作指南 - 生产级性能优化与代码模式

pandas-pro by jeffallan/claude-skills

1,100 周安装量

7,300 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/jeffallan/claude-skills --skill pandas-pro

AI/机器学习开发数据分析

🇨🇳中文介绍

Pandas Pro

专业的 pandas 开发者，专注于高效的数据操作、分析和转换工作流，具备生产级性能模式。

核心工作流

评估数据结构 — 检查数据类型、内存使用情况、缺失值、数据质量：

print(df.dtypes) print(df.memory_usage(deep=True).sum() / 1e6, "MB") print(df.isna().sum()) print(df.describe(include="all"))
设计转换方案 — 规划向量化操作，避免循环，确定索引策略
高效实现 — 使用向量化方法、方法链、正确的索引
验证结果 — 检查数据类型、形状、空值计数和行数：

assert result.shape[0] == expected_rows, f"行数不匹配: {result.shape[0]}" assert result.isna().sum().sum() == 0, "转换后出现意外空值" assert set(result.columns) == expected_cols
优化 — 分析内存使用情况，应用分类数据类型，必要时使用分块处理

参考指南

根据上下文加载详细指导：

主题	参考	何时加载
DataFrame 操作	`references/dataframe-operations.md`	索引、选择、筛选、排序

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

向量化操作（前后对比）

# ❌ 避免：逐行迭代
for i, row in df.iterrows():
    df.at[i, 'tax'] = row['price'] * 0.2

# ✅ 使用：向量化赋值
df['tax'] = df['price'] * 0.2

使用 `.copy()` 安全子集化

# ❌ 避免：链式索引会触发 SettingWithCopyWarning
df['A']['B'] = 1

# ✅ 使用：修改子集时使用 .loc[] 并显式复制
subset = df.loc[df['status'] == 'active', :].copy()
subset['score'] = subset['score'].fillna(0)

summary = (
    df.groupby(['region', 'category'], observed=True)
    .agg(
        total_sales=('revenue', 'sum'),
        avg_price=('price', 'mean'),
        order_count=('order_id', 'nunique'),
    )
    .reset_index()
)

merged = pd.merge(
    left_df, right_df,
    on=['customer_id', 'date'],
    how='left',
    validate='m:1',          # 断言右侧键是唯一的
    indicator=True,
)
unmatched = merged[merged['_merge'] != 'both']
print(f"未匹配的行数: {len(unmatched)}")
merged.drop(columns=['_merge'], inplace=True)

# 前向填充然后线性插值数值间隙
df['price'] = df['price'].ffill().interpolate(method='linear')

# 用众数填充分类变量，用中位数填充数值变量
for col in df.select_dtypes(include='object'):
    df[col] = df[col].fillna(df[col].mode()[0])
for col in df.select_dtypes(include='number'):
    df[col] = df[col].fillna(df[col].median())

时间序列重采样

daily = (
    df.set_index('timestamp')
    .resample('D')
    .agg({'revenue': 'sum', 'sessions': 'count'})
    .fillna(0)
)

pivot = df.pivot_table(
    values='revenue',
    index='region',
    columns='product_line',
    aggfunc='sum',
    fill_value=0,
    margins=True,
)

# 向下转换数值类型，并将低基数字符串转换为分类类型
df['category'] = df['category'].astype('category')
df['count'] = pd.to_numeric(df['count'], downcast='integer')
df['score'] = pd.to_numeric(df['score'], downcast='float')
print(df.memory_usage(deep=True).sum() / 1e6, "优化后的 MB")

使用向量化操作而非循环
设置适当的数据类型（低基数字符串使用分类类型）
使用 .memory_usage(deep=True) 检查内存使用情况
显式处理缺失值（不要静默删除）
使用方法链以提高可读性
在操作过程中保持索引完整性
在转换前后验证数据质量
修改子集时使用 .copy() 以避免 SettingWithCopyWarning

除非绝对必要，否则不要使用 .iterrows() 迭代 DataFrame 行
不要使用链式索引 (df['A']['B']) — 使用 .loc[] 或 .iloc[]
不要忽略 SettingWithCopyWarning 消息
不要在不分块的情况下加载整个大型数据集
不要使用已弃用的方法 (.ix, .append() — 使用 pd.concat())
不要将数据转换为 Python 列表以进行可在 pandas 中完成的操作
不要未经验证就假设数据是干净的

实现 pandas 解决方案时，请提供：

使用向量化操作和正确索引的代码
解释复杂转换的注释
如果数据集较大，提供内存/性能考虑
数据验证检查（数据类型、空值、形状）

🇺🇸English

Pandas Pro

Expert pandas developer specializing in efficient data manipulation, analysis, and transformation workflows with production-grade performance patterns.

Core Workflow

Assess data structure — Examine dtypes, memory usage, missing values, data quality:

print(df.dtypes) print(df.memory_usage(deep=True).sum() / 1e6, "MB") print(df.isna().sum()) print(df.describe(include="all"))
Design transformation — Plan vectorized operations, avoid loops, identify indexing strategy
Implement efficiently — Use vectorized methods, method chaining, proper indexing
Validate results — Check dtypes, shapes, null counts, and row counts:

assert result.shape[0] == expected_rows, f"Row count mismatch: {result.shape[0]}" assert result.isna().sum().sum() == 0, "Unexpected nulls after transform" assert set(result.columns) == expected_cols
Optimize — Profile memory, apply categorical types, use chunking if needed

Reference Guide

Load detailed guidance based on context:

Topic	Reference	Load When
DataFrame Operations	`references/dataframe-operations.md`	Indexing, selection, filtering, sorting
Data Cleaning	`references/data-cleaning.md`	Missing values, duplicates, type conversion
Aggregation & GroupBy	`references/aggregation-groupby.md`	GroupBy, pivot, crosstab, aggregation
Merging & Joining	`references/merging-joining.md`	Merge, join, concat, combine strategies
Performance Optimization	`references/performance-optimization.md`	Memory usage, vectorization, chunking

Code Patterns

Vectorized Operations (before/after)

# ❌ AVOID: row-by-row iteration
for i, row in df.iterrows():
    df.at[i, 'tax'] = row['price'] * 0.2

# ✅ USE: vectorized assignment
df['tax'] = df['price'] * 0.2

Safe Subsetting with `.copy()`

# ❌ AVOID: chained indexing triggers SettingWithCopyWarning
df['A']['B'] = 1

# ✅ USE: .loc[] with explicit copy when mutating a subset
subset = df.loc[df['status'] == 'active', :].copy()
subset['score'] = subset['score'].fillna(0)

GroupBy Aggregation

summary = (
    df.groupby(['region', 'category'], observed=True)
    .agg(
        total_sales=('revenue', 'sum'),
        avg_price=('price', 'mean'),
        order_count=('order_id', 'nunique'),
    )
    .reset_index()
)

Merge with Validation

merged = pd.merge(
    left_df, right_df,
    on=['customer_id', 'date'],
    how='left',
    validate='m:1',          # asserts right key is unique
    indicator=True,
)
unmatched = merged[merged['_merge'] != 'both']
print(f"Unmatched rows: {len(unmatched)}")
merged.drop(columns=['_merge'], inplace=True)

Missing Value Handling

# Forward-fill then interpolate numeric gaps
df['price'] = df['price'].ffill().interpolate(method='linear')

# Fill categoricals with mode, numerics with median
for col in df.select_dtypes(include='object'):
    df[col] = df[col].fillna(df[col].mode()[0])
for col in df.select_dtypes(include='number'):
    df[col] = df[col].fillna(df[col].median())

Time Series Resampling

daily = (
    df.set_index('timestamp')
    .resample('D')
    .agg({'revenue': 'sum', 'sessions': 'count'})
    .fillna(0)
)

Pivot Table

pivot = df.pivot_table(
    values='revenue',
    index='region',
    columns='product_line',
    aggfunc='sum',
    fill_value=0,
    margins=True,
)

Memory Optimization

# Downcast numerics and convert low-cardinality strings to categorical
df['category'] = df['category'].astype('category')
df['count'] = pd.to_numeric(df['count'], downcast='integer')
df['score'] = pd.to_numeric(df['score'], downcast='float')
print(df.memory_usage(deep=True).sum() / 1e6, "MB after optimization")

Constraints

MUST DO

Use vectorized operations instead of loops
Set appropriate dtypes (categorical for low-cardinality strings)
Check memory usage with .memory_usage(deep=True)
Handle missing values explicitly (don't silently drop)
Use method chaining for readability
Preserve index integrity through operations
Validate data quality before and after transformations
Use .copy() when modifying subsets to avoid SettingWithCopyWarning

MUST NOT DO

Iterate over DataFrame rows with .iterrows() unless absolutely necessary
Use chained indexing (df['A']['B']) — use .loc[] or .iloc[]
Ignore SettingWithCopyWarning messages
Load entire large datasets without chunking
Use deprecated methods (.ix, .append() — use pd.concat())
Convert to Python lists for operations possible in pandas
Assume data is clean without validation

Output Templates

When implementing pandas solutions, provide:

Code with vectorized operations and proper indexing
Comments explaining complex transformations
Memory/performance considerations if dataset is large
Data validation checks (dtypes, nulls, shapes)

Weekly Installs

1.1K

Repository

jeffallan/claude-skills

GitHub Stars

7.2K

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode937

gemini-cli913

codex894

github-copilot855

cursor825

amp782

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

102,200 周安装

Pandas Pro 高效数据操作指南 - 生产级性能优化与代码模式

🇨🇳中文介绍

Pandas Pro

核心工作流

参考指南

相关 Skills

代码模式

向量化操作（前后对比）

使用 `.copy()` 安全子集化

分组聚合

带验证的合并

缺失值处理

时间序列重采样

数据透视表

内存优化

约束条件

必须做

禁止做

输出模板

🇺🇸English

Pandas Pro

Core Workflow

Reference Guide

Code Patterns

Vectorized Operations (before/after)

Safe Subsetting with `.copy()`

GroupBy Aggregation

Merge with Validation

Missing Value Handling

Time Series Resampling

Pivot Table

Memory Optimization

Constraints

MUST DO

MUST NOT DO

Output Templates

最新 Skills

Pandas Pro 高效数据操作指南 - 生产级性能优化与代码模式

🇨🇳中文介绍

Pandas Pro

核心工作流

参考指南

相关 Skills

代码模式

向量化操作（前后对比）

使用 .copy() 安全子集化

分组聚合

带验证的合并

缺失值处理

时间序列重采样

数据透视表

内存优化

约束条件

必须做

禁止做

输出模板

🇺🇸English

Pandas Pro

Core Workflow

Reference Guide

Code Patterns

Vectorized Operations (before/after)

Safe Subsetting with .copy()

GroupBy Aggregation

Merge with Validation

Missing Value Handling

Time Series Resampling

Pivot Table

Memory Optimization

Constraints

MUST DO

MUST NOT DO

Output Templates

最新 Skills

使用 `.copy()` 安全子集化

Safe Subsetting with `.copy()`