pandas-pro by jeffallan/claude-skills
npx skills add https://github.com/jeffallan/claude-skills --skill pandas-pro专业的 pandas 开发者,专注于高效的数据操作、分析和转换工作流,具备生产级性能模式。
评估数据结构 — 检查数据类型、内存使用情况、缺失值、数据质量:
print(df.dtypes) print(df.memory_usage(deep=True).sum() / 1e6, "MB") print(df.isna().sum()) print(df.describe(include="all"))
设计转换方案 — 规划向量化操作,避免循环,确定索引策略
高效实现 — 使用向量化方法、方法链、正确的索引
验证结果 — 检查数据类型、形状、空值计数和行数:
assert result.shape[0] == expected_rows, f"行数不匹配: {result.shape[0]}" assert result.isna().sum().sum() == 0, "转换后出现意外空值" assert set(result.columns) == expected_cols
优化 — 分析内存使用情况,应用分类数据类型,必要时使用分块处理
根据上下文加载详细指导:
| 主题 | 参考 | 何时加载 |
|---|---|---|
| DataFrame 操作 | references/dataframe-operations.md | 索引、选择、筛选、排序 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 数据清洗 | references/data-cleaning.md | 缺失值、重复值、类型转换 |
| 聚合与分组 | references/aggregation-groupby.md | GroupBy、数据透视、交叉表、聚合 |
| 合并与连接 | references/merging-joining.md | 合并、连接、串联、组合策略 |
| 性能优化 | references/performance-optimization.md | 内存使用、向量化、分块处理 |
# ❌ 避免:逐行迭代
for i, row in df.iterrows():
df.at[i, 'tax'] = row['price'] * 0.2
# ✅ 使用:向量化赋值
df['tax'] = df['price'] * 0.2
.copy() 安全子集化# ❌ 避免:链式索引会触发 SettingWithCopyWarning
df['A']['B'] = 1
# ✅ 使用:修改子集时使用 .loc[] 并显式复制
subset = df.loc[df['status'] == 'active', :].copy()
subset['score'] = subset['score'].fillna(0)
summary = (
df.groupby(['region', 'category'], observed=True)
.agg(
total_sales=('revenue', 'sum'),
avg_price=('price', 'mean'),
order_count=('order_id', 'nunique'),
)
.reset_index()
)
merged = pd.merge(
left_df, right_df,
on=['customer_id', 'date'],
how='left',
validate='m:1', # 断言右侧键是唯一的
indicator=True,
)
unmatched = merged[merged['_merge'] != 'both']
print(f"未匹配的行数: {len(unmatched)}")
merged.drop(columns=['_merge'], inplace=True)
# 前向填充然后线性插值数值间隙
df['price'] = df['price'].ffill().interpolate(method='linear')
# 用众数填充分类变量,用中位数填充数值变量
for col in df.select_dtypes(include='object'):
df[col] = df[col].fillna(df[col].mode()[0])
for col in df.select_dtypes(include='number'):
df[col] = df[col].fillna(df[col].median())
daily = (
df.set_index('timestamp')
.resample('D')
.agg({'revenue': 'sum', 'sessions': 'count'})
.fillna(0)
)
pivot = df.pivot_table(
values='revenue',
index='region',
columns='product_line',
aggfunc='sum',
fill_value=0,
margins=True,
)
# 向下转换数值类型,并将低基数字符串转换为分类类型
df['category'] = df['category'].astype('category')
df['count'] = pd.to_numeric(df['count'], downcast='integer')
df['score'] = pd.to_numeric(df['score'], downcast='float')
print(df.memory_usage(deep=True).sum() / 1e6, "优化后的 MB")
.memory_usage(deep=True) 检查内存使用情况.copy() 以避免 SettingWithCopyWarning.iterrows() 迭代 DataFrame 行df['A']['B']) — 使用 .loc[] 或 .iloc[].ix, .append() — 使用 pd.concat())实现 pandas 解决方案时,请提供:
每周安装量
1.1K
代码仓库
GitHub 星标数
7.2K
首次出现
2026年1月20日
安全审计
安装于
opencode937
gemini-cli913
codex894
github-copilot855
cursor825
amp782
Expert pandas developer specializing in efficient data manipulation, analysis, and transformation workflows with production-grade performance patterns.
Assess data structure — Examine dtypes, memory usage, missing values, data quality:
print(df.dtypes) print(df.memory_usage(deep=True).sum() / 1e6, "MB") print(df.isna().sum()) print(df.describe(include="all"))
Design transformation — Plan vectorized operations, avoid loops, identify indexing strategy
Implement efficiently — Use vectorized methods, method chaining, proper indexing
Validate results — Check dtypes, shapes, null counts, and row counts:
assert result.shape[0] == expected_rows, f"Row count mismatch: {result.shape[0]}" assert result.isna().sum().sum() == 0, "Unexpected nulls after transform" assert set(result.columns) == expected_cols
Optimize — Profile memory, apply categorical types, use chunking if needed
Load detailed guidance based on context:
| Topic | Reference | Load When |
|---|---|---|
| DataFrame Operations | references/dataframe-operations.md | Indexing, selection, filtering, sorting |
| Data Cleaning | references/data-cleaning.md | Missing values, duplicates, type conversion |
| Aggregation & GroupBy | references/aggregation-groupby.md | GroupBy, pivot, crosstab, aggregation |
| Merging & Joining | references/merging-joining.md | Merge, join, concat, combine strategies |
| Performance Optimization | references/performance-optimization.md | Memory usage, vectorization, chunking |
# ❌ AVOID: row-by-row iteration
for i, row in df.iterrows():
df.at[i, 'tax'] = row['price'] * 0.2
# ✅ USE: vectorized assignment
df['tax'] = df['price'] * 0.2
.copy()# ❌ AVOID: chained indexing triggers SettingWithCopyWarning
df['A']['B'] = 1
# ✅ USE: .loc[] with explicit copy when mutating a subset
subset = df.loc[df['status'] == 'active', :].copy()
subset['score'] = subset['score'].fillna(0)
summary = (
df.groupby(['region', 'category'], observed=True)
.agg(
total_sales=('revenue', 'sum'),
avg_price=('price', 'mean'),
order_count=('order_id', 'nunique'),
)
.reset_index()
)
merged = pd.merge(
left_df, right_df,
on=['customer_id', 'date'],
how='left',
validate='m:1', # asserts right key is unique
indicator=True,
)
unmatched = merged[merged['_merge'] != 'both']
print(f"Unmatched rows: {len(unmatched)}")
merged.drop(columns=['_merge'], inplace=True)
# Forward-fill then interpolate numeric gaps
df['price'] = df['price'].ffill().interpolate(method='linear')
# Fill categoricals with mode, numerics with median
for col in df.select_dtypes(include='object'):
df[col] = df[col].fillna(df[col].mode()[0])
for col in df.select_dtypes(include='number'):
df[col] = df[col].fillna(df[col].median())
daily = (
df.set_index('timestamp')
.resample('D')
.agg({'revenue': 'sum', 'sessions': 'count'})
.fillna(0)
)
pivot = df.pivot_table(
values='revenue',
index='region',
columns='product_line',
aggfunc='sum',
fill_value=0,
margins=True,
)
# Downcast numerics and convert low-cardinality strings to categorical
df['category'] = df['category'].astype('category')
df['count'] = pd.to_numeric(df['count'], downcast='integer')
df['score'] = pd.to_numeric(df['score'], downcast='float')
print(df.memory_usage(deep=True).sum() / 1e6, "MB after optimization")
.memory_usage(deep=True).copy() when modifying subsets to avoid SettingWithCopyWarning.iterrows() unless absolutely necessarydf['A']['B']) — use .loc[] or .iloc[].ix, .append() — use pd.concat())When implementing pandas solutions, provide:
Weekly Installs
1.1K
Repository
GitHub Stars
7.2K
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode937
gemini-cli913
codex894
github-copilot855
cursor825
amp782
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
102,200 周安装