data-scientist by 404kidwiz/claude-supercode-skills
npx skills add https://github.com/404kidwiz/claude-supercode-skills --skill data-scientist提供统计分析和预测建模专业知识,专攻机器学习、实验设计和因果推断。构建严谨的模型,并将复杂的统计发现转化为可操作的业务洞察,同时进行适当的验证和不确定性量化。
目标: 在建模前理解数据分布、质量和关系。
步骤:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
加载和分析数据概况
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv("customer_data.csv")
# Basic profiling
print(df.info())
print(df.describe())
# Missing values analysis
missing = df.isnull().sum() / len(df)
print(missing[missing > 0].sort_values(ascending=False))
单变量分析(分布)
# Numerical features
num_cols = df.select_dtypes(include=[np.number]).columns
for col in num_cols:
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
sns.histplot(df[col], kde=True)
plt.subplot(1, 2, 2)
sns.boxplot(x=df[col])
plt.show()
# Categorical features
cat_cols = df.select_dtypes(exclude=[np.number]).columns
for col in cat_cols:
print(df[col].value_counts(normalize=True))
双变量分析(关系)
# Correlation matrix
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
# Target vs Features
target = 'churn'
sns.boxplot(x=target, y='tenure', data=df)
数据清洗
# Impute missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['category'].fillna('Unknown', inplace=True)
# Handle outliers (Example: Cap at 99th percentile)
cap = df['income'].quantile(0.99)
df['income'] = np.where(df['income'] > cap, cap, df['income'])
验证:
目标: 分析网站转化实验的结果。
步骤:
定义假设
加载和聚合数据
# data: ['user_id', 'group', 'converted']
results = df.groupby('group')['converted'].agg(['count', 'sum', 'mean'])
results.columns = ['n_users', 'conversions', 'conversion_rate']
print(results)
统计检验(比例 Z 检验)
from statsmodels.stats.proportion import proportions_ztest
control = results.loc['A']
treatment = results.loc['B']
count = np.array([treatment['conversions'], control['conversions']])
nobs = np.array([treatment['n_users'], control['n_users']])
stat, p_value = proportions_ztest(count, nobs, alternative='larger')
print(f"Z-statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")
置信区间
from statsmodels.stats.proportion import proportion_confint
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(count, nobs, alpha=0.05)
print(f"Control CI: [{lower_con:.4f}, {upper_con:.4f}]")
print(f"Treatment CI: [{lower_treat:.4f}, {upper_treat:.4f}]")
结论
目标: 在无法进行 A/B 测试时(观察性数据),估计“高级会员”对“消费”的影响。
步骤:
问题设定
计算倾向得分
from sklearn.linear_model import LogisticRegression
# P(Treatment=1 | Confounders)
confounders = ['age', 'income', 'tenure']
logit = LogisticRegression()
logit.fit(df[confounders], df['is_premium'])
df['propensity_score'] = logit.predict_proba(df[confounders])[:, 1]
# Check overlap (Common Support)
sns.histplot(data=df, x='propensity_score', hue='is_premium', element='step')
匹配(最近邻)
from sklearn.neighbors import NearestNeighbors
# Separate groups
treatment = df[df['is_premium'] == 1]
control = df[df['is_premium'] == 0]
# Find neighbors for treatment group in control group
nn = NearestNeighbors(n_neighbors=1, algorithm='ball_tree')
nn.fit(control[['propensity_score']])
distances, indices = nn.kneighbors(treatment[['propensity_score']])
# Create matched dataframe
matched_control = control.iloc[indices.flatten()]
# Compare outcomes
ate = treatment['spend'].mean() - matched_control['spend'].mean()
print(f"Average Treatment Effect (ATE): ${ate:.2f}")
验证(平衡性检查)
abs(mean_diff) / pooled_std < 0.1(标准化均值差)。表现形式:
失败原因:
正确方法:
X_train 上拟合缩放器/编码器,然后转换 X_test。Pipeline 对象确保安全性。表现形式:
失败原因:
正确方法:
表现形式:
失败原因:
正确方法:
scale_pos_weight,Sklearn 中的 class_weight='balanced'。方法论与严谨性:
代码与可复现性:
requirements.txt 或 environment.yml。random_state=42)。解释与沟通:
性能:
场景: 产品团队想知道新的推荐算法是否能提高用户参与度。
分析方法:
关键分析:
# Bootstrap confidence interval for difference in means
from scipy import stats
diff = treatment_means - control_means
ci = np.percentile(bootstrap_diffs, [2.5, 97.5])
结果: 功能发布,具有 95% 的积极影响概率
场景: 零售连锁店需要预测下一季度的销售额以进行库存规划。
建模方法:
结果:
| 模型 | 平均绝对百分比误差 | 90% 置信区间宽度 |
|---|---|---|
| ARIMA | 12.3% | ±15% |
| Prophet | 9.8% | ±12% |
| XGBoost | 7.2% | ±9% |
交付成果: 带有自动重训练管线的生产模型
场景: 市场营销部门希望了解哪些渠道真正推动了转化,而不仅仅是表面相关。
因果方法:
主要发现:
每周安装量
100
代码仓库
GitHub 星标数
43
首次出现
2026年1月24日
安全审计
安装于
opencode81
codex75
gemini-cli74
claude-code72
cursor68
github-copilot59
Provides statistical analysis and predictive modeling expertise specializing in machine learning, experimental design, and causal inference. Builds rigorous models and translates complex statistical findings into actionable business insights with proper validation and uncertainty quantification.
Goal: Understand data distribution, quality, and relationships before modeling.
Steps:
Load and Profile Data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv("customer_data.csv")
# Basic profiling
print(df.info())
print(df.describe())
# Missing values analysis
missing = df.isnull().sum() / len(df)
print(missing[missing > 0].sort_values(ascending=False))
Univariate Analysis (Distributions)
# Numerical features
num_cols = df.select_dtypes(include=[np.number]).columns
for col in num_cols:
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
sns.histplot(df[col], kde=True)
plt.subplot(1, 2, 2)
sns.boxplot(x=df[col])
plt.show()
# Categorical features
cat_cols = df.select_dtypes(exclude=[np.number]).columns
for col in cat_cols:
print(df[col].value_counts(normalize=True))
Bivariate Analysis (Relationships)
# Correlation matrix
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
# Target vs Features
target = 'churn'
sns.boxplot(x=target, y='tenure', data=df)
Verification:
Goal: Analyze results of a website conversion experiment.
Steps:
Define Hypothesis
Load and Aggregate Data
# data: ['user_id', 'group', 'converted']
results = df.groupby('group')['converted'].agg(['count', 'sum', 'mean'])
results.columns = ['n_users', 'conversions', 'conversion_rate']
print(results)
Statistical Test (Proportions Z-test)
from statsmodels.stats.proportion import proportions_ztest
control = results.loc['A']
treatment = results.loc['B']
count = np.array([treatment['conversions'], control['conversions']])
nobs = np.array([treatment['n_users'], control['n_users']])
stat, p_value = proportions_ztest(count, nobs, alternative='larger')
print(f"Z-statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")
Confidence Intervals
from statsmodels.stats.proportion import proportion_confint
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(count, nobs, alpha=0.05)
print(f"Control CI: [{lower_con:.4f}, {upper_con:.4f}]")
print(f"Treatment CI: [{lower_treat:.4f}, {upper_treat:.4f}]")
Goal: Estimate impact of a "Premium Membership" on "Spend" when A/B test isn't possible (observational data).
Steps:
Problem Setup
Calculate Propensity Scores
from sklearn.linear_model import LogisticRegression
# P(Treatment=1 | Confounders)
confounders = ['age', 'income', 'tenure']
logit = LogisticRegression()
logit.fit(df[confounders], df['is_premium'])
df['propensity_score'] = logit.predict_proba(df[confounders])[:, 1]
# Check overlap (Common Support)
sns.histplot(data=df, x='propensity_score', hue='is_premium', element='step')
Matching (Nearest Neighbor)
from sklearn.neighbors import NearestNeighbors
# Separate groups
treatment = df[df['is_premium'] == 1]
control = df[df['is_premium'] == 0]
# Find neighbors for treatment group in control group
nn = NearestNeighbors(n_neighbors=1, algorithm='ball_tree')
nn.fit(control[['propensity_score']])
distances, indices = nn.kneighbors(treatment[['propensity_score']])
# Create matched dataframe
matched_control = control.iloc[indices.flatten()]
# Compare outcomes
ate = treatment['spend'].mean() - matched_control['spend'].mean()
print(f"Average Treatment Effect (ATE): ${ate:.2f}")
What it looks like:
Why it fails:
Correct approach:
X_train, then transform X_test.Pipeline objects to ensure safety.What it looks like:
Why it fails:
Correct approach:
What it looks like:
Why it fails:
Correct approach:
scale_pos_weight in XGBoost, class_weight='balanced' in Sklearn.Methodology & Rigor:
Code & Reproducibility:
requirements.txt or environment.yml.random_state=42).Interpretation & Communication:
Performance:
Scenario: Product team wants to know if a new recommendation algorithm increases user engagement.
Analysis Approach:
Key Analysis:
# Bootstrap confidence interval for difference in means
from scipy import stats
diff = treatment_means - control_means
ci = np.percentile(bootstrap_diffs, [2.5, 97.5])
Outcome: Feature launched with 95% probability of positive impact
Scenario: Retail chain needs to forecast next-quarter sales for inventory planning.
Modeling Approach:
Results:
| Model | MAPE | 90% CI Width |
|---|---|---|
| ARIMA | 12.3% | ±15% |
| Prophet | 9.8% | ±12% |
| XGBoost | 7.2% | ±9% |
Deliverable: Production model with automated retraining pipeline
Scenario: Marketing wants to understand which channels drive actual conversions vs. appear correlated.
Causal Methods:
Key Findings:
Weekly Installs
100
Repository
GitHub Stars
43
First Seen
Jan 24, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode81
codex75
gemini-cli74
claude-code72
cursor68
github-copilot59
AI Elements:基于shadcn/ui的AI原生应用组件库,快速构建对话界面
69,600 周安装
Ralplan 共识规划工具:AI 驱动的迭代规划与决策制定 | 自动化开发工作流
213 周安装
ln-724-artifact-cleaner:自动清理在线平台项目产物,移除平台依赖,准备生产部署
204 周安装
Scanpy 单细胞 RNA-seq 数据分析教程 | Python 生物信息学工具包
206 周安装
AlphaFold 数据库技能:AI预测蛋白质3D结构检索、下载与分析完整指南
207 周安装
scikit-bio:Python生物信息学分析库,处理序列、比对、系统发育与多样性分析
207 周安装
xstate状态管理适配器:连接json-render与XState Store的状态后端解决方案
206 周安装
Data Cleaning
# Impute missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['category'].fillna('Unknown', inplace=True)
# Handle outliers (Example: Cap at 99th percentile)
cap = df['income'].quantile(0.99)
df['income'] = np.where(df['income'] > cap, cap, df['income'])
Conclusion
Validation (Balance Check)
abs(mean_diff) / pooled_std < 0.1 (Standardized Mean Difference).