数据科学家技能：机器学习、统计建模与A/B测试分析实战指南

data-scientist by 404kidwiz/claude-supercode-skills

142 周安装量

75 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/404kidwiz/claude-supercode-skills --skill data-scientist

AI/机器学习 Python Web框架数据分析

🇨🇳中文介绍

数据科学家

目的

提供统计分析和预测建模专业知识，专攻机器学习、实验设计和因果推断。构建严谨的模型，并将复杂的统计发现转化为可操作的业务洞察，同时进行适当的验证和不确定性量化。

使用场景

执行探索性数据分析以发现模式和异常
构建预测模型（分类、回归、预测）
设计和分析 A/B 测试或实验
进行严格的统计假设检验
创建高级可视化和数据叙事
为业务问题定义指标和关键绩效指标

核心能力

统计建模

使用回归、分类和聚类构建预测模型
实现时间序列预测和因果推断
设计和分析 A/B 测试和实验
执行特征工程和选择

机器学习

训练和评估监督式与非监督式学习模型
为复杂模式实现深度学习模型
执行超参数调优和模型优化
使用交叉验证和保留集验证模型

数据探索

进行探索性数据分析以发现模式
识别数据集中的异常值和离群点
创建用于洞察发现的高级可视化
从数据探索中生成假设

沟通与叙事

将统计发现转化为业务语言
为利益相关者创建引人入胜的数据叙事
构建交互式笔记本和报告
展示包含不确定性量化的发现

3. 核心工作流

工作流 1：探索性数据分析与数据清洗

目标： 在建模前理解数据分布、质量和关系。

步骤：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流 3：A/B 测试分析

目标： 分析网站转化实验的结果。

定义假设
- H0：转化率 B <= 转化率 A
- H1：转化率 B > 转化率 A
- Alpha：0.05

加载和聚合数据

# data: ['user_id', 'group', 'converted']
results = df.groupby('group')['converted'].agg(['count', 'sum', 'mean'])
results.columns = ['n_users', 'conversions', 'conversion_rate']
print(results)

统计检验（比例 Z 检验）

from statsmodels.stats.proportion import proportions_ztest

control = results.loc['A']
treatment = results.loc['B']

count = np.array([treatment['conversions'], control['conversions']])
nobs = np.array([treatment['n_users'], control['n_users']])

stat, p_value = proportions_ztest(count, nobs, alternative='larger')

print(f"Z-statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")

置信区间

from statsmodels.stats.proportion import proportion_confint

(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(count, nobs, alpha=0.05)

print(f"Control CI: [{lower_con:.4f}, {upper_con:.4f}]")
print(f"Treatment CI: [{lower_treat:.4f}, {upper_treat:.4f}]")

结论
- 如果 p 值 < 0.05：拒绝 H0。变体 B 在统计上显著更好。
- 检查实际显著性（提升幅度）。

工作流 5：因果推断（倾向得分匹配）

目标： 在无法进行 A/B 测试时（观察性数据），估计“高级会员”对“消费”的影响。

问题设定
- 处理变量：高级会员 (1) vs 免费 (0)
- 结果变量：年消费额 ($)
- 混杂变量：年龄、收入、地区、使用时长（影响会员资格和消费的因素）

计算倾向得分

from sklearn.linear_model import LogisticRegression

# P(Treatment=1 | Confounders)
confounders = ['age', 'income', 'tenure']
logit = LogisticRegression()
logit.fit(df[confounders], df['is_premium'])

df['propensity_score'] = logit.predict_proba(df[confounders])[:, 1]

# Check overlap (Common Support)
sns.histplot(data=df, x='propensity_score', hue='is_premium', element='step')

匹配（最近邻）

from sklearn.neighbors import NearestNeighbors

# Separate groups
treatment = df[df['is_premium'] == 1]
control = df[df['is_premium'] == 0]

# Find neighbors for treatment group in control group
nn = NearestNeighbors(n_neighbors=1, algorithm='ball_tree')
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treatment[['propensity_score']])

# Create matched dataframe
matched_control = control.iloc[indices.flatten()]

# Compare outcomes
ate = treatment['spend'].mean() - matched_control['spend'].mean()
print(f"Average Treatment Effect (ATE): ${ate:.2f}")

验证（平衡性检查）
- 检查匹配后混杂变量是否平衡（例如，处理组与匹配对照组的平均年龄应相似）。
- abs(mean_diff) / pooled_std < 0.1（标准化均值差）。

5. 反模式与陷阱

❌ 反模式 1：数据泄露

在划分训练/测试集之前对整个数据集进行缩放/标准化。
使用未来信息（例如，“next_month_churn”）作为特征。
包含基于整个数据集计算的目标衍生特征（例如，目标均值编码）。

模型在训练/验证期间的性能被人为夸大。
在生产环境中面对新的、未见过的数据时完全失败。

先划分，再转换。
仅在 X_train 上拟合缩放器/编码器，然后转换 X_test。
使用 Pipeline 对象确保安全性。

❌ 反模式 2：P 值操纵（数据挖掘）

测试 50 个不同的假设或子组。
仅报告 p < 0.05 的那一个结果。
在达到显著性时立即停止 A/B 测试（窥探）。

假阳性（第一类错误）的概率很高。
发现的是随机噪声，而非可复现的效应。

预先注册假设。
对于多重比较，应用邦费罗尼校正或错误发现率控制。
在实验之前确定样本量并严格遵守。

❌ 反模式 3：忽略类别不平衡

在欺诈率仅为 0.1% 的数据上训练欺诈检测模型。
将 99.9% 的准确率报告为“成功”。

模型只是简单地为所有人预测“无欺诈”。
无法检测到真正感兴趣的类别。

使用适当的指标：精确率-召回率 AUC、F1 分数。
重采样技术：SMOTE、随机欠采样。
类别权重：XGBoost 中的 scale_pos_weight，Sklearn 中的 class_weight='balanced'。

7. 质量检查清单

方法论与严谨性：

在分析之前明确定义假设。
检查统计检验的假设（正态性、独立性、同方差性）。
正确执行训练/测试/验证集划分（无泄露）。
适当处理类别不平衡（指标、重采样）。
使用交叉验证进行模型评估。

代码与可复现性：

代码存储在 git 中，并附带 requirements.txt 或 environment.yml。
为可复现性设置随机种子 (random_state=42)。
硬编码路径替换为相对路径或配置变量。
复杂逻辑封装在带有文档字符串的函数/类中。

解释与沟通：

用业务术语解释结果（例如，“收入提升” vs “对数损失减少”）。
为估计值提供置信区间。
如有需要，使用 SHAP 或 LIME 解释“黑盒”模型。
明确说明注意事项和局限性。

如果数据集 > 10GB，则在采样数据上执行 EDA。
使用向量化操作（pandas/numpy）而非循环。
优化查询（尽早过滤，仅选择所需列）。

示例 1：功能发布的 A/B 测试分析

场景： 产品团队想知道新的推荐算法是否能提高用户参与度。

实验设计：随机分配（50/50），最小样本量计算
数据收集：跟踪点击率、页面停留时间、转化率
统计检验：双样本 t 检验，附带自助法置信区间
结果：点击率显著提升（p < 0.01），提升 12%

# Bootstrap confidence interval for difference in means
from scipy import stats
diff = treatment_means - control_means
ci = np.percentile(bootstrap_diffs, [2.5, 97.5])

结果： 功能发布，具有 95% 的积极影响概率

示例 2：需求规划的时间序列预测

场景： 零售连锁店需要预测下一季度的销售额以进行库存规划。

探索性分析：识别趋势、季节性（每周、节假日）
特征工程：促销活动、天气、经济指标
模型选择：比较 ARIMA、Prophet 和梯度提升
验证：对过去 12 个月进行滚动验证

模型	平均绝对百分比误差	90% 置信区间宽度
ARIMA	12.3%	±15%
Prophet	9.8%	±12%
XGBoost	7.2%	±9%

交付成果： 带有自动重训练管线的生产模型

示例 3：因果归因分析

场景： 市场营销部门希望了解哪些渠道真正推动了转化，而不仅仅是表面相关。

倾向得分匹配：匹配具有相似特征的用户
双重差分法：比较活动前后的变化
工具变量法：解决观察性数据中的选择偏倚

电视广告：投资回报率 3.2 倍（归因最强）
社交媒体：投资回报率 1.1 倍（归因不明确）
电子邮件：投资回报率 5.8 倍（效率最高）

随机化：确保对处理组/对照组的真正随机分配
样本量计算：在开始实验前进行功效分析
多重检验：在检验多个假设时调整显著性水平
控制变量：包含相关协变量以减少方差
持续时间规划：实验运行时间足够长以获得稳定结果

特征工程：创建可解释、可预测的特征
交叉验证：对时间序列数据使用时序感知的划分
模型可解释性：使用 SHAP/LIME 解释预测
验证指标：选择与业务目标一致的指标
防止过拟合：正则化、早停法、保留数据

不确定性量化：始终报告置信区间
显著性解释：P 值不是效应大小
假设检验：验证统计检验的假设
敏感性分析：测试模型选择的稳健性
预注册：在查看结果前记录分析计划

业务转化：将统计术语转化为业务影响
可操作建议：将发现与具体决策联系起来
视觉叙事：从数据中创建引人入胜的叙事
利益相关者沟通：根据对象调整技术细节的深度
文档记录：维护可复现的分析记录

公平性考量：检查跨受保护群体的偏倚
隐私保护：适当匿名化敏感数据
透明度：记录数据来源和方法论
负责任的人工智能：考虑模型的社会影响
数据质量：承认局限性和潜在偏倚

🇺🇸English

Data Scientist

Purpose

Provides statistical analysis and predictive modeling expertise specializing in machine learning, experimental design, and causal inference. Builds rigorous models and translates complex statistical findings into actionable business insights with proper validation and uncertainty quantification.

When to Use

Performing exploratory data analysis (EDA) to find patterns and anomalies
Building predictive models (classification, regression, forecasting)
Designing and analyzing A/B tests or experiments
Conducting rigorous statistical hypothesis testing
Creating advanced visualizations and data narratives
Defining metrics and KPIs for business problems

Core Capabilities

Statistical Modeling

Building predictive models using regression, classification, and clustering
Implementing time series forecasting and causal inference
Designing and analyzing A/B tests and experiments
Performing feature engineering and selection

Machine Learning

Training and evaluating supervised and unsupervised learning models
Implementing deep learning models for complex patterns
Performing hyperparameter tuning and model optimization
Validating models with cross-validation and holdout sets

Data Exploration

Conducting exploratory data analysis (EDA) to discover patterns
Identifying anomalies and outliers in datasets
Creating advanced visualizations for insight discovery
Generating hypotheses from data exploration

Communication and Storytelling

Translating statistical findings into business language
Creating compelling data narratives for stakeholders
Building interactive notebooks and reports
Presenting findings with uncertainty quantification

3. Core Workflows

Workflow 1: Exploratory Data Analysis (EDA) & Cleaning

Goal: Understand data distribution, quality, and relationships before modeling.

Steps:

Load and Profile Data

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv("customer_data.csv")

# Basic profiling
print(df.info())
print(df.describe())

# Missing values analysis
missing = df.isnull().sum() / len(df)
print(missing[missing > 0].sort_values(ascending=False))

Univariate Analysis (Distributions)

# Numerical features
num_cols = df.select_dtypes(include=[np.number]).columns
for col in num_cols:
    plt.figure(figsize=(10, 4))
    plt.subplot(1, 2, 1)
    sns.histplot(df[col], kde=True)
    plt.subplot(1, 2, 2)
    sns.boxplot(x=df[col])
    plt.show()

# Categorical features
cat_cols = df.select_dtypes(exclude=[np.number]).columns
for col in cat_cols:
    print(df[col].value_counts(normalize=True))

Bivariate Analysis (Relationships)

# Correlation matrix
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')

# Target vs Features
target = 'churn'
sns.boxplot(x=target, y='tenure', data=df)

Verification:

No missing values in critical columns.
Distributions understood (normal vs skewed).
Target variable balance checked.

Workflow 3: A/B Test Analysis

Goal: Analyze results of a website conversion experiment.

Steps:

Define Hypothesis
- H0: Conversion Rate B <= Conversion Rate A
- H1: Conversion Rate B > Conversion Rate A
- Alpha: 0.05

Load and Aggregate Data

# data: ['user_id', 'group', 'converted']
results = df.groupby('group')['converted'].agg(['count', 'sum', 'mean'])
results.columns = ['n_users', 'conversions', 'conversion_rate']
print(results)

Statistical Test (Proportions Z-test)

from statsmodels.stats.proportion import proportions_ztest

control = results.loc['A']
treatment = results.loc['B']

count = np.array([treatment['conversions'], control['conversions']])
nobs = np.array([treatment['n_users'], control['n_users']])

stat, p_value = proportions_ztest(count, nobs, alternative='larger')

print(f"Z-statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")

Confidence Intervals

from statsmodels.stats.proportion import proportion_confint

(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(count, nobs, alpha=0.05)

print(f"Control CI: [{lower_con:.4f}, {upper_con:.4f}]")
print(f"Treatment CI: [{lower_treat:.4f}, {upper_treat:.4f}]")

Workflow 5: Causal Inference (Propensity Score Matching)

Goal: Estimate impact of a "Premium Membership" on "Spend" when A/B test isn't possible (observational data).

Steps:

Problem Setup
- Treatment: Premium Member (1) vs Free (0)
- Outcome: Annual Spend ($)
- Confounders: Age, Income, Location, Tenure (Factors affecting both membership and spend)

Calculate Propensity Scores

from sklearn.linear_model import LogisticRegression

# P(Treatment=1 | Confounders)
confounders = ['age', 'income', 'tenure']
logit = LogisticRegression()
logit.fit(df[confounders], df['is_premium'])

df['propensity_score'] = logit.predict_proba(df[confounders])[:, 1]

# Check overlap (Common Support)
sns.histplot(data=df, x='propensity_score', hue='is_premium', element='step')

Matching (Nearest Neighbor)

from sklearn.neighbors import NearestNeighbors

# Separate groups
treatment = df[df['is_premium'] == 1]
control = df[df['is_premium'] == 0]

# Find neighbors for treatment group in control group
nn = NearestNeighbors(n_neighbors=1, algorithm='ball_tree')
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treatment[['propensity_score']])

# Create matched dataframe
matched_control = control.iloc[indices.flatten()]

# Compare outcomes
ate = treatment['spend'].mean() - matched_control['spend'].mean()
print(f"Average Treatment Effect (ATE): ${ate:.2f}")

5. Anti-Patterns & Gotchas

❌ Anti-Pattern 1: Data Leakage

What it looks like:

Scaling/Standardizing the entire dataset before train/test split.
Using future information (e.g., "next_month_churn") as a feature.
Including target-derived features (e.g., mean target encoding) calculated on the whole set.

Why it fails:

Model performance is artificially inflated during training/validation.
Fails completely in production on new, unseen data.

Correct approach:

Split FIRST , then transform.
Fit scalers/encoders ONLY on X_train, then transform X_test.
Use Pipeline objects to ensure safety.

❌ Anti-Pattern 2: P-Hacking (Data Dredging)

What it looks like:

Testing 50 different hypotheses or subgroups.
Reporting only the one result with p < 0.05.
Stopping an A/B test exactly when significance is reached (peeking).

Why it fails:

High probability of False Positives (Type I error).
Findings are random noise, not reproducible effects.

Correct approach:

Pre-register hypotheses.
Apply Bonferroni correction or False Discovery Rate (FDR) control for multiple comparisons.
Determine sample size before the experiment and stick to it.

❌ Anti-Pattern 3: Ignoring Imbalanced Classes

What it looks like:

Training a fraud detection model on data with 0.1% fraud.
Reporting 99.9% Accuracy as "Success".

Why it fails:

The model simply predicts "No Fraud" for everyone.
Fails to detect the actual class of interest.

Correct approach:

Use appropriate metrics: Precision-Recall AUC , F1-Score.
Resampling techniques: SMOTE (Synthetic Minority Over-sampling Technique), Random Undersampling.
Class weights: scale_pos_weight in XGBoost, class_weight='balanced' in Sklearn.

7. Quality Checklist

Methodology & Rigor:

Hypothesis defined clearly before analysis.
Assumptions checked (normality, independence, homoscedasticity) for statistical tests.
Train/Test/Validation split performed correctly (no leakage).
Imbalanced classes handled appropriate (metrics, resampling).
Cross-validation used for model assessment.

Code & Reproducibility:

Code stored in git with requirements.txt or environment.yml.
Random seeds set for reproducibility (random_state=42).
Hardcoded paths replaced with relative paths or config variables.
Complex logic wrapped in functions/classes with docstrings.

Interpretation & Communication:

Results interpreted in business terms (e.g., "Revenue lift" vs "Log-loss decrease").
Confidence intervals provided for estimates.
"Black box" models explained using SHAP or LIME if needed.
Caveats and limitations explicitly stated.

Performance:

EDA performed on sampled data if dataset > 10GB.
Vectorized operations used (pandas/numpy) instead of loops.
Query optimized (filtering early, selecting only needed columns).

Examples

Example 1: A/B Test Analysis for Feature Launch

Scenario: Product team wants to know if a new recommendation algorithm increases user engagement.

Analysis Approach:

Experimental Design : Random assignment (50/50), minimum sample size calculation
Data Collection : Tracked click-through rate, time on page, conversion
Statistical Testing : Two-sample t-test with bootstrapped confidence intervals
Results : Significant improvement in CTR (p < 0.01), 12% lift

Key Analysis:

# Bootstrap confidence interval for difference in means
from scipy import stats
diff = treatment_means - control_means
ci = np.percentile(bootstrap_diffs, [2.5, 97.5])

Outcome: Feature launched with 95% probability of positive impact

Example 2: Time Series Forecasting for Demand Planning

Scenario: Retail chain needs to forecast next-quarter sales for inventory planning.

Modeling Approach:

Exploratory Analysis : Identified trends, seasonality (weekly, holiday)
Feature Engineering : Promotions, weather, economic indicators
Model Selection : Compared ARIMA, Prophet, and gradient boosting
Validation : Walk-forward validation on last 12 months

Results:

Model	MAPE	90% CI Width
ARIMA	12.3%	±15%
Prophet	9.8%	±12%
XGBoost	7.2%	±9%

Deliverable: Production model with automated retraining pipeline

Example 3: Causal Attribution Analysis

Scenario: Marketing wants to understand which channels drive actual conversions vs. appear correlated.

Causal Methods:

Propensity Score Matching : Match users with similar characteristics
Difference-in-Differences : Compare changes before/after campaigns
Instrumental Variables : Address selection bias in observational data

Key Findings:

TV ads: 3.2x ROAS (strongest attribution)
Social media: 1.1x ROAS (attribution unclear)
Email: 5.8x ROAS (highest efficiency)

Best Practices

Experimental Design

Randomization : Ensure true random assignment to treatment/control
Sample Size Calculation : Power analysis before starting experiments
Multiple Testing : Adjust significance levels when testing multiple hypotheses
Control Variables : Include relevant covariates to reduce variance
Duration Planning : Run experiments long enough for stable results

Model Development

Feature Engineering : Create interpretable, predictive features
Cross-Validation : Use time-aware splits for time series data
Model Interpretability : Use SHAP/LIME to explain predictions
Validation Metrics : Choose metrics aligned with business objectives
Overfitting Prevention : Regularization, early stopping, held-out data

Statistical Rigor

Uncertainty Quantification : Always report confidence intervals
Significance Interpretation : P-value is not effect size
Assumption Checking : Validate statistical test assumptions
Sensitivity Analysis : Test robustness to modeling choices
Pre-registration : Document analysis plan before seeing results

Communication and Impact

Business Translation : Convert statistical terms to business impact
Actionable Recommendations : Tie findings to specific decisions
Visual Storytelling : Create compelling narratives from data
Stakeholder Communication : Tailor level of technical detail
Documentation : Maintain reproducible analysis records

Ethical Data Science

Fairness Considerations : Check for bias across protected groups
Privacy Protection : Anonymize sensitive data appropriately
Transparency : Document data sources and methodology
Responsible AI : Consider societal impact of models
Data Quality : Acknowledge limitations and potential biases

Weekly Installs

100

Repository

404kidwiz/claud…e-skills

GitHub Stars

First Seen

Jan 24, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode81

codex75

gemini-cli74

claude-code72

cursor68

github-copilot59

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

69,600 周安装

Data Cleaning

# Impute missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['category'].fillna('Unknown', inplace=True)

# Handle outliers (Example: Cap at 99th percentile)
cap = df['income'].quantile(0.99)
df['income'] = np.where(df['income'] > cap, cap, df['income'])

Conclusion

If p-value < 0.05: Reject H0. Variation B is statistically significantly better.
Check practical significance (Lift magnitude).

Validation (Balance Check)

Check if confounders are balanced after matching (e.g., Mean Age of Treatment vs Matched Control should be similar).
abs(mean_diff) / pooled_std < 0.1 (Standardized Mean Difference).

数据科学家技能：机器学习、统计建模与A/B测试分析实战指南

🇨🇳中文介绍

数据科学家

目的

使用场景

核心能力

统计建模

机器学习

数据探索

沟通与叙事

3. 核心工作流

工作流 1：探索性数据分析与数据清洗

相关 Skills

工作流 3：A/B 测试分析

工作流 5：因果推断（倾向得分匹配）

5. 反模式与陷阱

❌ 反模式 1：数据泄露

❌ 反模式 2：P 值操纵（数据挖掘）

❌ 反模式 3：忽略类别不平衡

7. 质量检查清单

示例

示例 1：功能发布的 A/B 测试分析

示例 2：需求规划的时间序列预测

示例 3：因果归因分析

最佳实践

实验设计

模型开发

统计严谨性

沟通与影响

伦理数据科学

🇺🇸English

Data Scientist

Purpose

When to Use

Core Capabilities

Statistical Modeling

Machine Learning

Data Exploration

Communication and Storytelling

3. Core Workflows

Workflow 1: Exploratory Data Analysis (EDA) & Cleaning

Workflow 3: A/B Test Analysis

Workflow 5: Causal Inference (Propensity Score Matching)

5. Anti-Patterns & Gotchas

❌ Anti-Pattern 1: Data Leakage

❌ Anti-Pattern 2: P-Hacking (Data Dredging)

❌ Anti-Pattern 3: Ignoring Imbalanced Classes

7. Quality Checklist

Examples

Example 1: A/B Test Analysis for Feature Launch

Example 2: Time Series Forecasting for Demand Planning

Example 3: Causal Attribution Analysis

Best Practices

Experimental Design

Model Development

Statistical Rigor

Communication and Impact

Ethical Data Science

最新 Skills