tooluniverse-statistical-modeling by mims-harvard/tooluniverse
npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-statistical-modeling用于对生物医学数据拟合回归模型、生存模型和混合效应模型的全面统计建模技能。生成具有比值比、风险比、置信区间和 p 值的出版物级统计摘要。
当用户询问以下问题时应用此技能:
START: 结果变量是什么类型?
|
+-- 连续型 (身高、血压、分数)
| +-- 独立观测 -> 线性回归 (OLS)
| +-- 重复测量 -> 混合效应模型 (LMM)
| +-- 计数数据 -> 泊松/负二项回归
|
+-- 二分类 (是/否、患病/健康)
| +-- 独立观测 -> 逻辑回归
| +-- 重复测量 -> 逻辑混合效应模型 (GLMM/GEE)
| +-- 罕见事件 -> Firth 逻辑回归
|
+-- 有序型 (轻度/中度/重度,I/II/III/IV 期)
| +-- 有序逻辑回归 (比例优势模型)
|
+-- 多分类 (>2 个无序类别)
| +-- 多项逻辑回归
|
+-- 时间-事件型 (生存时间 + 删失)
+-- 回归 -> Cox 比例风险模型
+-- 生存曲线 -> Kaplan-Meier
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
目标:加载数据,识别变量类型,检查缺失值。
关键:首先识别结果变量
在任何分析之前,验证你实际要预测的是什么:
常见错误:问题提到"肥胖" -> 假设结果 = BMI >= 30 (与 BMI 预测变量形成循环逻辑)。始终首先检查数据列:print(df.columns.tolist())
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
print(f"观测数: {len(df)}, 变量数: {len(df.columns)}, 缺失值: {df.isnull().sum().sum()}")
for col in df.columns:
n_unique = df[col].nunique()
if n_unique == 2:
print(f"{col}: 二分类")
elif n_unique <= 10 and df[col].dtype == 'object':
print(f"{col}: 分类变量 ({n_unique} 个水平)")
elif df[col].dtype in ['float64', 'int64']:
print(f"{col}: 连续型 (均值={df[col].mean():.2f})")
目标:根据结果类型拟合适当的模型。
使用上面的决策树选择模型类型,然后参考相应的参考文件获取详细代码:
references/linear_models.mdreferences/logistic_regression.mdreferences/ordinal_logistic.mdreferences/cox_regression.mdanova_and_tests.md关键模型的快速参考 :
import statsmodels.formula.api as smf
import numpy as np
# 线性回归
model = smf.ols('outcome ~ predictor1 + predictor2', data=df).fit()
# 逻辑回归 (比值比)
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)
ors = np.exp(model.params)
ci = np.exp(model.conf_int())
# Cox 比例风险模型
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age']], duration_col='time', event_col='event')
hr = cph.hazard_ratios_['treatment']
当数据具有多个特征(基因、miRNA、代谢物)时,使用逐特征方差分析(而非聚合分析)。这是基因组学中最常见的模式。
完整的决策树、两种方法及工作示例请参见 anova_and_tests.md。
基因表达数据的默认方法:逐特征方差分析 (方法 B)。
目标:检查模型假设和拟合质量。
按模型类型划分的关键诊断:
cph.check_assumptions() 进行比例风险检验诊断代码和常见问题请参见 references/troubleshooting.md。
目标:生成出版物级的摘要。
对于每个结果,报告:效应大小 (OR/HR/系数)、95% CI、p 值和模型拟合统计量。常见问答模式请参见 bixbench_patterns_summary.md。
| 模式 | 问题类型 | 关键步骤 |
|---|---|---|
| 1 | 有序回归的比值比 | 拟合 OrderedModel,exp(coef) |
| 2 | 比值比的百分比减少 | 比较粗略模型与调整后模型 |
| 3 | 交互效应 | 拟合 A * B,提取 A:B 系数 |
| 4 | 风险比 | Cox PH 模型,exp(coef) |
| 5 | 多特征方差分析 | 逐特征 F 统计量 (非聚合) |
每个模式的解决方案代码请参见 bixbench_patterns_summary.md。15 多个详细问题模式请参见 references/bixbench_patterns.md。
| 使用场景 | 库 | 原因 |
|---|---|---|
| 推断 (p 值、CI、OR) | statsmodels | 完整的统计输出 |
| 预测 (准确率、AUC) | scikit-learn | 更好的预测工具 |
| 混合效应模型 | statsmodels | 唯一选项 |
| 正则化 (LASSO、Ridge) | scikit-learn | 更好的优化 |
| 生存分析 | lifelines | 专业库 |
通用规则:对于 BixBench 问题(它们要求 p 值、OR、HR),使用 statsmodels。
statsmodels>=0.14.0
scikit-learn>=1.3.0
lifelines>=0.27.0
pandas>=2.0.0
numpy>=1.24.0
scipy>=1.10.0
在最终确定任何统计分析之前:
tooluniverse-statistical-modeling/
+-- SKILL.md # 本文件 (工作流程指南)
+-- QUICK_START.md # 8 个快速示例
+-- EXAMPLES.md # 旧版示例
+-- TOOLS_REFERENCE.md # ToolUniverse 工具目录
+-- anova_and_tests.md # 方差分析决策树和代码
+-- bixbench_patterns_summary.md # 常见 BixBench 解决方案模式
+-- test_skill.py # 测试套件
+-- references/
| +-- logistic_regression.md # 详细逻辑回归示例
| +-- ordinal_logistic.md # 有序逻辑回归指南
| +-- cox_regression.md # 生存分析指南
| +-- linear_models.md # OLS 和混合效应模型
| +-- bixbench_patterns.md # 15+ 个问题模式
| +-- troubleshooting.md # 诊断问题
+-- scripts/
+-- format_statistical_output.py # 格式化结果以用于报告
+-- model_diagnostics.py # 自动化诊断
虽然此技能主要是计算性的,但 ToolUniverse 工具可以提供数据:
| 使用场景 | 工具 |
|---|---|
| 临床试验数据 | clinical_trials_search |
| 药物安全性结局 | FAERS_calculate_disproportionality |
| 基因-疾病关联 | OpenTargets_target_disease_evidence |
| 生物标志物数据 | fda_pharmacogenomic_biomarkers |
完整工具目录请参见 TOOLS_REFERENCE.md。
详细示例和故障排除:
references/logistic_regression.mdreferences/ordinal_logistic.mdreferences/cox_regression.mdreferences/linear_models.mdreferences/bixbench_patterns.mdanova_and_tests.mdreferences/troubleshooting.md每周安装次数
125
代码仓库
GitHub 星标数
1.2K
首次出现
2026年2月19日
安全审计
安装于
codex122
gemini-cli121
opencode121
github-copilot120
kimi-cli117
amp117
Comprehensive statistical modeling skill for fitting regression models, survival models, and mixed-effects models to biomedical data. Produces publication-quality statistical summaries with odds ratios, hazard ratios, confidence intervals, and p-values.
Apply this skill when user asks:
START: What type of outcome variable?
|
+-- CONTINUOUS (height, blood pressure, score)
| +-- Independent observations -> Linear Regression (OLS)
| +-- Repeated measures -> Mixed-Effects Model (LMM)
| +-- Count data -> Poisson/Negative Binomial
|
+-- BINARY (yes/no, disease/healthy)
| +-- Independent observations -> Logistic Regression
| +-- Repeated measures -> Logistic Mixed-Effects (GLMM/GEE)
| +-- Rare events -> Firth logistic regression
|
+-- ORDINAL (mild/moderate/severe, stages I/II/III/IV)
| +-- Ordinal Logistic Regression (Proportional Odds)
|
+-- MULTINOMIAL (>2 unordered categories)
| +-- Multinomial Logistic Regression
|
+-- TIME-TO-EVENT (survival time + censoring)
+-- Regression -> Cox Proportional Hazards
+-- Survival curves -> Kaplan-Meier
Goal : Load data, identify variable types, check for missing values.
CRITICAL: Identify the Outcome Variable First
Before any analysis, verify what you're actually predicting:
Common mistake : Question mentions "obesity" -> Assumed outcome = BMI >= 30 (circular logic with BMI predictor). Always check data columns first: print(df.columns.tolist())
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
print(f"Observations: {len(df)}, Variables: {len(df.columns)}, Missing: {df.isnull().sum().sum()}")
for col in df.columns:
n_unique = df[col].nunique()
if n_unique == 2:
print(f"{col}: binary")
elif n_unique <= 10 and df[col].dtype == 'object':
print(f"{col}: categorical ({n_unique} levels)")
elif df[col].dtype in ['float64', 'int64']:
print(f"{col}: continuous (mean={df[col].mean():.2f})")
Goal : Fit appropriate model based on outcome type.
Use the decision tree above to select model type, then refer to the appropriate reference file for detailed code:
references/linear_models.mdreferences/logistic_regression.mdreferences/ordinal_logistic.mdreferences/cox_regression.mdanova_and_tests.mdQuick reference for key models :
import statsmodels.formula.api as smf
import numpy as np
# Linear regression
model = smf.ols('outcome ~ predictor1 + predictor2', data=df).fit()
# Logistic regression (odds ratios)
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)
ors = np.exp(model.params)
ci = np.exp(model.conf_int())
# Cox proportional hazards
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age']], duration_col='time', event_col='event')
hr = cph.hazard_ratios_['treatment']
When data has multiple features (genes, miRNAs, metabolites), use per-feature ANOVA (not aggregate). This is the most common pattern in genomics.
See anova_and_tests.md for the full decision tree, both methods, and worked examples.
Default for gene expression data : Per-feature ANOVA (Method B).
Goal : Check model assumptions and fit quality.
Key diagnostics by model type:
cph.check_assumptions()See references/troubleshooting.md for diagnostic code and common issues.
Goal : Generate publication-quality summary.
For every result, report: effect size (OR/HR/coefficient), 95% CI, p-value, and model fit statistic. See bixbench_patterns_summary.md for common question-answer patterns.
| Pattern | Question Type | Key Steps |
|---|---|---|
| 1 | Odds ratio from ordinal regression | Fit OrderedModel, exp(coef) |
| 2 | Percentage reduction in OR | Compare crude vs adjusted model |
| 3 | Interaction effects | Fit A * B, extract A:B coef |
| 4 | Hazard ratio | Cox PH model, exp(coef) |
| 5 | Multi-feature ANOVA | Per-feature F-stats (not aggregate) |
See bixbench_patterns_summary.md for solution code for each pattern. See references/bixbench_patterns.md for 15+ detailed question patterns.
| Use Case | Library | Reason |
|---|---|---|
| Inference (p-values, CIs, ORs) | statsmodels | Full statistical output |
| Prediction (accuracy, AUC) | scikit-learn | Better prediction tools |
| Mixed-effects models | statsmodels | Only option |
| Regularization (LASSO, Ridge) | scikit-learn | Better optimization |
| Survival analysis | lifelines | Specialized library |
General rule : Use statsmodels for BixBench questions (they ask for p-values, ORs, HRs).
statsmodels>=0.14.0
scikit-learn>=1.3.0
lifelines>=0.27.0
pandas>=2.0.0
numpy>=1.24.0
scipy>=1.10.0
Before finalizing any statistical analysis:
tooluniverse-statistical-modeling/
+-- SKILL.md # This file (workflow guide)
+-- QUICK_START.md # 8 quick examples
+-- EXAMPLES.md # Legacy examples
+-- TOOLS_REFERENCE.md # ToolUniverse tool catalog
+-- anova_and_tests.md # ANOVA decision tree and code
+-- bixbench_patterns_summary.md # Common BixBench solution patterns
+-- test_skill.py # Test suite
+-- references/
| +-- logistic_regression.md # Detailed logistic examples
| +-- ordinal_logistic.md # Ordinal logit guide
| +-- cox_regression.md # Survival analysis guide
| +-- linear_models.md # OLS and mixed-effects
| +-- bixbench_patterns.md # 15+ question patterns
| +-- troubleshooting.md # Diagnostic issues
+-- scripts/
+-- format_statistical_output.py # Format results for reporting
+-- model_diagnostics.py # Automated diagnostics
While this skill is primarily computational, ToolUniverse tools can provide data:
| Use Case | Tools |
|---|---|
| Clinical trial data | clinical_trials_search |
| Drug safety outcomes | FAERS_calculate_disproportionality |
| Gene-disease associations | OpenTargets_target_disease_evidence |
| Biomarker data | fda_pharmacogenomic_biomarkers |
See TOOLS_REFERENCE.md for complete tool catalog.
For detailed examples and troubleshooting:
references/logistic_regression.mdreferences/ordinal_logistic.mdreferences/cox_regression.mdreferences/linear_models.mdreferences/bixbench_patterns.mdanova_and_tests.mdreferences/troubleshooting.mdWeekly Installs
125
Repository
GitHub Stars
1.2K
First Seen
Feb 19, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex122
gemini-cli121
opencode121
github-copilot120
kimi-cli117
amp117
Excel财务建模规范与xlsx文件处理指南:专业格式、零错误公式与数据分析
46,700 周安装