生物医学统计建模工具 - 线性回归、逻辑回归、生存分析、混合效应模型、方差分析

tooluniverse-statistical-modeling by mims-harvard/tooluniverse

144 周安装量

1,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-statistical-modeling

数据分析科研工具生物信息学

🇨🇳中文介绍

生物医学数据分析的统计建模

用于对生物医学数据拟合回归模型、生存模型和混合效应模型的全面统计建模技能。生成具有比值比、风险比、置信区间和 p 值的出版物级统计摘要。

功能

线性回归 - 用于连续结果的 OLS，包含诊断检验
逻辑回归 - 包含比值比的二分类、有序和多分类模型
生存分析 - Cox 比例风险模型和 Kaplan-Meier 曲线
混合效应模型 - 用于分层/重复测量数据的 LMM/GLMM
方差分析 - 单因素/双因素方差分析，用于组学数据的逐特征方差分析
模型诊断 - 假设检验、拟合统计量、残差分析
统计检验 - t 检验、卡方检验、Mann-Whitney 检验、Kruskal-Wallis 检验等。

使用时机

当用户询问以下问题时应用此技能：

"X 与 Y 相关的比值比是多少？"
"治疗的风险比是多少？"
"对 Y 在 X1, X2, X3 上进行线性回归拟合"
"对严重程度结局进行有序逻辑回归"
"在时间 T 的 Kaplan-Meier 生存估计值是多少？"
"调整混杂因素后，比值比的百分比减少是多少？"
"运行一个带随机截距的混合效应模型"
"计算 A 和 B 之间的交互项"
"比较各组的方差分析得出的 F 统计量是多少？"
"检验基因/miRNA 表达是否在不同细胞类型间存在差异"

模型选择决策树

START: 结果变量是什么类型？
|
+-- 连续型 (身高、血压、分数)
|   +-- 独立观测 -> 线性回归 (OLS)
|   +-- 重复测量 -> 混合效应模型 (LMM)
|   +-- 计数数据 -> 泊松/负二项回归
|
+-- 二分类 (是/否、患病/健康)
|   +-- 独立观测 -> 逻辑回归
|   +-- 重复测量 -> 逻辑混合效应模型 (GLMM/GEE)
|   +-- 罕见事件 -> Firth 逻辑回归
|
+-- 有序型 (轻度/中度/重度，I/II/III/IV 期)
|   +-- 有序逻辑回归 (比例优势模型)
|
+-- 多分类 (>2 个无序类别)
|   +-- 多项逻辑回归
|
+-- 时间-事件型 (生存时间 + 删失)
    +-- 回归 -> Cox 比例风险模型
    +-- 生存曲线 -> Kaplan-Meier

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

阶段 0：数据验证

目标：加载数据，识别变量类型，检查缺失值。

关键：首先识别结果变量

在任何分析之前，验证你实际要预测的是什么：

通读完整问题 - 寻找"预测 [结果]"、"建模 [结果]"或"因变量"
检查可用列 - 列出数据集中的所有列
将问题与数据匹配 - 找到与描述结果匹配的列
验证结果变量存在 - 不要从预测变量创建结果变量

常见错误：问题提到"肥胖" -> 假设结果 = BMI >= 30 (与 BMI 预测变量形成循环逻辑)。始终首先检查数据列：print(df.columns.tolist())

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')
print(f"观测数: {len(df)}, 变量数: {len(df.columns)}, 缺失值: {df.isnull().sum().sum()}")

for col in df.columns:
    n_unique = df[col].nunique()
    if n_unique == 2:
        print(f"{col}: 二分类")
    elif n_unique <= 10 and df[col].dtype == 'object':
        print(f"{col}: 分类变量 ({n_unique} 个水平)")
    elif df[col].dtype in ['float64', 'int64']:
        print(f"{col}: 连续型 (均值={df[col].mean():.2f})")

阶段 1：模型拟合

目标：根据结果类型拟合适当的模型。

使用上面的决策树选择模型类型，然后参考相应的参考文件获取详细代码：

线性回归 : references/linear_models.md
逻辑回归 (二分类): references/logistic_regression.md
有序逻辑回归 : references/ordinal_logistic.md
Cox 比例风险模型 : references/cox_regression.md
方差分析 / 统计检验 : anova_and_tests.md

关键模型的快速参考 :

import statsmodels.formula.api as smf
import numpy as np

# 线性回归
model = smf.ols('outcome ~ predictor1 + predictor2', data=df).fit()

# 逻辑回归 (比值比)
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)
ors = np.exp(model.params)
ci = np.exp(model.conf_int())

# Cox 比例风险模型
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age']], duration_col='time', event_col='event')
hr = cph.hazard_ratios_['treatment']

阶段 1b：多特征数据的方差分析

当数据具有多个特征（基因、miRNA、代谢物）时，使用逐特征方差分析（而非聚合分析）。这是基因组学中最常见的模式。

完整的决策树、两种方法及工作示例请参见 anova_and_tests.md。

基因表达数据的默认方法：逐特征方差分析 (方法 B)。

阶段 2：模型诊断

目标：检查模型假设和拟合质量。

按模型类型划分的关键诊断：

OLS : Shapiro-Wilk (正态性)、Breusch-Pagan (异方差性)、VIF (多重共线性)
Cox : 通过 cph.check_assumptions() 进行比例风险检验
逻辑回归 : Hosmer-Lemeshow 检验、ROC/AUC

诊断代码和常见问题请参见 references/troubleshooting.md。

阶段 3：结果解释

目标：生成出版物级的摘要。

对于每个结果，报告：效应大小 (OR/HR/系数)、95% CI、p 值和模型拟合统计量。常见问答模式请参见 bixbench_patterns_summary.md。

常见 BixBench 模式

模式	问题类型	关键步骤
1	有序回归的比值比	拟合 OrderedModel，exp(coef)
2	比值比的百分比减少	比较粗略模型与调整后模型
3	交互效应	拟合 `A * B`，提取 `A:B` 系数
4	风险比	Cox PH 模型，exp(coef)
5	多特征方差分析	逐特征 F 统计量 (非聚合)

每个模式的解决方案代码请参见 bixbench_patterns_summary.md。15 多个详细问题模式请参见 references/bixbench_patterns.md。

Statsmodels 与 Scikit-learn 对比

使用场景	库	原因
推断 (p 值、CI、OR)	statsmodels	完整的统计输出
预测 (准确率、AUC)	scikit-learn	更好的预测工具
混合效应模型	statsmodels	唯一选项
正则化 (LASSO、Ridge)	scikit-learn	更好的优化
生存分析	lifelines	专业库

通用规则：对于 BixBench 问题（它们要求 p 值、OR、HR），使用 statsmodels。

statsmodels>=0.14.0
scikit-learn>=1.3.0
lifelines>=0.27.0
pandas>=2.0.0
numpy>=1.24.0
scipy>=1.10.0

数据优先方法 - 在建模前始终检查和验证数据
按结果类型选择模型 - 使用上述决策树
假设检验 - 验证模型假设 (线性、比例风险等)
完整报告 - 始终报告效应大小、CI、p 值和模型拟合统计量
混杂因素意识 - 当指定或临床相关时，调整混杂因素
可重复分析 - 所有代码必须是确定性的且可重复
稳健的错误处理 - 优雅地处理收敛失败、分离、共线性
正确舍入 - 匹配请求的精度 (通常为 2-4 位小数)

完整性检查清单

在最终确定任何统计分析之前：

已识别结果变量：已验证哪一列是实际结果
已验证数据：确认了 N、缺失值、变量类型
已识别多特征数据：如果存在多个特征，使用逐特征方法
模型适当：结果类型与模型族匹配
已检查假设：执行了相关诊断
已报告效应大小：OR/HR/Cohen's d 及其 CI
已报告 p 值：如果需要，进行了适当的校正
已评估模型拟合：R 平方、AIC/BIC、一致性指数
已解释结果：用通俗语言解释
精度正确：数字舍入适当

tooluniverse-statistical-modeling/
+-- SKILL.md                          # 本文件 (工作流程指南)
+-- QUICK_START.md                    # 8 个快速示例
+-- EXAMPLES.md                       # 旧版示例
+-- TOOLS_REFERENCE.md                # ToolUniverse 工具目录
+-- anova_and_tests.md                # 方差分析决策树和代码
+-- bixbench_patterns_summary.md      # 常见 BixBench 解决方案模式
+-- test_skill.py                     # 测试套件
+-- references/
|   +-- logistic_regression.md        # 详细逻辑回归示例
|   +-- ordinal_logistic.md           # 有序逻辑回归指南
|   +-- cox_regression.md             # 生存分析指南
|   +-- linear_models.md              # OLS 和混合效应模型
|   +-- bixbench_patterns.md          # 15+ 个问题模式
|   +-- troubleshooting.md            # 诊断问题
+-- scripts/
    +-- format_statistical_output.py  # 格式化结果以用于报告
    +-- model_diagnostics.py          # 自动化诊断

虽然此技能主要是计算性的，但 ToolUniverse 工具可以提供数据：

使用场景	工具
临床试验数据	`clinical_trials_search`
药物安全性结局	`FAERS_calculate_disproportionality`
基因-疾病关联	`OpenTargets_target_disease_evidence`
生物标志物数据	`fda_pharmacogenomic_biomarkers`

完整工具目录请参见 TOOLS_REFERENCE.md。

statsmodels : https://www.statsmodels.org/
lifelines : https://lifelines.readthedocs.io/
scikit-learn : https://scikit-learn.org/
有序模型 : statsmodels.miscmodels.ordinal_model.OrderedModel

详细示例和故障排除：

逻辑回归 : references/logistic_regression.md
有序模型 : references/ordinal_logistic.md
生存分析 : references/cox_regression.md
线性/混合模型 : references/linear_models.md
BixBench 模式 : references/bixbench_patterns.md
方差分析和检验 : anova_and_tests.md
诊断 : references/troubleshooting.md

🇺🇸English

Statistical Modeling for Biomedical Data Analysis

Comprehensive statistical modeling skill for fitting regression models, survival models, and mixed-effects models to biomedical data. Produces publication-quality statistical summaries with odds ratios, hazard ratios, confidence intervals, and p-values.

Features

Linear Regression - OLS for continuous outcomes with diagnostic tests
Logistic Regression - Binary, ordinal, and multinomial models with odds ratios
Survival Analysis - Cox proportional hazards and Kaplan-Meier curves
Mixed-Effects Models - LMM/GLMM for hierarchical/repeated measures data
ANOVA - One-way/two-way ANOVA, per-feature ANOVA for omics data
Model Diagnostics - Assumption checking, fit statistics, residual analysis
Statistical Tests - t-tests, chi-square, Mann-Whitney, Kruskal-Wallis, etc.

When to Use

Apply this skill when user asks:

"What is the odds ratio of X associated with Y?"
"What is the hazard ratio for treatment?"
"Fit a linear regression of Y on X1, X2, X3"
"Perform ordinal logistic regression for severity outcome"
"What is the Kaplan-Meier survival estimate at time T?"
"What is the percentage reduction in odds ratio after adjusting for confounders?"
"Run a mixed-effects model with random intercepts"
"Compute the interaction term between A and B"
"What is the F-statistic from ANOVA comparing groups?"
"Test if gene/miRNA expression differs across cell types"

Model Selection Decision Tree

START: What type of outcome variable?
|
+-- CONTINUOUS (height, blood pressure, score)
|   +-- Independent observations -> Linear Regression (OLS)
|   +-- Repeated measures -> Mixed-Effects Model (LMM)
|   +-- Count data -> Poisson/Negative Binomial
|
+-- BINARY (yes/no, disease/healthy)
|   +-- Independent observations -> Logistic Regression
|   +-- Repeated measures -> Logistic Mixed-Effects (GLMM/GEE)
|   +-- Rare events -> Firth logistic regression
|
+-- ORDINAL (mild/moderate/severe, stages I/II/III/IV)
|   +-- Ordinal Logistic Regression (Proportional Odds)
|
+-- MULTINOMIAL (>2 unordered categories)
|   +-- Multinomial Logistic Regression
|
+-- TIME-TO-EVENT (survival time + censoring)
    +-- Regression -> Cox Proportional Hazards
    +-- Survival curves -> Kaplan-Meier

Workflow

Phase 0: Data Validation

Goal : Load data, identify variable types, check for missing values.

CRITICAL: Identify the Outcome Variable First

Before any analysis, verify what you're actually predicting:

Read the full question - Look for "predict [outcome]", "model [outcome]", or "dependent variable"
Examine available columns - List all columns in the dataset
Match question to data - Find the column that matches the described outcome
Verify outcome exists - Don't create outcome variables from predictors

Common mistake : Question mentions "obesity" -> Assumed outcome = BMI >= 30 (circular logic with BMI predictor). Always check data columns first: print(df.columns.tolist())

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')
print(f"Observations: {len(df)}, Variables: {len(df.columns)}, Missing: {df.isnull().sum().sum()}")

for col in df.columns:
    n_unique = df[col].nunique()
    if n_unique == 2:
        print(f"{col}: binary")
    elif n_unique <= 10 and df[col].dtype == 'object':
        print(f"{col}: categorical ({n_unique} levels)")
    elif df[col].dtype in ['float64', 'int64']:
        print(f"{col}: continuous (mean={df[col].mean():.2f})")

Phase 1: Model Fitting

Goal : Fit appropriate model based on outcome type.

Use the decision tree above to select model type, then refer to the appropriate reference file for detailed code:

Linear Regression : references/linear_models.md
Logistic Regression (binary): references/logistic_regression.md
Ordinal Logistic : references/ordinal_logistic.md
Cox Proportional Hazards : references/cox_regression.md
ANOVA / Statistical Tests : anova_and_tests.md

Quick reference for key models :

import statsmodels.formula.api as smf
import numpy as np

# Linear regression
model = smf.ols('outcome ~ predictor1 + predictor2', data=df).fit()

# Logistic regression (odds ratios)
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)
ors = np.exp(model.params)
ci = np.exp(model.conf_int())

# Cox proportional hazards
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age']], duration_col='time', event_col='event')
hr = cph.hazard_ratios_['treatment']

Phase 1b: ANOVA for Multi-Feature Data

When data has multiple features (genes, miRNAs, metabolites), use per-feature ANOVA (not aggregate). This is the most common pattern in genomics.

See anova_and_tests.md for the full decision tree, both methods, and worked examples.

Default for gene expression data : Per-feature ANOVA (Method B).

Phase 2: Model Diagnostics

Goal : Check model assumptions and fit quality.

Key diagnostics by model type:

OLS : Shapiro-Wilk (normality), Breusch-Pagan (heteroscedasticity), VIF (multicollinearity)
Cox : Proportional hazards test via cph.check_assumptions()
Logistic : Hosmer-Lemeshow, ROC/AUC

See references/troubleshooting.md for diagnostic code and common issues.

Phase 3: Interpretation

Goal : Generate publication-quality summary.

For every result, report: effect size (OR/HR/coefficient), 95% CI, p-value, and model fit statistic. See bixbench_patterns_summary.md for common question-answer patterns.

Common BixBench Patterns

Pattern	Question Type	Key Steps
1	Odds ratio from ordinal regression	Fit OrderedModel, exp(coef)
2	Percentage reduction in OR	Compare crude vs adjusted model
3	Interaction effects	Fit `A * B`, extract `A:B` coef
4	Hazard ratio	Cox PH model, exp(coef)
5	Multi-feature ANOVA	Per-feature F-stats (not aggregate)

See bixbench_patterns_summary.md for solution code for each pattern. See references/bixbench_patterns.md for 15+ detailed question patterns.

Statsmodels vs Scikit-learn

Use Case	Library	Reason
Inference (p-values, CIs, ORs)	statsmodels	Full statistical output
Prediction (accuracy, AUC)	scikit-learn	Better prediction tools
Mixed-effects models	statsmodels	Only option
Regularization (LASSO, Ridge)	scikit-learn	Better optimization
Survival analysis	lifelines	Specialized library

General rule : Use statsmodels for BixBench questions (they ask for p-values, ORs, HRs).

Python Package Requirements

statsmodels>=0.14.0
scikit-learn>=1.3.0
lifelines>=0.27.0
pandas>=2.0.0
numpy>=1.24.0
scipy>=1.10.0

Key Principles

Data-first approach - Always inspect and validate data before modeling
Model selection by outcome type - Use decision tree above
Assumption checking - Verify model assumptions (linearity, proportional hazards, etc.)
Complete reporting - Always report effect sizes, CIs, p-values, and model fit statistics
Confounder awareness - Adjust for confounders when specified or clinically relevant
Reproducible analysis - All code must be deterministic and reproducible
Robust error handling - Graceful handling of convergence failures, separation, collinearity
Round correctly - Match the precision requested (typically 2-4 decimal places)

Completeness Checklist

Before finalizing any statistical analysis:

Outcome variable identified : Verified which column is the actual outcome
Data validated : N, missing values, variable types confirmed
Multi-feature data identified : If multiple features, use per-feature approach
Model appropriate : Outcome type matches model family
Assumptions checked : Relevant diagnostics performed
Effect sizes reported : OR/HR/Cohen's d with CIs
P-values reported : With appropriate correction if needed
Model fit assessed : R-squared, AIC/BIC, concordance
Results interpreted : Plain-language interpretation
Precision correct : Numbers rounded appropriately

File Structure

tooluniverse-statistical-modeling/
+-- SKILL.md                          # This file (workflow guide)
+-- QUICK_START.md                    # 8 quick examples
+-- EXAMPLES.md                       # Legacy examples
+-- TOOLS_REFERENCE.md                # ToolUniverse tool catalog
+-- anova_and_tests.md                # ANOVA decision tree and code
+-- bixbench_patterns_summary.md      # Common BixBench solution patterns
+-- test_skill.py                     # Test suite
+-- references/
|   +-- logistic_regression.md        # Detailed logistic examples
|   +-- ordinal_logistic.md           # Ordinal logit guide
|   +-- cox_regression.md             # Survival analysis guide
|   +-- linear_models.md              # OLS and mixed-effects
|   +-- bixbench_patterns.md          # 15+ question patterns
|   +-- troubleshooting.md            # Diagnostic issues
+-- scripts/
    +-- format_statistical_output.py  # Format results for reporting
    +-- model_diagnostics.py          # Automated diagnostics

ToolUniverse Integration

While this skill is primarily computational, ToolUniverse tools can provide data:

Use Case	Tools
Clinical trial data	`clinical_trials_search`
Drug safety outcomes	`FAERS_calculate_disproportionality`
Gene-disease associations	`OpenTargets_target_disease_evidence`
Biomarker data	`fda_pharmacogenomic_biomarkers`

See TOOLS_REFERENCE.md for complete tool catalog.

References

statsmodels : https://www.statsmodels.org/
lifelines : https://lifelines.readthedocs.io/
scikit-learn : https://scikit-learn.org/
Ordinal models : statsmodels.miscmodels.ordinal_model.OrderedModel

Support

For detailed examples and troubleshooting:

Logistic regression : references/logistic_regression.md
Ordinal models : references/ordinal_logistic.md
Survival analysis : references/cox_regression.md
Linear/mixed models : references/linear_models.md
BixBench patterns : references/bixbench_patterns.md
ANOVA and tests : anova_and_tests.md
Diagnostics : references/troubleshooting.md

Weekly Installs

125

Repository

mims-harvard/to…universe

GitHub Stars

1.2K

First Seen

Feb 19, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex122

gemini-cli121

opencode121

github-copilot120

kimi-cli117

amp117

Excel财务建模规范与xlsx文件处理指南：专业格式、零错误公式与数据分析

46,700 周安装