Python statsmodels 统计建模与计量经济学分析库 - 回归、时间序列、假设检验

statsmodels by davila7/claude-code-templates

204 周安装量

23,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill statsmodels

AI/机器学习数据分析科研工具

🇨🇳中文介绍

Statsmodels: 统计建模与计量经济学

概述

Statsmodels 是 Python 中用于统计建模的首要库，提供了涵盖广泛统计方法的估计、推断和诊断工具。应用此技能可进行严谨的统计分析，从简单的线性回归到复杂的时间序列模型和计量经济学分析。

何时使用此技能

此技能应在以下情况下使用：

拟合回归模型（OLS、WLS、GLS、分位数回归）
执行广义线性建模（逻辑回归、泊松回归、伽马回归等）
分析离散结果（二元、多项、计数、有序）
进行时间序列分析（ARIMA、SARIMAX、VAR、预测）
运行统计检验和诊断
检验模型假设（异方差性、自相关性、正态性）
检测异常值和有影响的观测值
比较模型（AIC/BIC、似然比检验）
估计因果效应
生成可直接用于发表的统计表格和推断结果

快速入门指南

线性回归 (OLS)

import statsmodels.api as sm
import numpy as np
import pandas as pd

# 准备数据 - 始终添加常数项以包含截距
X = sm.add_constant(X_data)

# 拟合 OLS 模型
model = sm.OLS(y, X)
results = model.fit()

# 查看综合结果
print(results.summary())

# 关键结果
print(f"R-squared: {results.rsquared:.4f}")
print(f"Coefficients:\\n{results.params}")
print(f"P-values:\\n{results.pvalues}")

# 包含置信区间的预测
predictions = results.get_prediction(X_new)
pred_summary = predictions.summary_frame()
print(pred_summary)  # 包含均值、置信区间、预测区间

# 诊断
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan p-value: {bp_test[1]:.4f}")

# 可视化残差
import matplotlib.pyplot as plt
plt.scatter(results.fittedvalues, results.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.show()

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

逻辑回归 (二元结果)

from statsmodels.discrete.discrete_model import Logit

# 添加常数项
X = sm.add_constant(X_data)

# 拟合 logit 模型
model = Logit(y_binary, X)
results = model.fit()

print(results.summary())

# 比值比
odds_ratios = np.exp(results.params)
print("Odds ratios:\\n", odds_ratios)

# 预测概率
probs = results.predict(X)

# 二元预测（0.5 阈值）
predictions = (probs > 0.5).astype(int)

# 模型评估
from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_binary, predictions))
print(f"AUC: {roc_auc_score(y_binary, probs):.4f}")

# 边际效应
marginal = results.get_margeff()
print(marginal.summary())

时间序列 (ARIMA)

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# 检查平稳性
from statsmodels.tsa.stattools import adfuller

adf_result = adfuller(y_series)
print(f"ADF p-value: {adf_result[1]:.4f}")

if adf_result[1] > 0.05:
    # 序列非平稳，进行差分
    y_diff = y_series.diff().dropna()

# 绘制 ACF/PACF 以识别 p, q
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(y_diff, lags=40, ax=ax1)
plot_pacf(y_diff, lags=40, ax=ax2)
plt.show()

# 拟合 ARIMA(p,d,q)
model = ARIMA(y_series, order=(1, 1, 1))
results = model.fit()

print(results.summary())

# 预测
forecast = results.forecast(steps=10)
forecast_obj = results.get_forecast(steps=10)
forecast_df = forecast_obj.summary_frame()

print(forecast_df)  # 包含均值和置信区间

# 残差诊断
results.plot_diagnostics(figsize=(12, 8))
plt.show()

广义线性模型 (GLM)

import statsmodels.api as sm

# 计数数据的泊松回归
X = sm.add_constant(X_data)
model = sm.GLM(y_counts, X, family=sm.families.Poisson())
results = model.fit()

print(results.summary())

# 比率比（对于使用对数链接的泊松回归）
rate_ratios = np.exp(results.params)
print("Rate ratios:\\n", rate_ratios)

# 检查过度离散
overdispersion = results.pearson_chi2 / results.df_resid
print(f"Overdispersion: {overdispersion:.2f}")

if overdispersion > 1.5:
    # 改用负二项分布
    from statsmodels.discrete.count_model import NegativeBinomial
    nb_model = NegativeBinomial(y_counts, X)
    nb_results = nb_model.fit()
    print(nb_results.summary())

核心统计建模能力

1. 线性回归模型

针对具有各种误差结构的连续结果的综合线性模型套件。

OLS : 具有独立同分布误差的标准线性回归
WLS : 针对异方差误差的加权最小二乘法
GLS : 针对任意协方差结构的广义最小二乘法
GLSAR : 针对时间序列的具有自回归误差的 GLS
Quantile Regression : 条件分位数（对异常值稳健）
Mixed Effects : 具有随机效应的分层/多水平模型
Recursive/Rolling : 时变参数估计

全面的诊断检验
稳健标准误（HC、HAC、聚类稳健）
影响统计量（库克距离、杠杆值、DFFITS）
假设检验（F 检验、Wald 检验）
模型比较（AIC、BIC、似然比检验）
包含置信区间和预测区间的预测

何时使用： 连续结果变量，需要对系数进行推断，需要诊断

参考： 有关模型选择、诊断和最佳实践的详细指导，请参阅 references/linear_models.md。

2. 广义线性模型 (GLM)

将线性模型扩展到非正态分布的灵活框架。

Binomial : 二元结果或比例（逻辑回归）
Poisson : 计数数据
Negative Binomial : 过度离散的计数
Gamma : 正连续、右偏数据
Inverse Gaussian : 具有特定方差结构的正连续数据
Gaussian : 等同于 OLS
Tweedie : 适用于半连续数据的灵活族

Logit、Probit、Log、Identity、Inverse、Sqrt、CLogLog、Power
根据解释需求和模型拟合度选择

通过 IRLS 进行最大似然估计
偏差和皮尔逊残差
拟合优度统计量
伪 R 平方度量
稳健标准误

何时使用： 非正态结果，需要灵活的方差和链接函数规范

参考： 有关族选择、链接函数、解释和诊断的详细指导，请参阅 references/glm.md。

3. 离散选择模型

适用于分类和计数结果的模型。

Logit : 逻辑回归（比值比）
Probit : Probit 回归（正态分布）

MNLogit : 无序类别（3 个或更多水平）
Conditional Logit : 具有备选特定变量的选择模型
Ordered Model : 有序结果（有序类别）

Poisson : 标准计数模型
Negative Binomial : 过度离散的计数
Zero-Inflated : 过量零值（ZIP、ZINB）
Hurdle Models : 针对零值较多数据的两阶段模型

最大似然估计
均值处的边际效应或平均边际效应
通过 AIC/BIC 进行模型比较
预测概率和分类
拟合优度检验

何时使用： 二元、分类或计数结果

参考： 有关模型选择、解释和评估的详细指导，请参阅 references/discrete_choice.md。

4. 时间序列分析

全面的时间序列建模和预测能力。

单变量模型：

AutoReg (AR) : 自回归模型
ARIMA : 自回归综合移动平均
SARIMAX : 带有外生变量的季节性 ARIMA
Exponential Smoothing : 简单指数平滑、Holt、Holt-Winters
ETS : 创新状态空间模型

多变量模型：

VAR : 向量自回归
VARMAX : 带有 MA 和外生变量的 VAR
Dynamic Factor Models : 提取共同因子
VECM : 向量误差修正模型（协整）

State Space : 卡尔曼滤波、自定义规范
Regime Switching : 马尔可夫转换模型
ARDL : 自回归分布滞后

用于模型识别的 ACF/PACF 分析
平稳性检验（ADF、KPSS）
包含预测区间的预测
残差诊断（Ljung-Box、异方差性）
Granger 因果检验
脉冲响应函数 (IRF)
预测误差方差分解 (FEVD)

何时使用： 时间顺序数据、预测、理解时间动态

参考： 有关模型选择、诊断和预测方法的详细指导，请参阅 references/time_series.md。

5. 统计检验与诊断

用于模型验证的广泛测试和诊断能力。

自相关检验（Ljung-Box、Durbin-Watson、Breusch-Godfrey）
异方差性检验（Breusch-Pagan、White、ARCH）
正态性检验（Jarque-Bera、Omnibus、Anderson-Darling、Lilliefors）
设定检验（RESET、Harvey-Collier）

影响与异常值：

杠杆值（帽子值）
库克距离
DFFITS 和 DFBETAs
学生化残差
影响图

t 检验（单样本、双样本、配对）
比例检验
卡方检验
非参数检验（Mann-Whitney、Wilcoxon、Kruskal-Wallis）
方差分析（单因素、双因素、重复测量）

Tukey's HSD
Bonferroni 校正
错误发现率 (FDR)

效应大小与功效：

Cohen's d、eta-squared
t 检验、比例的功效分析
样本量计算

异方差一致标准误 (HC0-HC3)
HAC 标准误 (Newey-West)
聚类稳健标准误

何时使用： 验证假设、检测问题、确保稳健推断

参考： 有关全面测试和诊断程序的详细指导，请参阅 references/stats_diagnostics.md。

公式 API (R 风格)

Statsmodels 支持 R 风格的公式，用于直观的模型规范：

import statsmodels.formula.api as smf

# 使用公式的 OLS
results = smf.ols('y ~ x1 + x2 + x1:x2', data=df).fit()

# 分类变量（自动虚拟编码）
results = smf.ols('y ~ x1 + C(category)', data=df).fit()

# 交互作用
results = smf.ols('y ~ x1 * x2', data=df).fit()  # x1 + x2 + x1:x2

# 多项式项
results = smf.ols('y ~ x + I(x**2)', data=df).fit()

# Logit
results = smf.logit('y ~ x1 + x2 + C(group)', data=df).fit()

# Poisson
results = smf.poisson('count ~ x1 + x2', data=df).fit()

# ARIMA（无法通过公式使用，请使用常规 API）

模型选择与比较

# 使用 AIC/BIC 比较模型
models = {
    'Model 1': model1_results,
    'Model 2': model2_results,
    'Model 3': model3_results
}

comparison = pd.DataFrame({
    'AIC': {name: res.aic for name, res in models.items()},
    'BIC': {name: res.bic for name, res in models.items()},
    'Log-Likelihood': {name: res.llf for name, res in models.items()}
})

print(comparison.sort_values('AIC'))
# 较低的 AIC/BIC 表示更好的模型

似然比检验 (嵌套模型)

# 对于嵌套模型（一个模型是另一个的子集）
from scipy import stats

lr_stat = 2 * (full_model.llf - reduced_model.llf)
df = full_model.df_model - reduced_model.df_model
p_value = 1 - stats.chi2.cdf(lr_stat, df)

print(f"LR statistic: {lr_stat:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Full model significantly better")
else:
    print("Reduced model preferred (parsimony)")

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []

for train_idx, val_idx in kf.split(X):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    # 拟合模型
    model = sm.OLS(y_train, X_train).fit()

    # 预测
    y_pred = model.predict(X_val)

    # 评分
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    cv_scores.append(rmse)

print(f"CV RMSE: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")

始终添加常数项 : 除非排除截距，否则使用 sm.add_constant()
检查缺失值 : 在拟合前处理或填补缺失值
必要时进行缩放 : 改善收敛性、解释性（但树模型不需要）
编码分类变量 : 使用公式 API 或手动虚拟编码

从简单开始 : 从基本模型开始，根据需要增加复杂性
检查假设 : 检验残差、异方差性、自相关性
使用合适的模型 : 使模型与结果类型匹配（二元→Logit，计数→Poisson）
考虑替代方案 : 如果假设被违反，使用稳健方法或不同模型

报告效应大小 : 不仅仅是 p 值
使用稳健标准误 : 当存在异方差性或聚类时
多重比较 : 在检验多个假设时进行校正
置信区间 : 始终与点估计一起报告

检查残差 : 绘制残差与拟合值图、Q-Q 图
影响诊断 : 识别并调查有影响的观测值
样本外验证 : 在保留集上测试或进行交叉验证
比较模型 : 对非嵌套模型使用 AIC/BIC，对嵌套模型使用 LR 检验

综合摘要 : 使用 .summary() 获取详细输出
记录决策 : 记录转换、排除的观测值
谨慎解释 : 考虑链接函数（例如，对于对数链接使用 exp(β)）
可视化 : 绘制预测、置信区间、诊断图

工作流程 1: 线性回归分析

探索数据（绘图、描述性统计）
拟合初始 OLS 模型
检查残差诊断
检验异方差性、自相关性
检查多重共线性 (VIF)
识别有影响的观测值
必要时使用稳健标准误重新拟合
解释系数和推断
在保留集上或通过交叉验证进行验证

工作流程 2: 二元分类

拟合逻辑回归 (Logit)
检查收敛问题
解释比值比
计算边际效应
评估分类性能（AUC、混淆矩阵）
检查有影响的观测值
与替代模型比较 (Probit)
在测试集上验证预测

工作流程 3: 计数数据分析

拟合泊松回归
检查过度离散
如果过度离散，拟合负二项回归
检查过量零值（考虑 ZIP/ZINB）
解释比率比
评估拟合优度
通过 AIC 比较模型
验证预测

工作流程 4: 时间序列预测

绘制序列图，检查趋势/季节性
检验平稳性 (ADF, KPSS)
如果非平稳则进行差分
从 ACF/PACF 识别 p, q
拟合 ARIMA 或 SARIMAX
检查残差诊断 (Ljung-Box)
生成包含置信区间的预测
在测试集上评估预测准确性

此技能包含用于详细指导的综合参考文件：

references/linear_models.md

线性回归模型的详细覆盖，包括：

OLS、WLS、GLS、GLSAR、分位数回归
混合效应模型
递归和滚动回归
综合诊断（异方差性、自相关性、多重共线性）
影响统计量和异常值检测
稳健标准误（HC、HAC、聚类）
假设检验和模型比较

广义线性模型的完整指南：

所有分布族（二项分布、泊松分布、伽马分布等）
链接函数及何时使用每个函数
模型拟合和解释
伪 R 平方和拟合优度
诊断和残差分析
应用（逻辑回归、泊松回归、伽马回归）

references/discrete_choice.md

离散结果模型的综合指南：

二元模型（Logit、Probit）
多项模型（MNLogit、条件 Logit）
计数模型（泊松分布、负二项分布、零膨胀分布、障碍模型）
有序模型
边际效应和解释
模型诊断和比较

references/time_series.md

深入的时间序列分析指导：

单变量模型（AR、ARIMA、SARIMAX、指数平滑）
多变量模型（VAR、VARMAX、动态因子）
状态空间模型
平稳性检验和诊断
预测方法和评估
Granger 因果性、IRF、FEVD

references/stats_diagnostics.md

全面的统计检验和诊断：

残差诊断（自相关性、异方差性、正态性）
影响和异常值检测
假设检验（参数和非参数）
方差分析和事后检验
多重比较校正
稳健协方差矩阵
功效分析和效应大小

需要详细的参数解释
在相似模型之间进行选择
解决收敛或诊断问题
理解特定的检验统计量
寻找高级功能的代码示例

# 查找特定模型的信息
grep -r "Quantile Regression" references/

# 查找诊断检验
grep -r "Breusch-Pagan" references/stats_diagnostics.md

# 查找时间序列指导
grep -r "SARIMAX" references/time_series.md

需要避免的常见陷阱

忘记常数项 : 除非不需要截距，否则始终使用 sm.add_constant()
忽略假设 : 检查残差、异方差性、自相关性
错误的结果类型模型 : 二元→Logit/Probit，计数→Poisson/NB，而不是 OLS
未检查收敛性 : 寻找优化警告
错误解释系数 : 记住链接函数（log、logit 等）
对过度离散数据使用泊松回归 : 检查离散度，必要时使用负二项回归
未使用稳健标准误 : 当存在异方差性或聚类时
过拟合 : 相对于样本量参数过多
数据泄露 : 在测试数据上拟合或使用未来信息
未验证预测 : 始终检查样本外性能
比较非嵌套模型 : 使用 AIC/BIC，而不是 LR 检验
忽略有影响的观测值 : 检查库克距离和杠杆值
多重检验 : 在检验多个假设时校正 p 值
未对时间序列进行差分 : 在非平稳数据上拟合 ARIMA
混淆预测区间与置信区间 : 预测区间更宽

如需详细文档和示例：

🇺🇸English

Statsmodels: Statistical Modeling and Econometrics

Overview

Statsmodels is Python's premier library for statistical modeling, providing tools for estimation, inference, and diagnostics across a wide range of statistical methods. Apply this skill for rigorous statistical analysis, from simple linear regression to complex time series models and econometric analyses.

When to Use This Skill

This skill should be used when:

Fitting regression models (OLS, WLS, GLS, quantile regression)
Performing generalized linear modeling (logistic, Poisson, Gamma, etc.)
Analyzing discrete outcomes (binary, multinomial, count, ordinal)
Conducting time series analysis (ARIMA, SARIMAX, VAR, forecasting)
Running statistical tests and diagnostics
Testing model assumptions (heteroskedasticity, autocorrelation, normality)
Detecting outliers and influential observations
Comparing models (AIC/BIC, likelihood ratio tests)
Estimating causal effects
Producing publication-ready statistical tables and inference

Quick Start Guide

Linear Regression (OLS)

import statsmodels.api as sm
import numpy as np
import pandas as pd

# Prepare data - ALWAYS add constant for intercept
X = sm.add_constant(X_data)

# Fit OLS model
model = sm.OLS(y, X)
results = model.fit()

# View comprehensive results
print(results.summary())

# Key results
print(f"R-squared: {results.rsquared:.4f}")
print(f"Coefficients:\\n{results.params}")
print(f"P-values:\\n{results.pvalues}")

# Predictions with confidence intervals
predictions = results.get_prediction(X_new)
pred_summary = predictions.summary_frame()
print(pred_summary)  # includes mean, CI, prediction intervals

# Diagnostics
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan p-value: {bp_test[1]:.4f}")

# Visualize residuals
import matplotlib.pyplot as plt
plt.scatter(results.fittedvalues, results.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.show()

Logistic Regression (Binary Outcomes)

from statsmodels.discrete.discrete_model import Logit

# Add constant
X = sm.add_constant(X_data)

# Fit logit model
model = Logit(y_binary, X)
results = model.fit()

print(results.summary())

# Odds ratios
odds_ratios = np.exp(results.params)
print("Odds ratios:\\n", odds_ratios)

# Predicted probabilities
probs = results.predict(X)

# Binary predictions (0.5 threshold)
predictions = (probs > 0.5).astype(int)

# Model evaluation
from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_binary, predictions))
print(f"AUC: {roc_auc_score(y_binary, probs):.4f}")

# Marginal effects
marginal = results.get_margeff()
print(marginal.summary())

Time Series (ARIMA)

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Check stationarity
from statsmodels.tsa.stattools import adfuller

adf_result = adfuller(y_series)
print(f"ADF p-value: {adf_result[1]:.4f}")

if adf_result[1] > 0.05:
    # Series is non-stationary, difference it
    y_diff = y_series.diff().dropna()

# Plot ACF/PACF to identify p, q
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(y_diff, lags=40, ax=ax1)
plot_pacf(y_diff, lags=40, ax=ax2)
plt.show()

# Fit ARIMA(p,d,q)
model = ARIMA(y_series, order=(1, 1, 1))
results = model.fit()

print(results.summary())

# Forecast
forecast = results.forecast(steps=10)
forecast_obj = results.get_forecast(steps=10)
forecast_df = forecast_obj.summary_frame()

print(forecast_df)  # includes mean and confidence intervals

# Residual diagnostics
results.plot_diagnostics(figsize=(12, 8))
plt.show()

Generalized Linear Models (GLM)

import statsmodels.api as sm

# Poisson regression for count data
X = sm.add_constant(X_data)
model = sm.GLM(y_counts, X, family=sm.families.Poisson())
results = model.fit()

print(results.summary())

# Rate ratios (for Poisson with log link)
rate_ratios = np.exp(results.params)
print("Rate ratios:\\n", rate_ratios)

# Check overdispersion
overdispersion = results.pearson_chi2 / results.df_resid
print(f"Overdispersion: {overdispersion:.2f}")

if overdispersion > 1.5:
    # Use Negative Binomial instead
    from statsmodels.discrete.count_model import NegativeBinomial
    nb_model = NegativeBinomial(y_counts, X)
    nb_results = nb_model.fit()
    print(nb_results.summary())

Core Statistical Modeling Capabilities

1. Linear Regression Models

Comprehensive suite of linear models for continuous outcomes with various error structures.

Available models:

OLS : Standard linear regression with i.i.d. errors
WLS : Weighted least squares for heteroskedastic errors
GLS : Generalized least squares for arbitrary covariance structure
GLSAR : GLS with autoregressive errors for time series
Quantile Regression : Conditional quantiles (robust to outliers)
Mixed Effects : Hierarchical/multilevel models with random effects
Recursive/Rolling : Time-varying parameter estimation

Key features:

Comprehensive diagnostic tests
Robust standard errors (HC, HAC, cluster-robust)
Influence statistics (Cook's distance, leverage, DFFITS)
Hypothesis testing (F-tests, Wald tests)
Model comparison (AIC, BIC, likelihood ratio tests)
Prediction with confidence and prediction intervals

When to use: Continuous outcome variable, want inference on coefficients, need diagnostics

Reference: See references/linear_models.md for detailed guidance on model selection, diagnostics, and best practices.

2. Generalized Linear Models (GLM)

Flexible framework extending linear models to non-normal distributions.

Distribution families:

Binomial : Binary outcomes or proportions (logistic regression)
Poisson : Count data
Negative Binomial : Overdispersed counts
Gamma : Positive continuous, right-skewed data
Inverse Gaussian : Positive continuous with specific variance structure
Gaussian : Equivalent to OLS
Tweedie : Flexible family for semi-continuous data

Link functions:

Logit, Probit, Log, Identity, Inverse, Sqrt, CLogLog, Power
Choose based on interpretation needs and model fit

Key features:

Maximum likelihood estimation via IRLS
Deviance and Pearson residuals
Goodness-of-fit statistics
Pseudo R-squared measures
Robust standard errors

When to use: Non-normal outcomes, need flexible variance and link specifications

Reference: See references/glm.md for family selection, link functions, interpretation, and diagnostics.

3. Discrete Choice Models

Models for categorical and count outcomes.

Binary models:

Logit : Logistic regression (odds ratios)
Probit : Probit regression (normal distribution)

Multinomial models:

MNLogit : Unordered categories (3+ levels)
Conditional Logit : Choice models with alternative-specific variables
Ordered Model : Ordinal outcomes (ordered categories)

Count models:

Poisson : Standard count model
Negative Binomial : Overdispersed counts
Zero-Inflated : Excess zeros (ZIP, ZINB)
Hurdle Models : Two-stage models for zero-heavy data

Key features:

Maximum likelihood estimation
Marginal effects at means or average marginal effects
Model comparison via AIC/BIC
Predicted probabilities and classification
Goodness-of-fit tests

When to use: Binary, categorical, or count outcomes

Reference: See references/discrete_choice.md for model selection, interpretation, and evaluation.

4. Time Series Analysis

Comprehensive time series modeling and forecasting capabilities.

Univariate models:

AutoReg (AR) : Autoregressive models
ARIMA : Autoregressive integrated moving average
SARIMAX : Seasonal ARIMA with exogenous variables
Exponential Smoothing : Simple, Holt, Holt-Winters
ETS : Innovations state space models

Multivariate models:

VAR : Vector autoregression
VARMAX : VAR with MA and exogenous variables
Dynamic Factor Models : Extract common factors
VECM : Vector error correction models (cointegration)

Advanced models:

State Space : Kalman filtering, custom specifications
Regime Switching : Markov switching models
ARDL : Autoregressive distributed lag

Key features:

ACF/PACF analysis for model identification
Stationarity tests (ADF, KPSS)
Forecasting with prediction intervals
Residual diagnostics (Ljung-Box, heteroskedasticity)
Granger causality testing
Impulse response functions (IRF)
Forecast error variance decomposition (FEVD)

When to use: Time-ordered data, forecasting, understanding temporal dynamics

Reference: See references/time_series.md for model selection, diagnostics, and forecasting methods.

5. Statistical Tests and Diagnostics

Extensive testing and diagnostic capabilities for model validation.

Residual diagnostics:

Autocorrelation tests (Ljung-Box, Durbin-Watson, Breusch-Godfrey)
Heteroskedasticity tests (Breusch-Pagan, White, ARCH)
Normality tests (Jarque-Bera, Omnibus, Anderson-Darling, Lilliefors)
Specification tests (RESET, Harvey-Collier)

Influence and outliers:

Leverage (hat values)
Cook's distance
DFFITS and DFBETAs
Studentized residuals
Influence plots

Hypothesis testing:

t-tests (one-sample, two-sample, paired)
Proportion tests
Chi-square tests
Non-parametric tests (Mann-Whitney, Wilcoxon, Kruskal-Wallis)
ANOVA (one-way, two-way, repeated measures)

Multiple comparisons:

Tukey's HSD
Bonferroni correction
False Discovery Rate (FDR)

Effect sizes and power:

Cohen's d, eta-squared
Power analysis for t-tests, proportions
Sample size calculations

Robust inference:

Heteroskedasticity-consistent SEs (HC0-HC3)
HAC standard errors (Newey-West)
Cluster-robust standard errors

When to use: Validating assumptions, detecting problems, ensuring robust inference

Reference: See references/stats_diagnostics.md for comprehensive testing and diagnostic procedures.

Formula API (R-style)

Statsmodels supports R-style formulas for intuitive model specification:

import statsmodels.formula.api as smf

# OLS with formula
results = smf.ols('y ~ x1 + x2 + x1:x2', data=df).fit()

# Categorical variables (automatic dummy coding)
results = smf.ols('y ~ x1 + C(category)', data=df).fit()

# Interactions
results = smf.ols('y ~ x1 * x2', data=df).fit()  # x1 + x2 + x1:x2

# Polynomial terms
results = smf.ols('y ~ x + I(x**2)', data=df).fit()

# Logit
results = smf.logit('y ~ x1 + x2 + C(group)', data=df).fit()

# Poisson
results = smf.poisson('count ~ x1 + x2', data=df).fit()

# ARIMA (not available via formula, use regular API)

Model Selection and Comparison

Information Criteria

# Compare models using AIC/BIC
models = {
    'Model 1': model1_results,
    'Model 2': model2_results,
    'Model 3': model3_results
}

comparison = pd.DataFrame({
    'AIC': {name: res.aic for name, res in models.items()},
    'BIC': {name: res.bic for name, res in models.items()},
    'Log-Likelihood': {name: res.llf for name, res in models.items()}
})

print(comparison.sort_values('AIC'))
# Lower AIC/BIC indicates better model

Likelihood Ratio Test (Nested Models)

# For nested models (one is subset of the other)
from scipy import stats

lr_stat = 2 * (full_model.llf - reduced_model.llf)
df = full_model.df_model - reduced_model.df_model
p_value = 1 - stats.chi2.cdf(lr_stat, df)

print(f"LR statistic: {lr_stat:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Full model significantly better")
else:
    print("Reduced model preferred (parsimony)")

Cross-Validation

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []

for train_idx, val_idx in kf.split(X):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    # Fit model
    model = sm.OLS(y_train, X_train).fit()

    # Predict
    y_pred = model.predict(X_val)

    # Score
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    cv_scores.append(rmse)

print(f"CV RMSE: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")

Best Practices

Data Preparation

Always add constant : Use sm.add_constant() unless excluding intercept
Check for missing values : Handle or impute before fitting
Scale if needed : Improves convergence, interpretation (but not required for tree models)
Encode categoricals : Use formula API or manual dummy coding

Model Building

Start simple : Begin with basic model, add complexity as needed
Check assumptions : Test residuals, heteroskedasticity, autocorrelation
Use appropriate model : Match model to outcome type (binary→Logit, count→Poisson)
Consider alternatives : If assumptions violated, use robust methods or different model

Inference

Report effect sizes : Not just p-values
Use robust SEs : When heteroskedasticity or clustering present
Multiple comparisons : Correct when testing many hypotheses
Confidence intervals : Always report alongside point estimates

Model Evaluation

Check residuals : Plot residuals vs fitted, Q-Q plot
Influence diagnostics : Identify and investigate influential observations
Out-of-sample validation : Test on holdout set or cross-validate
Compare models : Use AIC/BIC for non-nested, LR test for nested

Reporting

Comprehensive summary : Use .summary() for detailed output
Document decisions : Note transformations, excluded observations
Interpret carefully : Account for link functions (e.g., exp(β) for log link)
Visualize : Plot predictions, confidence intervals, diagnostics

Common Workflows

Workflow 1: Linear Regression Analysis

Explore data (plots, descriptives)
Fit initial OLS model
Check residual diagnostics
Test for heteroskedasticity, autocorrelation
Check for multicollinearity (VIF)
Identify influential observations
Refit with robust SEs if needed
Interpret coefficients and inference
Validate on holdout or via CV

Workflow 2: Binary Classification

Fit logistic regression (Logit)
Check for convergence issues
Interpret odds ratios
Calculate marginal effects
Evaluate classification performance (AUC, confusion matrix)
Check for influential observations
Compare with alternative models (Probit)
Validate predictions on test set

Workflow 3: Count Data Analysis

Fit Poisson regression
Check for overdispersion
If overdispersed, fit Negative Binomial
Check for excess zeros (consider ZIP/ZINB)
Interpret rate ratios
Assess goodness of fit
Compare models via AIC
Validate predictions

Workflow 4: Time Series Forecasting

Plot series, check for trend/seasonality
Test for stationarity (ADF, KPSS)
Difference if non-stationary
Identify p, q from ACF/PACF
Fit ARIMA or SARIMAX
Check residual diagnostics (Ljung-Box)
Generate forecasts with confidence intervals
Evaluate forecast accuracy on test set

Reference Documentation

This skill includes comprehensive reference files for detailed guidance:

references/linear_models.md

Detailed coverage of linear regression models including:

OLS, WLS, GLS, GLSAR, Quantile Regression
Mixed effects models
Recursive and rolling regression
Comprehensive diagnostics (heteroskedasticity, autocorrelation, multicollinearity)
Influence statistics and outlier detection
Robust standard errors (HC, HAC, cluster)
Hypothesis testing and model comparison

references/glm.md

Complete guide to generalized linear models:

All distribution families (Binomial, Poisson, Gamma, etc.)
Link functions and when to use each
Model fitting and interpretation
Pseudo R-squared and goodness of fit
Diagnostics and residual analysis
Applications (logistic, Poisson, Gamma regression)

references/discrete_choice.md

Comprehensive guide to discrete outcome models:

Binary models (Logit, Probit)
Multinomial models (MNLogit, Conditional Logit)
Count models (Poisson, Negative Binomial, Zero-Inflated, Hurdle)
Ordinal models
Marginal effects and interpretation
Model diagnostics and comparison

references/time_series.md

In-depth time series analysis guidance:

Univariate models (AR, ARIMA, SARIMAX, Exponential Smoothing)
Multivariate models (VAR, VARMAX, Dynamic Factor)
State space models
Stationarity testing and diagnostics
Forecasting methods and evaluation
Granger causality, IRF, FEVD

references/stats_diagnostics.md

Comprehensive statistical testing and diagnostics:

Residual diagnostics (autocorrelation, heteroskedasticity, normality)
Influence and outlier detection
Hypothesis tests (parametric and non-parametric)
ANOVA and post-hoc tests
Multiple comparisons correction
Robust covariance matrices
Power analysis and effect sizes

When to reference:

Need detailed parameter explanations
Choosing between similar models
Troubleshooting convergence or diagnostic issues
Understanding specific test statistics
Looking for code examples for advanced features

Search patterns:

# Find information about specific models
grep -r "Quantile Regression" references/

# Find diagnostic tests
grep -r "Breusch-Pagan" references/stats_diagnostics.md

# Find time series guidance
grep -r "SARIMAX" references/time_series.md

Common Pitfalls to Avoid

Forgetting constant term : Always use sm.add_constant() unless no intercept desired
Ignoring assumptions : Check residuals, heteroskedasticity, autocorrelation
Wrong model for outcome type : Binary→Logit/Probit, Count→Poisson/NB, not OLS
Not checking convergence : Look for optimization warnings
Misinterpreting coefficients : Remember link functions (log, logit, etc.)
Using Poisson with overdispersion : Check dispersion, use Negative Binomial if needed
Not using robust SEs : When heteroskedasticity or clustering present
Overfitting : Too many parameters relative to sample size
Data leakage : Fitting on test data or using future information
Not validating predictions : Always check out-of-sample performance
Comparing non-nested models : Use AIC/BIC, not LR test
Ignoring influential observations : Check Cook's distance and leverage
Multiple testing : Correct p-values when testing many hypotheses
Not differencing time series : Fit ARIMA on non-stationary data
Confusing prediction vs confidence intervals : Prediction intervals are wider

Getting Help

For detailed documentation and examples:

Official docs: https://www.statsmodels.org/stable/
User guide: https://www.statsmodels.org/stable/user-guide.html
Examples: https://www.statsmodels.org/stable/examples/index.html
API reference: https://www.statsmodels.org/stable/api.html

Weekly Installs

171

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode138

claude-code138

gemini-cli129

cursor122

codex120

github-copilot113

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

62,200 周安装

Python statsmodels 统计建模与计量经济学分析库 - 回归、时间序列、假设检验

🇨🇳中文介绍

Statsmodels: 统计建模与计量经济学

概述

何时使用此技能

快速入门指南

线性回归 (OLS)

相关 Skills

逻辑回归 (二元结果)

时间序列 (ARIMA)

广义线性模型 (GLM)

核心统计建模能力

1. 线性回归模型

2. 广义线性模型 (GLM)

3. 离散选择模型

4. 时间序列分析

5. 统计检验与诊断

公式 API (R 风格)

模型选择与比较

信息准则

似然比检验 (嵌套模型)

交叉验证

最佳实践

数据准备

模型构建

推断

模型评估

报告

常见工作流程

工作流程 1: 线性回归分析

工作流程 2: 二元分类

工作流程 3: 计数数据分析

工作流程 4: 时间序列预测

参考文档

references/linear_models.md

references/glm.md

references/discrete_choice.md

references/time_series.md

references/stats_diagnostics.md

需要避免的常见陷阱

获取帮助

🇺🇸English

Statsmodels: Statistical Modeling and Econometrics

Overview

When to Use This Skill

Quick Start Guide

Linear Regression (OLS)

Logistic Regression (Binary Outcomes)

Time Series (ARIMA)

Generalized Linear Models (GLM)

Core Statistical Modeling Capabilities

1. Linear Regression Models

2. Generalized Linear Models (GLM)

3. Discrete Choice Models

4. Time Series Analysis

5. Statistical Tests and Diagnostics

Formula API (R-style)

Model Selection and Comparison

Information Criteria

Likelihood Ratio Test (Nested Models)

Cross-Validation

Best Practices

Data Preparation

Model Building

Inference

Model Evaluation

Reporting

Common Workflows

Workflow 1: Linear Regression Analysis

Workflow 2: Binary Classification

Workflow 3: Count Data Analysis

Workflow 4: Time Series Forecasting

Reference Documentation

references/linear_models.md

references/glm.md

references/discrete_choice.md

references/time_series.md

references/stats_diagnostics.md

Common Pitfalls to Avoid

Getting Help