PyDESeq2：Python版DESeq2差异表达分析工具，用于RNA-seq数据处理

pydeseq2 by davila7/claude-code-templates

179 周安装量

24,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill pydeseq2

Python Web框架数据分析生物信息学

🇨🇳中文介绍

PyDESeq2

概述

PyDESeq2 是 DESeq2 的 Python 实现，用于批量 RNA-seq 数据的差异表达分析。设计和执行从数据加载到结果解释的完整工作流程，包括单因子和多因子设计、带多重检验校正的 Wald 检验、可选的 apeGLM 收缩，以及与 pandas 和 AnnData 的集成。

使用场景

此技能应在以下情况下使用：

分析批量 RNA-seq 计数数据以进行差异表达分析
比较实验条件之间的基因表达（例如，处理组 vs 对照组）
执行考虑批次效应或协变量的多因子设计
将基于 R 的 DESeq2 工作流程转换为 Python
将差异表达分析集成到基于 Python 的流程中
用户提及 "DESeq2"、"差异表达"、"RNA-seq 分析" 或 "PyDESeq2"

快速开始工作流程

对于想要执行标准差异表达分析的用户：

import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# 1. 加载数据
counts_df = pd.read_csv("counts.csv", index_col=0).T  # 转置为样本 × 基因
metadata = pd.read_csv("metadata.csv", index_col=0)

# 2. 过滤低计数基因
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

# 3. 初始化和拟合 DESeq2
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",
    refit_cooks=True
)
dds.deseq2()

# 4. 执行统计检验
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()

# 5. 访问结果
results = ds.results_df
significant = results[results.padj < 0.05]
print(f"Found {len(significant)} significant genes")

核心工作流程步骤

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

步骤 1：数据准备

计数矩阵： 样本 × 基因的 DataFrame，包含非负整数读取计数
元数据： 样本 × 变量的 DataFrame，包含实验因子

常见数据加载模式：

# 从 CSV 加载（典型格式：基因 × 样本，需要转置）
counts_df = pd.read_csv("counts.csv", index_col=0).T
metadata = pd.read_csv("metadata.csv", index_col=0)

# 从 TSV 加载
counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T

# 从 AnnData 加载
import anndata as ad
adata = ad.read_h5ad("data.h5ad")
counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
metadata = adata.obs

# 移除低计数基因
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

# 移除元数据缺失的样本
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]

步骤 2：设计规范

设计公式指定了基因表达如何建模。

单因子设计：

design = "~condition"  # 简单的两组比较

多因子设计：

design = "~batch + condition"  # 控制批次效应
design = "~age + condition"     # 包含连续协变量
design = "~group + condition + group:condition"  # 交互效应

设计公式指南：

使用 Wilkinson 公式表示法（R 风格）
将调整变量（例如，批次）放在主要关注变量之前
确保变量作为列存在于元数据 DataFrame 中
使用适当的数据类型（离散变量使用分类类型）

步骤 3：DESeq2 拟合

初始化 DeseqDataSet 并运行完整流程：

from pydeseq2.dds import DeseqDataSet

dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",
    refit_cooks=True,  # 移除异常值后重新拟合
    n_cpus=1           # 并行处理（根据需要调整）
)

# 运行完整的 DESeq2 流程
dds.deseq2()

deseq2() 的作用：

计算大小因子（归一化）
拟合基因间离散度
拟合离散度趋势曲线
计算离散度先验
拟合 MAP 离散度（收缩）
拟合对数倍数变化
计算 Cook 距离（异常值检测）
如果检测到异常值则重新拟合（可选）

步骤 4：统计检验

执行 Wald 检验以识别差异表达基因：

from pydeseq2.ds import DeseqStats

ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"],  # 测试处理组 vs 对照组
    alpha=0.05,                # 显著性阈值
    cooks_filter=True,         # 过滤异常值
    independent_filter=True    # 过滤低功效检验
)

ds.summary()

格式：[变量, 测试水平, 参考水平]
示例：["condition", "treated", "control"] 测试处理组 vs 对照组
如果为 None，则使用设计中的最后一个系数

结果 DataFrame 列：

baseMean：跨样本的平均归一化计数
log2FoldChange：条件之间的对数 2 倍数变化
lfcSE：LFC 的标准误
stat：Wald 检验统计量
pvalue：原始 p 值
padj：调整后的 p 值（通过 Benjamini-Hochberg 进行 FDR 校正）

步骤 5：可选的 LFC 收缩

应用收缩以减少倍数变化估计中的噪声：

ds.lfc_shrink()  # 应用 apeGLM 收缩

何时使用 LFC 收缩：

用于可视化（火山图、热图）
用于按效应大小对基因排序
当优先考虑后续实验的基因时

重要提示： 收缩仅影响 log2FoldChange 值，不影响统计检验结果（p 值保持不变）。使用收缩值进行可视化，但报告未收缩的 p 值以表示显著性。

步骤 6：结果导出

保存结果和中间对象：

import pickle

# 将结果导出为 CSV
ds.results_df.to_csv("deseq2_results.csv")

# 仅保存显著基因
significant = ds.results_df[ds.results_df.padj < 0.05]
significant.to_csv("significant_genes.csv")

# 保存 DeseqDataSet 以供后续使用
with open("dds_result.pkl", "wb") as f:
    pickle.dump(dds.to_picklable_anndata(), f)

标准病例-对照比较：

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()

ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()

results = ds.results_df
significant = results[results.padj < 0.05]

测试多个处理组与对照组：

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()

treatments = ["treatment_A", "treatment_B", "treatment_C"]
all_results = {}

for treatment in treatments:
    ds = DeseqStats(dds, contrast=["condition", treatment, "control"])
    ds.summary()
    all_results[treatment] = ds.results_df

    sig_count = len(ds.results_df[ds.results_df.padj < 0.05])
    print(f"{treatment}: {sig_count} significant genes")

控制技术变异：

# 在设计中包含批次
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition")
dds.deseq2()

# 测试条件同时控制批次
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()

包含连续变量，如年龄或剂量：

# 确保连续变量是数值型
metadata["age"] = pd.to_numeric(metadata["age"])

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition")
dds.deseq2()

ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()

此技能包含一个用于标准分析的完整命令行脚本：

# 基本用法
python scripts/run_deseq2_analysis.py \
  --counts counts.csv \
  --metadata metadata.csv \
  --design "~condition" \
  --contrast condition treated control \
  --output results/

# 带附加选项
python scripts/run_deseq2_analysis.py \
  --counts counts.csv \
  --metadata metadata.csv \
  --design "~batch + condition" \
  --contrast condition treated control \
  --output results/ \
  --min-counts 10 \
  --alpha 0.05 \
  --n-cpus 4 \
  --plots

自动数据加载和验证
基因和样本过滤
完整的 DESeq2 流程执行
带可定制参数的统计检验
结果导出（CSV, pickle）
可选的可视化（火山图和 MA 图）

当用户需要独立分析工具或想要批量处理多个数据集时，请参考 scripts/run_deseq2_analysis.py。

# 按调整后 p 值过滤
significant = ds.results_df[ds.results_df.padj < 0.05]

# 按显著性和效应大小同时过滤
sig_and_large = ds.results_df[
    (ds.results_df.padj < 0.05) &
    (abs(ds.results_df.log2FoldChange) > 1)
]

# 分离上调和下调基因
upregulated = significant[significant.log2FoldChange > 0]
downregulated = significant[significant.log2FoldChange < 0]

print(f"Upregulated: {len(upregulated)}")
print(f"Downregulated: {len(downregulated)}")

# 按调整后 p 值排序
top_by_padj = ds.results_df.sort_values("padj").head(20)

# 按绝对倍数变化排序（使用收缩值）
ds.lfc_shrink()
ds.results_df["abs_lfc"] = abs(ds.results_df.log2FoldChange)
top_by_lfc = ds.results_df.sort_values("abs_lfc", ascending=False).head(20)

# 按组合指标排序
ds.results_df["score"] = -np.log10(ds.results_df.padj) * abs(ds.results_df.log2FoldChange)
top_combined = ds.results_df.sort_values("score", ascending=False).head(20)

# 检查归一化（大小因子应接近 1）
print("Size factors:", dds.obsm["size_factors"])

# 检查离散度估计
import matplotlib.pyplot as plt
plt.hist(dds.varm["dispersions"], bins=50)
plt.xlabel("Dispersion")
plt.ylabel("Frequency")
plt.title("Dispersion Distribution")
plt.show()

# 检查 p 值分布（应基本平坦，在 0 附近有峰值）
plt.hist(ds.results_df.pvalue.dropna(), bins=50)
plt.xlabel("P-value")
plt.ylabel("Frequency")
plt.title("P-value Distribution")
plt.show()

可视化显著性与效应大小：

import matplotlib.pyplot as plt
import numpy as np

results = ds.results_df.copy()
results["-log10(padj)"] = -np.log10(results.padj)

plt.figure(figsize=(10, 6))
significant = results.padj < 0.05

plt.scatter(
    results.loc[~significant, "log2FoldChange"],
    results.loc[~significant, "-log10(padj)"],
    alpha=0.3, s=10, c='gray', label='Not significant'
)
plt.scatter(
    results.loc[significant, "log2FoldChange"],
    results.loc[significant, "-log10(padj)"],
    alpha=0.6, s=10, c='red', label='padj < 0.05'
)

plt.axhline(-np.log10(0.05), color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log2 Fold Change")
plt.ylabel("-Log10(Adjusted P-value)")
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)

显示倍数变化与平均表达：

plt.figure(figsize=(10, 6))

plt.scatter(
    np.log10(results.loc[~significant, "baseMean"] + 1),
    results.loc[~significant, "log2FoldChange"],
    alpha=0.3, s=10, c='gray'
)
plt.scatter(
    np.log10(results.loc[significant, "baseMean"] + 1),
    results.loc[significant, "log2FoldChange"],
    alpha=0.6, s=10, c='red'
)

plt.axhline(0, color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log10(Base Mean + 1)")
plt.ylabel("Log2 Fold Change")
plt.title("MA Plot")
plt.savefig("ma_plot.png", dpi=300)

问题： "计数和元数据之间的索引不匹配"

解决方案： 确保样本名称完全匹配

print("Counts samples:", counts_df.index.tolist())
print("Metadata samples:", metadata.index.tolist())

# 如果需要，取交集
common = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common]
metadata = metadata.loc[common]

问题： "所有基因的计数都为零"

解决方案： 检查数据是否需要转置

print(f"Counts shape: {counts_df.shape}")
# 如果基因数 > 样本数，则需要转置
if counts_df.shape[1] < counts_df.shape[0]:
    counts_df = counts_df.T

问题： "设计矩阵不是满秩的"

原因： 混杂变量（例如，所有处理样本都在一个批次中）

解决方案： 移除混杂变量或添加交互项

# 检查混杂
print(pd.crosstab(metadata.condition, metadata.batch))

# 简化设计或添加交互
design = "~condition"  # 移除批次
# 或者
design = "~condition + batch + condition:batch"  # 建模交互

# 检查离散度分布
plt.hist(dds.varm["dispersions"], bins=50)
plt.show()

# 检查大小因子
print(dds.obsm["size_factors"])

# 查看按原始 p 值排序的顶部基因
print(ds.results_df.nsmallest(20, "pvalue"))

效应大小小
生物变异性高
样本量不足
技术问题（批次效应、异常值）

有关此面向工作流程指南之外的全面详细信息：

API 参考 (references/api_reference.md)：PyDESeq2 类、方法和数据结构的完整文档。当需要详细的参数信息或理解对象属性时使用。
工作流程指南 (references/workflow_guide.md)：深入指南，涵盖完整的分析工作流程、数据加载模式、多因子设计、故障排除和最佳实践。当处理复杂的实验设计或遇到问题时使用。

当用户需要时，请将这些参考加载到上下文中：

详细的 API 文档：Read references/api_reference.md
全面的工作流程示例：Read references/workflow_guide.md
故障排除指导：Read references/workflow_guide.md（参见故障排除部分）

数据方向很重要： 计数矩阵通常加载为基因 × 样本，但需要是样本 × 基因。如果需要，始终使用 .T 进行转置。
样本过滤： 在分析前移除元数据缺失的样本以避免错误。
基因过滤： 过滤低计数基因（例如，总读取数 < 10）以提高功效并减少计算时间。
设计公式顺序： 将调整变量放在关注变量之前（例如，"~batch + condition" 而不是 "~condition + batch"）。
LFC 收缩时机： 在统计检验后应用收缩，仅用于可视化/排序目的。P 值仍基于未收缩的估计值。
结果解释： 使用 padj < 0.05 表示显著性，而不是原始 p 值。Benjamini-Hochberg 程序控制错误发现率。
对比规范： 格式为 [变量, 测试水平, 参考水平]，其中测试水平与参考水平进行比较。
保存中间对象： 使用 pickle 保存 DeseqDataSet 对象，以供后续使用或进行额外分析，而无需重新运行昂贵的拟合步骤。

uv pip install pydeseq2

Python 3.10-3.11
pandas 1.4.3+
numpy 1.23.0+
scipy 1.11.0+
scikit-learn 1.1.1+
anndata 0.8.0+

可视化可选：

官方文档： https://pydeseq2.readthedocs.io
GitHub 仓库： https://github.com/owkin/PyDESeq2
出版物： Muzellec et al. (2023) Bioinformatics, DOI: 10.1093/bioinformatics/btad547
原始 DESeq2 (R)： Love et al. (2014) Genome Biology, DOI: 10.1186/s13059-014-0550-8

2026 年 1 月 21 日

🇺🇸English

PyDESeq2

Overview

PyDESeq2 is a Python implementation of DESeq2 for differential expression analysis with bulk RNA-seq data. Design and execute complete workflows from data loading through result interpretation, including single-factor and multi-factor designs, Wald tests with multiple testing correction, optional apeGLM shrinkage, and integration with pandas and AnnData.

When to Use This Skill

This skill should be used when:

Analyzing bulk RNA-seq count data for differential expression
Comparing gene expression between experimental conditions (e.g., treated vs control)
Performing multi-factor designs accounting for batch effects or covariates
Converting R-based DESeq2 workflows to Python
Integrating differential expression analysis into Python-based pipelines
Users mention "DESeq2", "differential expression", "RNA-seq analysis", or "PyDESeq2"

Quick Start Workflow

For users who want to perform a standard differential expression analysis:

import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# 1. Load data
counts_df = pd.read_csv("counts.csv", index_col=0).T  # Transpose to samples × genes
metadata = pd.read_csv("metadata.csv", index_col=0)

# 2. Filter low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

# 3. Initialize and fit DESeq2
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",
    refit_cooks=True
)
dds.deseq2()

# 4. Perform statistical testing
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()

# 5. Access results
results = ds.results_df
significant = results[results.padj < 0.05]
print(f"Found {len(significant)} significant genes")

Core Workflow Steps

Step 1: Data Preparation

Input requirements:

Count matrix: Samples × genes DataFrame with non-negative integer read counts
Metadata: Samples × variables DataFrame with experimental factors

Common data loading patterns:

# From CSV (typical format: genes × samples, needs transpose)
counts_df = pd.read_csv("counts.csv", index_col=0).T
metadata = pd.read_csv("metadata.csv", index_col=0)

# From TSV
counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T

# From AnnData
import anndata as ad
adata = ad.read_h5ad("data.h5ad")
counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
metadata = adata.obs

Data filtering:

# Remove low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

# Remove samples with missing metadata
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]

Step 2: Design Specification

The design formula specifies how gene expression is modeled.

Single-factor designs:

design = "~condition"  # Simple two-group comparison

Multi-factor designs:

design = "~batch + condition"  # Control for batch effects
design = "~age + condition"     # Include continuous covariate
design = "~group + condition + group:condition"  # Interaction effects

Design formula guidelines:

Use Wilkinson formula notation (R-style)
Put adjustment variables (e.g., batch) before the main variable of interest
Ensure variables exist as columns in the metadata DataFrame
Use appropriate data types (categorical for discrete variables)

Step 3: DESeq2 Fitting

Initialize the DeseqDataSet and run the complete pipeline:

from pydeseq2.dds import DeseqDataSet

dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",
    refit_cooks=True,  # Refit after removing outliers
    n_cpus=1           # Parallel processing (adjust as needed)
)

# Run the complete DESeq2 pipeline
dds.deseq2()

Whatdeseq2() does:

Computes size factors (normalization)
Fits genewise dispersions
Fits dispersion trend curve
Computes dispersion priors
Fits MAP dispersions (shrinkage)
Fits log fold changes
Calculates Cook's distances (outlier detection)
Refits if outliers detected (optional)

Step 4: Statistical Testing

Perform Wald tests to identify differentially expressed genes:

from pydeseq2.ds import DeseqStats

ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"],  # Test treated vs control
    alpha=0.05,                # Significance threshold
    cooks_filter=True,         # Filter outliers
    independent_filter=True    # Filter low-power tests
)

ds.summary()

Contrast specification:

Format: [variable, test_level, reference_level]
Example: ["condition", "treated", "control"] tests treated vs control
If None, uses the last coefficient in the design

Result DataFrame columns:

baseMean: Mean normalized count across samples
log2FoldChange: Log2 fold change between conditions
lfcSE: Standard error of LFC
stat: Wald test statistic
pvalue: Raw p-value
padj: Adjusted p-value (FDR-corrected via Benjamini-Hochberg)

Step 5: Optional LFC Shrinkage

Apply shrinkage to reduce noise in fold change estimates:

ds.lfc_shrink()  # Applies apeGLM shrinkage

When to use LFC shrinkage:

For visualization (volcano plots, heatmaps)
For ranking genes by effect size
When prioritizing genes for follow-up experiments

Important: Shrinkage affects only the log2FoldChange values, not the statistical test results (p-values remain unchanged). Use shrunk values for visualization but report unshrunken p-values for significance.

Step 6: Result Export

Save results and intermediate objects:

import pickle

# Export results as CSV
ds.results_df.to_csv("deseq2_results.csv")

# Save significant genes only
significant = ds.results_df[ds.results_df.padj < 0.05]
significant.to_csv("significant_genes.csv")

# Save DeseqDataSet for later use
with open("dds_result.pkl", "wb") as f:
    pickle.dump(dds.to_picklable_anndata(), f)

Common Analysis Patterns

Two-Group Comparison

Standard case-control comparison:

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()

ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()

results = ds.results_df
significant = results[results.padj < 0.05]

Multiple Comparisons

Testing multiple treatment groups against control:

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()

treatments = ["treatment_A", "treatment_B", "treatment_C"]
all_results = {}

for treatment in treatments:
    ds = DeseqStats(dds, contrast=["condition", treatment, "control"])
    ds.summary()
    all_results[treatment] = ds.results_df

    sig_count = len(ds.results_df[ds.results_df.padj < 0.05])
    print(f"{treatment}: {sig_count} significant genes")

Accounting for Batch Effects

Control for technical variation:

# Include batch in design
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition")
dds.deseq2()

# Test condition while controlling for batch
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()

Continuous Covariates

Include continuous variables like age or dosage:

# Ensure continuous variable is numeric
metadata["age"] = pd.to_numeric(metadata["age"])

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition")
dds.deseq2()

ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()

Using the Analysis Script

This skill includes a complete command-line script for standard analyses:

# Basic usage
python scripts/run_deseq2_analysis.py \
  --counts counts.csv \
  --metadata metadata.csv \
  --design "~condition" \
  --contrast condition treated control \
  --output results/

# With additional options
python scripts/run_deseq2_analysis.py \
  --counts counts.csv \
  --metadata metadata.csv \
  --design "~batch + condition" \
  --contrast condition treated control \
  --output results/ \
  --min-counts 10 \
  --alpha 0.05 \
  --n-cpus 4 \
  --plots

Script features:

Automatic data loading and validation
Gene and sample filtering
Complete DESeq2 pipeline execution
Statistical testing with customizable parameters
Result export (CSV, pickle)
Optional visualization (volcano and MA plots)

Refer users to scripts/run_deseq2_analysis.py when they need a standalone analysis tool or want to batch process multiple datasets.

Result Interpretation

Identifying Significant Genes

# Filter by adjusted p-value
significant = ds.results_df[ds.results_df.padj < 0.05]

# Filter by both significance and effect size
sig_and_large = ds.results_df[
    (ds.results_df.padj < 0.05) &
    (abs(ds.results_df.log2FoldChange) > 1)
]

# Separate up- and down-regulated
upregulated = significant[significant.log2FoldChange > 0]
downregulated = significant[significant.log2FoldChange < 0]

print(f"Upregulated: {len(upregulated)}")
print(f"Downregulated: {len(downregulated)}")

Ranking and Sorting

# Sort by adjusted p-value
top_by_padj = ds.results_df.sort_values("padj").head(20)

# Sort by absolute fold change (use shrunk values)
ds.lfc_shrink()
ds.results_df["abs_lfc"] = abs(ds.results_df.log2FoldChange)
top_by_lfc = ds.results_df.sort_values("abs_lfc", ascending=False).head(20)

# Sort by a combined metric
ds.results_df["score"] = -np.log10(ds.results_df.padj) * abs(ds.results_df.log2FoldChange)
top_combined = ds.results_df.sort_values("score", ascending=False).head(20)

Quality Metrics

# Check normalization (size factors should be close to 1)
print("Size factors:", dds.obsm["size_factors"])

# Examine dispersion estimates
import matplotlib.pyplot as plt
plt.hist(dds.varm["dispersions"], bins=50)
plt.xlabel("Dispersion")
plt.ylabel("Frequency")
plt.title("Dispersion Distribution")
plt.show()

# Check p-value distribution (should be mostly flat with peak near 0)
plt.hist(ds.results_df.pvalue.dropna(), bins=50)
plt.xlabel("P-value")
plt.ylabel("Frequency")
plt.title("P-value Distribution")
plt.show()

Visualization Guidelines

Volcano Plot

Visualize significance vs effect size:

import matplotlib.pyplot as plt
import numpy as np

results = ds.results_df.copy()
results["-log10(padj)"] = -np.log10(results.padj)

plt.figure(figsize=(10, 6))
significant = results.padj < 0.05

plt.scatter(
    results.loc[~significant, "log2FoldChange"],
    results.loc[~significant, "-log10(padj)"],
    alpha=0.3, s=10, c='gray', label='Not significant'
)
plt.scatter(
    results.loc[significant, "log2FoldChange"],
    results.loc[significant, "-log10(padj)"],
    alpha=0.6, s=10, c='red', label='padj < 0.05'
)

plt.axhline(-np.log10(0.05), color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log2 Fold Change")
plt.ylabel("-Log10(Adjusted P-value)")
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)

MA Plot

Show fold change vs mean expression:

plt.figure(figsize=(10, 6))

plt.scatter(
    np.log10(results.loc[~significant, "baseMean"] + 1),
    results.loc[~significant, "log2FoldChange"],
    alpha=0.3, s=10, c='gray'
)
plt.scatter(
    np.log10(results.loc[significant, "baseMean"] + 1),
    results.loc[significant, "log2FoldChange"],
    alpha=0.6, s=10, c='red'
)

plt.axhline(0, color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log10(Base Mean + 1)")
plt.ylabel("Log2 Fold Change")
plt.title("MA Plot")
plt.savefig("ma_plot.png", dpi=300)

Troubleshooting Common Issues

Data Format Problems

Issue: "Index mismatch between counts and metadata"

Solution: Ensure sample names match exactly

print("Counts samples:", counts_df.index.tolist())
print("Metadata samples:", metadata.index.tolist())

# Take intersection if needed
common = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common]
metadata = metadata.loc[common]

Issue: "All genes have zero counts"

Solution: Check if data needs transposition

print(f"Counts shape: {counts_df.shape}")
# If genes > samples, transpose is needed
if counts_df.shape[1] < counts_df.shape[0]:
    counts_df = counts_df.T

Design Matrix Issues

Issue: "Design matrix is not full rank"

Cause: Confounded variables (e.g., all treated samples in one batch)

Solution: Remove confounded variable or add interaction term

# Check confounding
print(pd.crosstab(metadata.condition, metadata.batch))

# Either simplify design or add interaction
design = "~condition"  # Remove batch
# OR
design = "~condition + batch + condition:batch"  # Model interaction

No Significant Genes

Diagnostics:

# Check dispersion distribution
plt.hist(dds.varm["dispersions"], bins=50)
plt.show()

# Check size factors
print(dds.obsm["size_factors"])

# Look at top genes by raw p-value
print(ds.results_df.nsmallest(20, "pvalue"))

Possible causes:

Small effect sizes
High biological variability
Insufficient sample size
Technical issues (batch effects, outliers)

Reference Documentation

For comprehensive details beyond this workflow-oriented guide:

API Reference (references/api_reference.md): Complete documentation of PyDESeq2 classes, methods, and data structures. Use when needing detailed parameter information or understanding object attributes.
Workflow Guide (references/workflow_guide.md): In-depth guide covering complete analysis workflows, data loading patterns, multi-factor designs, troubleshooting, and best practices. Use when handling complex experimental designs or encountering issues.

Load these references into context when users need:

Detailed API documentation: Read references/api_reference.md
Comprehensive workflow examples: Read references/workflow_guide.md
Troubleshooting guidance: Read references/workflow_guide.md (see Troubleshooting section)

Key Reminders

Data orientation matters: Count matrices typically load as genes × samples but need to be samples × genes. Always transpose with .T if needed.
Sample filtering: Remove samples with missing metadata before analysis to avoid errors.
Gene filtering: Filter low-count genes (e.g., < 10 total reads) to improve power and reduce computational time.
Design formula order: Put adjustment variables before the variable of interest (e.g., "~batch + condition" not "~condition + batch").
LFC shrinkage timing: Apply shrinkage after statistical testing and only for visualization/ranking purposes. P-values remain based on unshrunken estimates.
Result interpretation: Use padj < 0.05 for significance, not raw p-values. The Benjamini-Hochberg procedure controls false discovery rate.
Contrast specification: The format is [variable, test_level, reference_level] where test_level is compared against reference_level.

Installation and Requirements

uv pip install pydeseq2

System requirements:

Python 3.10-3.11
pandas 1.4.3+
numpy 1.23.0+
scipy 1.11.0+
scikit-learn 1.1.1+
anndata 0.8.0+

Optional for visualization:

matplotlib
seaborn

Additional Resources

Official Documentation: https://pydeseq2.readthedocs.io
GitHub Repository: https://github.com/owkin/PyDESeq2
Publication: Muzellec et al. (2023) Bioinformatics, DOI: 10.1093/bioinformatics/btad547
Original DESeq2 (R): Love et al. (2014) Genome Biology, DOI: 10.1186/s13059-014-0550-8

Weekly Installs

122

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code104

opencode98

gemini-cli93

cursor93

antigravity86

codex82

DOCX文件创建、编辑与分析完整指南 - 使用docx-js、Pandoc和Python脚本

51,800 周安装

Save intermediate objects: Use pickle to save DeseqDataSet objects for later use or additional analyses without re-running the expensive fitting step.

PyDESeq2：Python版DESeq2差异表达分析工具，用于RNA-seq数据处理

🇨🇳中文介绍

PyDESeq2

概述

使用场景

快速开始工作流程

核心工作流程步骤

相关 Skills

步骤 1：数据准备

步骤 2：设计规范

步骤 3：DESeq2 拟合

步骤 4：统计检验

步骤 5：可选的 LFC 收缩

步骤 6：结果导出

常见分析模式

两组比较

多重比较

考虑批次效应

连续协变量

使用分析脚本

结果解释

识别显著基因

排序

质量指标

可视化指南

火山图

MA 图

常见问题排查

数据格式问题

设计矩阵问题

无显著基因

参考文档

关键提醒

安装与要求

其他资源

🇺🇸English

PyDESeq2

Overview

When to Use This Skill

Quick Start Workflow

Core Workflow Steps

Step 1: Data Preparation

Step 2: Design Specification

Step 3: DESeq2 Fitting

Step 4: Statistical Testing

Step 5: Optional LFC Shrinkage

Step 6: Result Export

Common Analysis Patterns

Two-Group Comparison

Multiple Comparisons

Accounting for Batch Effects

Continuous Covariates

Using the Analysis Script

Result Interpretation

Identifying Significant Genes

Ranking and Sorting

Quality Metrics

Visualization Guidelines

Volcano Plot

MA Plot

Troubleshooting Common Issues

Data Format Problems

Design Matrix Issues

No Significant Genes

Reference Documentation

Key Reminders

Installation and Requirements

Additional Resources

最新 Skills