单细胞RNA测序质量控制工具 - 遵循scverse最佳实践的自动化QC工作流

single-cell-rna-qc by anthropics/knowledge-work-plugins

227 周安装量

11,000 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/anthropics/knowledge-work-plugins --skill single-cell-rna-qc

科研工具生物信息学数据处理

🇨🇳中文介绍

单细胞 RNA 测序质量控制

遵循 scverse 最佳实践的自动化单细胞 RNA 测序数据质量控制工作流。

何时使用此技能

当用户：

请求对单细胞 RNA 测序数据进行质量控制或 QC
想要过滤低质量细胞或评估数据质量
需要 QC 可视化或指标
要求遵循 scverse/scanpy 最佳实践
请求基于 MAD 的过滤或异常值检测

支持的输入格式：

.h5ad 文件（来自 scanpy/Python 工作流的 AnnData 格式）
.h5 文件（10X Genomics Cell Ranger 输出）

默认建议：使用方案 1（完整流程），除非用户有特定的自定义需求或明确要求非标准的过滤逻辑。

方案 1：完整的 QC 流程（标准工作流推荐）

对于遵循 scverse 最佳实践的标准 QC，请使用便捷脚本 scripts/qc_analysis.py：

python3 scripts/qc_analysis.py input.h5ad
# 或对于 10X Genomics .h5 文件：
python3 scripts/qc_analysis.py raw_feature_bc_matrix.h5

该脚本会自动检测文件格式并适当地加载。

何时使用此方案：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

方案 2：模块化构建块（用于自定义工作流）

对于自定义分析工作流或非标准需求，请使用 scripts/qc_core.py 和 scripts/qc_plotting.py 中的模块化实用函数：

# 从 scripts/ 目录运行，或者如果需要，将 scripts/ 添加到 sys.path
import anndata as ad
from qc_core import calculate_qc_metrics, detect_outliers_mad, filter_cells
from qc_plotting import plot_qc_distributions  # 仅在需要可视化时

adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
# ... 此处为自定义分析逻辑

何时使用此方案：

需要不同的工作流（跳过步骤、更改顺序、对子集应用不同的阈值）
条件逻辑（例如，对神经元与其他细胞的过滤方式不同）
部分执行（仅指标/可视化，不过滤）
与更大流程中的其他分析步骤集成
超出命令行参数支持范围的自定义过滤标准

可用的实用函数：

来自 qc_core.py（核心 QC 操作）：

calculate_qc_metrics(adata, mt_pattern, ribo_pattern, hb_pattern, inplace=True) - 计算 QC 指标并注释 adata
detect_outliers_mad(adata, metric, n_mads, verbose=True) - 基于 MAD 的异常值检测，返回布尔掩码
apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True) - 应用硬性截止值，返回布尔掩码
filter_cells(adata, mask, inplace=False) - 应用布尔掩码过滤细胞
filter_genes(adata, min_cells=20, min_counts=None, inplace=True) - 按检测情况过滤基因
print_qc_summary(adata, label='') - 打印汇总统计信息

来自 qc_plotting.py（可视化）：

plot_qc_distributions(adata, output_path, title) - 生成全面的 QC 图
plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path) - 可视化过滤阈值
plot_qc_after_filtering(adata, output_path) - 生成过滤后图

自定义工作流示例：

示例 1：仅计算指标和可视化，暂不进行过滤

adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
plot_qc_distributions(adata, 'qc_before.png', title='初始 QC')
print_qc_summary(adata, label='过滤前')

示例 2：仅应用线粒体百分比过滤，其他指标保持宽松

adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

# 仅过滤高线粒体百分比 的细胞
high_mt = apply_hard_threshold(adata, 'pct_counts_mt', 10, operator='>')
adata_filtered = filter_cells(adata, ~high_mt)
adata_filtered.write('filtered.h5ad')

示例 3：对不同子集应用不同的阈值

adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

# 应用类型特异性 QC（假设存在 cell_type 元数据）
neurons = adata.obs['cell_type'] == 'neuron'
other_cells = ~neurons

# 神经元能容忍更高的线粒体百分比 ，其他细胞使用更严格的阈值
neuron_qc = apply_hard_threshold(adata[neurons], 'pct_counts_mt', 15, operator='>')
other_qc = apply_hard_threshold(adata[other_cells], 'pct_counts_mt', 8, operator='>')

过滤要宽松 - 默认阈值有意保留大多数细胞，以避免丢失稀有群体
检查可视化结果 - 始终查看前后对比图，以确保过滤在生物学上有意义
考虑数据集特定因素 - 某些组织天然具有较高的线粒体含量（例如，神经元、心肌细胞）
检查基因注释 - 线粒体基因前缀因物种而异（小鼠为 mt-，人类为 MT-）
必要时进行迭代 - QC 参数可能需要根据特定实验或组织类型进行调整

有关详细的 QC 方法、参数原理和故障排除指南，请参阅 references/scverse_qc_guidelines.md。此参考资料提供：

每个 QC 指标的详细解释及其重要性
基于 MAD 的阈值的原理以及为什么它们比固定截止值更好
解释 QC 可视化（直方图、小提琴图、散点图）的指南
基因注释的物种特异性注意事项
何时以及如何调整过滤参数
高级 QC 注意事项（环境 RNA 校正、双联体检测）

当用户需要更深入地理解方法或在解决 QC 问题时，请加载此参考资料。

QC 后的后续步骤

典型的下游分析步骤：

环境 RNA 校正（SoupX, CellBender）
双联体检测（scDblFinder）
标准化（log-normalize, scran）
特征选择和降维
聚类和细胞类型注释

2026 年 1 月 31 日

🇺🇸English

Single-Cell RNA-seq Quality Control

Automated QC workflow for single-cell RNA-seq data following scverse best practices.

When to Use This Skill

Use when users:

Request quality control or QC on single-cell RNA-seq data
Want to filter low-quality cells or assess data quality
Need QC visualizations or metrics
Ask to follow scverse/scanpy best practices
Request MAD-based filtering or outlier detection

Supported input formats:

.h5ad files (AnnData format from scanpy/Python workflows)
.h5 files (10X Genomics Cell Ranger output)

Default recommendation : Use Approach 1 (complete pipeline) unless the user has specific custom requirements or explicitly requests non-standard filtering logic.

Approach 1: Complete QC Pipeline (Recommended for Standard Workflows)

For standard QC following scverse best practices, use the convenience script scripts/qc_analysis.py:

python3 scripts/qc_analysis.py input.h5ad
# or for 10X Genomics .h5 files:
python3 scripts/qc_analysis.py raw_feature_bc_matrix.h5

The script automatically detects the file format and loads it appropriately.

When to use this approach:

Standard QC workflow with adjustable thresholds (all cells filtered the same way)
Batch processing multiple datasets
Quick exploratory analysis
User wants the "just works" solution

Requirements: anndata, scanpy, scipy, matplotlib, seaborn, numpy

Parameters:

Customize filtering thresholds and gene patterns using command-line parameters:

--output-dir - Output directory
--mad-counts, --mad-genes, --mad-mt - MAD thresholds for counts/genes/MT%
--mt-threshold - Hard mitochondrial % cutoff
--min-cells - Gene filtering threshold
--mt-pattern, --ribo-pattern, --hb-pattern - Gene name patterns for different species

Use --help to see current default values.

Outputs:

All files are saved to <input_basename>_qc_results/ directory by default (or to the directory specified by --output-dir):

qc_metrics_before_filtering.png - Pre-filtering visualizations
qc_filtering_thresholds.png - MAD-based threshold overlays
qc_metrics_after_filtering.png - Post-filtering quality metrics
<input_basename>_filtered.h5ad - Clean, filtered dataset ready for downstream analysis
<input_basename>_with_qc.h5ad - Original data with QC annotations preserved

If copying outputs for user access, copy individual files (not the entire directory) so users can preview them directly.

Workflow Steps

The script performs the following steps:

Calculate QC metrics - Count depth, gene detection, mitochondrial/ribosomal/hemoglobin content
Apply MAD-based filtering - Permissive outlier detection using MAD thresholds for counts/genes/MT%
Filter genes - Remove genes detected in few cells
Generate visualizations - Comprehensive before/after plots with threshold overlays

Approach 2: Modular Building Blocks (For Custom Workflows)

For custom analysis workflows or non-standard requirements, use the modular utility functions from scripts/qc_core.py and scripts/qc_plotting.py:

# Run from scripts/ directory, or add scripts/ to sys.path if needed
import anndata as ad
from qc_core import calculate_qc_metrics, detect_outliers_mad, filter_cells
from qc_plotting import plot_qc_distributions  # Only if visualization needed

adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
# ... custom analysis logic here

When to use this approach:

Different workflow needed (skip steps, change order, apply different thresholds to subsets)
Conditional logic (e.g., filter neurons differently than other cells)
Partial execution (only metrics/visualization, no filtering)
Integration with other analysis steps in a larger pipeline
Custom filtering criteria beyond what command-line params support

Available utility functions:

From qc_core.py (core QC operations):

calculate_qc_metrics(adata, mt_pattern, ribo_pattern, hb_pattern, inplace=True) - Calculate QC metrics and annotate adata
detect_outliers_mad(adata, metric, n_mads, verbose=True) - MAD-based outlier detection, returns boolean mask
apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True) - Apply hard cutoffs, returns boolean mask
filter_cells(adata, mask, inplace=False) - Apply boolean mask to filter cells
filter_genes(adata, min_cells=20, min_counts=None, inplace=True) - Filter genes by detection
print_qc_summary(adata, label='') - Print summary statistics

From qc_plotting.py (visualization):

plot_qc_distributions(adata, output_path, title) - Generate comprehensive QC plots
plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path) - Visualize filtering thresholds
plot_qc_after_filtering(adata, output_path) - Generate post-filtering plots

Example custom workflows:

Example 1: Only calculate metrics and visualize, don't filter yet

adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
plot_qc_distributions(adata, 'qc_before.png', title='Initial QC')
print_qc_summary(adata, label='Before filtering')

Example 2: Apply only MT% filtering, keep other metrics permissive

adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

# Only filter high MT% cells
high_mt = apply_hard_threshold(adata, 'pct_counts_mt', 10, operator='>')
adata_filtered = filter_cells(adata, ~high_mt)
adata_filtered.write('filtered.h5ad')

Example 3: Different thresholds for different subsets

adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

# Apply type-specific QC (assumes cell_type metadata exists)
neurons = adata.obs['cell_type'] == 'neuron'
other_cells = ~neurons

# Neurons tolerate higher MT%, other cells use stricter threshold
neuron_qc = apply_hard_threshold(adata[neurons], 'pct_counts_mt', 15, operator='>')
other_qc = apply_hard_threshold(adata[other_cells], 'pct_counts_mt', 8, operator='>')

Best Practices

Be permissive with filtering - Default thresholds intentionally retain most cells to avoid losing rare populations
Inspect visualizations - Always review before/after plots to ensure filtering makes biological sense
Consider dataset-specific factors - Some tissues naturally have higher mitochondrial content (e.g., neurons, cardiomyocytes)
Check gene annotations - Mitochondrial gene prefixes vary by species (mt- for mouse, MT- for human)
Iterate if needed - QC parameters may need adjustment based on the specific experiment or tissue type

Reference Materials

For detailed QC methodology, parameter rationale, and troubleshooting guidance, see references/scverse_qc_guidelines.md. This reference provides:

Detailed explanations of each QC metric and why it matters
Rationale for MAD-based thresholds and why they're better than fixed cutoffs
Guidelines for interpreting QC visualizations (histograms, violin plots, scatter plots)
Species-specific considerations for gene annotations
When and how to adjust filtering parameters
Advanced QC considerations (ambient RNA correction, doublet detection)

Load this reference when users need deeper understanding of the methodology or when troubleshooting QC issues.

Next Steps After QC

Typical downstream analysis steps:

Ambient RNA correction (SoupX, CellBender)
Doublet detection (scDblFinder)
Normalization (log-normalize, scran)
Feature selection and dimensionality reduction
Clustering and cell type annotation

Weekly Installs

149

Repository

anthropics/know…-plugins

GitHub Stars

8.9K

First Seen

Jan 31, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode133

codex130

gemini-cli125

github-copilot121

claude-code117

cursor115

免费AI数据抓取智能体：自动化收集、丰富与存储网站/API数据

1,100 周安装