Scanpy 单细胞 RNA-seq 数据分析教程 | Python 生物信息学工具包

scanpy by davila7/claude-code-templates

206 周安装量

24,300 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill scanpy

Python Web框架数据分析生物信息学

🇨🇳中文介绍

Scanpy：单细胞分析

概述

Scanpy 是一个基于 AnnData 构建的、可扩展的 Python 工具包，用于分析单细胞 RNA-seq 数据。应用此技能可完成完整的单细胞分析流程，包括质量控制、标准化、降维、聚类、标记基因识别、可视化和轨迹分析。

何时使用此技能

在以下情况应使用此技能：

分析单细胞 RNA-seq 数据（.h5ad、10X、CSV 格式）
对 scRNA-seq 数据集执行质量控制
创建 UMAP、t-SNE 或 PCA 可视化图
识别细胞簇并寻找标记基因
基于基因表达注释细胞类型
进行轨迹推断或拟时序分析
生成可用于发表的单细胞图

快速开始

基本导入与设置

import scanpy as sc
import pandas as pd
import numpy as np

# 配置设置
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white')
sc.settings.figdir = './figures/'

加载数据

# 从 10X Genomics 加载
adata = sc.read_10x_mtx('path/to/data/')
adata = sc.read_10x_h5('path/to/data.h5')

# 从 h5ad (AnnData 格式) 加载
adata = sc.read_h5ad('path/to/data.h5ad')

# 从 CSV 加载
adata = sc.read_csv('path/to/data.csv')

理解 AnnData 结构

AnnData 对象是 scanpy 中的核心数据结构：

adata.X          # 表达矩阵 (细胞 × 基因)
adata.obs        # 细胞元数据 (DataFrame)
adata.var        # 基因元数据 (DataFrame)
adata.uns        # 非结构化注释 (字典)
adata.obsm       # 多维细胞数据 (PCA, UMAP)
adata.raw        # 原始数据备份

# 访问细胞和基因名称
adata.obs_names  # 细胞条形码
adata.var_names  # 基因名称

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

2. 标准化与预处理

# 将每个细胞标准化到 10,000 个计数
sc.pp.normalize_total(adata, target_sum=1e4)

# 对数转换
sc.pp.log1p(adata)

# 保存原始计数以备后用
adata.raw = adata

# 识别高可变基因
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pl.highly_variable_genes(adata)

# 子集化为高可变基因
adata = adata[:, adata.var.highly_variable]

# 回归掉不需要的变异
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])

# 缩放数据
sc.pp.scale(adata, max_value=10)

# PCA
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True)  # 检查肘部图

# 计算邻域图
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

# UMAP 可视化
sc.tl.umap(adata)
sc.pl.umap(adata, color='leiden')

# 替代方案：t-SNE
sc.tl.tsne(adata)

# Leiden 聚类 (推荐)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color='leiden', legend_loc='on data')

# 尝试多种分辨率以找到最佳粒度
for res in [0.3, 0.5, 0.8, 1.0]:
    sc.tl.leiden(adata, resolution=res, key_added=f'leiden_{res}')

5. 标记基因识别

# 为每个簇寻找标记基因
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')

# 可视化结果
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
sc.pl.rank_genes_groups_heatmap(adata, n_genes=10)
sc.pl.rank_genes_groups_dotplot(adata, n_genes=5)

# 将结果获取为 DataFrame
markers = sc.get.rank_genes_groups_df(adata, group='0')

6. 细胞类型注释

# 为已知细胞类型定义标记基因
marker_genes = ['CD3D', 'CD14', 'MS4A1', 'NKG7', 'FCGR3A']

# 可视化标记基因
sc.pl.umap(adata, color=marker_genes, use_raw=True)
sc.pl.dotplot(adata, var_names=marker_genes, groupby='leiden')

# 手动注释
cluster_to_celltype = {
    '0': 'CD4 T cells',
    '1': 'CD14+ Monocytes',
    '2': 'B cells',
    '3': 'CD8 T cells',
}
adata.obs['cell_type'] = adata.obs['leiden'].map(cluster_to_celltype)

# 可视化注释后的类型
sc.pl.umap(adata, color='cell_type', legend_loc='on data')

# 保存处理后的数据
adata.write('results/processed_data.h5ad')

# 导出元数据
adata.obs.to_csv('results/cell_metadata.csv')
adata.var.to_csv('results/gene_metadata.csv')

创建可用于发表的图

# 设置高质量默认值
sc.settings.set_figure_params(dpi=300, frameon=False, figsize=(5, 5))
sc.settings.file_format_figs = 'pdf'

# 自定义样式的 UMAP
sc.pl.umap(adata, color='cell_type',
           palette='Set2',
           legend_loc='on data',
           legend_fontsize=12,
           legend_fontoutline=2,
           frameon=False,
           save='_publication.pdf')

# 标记基因热图
sc.pl.heatmap(adata, var_names=genes, groupby='cell_type',
              swap_axes=True, show_gene_labels=True,
              save='_markers.pdf')

# 点图
sc.pl.dotplot(adata, var_names=genes, groupby='cell_type',
              save='_dotplot.pdf')

有关全面的可视化示例，请参阅 references/plotting_guide.md。

# PAGA (基于分区的图抽象)
sc.tl.paga(adata, groups='leiden')
sc.pl.paga(adata, color='leiden')

# 扩散拟时序
adata.uns['iroot'] = np.flatnonzero(adata.obs['leiden'] == '0')[0]
sc.tl.dpt(adata)
sc.pl.umap(adata, color='dpt_pseudotime')

条件间的差异表达

# 在细胞类型内比较处理组与对照组
adata_subset = adata[adata.obs['cell_type'] == 'T cells']
sc.tl.rank_genes_groups(adata_subset, groupby='condition',
                         groups=['treated'], reference='control')
sc.pl.rank_genes_groups(adata_subset, groups=['treated'])

# 为细胞的基因集表达评分
gene_set = ['CD3D', 'CD3E', 'CD3G']
sc.tl.score_genes(adata, gene_set, score_name='T_cell_score')
sc.pl.umap(adata, color='T_cell_score')

# ComBat 批次校正
sc.pp.combat(adata, key='batch')

# 替代方案：使用 Harmony 或 scVI (单独的包)

需调整的关键参数

min_genes：每个细胞的最小基因数（通常为 200-500）
min_cells：每个基因的最小细胞数（通常为 3-10）
pct_counts_mt：线粒体阈值（通常为 5-20%）

target_sum：每个细胞的目标计数（默认为 1e4）

n_top_genes：高可变基因数量（通常为 2000-3000）
min_mean, max_mean, min_disp：高可变基因选择参数

n_pcs：主成分数量（检查方差比率图）
n_neighbors：邻居数量（通常为 10-30）

resolution：聚类粒度（0.4-1.2，值越高 = 簇越多）

常见陷阱与最佳实践

始终保存原始计数：在过滤基因之前执行 adata.raw = adata
仔细检查 QC 图：根据数据集质量调整阈值
使用 Leiden 而非 Louvain：更高效且结果更好
尝试多种聚类分辨率：找到最佳粒度
验证细胞类型注释：使用多个标记基因
使用 use_raw=True 绘制基因表达图：显示原始计数
检查 PCA 方差比率：确定最佳 PC 数量
保存中间结果：长流程可能在中间失败

scripts/qc_analysis.py

自动化质量控制脚本，用于计算指标、生成图和过滤数据：

python scripts/qc_analysis.py input.h5ad --output filtered.h5ad \
    --mt-threshold 5 --min-genes 200 --min-cells 3

references/standard_workflow.md

完整的逐步工作流程，包含详细解释和代码示例，涵盖：

数据加载与设置
带可视化的质量控制
标准化与缩放
特征选择
降维 (PCA, UMAP, t-SNE)
聚类 (Leiden, Louvain)
标记基因识别
细胞类型注释
轨迹推断
差异表达

从头开始执行完整分析时，请阅读此参考文档。

references/api_reference.md

按模块组织的 scanpy 函数快速参考指南：

读取/写入数据 (sc.read_*, adata.write_*)
预处理 (sc.pp.*)
工具 (sc.tl.*)
绘图 (sc.pl.*)
AnnData 结构及操作
设置与实用工具

用于快速查找函数签名和常用参数。

references/plotting_guide.md

全面的可视化指南，包括：

质量控制图
降维可视化
聚类可视化
标记基因图（热图、点图、小提琴图）
轨迹和拟时序图
可用于发表的定制化
多面板图
调色板与样式

创建可用于发表的图表时，请查阅此指南。

assets/analysis_template.py

完整的分析模板，提供了从数据加载到细胞类型注释的完整工作流程。复制并自定义此模板以用于新分析：

cp assets/analysis_template.py my_analysis.py
# 编辑参数并运行
python my_analysis.py

该模板包含所有标准步骤，带有可配置参数和有用的注释。

官方 scanpy 文档：https://scanpy.readthedocs.io/
Scanpy 教程：https://scanpy-tutorials.readthedocs.io/
scverse 生态系统：https://scverse.org/ (相关工具：squidpy, scvi-tools, cellrank)
最佳实践：Luecken & Theis (2019) "Current best practices in single-cell RNA-seq"

从模板开始：使用 assets/analysis_template.py 作为起点
首先运行 QC 脚本：使用 scripts/qc_analysis.py 进行初始过滤
根据需要查阅参考资料：将工作流程和 API 参考加载到上下文中
迭代聚类：尝试多种分辨率和可视化方法
进行生物学验证：检查标记基因是否符合预期的细胞类型
记录参数：记录 QC 阈值和分析设置
保存检查点：在关键步骤写入中间结果

🇺🇸English

Scanpy: Single-Cell Analysis

Overview

Scanpy is a scalable Python toolkit for analyzing single-cell RNA-seq data, built on AnnData. Apply this skill for complete single-cell workflows including quality control, normalization, dimensionality reduction, clustering, marker gene identification, visualization, and trajectory analysis.

When to Use This Skill

This skill should be used when:

Analyzing single-cell RNA-seq data (.h5ad, 10X, CSV formats)
Performing quality control on scRNA-seq datasets
Creating UMAP, t-SNE, or PCA visualizations
Identifying cell clusters and finding marker genes
Annotating cell types based on gene expression
Conducting trajectory inference or pseudotime analysis
Generating publication-quality single-cell plots

Quick Start

Basic Import and Setup

import scanpy as sc
import pandas as pd
import numpy as np

# Configure settings
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white')
sc.settings.figdir = './figures/'

Loading Data

# From 10X Genomics
adata = sc.read_10x_mtx('path/to/data/')
adata = sc.read_10x_h5('path/to/data.h5')

# From h5ad (AnnData format)
adata = sc.read_h5ad('path/to/data.h5ad')

# From CSV
adata = sc.read_csv('path/to/data.csv')

Understanding AnnData Structure

The AnnData object is the core data structure in scanpy:

adata.X          # Expression matrix (cells × genes)
adata.obs        # Cell metadata (DataFrame)
adata.var        # Gene metadata (DataFrame)
adata.uns        # Unstructured annotations (dict)
adata.obsm       # Multi-dimensional cell data (PCA, UMAP)
adata.raw        # Raw data backup

# Access cell and gene names
adata.obs_names  # Cell barcodes
adata.var_names  # Gene names

Standard Analysis Workflow

1. Quality Control

Identify and filter low-quality cells and genes:

# Identify mitochondrial genes
adata.var['mt'] = adata.var_names.str.startswith('MT-')

# Calculate QC metrics
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)

# Visualize QC metrics
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)

# Filter cells and genes
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.pct_counts_mt < 5, :]  # Remove high MT% cells

Use the QC script for automated analysis:

python scripts/qc_analysis.py input_file.h5ad --output filtered.h5ad

2. Normalization and Preprocessing

# Normalize to 10,000 counts per cell
sc.pp.normalize_total(adata, target_sum=1e4)

# Log-transform
sc.pp.log1p(adata)

# Save raw counts for later
adata.raw = adata

# Identify highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pl.highly_variable_genes(adata)

# Subset to highly variable genes
adata = adata[:, adata.var.highly_variable]

# Regress out unwanted variation
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])

# Scale data
sc.pp.scale(adata, max_value=10)

3. Dimensionality Reduction

# PCA
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True)  # Check elbow plot

# Compute neighborhood graph
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

# UMAP for visualization
sc.tl.umap(adata)
sc.pl.umap(adata, color='leiden')

# Alternative: t-SNE
sc.tl.tsne(adata)

4. Clustering

# Leiden clustering (recommended)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color='leiden', legend_loc='on data')

# Try multiple resolutions to find optimal granularity
for res in [0.3, 0.5, 0.8, 1.0]:
    sc.tl.leiden(adata, resolution=res, key_added=f'leiden_{res}')

5. Marker Gene Identification

# Find marker genes for each cluster
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')

# Visualize results
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
sc.pl.rank_genes_groups_heatmap(adata, n_genes=10)
sc.pl.rank_genes_groups_dotplot(adata, n_genes=5)

# Get results as DataFrame
markers = sc.get.rank_genes_groups_df(adata, group='0')

6. Cell Type Annotation

# Define marker genes for known cell types
marker_genes = ['CD3D', 'CD14', 'MS4A1', 'NKG7', 'FCGR3A']

# Visualize markers
sc.pl.umap(adata, color=marker_genes, use_raw=True)
sc.pl.dotplot(adata, var_names=marker_genes, groupby='leiden')

# Manual annotation
cluster_to_celltype = {
    '0': 'CD4 T cells',
    '1': 'CD14+ Monocytes',
    '2': 'B cells',
    '3': 'CD8 T cells',
}
adata.obs['cell_type'] = adata.obs['leiden'].map(cluster_to_celltype)

# Visualize annotated types
sc.pl.umap(adata, color='cell_type', legend_loc='on data')

7. Save Results

# Save processed data
adata.write('results/processed_data.h5ad')

# Export metadata
adata.obs.to_csv('results/cell_metadata.csv')
adata.var.to_csv('results/gene_metadata.csv')

Common Tasks

Creating Publication-Quality Plots

# Set high-quality defaults
sc.settings.set_figure_params(dpi=300, frameon=False, figsize=(5, 5))
sc.settings.file_format_figs = 'pdf'

# UMAP with custom styling
sc.pl.umap(adata, color='cell_type',
           palette='Set2',
           legend_loc='on data',
           legend_fontsize=12,
           legend_fontoutline=2,
           frameon=False,
           save='_publication.pdf')

# Heatmap of marker genes
sc.pl.heatmap(adata, var_names=genes, groupby='cell_type',
              swap_axes=True, show_gene_labels=True,
              save='_markers.pdf')

# Dot plot
sc.pl.dotplot(adata, var_names=genes, groupby='cell_type',
              save='_dotplot.pdf')

Refer to references/plotting_guide.md for comprehensive visualization examples.

Trajectory Inference

# PAGA (Partition-based graph abstraction)
sc.tl.paga(adata, groups='leiden')
sc.pl.paga(adata, color='leiden')

# Diffusion pseudotime
adata.uns['iroot'] = np.flatnonzero(adata.obs['leiden'] == '0')[0]
sc.tl.dpt(adata)
sc.pl.umap(adata, color='dpt_pseudotime')

Differential Expression Between Conditions

# Compare treated vs control within cell types
adata_subset = adata[adata.obs['cell_type'] == 'T cells']
sc.tl.rank_genes_groups(adata_subset, groupby='condition',
                         groups=['treated'], reference='control')
sc.pl.rank_genes_groups(adata_subset, groups=['treated'])

Gene Set Scoring

# Score cells for gene set expression
gene_set = ['CD3D', 'CD3E', 'CD3G']
sc.tl.score_genes(adata, gene_set, score_name='T_cell_score')
sc.pl.umap(adata, color='T_cell_score')

Batch Correction

# ComBat batch correction
sc.pp.combat(adata, key='batch')

# Alternative: use Harmony or scVI (separate packages)

Key Parameters to Adjust

Quality Control

min_genes: Minimum genes per cell (typically 200-500)
min_cells: Minimum cells per gene (typically 3-10)
pct_counts_mt: Mitochondrial threshold (typically 5-20%)

Normalization

target_sum: Target counts per cell (default 1e4)

Feature Selection

n_top_genes: Number of HVGs (typically 2000-3000)
min_mean, max_mean, min_disp: HVG selection parameters

Dimensionality Reduction

n_pcs: Number of principal components (check variance ratio plot)
n_neighbors: Number of neighbors (typically 10-30)

Clustering

resolution: Clustering granularity (0.4-1.2, higher = more clusters)

Common Pitfalls and Best Practices

Always save raw counts : adata.raw = adata before filtering genes
Check QC plots carefully : Adjust thresholds based on dataset quality
Use Leiden over Louvain : More efficient and better results
Try multiple clustering resolutions : Find optimal granularity
Validate cell type annotations : Use multiple marker genes
Useuse_raw=True for gene expression plots: Shows original counts
Check PCA variance ratio : Determine optimal number of PCs
Save intermediate results : Long workflows can fail partway through

Bundled Resources

scripts/qc_analysis.py

Automated quality control script that calculates metrics, generates plots, and filters data:

python scripts/qc_analysis.py input.h5ad --output filtered.h5ad \
    --mt-threshold 5 --min-genes 200 --min-cells 3

references/standard_workflow.md

Complete step-by-step workflow with detailed explanations and code examples for:

Data loading and setup
Quality control with visualization
Normalization and scaling
Feature selection
Dimensionality reduction (PCA, UMAP, t-SNE)
Clustering (Leiden, Louvain)
Marker gene identification
Cell type annotation
Trajectory inference
Differential expression

Read this reference when performing a complete analysis from scratch.

references/api_reference.md

Quick reference guide for scanpy functions organized by module:

Reading/writing data (sc.read_*, adata.write_*)
Preprocessing (sc.pp.*)
Tools (sc.tl.*)
Plotting (sc.pl.*)
AnnData structure and manipulation
Settings and utilities

Use this for quick lookup of function signatures and common parameters.

references/plotting_guide.md

Comprehensive visualization guide including:

Quality control plots
Dimensionality reduction visualizations
Clustering visualizations
Marker gene plots (heatmaps, dot plots, violin plots)
Trajectory and pseudotime plots
Publication-quality customization
Multi-panel figures
Color palettes and styling

Consult this when creating publication-ready figures.

assets/analysis_template.py

Complete analysis template providing a full workflow from data loading through cell type annotation. Copy and customize this template for new analyses:

cp assets/analysis_template.py my_analysis.py
# Edit parameters and run
python my_analysis.py

The template includes all standard steps with configurable parameters and helpful comments.

Additional Resources

Official scanpy documentation : https://scanpy.readthedocs.io/
Scanpy tutorials : https://scanpy-tutorials.readthedocs.io/
scverse ecosystem : https://scverse.org/ (related tools: squidpy, scvi-tools, cellrank)
Best practices : Luecken & Theis (2019) "Current best practices in single-cell RNA-seq"

Tips for Effective Analysis

Start with the template : Use assets/analysis_template.py as a starting point
Run QC script first : Use scripts/qc_analysis.py for initial filtering
Consult references as needed : Load workflow and API references into context
Iterate on clustering : Try multiple resolutions and visualization methods
Validate biologically : Check marker genes match expected cell types
Document parameters : Record QC thresholds and analysis settings
Save checkpoints : Write intermediate results at key steps

Weekly Installs

164

Repository

davila7/claude-…emplates

GitHub Stars

23.5K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode137

claude-code135

gemini-cli129

cursor129

codex120

github-copilot114

DOCX文件创建、编辑与分析完整指南 - 使用docx-js、Pandoc和Python脚本

51,800 周安装

Scanpy 单细胞 RNA-seq 数据分析教程 | Python 生物信息学工具包

🇨🇳中文介绍

Scanpy：单细胞分析

概述

何时使用此技能

快速开始

基本导入与设置

加载数据

理解 AnnData 结构

相关 Skills

标准分析流程

1. 质量控制

2. 标准化与预处理

3. 降维

4. 聚类

5. 标记基因识别

6. 细胞类型注释

7. 保存结果

常见任务

创建可用于发表的图

轨迹推断

条件间的差异表达

基因集评分

批次校正

需调整的关键参数

质量控制

标准化

特征选择

降维

聚类

常见陷阱与最佳实践

捆绑资源

scripts/qc_analysis.py

references/standard_workflow.md

references/api_reference.md

references/plotting_guide.md

assets/analysis_template.py

其他资源

高效分析技巧

🇺🇸English

Scanpy: Single-Cell Analysis

Overview

When to Use This Skill

Quick Start

Basic Import and Setup

Loading Data

Understanding AnnData Structure

Standard Analysis Workflow

1. Quality Control

2. Normalization and Preprocessing

3. Dimensionality Reduction

4. Clustering

5. Marker Gene Identification

6. Cell Type Annotation

7. Save Results

Common Tasks

Creating Publication-Quality Plots

Trajectory Inference

Differential Expression Between Conditions

Gene Set Scoring

Batch Correction

Key Parameters to Adjust

Quality Control

Normalization

Feature Selection

Dimensionality Reduction

Clustering

Common Pitfalls and Best Practices

Bundled Resources

scripts/qc_analysis.py

references/standard_workflow.md

references/api_reference.md

references/plotting_guide.md

assets/analysis_template.py

Additional Resources

Tips for Effective Analysis

最新 Skills