CRISPR筛选分析工具：基因必需性、合成致死性与药物靶点发现全流程

tooluniverse-crispr-screen-analysis by mims-harvard/tooluniverse

152 周安装量

1,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-crispr-screen-analysis

数据分析科研工具生物信息学

🇨🇳中文介绍

ToolUniverse CRISPR 筛选分析

通过稳健的统计分析和通路富集，分析 CRISPR-Cas9 基因筛选以识别必需基因、合成致死相互作用和治疗靶点的综合技能。

概述

CRISPR 筛选通过系统性扰动基因并测量适应性效应，实现全基因组功能基因组学。此技能提供了一个包含 8 个阶段的工作流程，用于：

处理 sgRNA 计数矩阵
质量控制和标准化
基因水平必需性评分（类 MAGeCK 和类 BAGEL 方法）
合成致死性检测
通路富集分析
结合 DepMap 进行药物靶点优先排序
与表达和突变数据整合

核心工作流程

阶段 1：数据导入与 sgRNA 计数处理

加载 sgRNA 计数矩阵（MAGeCK 格式或通用 TSV）。预期列：sgRNA、Gene，以及样本列。创建实验设计表，将样本与条件（基线/处理）及重复分配关联起来。

阶段 2：质量控制与过滤

评估 sgRNA 分布质量：

文库大小：每个样本的总读数
零计数 sgRNAs：跨样本计数
低计数过滤：移除低于阈值的 sgRNA（默认：在 >N-2 个样本中读数 <30）
基尼系数：评估每个样本的分布偏度
报告过滤建议

阶段 3：标准化

标准化 sgRNA 计数以考虑文库大小差异：

中位数比率（类 DESeq2）：计算几何平均参考，将中位数比率计算为大小因子
（类 CPM）：除以以百万计的文库大小

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

阶段 4：基因水平评分

两种评分方法：

类 MAGeCK (RRA)：按 LFC 对所有 sgRNA 排序，计算每个基因的平均排名。平均排名越低，越可能是必需基因。包括 sgRNA 计数和每个基因的平均 LFC。
类 BAGEL (贝叶斯因子)：使用参考必需/非必需基因集来估计 LFC 分布。计算每个基因的似然比（贝叶斯因子）。BF 越高，越可能是必需基因。

阶段 5：合成致死性检测

比较野生型和突变型细胞系之间的必需性评分：

合并基因评分，计算 delta LFC 和 delta 排名
筛选在突变型中必需（LFC < 阈值）但在野生型中不必需（LFC > -0.5）且排名变化大的基因
按差异必需性排序

使用 PubMed 搜索查询 DepMap/文献以获取已知的依赖性信息。

阶段 6：通路富集分析

将排名靠前的必需基因提交至 Enrichr 进行通路富集：

KEGG 通路
GO 生物过程
检索带有 p 值和基因列表的富集条目

阶段 7：药物靶点优先排序

综合评分结合：

必需性（权重 50%）：来自 CRISPR 筛选的标准化平均 LFC
表达（权重 30%）：来自 RNA-seq 的 Log2 倍数变化（如果可用）
成药性（权重 20%）：来自 DGIdb 的药物相互作用数量

查询 DGIdb 以查找每个候选基因的现有药物、相互作用类型和来源。

阶段 8：报告生成

生成 Markdown 报告，包含：

摘要统计（总基因数、必需基因数、非必需基因数）
前 20 个必需基因表（排名、基因、平均 LFC、sgRNA 数量、评分）
通路富集结果（每个数据库的前 10 个条目）
药物靶点候选（排名、基因、必需性、表达倍数变化、成药性、优先评分）
方法部分

ToolUniverse 工具集成

使用的关键工具：

PubMed_search - 基因必需性文献搜索
Enrichr_submit_genelist - 通路富集提交
Enrichr_get_results - 检索富集结果
DGIdb_query_gene - 药物-基因相互作用和成药性
STRING_get_network - 蛋白质相互作用网络
KEGG_get_pathway - 通路可视化

表达数据集成：

GEO_get_dataset - 下载表达数据
ArrayExpress_get_experiment - 替代表达数据源

变异数据集成：

ClinVar_query_gene - 已知致病性变异
gnomAD_get_gene - 人群等位基因频率

import pandas as pd
from tooluniverse import ToolUniverse

# 1. 加载数据
counts, meta = load_sgrna_counts("sgrna_counts.txt")
design = create_design_matrix(['T0_1', 'T0_2', 'T14_1', 'T14_2'],
                               ['baseline', 'baseline', 'treatment', 'treatment'])

# 2. 处理
filtered_counts, filtered_mapping = filter_low_count_sgrnas(counts, meta['sgrna_to_gene'])
norm_counts, _ = normalize_counts(filtered_counts)
lfc, _, _ = calculate_lfc(norm_counts, design)

# 3. 基因评分
gene_scores = mageck_gene_scoring(lfc, filtered_mapping)

# 4. 富集通路
enrichment = enrich_essential_genes(gene_scores, top_n=100)

# 5. 寻找药物靶点
drug_targets = prioritize_drug_targets(gene_scores)

# 6. 生成报告
report = generate_crispr_report(gene_scores, enrichment, drug_targets)

Li W, 等人. (2014) MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biology
Hart T, 等人. (2015) High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype-Specific Cancer Liabilities. Cell
Meyers RM, 等人. (2017) Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens. Nature Genetics
Tsherniak A, 等人. (2017) Defining a Cancer Dependency Map. Cell (DepMap)

ANALYSIS_DETAILS.md - 所有 8 个阶段的详细代码片段
USE_CASES.md - 完整用例（必需性筛选、合成致死性、药物靶点发现、表达数据整合）和最佳实践
EXAMPLES.md - 示例用法和快速参考
QUICK_START.md - 快速入门指南
FALLBACK_PATCH.md - API 问题的备用模式

🇺🇸English

ToolUniverse CRISPR Screen Analysis

Comprehensive skill for analyzing CRISPR-Cas9 genetic screens to identify essential genes, synthetic lethal interactions, and therapeutic targets through robust statistical analysis and pathway enrichment.

Overview

CRISPR screens enable genome-wide functional genomics by systematically perturbing genes and measuring fitness effects. This skill provides an 8-phase workflow for:

Processing sgRNA count matrices
Quality control and normalization
Gene-level essentiality scoring (MAGeCK-like and BAGEL-like approaches)
Synthetic lethality detection
Pathway enrichment analysis
Drug target prioritization with DepMap integration
Integration with expression and mutation data

Core Workflow

Phase 1: Data Import & sgRNA Count Processing

Load sgRNA count matrix (MAGeCK format or generic TSV). Expected columns: sgRNA, Gene, plus sample columns. Create experimental design table linking samples to conditions (baseline/treatment) with replicate assignments.

Phase 2: Quality Control & Filtering

Assess sgRNA distribution quality:

Library sizes per sample (total reads)
Zero-count sgRNAs : Count across samples
Low-count filtering : Remove sgRNAs below threshold (default: <30 reads in >N-2 samples)
Gini coefficient : Assess distribution skewness per sample
Report filtering recommendations

Phase 3: Normalization

Normalize sgRNA counts to account for library size differences:

Median ratio (DESeq2-like): Calculate geometric mean reference, compute size factors as median of ratios
Total count (CPM-like): Divide by library size in millions

Calculate log2 fold changes (LFC) between treatment and control conditions with pseudocount.

Phase 4: Gene-Level Scoring

Two scoring approaches:

MAGeCK-like (RRA) : Rank all sgRNAs by LFC, compute mean rank per gene. Lower mean rank = more essential. Includes sgRNA count and mean LFC per gene.
BAGEL-like (Bayes Factor) : Use reference essential/non-essential gene sets to estimate LFC distributions. Calculate likelihood ratio (Bayes Factor) for each gene. Higher BF = more likely essential.

Phase 5: Synthetic Lethality Detection

Compare essentiality scores between wildtype and mutant cell lines:

Merge gene scores, calculate delta LFC and delta rank
Filter for genes essential in mutant (LFC < threshold) but not wildtype (LFC > -0.5) with large rank change
Sort by differential essentiality

Query DepMap/literature for known dependencies using PubMed search.

Phase 6: Pathway Enrichment Analysis

Submit top essential genes to Enrichr for pathway enrichment:

KEGG pathways
GO Biological Process
Retrieve enriched terms with p-values and gene lists

Phase 7: Drug Target Prioritization

Composite scoring combining:

Essentiality (50% weight): Normalized mean LFC from CRISPR screen
Expression (30% weight): Log2 fold change from RNA-seq (if available)
Druggability (20% weight): Number of drug interactions from DGIdb

Query DGIdb for each candidate gene to find existing drugs, interaction types, and sources.

Phase 8: Report Generation

Generate markdown report with:

Summary statistics (total genes, essential genes, non-essential genes)
Top 20 essential genes table (rank, gene, mean LFC, sgRNAs, score)
Pathway enrichment results (top 10 terms per database)
Drug target candidates (rank, gene, essentiality, expression FC, druggability, priority score)
Methods section

ToolUniverse Tool Integration

Key Tools Used :

PubMed_search - Literature search for gene essentiality
Enrichr_submit_genelist - Pathway enrichment submission
Enrichr_get_results - Retrieve enrichment results
DGIdb_query_gene - Drug-gene interactions and druggability
STRING_get_network - Protein interaction networks
KEGG_get_pathway - Pathway visualization

Expression Integration :

GEO_get_dataset - Download expression data
ArrayExpress_get_experiment - Alternative expression source

Variant Integration :

ClinVar_query_gene - Known pathogenic variants
gnomAD_get_gene - Population allele frequencies

Quick Start

import pandas as pd
from tooluniverse import ToolUniverse

# 1. Load data
counts, meta = load_sgrna_counts("sgrna_counts.txt")
design = create_design_matrix(['T0_1', 'T0_2', 'T14_1', 'T14_2'],
                               ['baseline', 'baseline', 'treatment', 'treatment'])

# 2. Process
filtered_counts, filtered_mapping = filter_low_count_sgrnas(counts, meta['sgrna_to_gene'])
norm_counts, _ = normalize_counts(filtered_counts)
lfc, _, _ = calculate_lfc(norm_counts, design)

# 3. Score genes
gene_scores = mageck_gene_scoring(lfc, filtered_mapping)

# 4. Enrich pathways
enrichment = enrich_essential_genes(gene_scores, top_n=100)

# 5. Find drug targets
drug_targets = prioritize_drug_targets(gene_scores)

# 6. Generate report
report = generate_crispr_report(gene_scores, enrichment, drug_targets)

References

Li W, et al. (2014) MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biology
Hart T, et al. (2015) High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype-Specific Cancer Liabilities. Cell
Meyers RM, et al. (2017) Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens. Nature Genetics
Tsherniak A, et al. (2017) Defining a Cancer Dependency Map. Cell (DepMap)