表观基因组学数据分析技能：甲基化、ChIP-seq、ATAC-seq与多组学整合处理

tooluniverse-epigenomics by mims-harvard/tooluniverse

166 周安装量

1,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-epigenomics

科研工具生物信息学数据处理

🇨🇳中文介绍

基因组学与表观基因组学数据处理

用于处理和分析表观基因组学数据的生产就绪计算技能。结合本地 Python 计算（pandas、scipy、numpy、pysam、statsmodels）与 ToolUniverse 注释工具，提供调控背景信息。旨在解决关于甲基化、ChIP-seq、ATAC-seq 和多组学整合的 BixBench 风格问题。

何时使用此技能

触发条件：

用户提供甲基化数据（β值矩阵、Illumina 芯片）并询问 CpG 位点相关问题
关于差异甲基化分析的问题
年龄相关 CpG 检测或表观遗传时钟问题
染色体水平甲基化密度或统计问题
提供 ChIP-seq 峰文件（BED 格式）并询问分析问题
ATAC-seq 染色质可及性问题
多组学整合问题（表达 + 甲基化，表达 + ChIP-seq）
全基因组表观基因组统计问题
提及"甲基化"、"CpG"、"ChIP-seq"、"ATAC-seq"、"组蛋白"、"染色质"、"表观遗传"的问题
关于临床/基因组/表观基因组模态中缺失数据的问题
为处理后的表观基因组数据注释调控元件

示例问题：

"有多少患者在生存状态、基因表达和甲基化数据上没有缺失数据？"
"染色体间过滤后的年龄相关 CpG 密度比率是多少？"
"全基因组范围内，每条染色体上独特的年龄相关 CpG 位点每碱基对的平均染色体密度是多少？"
"有多少 CpG 位点显示出显著的差异甲基化（padj < 0.05）？"
"基因 X 的甲基化与表达之间的皮尔逊相关性是多少？"
"有多少 ChIP-seq 峰与启动子区域重叠？"
"ATAC-seq 峰中有多少比例位于增强子区域？"
"哪条染色体的高甲基化 CpG 位点密度最高？"
"按方差 > 阈值过滤 CpG 位点并映射到最近的基因"
"染色体 17 上肿瘤与正常样本之间的平均 β 值差异是多少？"

不适用于（请使用其他技能）：

没有数据文件的基因调控查找 -> 使用现有的表观基因组学注释模式
RNA-seq 差异表达 -> 使用 tooluniverse-rnaseq-deseq2
从 VCF 文件进行变异调用/注释 -> 使用

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

必需的 Python 包

# Core (MUST be available)
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.stats.multitest as mt

# Optional but useful
import pysam      # BAM/CRAM file access
import gseapy     # Enrichment of genes from methylation analysis

# ToolUniverse (for annotation)
from tooluniverse import ToolUniverse

数据优先方法 - 在进行任何分析之前，先加载和检查数据文件
问题驱动 - 解析用户实际询问的内容，并提取具体的数值答案
文件格式检测 - 自动检测甲基化芯片、BED 文件、BigWig、临床数据
坐标系意识 - 跟踪基因组版本（hg19、hg38、mm10），处理 chr 前缀差异
统计严谨性 - 正确的多重检验校正、效应大小过滤、样本量意识
缺失数据处理 - 明确报告和处理 NaN/缺失值
染色体名称标准化 - 始终标准化染色体名称（chr1 与 1，chrX 与 X）
CpG 位点识别 - 解析 Illumina 探针 ID（cg/ch 探针）、基因组坐标
报告优先 - 首先创建输出文件，逐步填充内容
英文优先查询 - 在所有工具调用中使用英文

阶段 0：问题解析与数据发现

在编写任何代码之前，先解析问题以确定：

有哪些数据文件可用（甲基化、ChIP-seq、ATAC-seq、临床、表达、清单文件）
具体询问的是哪种统计量或答案
适用哪些阈值（显著性、效应大小、方差、染色体过滤器）
使用哪个基因组版本

通过扫描关键词对文件进行分类：methyl/beta/cpg/illumina、chip/peak/narrowpeak、atac/accessibility、clinical/patient/sample、express/rnaseq/fpkm、manifest/annotation/probe。

完整的决策树和参数提取表请参见 ANALYSIS_PROCEDURES.md。

阶段 1：甲基化数据处理

甲基化分析的核心功能：

加载甲基化数据 - 支持 CSV、TSV、parquet、HDF5；自动检测 β 值与 M 值
加载探针清单 - Illumina 450K/EPIC 清单，包含染色体、位置、基因注释
CpG 过滤 - 按方差、缺失率、探针类型（cg/ch）、染色体、CpG 岛关系、基因群组过滤
差异甲基化 - 组间 T 检验/Wilcoxon/KS 检验并进行 FDR 校正；识别差异甲基化位点（高/低甲基化）
年龄相关 CpG 分析 - 探针与年龄的皮尔逊/斯皮尔曼相关性，FDR 校正
染色体水平密度 - 每条染色体的 CpG 计数除以染色体长度；密度比率；全基因组平均值

完整的函数实现请参见 CODE_REFERENCE.md 阶段 1。

阶段 2：ChIP-seq 峰分析

加载 BED/narrowPeak/broadPeak - 自动检测格式，标准化染色体名称
峰统计 - 计数、长度分布、信号值、q 值
峰注释 - 将峰映射到最近的基因，分类为启动子/基因体/近端/远端
峰重叠 - 两个 BED 文件之间的纯 Python 区间交集；Jaccard 相似度

完整的函数实现请参见 CODE_REFERENCE.md 阶段 2。

阶段 3：ATAC-seq 分析

加载 ATAC 峰 - BED 加载器的包装，用于 narrowPeak 格式
ATAC 特定统计 - 核小体游离区域检测（<150bp 峰），区域分类
按区域的染色质可及性 - 开放染色质在启动子/增强子/基因间区域的分布

完整的函数实现请参见 CODE_REFERENCE.md 阶段 3。

阶段 4：多组学整合

甲基化-表达相关性 - 对齐样本，计算每个探针-基因对的皮尔逊/斯皮尔曼相关性并进行 FDR 校正
ChIP-seq + 表达 - 查找具有启动子峰的基因并比较表达水平

完整的函数实现请参见 CODE_REFERENCE.md 阶段 4。

阶段 5：临床数据整合

缺失数据分析 - 统计在临床、表达和甲基化模态中均存在的样本数量
完整病例识别 - 查找在指定变量上具有非缺失值的样本交集

完整的函数实现请参见 CODE_REFERENCE.md 阶段 5。

阶段 6：ToolUniverse 注释

在计算分析后使用 ToolUniverse 工具添加生物学背景：

基因注释 - 通过 Ensembl 查找坐标、生物类型、交叉引用
调控元件 - SCREEN cCREs（增强子、启动子、绝缘子）
ChIPAtlas - 按抗原/细胞类型查询可用的 ChIP-seq 实验
Ensembl 调控特征 - 用调控重叠注释基因组区域

参数详情请参见 CODE_REFERENCE.md 阶段 6 和 TOOLS_REFERENCE.md。

阶段 7：全基因组统计

全面的甲基化统计 - 全局平均/中位数 β 值、探针方差、染色体密度
差异甲基化摘要 - 显著计数、高/低甲基化拆分、效应大小

完整的函数实现请参见 CODE_REFERENCE.md 阶段 7。

模式	输入	关键步骤	输出
差异甲基化	β 值矩阵 + 临床数据	过滤探针 -> 定义组 -> t 检验 -> FDR -> 阈值	显著差异甲基化位点计数
年龄相关 CpG 密度	β 值矩阵 + 清单 + 年龄	与年龄相关性 -> FDR -> 映射到染色体 -> 每条染色体密度	染色体间密度比率
多组学缺失数据	临床数据 + 表达数据 + 甲基化数据	提取样本 ID -> 取交集 -> 检查 NaN	完整病例计数
ChIP-seq 峰注释	BED/narrowPeak + 基因注释	加载峰 -> 注释到基因 -> 分类区域	位于启动子的比例
甲基化-表达	β 值矩阵 + 表达数据 + 探针-基因映射	对齐样本 -> 相关性 -> FDR	显著负相关

详细的逐步流程和边缘情况处理请参见 ANALYSIS_PROCEDURES.md。

函数	目的	输入	输出
`load_methylation_data()`	加载 β/M 值矩阵	文件路径	DataFrame
`detect_methylation_type()`	检测 β 值与 M 值	DataFrame	'beta' 或 'mvalue'
`filter_cpg_probes()`	按条件过滤探针	DataFrame + 过滤器	过滤后的 DataFrame
`differential_methylation()`	组间差异甲基化分析	β 值 + 样本	包含 padj 的 DataFrame
`identify_age_related_cpgs()`	年龄相关 CpG	β 值 + 年龄	包含 padj 的 DataFrame
`chromosome_cpg_density()`	每条染色体的 CpG 密度	探针 + 清单	密度 DataFrame
`genome_wide_average_density()`	总体基因组密度	密度 DataFrame	浮点数
`chromosome_density_ratio()`	染色体间比率	密度 + 染色体名称	浮点数
`load_bed_file()`	加载 BED/narrowPeak	文件路径	DataFrame
`peak_statistics()`	基本峰统计	BED DataFrame	字典
`annotate_peaks_to_genes()`	将峰注释到基因	峰 + 基因	注释后的 DataFrame
`find_overlaps()`	峰重叠分析	两个 BED DataFrame	重叠 DataFrame
`missing_data_analysis()`	跨模态完整性分析	多个 DataFrame	字典
`correlate_methylation_expression()`	甲基化-表达相关性	β 值 + 表达数据	相关性 DataFrame

使用的 ToolUniverse 工具

ensembl_lookup_gene - 基因坐标、生物类型（需要 species='homo_sapiens'）
ensembl_get_regulatory_features - 按区域的调控特征（区域中不要使用 "chr" 前缀）
ensembl_get_overlap_features - 基因/转录本重叠数据
SCREEN_get_regulatory_elements - cCREs：增强子、启动子、绝缘子
ReMap_get_transcription_factor_binding - TF 结合位点
RegulomeDB_query_variant - 变异调控评分
jaspar_search_matrices - TF 结合矩阵
ENCODE_search_experiments - 实验元数据（assay_title 必须是 "TF ChIP-seq"，而不是 "ChIP-seq"）
ChIPAtlas_get_experiments - ChIP-seq 实验（需要 operation 参数）
ChIPAtlas_search_datasets - 数据集搜索（需要 operation 参数）
ChIPAtlas_enrichment_analysis - 来自 BED/基序/基因的富集分析
ChIPAtlas_get_peak_data - 峰数据下载（需要 operation 参数）
FourDN_search_data - 染色质构象数据（需要 operation 参数）

MyGene_query_genes - 基因查询
MyGene_batch_query - 批量基因查询
HGNC_get_gene_info - 基因符号、别名、ID
GO_get_annotations_for_gene - GO 注释

完整的参数详情和返回模式请参见 TOOLS_REFERENCE.md。

甲基化数据：探针（行）x 样本（列），β 值范围 0-1
BED 文件：制表符分隔，0 基半开区间坐标
narrowPeak：10 列的 BED 扩展，包含 signalValue、pValue、qValue、peak
Illumina 清单：探针 ID、染色体、位置、基因注释
临床数据：以患者/样本为中心，临床变量作为列

支持的基因组版本

版本	物种	常染色体	性染色体
hg38 (GRCh38)	人类	chr1-chr22	chrX, chrY
hg19 (GRCh37)	人类	chr1-chr22	chrX, chrY
mm10 (GRCm38)	小鼠	chr1-chr19	chrX, chrY

无原生 pybedtools：使用纯 Python 区间操作
无原生 pyBigWig：没有相应包则无法直接读取 BigWig 文件
无 R 桥接：不使用 methylKit、ChIPseeker 或 DiffBind
Illumina 中心化：甲基化功能专为 450K/EPIC 芯片设计
使用 t 检验/Wilcoxon 进行差异甲基化分析（非 limma/bumphunter）
无峰识别功能：假设峰已预先识别
API 速率限制：ToolUniverse 注释每批限制约 20 个基因

CODE_REFERENCE.md - 所有阶段的完整 Python 函数实现
TOOLS_REFERENCE.md - ToolUniverse 工具参数详情和返回模式
ANALYSIS_PROCEDURES.md - 决策树、逐步分析模式、边缘情况、备用策略
QUICK_START.md - 常见分析类型的快速入门示例

🇺🇸English

Genomics and Epigenomics Data Processing

Production-ready computational skill for processing and analyzing epigenomics data. Combines local Python computation (pandas, scipy, numpy, pysam, statsmodels) with ToolUniverse annotation tools for regulatory context. Designed to solve BixBench-style questions about methylation, ChIP-seq, ATAC-seq, and multi-omics integration.

When to Use This Skill

Triggers :

User provides methylation data (beta-value matrices, Illumina arrays) and asks about CpG sites
Questions about differential methylation analysis
Age-related CpG detection or epigenetic clock questions
Chromosome-level methylation density or statistics
ChIP-seq peak files (BED format) with analysis questions
ATAC-seq chromatin accessibility questions
Multi-omics integration (expression + methylation, expression + ChIP-seq)
Genome-wide epigenomic statistics
Questions mentioning "methylation", "CpG", "ChIP-seq", "ATAC-seq", "histone", "chromatin", "epigenetic"
Questions about missing data across clinical/genomic/epigenomic modalities
Regulatory element annotation for processed epigenomic data

Example Questions :

"How many patients have no missing data for vital status, gene expression, and methylation data?"
"What is the ratio of filtered age-related CpG density between chromosomes?"
"What is the genome-wide average chromosomal density of unique age-related CpGs per base pair?"
"How many CpG sites show significant differential methylation (padj < 0.05)?"
"What is the Pearson correlation between methylation and expression for gene X?"
"How many ChIP-seq peaks overlap with promoter regions?"
"What fraction of ATAC-seq peaks are in enhancer regions?"
"Which chromosome has the highest density of hypermethylated CpGs?"
"Filter CpG sites by variance > threshold and map to nearest genes"
"What is the average beta value difference between tumor and normal for chromosome 17?"

NOT for (use other skills instead):

Gene regulation lookup without data files -> Use existing epigenomics annotation pattern
RNA-seq differential expression -> Use tooluniverse-rnaseq-deseq2
Variant calling/annotation from VCF -> Use tooluniverse-variant-analysis
Gene enrichment analysis -> Use tooluniverse-gene-enrichment
Protein structure analysis -> Use tooluniverse-protein-structure-retrieval

Required Python Packages

# Core (MUST be available)
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.stats.multitest as mt

# Optional but useful
import pysam      # BAM/CRAM file access
import gseapy     # Enrichment of genes from methylation analysis

# ToolUniverse (for annotation)
from tooluniverse import ToolUniverse

Key Principles

Data-first approach - Load and inspect data files BEFORE any analysis
Question-driven - Parse what the user is actually asking and extract the specific numeric answer
File format detection - Automatically detect methylation arrays, BED files, BigWig, clinical data
Coordinate system awareness - Track genome build (hg19, hg38, mm10), handle chr prefix differences
Statistical rigor - Proper multiple testing correction, effect size filtering, sample size awareness
Missing data handling - Explicitly report and handle NaN/missing values
Chromosome normalization - Always normalize chromosome names (chr1 vs 1, chrX vs X)
CpG site identification - Parse Illumina probe IDs (cg/ch probes), genomic coordinates
Report-first - Create output file first, populate progressively
English-first queries - Use English in all tool calls

Workflow Overview

Phase 0: Question Parsing and Data Discovery

Before writing any code, parse the question to identify:

What data files are available (methylation, ChIP-seq, ATAC-seq, clinical, expression, manifest)
What specific statistic or answer is being asked for
What thresholds apply (significance, effect size, variance, chromosome filters)
What genome build to use

Categorize files by scanning for keywords: methyl/beta/cpg/illumina, chip/peak/narrowpeak, atac/accessibility, clinical/patient/sample, express/rnaseq/fpkm, manifest/annotation/probe.

See ANALYSIS_PROCEDURES.md for the full decision tree and parameter extraction table.

Phase 1: Methylation Data Processing

Core functions for methylation analysis:

Load methylation data - Supports CSV, TSV, parquet, HDF5; auto-detects beta vs M-values
Load probe manifest - Illumina 450K/EPIC manifest with chromosome, position, gene annotation
CpG filtering - Filter by variance, missing rate, probe type (cg/ch), chromosome, CpG island relation, gene group
Differential methylation - T-test/Wilcoxon/KS between groups with FDR correction; identify DMPs (hyper/hypo)
Age-related CpG analysis - Pearson/Spearman correlation of probes with age, FDR correction
Chromosome-level density - CpG count per chromosome divided by chromosome length; density ratios; genome-wide average

See CODE_REFERENCE.md Phase 1 for full function implementations.

Phase 2: ChIP-seq Peak Analysis

Load BED/narrowPeak/broadPeak - Auto-detect format, normalize chromosomes
Peak statistics - Count, length distribution, signal values, q-values
Peak annotation - Map peaks to nearest gene, classify as promoter/gene_body/proximal/distal
Peak overlap - Pure Python interval intersection between two BED files; Jaccard similarity

See CODE_REFERENCE.md Phase 2 for full function implementations.

Phase 3: ATAC-seq Analysis

Load ATAC peaks - Wrapper around BED loader for narrowPeak format
ATAC-specific stats - Nucleosome-free region (NFR) detection (<150bp peaks), region classification
Chromatin accessibility by region - Distribution of open chromatin across promoter/enhancer/intergenic

See CODE_REFERENCE.md Phase 3 for full function implementations.

Phase 4: Multi-Omics Integration

Methylation-expression correlation - Align samples, compute per-probe-gene Pearson/Spearman with FDR
ChIP-seq + expression - Find genes with promoter peaks and compare expression levels

See CODE_REFERENCE.md Phase 4 for full function implementations.

Phase 5: Clinical Data Integration

Missing data analysis - Count samples present across clinical, expression, and methylation modalities
Complete case identification - Find intersection of samples with non-missing values for specified variables

See CODE_REFERENCE.md Phase 5 for full function implementations.

Phase 6: ToolUniverse Annotation

Use ToolUniverse tools to add biological context after computational analysis:

Gene annotation - Ensembl lookup for coordinates, biotype, cross-references
Regulatory elements - SCREEN cCREs (enhancers, promoters, insulators)
ChIPAtlas - Query available ChIP-seq experiments by antigen/cell type
Ensembl regulatory features - Annotate genomic regions with regulatory overlaps

See CODE_REFERENCE.md Phase 6 and TOOLS_REFERENCE.md for parameters.

Phase 7: Genome-Wide Statistics

Comprehensive methylation stats - Global mean/median beta, probe variance, chromosome density
Differential methylation summary - Count significant, hyper/hypo split, effect sizes

See CODE_REFERENCE.md Phase 7 for full function implementations.

Common Analysis Patterns

Pattern	Input	Key Steps	Output
Differential methylation	Beta matrix + clinical	Filter probes -> define groups -> t-test -> FDR -> threshold	Count of significant DMPs
Age-related CpG density	Beta matrix + manifest + ages	Correlate with age -> FDR -> map to chr -> density per chr	Density ratio between chromosomes
Multi-omics missing data	Clinical + expression + methylation	Extract sample IDs -> intersect -> check NaN	Complete case count
ChIP-seq peak annotation	BED/narrowPeak + gene annotation	Load peaks -> annotate to genes -> classify regions	Fraction in promoters
Methylation-expression	Beta matrix + expression + probe-gene map	Align samples -> correlate -> FDR	Significant anti-correlations

See ANALYSIS_PROCEDURES.md for detailed step-by-step flows and edge case handling.

Key Functions Reference

Function	Purpose	Input	Output
`load_methylation_data()`	Load beta/M-value matrix	file path	DataFrame
`detect_methylation_type()`	Detect beta vs M-values	DataFrame	'beta' or 'mvalue'
`filter_cpg_probes()`	Filter probes by criteria	DataFrame + filters	filtered DataFrame
`differential_methylation()`	DM analysis between groups	beta + samples

ToolUniverse Tools Used

Regulatory Annotation Tools

ensembl_lookup_gene - Gene coordinates, biotype (REQUIRES species='homo_sapiens')
ensembl_get_regulatory_features - Regulatory features by region (NO "chr" prefix in region)
ensembl_get_overlap_features - Gene/transcript overlap data
SCREEN_get_regulatory_elements - cCREs: enhancers, promoters, insulators
ReMap_get_transcription_factor_binding - TF binding sites
RegulomeDB_query_variant - Variant regulatory score
jaspar_search_matrices - TF binding matrices
ENCODE_search_experiments - Experiment metadata (assay_title must be "TF ChIP-seq" not "ChIP-seq")

Gene Annotation Tools

MyGene_query_genes - Gene query
MyGene_batch_query - Batch gene query
HGNC_get_gene_info - Gene symbol, aliases, IDs
GO_get_annotations_for_gene - GO annotations

See TOOLS_REFERENCE.md for full parameter details and return schemas.

Data Format Notes

Methylation data : Probes (rows) x samples (columns), beta values 0-1
BED files : Tab-separated, 0-based half-open coordinates
narrowPeak : 10-column BED extension with signalValue, pValue, qValue, peak
Illumina manifests : Probe ID, chromosome, position, gene annotation
Clinical data : Patient/sample-centric with clinical variables as columns

Genome Builds Supported

Build	Species	Autosomes	Sex Chromosomes
hg38 (GRCh38)	Human	chr1-chr22	chrX, chrY
hg19 (GRCh37)	Human	chr1-chr22	chrX, chrY
mm10 (GRCm38)	Mouse	chr1-chr19	chrX, chrY

Limitations

No native pybedtools: uses pure Python interval operations
No native pyBigWig: cannot read BigWig files directly without package
No R bridge: does not use methylKit, ChIPseeker, or DiffBind
Illumina-centric: methylation functions designed for 450K/EPIC arrays
Uses t-test/Wilcoxon for differential methylation (not limma/bumphunter)
No peak calling: assumes peaks are pre-called
API rate limits: ToolUniverse annotation limited to ~20 genes per batch

Reference Files

CODE_REFERENCE.md - Full Python function implementations for all phases
TOOLS_REFERENCE.md - ToolUniverse tool parameter details and return schemas
ANALYSIS_PROCEDURES.md - Decision trees, step-by-step analysis patterns, edge cases, fallback strategies
QUICK_START.md - Quick start examples for common analysis types

Weekly Installs

142

Repository

mims-harvard/to…universe

GitHub Stars

1.2K

First Seen

Feb 16, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

gemini-cli136

codex136

opencode136

github-copilot134

kimi-cli130

amp130

智能OCR文字识别工具 - 支持100+语言，高精度提取图片/PDF/手写文本

1,100 周安装

ChIPAtlas_get_experiments - ChIP-seq experiments (REQUIRES operation param)

ChIPAtlas_search_datasets - Dataset search (REQUIRES operation param)

ChIPAtlas_enrichment_analysis - Enrichment from BED/motifs/genes

ChIPAtlas_get_peak_data - Peak data download (REQUIRES operation param)

FourDN_search_data - Chromatin conformation data (REQUIRES operation param)