geniml by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill genimlGeniml 是一个用于基于 BED 文件中的基因组区间数据构建机器学习模型的 Python 包。它提供了学习基因组区域、单细胞和元数据标签嵌入的无监督方法,支持相似性搜索、聚类和下游机器学习任务。
使用 uv 安装 geniml:
uv uv pip install geniml
安装机器学习依赖项(PyTorch 等):
uv uv pip install 'geniml[ml]'
从 GitHub 安装开发版本:
uv uv pip install git+https://github.com/databio/geniml.git
Geniml 提供五个主要功能,每个功能在专门的参考文件中都有详细说明:
使用 word2vec 风格的学习方法训练基因组区域的无监督嵌入。
适用于: BED 文件的降维、区域相似性分析、下游机器学习的特征向量。
工作流程:
参考: 详细的工作流程、参数和示例请参见 references/region2vec.md。
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
使用 StarSpace 训练区域集和元数据标签的共享嵌入。
适用于: 元数据感知的搜索、跨模态查询(区域→标签或标签→区域)、基因组内容与实验条件的联合分析。
工作流程:
参考: 详细的工作流程、搜索类型和示例请参见 references/bedspace.md。
在单细胞 ATAC-seq 数据上训练 Region2Vec 模型,生成细胞水平的嵌入。
适用于: scATAC-seq 聚类、细胞类型注释、单细胞降维、与 scanpy 工作流程集成。
工作流程:
参考: 详细的工作流程、参数和示例请参见 references/scembed.md。
使用多种统计方法从 BED 文件集合中构建参考峰集(参考基因组集)。
适用于: 创建标记化参考、跨数据集标准化区域、通过统计严谨性定义共识特征。
工作流程:
方法:
参考: 方法比较、参数和示例请参见 references/consensus_peaks.md。
用于缓存、随机化、评估和搜索的附加工具。
可用工具:
参考: 每个工具的详细用法请参见 references/utilities.md。
from geniml.tokenization import hard_tokenization
from geniml.region2vec import region2vec
from geniml.evaluation import evaluate_embeddings
# 步骤 1: 标记化 BED 文件
hard_tokenization(
src_folder='bed_files/',
dst_folder='tokens/',
universe_file='universe.bed',
p_value_threshold=1e-9
)
# 步骤 2: 训练 Region2Vec
region2vec(
token_folder='tokens/',
save_dir='model/',
num_shufflings=1000,
embedding_dim=100
)
# 步骤 3: 评估
metrics = evaluate_embeddings(
embeddings_file='model/embeddings.npy',
labels_file='metadata.csv'
)
import scanpy as sc
from geniml.scembed import ScEmbed
from geniml.io import tokenize_cells
# 步骤 1: 加载数据
adata = sc.read_h5ad('scatac_data.h5ad')
# 步骤 2: 标记化细胞
tokenize_cells(
adata='scatac_data.h5ad',
universe_file='universe.bed',
output='tokens.parquet'
)
# 步骤 3: 训练 scEmbed
model = ScEmbed(embedding_dim=100)
model.train(dataset='tokens.parquet', epochs=100)
# 步骤 4: 生成嵌入
embeddings = model.encode(adata)
adata.obsm['scembed_X'] = embeddings
# 步骤 5: 使用 scanpy 聚类
sc.pp.neighbors(adata, use_rep='scembed_X')
sc.tl.leiden(adata)
sc.tl.umap(adata)
# 生成覆盖度
cat bed_files/*.bed > combined.bed
uniwig -m 25 combined.bed chrom.sizes coverage/
# 使用覆盖度阈值构建参考基因组集
geniml universe build cc \
--coverage-folder coverage/ \
--output-file universe.bed \
--cutoff 5 \
--merge 100 \
--filter-size 50
# 评估参考基因组集质量
geniml universe evaluate \
--universe universe.bed \
--coverage-folder coverage/ \
--bed-folder bed_files/
Geniml 为主要操作提供命令行界面:
# Region2Vec 训练
geniml region2vec --token-folder tokens/ --save-dir model/ --num-shuffle 1000
# BEDspace 预处理
geniml bedspace preprocess --input regions/ --metadata labels.csv --universe universe.bed
# BEDspace 训练
geniml bedspace train --input preprocessed.txt --output model/ --dim 100
# BEDspace 搜索
geniml bedspace search -t r2l -d distances.pkl -q query.bed -n 10
# 参考基因组集构建
geniml universe build cc --coverage-folder coverage/ --output universe.bed --cutoff 5
# BEDshift 随机化
geniml bedshift --input peaks.bed --genome hg38 --preserve-chrom --iterations 100
在以下情况使用 Region2Vec:
在以下情况使用 BEDspace:
在以下情况使用 scEmbed:
在以下情况使用参考基因组集构建:
在以下情况使用工具集:
adata.obsm 条目无缝集成Geniml 是 BEDbase 生态系统的一部分:
"标记化覆盖度过低":
"训练未收敛":
"内存不足错误":
"未找到 StarSpace" (BEDspace):
--path-to-starspace 参数详细的故障排除和特定于方法的问题,请查阅相应的参考文件。
每周安装量
116
代码仓库
GitHub 星标数
22.6K
首次出现
2026年1月21日
安全审计
安装于
claude-code98
opencode91
gemini-cli87
cursor87
antigravity82
codex76
Geniml is a Python package for building machine learning models on genomic interval data from BED files. It provides unsupervised methods for learning embeddings of genomic regions, single cells, and metadata labels, enabling similarity searches, clustering, and downstream ML tasks.
Install geniml using uv:
uv uv pip install geniml
For ML dependencies (PyTorch, etc.):
uv uv pip install 'geniml[ml]'
Development version from GitHub:
uv uv pip install git+https://github.com/databio/geniml.git
Geniml provides five primary capabilities, each detailed in dedicated reference files:
Train unsupervised embeddings of genomic regions using word2vec-style learning.
Use for: Dimensionality reduction of BED files, region similarity analysis, feature vectors for downstream ML.
Workflow:
Reference: See references/region2vec.md for detailed workflow, parameters, and examples.
Train shared embeddings for region sets and metadata labels using StarSpace.
Use for: Metadata-aware searches, cross-modal queries (region→label or label→region), joint analysis of genomic content and experimental conditions.
Workflow:
Reference: See references/bedspace.md for detailed workflow, search types, and examples.
Train Region2Vec models on single-cell ATAC-seq data for cell-level embeddings.
Use for: scATAC-seq clustering, cell-type annotation, dimensionality reduction of single cells, integration with scanpy workflows.
Workflow:
Reference: See references/scembed.md for detailed workflow, parameters, and examples.
Build reference peak sets (universes) from BED file collections using multiple statistical methods.
Use for: Creating tokenization references, standardizing regions across datasets, defining consensus features with statistical rigor.
Workflow:
Methods:
Reference: See references/consensus_peaks.md for method comparison, parameters, and examples.
Additional tools for caching, randomization, evaluation, and search.
Available utilities:
Reference: See references/utilities.md for detailed usage of each utility.
from geniml.tokenization import hard_tokenization
from geniml.region2vec import region2vec
from geniml.evaluation import evaluate_embeddings
# Step 1: Tokenize BED files
hard_tokenization(
src_folder='bed_files/',
dst_folder='tokens/',
universe_file='universe.bed',
p_value_threshold=1e-9
)
# Step 2: Train Region2Vec
region2vec(
token_folder='tokens/',
save_dir='model/',
num_shufflings=1000,
embedding_dim=100
)
# Step 3: Evaluate
metrics = evaluate_embeddings(
embeddings_file='model/embeddings.npy',
labels_file='metadata.csv'
)
import scanpy as sc
from geniml.scembed import ScEmbed
from geniml.io import tokenize_cells
# Step 1: Load data
adata = sc.read_h5ad('scatac_data.h5ad')
# Step 2: Tokenize cells
tokenize_cells(
adata='scatac_data.h5ad',
universe_file='universe.bed',
output='tokens.parquet'
)
# Step 3: Train scEmbed
model = ScEmbed(embedding_dim=100)
model.train(dataset='tokens.parquet', epochs=100)
# Step 4: Generate embeddings
embeddings = model.encode(adata)
adata.obsm['scembed_X'] = embeddings
# Step 5: Cluster with scanpy
sc.pp.neighbors(adata, use_rep='scembed_X')
sc.tl.leiden(adata)
sc.tl.umap(adata)
# Generate coverage
cat bed_files/*.bed > combined.bed
uniwig -m 25 combined.bed chrom.sizes coverage/
# Build universe with coverage cutoff
geniml universe build cc \
--coverage-folder coverage/ \
--output-file universe.bed \
--cutoff 5 \
--merge 100 \
--filter-size 50
# Evaluate universe quality
geniml universe evaluate \
--universe universe.bed \
--coverage-folder coverage/ \
--bed-folder bed_files/
Geniml provides command-line interfaces for major operations:
# Region2Vec training
geniml region2vec --token-folder tokens/ --save-dir model/ --num-shuffle 1000
# BEDspace preprocessing
geniml bedspace preprocess --input regions/ --metadata labels.csv --universe universe.bed
# BEDspace training
geniml bedspace train --input preprocessed.txt --output model/ --dim 100
# BEDspace search
geniml bedspace search -t r2l -d distances.pkl -q query.bed -n 10
# Universe building
geniml universe build cc --coverage-folder coverage/ --output universe.bed --cutoff 5
# BEDshift randomization
geniml bedshift --input peaks.bed --genome hg38 --preserve-chrom --iterations 100
Use Region2Vec when:
Use BEDspace when:
Use scEmbed when:
Use Universe Building when:
Use Utilities when:
adata.obsm entriesGeniml is part of the BEDbase ecosystem:
"Tokenization coverage too low":
"Training not converging":
"Out of memory errors":
"StarSpace not found" (BEDspace):
--path-to-starspace parameter correctlyFor detailed troubleshooting and method-specific issues, consult the appropriate reference file.
Weekly Installs
116
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
claude-code98
opencode91
gemini-cli87
cursor87
antigravity82
codex76
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装