arboreto by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill arboretoArboreto 是一个用于从基因表达数据推断基因调控网络(GRN)的计算库,采用并行化算法,可从单机扩展到多节点集群。
核心功能:根据观测数据(细胞、样本、条件)中的表达模式,识别哪些转录因子(TF)调控哪些靶基因。
安装 arboreto:
uv pip install arboreto
基本的 GRN 推断:
import pandas as pd
from arboreto.algo import grnboost2
if __name__ == '__main__':
# 加载表达数据(基因作为列)
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
# 推断调控网络
network = grnboost2(expression_data=expression_matrix)
# 保存结果(TF, target, importance)
network.to_csv('network.tsv', sep='\t', index=False, header=False)
关键:务必使用 if __name__ == '__main__': 保护,因为 Dask 会生成新进程。
适用于标准的 GRN 推断工作流,包括:
:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
references/basic_inference.md使用即用脚本:对于标准推断任务,使用 scripts/basic_grn_inference.py:
python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777
Arboreto 提供两种算法:
GRNBoost2(推荐):
GENIE3:
快速比较:
from arboreto.algo import grnboost2, genie3
# 快速,推荐
network_grnboost = grnboost2(expression_data=matrix)
# 经典算法
network_genie3 = genie3(expression_data=matrix)
关于详细的算法比较、参数和选择指南:references/algorithms.md
将推断从本地多核扩展到集群环境:
本地(默认) - 自动使用所有可用核心:
network = grnboost2(expression_data=matrix)
自定义本地客户端 - 控制资源:
from distributed import LocalCluster, Client
local_cluster = LocalCluster(n_workers=10, memory_limit='8GB')
client = Client(local_cluster)
network = grnboost2(expression_data=matrix, client_or_address=client)
client.close()
local_cluster.close()
集群计算 - 连接到远程 Dask 调度器:
from distributed import Client
client = Client('tcp://scheduler:8786')
network = grnboost2(expression_data=matrix, client_or_address=client)
关于集群设置、性能优化和大规模工作流:references/distributed_computing.md
uv pip install arboreto
依赖项:scipy, scikit-learn, numpy, pandas, dask, distributed
import pandas as pd
from arboreto.algo import grnboost2
if __name__ == '__main__':
# 加载单细胞表达矩阵(细胞 x 基因)
sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')
# 推断细胞类型特异性调控网络
network = grnboost2(expression_data=sc_data, seed=42)
# 筛选高置信度连接
high_confidence = network[network['importance'] > 0.5]
high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
if __name__ == '__main__':
# 加载数据
expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t')
tf_names = load_tf_names('human_tfs.txt')
# 使用 TF 限制进行推断
network = grnboost2(
expression_data=expression_data,
tf_names=tf_names,
seed=123
)
network.to_csv('tf_target_network.tsv', sep='\t', index=False)
from arboreto.algo import grnboost2
if __name__ == '__main__':
# 为不同条件推断网络
conditions = ['control', 'treatment_24h', 'treatment_48h']
for condition in conditions:
data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')
network = grnboost2(expression_data=data, seed=42)
network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)
Arboreto 返回一个包含调控连接的 DataFrame:
| 列 | 描述 |
|---|---|
TF | 转录因子(调控因子) |
target | 靶基因 |
importance | 调控重要性分数(越高越强) |
筛选策略:
Arboreto 是用于单细胞调控网络分析的 SCENIC 流程的核心组件:
# 步骤 1: 使用 arboreto 进行 GRN 推断
from arboreto.algo import grnboost2
network = grnboost2(expression_data=sc_data, tf_names=tf_list)
# 步骤 2: 使用 pySCENIC 进行调节子识别和活性评分
# (下游分析请参阅 pySCENIC 文档)
始终设置种子以获得可重复的结果:
network = grnboost2(expression_data=matrix, seed=777)
运行多个种子进行稳健性分析:
from distributed import LocalCluster, Client
if __name__ == '__main__':
client = Client(LocalCluster())
seeds = [42, 123, 777]
networks = []
for seed in seeds:
net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)
networks.append(net)
# 合并网络并筛选共识连接
consensus = analyze_consensus(networks)
内存错误:通过筛选低变异基因来减少数据集大小,或使用分布式计算
性能缓慢:使用 GRNBoost2 代替 GENIE3,启用分布式客户端,筛选 TF 列表
Dask 错误:确保脚本中存在 if __name__ == '__main__': 保护
结果为空:检查数据格式(基因作为列),验证 TF 名称是否与基因名称匹配
每周安装数
117
代码仓库
GitHub 星标数
22.6K
首次出现
2026年1月21日
安全审计
安装于
claude-code100
opencode93
gemini-cli88
cursor88
antigravity82
codex77
Arboreto is a computational library for inferring gene regulatory networks (GRNs) from gene expression data using parallelized algorithms that scale from single machines to multi-node clusters.
Core capability : Identify which transcription factors (TFs) regulate which target genes based on expression patterns across observations (cells, samples, conditions).
Install arboreto:
uv pip install arboreto
Basic GRN inference:
import pandas as pd
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Load expression data (genes as columns)
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
# Infer regulatory network
network = grnboost2(expression_data=expression_matrix)
# Save results (TF, target, importance)
network.to_csv('network.tsv', sep='\t', index=False, header=False)
Critical : Always use if __name__ == '__main__': guard because Dask spawns new processes.
For standard GRN inference workflows including:
See : references/basic_inference.md
Use the ready-to-run script : scripts/basic_grn_inference.py for standard inference tasks:
python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777
Arboreto provides two algorithms:
GRNBoost2 (Recommended) :
GENIE3 :
Quick comparison:
from arboreto.algo import grnboost2, genie3
# Fast, recommended
network_grnboost = grnboost2(expression_data=matrix)
# Classic algorithm
network_genie3 = genie3(expression_data=matrix)
For detailed algorithm comparison, parameters, and selection guidance : references/algorithms.md
Scale inference from local multi-core to cluster environments:
Local (default) - Uses all available cores automatically:
network = grnboost2(expression_data=matrix)
Custom local client - Control resources:
from distributed import LocalCluster, Client
local_cluster = LocalCluster(n_workers=10, memory_limit='8GB')
client = Client(local_cluster)
network = grnboost2(expression_data=matrix, client_or_address=client)
client.close()
local_cluster.close()
Cluster computing - Connect to remote Dask scheduler:
from distributed import Client
client = Client('tcp://scheduler:8786')
network = grnboost2(expression_data=matrix, client_or_address=client)
For cluster setup, performance optimization, and large-scale workflows : references/distributed_computing.md
uv pip install arboreto
Dependencies : scipy, scikit-learn, numpy, pandas, dask, distributed
import pandas as pd
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Load single-cell expression matrix (cells x genes)
sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')
# Infer cell-type-specific regulatory network
network = grnboost2(expression_data=sc_data, seed=42)
# Filter high-confidence links
high_confidence = network[network['importance'] > 0.5]
high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Load data
expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t')
tf_names = load_tf_names('human_tfs.txt')
# Infer with TF restriction
network = grnboost2(
expression_data=expression_data,
tf_names=tf_names,
seed=123
)
network.to_csv('tf_target_network.tsv', sep='\t', index=False)
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Infer networks for different conditions
conditions = ['control', 'treatment_24h', 'treatment_48h']
for condition in conditions:
data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')
network = grnboost2(expression_data=data, seed=42)
network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)
Arboreto returns a DataFrame with regulatory links:
| Column | Description |
|---|---|
TF | Transcription factor (regulator) |
target | Target gene |
importance | Regulatory importance score (higher = stronger) |
Filtering strategy :
Arboreto is a core component of the SCENIC pipeline for single-cell regulatory network analysis:
# Step 1: Use arboreto for GRN inference
from arboreto.algo import grnboost2
network = grnboost2(expression_data=sc_data, tf_names=tf_list)
# Step 2: Use pySCENIC for regulon identification and activity scoring
# (See pySCENIC documentation for downstream analysis)
Always set a seed for reproducible results:
network = grnboost2(expression_data=matrix, seed=777)
Run multiple seeds for robustness analysis:
from distributed import LocalCluster, Client
if __name__ == '__main__':
client = Client(LocalCluster())
seeds = [42, 123, 777]
networks = []
for seed in seeds:
net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)
networks.append(net)
# Combine networks and filter consensus links
consensus = analyze_consensus(networks)
Memory errors : Reduce dataset size by filtering low-variance genes or use distributed computing
Slow performance : Use GRNBoost2 instead of GENIE3, enable distributed client, filter TF list
Dask errors : Ensure if __name__ == '__main__': guard is present in scripts
Empty results : Check data format (genes as columns), verify TF names match gene names
Weekly Installs
117
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
claude-code100
opencode93
gemini-cli88
cursor88
antigravity82
codex77
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
47,700 周安装
Azure Data Explorer (Kusto) 查询分析指南:KQL 语法、性能优化与大数据分析
103,100 周安装
Azure合规性扫描与安全审计工具 - 全面评估资源配置与Key Vault过期监控
103,100 周安装
Azure AI 网关配置指南:使用APIM治理AI模型、MCP工具与智能体 | Microsoft Copilot
103,100 周安装
Azure Application Insights 仪表化指南 - ASP.NET Core/Node.js/Python 应用监控教程
103,100 周安装
Microsoft Foundry 技能指南:部署、调用、监控智能体全流程详解
103,200 周安装
Azure存储服务全解析:Blob、文件、队列、表存储及Data Lake使用指南
103,300 周安装