系统发育学与序列分析工具：PhyKIT、Biopython、DendroPy 生物信息学技能

tooluniverse-phylogenetics by mims-harvard/tooluniverse

120 周安装量

1,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-phylogenetics

科研工具生物信息学数据处理

🇨🇳中文介绍

系统发育学与序列分析

使用 PhyKIT、Biopython 和 DendroPy 进行全面的系统发育学和序列分析。专为处理关于多序列比对、系统发育树、简约法、分子进化和比较基因组学的生物信息学问题而设计。

重要提示：此技能处理复杂的系统发育工作流程。大部分实现细节已移至 references/ 目录以便渐进式披露。本文档侧重于高层决策和工作流程编排。

何时使用此技能

当用户满足以下情况时应用：

拥有 FASTA 比对文件并询问关于简约信息位点、空位或比对质量的问题
拥有 Newick 树文件并询问关于树性、树长、进化速率或 DVMC 的问题
询问关于树性/RCV、RCV 或相对组成变异性的问题
需要比较组间（如真菌 vs 动物）的系统发育指标
询问关于 PhyKIT 函数（treeness、rcv、dvmc、evo_rate、parsimony_informative、tree_length）的问题
拥有成对比对文件和树的基因家族数据
需要对系统发育指标进行 Mann-Whitney U 检验或其他统计比较
询问关于自举支持率、分支长度或树拓扑结构的问题
需要根据比对构建树（NJ、UPGMA、简约法）
询问关于 Robinson-Foulds 距离或树比较的问题

BixBench 覆盖范围：涵盖 8 个项目（bix-4, bix-11, bix-12, bix-25, bix-35, bix-38, bix-45, bix-60）的 33 个问题

不适用于（请使用其他技能）：

多序列比对生成 → 使用外部工具（MUSCLE, MAFFT, ClustalW）
最大似然树构建 → 使用 IQ-TREE, RAxML 或 PhyML
贝叶斯系统发育学 → 使用 MrBayes 或 BEAST
祖先状态重建 → 使用单独的工具

核心原则

数据优先方法 - 在任何分析之前发现并验证所有输入文件（比对、树）
PhyKIT 兼容 - 使用 PhyKIT 函数计算树性、RCV、DVMC、简约法、进化速率（匹配 BixBench 预期输出）
- 支持 FASTA、PHYLIP、Nexus、Newick 格式，并自动检测格式

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

必需的 Python 包

# 核心包（必须安装）
import numpy as np
import pandas as pd
from scipy import stats
from Bio import AlignIO, Phylo, SeqIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

# PhyKIT（主要计算引擎）
from phykit.services.tree.treeness import Treeness
from phykit.services.tree.total_tree_length import TotalTreeLength
from phykit.services.tree.evolutionary_rate import EvolutionaryRate
from phykit.services.tree.dvmc import DVMC
from phykit.services.tree.treeness_over_rcv import TreenessOverRCV
from phykit.services.alignment.parsimony_informative_sites import ParsimonyInformative
from phykit.services.alignment.rcv import RelativeCompositionVariability

# DendroPy（用于高级树操作）
import dendropy

# ToolUniverse（用于序列检索）
from tooluniverse import ToolUniverse

pip install phykit dendropy biopython pandas numpy scipy

高层工作流程决策树

START: 用户关于系统发育数据的问题
│
├─ Q1: 需要什么类型的分析？
│  │
│  ├─ 比对分析（FASTA/PHYLIP 文件）
│  │  ├─ 简约信息位点 → phykit_parsimony_informative()
│  │  ├─ RCV 分数 → phykit_rcv()
│  │  ├─ 空位百分比 → alignment_gap_percentage()
│  │  ├─ GC 含量 → alignment_statistics()
│  │  └─ 参见：references/sequence_alignment.md
│  │
│  ├─ 树分析（Newick 文件）
│  │  ├─ 树性 → phykit_treeness()
│  │  ├─ 树长 → phykit_tree_length()
│  │  ├─ 进化速率 → phykit_evolutionary_rate()
│  │  ├─ DVMC → phykit_dvmc()
│  │  ├─ 自举支持率 → extract_bootstrap_support()
│  │  └─ 参见：references/tree_building.md
│  │
│  ├─ 组合分析（比对 + 树）
│  │  └─ 树性/RCV → phykit_treeness_over_rcv()
│  │
│  ├─ 树构建（根据比对构建）
│  │  ├─ 邻接法 → build_nj_tree()
│  │  ├─ UPGMA → build_upgma_tree()
│  │  ├─ 简约法 → build_parsimony_tree()
│  │  └─ 参见：references/tree_building.md
│  │
│  ├─ 组间比较（真菌 vs 动物等）
│  │  ├─ 批量计算每组指标
│  │  ├─ Mann-Whitney U 检验
│  │  ├─ 汇总统计（中位数、均值、百分位数）
│  │  └─ 参见：references/parsimony_analysis.md
│  │
│  └─ 树比较
│     ├─ Robinson-Foulds 距离 → robinson_foulds_distance()
│     └─ 自举一致性 → bootstrap_analysis()
│
├─ Q2: 有哪些数据格式可用？
│  ├─ FASTA (.fa, .fasta, .faa, .fna)
│  ├─ PHYLIP (.phy, .phylip) - 对于长名称使用 phylip-relaxed
│  ├─ Nexus (.nex, .nexus)
│  ├─ Newick (.nwk, .newick, .tre, .tree)
│  └─ 使用 load_alignment() 或 load_tree() 自动检测
│
└─ Q3: 这是批处理分析吗？
   ├─ 单个基因 → 运行指标函数一次
   ├─ 多个基因 → 使用 batch_compute_metric()
   └─ 组间比较 → 使用 discover_gene_files() + compare_groups()

快速参考：常用指标

指标	函数	输入	描述
树性	`phykit_treeness(tree_file)`	Newick	内部分支长度 / 总分支长度
RCV	`phykit_rcv(aln_file)`	FASTA/PHYLIP	相对组成变异性
树性/RCV	`phykit_treeness_over_rcv(tree, aln)`	两者	树性除以 RCV
树长	`phykit_tree_length(tree_file)`	Newick	所有分支长度之和
进化速率	`phykit_evolutionary_rate(tree_file)`	Newick	总分支长度 / 终端节点数
DVMC	`phykit_dvmc(tree_file)`	Newick	分子钟违反程度
简约位点	`phykit_parsimony_informative(aln_file)`	FASTA/PHYLIP	出现次数 ≥2 且字符 ≥2 的位点
空位百分比	`alignment_gap_percentage(aln_file)`	FASTA/PHYLIP	空位字符的百分比

参见 scripts/tree_statistics.py 了解实现。

常见分析模式（BixBench）

模式 1：跨组单指标

问题："真菌与动物的 DVMC 中位数是多少？"

# 1. 发现文件
fungi_genes = discover_gene_files("data/fungi")
animal_genes = discover_gene_files("data/animals")

# 2. 计算指标
fungi_dvmc = batch_dvmc(fungi_genes)
animal_dvmc = batch_dvmc(animal_genes)

# 3. 比较
fungi_values = list(fungi_dvmc.values())
animal_values = list(animal_dvmc.values())

print(f"真菌 DVMC 中位数：{np.median(fungi_values):.4f}")
print(f"动物 DVMC 中位数：{np.median(animal_values):.4f}")

参见：references/parsimony_analysis.md 获取完整实现

模式 2：统计比较

问题："比较组间树性的 Mann-Whitney U 统计量是多少？"

from scipy import stats

# 计算两组的树性
group1_treeness = batch_treeness(group1_genes)
group2_treeness = batch_treeness(group2_genes)

# Mann-Whitney U 检验（双侧）
u_stat, p_value = stats.mannwhitneyu(
    list(group1_treeness.values()),
    list(group2_treeness.values()),
    alternative='two-sided'
)

print(f"U 统计量：{u_stat:.0f}")
print(f"P 值：{p_value:.4e}")

模式 3：过滤 + 指标

问题："对于空位 <5% 的比对，其树性/RCV 是多少？"

# 1. 按空位百分比过滤
valid_genes = []
for entry in gene_files:
    if 'aln_file' in entry:
        gap_pct = alignment_gap_percentage(entry['aln_file'])
        if gap_pct < 5.0:
            valid_genes.append(entry)

# 2. 在过滤集上计算指标
results = batch_treeness_over_rcv(valid_genes)

# 3. 报告
values = [r[0] for r in results.values()]  # treeness/rcv 比率
print(f"树性/RCV 中位数：{np.median(values):.4f}")

模式 4：特定基因查找

问题："基因 X 的进化速率是多少？"

# 查找基因文件
gene_files = discover_gene_files("data/")
gene_entry = [g for g in gene_files if g['gene_id'] == 'X'][0]

# 计算指标
evo_rate = phykit_evolutionary_rate(gene_entry['tree_file'])

print(f"基因 X 的进化速率：{evo_rate:.4f}")

方法选择：何时使用何种方法

构建比对时（使用外部工具，非此技能）：

方法	速度	准确性	使用场景
ClustalW	慢	中等	小型数据集（<100 条序列），教学
MUSCLE	快	高	中型数据集（100-1000 条序列）
MAFFT	非常快	非常高	推荐 - 大型数据集（>1000 条序列）

对于此技能：处理预比对的序列。使用 load_alignment() 读取任何格式。

何时使用哪种树方法：

方法	速度	准确性	使用场景
邻接法	快	中等	快速建树，大型数据集，探索性分析
UPGMA	快	低	假设分子钟，仅用于特殊情况
最大简约法	中等	中等	小型数据集，离散性状
最大似然法	慢	高	使用外部工具（IQ-TREE, RAxML）进行生产分析

此技能中的实现：

# 快速基于距离的树
tree = build_nj_tree("alignment.fa")  # 邻接法
tree = build_upgma_tree("alignment.fa")  # UPGMA

# 简约法（用于小型比对）
tree = build_parsimony_tree("alignment.fa")

对于生产级 ML 树：使用 IQ-TREE 或 RAxML 外部构建，然后使用此技能进行分析。

参见 references/tree_building.md 了解详细实现。

# 自动发现成对的比对 + 树文件
gene_files = discover_gene_files("data/")

# 结果：包含 'gene_id', 'aln_file', 'tree_file' 的字典列表
# [
#   {'gene_id': 'gene1', 'aln_file': 'gene1.fa', 'tree_file': 'gene1.nwk'},
#   {'gene_id': 'gene2', 'aln_file': 'gene2.fa', 'tree_file': 'gene2.nwk'},
#   ...
# ]

# 树指标
treeness_results = batch_treeness(gene_files)
tree_length_results = batch_tree_length(gene_files)
dvmc_results = batch_dvmc(gene_files)
evo_rate_results = batch_evolutionary_rate(gene_files)

# 比对指标
rcv_results = batch_rcv(gene_files)
pi_results = batch_parsimony_informative(gene_files)
gap_results = batch_gap_percentage(gene_files)

# 组合指标
treeness_rcv_results = batch_treeness_over_rcv(gene_files)

# 所有函数返回字典：{gene_id: value}

# 汇总统计
stats = summary_stats(list(treeness_results.values()))
# 返回：{'mean': ..., 'median': ..., 'std': ..., 'min': ..., 'max': ...}

# 组间比较
comparison = compare_groups(
    list(fungi_treeness.values()),
    list(animal_treeness.values()),
    group1_name="Fungi",
    group2_name="Animals"
)
# 返回：{'u_statistic': ..., 'p_value': ..., 'Fungi': {...}, 'Animals': {...}}

参见 references/parsimony_analysis.md 获取完整工作流程。

BixBench 答案提取

问题模式	提取方法
"X 的中位数是多少？"	`np.median(values)`
"X 的最大值是多少？"	`np.max(values)`
"A 与 B 的 X 中位数之差是多少？"	`abs(np.median(a) - np.median(b))`
"有多少百分比的 X 其 Y 大于 Z？"	`sum(v > Z for v in values) / len(values) * 100`
"Mann-Whitney U 统计量是多少？"	`stats.mannwhitneyu(a, b)[0]`
"P 值是多少？"	`stats.mannwhitneyu(a, b)[1]`
"基因 Y 的 X 值是多少？"	`results[gene_id]`
"X 中位数的倍数变化是多少？"	`np.median(a) / np.median(b)`
"乘以 1000"	`round(value * 1000)`

PhyKIT 默认：4 位小数
百分比：匹配问题格式（例如，"35%" → 整数，"3.5%" → 1 位小数）
P 值：非常小的值使用科学记数法
U 统计量：整数（无小数）
始终检查问题措辞："四舍五入到 3 位小数" 会覆盖默认值

BixBench 问题覆盖范围

项目	问题数量	指标
bix-4	7	DVMC 分析（真菌 vs 动物）
bix-11	6	树性分析（中位数、百分比、Mann-Whitney U）
bix-12	5	简约信息位点（计数、百分比、比率）
bix-25	2	带空位过滤的树性/RCV
bix-35	4	进化速率（特定基因、比较）
bix-38	5	树长（倍数变化、方差、配对比率）
bix-45	4	RCV（Mann-Whitney U、中位数、配对差异）
bix-60	1	多棵树上的平均树性

from tooluniverse import ToolUniverse

tu = ToolUniverse()
tu.load_tools()

# 从 NCBI 获取序列
result = tu.tools.NCBI_get_sequence(accession="NP_000546")

# 从 Ensembl 获取基因树
tree_result = tu.tools.EnsemblCompara_get_gene_tree(gene="ENSG00000141510")

# 从 OpenTree 获取物种树
tree_result = tu.tools.OpenTree_get_induced_subtree(ott_ids="770315,770319")

tooluniverse-phylogenetics/
├── SKILL.md                           # 本文件（工作流程编排）
├── QUICK_START.md                     # 快速参考
├── test_phylogenetics.py             # 综合测试套件
├── references/
│   ├── sequence_alignment.md         # 比对分析详情
│   ├── tree_building.md              # 树构建方法
│   ├── parsimony_analysis.md         # 统计比较工作流程
│   └── troubleshooting.md            # 常见问题及解决方案
└── scripts/
    ├── format_alignment.py           # 比对格式转换
    └── tree_statistics.py            # 核心指标实现

完整性检查清单

返回答案前，请验证：

已识别所有输入文件（比对和/或树）
如果适用，已检测组结构（真菌/动物等）
对请求的指标使用了正确的 PhyKIT 函数
处理了每组中的所有基因（不仅仅是样本）
如果请求比较，应用了正确的统计检验
使用了正确的舍入（默认 4 位小数，或按指定）
返回了所问的特定统计量（中位数、最大值、U 统计量、P 值等）
对于百分比问题，确认答案是整数还是小数
对于"差异"问题，确认方向（A - B 与绝对差异）
对于 Mann-Whitney U，使用了 alternative='two-sided'（scipy 默认值）

对于详细的比对分析工作流程 → 参见 references/sequence_alignment.md
对于树构建方法 → 参见 references/tree_building.md
对于统计比较示例 → 参见 references/parsimony_analysis.md
对于常见错误和解决方案 → 参见 references/troubleshooting.md
对于脚本实现 → 参见 scripts/tree_statistics.py

遇到以下问题时：

PhyKIT 函数：查看 PhyKIT 文档 https://jlsteenwyk.com/PhyKIT/
Biopython 树/比对解析：参见 https://biopython.org/wiki/Phylo
DendroPy 操作：参见 https://dendropy.org/
ToolUniverse 集成：查看 ToolUniverse 文档

与 ToolUniverse 框架许可证相同。

🇺🇸English

Phylogenetics and Sequence Analysis

Comprehensive phylogenetics and sequence analysis using PhyKIT, Biopython, and DendroPy. Designed for bioinformatics questions about multiple sequence alignments, phylogenetic trees, parsimony, molecular evolution, and comparative genomics.

IMPORTANT : This skill handles complex phylogenetic workflows. Most implementation details have been moved to references/ for progressive disclosure. This document focuses on high-level decision-making and workflow orchestration.

When to Use This Skill

Apply when users:

Have FASTA alignment files and ask about parsimony informative sites, gaps, or alignment quality
Have Newick tree files and ask about treeness, tree length, evolutionary rate, or DVMC
Ask about treeness/RCV, RCV, or relative composition variability
Need to compare phylogenetic metrics between groups (fungi vs animals, etc.)
Ask about PhyKIT functions (treeness, rcv, dvmc, evo_rate, parsimony_informative, tree_length)
Have gene family data with paired alignments and trees
Need Mann-Whitney U tests or other statistical comparisons of phylogenetic metrics
Ask about bootstrap support, branch lengths, or tree topology
Need to build trees (NJ, UPGMA, parsimony) from alignments
Ask about Robinson-Foulds distance or tree comparison

BixBench Coverage : 33 questions across 8 projects (bix-4, bix-11, bix-12, bix-25, bix-35, bix-38, bix-45, bix-60)

NOT for (use other skills instead):

Multiple sequence alignment generation → Use external tools (MUSCLE, MAFFT, ClustalW)
Maximum Likelihood tree construction → Use IQ-TREE, RAxML, or PhyML
Bayesian phylogenetics → Use MrBayes or BEAST
Ancestral state reconstruction → Use separate tools

Core Principles

Data-first approach - Discover and validate all input files (alignments, trees) before any analysis
PhyKIT-compatible - Use PhyKIT functions for treeness, RCV, DVMC, parsimony, evolutionary rate (matches BixBench expected outputs)
Format-flexible - Support FASTA, PHYLIP, Nexus, Newick, and auto-detect formats
Batch processing - Process hundreds of gene alignments/trees in a single analysis
Statistical rigor - Mann-Whitney U, medians, percentiles, standard deviations with scipy.stats
Precision awareness - Match rounding to 4 decimal places (PhyKIT default) or as requested
Group comparison - Compare metrics between taxa groups (e.g., fungi vs animals)
Question-driven - Parse exactly what is asked and return the specific number/statistic

Required Python Packages

# Core (MUST be installed)
import numpy as np
import pandas as pd
from scipy import stats
from Bio import AlignIO, Phylo, SeqIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

# PhyKIT (primary computation engine)
from phykit.services.tree.treeness import Treeness
from phykit.services.tree.total_tree_length import TotalTreeLength
from phykit.services.tree.evolutionary_rate import EvolutionaryRate
from phykit.services.tree.dvmc import DVMC
from phykit.services.tree.treeness_over_rcv import TreenessOverRCV
from phykit.services.alignment.parsimony_informative_sites import ParsimonyInformative
from phykit.services.alignment.rcv import RelativeCompositionVariability

# DendroPy (for advanced tree operations)
import dendropy

# ToolUniverse (for sequence retrieval)
from tooluniverse import ToolUniverse

Installation :

pip install phykit dendropy biopython pandas numpy scipy

High-Level Workflow Decision Tree

START: User question about phylogenetic data
│
├─ Q1: What type of analysis is needed?
│  │
│  ├─ ALIGNMENT ANALYSIS (FASTA/PHYLIP files)
│  │  ├─ Parsimony informative sites → phykit_parsimony_informative()
│  │  ├─ RCV score → phykit_rcv()
│  │  ├─ Gap percentage → alignment_gap_percentage()
│  │  ├─ GC content → alignment_statistics()
│  │  └─ See: references/sequence_alignment.md
│  │
│  ├─ TREE ANALYSIS (Newick files)
│  │  ├─ Treeness → phykit_treeness()
│  │  ├─ Tree length → phykit_tree_length()
│  │  ├─ Evolutionary rate → phykit_evolutionary_rate()
│  │  ├─ DVMC → phykit_dvmc()
│  │  ├─ Bootstrap support → extract_bootstrap_support()
│  │  └─ See: references/tree_building.md
│  │
│  ├─ COMBINED ANALYSIS (alignment + tree)
│  │  └─ Treeness/RCV → phykit_treeness_over_rcv()
│  │
│  ├─ TREE CONSTRUCTION (build from alignment)
│  │  ├─ Neighbor-Joining → build_nj_tree()
│  │  ├─ UPGMA → build_upgma_tree()
│  │  ├─ Parsimony → build_parsimony_tree()
│  │  └─ See: references/tree_building.md
│  │
│  ├─ GROUP COMPARISON (fungi vs animals, etc.)
│  │  ├─ Batch compute metrics per group
│  │  ├─ Mann-Whitney U test
│  │  ├─ Summary statistics (median, mean, percentiles)
│  │  └─ See: references/parsimony_analysis.md
│  │
│  └─ TREE COMPARISON
│     ├─ Robinson-Foulds distance → robinson_foulds_distance()
│     └─ Bootstrap consensus → bootstrap_analysis()
│
├─ Q2: What data format is available?
│  ├─ FASTA (.fa, .fasta, .faa, .fna)
│  ├─ PHYLIP (.phy, .phylip) - Use phylip-relaxed for long names
│  ├─ Nexus (.nex, .nexus)
│  ├─ Newick (.nwk, .newick, .tre, .tree)
│  └─ Auto-detect with load_alignment() or load_tree()
│
└─ Q3: Is this a batch analysis?
   ├─ Single gene → Run metric function once
   ├─ Multiple genes → Use batch_compute_metric()
   └─ Group comparison → Use discover_gene_files() + compare_groups()

Quick Reference: Common Metrics

Metric	Function	Input	Description
Treeness	`phykit_treeness(tree_file)`	Newick	Internal branch length / Total branch length
RCV	`phykit_rcv(aln_file)`	FASTA/PHYLIP	Relative Composition Variability
Treeness/RCV	`phykit_treeness_over_rcv(tree, aln)`	Both	Treeness divided by RCV
Tree Length	`phykit_tree_length(tree_file)`

See scripts/tree_statistics.py for implementation.

Common Analysis Patterns (BixBench)

Pattern 1: Single Metric Across Groups

Question : "What is the median DVMC for fungi vs animals?"

Workflow :

# 1. Discover files
fungi_genes = discover_gene_files("data/fungi")
animal_genes = discover_gene_files("data/animals")

# 2. Compute metric
fungi_dvmc = batch_dvmc(fungi_genes)
animal_dvmc = batch_dvmc(animal_genes)

# 3. Compare
fungi_values = list(fungi_dvmc.values())
animal_values = list(animal_dvmc.values())

print(f"Fungi median DVMC: {np.median(fungi_values):.4f}")
print(f"Animal median DVMC: {np.median(animal_values):.4f}")

See : references/parsimony_analysis.md for full implementation

Pattern 2: Statistical Comparison

Question : "What is the Mann-Whitney U statistic comparing treeness between groups?"

Workflow :

from scipy import stats

# Compute treeness for both groups
group1_treeness = batch_treeness(group1_genes)
group2_treeness = batch_treeness(group2_genes)

# Mann-Whitney U test (two-sided)
u_stat, p_value = stats.mannwhitneyu(
    list(group1_treeness.values()),
    list(group2_treeness.values()),
    alternative='two-sided'
)

print(f"U statistic: {u_stat:.0f}")
print(f"P-value: {p_value:.4e}")

Pattern 3: Filtering + Metric

Question : "What is the treeness/RCV for alignments with <5% gaps?"

Workflow :

# 1. Filter by gap percentage
valid_genes = []
for entry in gene_files:
    if 'aln_file' in entry:
        gap_pct = alignment_gap_percentage(entry['aln_file'])
        if gap_pct < 5.0:
            valid_genes.append(entry)

# 2. Compute metric on filtered set
results = batch_treeness_over_rcv(valid_genes)

# 3. Report
values = [r[0] for r in results.values()]  # treeness/rcv ratio
print(f"Median treeness/RCV: {np.median(values):.4f}")

Pattern 4: Specific Gene Lookup

Question : "What is the evolutionary rate for gene X?"

Workflow :

# Find gene file
gene_files = discover_gene_files("data/")
gene_entry = [g for g in gene_files if g['gene_id'] == 'X'][0]

# Compute metric
evo_rate = phykit_evolutionary_rate(gene_entry['tree_file'])

print(f"Evolutionary rate for gene X: {evo_rate:.4f}")

Choosing Methods: When to Use What

Alignment Methods

When building alignments (use external tools, not this skill):

Method	Speed	Accuracy	Use Case
ClustalW	Slow	Medium	Small datasets (<100 sequences), educational
MUSCLE	Fast	High	Medium datasets (100-1000 sequences)
MAFFT	Very Fast	Very High	Recommended - Large datasets (>1000 sequences)

For this skill : Work with pre-aligned sequences. Use load_alignment() to read any format.

Tree Building Methods

When to use which tree method:

Method	Speed	Accuracy	Use Case
Neighbor-Joining	Fast	Medium	Quick trees, large datasets, exploratory
UPGMA	Fast	Low	Assumes molecular clock, special cases only
Maximum Parsimony	Medium	Medium	Small datasets, discrete characters
Maximum Likelihood	Slow	High	Use external tools (IQ-TREE, RAxML) for production

Implementation in this skill :

# Fast distance-based trees
tree = build_nj_tree("alignment.fa")  # Neighbor-Joining
tree = build_upgma_tree("alignment.fa")  # UPGMA

# Parsimony (for small alignments)
tree = build_parsimony_tree("alignment.fa")

For production ML trees : Use IQ-TREE or RAxML externally, then analyze with this skill.

See references/tree_building.md for detailed implementations.

Batch Processing

Discovering Gene Files

# Auto-discover paired alignment + tree files
gene_files = discover_gene_files("data/")

# Result: list of dicts with 'gene_id', 'aln_file', 'tree_file'
# [
#   {'gene_id': 'gene1', 'aln_file': 'gene1.fa', 'tree_file': 'gene1.nwk'},
#   {'gene_id': 'gene2', 'aln_file': 'gene2.fa', 'tree_file': 'gene2.nwk'},
#   ...
# ]

Computing Metrics in Batch

# Tree metrics
treeness_results = batch_treeness(gene_files)
tree_length_results = batch_tree_length(gene_files)
dvmc_results = batch_dvmc(gene_files)
evo_rate_results = batch_evolutionary_rate(gene_files)

# Alignment metrics
rcv_results = batch_rcv(gene_files)
pi_results = batch_parsimony_informative(gene_files)
gap_results = batch_gap_percentage(gene_files)

# Combined metrics
treeness_rcv_results = batch_treeness_over_rcv(gene_files)

# All return dict: {gene_id: value}

Statistical Analysis

# Summary statistics
stats = summary_stats(list(treeness_results.values()))
# Returns: {'mean': ..., 'median': ..., 'std': ..., 'min': ..., 'max': ...}

# Group comparison
comparison = compare_groups(
    list(fungi_treeness.values()),
    list(animal_treeness.values()),
    group1_name="Fungi",
    group2_name="Animals"
)
# Returns: {'u_statistic': ..., 'p_value': ..., 'Fungi': {...}, 'Animals': {...}}

See references/parsimony_analysis.md for full workflow.

Answer Extraction for BixBench

Question Pattern	Extraction Method
"What is the median X?"	`np.median(values)`
"What is the maximum X?"	`np.max(values)`
"What is the difference between median X for A vs B?"	`abs(np.median(a) - np.median(b))`
"What percentage of X have Y above Z?"	`sum(v > Z for v in values) / len(values) * 100`
"What is the Mann-Whitney U statistic?"	`stats.mannwhitneyu(a, b)[0]`
"What is the p-value?"	`stats.mannwhitneyu(a, b)[1]`

Rounding Rules

PhyKIT default : 4 decimal places
Percentages : Match question format (e.g., "35%" → integer, "3.5%" → 1 decimal)
P-values : Scientific notation for very small values
U statistics : Integer (no decimals)
Always check question wording : "rounded to 3 decimal places" overrides defaults

BixBench Question Coverage

Project	Questions	Metrics
bix-4	7	DVMC analysis (fungi vs animals)
bix-11	6	Treeness analysis (median, percentages, Mann-Whitney U)
bix-12	5	Parsimony informative sites (counts, percentages, ratios)
bix-25	2	Treeness/RCV with gap filtering
bix-35	4	Evolutionary rate (specific genes, comparisons)
bix-38	5	Tree length (fold-change, variance, paired ratios)
bix-45	4	RCV (Mann-Whitney U, medians, paired differences)
bix-60

ToolUniverse Integration

Sequence Retrieval

from tooluniverse import ToolUniverse

tu = ToolUniverse()
tu.load_tools()

# Get sequences from NCBI
result = tu.tools.NCBI_get_sequence(accession="NP_000546")

# Get gene tree from Ensembl
tree_result = tu.tools.EnsemblCompara_get_gene_tree(gene="ENSG00000141510")

# Get species tree from OpenTree
tree_result = tu.tools.OpenTree_get_induced_subtree(ott_ids="770315,770319")

File Structure

tooluniverse-phylogenetics/
├── SKILL.md                           # This file (workflow orchestration)
├── QUICK_START.md                     # Quick reference
├── test_phylogenetics.py             # Comprehensive test suite
├── references/
│   ├── sequence_alignment.md         # Alignment analysis details
│   ├── tree_building.md              # Tree construction methods
│   ├── parsimony_analysis.md         # Statistical comparison workflows
│   └── troubleshooting.md            # Common issues and solutions
└── scripts/
    ├── format_alignment.py           # Alignment format conversion
    └── tree_statistics.py            # Core metric implementations

Completeness Checklist

Before returning your answer, verify:

Identified all input files (alignments and/or trees)
Detected group structure (fungi/animals/etc.) if applicable
Used correct PhyKIT function for the requested metric
Processed ALL genes in each group (not just a sample)
Applied correct statistical test if comparison requested
Used correct rounding (4 decimals default, or as specified)
Returned the specific statistic asked for (median, max, U stat, p-value, etc.)
For percentage questions, confirmed whether answer is integer or decimal
For "difference" questions, confirmed direction (A - B vs abs difference)
For Mann-Whitney U, used alternative='two-sided' (default in scipy)

Next Steps

For detailed alignment analysis workflows → See references/sequence_alignment.md
For tree construction methods → See references/tree_building.md
For statistical comparison examples → See references/parsimony_analysis.md
For common errors and solutions → See references/troubleshooting.md
For script implementations → See scripts/tree_statistics.py

Support

For issues with:

PhyKIT functions : Check PhyKIT documentation at https://jlsteenwyk.com/PhyKIT/
Biopython tree/alignment parsing : See https://biopython.org/wiki/Phylo
DendroPy operations : See https://dendropy.org/
ToolUniverse integration : Check ToolUniverse documentation

License

Same as ToolUniverse framework license.

Weekly Installs

120

Repository

mims-harvard/to…universe

GitHub Stars

1.2K

First Seen

Feb 19, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex117

gemini-cli116

opencode116

github-copilot115

cursor113

kimi-cli112

智能OCR文字识别工具 - 支持100+语言，高精度提取图片/PDF/手写文本

1,000 周安装

系统发育学与序列分析工具：PhyKIT、Biopython、DendroPy 生物信息学技能

🇨🇳中文介绍

系统发育学与序列分析

何时使用此技能

核心原则

相关 Skills

必需的 Python 包

高层工作流程决策树

快速参考：常用指标

常见分析模式（BixBench）

模式 1：跨组单指标

模式 2：统计比较

模式 3：过滤 + 指标

模式 4：特定基因查找

方法选择：何时使用何种方法

比对方法

树构建方法

批处理

发现基因文件

批量计算指标

统计分析

BixBench 答案提取

舍入规则

BixBench 问题覆盖范围

ToolUniverse 集成

序列检索

文件结构

完整性检查清单

后续步骤

支持

许可证

🇺🇸English

Phylogenetics and Sequence Analysis

When to Use This Skill

Core Principles

Required Python Packages

High-Level Workflow Decision Tree

Quick Reference: Common Metrics

Common Analysis Patterns (BixBench)

Pattern 1: Single Metric Across Groups

Pattern 2: Statistical Comparison

Pattern 3: Filtering + Metric

Pattern 4: Specific Gene Lookup

Choosing Methods: When to Use What

Alignment Methods

Tree Building Methods

Batch Processing

Discovering Gene Files

Computing Metrics in Batch

Statistical Analysis

Answer Extraction for BixBench

Rounding Rules

BixBench Question Coverage

ToolUniverse Integration

Sequence Retrieval

File Structure

Completeness Checklist

Next Steps

Support

License

最新 Skills