bioinformatics-fundamentals by delphine-l/claude_global
npx skills add https://github.com/delphine-l/claude_global --skill bioinformatics-fundamentals基因组学和生物信息学工作流程的基础知识。提供对文件格式、测序技术和常见数据处理模式的基本理解。
标志位是累加的 - 一个读段可以同时设置多个标志位。
常见标志位:
0x0001 (1): 测序中的读段是配对的0x0002 (2): 每个片段正确比对(正确配对)0x0004 (4): 读段未比对0x0008 (8): 配对读段未比对0x0010 (16): 读段比对到反向链广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
0x0020 (32): 配对读段比对到反向链0x0040 (64): 配对中的第一个(R1/正向)0x0080 (128): 配对中的第二个(R2/反向)0x0100 (256): 次要比对0x0400 (1024): PCR 或光学重复0x0800 (2048): 补充比对标志位组合:
99 (0x63 = 1 + 2 + 32 + 64)147 (0x93 = 1 + 2 + 16 + 128)48"正确配对"的含义:
重要提示: 不同的比对工具对正确配对有不同的标准!
公式: MAPQ = -10 * log10(P(比对错误))
常见阈值:
MAPQ >= 60: 高置信度(错误概率 < 0.0001%)MAPQ >= 30: 良好质量(错误概率 < 0.1%)MAPQ >= 20: 可接受(错误概率 < 1%)MAPQ >= 10: 低置信度(错误概率 < 10%)MAPQ = 0: 多重比对或未比对注意: MAPQ=0 可能意味着未比对或有多个同等好的比对位置。
表示读段与参考序列之间的比对关系:
M: 匹配或不匹配(比对匹配)I: 读段相对于参考序列的插入D: 读段相对于参考序列的缺失S: 软裁剪(读段中未比对的部分)H: 硬裁剪(不在读段序列中的部分)N: 跳过区域(用于 RNA-seq 剪接)示例: 100M = 完美的 100bp 匹配
示例: 50M5I45M = 50bp 匹配,5bp 插入,45bp 匹配
特点:
最佳比对工具:
map-pb, map-hifi典型用例:
特点:
最佳比对工具:
关键概念: Hi-C 读段对有意地比对到远距离位点。区域过滤很容易破坏配对!
典型用例:
特点:
最佳比对工具:
用途: 过滤、转换和查看 SAM/BAM 文件
关键标志:
-b: 输出 BAM 格式-h: 包含头部信息-f INT: 要求标志位(保留具有这些标志位的读段)-F INT: 过滤标志位(移除具有这些标志位的读段)-q INT: 最小 MAPQ 阈值-L FILE: 保留与 BED 文件中区域重叠的读段重要行为:
-L(区域过滤)单独检查每个读段,而不是配对-f, -F)先于区域过滤(-L)应用示例 - 过滤正确配对:
samtools view -b -f 2 input.bam > proper_pairs.bam
示例 - 按区域过滤(可能破坏配对):
samtools view -b -L regions.bed input.bam > filtered.bam
示例 - 区域内的正确配对(正确顺序):
samtools view -b -f 2 -L regions.bed input.bam > proper_pairs_in_regions.bam
用途: 使用复杂标准进行高级过滤
关键特性:
常见过滤器:
isPaired: true - 读段来自双端测序isProperPair: true - 读段是正确配对的一部分isMapped: true - 读段已比对mapQuality: >=30 - 比对质量阈值与 samtools 的重要区别:
isProperPair 比 samtools 的 -f 2 更严格用途: 将 SAM/BAM 转换为 FASTQ/FASTA
输出模式:
outputs: ["r1", "r2"] - 双端数据的正向和反向读段分开输出outputs: ["other"] - 单端数据的单一输出outputs: ["r0"] - 所有读段(混合配对/未配对)过滤选项:
inclusive_filter: ["2"] - 要求正确配对标志位exclusive_filter: ["4", "8"] - 排除未比对或配对读段未比对exclusive_filter_all: ["8"] - 如果配对读段未比对则排除关键提示: 使用适当的过滤器以确保 R1/R2 文件匹配!
错误方式(破坏配对):
# 先进行区域过滤 → 当配对读段在不同区域时破坏配对
samtools view -b -L regions.bed input.bam | bamtools filter -isPaired -isProperPair
# 结果:空输出(所有配对被破坏)
正确方式(保留配对):
# 先进行正确配对过滤,然后进行区域过滤
samtools view -b -f 2 -L regions.bed input.bam > output.bam
# 结果:两个配对读段都在区域内的配对(或一个在区域内,另一个在任何地方)
最佳方式(两个配对读段都在区域内):
# 过滤正确配对,然后使用配对感知的区域过滤
samtools view -b -f 2 input.bam | \
# 自定义脚本以保留两个配对读段都在区域内的配对
对于双端数据:
# 在提取前确保正确配对
samtools fastx -1 R1.fq.gz -2 R2.fq.gz \
--i1-flags 2 \ # 要求正确配对
--i2-flags 64,128 \ # 分离 R1/R2
input.bam
对于单端数据:
# 简单提取
samtools fastx -0 output.fq.gz input.bam
保守(高质量):
samtools view -b -q 30 -f 2 -F 256 -F 2048 input.bam
# MAPQ >= 30,正确配对,无次要比对/补充比对
宽松(用于低覆盖度数据):
samtools view -b -q 10 -F 4 input.bam
# MAPQ >= 10,已比对的读段
症状:
原因:
samtools view -L)破坏了读段对isProperPair 过滤器移除了所有读段解决方案:
# 在区域过滤之前应用正确配对过滤器
samtools view -b -f 2 -L regions.bed input.bam > output.bam
另请参阅: common-issues.md 获取详细故障排除
症状:
原因:
解决方案:
# 在提取过程中要求正确配对
samtools fastx -1 R1.fq -2 R2.fq --i1-flags 2 input.bam
症状:
实际上不是问题:
解决方案:
症状:
可能的原因:
解决方案:
# 检查插入片段大小分布
samtools stats input.bam | grep "insert size"
# 检查配对标志位
samtools flagstat input.bam
N50: 最短重叠群的长度,使得总组装的 50% 包含在该长度或更长的重叠群中
如何解读:
相关指标:
覆盖度: 至少被一个读段覆盖的参考碱基百分比 深度: 覆盖每个碱基的平均读段数
推荐深度:
>sequence_id description
ATCGATCGATCG
ATCGATCG
> 开头@read_id
ATCGATCGATCG
+
IIIIIIIIIIII
chr1 1000 2000 feature_name score +
chr1 1 5000 1 W contig_1 1 5000 +
chr1 5001 5100 2 U 100 scaffold yes proximity_ligation
map-hifi 或 map-pbGenomeArk (s3://genomeark/) 是一个公共 AWS S3 存储桶,包含 VGP 基因组组装和 QC 数据。使用 --no-sign-request 访问无需凭证。
关键发现: GenomeArk S3 结构随时间演变(2022 → 2024)。为可靠性起见,始终尝试多种路径模式。
基础结构:
s3://genomeark/species/{Species_name}/{ToLID}/assembly_vgp_{type}_2.0/evaluation/
关键变化:
assembly_vgp_hic_2.0assembly_vgp_HiC_2.0(区分大小写!)hic 替换为 HiC路径: {assembly}/evaluation/genomescope/
文件名模式(按顺序尝试):
{ToLID}_genomescope__Summary.txt(模式 A:双下划线 - 最常见){ToLID}_genomescope_Summary.txt(模式 C:单下划线 ⭐ 容易被遗漏){ToLID}_Summary.txt(模式 B:无前缀 - 较旧的组装)⚠️ 关键提示: 必须检查所有三种模式!模式 C(单下划线)是在 2026 年 2 月调试期间发现的 - 仅检查模式 A 和 B 会导致遗漏约 30-40% 的数据!
模式 C 示例:
rPlaMeg1_genomescope__Summary.txt ✗rPlaMeg1_genomescope_Summary.txt ✓⚠️ 关键提示:验证数据质量
失败的 GenomeScope 运行显示不切实际的范围:
Heterozygous (ab) 0% 100% ← 运行失败 - 请勿使用
良好的运行显示狭窄的范围:
Heterozygous (ab) 0.49% 0.54% ← 有效 - 使用最大值
验证逻辑:
# 提取最小和最大百分比
percentages = [0.49, 0.54] # 解析示例
min_val, max_val = percentages[0], percentages[-1]
range_width = max_val - min_val
# 在使用前验证
if range_width <= 50.0 and max_val <= 95.0:
heterozygosity = max_val # 接受
else:
heterozygosity = None # 拒绝 - 运行失败
如果出现以下情况则跳过值:
Summary.txt 格式:
GenomeScope version 2.0
...
property min max
Genome Haploid Length 4,077,481,159 bp 4,095,803,536 bp
Heterozygous (ab) 1.43264% 1.47696%
Genome Repeat Length 2,528,408,288 bp 2,539,769,824 bp
解析:
(repeat_length / genome_size) * 100路径: {assembly}/evaluation/busco/{subdir}/
子目录变化:
c/, c1/ - 主要结果p/, p1/ - 替代结果文件: *short_summary*.txt(不区分大小写搜索)
文件名模式:
{ToLID}_HiC__busco_hap1_busco_short_summary.txt{ToLID}_busco_short_summary.txt格式:
# BUSCO version is: 5.2.2
# The lineage dataset is: vertebrata_odb10
...
C:94.0%[S:92.4%,D:1.6%],F:2.7%,M:3.3%,n:3354
解析以 C: 开头的行: 从 C:94.0% 中提取 94.0
⚠️ 两种路径模式(结构在 2022 → 2024 年发生变化):
模式 A(较新 - 直接,2024+):
{assembly}/evaluation/merqury/{ToLID}_qv/output_merqury.tabular
模式 B(较旧 - 嵌套,2022):
{assembly}/evaluation/merqury/{c,p}/{ToLID}_qv/output_merqury.tabular
策略: 首先尝试直接路径,然后搜索嵌套子目录
文件格式(制表符分隔,可能有头部):
assembly unique k-mers common k-mers QV error rate
assembly_01 20197 2133011206 63.4592 4.50896e-07
assembly_02 19654 2304717679 63.9138 4.06084e-07
Both 39851 4437728885 63.6894 4.27623e-07
解析:
assembly\t 开头,则跳过头部行def normalize_s3_path(s3_path):
"""规范化 GenomeArk 路径(区分大小写!)"""
if not s3_path:
return None
# 关键:HiC 大小写
s3_path = s3_path.replace('/assembly_vgp_hic_2.0/', '/assembly_vgp_HiC_2.0/')
if not s3_path.endswith('/'):
s3_path += '/'
return s3_path
def fetch_genomescope_data(s3_path):
"""获取并验证数据"""
s3_path = normalize_s3_path(s3_path)
tolid = s3_path.rstrip('/').split('/')[-2]
# 尝试所有三种文件名模式
for filename in [
f'{tolid}_genomescope__Summary.txt', # 模式 A:双下划线
f'{tolid}_genomescope_Summary.txt', # 模式 C:单下划线
f'{tolid}_Summary.txt' # 模式 B:无前缀
]:
file_path = f"{s3_path}evaluation/genomescope/{filename}"
result = subprocess.run(['aws', 's3', 'cp', file_path, '-', '--no-sign-request'],
capture_output=True, text=True, timeout=30)
if result.returncode == 0 and result.stdout:
# 解析和验证
data = parse_genomescope(result.stdout)
# 验证杂合度范围
if 'heterozygosity' in data:
# 检查范围是否合理
if heterozygosity_range > 50.0 or max_het > 95.0:
del data['heterozygosity'] # 跳过无效值
if data:
return data
return None
def fetch_merqury_data(s3_path):
"""从直接或嵌套路径获取"""
s3_path = normalize_s3_path(s3_path)
tolid = s3_path.rstrip('/').split('/')[-2]
# 首先尝试直接路径(较新结构)
direct_path = f"{s3_path}evaluation/merqury/{tolid}_qv/output_merqury.tabular"
result = subprocess.run(['aws', 's3', 'cp', direct_path, '-', '--no-sign-request'],
capture_output=True, text=True, timeout=30)
if result.returncode == 0 and result.stdout:
# 从第 4 列解析 QV
for line in result.stdout.split('\n'):
if line.strip() and not line.startswith('assembly\t'):
parts = line.split('\t')
if len(parts) >= 4:
return {'qv': float(parts[3]), 'path_type': 'direct'}
# 回退:搜索嵌套子目录(较旧结构)
# 列出子目录,尝试 c/, p/ 等
...
def fetch_busco_data(s3_path):
"""搜索动态子目录"""
s3_path = normalize_s3_path(s3_path)
# 列出 busco/ 子目录
list_result = subprocess.run(['aws', 's3', 'ls', f"{s3_path}evaluation/busco/", '--no-sign-request'],
capture_output=True, text=True, timeout=10)
# 查找子目录(包含 'PRE' 的行)
subdirs = [line.split('PRE')[1].strip().rstrip('/')
for line in list_result.stdout.split('\n') if 'PRE' in line]
# 为每个子目录尝试查找 short_summary 文件
...
assembly_vgp_hic_2.0 → S3 中的 assembly_vgp_HiC_2.0对于像 GenomeArk 这样的公共存储桶:
subprocess + aws s3 CLI 配合 --no-sign-requestboto3(即使对于公共访问也需要凭证配置)# 简单且有效
cmd = ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request']
result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
确认有效的路径:
# GenomeScope - 模式 A(双下划线)
aws s3 cp s3://genomeark/species/Gastrophryne_carolinensis/aGasCar1/assembly_vgp_HiC_2.0/evaluation/genomescope/aGasCar1_genomescope__Summary.txt - --no-sign-request
# GenomeScope - 模式 C(单下划线) ⭐ 新模式
aws s3 cp s3://genomeark/species/Platysternon_megacephalum/rPlaMeg1/assembly_vgp_HiC_2.0/evaluation/genomescope/rPlaMeg1_genomescope_Summary.txt - --no-sign-request
# GenomeScope - 模式 B(无前缀 - 较旧)
aws s3 cp s3://genomeark/species/Spea_bombifrons/aSpeBom1/assembly_vgp_standard_2.0/evaluation/genomescope/aSpeBom1_Summary.txt - --no-sign-request
# BUSCO
aws s3 cp s3://genomeark/species/Gastrophryne_carolinensis/aGasCar1/assembly_vgp_HiC_2.0/evaluation/busco/c/aGasCar1_HiC__busco_hap1_busco_short_summary.txt - --no-sign-request
# Merqury - 直接路径(2024+)
aws s3 cp s3://genomeark/species/Ia_io/mIaxIox2/assembly_vgp_HiC_2.0/evaluation/merqury/mIaxIox2_qv/output_merqury.tabular - --no-sign-request
# Merqury - 嵌套路径(2022)
aws s3 cp s3://genomeark/species/Gastrophryne_carolinensis/aGasCar1/assembly_vgp_HiC_2.0/evaluation/merqury/aGasCar1_qv/output_merqury.tabular - --no-sign-request
核型数据(二倍体 2n 和单倍体 n 染色体数目)对于基因组组装验证至关重要,但很少通过 API 获得。需要手动文献整理。
"{species_name} karyotype chromosome 2n"
"{species_name} diploid number karyotype"
"{genus} karyotype evolution"
"cytogenetic analysis {family_name}"
"{species_name} chromosome number diploid"
⚠️ 警告: 染色体水平组装通常报告的染色体数量少于实际二倍体数目
原因: 组装通常仅捕获:
示例: 2n = 80 的水禽通常有 34-42 条"染色体"的基因组组装
当特定核型数据不可用但属/科模式很强时:
高置信度推断(可接受用于发表):
清晰记录推断:
accession,taxid,species,2n,n,notes,reference
GCA_XXX,123,Species name,80,40,Inferred from Anatidae conservation,https://family-level-study.url
直接确认的优先级:
当两个性染色体都在主要单倍型中时(VGP 组装中常见):
CSV 结构:
accession,taxid,species_name,diploid_2n,haploid_n,notes,reference
GCA_XXXXXX,12345,Species name,80,40,Brief description,https://doi.org/...
备注字段示例:
**第
Foundation knowledge for genomics and bioinformatics workflows. Provides essential understanding of file formats, sequencing technologies, and common data processing patterns.
Flags are additive - a read can have multiple flags set simultaneously.
Common Flags:
0x0001 (1): Read is paired in sequencing0x0002 (2): Each segment properly aligned (proper pair)0x0004 (4): Read unmapped0x0008 (8): Mate unmapped0x0010 (16): Read mapped to reverse strand0x0020 (32): Mate mapped to reverse strand0x0040 (64): First in pair (R1/forward)0x0080 (128): Second in pair (R2/reverse)0x0100 (256): Secondary alignment0x0400 (1024): PCR or optical duplicate0x0800 (2048): Supplementary alignmentFlag Combinations:
99 (0x63 = 1 + 2 + 32 + 64)147 (0x93 = 1 + 2 + 16 + 128)48What "proper pair" means:
Important: Different aligners have different criteria for proper pairs!
Formula: MAPQ = -10 * log10(P(mapping is wrong))
Common Thresholds:
MAPQ >= 60: High confidence (error probability < 0.0001%)MAPQ >= 30: Good quality (error probability < 0.1%)MAPQ >= 20: Acceptable (error probability < 1%)MAPQ >= 10: Low confidence (error probability < 10%)MAPQ = 0: Multi-mapper or unmappedNote: MAPQ=0 can mean either unmapped OR equally good multiple mappings.
Represents alignment between read and reference:
M: Match or mismatch (alignment match)I: Insertion in read vs referenceD: Deletion in read vs referenceS: Soft clipping (bases in read not aligned)H: Hard clipping (bases not in read sequence)N: Skipped region (for RNA-seq splicing)Example: 100M = perfect 100bp match Example: 50M5I45M = 50bp match, 5bp insertion, 45bp match
Characteristics:
Best Mappers:
map-pb, map-hifiTypical Use Cases:
Characteristics:
Best Mappers:
Critical Concept: Hi-C read pairs intentionally map to distant loci. Region filtering can easily break pairs!
Typical Use Cases:
Characteristics:
Best Mappers:
Purpose: Filter, convert, and view SAM/BAM files
Key Flags:
-b: Output BAM format-h: Include header-f INT: Require flags (keep reads WITH these flags)-F INT: Filter flags (remove reads WITH these flags)-q INT: Minimum MAPQ threshold-L FILE: Keep reads overlapping regions in BED fileImportant Behavior:
-L (region filtering) checks each read individually , not pairs-f, -F) are applied before region filters (-L)Example - Filter for proper pairs:
samtools view -b -f 2 input.bam > proper_pairs.bam
Example - Filter by region (may break pairs):
samtools view -b -L regions.bed input.bam > filtered.bam
Example - Proper pairs in regions (correct order):
samtools view -b -f 2 -L regions.bed input.bam > proper_pairs_in_regions.bam
Purpose: Advanced filtering with complex criteria
Key Features:
Common Filters:
isPaired: true - Read is from paired-end sequencingisProperPair: true - Read is part of proper pairisMapped: true - Read is mappedmapQuality: >=30 - Mapping quality thresholdImportant Difference from samtools:
isProperPair is more strict than samtools -f 2Purpose: Convert SAM/BAM to FASTQ/FASTA
Output Modes:
outputs: ["r1", "r2"] - Separate forward and reverse for paired-endoutputs: ["other"] - Single output for single-end dataoutputs: ["r0"] - All reads (mixed paired/unpaired)Filtering Options:
inclusive_filter: ["2"] - Require proper pair flagexclusive_filter: ["4", "8"] - Exclude unmapped or mate unmappedexclusive_filter_all: ["8"] - Exclude if mate unmappedCritical: Use appropriate filters to ensure R1/R2 files match!
WRONG WAY (breaks pairs):
# Region filter first → breaks pairs when mates are in different regions
samtools view -b -L regions.bed input.bam | bamtools filter -isPaired -isProperPair
# Result: Empty output (all pairs broken)
RIGHT WAY (preserves pairs):
# Proper pair filter FIRST, then region filter
samtools view -b -f 2 -L regions.bed input.bam > output.bam
# Result: Pairs where both mates are in regions (or one mate in region, other anywhere)
BEST WAY (both mates in regions):
# Filter for proper pairs, then use paired-aware region filtering
samtools view -b -f 2 input.bam | \
# Custom script to keep pairs where both mates in regions
For Paired-End:
# Ensure proper pairs before extraction
samtools fastx -1 R1.fq.gz -2 R2.fq.gz \
--i1-flags 2 \ # Require proper pair
--i2-flags 64,128 \ # Separate R1/R2
input.bam
For Single-End:
# Simple extraction
samtools fastx -0 output.fq.gz input.bam
Conservative (high quality):
samtools view -b -q 30 -f 2 -F 256 -F 2048 input.bam
# MAPQ >= 30, proper pairs, no secondary/supplementary
Permissive (for low-coverage data):
samtools view -b -q 10 -F 4 input.bam
# MAPQ >= 10, mapped reads
Symptom:
Cause:
samtools view -L) breaks read pairsisProperPair filter removes all readsSolution:
# Apply proper pair filter BEFORE region filtering
samtools view -b -f 2 -L regions.bed input.bam > output.bam
See Also: common-issues.md for detailed troubleshooting
Symptom:
Cause:
Solution:
# Require proper pairs during extraction
samtools fastx -1 R1.fq -2 R2.fq --i1-flags 2 input.bam
Symptom:
Not Actually a Problem:
Solution:
Symptom:
Possible Causes:
Solution:
# Check insert size distribution
samtools stats input.bam | grep "insert size"
# Check pairing flags
samtools flagstat input.bam
N50: Length of the shortest contig at which 50% of total assembly is contained in contigs of that length or longer
How to interpret:
Related Metrics:
Coverage: Percentage of reference bases covered by at least one read Depth: Average number of reads covering each base
Recommended Depths:
>sequence_id description
ATCGATCGATCG
ATCGATCG
>@read_id
ATCGATCGATCG
+
IIIIIIIIIIII
chr1 1000 2000 feature_name score +
chr1 1 5000 1 W contig_1 1 5000 +
chr1 5001 5100 2 U 100 scaffold yes proximity_ligation
map-hifi or map-pbGenomeArk (s3://genomeark/) is a public AWS S3 bucket containing VGP genome assemblies and QC data. Access requires no credentials using --no-sign-request.
Critical Discovery : GenomeArk S3 structure has evolved over time (2022 → 2024). Always try multiple path patterns for reliability.
Base structure :
s3://genomeark/species/{Species_name}/{ToLID}/assembly_vgp_{type}_2.0/evaluation/
Key variation :
assembly_vgp_hic_2.0assembly_vgp_HiC_2.0 (case-sensitive!)hic → HiC before fetchingPath : {assembly}/evaluation/genomescope/
Filename patterns (try in order):
{ToLID}_genomescope__Summary.txt (Pattern A: double underscore - most common){ToLID}_genomescope_Summary.txt (Pattern C: single underscore ⭐ EASILY MISSED){ToLID}_Summary.txt (Pattern B: no prefix - older assemblies)⚠️ CRITICAL : ALL THREE patterns must be checked! Pattern C (single underscore) was discovered in Feb 2026 during debugging - checking only patterns A and B causes ~30-40% of data to be missed!
Example of Pattern C :
rPlaMeg1_genomescope__Summary.txt ✗rPlaMeg1_genomescope_Summary.txt ✓⚠️ CRITICAL: Validate Data Quality
Failed GenomeScope runs show unrealistic ranges:
Heterozygous (ab) 0% 100% ← FAILED RUN - DO NOT USE
Good runs show narrow ranges:
Heterozygous (ab) 0.49% 0.54% ← VALID - use max value
Validation logic :
# Extract min and max percentages
percentages = [0.49, 0.54] # Example from parsing
min_val, max_val = percentages[0], percentages[-1]
range_width = max_val - min_val
# Validate before using
if range_width <= 50.0 and max_val <= 95.0:
heterozygosity = max_val # ACCEPT
else:
heterozygosity = None # REJECT - failed run
Skip values if :
Summary.txt format :
GenomeScope version 2.0
...
property min max
Genome Haploid Length 4,077,481,159 bp 4,095,803,536 bp
Heterozygous (ab) 1.43264% 1.47696%
Genome Repeat Length 2,528,408,288 bp 2,539,769,824 bp
Parsing :
(repeat_length / genome_size) * 100Path : {assembly}/evaluation/busco/{subdir}/
Subdirectories vary :
c/, c1/ - primary resultsp/, p1/ - alternate resultsFiles : *short_summary*.txt (case-insensitive search)
Filename patterns :
{ToLID}_HiC__busco_hap1_busco_short_summary.txt{ToLID}_busco_short_summary.txtFormat :
# BUSCO version is: 5.2.2
# The lineage dataset is: vertebrata_odb10
...
C:94.0%[S:92.4%,D:1.6%],F:2.7%,M:3.3%,n:3354
Parse line starting withC:: Extract 94.0 from C:94.0%
⚠️ TWO PATH PATTERNS (structure changed 2022 → 2024):
Pattern A (Newer - Direct, 2024+) :
{assembly}/evaluation/merqury/{ToLID}_qv/output_merqury.tabular
Pattern B (Older - Nested, 2022) :
{assembly}/evaluation/merqury/{c,p}/{ToLID}_qv/output_merqury.tabular
Strategy : Try direct path first, then search for nested subdirectories
File format (tab-separated, may have header):
assembly unique k-mers common k-mers QV error rate
assembly_01 20197 2133011206 63.4592 4.50896e-07
assembly_02 19654 2304717679 63.9138 4.06084e-07
Both 39851 4437728885 63.6894 4.27623e-07
Parsing :
assembly\tdef normalize_s3_path(s3_path):
"""Normalize path for GenomeArk (case sensitivity!)"""
if not s3_path:
return None
# Critical: HiC capitalization
s3_path = s3_path.replace('/assembly_vgp_hic_2.0/', '/assembly_vgp_HiC_2.0/')
if not s3_path.endswith('/'):
s3_path += '/'
return s3_path
def fetch_genomescope_data(s3_path):
"""Fetch with validation"""
s3_path = normalize_s3_path(s3_path)
tolid = s3_path.rstrip('/').split('/')[-2]
# Try ALL THREE filename patterns
for filename in [
f'{tolid}_genomescope__Summary.txt', # Pattern A: double underscore
f'{tolid}_genomescope_Summary.txt', # Pattern C: single underscore
f'{tolid}_Summary.txt' # Pattern B: no prefix
]:
file_path = f"{s3_path}evaluation/genomescope/{filename}"
result = subprocess.run(['aws', 's3', 'cp', file_path, '-', '--no-sign-request'],
capture_output=True, text=True, timeout=30)
if result.returncode == 0 and result.stdout:
# Parse and validate
data = parse_genomescope(result.stdout)
# Validate heterozygosity range
if 'heterozygosity' in data:
# Check if range is reasonable
if heterozygosity_range > 50.0 or max_het > 95.0:
del data['heterozygosity'] # Skip invalid value
if data:
return data
return None
def fetch_merqury_data(s3_path):
"""Fetch from direct or nested paths"""
s3_path = normalize_s3_path(s3_path)
tolid = s3_path.rstrip('/').split('/')[-2]
# Try direct path first (newer structure)
direct_path = f"{s3_path}evaluation/merqury/{tolid}_qv/output_merqury.tabular"
result = subprocess.run(['aws', 's3', 'cp', direct_path, '-', '--no-sign-request'],
capture_output=True, text=True, timeout=30)
if result.returncode == 0 and result.stdout:
# Parse QV from column 4
for line in result.stdout.split('\n'):
if line.strip() and not line.startswith('assembly\t'):
parts = line.split('\t')
if len(parts) >= 4:
return {'qv': float(parts[3]), 'path_type': 'direct'}
# Fallback: search nested subdirectories (older structure)
# List subdirectories, try c/, p/, etc.
...
def fetch_busco_data(s3_path):
"""Search dynamic subdirectories"""
s3_path = normalize_s3_path(s3_path)
# List busco/ subdirectories
list_result = subprocess.run(['aws', 's3', 'ls', f"{s3_path}evaluation/busco/", '--no-sign-request'],
capture_output=True, text=True, timeout=10)
# Find subdirectories (lines with 'PRE')
subdirs = [line.split('PRE')[1].strip().rstrip('/')
for line in list_result.stdout.split('\n') if 'PRE' in line]
# Try each subdirectory for short_summary files
...
assembly_vgp_hic_2.0 in table → assembly_vgp_HiC_2.0 in S3For public buckets like GenomeArk:
Prefer : subprocess + aws s3 CLI with --no-sign-request
Avoid : boto3 (requires credential config even for public access)
cmd = ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request'] result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
Confirmed working paths:
# GenomeScope - Pattern A (double underscore)
aws s3 cp s3://genomeark/species/Gastrophryne_carolinensis/aGasCar1/assembly_vgp_HiC_2.0/evaluation/genomescope/aGasCar1_genomescope__Summary.txt - --no-sign-request
# GenomeScope - Pattern C (single underscore) ⭐ NEW PATTERN
aws s3 cp s3://genomeark/species/Platysternon_megacephalum/rPlaMeg1/assembly_vgp_HiC_2.0/evaluation/genomescope/rPlaMeg1_genomescope_Summary.txt - --no-sign-request
# GenomeScope - Pattern B (no prefix - older)
aws s3 cp s3://genomeark/species/Spea_bombifrons/aSpeBom1/assembly_vgp_standard_2.0/evaluation/genomescope/aSpeBom1_Summary.txt - --no-sign-request
# BUSCO
aws s3 cp s3://genomeark/species/Gastrophryne_carolinensis/aGasCar1/assembly_vgp_HiC_2.0/evaluation/busco/c/aGasCar1_HiC__busco_hap1_busco_short_summary.txt - --no-sign-request
# Merqury - Direct path (2024+)
aws s3 cp s3://genomeark/species/Ia_io/mIaxIox2/assembly_vgp_HiC_2.0/evaluation/merqury/mIaxIox2_qv/output_merqury.tabular - --no-sign-request
# Merqury - Nested path (2022)
aws s3 cp s3://genomeark/species/Gastrophryne_carolinensis/aGasCar1/assembly_vgp_HiC_2.0/evaluation/merqury/aGasCar1_qv/output_merqury.tabular - --no-sign-request
Karyotype data (diploid 2n and haploid n chromosome numbers) is critical for genome assembly validation but rarely available via APIs. Manual literature curation is required.
"{species_name} karyotype chromosome 2n"
"{species_name} diploid number karyotype"
"{genus} karyotype evolution"
"cytogenetic analysis {family_name}"
"{species_name} chromosome number diploid"
⚠️ Warning : Chromosome-level assemblies often report fewer chromosomes than actual diploid number
Why : Assemblies typically capture only:
Example : Waterfowl with 2n = 80 often have genome assemblies with 34-42 "chromosomes"
When specific karyotype data is unavailable but genus/family patterns are strong:
High confidence inference (acceptable for publication):
Document inference clearly :
accession,taxid,species,2n,n,notes,reference
GCA_XXX,123,Species name,80,40,Inferred from Anatidae conservation,https://family-level-study.url
Priority for direct confirmation :
When both sex chromosomes are in main haplotype (common in VGP assemblies):
CSV Structure :
accession,taxid,species_name,diploid_2n,haploid_n,notes,reference
GCA_XXXXXX,12345,Species name,80,40,Brief description,https://doi.org/...
Notes field examples :
TIER 1 (>90% success rate):
TIER 2 (70-90% success rate):
TIER 3 (50-70% success rate):
Low priority (<50% success rate):
Genome assembly metadata typically includes both haploid and diploid chromosome counts:
# Typical column names (exact names vary by dataset):
df['num_chromosomes'] # Often diploid (2n)
df['total_number_of_chromosomes'] # Often haploid (n)
df['karyotype'] # Usually haploid (n)
df['num_chromosomes_haploid_adjusted'] # Haploid with sex chr adjustment
⚠️ WARNING : Column names are NOT standardized across datasets - always verify which is which!
Use HAPLOID (n) for:
Use DIPLOID (2n) for:
Problem : Used num_chromosomes (diploid) for per-assembly comparison
Result : All assemblies appeared to have 2× expected chromosomes
Fix : Changed to total_number_of_chromosomes (haploid)
Validation : Ratio now ~1.0 instead of ~2.0
# WRONG - uses diploid count
fig, ax = plt.subplots()
ax.scatter(df['num_chromosomes'], df['num_scaffolds_assigned'])
# Result: Everything appears at 2× diagonal
# CORRECT - uses haploid count
fig, ax = plt.subplots()
ax.scatter(df['total_number_of_chromosomes'], df['num_scaffolds_assigned'])
# Result: Expected 1:1 diagonal relationship
Some species have different haploid counts by sex:
Check for adjusted counts:
# Some datasets provide sex-adjusted haploid counts
# Example: Human male
# Karyotype n = 23 (22 autosomes + X or Y)
# But for telomere counting: 24 (22 autosomes + X + Y both have telomeres)
df['num_chromosomes_haploid_adjusted'] # May add +1 for male XY
# Check if counts are haploid or diploid by testing known species
human_samples = df[df['species'] == 'Homo sapiens']
median_count = human_samples['column_name'].median()
if median_count > 40:
print("Likely diploid (2n) - expect ~46 for humans")
elif median_count > 20:
print("Likely haploid (n) - expect ~23 for humans")
else:
print("Check data - values unexpectedly low")
# Verify ratios make biological sense
df['ratio'] = df['scaffolds_assigned'] / df['haploid_count']
assert 0.5 < df['ratio'].median() < 2.0, "Ratio should be near 1.0 for good assemblies"
# Check for systematic doubling
if df['ratio'].median() > 1.8:
print("WARNING: May be using diploid count - ratios systematically doubled")
Assuming column names are accurate
num_chromosomes could be either n or 2nNot accounting for sex chromosomes
Mixing haploid and diploid across analyses
Forgetting about polyploids
Time Tree databases sometimes use proxy/replacement species when they don't have phylogenetic data for the exact species needed. This creates a mismatch between tree species names and dataset species names.
Pattern:
Solution Workflow:
species_replacements.json:{
"actual_species_name": "tree_proxy_name",
"Anniella_stebbinsi": "Anniella_pulchra",
"Pelomedusa_somalica": "Pelomedusa_subrufa"
}
2. Update tree file to use actual dataset names:
* Read Newick tree file
* Replace proxy names with actual species names
* Ensures tree matches dataset exactly
3. Synchronize all config files using actual names:
* iTOL colorstrip configs
* Label configs
* Any taxonomic annotation files
4. Recover missing data if needed:
* Check deprecated datasets for actual species
* Proxy species indicates actual species likely exists in data
* Add to current dataset after recovery
Why This Matters:
Common Files Needing Synchronization:
Tree_final.nwk - Main phylogenetic treeitol_taxonomic_colorstrip_final.txt - Taxonomic annotationsspecies_*_methods.csv - Species classification configsWhen reconciling phylogenetic trees with species datasets:
Coverage Metric:
Coverage = (Species in both tree AND dataset) / (Total species in tree) × 100%
Identifying Missing Species:
with open('Tree_final.nwk', 'r') as f:
tree_content = f.read()
# Extract species names (underscored format)
tree_species = set(re.findall(r'([A-Z][a-z]+_[a-z]+)', tree_content))
2. Extract species from dataset :
df = pd.read_csv('species_methods.csv')
dataset_species = set(df['Species'].str.replace(' ', '_'))
3. Find missing species :
missing = tree_species - dataset_species
4. Categorize missing species : * Recoverable : Time Tree replacements or in deprecated datasets * Phylogenetic context : Tree-only species for evolutionary context * Unknown curation : In dataset but cannot classify
Recovery Workflow:
# Check if missing species are Time Tree replacements
replacements = json.load(open('species_replacements.json'))
for species in missing:
tree_name = species.replace('_', ' ')
if tree_name in replacements.values():
actual_name = [k for k,v in replacements.items() if v==tree_name][0]
# Search deprecated datasets for actual_name
# Recover and add to current dataset
Acceptable Coverage Levels:
Example Results:
AGP (A Golden Path) format describes how assembled sequences (chromosomes, scaffolds) are constructed from component sequences (contigs, scaffolds) and gaps. Critical for genome assembly curation and submission to NCBI/EBI.
Tab-delimited format with 9 columns for sequence lines (type W) or 8+ columns for gap lines (type U/N)
Sequence Lines (Column 5 = 'W'):
object obj_beg obj_end part_num W component_id comp_beg comp_end orientation
Gap Lines (Column 5 = 'U' or 'N'):
object obj_beg obj_end part_num gap_type gap_length gap_type linkage linkage_evidence
Rule 1: Object and Component Lengths MUST Match
For sequence lines, the span in the object MUST equal the span in the component:
(obj_end - obj_beg + 1) == (comp_end - comp_beg + 1)
Example - CORRECT:
Scaffold_47_unloc_1 1 54360 1 W scaffold_23.hap1 19274039 19328398 -
# Object length: 54360 - 1 + 1 = 54,360 bp
# Component length: 19328398 - 19274039 + 1 = 54,360 bp ✓
Example - INCORRECT:
Scaffold_47_unloc_1 1 19328398 1 W scaffold_23.hap1 19274039 19328398 -
# Object length: 19328398 - 1 + 1 = 19,328,398 bp
# Component length: 19328398 - 19274039 + 1 = 54,360 bp ✗
# ERROR: Lengths don't match!
Rule 2: Component Numbering Restarts for New Objects
Each object (column 1) has its own component numbering (column 4) starting at 1:
Scaffold_10 1 30578279 1 W scaffold_4.hap2 1 30578279 -
Scaffold_10_unloc_1 1 65764 1 W scaffold_74.hap2 1 65764 + # ← Starts at 1, not 3!
Rule 3: Sequential Component Numbering Within Objects
Component numbers increment sequentially (gaps and sequences both count):
Scaffold_2 1 1731008 1 W scaffold_25.hap1 1 1731008 -
Scaffold_2 1731009 1731108 2 U 100 scaffold yes proximity_ligation
Scaffold_2 1731109 1956041 3 W scaffold_70.hap1 1 224933 -
Symptom:
ERROR: object coordinates (1, 19328398) and component coordinates (19274039, 19328398)
do not have the same length
Cause: When converting a region of a scaffold into an unlocalized sequence (unloc), the object coordinates must represent the length of the extracted region, not the original component end coordinate.
Wrong Approach:
# Setting object end to component end coordinate
agp_df.loc[index, 'chr_end'] = agp_df.loc[index, 'scaff_end'] # ✗ WRONG
Correct Approach:
# Calculate actual length from component coordinates
agp_df.loc[index, 'chr_end'] = int(agp_df.loc[index, 'scaff_end']) - int(agp_df.loc[index, 'scaff_start']) + 1 # ✓ CORRECT
Symptom: Unloc scaffolds have component numbers > 1 when they should start at 1.
Cause: When creating a new object (unloc scaffold), component numbering wasn't reset.
Solution:
# When creating unlocs, reset component number
agp_df.loc[index, '#_scaffs'] = 1 # Column 4: component number
Symptom: Unloc sequences inherit cumulative coordinates from parent scaffolds.
Cause: AGPcorrect adjusts coordinates based on sequence length corrections. When scaffolds are later split into unlocs, the accumulated corrections need to be recalculated based on actual component spans.
Solution: Always recalculate object coordinates from component spans when creating new objects (unlocs).
end - start + 1# When extracting a region to create an unloc:
# 1. Calculate the actual length of the region
length = int(comp_end) - int(comp_start) + 1
# 2. Set object coordinates for the new unloc
obj_start = 1 # Always starts at 1
obj_end = length # Equals the length
# 3. Reset component number
component_num = 1 # New object, new numbering
# 4. Rename the object
new_object_name = f"{parent_scaffold}_unloc_{unloc_number}"
Use NCBI's AGP validator:
agp_validate assembly.agp
Common validation checks:
When splitting diploid assemblies into haplotypes:
When encountering coordinate errors:
# For each AGP line, verify:
obj_length = int(obj_end) - int(obj_beg) + 1
comp_length = int(comp_end) - int(comp_beg) + 1
assert obj_length == comp_length, f"Length mismatch: obj={obj_length}, comp={comp_length}"
# For sequential component numbers:
assert comp_num == expected_num, f"Component number gap: got {comp_num}, expected {expected_num}"
1. Raw Assembly AGP:
2. Corrected AGP:
3. Haplotype-Split AGP:
4. Final Curated AGP:
When analyzing telomere data from BED files to classify scaffolds:
File Structure :
Best Practice - Use Python CSV Module :
import csv
from collections import defaultdict
# Use defaultdict for automatic initialization
telomere_counts = defaultdict(lambda: {'terminal': 0, 'interstitial': 0})
# Process with csv.reader (more portable than pandas)
with open('telomeres.bed', 'r') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
scaffold = row[0]
accession = row[10] # GCA accession
key = (accession, scaffold)
telomere_counts[key]['terminal'] += 1
Why CSV over pandas :
Classification Categories :
Problem : Need chromosome counts for 400+ assemblies from NCBI.
Anti-pattern : Query NCBI datasets API for each accession
# DON'T: Query 400+ times
for accession in missing_data:
result = subprocess.run(['datasets', 'summary', 'genome', 'accession', accession])
# Takes 10+ minutes, hits API rate limits
Better Pattern : Check if data already exists in compiled tables
# DO: Look for existing compiled data first
# VGP table has multiple chromosome count columns:
# - num_chromosomes (column 54)
# - total_number_of_chromosomes (column 106)
# - num_chromosomes_haploid (column 122)
# Read from existing comprehensive table
with open('VGP-table.csv') as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
num_chr = row[53] if row[53] else row[105] # Fallback strategy
Results : Filled 392/417 missing values instantly vs 10+ minutes of API calls.
Fallback Strategy for Multiple Columns :
# Try multiple sources in order of preference
num_chromosomes = row[53] if (len(row) > 53 and row[53]) else ''
if not num_chromosomes and len(row) > 105:
num_chromosomes = row[105] # Alternative column
When to use NCBI API :
API Best Practices (when necessary):
time.sleep(0.5))GenomeArk uses public S3 buckets (no credentials needed):
s3://genomeark/species/{Species_name}/{ToLID}/assembly_vgp_{type}_2.0/
↑
HiC (most common), standard, hic (legacy)
Key Challenges :
hic vs HiC in pathsPath structure : {assembly}/evaluation/busco/{subdir}/short_summary*.txt
Strategy :
busco/short_summary*.txtExpected coverage : ~20-30% of VGP assemblies have BUSCO data
Path patterns (try in order):
{assembly}/evaluation/merqury/{ToLID}_qv/output_merqury.tabular{assembly}/evaluation/merqury/{subdir}/{ToLID}_qv/output_merqury.tabularNested subdirs : Usually 1-2 character names (c, p, c1, p1)
Output format : Tab-separated, QV is column 4 (index 3)
Three filename patterns to try :
filenames = [
f'{tolid}_genomescope__Summary.txt', # Double underscore
f'{tolid}_genomescope_Summary.txt', # Single underscore
f'{tolid}_Summary.txt', # No prefix
]
Validation : Skip heterozygosity if range > 50% or max > 95% (indicates failed run)
Always normalize paths:
def normalize_s3_path(s3_path):
s3_path = s3_path.strip()
s3_path = s3_path.replace('/assembly_vgp_hic_2.0/', '/assembly_vgp_HiC_2.0/')
if not s3_path.endswith('/'):
s3_path += '/'
return s3_path
Public access (no credentials):
aws s3 ls s3://genomeark/... --no-sign-request
aws s3 cp s3://genomeark/.../file.txt - --no-sign-request
Timeouts : Use 10-30s timeouts for robustness
Weekly Installs
73
Repository
GitHub Stars
8
First Seen
Jan 24, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
codex58
opencode58
gemini-cli57
cursor54
github-copilot50
kimi-cli48
Apify Actor 输出模式生成工具 - 自动化创建 dataset_schema.json 与 output_schema.json
1,100 周安装