tooluniverse-sequence-retrieval by mims-harvard/tooluniverse
npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-sequence-retrieval通过适当的消歧义和跨数据库处理,检索 DNA、RNA 和蛋白质序列。
重要提示:在工具调用中始终使用英文术语(基因名称、生物体名称、序列描述),即使用户使用其他语言书写。只有在英文查询没有结果时,才尝试使用原始语言术语作为备选方案。使用用户的语言进行回复。
Phase 0: 澄清(如果需要)
↓
Phase 1: 消歧义基因/生物体
↓
Phase 2: 搜索与检索(内部)
↓
Phase 3: 报告序列概况
仅在以下情况下询问用户:
以下情况跳过澄清:
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()
# 策略取决于输入类型
if user_provided_accession:
# 基于登录号类型直接检索
accession = user_provided_accession
elif user_provided_gene_and_organism:
# 搜索 NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism=organism,
gene=gene,
limit=10
)
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
关键:登录号前缀决定使用哪个工具。
| 前缀 | 类型 | 使用工具 |
|---|---|---|
| NC_* | RefSeq 染色体 | 仅 NCBI |
| NM_* | RefSeq mRNA | 仅 NCBI |
| NR_* | RefSeq ncRNA | 仅 NCBI |
| NP_* | RefSeq 蛋白质 | 仅 NCBI |
| XM_* | RefSeq 预测 mRNA | 仅 NCBI |
| U*, M*, K*, X* | GenBank | NCBI 或 ENA |
| CP*, NZ_* | GenBank 基因组 | NCBI 或 ENA |
| EMBL 格式 | EMBL | 首选 ENA |
静默检索。不要叙述搜索过程。
# 搜索 NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism=organism,
gene=gene,
strain=strain, # 可选
keywords=keywords, # 可选
seq_type=seq_type, # complete_genome, mrna, refseq
limit=10
)
# 从 UIDs 获取登录号
accessions = tu.tools.NCBI_fetch_accessions(
operation="fetch_accession",
uids=result["data"]["uids"]
)
# 获取所需格式的序列
sequence = tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession=accession,
format="fasta" # 或 "genbank"
)
# GenBank 格式用于注释
annotations = tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession=accession,
format="genbank"
)
# 仅用于非 RefSeq 登录号!
if not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")):
# ENA 条目信息
entry = tu.tools.ena_get_entry(accession=accession)
# ENA FASTA
fasta = tu.tools.ena_get_sequence_fasta(accession=accession)
# ENA 摘要
summary = tu.tools.ena_get_entry_summary(accession=accession)
| 主要 | 备用 | 备注 |
|---|---|---|
| NCBI_get_sequence | ENA(如果是 GenBank 格式) | NCBI 不可用时 |
| ENA_get_entry | NCBI_get_sequence | ENA 没有 RefSeq |
| NCBI_search_nucleotide | 尝试更宽泛的关键词 | 无结果时 |
关键规则:切勿对 RefSeq 登录号(NC_、NM_ 等)使用 ENA 工具——它们将返回 404 错误。
以序列概况报告的形式呈现。隐藏搜索过程。
# 序列概况:[基因/生物体]
**搜索摘要**
- 查询:[gene] 在 [organism] 中
- 数据库:NCBI Nucleotide
- 结果:找到 [N] 条序列
---
## 主要序列
### [Accession]: [Definition/Title]
| 属性 | 值 |
|-----------|-------|
| **登录号** | [accession] |
| **类型** | RefSeq / GenBank |
| **生物体** | [scientific name] |
| **菌株** | [strain if applicable] |
| **长度** | [X,XXX bp / aa] |
| **分子** | DNA / mRNA / Protein |
| **拓扑结构** | Linear / Circular |
**管理级别**:●●● RefSeq(已管理)/ ●●○ GenBank(已提交)/ ●○○ 第三方
### 序列统计
| 统计项 | 值 |
|-----------|-------|
| **长度** | [X,XXX] bp |
| **GC 含量** | [XX.X]% |
| **基因** | [N](如果是基因组) |
| **CDS** | [N](如果有注释) |
### 序列预览
```fasta
>[accession] [definition]
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
... [已截断,完整序列在下载中]
| 特征 | 数量 | 示例 |
|---|---|---|
| CDS | [N] | [gene names] |
| tRNA | [N] | - |
| rRNA | [N] | 16S, 23S |
| 调控元件 | [N] | promoters |
按相关性和管理级别排序:
| 登录号 | 类型 | 长度 | 描述 | ENA 兼容 |
|---|---|---|---|---|
| NC_000913.3 | RefSeq | 4.6 Mb | 大肠杆菌 K-12 参考 | ✗ |
| U00096.3 | GenBank | 4.6 Mb | 大肠杆菌 K-12 | ✓ |
| CP001509.3 | GenBank | 4.6 Mb | 大肠杆菌 DH10B | ✓ |
| 数据库 | 登录号 | 链接 |
|---|---|---|
| RefSeq | [NC_*] | [NCBI link] |
| GenBank | [U*] | [NCBI link] |
| ENA/EMBL | [same as GenBank] | [ENA link] |
| BioProject | [PRJNA*] | [link] |
| BioSample | [SAMN*] | [link] |
| 格式 | 描述 | 使用场景 |
|---|---|---|
| FASTA | 仅序列 | BLAST,比对 |
| GenBank | 序列 + 注释 | 基因分析 |
| GFF3 | 仅注释 | 基因组浏览器 |
# FASTA 格式
tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession="[accession]",
format="fasta"
)
# GenBank 格式(带注释)
tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession="[accession]",
format="genbank"
)
| 登录号 | 菌株 | 相似度 | 备注 |
|---|---|---|---|
| [acc1] | [strain1] | 99.9% | [notes] |
| [acc2] | [strain2] | 99.5% | [notes] |
| 蛋白质登录号 | 产物名称 | 长度 |
|---|---|---|
| [NP_*] | [protein name] | [X] aa |
检索时间:[date] 数据库:NCBI Nucleotide
---
## 管理级别层级
| 层级 | 符号 | 登录号前缀 | 描述 |
|------|--------|------------------|-------------|
| RefSeq 参考 | ●●●● | NC_, NM_, NP_ | NCBI 管理,黄金标准 |
| RefSeq 预测 | ●●●○ | XM_, XP_, XR_ | 计算预测 |
| GenBank 已验证 | ●●○○ | Various | 已提交,部分管理 |
| GenBank 直接提交 | ●○○○ | Various | 直接提交 |
| 第三方 | ○○○○ | TPA_ | 第三方注释 |
在报告中包含:
```markdown
**管理级别**:●●●● RefSeq 参考
- 由 NCBI RefSeq 项目管理
- 定期更新和验证
- 推荐作为参考使用
每个序列报告必须包含:
用户:"获取大肠杆菌 K-12 完整基因组"
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism="Escherichia coli",
strain="K-12",
seq_type="complete_genome",
limit=3
)
# 返回 NC_000913.3(RefSeq 参考)
用户:"查找人类 BRCA1 mRNA"
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism="Homo sapiens",
gene="BRCA1",
seq_type="mrna",
limit=10
)
用户:"获取 NC_045512.2 的序列" → 使用完整元数据进行直接检索
用户:"比较大肠杆菌 K-12 和 O157:H7 基因组" → 搜索两个菌株,提供比较表
| 错误 | 响应 |
|---|---|
| "未提供搜索条件" | 添加生物体、基因或关键词 |
| "ENA 404 错误" | 登录号很可能是 RefSeq → 仅使用 NCBI |
| "未找到结果" | 扩大搜索范围,检查拼写,尝试同义词 |
| "序列过大" | 注明大小,提供下载链接而非预览 |
| "API 速率限制" | 工具会自动重试;如果持续存在,请稍等片刻 |
NCBI 工具(所有登录号)
| 工具 | 用途 |
|---|---|
NCBI_search_nucleotide | 按基因/生物体搜索 |
NCBI_fetch_accessions | 将 UID 转换为登录号 |
NCBI_get_sequence | 检索序列数据 |
ENA 工具(仅限 GenBank/EMBL)
| 工具 | 用途 |
|---|---|
ena_get_entry | 条目元数据 |
ena_get_sequence_fasta | FASTA 序列 |
ena_get_entry_summary | 摘要信息 |
NCBI_search_nucleotide
| 参数 | 描述 | 示例 |
|---|---|---|
operation | 始终为 "search" | "search" |
organism | 学名 | "Homo sapiens" |
gene | 基因符号 | "BRCA1" |
strain | 特定菌株 | "K-12" |
keywords | 自由文本 | "complete genome" |
seq_type | 序列类型 | "complete_genome", "mrna", "refseq" |
limit | 最大结果数 | 10 |
NCBI_get_sequence
| 参数 | 描述 | 示例 |
|---|---|---|
operation | 始终为 "fetch_sequence" | "fetch_sequence" |
accession | 登录号 | "NC_000913.3" |
format | 输出格式 | "fasta", "genbank" |
每周安装量
1.3K
仓库
GitHub Stars
1.2K
首次出现
Feb 4, 2026
安全审计
安装于
codex1.1K
opencode1.1K
gemini-cli1.1K
github-copilot1.1K
amp1.1K
kimi-cli1.1K
Retrieve DNA, RNA, and protein sequences with proper disambiguation and cross-database handling.
IMPORTANT : Always use English terms in tool calls (gene names, organism names, sequence descriptions), even if the user writes in another language. Only try original-language terms as a fallback if English returns no results. Respond in the user's language.
Phase 0: Clarify (if needed)
↓
Phase 1: Disambiguate Gene/Organism
↓
Phase 2: Search & Retrieve (Internal)
↓
Phase 3: Report Sequence Profile
Ask the user ONLY if:
Skip clarification for:
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()
# Strategy depends on input type
if user_provided_accession:
# Direct retrieval based on accession type
accession = user_provided_accession
elif user_provided_gene_and_organism:
# Search NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism=organism,
gene=gene,
limit=10
)
CRITICAL : Accession prefix determines which tools to use.
| Prefix | Type | Use With |
|---|---|---|
| NC_* | RefSeq chromosome | NCBI only |
| NM_* | RefSeq mRNA | NCBI only |
| NR_* | RefSeq ncRNA | NCBI only |
| NP_* | RefSeq protein | NCBI only |
| XM_* | RefSeq predicted mRNA | NCBI only |
| U*, M*, K*, X* | GenBank | NCBI or ENA |
| CP*, NZ_* | GenBank genome | NCBI or ENA |
| EMBL format | EMBL | ENA preferred |
Retrieve silently. Do NOT narrate the search process.
# Search NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism=organism,
gene=gene,
strain=strain, # Optional
keywords=keywords, # Optional
seq_type=seq_type, # complete_genome, mrna, refseq
limit=10
)
# Get accession numbers from UIDs
accessions = tu.tools.NCBI_fetch_accessions(
operation="fetch_accession",
uids=result["data"]["uids"]
)
# Get sequence in desired format
sequence = tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession=accession,
format="fasta" # or "genbank"
)
# GenBank format for annotations
annotations = tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession=accession,
format="genbank"
)
# Only for non-RefSeq accessions!
if not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")):
# ENA entry info
entry = tu.tools.ena_get_entry(accession=accession)
# ENA FASTA
fasta = tu.tools.ena_get_sequence_fasta(accession=accession)
# ENA summary
summary = tu.tools.ena_get_entry_summary(accession=accession)
| Primary | Fallback | Notes |
|---|---|---|
| NCBI_get_sequence | ENA (if GenBank format) | NCBI unavailable |
| ENA_get_entry | NCBI_get_sequence | ENA doesn't have RefSeq |
| NCBI_search_nucleotide | Try broader keywords | No results |
Critical Rule : Never try ENA tools with RefSeq accessions (NC_, NM_, etc.) - they will return 404 errors.
Present as a Sequence Profile Report. Hide search process.
# Sequence Profile: [Gene/Organism]
**Search Summary**
- Query: [gene] in [organism]
- Database: NCBI Nucleotide
- Results: [N] sequences found
---
## Primary Sequence
### [Accession]: [Definition/Title]
| Attribute | Value |
|-----------|-------|
| **Accession** | [accession] |
| **Type** | RefSeq / GenBank |
| **Organism** | [scientific name] |
| **Strain** | [strain if applicable] |
| **Length** | [X,XXX bp / aa] |
| **Molecule** | DNA / mRNA / Protein |
| **Topology** | Linear / Circular |
**Curation Level**: ●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party
### Sequence Statistics
| Statistic | Value |
|-----------|-------|
| **Length** | [X,XXX] bp |
| **GC Content** | [XX.X]% |
| **Genes** | [N] (if genome) |
| **CDS** | [N] (if annotated) |
### Sequence Preview
```fasta
>[accession] [definition]
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
... [truncated, full sequence in download]
| Feature | Count | Examples |
|---|---|---|
| CDS | [N] | [gene names] |
| tRNA | [N] | - |
| rRNA | [N] | 16S, 23S |
| Regulatory | [N] | promoters |
Ranked by relevance and curation level:
| Accession | Type | Length | Description | ENA Compatible |
|---|---|---|---|---|
| NC_000913.3 | RefSeq | 4.6 Mb | E. coli K-12 reference | ✗ |
| U00096.3 | GenBank | 4.6 Mb | E. coli K-12 | ✓ |
| CP001509.3 | GenBank | 4.6 Mb | E. coli DH10B | ✓ |
| Database | Accession | Link |
|---|---|---|
| RefSeq | [NC_*] | [NCBI link] |
| GenBank | [U*] | [NCBI link] |
| ENA/EMBL | [same as GenBank] | [ENA link] |
| BioProject | [PRJNA*] | [link] |
| BioSample | [SAMN*] | [link] |
| Format | Description | Use Case |
|---|---|---|
| FASTA | Sequence only | BLAST, alignment |
| GenBank | Sequence + annotations | Gene analysis |
| GFF3 | Annotations only | Genome browsers |
# FASTA format
tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession="[accession]",
format="fasta"
)
# GenBank format (with annotations)
tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession="[accession]",
format="genbank"
)
| Accession | Strain | Similarity | Notes |
|---|---|---|---|
| [acc1] | [strain1] | 99.9% | [notes] |
| [acc2] | [strain2] | 99.5% | [notes] |
| Protein Accession | Product Name | Length |
|---|---|---|
| [NP_*] | [protein name] | [X] aa |
Retrieved: [date] Database: NCBI Nucleotide
---
## Curation Level Tiers
| Tier | Symbol | Accession Prefix | Description |
|------|--------|------------------|-------------|
| RefSeq Reference | ●●●● | NC_, NM_, NP_ | NCBI-curated, gold standard |
| RefSeq Predicted | ●●●○ | XM_, XP_, XR_ | Computationally predicted |
| GenBank Validated | ●●○○ | Various | Submitted, some curation |
| GenBank Direct | ●○○○ | Various | Direct submission |
| Third Party | ○○○○ | TPA_ | Third-party annotation |
Include in report:
```markdown
**Curation Level**: ●●●● RefSeq Reference
- Curated by NCBI RefSeq project
- Regular updates and validation
- Recommended for reference use
Every sequence report MUST include:
User: "Get E. coli K-12 complete genome"
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism="Escherichia coli",
strain="K-12",
seq_type="complete_genome",
limit=3
)
# Return NC_000913.3 (RefSeq reference)
User: "Find human BRCA1 mRNA"
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism="Homo sapiens",
gene="BRCA1",
seq_type="mrna",
limit=10
)
User: "Get sequence for NC_045512.2" → Direct retrieval with full metadata
User: "Compare E. coli K-12 and O157:H7 genomes" → Search both strains, provide comparison table
| Error | Response |
|---|---|
| "No search criteria provided" | Add organism, gene, or keywords |
| "ENA 404 error" | Accession is likely RefSeq → use NCBI only |
| "No results found" | Broaden search, check spelling, try synonyms |
| "Sequence too large" | Note size, provide download link instead of preview |
| "API rate limit" | Tools auto-retry; if persistent, wait briefly |
NCBI Tools (All Accessions)
| Tool | Purpose |
|---|---|
NCBI_search_nucleotide | Search by gene/organism |
NCBI_fetch_accessions | Convert UIDs to accessions |
NCBI_get_sequence | Retrieve sequence data |
ENA Tools (GenBank/EMBL Only)
| Tool | Purpose |
|---|---|
ena_get_entry | Entry metadata |
ena_get_sequence_fasta | FASTA sequence |
ena_get_entry_summary | Summary info |
NCBI_search_nucleotide
| Parameter | Description | Example |
|---|---|---|
operation | Always "search" | "search" |
organism | Scientific name | "Homo sapiens" |
gene | Gene symbol | "BRCA1" |
strain | Specific strain | "K-12" |
keywords | Free text | "complete genome" |
NCBI_get_sequence
| Parameter | Description | Example |
|---|---|---|
operation | Always "fetch_sequence" | "fetch_sequence" |
accession | Accession number | "NC_000913.3" |
format | Output format | "fasta", "genbank" |
Weekly Installs
1.3K
Repository
GitHub Stars
1.2K
First Seen
Feb 4, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
codex1.1K
opencode1.1K
gemini-cli1.1K
github-copilot1.1K
amp1.1K
kimi-cli1.1K
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
102,200 周安装
seq_type | Sequence type | "complete_genome", "mrna", "refseq" |
limit | Max results | 10 |