bioRxiv数据库Python工具：高效搜索下载预印本，支持关键词/作者/日期/类别筛选

biorxiv-database by davila7/claude-code-templates

163 周安装量

23,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill biorxiv-database

Python Web框架科研工具生物信息学

🇨🇳中文介绍

bioRxiv 数据库

概述

此技能提供基于 Python 的高效工具，用于搜索和检索 bioRxiv 数据库中的预印本。它支持通过关键词、作者、日期范围和类别进行综合搜索，返回结构化的 JSON 元数据，包括标题、摘要、DOI 和引用信息。该技能还支持下载 PDF 以进行全文分析。

何时使用此技能

在以下情况下使用此技能：

在特定研究领域搜索最近的预印本
跟踪特定作者的出版物
进行系统性文献综述
分析一段时间内的研究趋势
检索用于引文管理的元数据
下载预印本 PDF 进行分析
按 bioRxiv 学科类别筛选论文

核心搜索功能

1. 关键词搜索

搜索标题、摘要或作者列表中包含特定关键词的预印本。

基本用法：

python scripts/biorxiv_search.py \
  --keywords "CRISPR" "gene editing" \
  --start-date 2024-01-01 \
  --end-date 2024-12-31 \
  --output results.json

带类别筛选：

python scripts/biorxiv_search.py \
  --keywords "neural networks" "deep learning" \
  --days-back 180 \
  --category neuroscience \
  --output recent_neuroscience.json

搜索字段： 默认情况下，关键词在标题和摘要中搜索。使用 --search-fields 自定义：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

3. 日期范围搜索

检索在特定日期范围内发布的所有预印本。

python scripts/biorxiv_search.py \
  --start-date 2024-01-01 \
  --end-date 2024-01-31 \
  --output january_2024.json

带类别筛选：

python scripts/biorxiv_search.py \
  --start-date 2024-06-01 \
  --end-date 2024-06-30 \
  --category genomics \
  --output genomics_june.json

天数回溯快捷方式：

# 最近 30 天
python scripts/biorxiv_search.py \
  --days-back 30 \
  --output last_month.json

4. 通过 DOI 获取论文详情

检索特定预印本的详细元数据。

python scripts/biorxiv_search.py \
  --doi "10.1101/2024.01.15.123456" \
  --output paper_details.json

接受完整的 DOI URL：

python scripts/biorxiv_search.py \
  --doi "https://doi.org/10.1101/2024.01.15.123456"

下载任何预印本的全文 PDF。

python scripts/biorxiv_search.py \
  --doi "10.1101/2024.01.15.123456" \
  --download-pdf paper.pdf

批量处理： 对于多个 PDF，从搜索结果 JSON 中提取 DOI 并下载每篇论文：

import json
from biorxiv_search import BioRxivSearcher

# 加载搜索结果
with open('results.json') as f:
    data = json.load(f)

searcher = BioRxivSearcher(verbose=True)

# 下载每篇论文
for i, paper in enumerate(data['results'][:10]):  # 前 10 篇论文
    doi = paper['doi']
    searcher.download_pdf(doi, f"papers/paper_{i+1}.pdf")

按 bioRxiv 学科类别筛选搜索：

animal-behavior-and-cognition
biochemistry
bioengineering
bioinformatics
biophysics
cancer-biology
cell-biology
clinical-trials
developmental-biology
ecology
epidemiology
evolutionary-biology
genetics
genomics
immunology
microbiology
molecular-biology
neuroscience
paleontology
pathology
pharmacology-and-toxicology
physiology
plant-biology
scientific-communication-and-education
synthetic-biology
systems-biology
zoology

所有搜索都返回具有以下格式的结构化 JSON：

{
  "query": {
    "keywords": ["CRISPR"],
    "start_date": "2024-01-01",
    "end_date": "2024-12-31",
    "category": "genomics"
  },
  "result_count": 42,
  "results": [
    {
      "doi": "10.1101/2024.01.15.123456",
      "title": "Paper Title Here",
      "authors": "Smith J, Doe J, Johnson A",
      "author_corresponding": "Smith J",
      "author_corresponding_institution": "University Example",
      "date": "2024-01-15",
      "version": "1",
      "type": "new results",
      "license": "cc_by",
      "category": "genomics",
      "abstract": "Full abstract text...",
      "pdf_url": "https://www.biorxiv.org/content/10.1101/2024.01.15.123456v1.full.pdf",
      "html_url": "https://www.biorxiv.org/content/10.1101/2024.01.15.123456v1",
      "jatsxml": "https://www.biorxiv.org/content/...",
      "published": ""
    }
  ]
}

文献综述工作流

广泛的关键词搜索：

python scripts/biorxiv_search.py \

  --keywords "organoids" "tissue engineering" \
  --start-date 2023-01-01 \
  --end-date 2024-12-31 \
  --category bioengineering \
  --output organoid_papers.json

2. 提取并查看结果：

import json

with open('organoid_papers.json') as f:
    data = json.load(f)

print(f"Found {data['result_count']} papers")

for paper in data['results'][:5]:
    print(f"\nTitle: {paper['title']}")
    print(f"Authors: {paper['authors']}")
    print(f"Date: {paper['date']}")
    print(f"DOI: {paper['doi']}")

3. 下载选定的论文：

from biorxiv_search import BioRxivSearcher

searcher = BioRxivSearcher()
selected_dois = ["10.1101/2024.01.15.123456", "10.1101/2024.02.20.789012"]

for doi in selected_dois:
    filename = doi.replace("/", "_").replace(".", "_") + ".pdf"
    searcher.download_pdf(doi, f"papers/{filename}")

通过分析一段时间内的发表频率来跟踪研究趋势：

python scripts/biorxiv_search.py \
  --keywords "machine learning" \
  --start-date 2020-01-01 \
  --end-date 2024-12-31 \
  --category bioinformatics \
  --output ml_trends.json

然后分析结果中的时间分布。

监控特定研究人员的预印本：

# 跟踪多位作者
authors = ["Smith", "Johnson", "Williams"]

for author in authors:
    python scripts/biorxiv_search.py \
      --author "{author}" \
      --days-back 365 \
      --output "{author}_papers.json"

对于更复杂的工作流，直接导入并使用 BioRxivSearcher 类：

from scripts.biorxiv_search import BioRxivSearcher

# 初始化
searcher = BioRxivSearcher(verbose=True)

# 多个搜索操作
keywords_papers = searcher.search_by_keywords(
    keywords=["CRISPR", "gene editing"],
    start_date="2024-01-01",
    end_date="2024-12-31",
    category="genomics"
)

author_papers = searcher.search_by_author(
    author_name="Smith",
    start_date="2023-01-01",
    end_date="2024-12-31"
)

# 获取特定论文详情
paper = searcher.get_paper_details("10.1101/2024.01.15.123456")

# 下载 PDF
success = searcher.download_pdf(
    doi="10.1101/2024.01.15.123456",
    output_path="paper.pdf"
)

# 一致地格式化结果
formatted = searcher.format_result(paper, include_abstract=True)

使用适当的日期范围 ：较小的日期范围返回更快。对于长时间的关键词搜索，请考虑拆分为多个查询。
按类别筛选 ：尽可能使用 --category 来减少数据传输并提高搜索精度。
遵守速率限制 ：脚本包含自动延迟（请求之间间隔 0.5 秒）。对于大规模数据收集，请添加额外的延迟。
缓存结果 ：将搜索结果保存到 JSON 文件以避免重复的 API 调用。
版本跟踪 ：预印本可能有多个版本。version 字段指示返回的是哪个版本。PDF URL 包含版本号。
优雅地处理错误 ：检查输出 JSON 中的 result_count。空结果可能表示日期范围问题或 API 连接问题。
调试时使用详细模式 ：使用 --verbose 标志查看 API 请求和响应的详细日志。

自定义日期范围逻辑

from datetime import datetime, timedelta

# 最近一个季度
end_date = datetime.now()
start_date = end_date - timedelta(days=90)

python scripts/biorxiv_search.py \
  --start-date {start_date.strftime('%Y-%m-%d')} \
  --end-date {end_date.strftime('%Y-%m-%d')}

限制返回的结果数量：

python scripts/biorxiv_search.py \
  --keywords "COVID-19" \
  --days-back 30 \
  --limit 50 \
  --output covid_top50.json

排除摘要以提高速度

当只需要元数据时：

# 注意：摘要包含在 Python API 中控制
from scripts.biorxiv_search import BioRxivSearcher

searcher = BioRxivSearcher()
papers = searcher.search_by_keywords(keywords=["AI"], days_back=30)
formatted = [searcher.format_result(p, include_abstract=False) for p in papers]

将搜索结果集成到下游分析管道中：

import json
import pandas as pd

# 加载结果
with open('results.json') as f:
    data = json.load(f)

# 转换为 DataFrame 进行分析
df = pd.DataFrame(data['results'])

# 分析
print(f"Total papers: {len(df)}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
print(f"\nTop authors by paper count:")
print(df['authors'].str.split(',').explode().str.strip().value_counts().head(10))

# 筛选和导出
recent = df[df['date'] >= '2024-06-01']
recent.to_csv('recent_papers.csv', index=False)

要验证 bioRxiv 数据库技能是否正常工作，请运行全面的测试套件。

uv pip install requests

python tests/test_biorxiv_search.py

测试套件验证：

初始化 ：BioRxivSearcher 类实例化
日期范围搜索 ：检索特定日期范围内的论文
类别筛选 ：按 bioRxiv 类别筛选论文
关键词搜索 ：查找包含特定关键词的论文
DOI 查找 ：通过 DOI 检索特定论文
结果格式化 ：正确格式化论文元数据
间隔搜索 ：按时间间隔获取最近的论文

🧬 bioRxiv Database Search Skill Test Suite
======================================================================

🧪 Test 1: Initialization
✅ BioRxivSearcher initialized successfully

🧪 Test 2: Date Range Search
✅ Found 150 papers between 2024-01-01 and 2024-01-07
   First paper: Novel CRISPR-based approach for genome editing...

[... additional tests ...]

======================================================================
📊 Test Summary
======================================================================
✅ PASS: Initialization
✅ PASS: Date Range Search
✅ PASS: Category Filtering
✅ PASS: Keyword Search
✅ PASS: DOI Lookup
✅ PASS: Result Formatting
✅ PASS: Interval Search
======================================================================
Results: 7/7 tests passed (100%)
======================================================================

🎉 All tests passed! The bioRxiv database skill is working correctly.

注意： 如果在特定日期范围或类别中未找到论文，某些测试可能会显示警告。这是正常的，并不表示失败。

有关详细的 API 规范、端点文档和响应模式，请参阅：

references/api_reference.md - 完整的 bioRxiv API 文档

参考文件包括：

完整的 API 端点规范
响应格式详情
错误处理模式
速率限制指南
高级搜索模式

🇺🇸English

bioRxiv Database

Overview

This skill provides efficient Python-based tools for searching and retrieving preprints from the bioRxiv database. It enables comprehensive searches by keywords, authors, date ranges, and categories, returning structured JSON metadata that includes titles, abstracts, DOIs, and citation information. The skill also supports PDF downloads for full-text analysis.

When to Use This Skill

Use this skill when:

Searching for recent preprints in specific research areas
Tracking publications by particular authors
Conducting systematic literature reviews
Analyzing research trends over time periods
Retrieving metadata for citation management
Downloading preprint PDFs for analysis
Filtering papers by bioRxiv subject categories

Core Search Capabilities

1. Keyword Search

Search for preprints containing specific keywords in titles, abstracts, or author lists.

Basic Usage:

python scripts/biorxiv_search.py \
  --keywords "CRISPR" "gene editing" \
  --start-date 2024-01-01 \
  --end-date 2024-12-31 \
  --output results.json

With Category Filter:

python scripts/biorxiv_search.py \
  --keywords "neural networks" "deep learning" \
  --days-back 180 \
  --category neuroscience \
  --output recent_neuroscience.json

Search Fields: By default, keywords are searched in both title and abstract. Customize with --search-fields:

python scripts/biorxiv_search.py \
  --keywords "AlphaFold" \
  --search-fields title \
  --days-back 365

2. Author Search

Find all papers by a specific author within a date range.

Basic Usage:

python scripts/biorxiv_search.py \
  --author "Smith" \
  --start-date 2023-01-01 \
  --end-date 2024-12-31 \
  --output smith_papers.json

Recent Publications:

# Last year by default if no dates specified
python scripts/biorxiv_search.py \
  --author "Johnson" \
  --output johnson_recent.json

3. Date Range Search

Retrieve all preprints posted within a specific date range.

Basic Usage:

python scripts/biorxiv_search.py \
  --start-date 2024-01-01 \
  --end-date 2024-01-31 \
  --output january_2024.json

With Category Filter:

python scripts/biorxiv_search.py \
  --start-date 2024-06-01 \
  --end-date 2024-06-30 \
  --category genomics \
  --output genomics_june.json

Days Back Shortcut:

# Last 30 days
python scripts/biorxiv_search.py \
  --days-back 30 \
  --output last_month.json

4. Paper Details by DOI

Retrieve detailed metadata for a specific preprint.

Basic Usage:

python scripts/biorxiv_search.py \
  --doi "10.1101/2024.01.15.123456" \
  --output paper_details.json

Full DOI URLs Accepted:

python scripts/biorxiv_search.py \
  --doi "https://doi.org/10.1101/2024.01.15.123456"

5. PDF Downloads

Download the full-text PDF of any preprint.

Basic Usage:

python scripts/biorxiv_search.py \
  --doi "10.1101/2024.01.15.123456" \
  --download-pdf paper.pdf

Batch Processing: For multiple PDFs, extract DOIs from a search result JSON and download each paper:

import json
from biorxiv_search import BioRxivSearcher

# Load search results
with open('results.json') as f:
    data = json.load(f)

searcher = BioRxivSearcher(verbose=True)

# Download each paper
for i, paper in enumerate(data['results'][:10]):  # First 10 papers
    doi = paper['doi']
    searcher.download_pdf(doi, f"papers/paper_{i+1}.pdf")

Valid Categories

Filter searches by bioRxiv subject categories:

animal-behavior-and-cognition
biochemistry
bioengineering
bioinformatics
biophysics
cancer-biology
cell-biology
clinical-trials
developmental-biology
ecology

Output Format

All searches return structured JSON with the following format:

{
  "query": {
    "keywords": ["CRISPR"],
    "start_date": "2024-01-01",
    "end_date": "2024-12-31",
    "category": "genomics"
  },
  "result_count": 42,
  "results": [
    {
      "doi": "10.1101/2024.01.15.123456",
      "title": "Paper Title Here",
      "authors": "Smith J, Doe J, Johnson A",
      "author_corresponding": "Smith J",
      "author_corresponding_institution": "University Example",
      "date": "2024-01-15",
      "version": "1",
      "type": "new results",
      "license": "cc_by",
      "category": "genomics",
      "abstract": "Full abstract text...",
      "pdf_url": "https://www.biorxiv.org/content/10.1101/2024.01.15.123456v1.full.pdf",
      "html_url": "https://www.biorxiv.org/content/10.1101/2024.01.15.123456v1",
      "jatsxml": "https://www.biorxiv.org/content/...",
      "published": ""
    }
  ]
}

Common Usage Patterns

Literature Review Workflow

Broad keyword search:

python scripts/biorxiv_search.py \

  --keywords "organoids" "tissue engineering" \
  --start-date 2023-01-01 \
  --end-date 2024-12-31 \
  --category bioengineering \
  --output organoid_papers.json

2. Extract and review results:

import json

with open('organoid_papers.json') as f:
    data = json.load(f)

print(f"Found {data['result_count']} papers")

for paper in data['results'][:5]:
    print(f"\nTitle: {paper['title']}")
    print(f"Authors: {paper['authors']}")
    print(f"Date: {paper['date']}")
    print(f"DOI: {paper['doi']}")

3. Download selected papers:

from biorxiv_search import BioRxivSearcher

searcher = BioRxivSearcher()
selected_dois = ["10.1101/2024.01.15.123456", "10.1101/2024.02.20.789012"]

for doi in selected_dois:
    filename = doi.replace("/", "_").replace(".", "_") + ".pdf"
    searcher.download_pdf(doi, f"papers/{filename}")

Trend Analysis

Track research trends by analyzing publication frequencies over time:

python scripts/biorxiv_search.py \
  --keywords "machine learning" \
  --start-date 2020-01-01 \
  --end-date 2024-12-31 \
  --category bioinformatics \
  --output ml_trends.json

Then analyze the temporal distribution in the results.

Author Tracking

Monitor specific researchers' preprints:

# Track multiple authors
authors = ["Smith", "Johnson", "Williams"]

for author in authors:
    python scripts/biorxiv_search.py \
      --author "{author}" \
      --days-back 365 \
      --output "{author}_papers.json"

Python API Usage

For more complex workflows, import and use the BioRxivSearcher class directly:

from scripts.biorxiv_search import BioRxivSearcher

# Initialize
searcher = BioRxivSearcher(verbose=True)

# Multiple search operations
keywords_papers = searcher.search_by_keywords(
    keywords=["CRISPR", "gene editing"],
    start_date="2024-01-01",
    end_date="2024-12-31",
    category="genomics"
)

author_papers = searcher.search_by_author(
    author_name="Smith",
    start_date="2023-01-01",
    end_date="2024-12-31"
)

# Get specific paper details
paper = searcher.get_paper_details("10.1101/2024.01.15.123456")

# Download PDF
success = searcher.download_pdf(
    doi="10.1101/2024.01.15.123456",
    output_path="paper.pdf"
)

# Format results consistently
formatted = searcher.format_result(paper, include_abstract=True)

Best Practices

Use appropriate date ranges : Smaller date ranges return faster. For keyword searches over long periods, consider splitting into multiple queries.
Filter by category : When possible, use --category to reduce data transfer and improve search precision.
Respect rate limits : The script includes automatic delays (0.5s between requests). For large-scale data collection, add additional delays.
Cache results : Save search results to JSON files to avoid repeated API calls.
Version tracking : Preprints can have multiple versions. The version field indicates which version is returned. PDF URLs include the version number.
Handle errors gracefully : Check the result_count in output JSON. Empty results may indicate date range issues or API connectivity problems.
Verbose mode for debugging : Use --verbose flag to see detailed logging of API requests and responses.

Advanced Features

Custom Date Range Logic

from datetime import datetime, timedelta

# Last quarter
end_date = datetime.now()
start_date = end_date - timedelta(days=90)

python scripts/biorxiv_search.py \
  --start-date {start_date.strftime('%Y-%m-%d')} \
  --end-date {end_date.strftime('%Y-%m-%d')}

Result Limiting

Limit the number of results returned:

python scripts/biorxiv_search.py \
  --keywords "COVID-19" \
  --days-back 30 \
  --limit 50 \
  --output covid_top50.json

Exclude Abstracts for Speed

When only metadata is needed:

# Note: Abstract inclusion is controlled in Python API
from scripts.biorxiv_search import BioRxivSearcher

searcher = BioRxivSearcher()
papers = searcher.search_by_keywords(keywords=["AI"], days_back=30)
formatted = [searcher.format_result(p, include_abstract=False) for p in papers]

Programmatic Integration

Integrate search results into downstream analysis pipelines:

import json
import pandas as pd

# Load results
with open('results.json') as f:
    data = json.load(f)

# Convert to DataFrame for analysis
df = pd.DataFrame(data['results'])

# Analyze
print(f"Total papers: {len(df)}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
print(f"\nTop authors by paper count:")
print(df['authors'].str.split(',').explode().str.strip().value_counts().head(10))

# Filter and export
recent = df[df['date'] >= '2024-06-01']
recent.to_csv('recent_papers.csv', index=False)

Testing the Skill

To verify that the bioRxiv database skill is working correctly, run the comprehensive test suite.

Prerequisites:

uv pip install requests

Run tests:

python tests/test_biorxiv_search.py

The test suite validates:

Initialization : BioRxivSearcher class instantiation
Date Range Search : Retrieving papers within specific date ranges
Category Filtering : Filtering papers by bioRxiv categories
Keyword Search : Finding papers containing specific keywords
DOI Lookup : Retrieving specific papers by DOI
Result Formatting : Proper formatting of paper metadata
Interval Search : Fetching recent papers by time intervals

Expected Output:

🧬 bioRxiv Database Search Skill Test Suite
======================================================================

🧪 Test 1: Initialization
✅ BioRxivSearcher initialized successfully

🧪 Test 2: Date Range Search
✅ Found 150 papers between 2024-01-01 and 2024-01-07
   First paper: Novel CRISPR-based approach for genome editing...

[... additional tests ...]

======================================================================
📊 Test Summary
======================================================================
✅ PASS: Initialization
✅ PASS: Date Range Search
✅ PASS: Category Filtering
✅ PASS: Keyword Search
✅ PASS: DOI Lookup
✅ PASS: Result Formatting
✅ PASS: Interval Search
======================================================================
Results: 7/7 tests passed (100%)
======================================================================

🎉 All tests passed! The bioRxiv database skill is working correctly.

Note: Some tests may show warnings if no papers are found in specific date ranges or categories. This is normal and does not indicate a failure.

Reference Documentation

For detailed API specifications, endpoint documentation, and response schemas, refer to:

references/api_reference.md - Complete bioRxiv API documentation

The reference file includes:

Full API endpoint specifications
Response format details
Error handling patterns
Rate limiting guidelines
Advanced search patterns

Weekly Installs

163

Repository

davila7/claude-…emplates

GitHub Stars

23.5K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

claude-code136

opencode133

cursor130

gemini-cli127

codex119

antigravity112

PPTX 文件处理全攻略：Python 脚本创建、编辑、分析 .pptx 文件内容与结构

877 周安装

evolutionary-biology

pharmacology-and-toxicology

scientific-communication-and-education

bioRxiv数据库Python工具：高效搜索下载预印本，支持关键词/作者/日期/类别筛选

🇨🇳中文介绍

bioRxiv 数据库

概述

何时使用此技能

核心搜索功能

1. 关键词搜索

相关 Skills

2. 作者搜索

3. 日期范围搜索

4. 通过 DOI 获取论文详情

5. PDF 下载

有效类别

输出格式

常用使用模式

文献综述工作流

趋势分析

作者跟踪

Python API 用法

最佳实践

高级功能

自定义日期范围逻辑

结果限制

排除摘要以提高速度

程序化集成

测试技能

参考文档

🇺🇸English

bioRxiv Database

Overview

When to Use This Skill

Core Search Capabilities

1. Keyword Search

2. Author Search

3. Date Range Search

4. Paper Details by DOI

5. PDF Downloads

Valid Categories

Output Format

Common Usage Patterns

Literature Review Workflow

Trend Analysis

Author Tracking

Python API Usage

Best Practices

Advanced Features

Custom Date Range Logic

Result Limiting

Exclude Abstracts for Speed

Programmatic Integration

Testing the Skill

Reference Documentation

最新 Skills