PDF文本提取器：智能提取科研论文全文，支持摘要与本地PDF模式，高效管理文献证据

pdf-text-extractor by willoscar/research-units-pipeline-skills

96 周安装量

388 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/willoscar/research-units-pipeline-skills --skill pdf-text-extractor

自动化科研工具数据处理

🇨🇳中文介绍

PDF 文本提取器

可选择性地收集全文片段，以获取比摘要更深入的证据。

此技能设计上较为保守：在许多调研任务中，摘要/片段模式已足够，并可避免大量下载。

输入

papers/core_set.csv (期望包含 paper_id、title，最好还有 pdf_url/arxiv_id/url)
可选：outline/mapping.tsv (用于优先处理已映射的论文)

输出

papers/fulltext_index.jsonl (每条记录对应一篇尝试处理的论文)

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

决策：证据模式

queries.md 可以设置 evidence_mode: "abstract" | "fulltext"。
- abstract (默认模板)：不下载 PDF；生成一个明确记录跳过状态的索引文件。
- fulltext：下载 PDF（在可能的情况下）并将文本提取到 papers/fulltext/ 目录。

当您无法/不应下载 PDF 时（网络受限、速率限制、无权限），请手动提供 PDF 并以“仅本地 PDF”模式运行。

PDF 命名约定：papers/pdfs/<paper_id>.pdf，其中 <paper_id> 与 papers/core_set.csv 中的匹配。
在 queries.md 中设置 - evidence_mode: "fulltext"。
运行：python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only

如果缺少 PDF，脚本会生成一个待办列表：

output/MISSING_PDFS.md (人类可读的摘要)
papers/missing_pdfs.csv (机器可读的列表)

工作流程（启发式）

读取 papers/core_set.csv。
如果存在 outline/mapping.tsv，则优先处理已映射的论文。
对于每篇选中的论文（全文模式）：
- 解析 pdf_url (优先使用 pdf_url，否则尽可能从 arxiv_id/url 推导)
- 如果缺失，则下载到 papers/pdfs/<paper_id>.pdf
- 提取合理长度的文本前缀到 papers/fulltext/<paper_id>.txt
- 在 papers/fulltext_index.jsonl 中追加/更新一条包含状态和统计信息的 JSONL 记录
除非明确要求，否则绝不覆盖已提取的文本（删除 .txt 文件可重新提取）。

papers/fulltext_index.jsonl 存在且非空。
如果 evidence_mode: "fulltext"：至少有一个小型但非平凡的子集拥有提取的文本（严格模式下，如果提取覆盖率接近零，则会阻塞）。
如果 evidence_mode: "abstract"：索引记录应清晰反映跳过状态（未尝试下载）。

python .codex/skills/pdf-text-extractor/scripts/run.py --help
python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>

--max-papers <n>：限制处理的论文数量（可被 queries.md 覆盖）
--max-pages <n>：每个 PDF 最多提取 N 页
--min-chars <n>：被视为提取成功的最小字符数
--sleep <sec>：下载之间的延迟时间
--local-pdfs-only：不下载；仅使用已存在的 papers/pdfs/<paper_id>.pdf
queries.md 支持：evidence_mode、fulltext_max_papers、fulltext_max_pages、fulltext_min_chars

摘要模式（不下载）：
- 在 queries.md 中设置 - evidence_mode: "abstract"，然后运行脚本（它将生成包含跳过状态的 papers/fulltext_index.jsonl）
仅使用本地 PDF 的全文模式：
- 在 queries.md 中设置 - evidence_mode: "fulltext"，将 PDF 放在 papers/pdfs/ 目录下，然后运行：python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
小规模预算的全文模式：
- python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200

下载的文件缓存在 papers/pdfs/ 目录下；提取的文本缓存在 papers/fulltext/ 目录下。
脚本不会覆盖已存在的提取文本，除非您删除 .txt 文件。

问题：没有可下载的 PDF

使用 evidence_mode: abstract（默认）或在 papers/pdfs/ 目录下提供本地 PDF，并使用 --local-pdfs-only 重新运行。

问题：提取的文本为空/乱码

如果支持，尝试使用不同的提取后端；否则，将该论文标记为 abstract 证据级别，并避免基于全文的强有力声明。

2026 年 1 月 23 日

🇺🇸English

PDF Text Extractor

Optionally collect full-text snippets to deepen evidence beyond abstracts.

This skill is intentionally conservative: in many survey runs, abstract/snippet mode is enough and avoids heavy downloads.

Inputs

papers/core_set.csv (expects paper_id, title, and ideally pdf_url/arxiv_id/url)
Optional: outline/mapping.tsv (to prioritize mapped papers)

Outputs

papers/fulltext_index.jsonl (one record per attempted paper)
Side artifacts:
- papers/pdfs/<paper_id>.pdf (cached downloads)
- papers/fulltext/<paper_id>.txt (extracted text)

Decision: evidence mode

queries.md can set evidence_mode: "abstract" | "fulltext".
- abstract (default template): do not download ; write an index that clearly records skipping.
- fulltext: download PDFs (when possible) and extract text to papers/fulltext/.

Local PDFs Mode

When you cannot/should not download PDFs (restricted network, rate limits, no permission), provide PDFs manually and run in “local PDFs only” mode.

PDF naming convention: papers/pdfs/<paper_id>.pdf where <paper_id> matches papers/core_set.csv.
Set - evidence_mode: "fulltext" in queries.md.
Run: python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only

If PDFs are missing, the script writes a to-do list:

output/MISSING_PDFS.md (human-readable summary)
papers/missing_pdfs.csv (machine-readable list)

Workflow (heuristic)

Read papers/core_set.csv.
If outline/mapping.tsv exists, prioritize mapped papers first.
For each selected paper (fulltext mode):
- resolve pdf_url (use pdf_url, else derive from arxiv_id/url when possible)
- download to papers/pdfs/<paper_id>.pdf if missing
- extract a reasonable prefix of text to papers/fulltext/<paper_id>.txt
- append/update a JSONL record in papers/fulltext_index.jsonl with status + stats

Quality checklist

papers/fulltext_index.jsonl exists and is non-empty.
If evidence_mode: "fulltext": at least a small but non-trivial subset has extracted text (strict mode blocks if extraction coverage is near-zero).
If evidence_mode: "abstract": the index records clearly reflect skip status (no downloads attempted).

Script

Quick Start

python .codex/skills/pdf-text-extractor/scripts/run.py --help
python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>

All Options

--max-papers <n>: cap number of papers processed (can be overridden by queries.md)
--max-pages <n>: extract at most N pages per PDF
--min-chars <n>: minimum extracted chars to count as OK
--sleep <sec>: delay between downloads
--local-pdfs-only: do not download; only use papers/pdfs/<paper_id>.pdf if present
queries.md supports: evidence_mode, fulltext_max_papers, ,

Examples

Abstract mode (no downloads):
- Set - evidence_mode: "abstract" in queries.md, then run the script (it will emit papers/fulltext_index.jsonl with skip statuses)
Fulltext mode with local PDFs only:
- Set - evidence_mode: "fulltext" in queries.md, put PDFs under papers/pdfs/, then run: python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
Fulltext mode with smaller budget:
- python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200

Notes

Downloads are cached under papers/pdfs/; extracted text is cached under papers/fulltext/.
The script does not overwrite existing extracted text unless you delete the .txt file.

Troubleshooting

Issue: no PDFs are available to download

Fix :

Use evidence_mode: abstract (default) or provide local PDFs under papers/pdfs/ and rerun with --local-pdfs-only.

Issue: extracted text is empty/garbled

Fix :

Try a different extraction backend if supported; otherwise mark the paper as abstract evidence level and avoid strong fulltext claims.

Weekly Installs

Repository

willoscar/resea…e-skills

GitHub Stars

375

First Seen

Jan 23, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

gemini-cli80

codex78

opencode76

cursor74

github-copilot70

amp66

Skills CLI 使用指南：AI Agent 技能包管理器安装与管理教程

46,600 周安装

Never overwrite existing extracted text unless explicitly requested (delete the .txt to re-extract).

fulltext_max_pages

fulltext_min_chars