pdf-text-extractor by willoscar/research-units-pipeline-skills
npx skills add https://github.com/willoscar/research-units-pipeline-skills --skill pdf-text-extractor可选择性地收集全文片段,以获取比摘要更深入的证据。
此技能设计上较为保守:在许多调研任务中,摘要/片段模式已足够,并可避免大量下载。
papers/core_set.csv (期望包含 paper_id、title,最好还有 pdf_url/arxiv_id/url)outline/mapping.tsv (用于优先处理已映射的论文)papers/fulltext_index.jsonl (每条记录对应一篇尝试处理的论文)广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
papers/pdfs/<paper_id>.pdf (缓存的下载文件)papers/fulltext/<paper_id>.txt (提取的文本)queries.md 可以设置 evidence_mode: "abstract" | "fulltext"。
abstract (默认模板):不下载 PDF;生成一个明确记录跳过状态的索引文件。fulltext:下载 PDF(在可能的情况下)并将文本提取到 papers/fulltext/ 目录。当您无法/不应下载 PDF 时(网络受限、速率限制、无权限),请手动提供 PDF 并以“仅本地 PDF”模式运行。
papers/pdfs/<paper_id>.pdf,其中 <paper_id> 与 papers/core_set.csv 中的匹配。queries.md 中设置 - evidence_mode: "fulltext"。python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only如果缺少 PDF,脚本会生成一个待办列表:
output/MISSING_PDFS.md (人类可读的摘要)papers/missing_pdfs.csv (机器可读的列表)papers/core_set.csv。outline/mapping.tsv,则优先处理已映射的论文。pdf_url (优先使用 pdf_url,否则尽可能从 arxiv_id/url 推导)papers/pdfs/<paper_id>.pdfpapers/fulltext/<paper_id>.txtpapers/fulltext_index.jsonl 中追加/更新一条包含状态和统计信息的 JSONL 记录.txt 文件可重新提取)。papers/fulltext_index.jsonl 存在且非空。evidence_mode: "fulltext":至少有一个小型但非平凡的子集拥有提取的文本(严格模式下,如果提取覆盖率接近零,则会阻塞)。evidence_mode: "abstract":索引记录应清晰反映跳过状态(未尝试下载)。python .codex/skills/pdf-text-extractor/scripts/run.py --helppython .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>--max-papers <n>:限制处理的论文数量(可被 queries.md 覆盖)--max-pages <n>:每个 PDF 最多提取 N 页--min-chars <n>:被视为提取成功的最小字符数--sleep <sec>:下载之间的延迟时间--local-pdfs-only:不下载;仅使用已存在的 papers/pdfs/<paper_id>.pdfqueries.md 支持:evidence_mode、fulltext_max_papers、fulltext_max_pages、fulltext_min_charsqueries.md 中设置 - evidence_mode: "abstract",然后运行脚本(它将生成包含跳过状态的 papers/fulltext_index.jsonl)queries.md 中设置 - evidence_mode: "fulltext",将 PDF 放在 papers/pdfs/ 目录下,然后运行:python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-onlypython .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200papers/pdfs/ 目录下;提取的文本缓存在 papers/fulltext/ 目录下。.txt 文件。修复方法:
evidence_mode: abstract(默认)或在 papers/pdfs/ 目录下提供本地 PDF,并使用 --local-pdfs-only 重新运行。修复方法:
abstract 证据级别,并避免基于全文的强有力声明。每周安装量
91
代码仓库
GitHub 星标数
375
首次出现
2026 年 1 月 23 日
安全审计
安装于
gemini-cli80
codex78
opencode76
cursor74
github-copilot70
amp66
Optionally collect full-text snippets to deepen evidence beyond abstracts.
This skill is intentionally conservative: in many survey runs, abstract/snippet mode is enough and avoids heavy downloads.
papers/core_set.csv (expects paper_id, title, and ideally pdf_url/arxiv_id/url)outline/mapping.tsv (to prioritize mapped papers)papers/fulltext_index.jsonl (one record per attempted paper)papers/pdfs/<paper_id>.pdf (cached downloads)papers/fulltext/<paper_id>.txt (extracted text)queries.md can set evidence_mode: "abstract" | "fulltext".
abstract (default template): do not download ; write an index that clearly records skipping.fulltext: download PDFs (when possible) and extract text to papers/fulltext/.When you cannot/should not download PDFs (restricted network, rate limits, no permission), provide PDFs manually and run in “local PDFs only” mode.
papers/pdfs/<paper_id>.pdf where <paper_id> matches papers/core_set.csv.- evidence_mode: "fulltext" in queries.md.python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-onlyIf PDFs are missing, the script writes a to-do list:
output/MISSING_PDFS.md (human-readable summary)papers/missing_pdfs.csv (machine-readable list)papers/core_set.csv.outline/mapping.tsv exists, prioritize mapped papers first.pdf_url (use pdf_url, else derive from arxiv_id/url when possible)papers/pdfs/<paper_id>.pdf if missingpapers/fulltext/<paper_id>.txtpapers/fulltext_index.jsonl with status + statspapers/fulltext_index.jsonl exists and is non-empty.evidence_mode: "fulltext": at least a small but non-trivial subset has extracted text (strict mode blocks if extraction coverage is near-zero).evidence_mode: "abstract": the index records clearly reflect skip status (no downloads attempted).python .codex/skills/pdf-text-extractor/scripts/run.py --helppython .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>--max-papers <n>: cap number of papers processed (can be overridden by queries.md)--max-pages <n>: extract at most N pages per PDF--min-chars <n>: minimum extracted chars to count as OK--sleep <sec>: delay between downloads--local-pdfs-only: do not download; only use papers/pdfs/<paper_id>.pdf if presentqueries.md supports: evidence_mode, fulltext_max_papers, , - evidence_mode: "abstract" in queries.md, then run the script (it will emit papers/fulltext_index.jsonl with skip statuses)- evidence_mode: "fulltext" in queries.md, put PDFs under papers/pdfs/, then run: python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-onlypython .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200papers/pdfs/; extracted text is cached under papers/fulltext/..txt file.Fix :
evidence_mode: abstract (default) or provide local PDFs under papers/pdfs/ and rerun with --local-pdfs-only.Fix :
abstract evidence level and avoid strong fulltext claims.Weekly Installs
91
Repository
GitHub Stars
375
First Seen
Jan 23, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
gemini-cli80
codex78
opencode76
cursor74
github-copilot70
amp66
Skills CLI 使用指南:AI Agent 技能包管理器安装与管理教程
46,600 周安装
.txt to re-extract).fulltext_max_pagesfulltext_min_chars