PDF文档处理技能：2026年最佳实践，支持创建、提取、合并、拆分、OCR与表单自动化

document-pdf by vasilyu1983/ai-agents-public

167 周安装量

46 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/vasilyu1983/ai-agents-public --skill document-pdf

开发自动化数据处理

🇨🇳中文介绍

文档 PDF 技能 — 快速参考

此技能支持 PDF 的创建、提取、操作和分析。当用户需要生成发票、报告、从 PDF 中提取数据、合并文档或处理 PDF 表单时，Claude 应应用这些模式。

现代最佳实践（2026年1月）：

PDF 是发布产物，而非可编辑的单一事实来源。
在需要时验证导出的保真度（字体、图像、链接）和可访问性。
可访问性：如果合规性很重要，请以标记/结构化 PDF 工作流（通常符合 PDF/UA 标准）为目标，并使用工具进行验证。
欧盟分发：EAA（2025年6月）通常意味着面向客户的 PDF 需要满足 EN 301 549 的期望。
将 PDF 视为敏感文件：清除元数据，确保真正的信息遮盖，并控制分发。

核心决策规则（2026年）

首先决定：原生数字 PDF（可选择文本）与扫描 PDF（图像）。扫描 PDF 通常需要 OCR；请参阅 references/pdf-extraction-patterns.md。
如果用户需要可访问性/合规性，优先从支持结构的源格式（DOCX/HTML + 正确的导出）生成，而不是“事后修复”未标记的 PDF。
对于确定性操作（合并/拆分/旋转/清除），优先使用 scripts/ 中的辅助工具，而不是重新实现临时方案。
切勿将黑色矩形或覆盖层视为信息遮盖；使用真正的信息遮盖功能，并通过复制/粘贴 + 搜索进行验证。

快速参考

任务	工具/库	语言	使用场景
创建 PDF	pdfkit	Node.js

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

何时使用此技能

当用户请求以下内容时，Claude 应调用此技能：

从数据生成 PDF（发票、报告、证书）
从现有 PDF 中提取文本或表格
将多个 PDF 合并为一个文档
将 PDF 拆分为单独的文件
以编程方式填充 PDF 表单
添加水印、页眉、页脚
将 HTML/网页转换为 PDF

创建：选择 pdfkit（Node）或 ReportLab（Python），并从 assets/invoice-template.md 或 assets/report-template.md 开始；对于高级布局，请使用 references/pdf-generation-patterns.md。
提取：使用 references/pdf-extraction-patterns.md（文本/表格/图像/元数据 + OCR 后备方案）。
交付：运行 assets/pdf-release-checklist.md（保真度、链接、可访问性基线、隐私）。

脚本（确定性操作）

脚本是可选的辅助工具；它们假定使用 Python 3 以及每个文件中列出的依赖项。

合并：python3 scripts/merge_pdfs.py merged.pdf a.pdf b.pdf
拆分：python3 scripts/split_pdf.py in.pdf out_dir --each-page
旋转：python3 scripts/rotate_pdf.py in.pdf out.pdf --degrees 90
清除元数据：python3 scripts/scrub_metadata.py in.pdf out.pdf

INVOICE STRUCTURE
├── Header (logo, company info, invoice #)
├── Bill To / Ship To blocks
├── Line items table
│   ├── Description | Qty | Unit Price | Total
│   └── Subtotal, Tax, Total
├── Payment terms
└── Footer (contact, thank you)

REPORT PDF STRUCTURE
├── Cover page (title, author, date)
├── Table of contents
├── Body sections with page numbers
├── Charts/images with captions
├── Appendices
└── Running header/footer

PDF Task: [What do you need?]
    ├─ Create new PDF?
    │   ├─ Simple text/tables → pdfkit (Node) or ReportLab (Python)
    │   ├─ Complex layouts → ReportLab with Platypus
    │   └─ From HTML → Puppeteer or wkhtmltopdf
    │
    ├─ Extract from PDF?
    │   ├─ Text only → pdfplumber (Python)
    │   ├─ Tables → pdfplumber or camelot (Python)
    │   └─ Images → PyMuPDF/fitz (Python)
    │
    ├─ Modify existing PDF?
    │   ├─ Add text/images → pdf-lib (Node)
    │   ├─ Merge/split → pypdf or pdf-lib
    │   └─ Fill forms → pdf-lib
    │
    └─ Batch processing?
        └─ pypdf + pdfplumber pipeline

应做 / 应避免（2026年1月）

保留一个版本化的源文档（doc/幻灯片/设计文件）与 PDF 并存。
对于长文档，验证链接和阅读顺序。
使用真正的信息遮盖功能，并通过复制/粘贴进行测试。

当存在源文档时，将编辑 PDF 作为主要工作流。
交付带有损坏链接或难以辨认图表的 PDF。
未经明确批准，在 PDF 中包含客户 PII 或机密信息。

保真度：可从版本化的源文件（doc/幻灯片/设计）重现导出，并且在所有查看器中看起来一致。
可访问性：标记/阅读顺序正确；链接有效；扫描文档在适当时经过 OCR 处理。
发布卫生：文件名包含版本/日期；元数据干净；不存在“PDF 作为单一事实来源”。
安全性：信息遮盖经过验证（复制/粘贴测试），并且敏感数据最小化。
质量保证：使用 assets/pdf-release-checklist.md 完成发布检查清单。

可选：AI / 自动化

仅在明确请求且符合政策时使用。

生成发布检查清单运行；人工手动验证最终 PDF。

references/pdf-generation-patterns.md — 复杂布局、多页文档
references/pdf-extraction-patterns.md — 文本、表格、图像提取
references/pdf-accessibility-compliance.md — 标记 PDF、PDF/UA、EAA 合规性
references/pdf-forms-interactive.md — AcroForms、表单填充、数字签名
references/pdf-security-redaction.md — 加密、权限、真正的信息遮盖
data/sources.json — 库文档链接

assets/invoice-template.md — 发票 PDF 生成
assets/report-template.md — 多页报告结构
assets/pdf-release-checklist.md — 链接、可访问性、导出保真度

../document-docx/SKILL.md — Word 文档生成
../document-xlsx/SKILL.md — Excel/电子表格工作流
../document-pptx/SKILL.md — PowerPoint 演示文稿

在给出最终答案前，使用网络搜索/网络抓取来验证当前的外部事实、版本、定价、截止日期、法规或平台行为。
优先使用一手来源；对于易变信息，报告来源链接和日期。
如果无法访问网络，请说明此限制，并将指南标记为未经验证。

🇺🇸English

Document PDF Skill — Quick Reference

This skill enables PDF creation, extraction, manipulation, and analysis. Claude should apply these patterns when users need to generate invoices, reports, extract data from PDFs, merge documents, or work with PDF forms.

Modern Best Practices (Jan 2026) :

PDF is a release artifact, not the editable source of truth.
Validate export fidelity (fonts, images, links) and accessibility where required.
Accessibility: if compliance matters, target a tagged/structured PDF workflow (often PDF/UA-aligned) and validate with tooling.
EU distribution: EAA (June 2025) typically implies EN 301 549 expectations for customer-facing PDFs.
Treat PDFs as sensitive: scrub metadata, ensure real redaction, and control distribution.

Core Decision Rules (2026)

First decide: born-digital PDF (selectable text) vs scanned PDF (images). Scanned PDFs usually require OCR; see references/pdf-extraction-patterns.md.
If the user needs accessibility/compliance, prefer generating from a source format that supports structure (DOCX/HTML + proper export) rather than “post-fixing” an untagged PDF.
For deterministic ops (merge/split/rotate/scrub), prefer scripts/ helpers over re-implementing ad hoc.
Never treat black rectangles or overlays as redaction; use real redaction and verify by copy/paste + search.

Quick Reference

Task	Tool/Library	Language	When to Use
Create PDF	pdfkit	Node.js	Reports, invoices, certificates
Create PDF	ReportLab	Python	Complex layouts, tables
Create PDF	FPDF2	Python	Simple PDFs with Unicode support
Create PDF	Borb	Python	Interactive elements, pure Python
Edit PDF	pdf-lib	Node.js	Modify existing PDFs, add pages
Extract text	pdfplumber	Python	OCR-free text extraction
OCR scanned PDF	PyMuPDF + Tesseract	Python	Scanned PDFs (no selectable text)
Extract tables	Camelot	Python	Tables with borders (Lattice mode)
Extract tables	Camelot/Tabula

When to Use This Skill

Claude should invoke this skill when a user requests:

Generate PDFs from data (invoices, reports, certificates)
Extract text or tables from existing PDFs
Merge multiple PDFs into one document
Split PDFs into separate files
Fill PDF forms programmatically
Add watermarks, headers, footers
Convert HTML/web pages to PDF

Default Workflow

Create: pick pdfkit (Node) or ReportLab (Python) and start from assets/invoice-template.md or assets/report-template.md; for advanced layouts use references/pdf-generation-patterns.md.
Extract: use references/pdf-extraction-patterns.md (text/tables/images/metadata + OCR fallback).
Ship: run assets/pdf-release-checklist.md (fidelity, links, accessibility baseline, privacy).

Scripts (Deterministic Operations)

Scripts are optional helpers; they assume Python 3 plus the listed dependencies in each file.

Merge: python3 scripts/merge_pdfs.py merged.pdf a.pdf b.pdf
Split: python3 scripts/split_pdf.py in.pdf out_dir --each-page
Rotate: python3 scripts/rotate_pdf.py in.pdf out.pdf --degrees 90
Scrub metadata: python3 scripts/scrub_metadata.py in.pdf out.pdf

PDF Structure Patterns

Invoice Template

INVOICE STRUCTURE
├── Header (logo, company info, invoice #)
├── Bill To / Ship To blocks
├── Line items table
│   ├── Description | Qty | Unit Price | Total
│   └── Subtotal, Tax, Total
├── Payment terms
└── Footer (contact, thank you)

Report Template

REPORT PDF STRUCTURE
├── Cover page (title, author, date)
├── Table of contents
├── Body sections with page numbers
├── Charts/images with captions
├── Appendices
└── Running header/footer

Decision Tree

PDF Task: [What do you need?]
    ├─ Create new PDF?
    │   ├─ Simple text/tables → pdfkit (Node) or ReportLab (Python)
    │   ├─ Complex layouts → ReportLab with Platypus
    │   └─ From HTML → Puppeteer or wkhtmltopdf
    │
    ├─ Extract from PDF?
    │   ├─ Text only → pdfplumber (Python)
    │   ├─ Tables → pdfplumber or camelot (Python)
    │   └─ Images → PyMuPDF/fitz (Python)
    │
    ├─ Modify existing PDF?
    │   ├─ Add text/images → pdf-lib (Node)
    │   ├─ Merge/split → pypdf or pdf-lib
    │   └─ Fill forms → pdf-lib
    │
    └─ Batch processing?
        └─ pypdf + pdfplumber pipeline

Do / Avoid (Jan 2026)

Do

Keep a versioned source document (doc/slide/design file) alongside the PDF.
Verify links and reading order for long documents.
Use real redaction and test by copy/paste.

Avoid

Editing PDFs as the primary workflow when a source doc exists.
Shipping PDFs with broken links or illegible charts.
Including customer PII or secrets in PDFs without explicit approval.

What Good Looks Like

Fidelity: export is reproducible from a versioned source file (doc/slide/design) and looks identical across viewers.
Accessibility: tags/reading order are correct; links work; scanned docs are OCRed when appropriate.
Release hygiene: file naming includes version/date; metadata is clean; no “PDF as source of truth”.
Security: redaction is verified (copy/paste test) and sensitive data is minimized.
QA: release checklist completed using assets/pdf-release-checklist.md.

Optional: AI / Automation

Use only when explicitly requested and policy-compliant.

Generate a release checklist run; humans verify the final PDF manually.

Navigation

Resources

references/pdf-generation-patterns.md — Complex layouts, multi-page docs
references/pdf-extraction-patterns.md — Text, table, image extraction
references/pdf-accessibility-compliance.md — Tagged PDFs, PDF/UA, EAA compliance
references/pdf-forms-interactive.md — AcroForms, form filling, digital signatures
references/pdf-security-redaction.md — Encryption, permissions, real redaction
data/sources.json — Library documentation links

Templates

assets/invoice-template.md — Invoice PDF generation
assets/report-template.md — Multi-page report structure
assets/pdf-release-checklist.md — Links, accessibility, export fidelity

Related Skills

../document-docx/SKILL.md — Word document generation
../document-xlsx/SKILL.md — Excel/spreadsheet workflows
../document-pptx/SKILL.md — PowerPoint presentations

Fact-Checking

Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
Prefer primary sources; report source links and dates for volatile information.
If web access is unavailable, state the limitation and mark guidance as unverified.

Weekly Installs

167

Repository

vasilyu1983/ai-…s-public

GitHub Stars

First Seen

Jan 23, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode141

cursor140

gemini-cli138

codex137

github-copilot134

amp124

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

152,900 周安装