PDF文本提取工具指南：PyMuPDF、pdfplumber、OCR API对比，为LLM和RAG优化

extracting-pdf-text by letta-ai/skills

205 周安装量

74 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/letta-ai/skills --skill extracting-pdf-text

AI/机器学习自动化数据处理

🇨🇳中文介绍

为大型语言模型提取 PDF 文本

此技能提供了从 PDF 中提取文本的工具和指南，提取格式适合语言模型使用。

快速决策指南

PDF 类型	最佳方法	脚本
纯文本 PDF	PyMuPDF	`scripts/extract_pymupdf.py`
包含表格的 PDF	pdfplumber	`scripts/extract_pdfplumber.py`
扫描/图像 PDF（本地）	pytesseract	`scripts/extract_with_ocr.py`
复杂布局，最高精度	Mistral OCR API	`scripts/extract_mistral_ocr.py`

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

本地提取（无需 API）

PyMuPDF - 快速通用提取

最佳适用场景：文本密集的 PDF、速度要求高的工作流、基本结构保留。

uv run scripts/extract_pymupdf.py input.pdf output.md

该脚本输出保留标题和段落的 Markdown。对于为 LLM 优化的输出，它使用 pymupdf4llm 来为 RAG 系统格式化文本。

pdfplumber - 表格提取

最佳适用场景：包含表格的 PDF、财务文档、结构化数据。

uv run scripts/extract_pdfplumber.py input.pdf output.md

表格被转换为 Markdown 格式。注意：pdfplumber 在机器生成的 PDF 上效果最好，而非扫描文档。

本地 OCR - 扫描文档

最佳适用场景：无法访问 API 时的扫描 PDF。

uv run scripts/extract_with_ocr.py input.pdf output.txt

需要：pytesseract、pdf2image 以及已安装的 Tesseract（在 macOS 上使用 brew install tesseract）。

基于 API 的提取

最佳适用场景：复杂布局、扫描文档、最高精度、多语言内容、数学公式。

定价：约每美元 1000 页（非常经济高效）

export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.md

输出干净的 Markdown
保留文档结构（标题、列表、表格）
处理图像、数学公式、多语言文本
在复杂文档上达到 95%+ 的准确率

有关详细的 API 选项和其他服务，请参阅 references/api-services.md。

对于 LLM 使用，推荐 Markdown：

保留语义结构（标题成为上下文边界）
表格保持可读性
兼容大多数 RAG 分块策略

有关本地工具的详细比较，请参阅 references/local-tools.md。

2026 年 1 月 24 日

🇺🇸English

Extracting PDF Text for LLMs

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.

Quick Decision Guide

PDF Type	Best Approach	Script
Simple text PDF	PyMuPDF	`scripts/extract_pymupdf.py`
PDF with tables	pdfplumber	`scripts/extract_pdfplumber.py`
Scanned/image PDF (local)	pytesseract	`scripts/extract_with_ocr.py`
Complex layout, highest accuracy	Mistral OCR API	`scripts/extract_mistral_ocr.py`
End-to-end RAG pipeline	marker-pdf	`pip install marker-pdf`

Recommended Workflow

Try PyMuPDF first - fastest, handles most text-based PDFs well
If tables are mangled - switch to pdfplumber
If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.

uv run scripts/extract_pymupdf.py input.pdf output.md

The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses pymupdf4llm which formats text for RAG systems.

pdfplumber - Table Extraction

Best for: PDFs with tables, financial documents, structured data.

uv run scripts/extract_pdfplumber.py input.pdf output.md

Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.

Local OCR - Scanned Documents

Best for: Scanned PDFs when API access is unavailable.

uv run scripts/extract_with_ocr.py input.pdf output.txt

Requires: pytesseract, pdf2image, and Tesseract installed (brew install tesseract on macOS).

API-Based Extraction

Mistral OCR API

Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.

Pricing : ~1000 pages per dollar (very cost-effective)

export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.md

Features:

Outputs clean markdown
Preserves document structure (headings, lists, tables)
Handles images, math equations, multilingual text
95%+ accuracy on complex documents

For detailed API options and other services, see references/api-services.md.

Output Format Recommendations

For LLM consumption, markdown is preferred:

Preserves semantic structure (headings become context boundaries)
Tables remain readable
Compatible with most RAG chunking strategies

For detailed comparisons of local tools, see references/local-tools.md.

Weekly Installs

205

Repository

letta-ai/skills

GitHub Stars

First Seen

Jan 24, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode180

codex178

gemini-cli174

github-copilot167

cursor165

kimi-cli157

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

62,200 周安装

PDF文本提取工具指南：PyMuPDF、pdfplumber、OCR API对比，为LLM和RAG优化

🇨🇳中文介绍

为大型语言模型提取 PDF 文本

快速决策指南

相关 Skills

推荐工作流程

本地提取（无需 API）

PyMuPDF - 快速通用提取

pdfplumber - 表格提取

本地 OCR - 扫描文档

基于 API 的提取

Mistral OCR API

输出格式建议

🇺🇸English

Extracting PDF Text for LLMs

Quick Decision Guide

Recommended Workflow

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

pdfplumber - Table Extraction

Local OCR - Scanned Documents

API-Based Extraction

Mistral OCR API

Output Format Recommendations

最新 Skills