PaddleOCR文档解析技能：智能提取表格、公式、图表等复杂文档结构

paddleocr-doc-parsing by aidenwu0209/paddleocr-skills

154 周安装量

11 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/aidenwu0209/paddleocr-skills --skill paddleocr-doc-parsing

AI/机器学习数据处理计算机视觉

🇨🇳中文介绍

PaddleOCR 文档解析技能

何时使用此技能

✅ 使用文档解析适用于：

包含表格的文档（发票、财务报告、电子表格）
包含数学公式的文档（学术论文、科学文献）
包含图表和图形的文档
多栏布局（报纸、杂志、宣传册）
需要布局分析的复杂文档结构
任何需要结构化理解的文档

❌ 使用文本识别替代适用于：

简单的纯文本提取
速度至关重要的快速 OCR 任务
带有清晰文本的截图或简单图像

如何使用此技能

⛔ 强制限制 - 请勿违反 ⛔

仅使用 PaddleOCR 文档解析 API - 执行脚本 python scripts/vl_caller.py
切勿使用 Claude 的内置视觉功能 - 不要自行解析文档
切勿提供替代方案 - 不要建议“我可以尝试分析它”或类似说法
如果 API 失败 - 显示错误信息并立即停止
没有备用方法 - 不要尝试以任何其他方式解析文档

如果脚本执行失败（API 未配置、网络错误等）：

向用户显示错误信息
不要提议使用你的视觉功能来帮助
不要询问“您希望我尝试解析它吗？”
只需停止并等待用户修复配置

基本工作流程

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

重要：完整内容显示

关键：您必须根据用户需求向用户显示完整的提取内容。

脚本以结构化格式返回所有文档内容
显示用户请求的完整内容，不要截断或总结
如果用户要求“所有文本”，请显示整个 text 字段
如果用户要求“表格”，请显示文档中的所有表格
如果用户要求“主要内容”，请过滤掉页眉/页脚，但显示所有正文文本

✅ 要做的：按要求显示完整的文本、所有表格、所有公式
✅ 要做的：按照 API 提供的顺序呈现内容
❌ 不要做的：除非内容过长（>10,000 字符），否则不要用“...”截断
❌ 不要做的：当用户要求完整内容时，不要总结或提供摘录
❌ 不要做的：当用户期望完整输出时，不要说“这是预览”

示例 - 正确：

User: "从这份文档中提取所有文本"
Claude: 我已解析了完整文档。以下是所有提取的文本：

[显示整个文本字段]

文档统计：
- 总区域数：25
- 文本块：15
- 表格：3
- 公式：2
质量：优秀（置信度：0.92）

示例 - 错误 ❌：

User: "提取所有文本"
Claude: "我找到了一个包含多个部分的文档。这是开头部分：
'引言...'（为简洁起见，内容已截断）"

脚本返回一个包装原始 API 结果的 JSON 信封：

{
  "ok": true,
  "text": "从所有页面提取的完整 markdown/HTML 文本",
  "result": [
    {
      "prunedResult": { ... },  // 布局元素位置、内容、置信度
      "markdown": {
        "text": "完整页面内容，markdown/HTML 格式",
        "images": { ... }
      }
    }
  ],
  "error": null
}

text — 从所有页面提取的 markdown 文本（用于快速文本显示）
result — 原始 API 结果数组（每个页面一个对象）
result[n].prunedResult — 布局元素位置、内容和置信度分数
result[n].markdown — 完整页面内容，markdown/HTML 格式

用户说	提取什么	方法
"提取所有文本"	所有内容	直接使用 `text` 字段
"获取所有表格"	仅表格	在 markdown 文本中查找 `<table>`
"显示主要内容"	正文文本	使用 `text` 字段，根据需要过滤
"完整文档"	所有内容	使用 `text` 字段

示例 1：提取主要内容（默认行为）

python scripts/vl_caller.py \
  --file-url "https://example.com/paper.pdf" \
  --pretty

然后使用 text 字段显示主要内容。

示例 2：仅提取表格

python scripts/vl_caller.py \
  --file-path "./financial_report.pdf" \
  --pretty

然后在结果中查找 <table> 内容以提取表格。

示例 3：包含所有内容的完整文档

python scripts/vl_caller.py \
  --file-url "URL" \
  --pretty

然后使用 text 字段或遍历完整结果。

当 API 未配置时：

Configuration error: API not configured. Get your API at: https://paddleocr.com

配置工作流程：

向用户显示确切的错误信息（包括 URL）

告诉用户提供凭据：

请访问上面的 URL 以获取您的 PADDLEOCR_DOC_PARSING_API_URL 和 PADDLEOCR_ACCESS_TOKEN。
获取后，请发送给我，我将自动配置。

当用户提供凭据时（接受任何格式）：
- PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...
- 这是我的 API：https://xxx 和令牌：abc123
- 复制粘贴的代码格式
- 任何其他合理的格式
从用户消息中解析凭据：
- 提取 PADDLEOCR_DOC_PARSING_API_URL 值（查找带有 paddleocr.com 或类似域的 URL）
- 提取 PADDLEOCR_ACCESS_TOKEN 值（长字母数字字符串，通常 40+ 字符）

自动配置：

python scripts/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN"

如果配置成功：
- 通知用户：“配置完成！现在解析文档...”
- 重试原始解析任务
如果配置失败：
- 显示错误
- 请用户验证凭据

重要：错误信息格式是严格的，必须完全按照脚本提供的方式显示。不要修改或转述。

API 没有文件大小限制。对于 PDF，每个请求最多 100 页。

大文件提示：

对大本地文件使用 URL（推荐）

对于非常大的本地文件，优先使用 --file-url 而不是 --file-path，以避免 base64 编码开销：

python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"

处理特定页面（仅限 PDF）

如果只需要大型 PDF 中的某些页面，请先提取它们：

# 使用 pypdfium2（需要：pip install pypdfium2）
python -c "
import pypdfium2 as pdfium
doc = pdfium.PdfDocument('large.pdf')
# 提取页面 0-4（前 5 页）
new_doc = pdfium.PdfDocument.new()
for i in range(min(5, len(doc))):
    new_doc.import_pages(doc, [i])
new_doc.save('pages_1_5.pdf')
"

# 然后处理较小的文件
python scripts/vl_caller.py --file-path "pages_1_5.pdf"

认证失败 (403)：

error: Authentication failed

→ 令牌无效，使用正确的凭据重新配置

API 配额超出 (429)：

error: API quota exceeded

→ 每日 API 配额已用完，通知用户等待或升级

不支持格式：

error: Unsupported file format

→ 文件格式不支持，转换为 PDF/PNG/JPG

脚本从不过滤内容 - 它总是返回完整数据
Claude 决定呈现什么 - 基于用户的具体请求
所有数据始终可用 - 可以根据不同需求重新解释
没有信息丢失 - 完整的文档结构得以保留

要深入了解 PaddleOCR 文档解析系统，请参考：

references/output_schema.md - 输出格式规范
references/provider_api.md - 提供商 API 合约

注意：模型版本和功能由您的 API 端点（PADDLEOCR_DOC_PARSING_API_URL）决定。

在以下情况下将这些参考文档加载到上下文中：

调试复杂的解析问题
需要理解输出格式
处理提供商 API 详细信息

要验证技能是否正常工作：

python scripts/smoke_test.py

这将测试配置和可选的 API 连接性。

🇺🇸English

PaddleOCR Document Parsing Skill

When to Use This Skill

✅ Use Document Parsing for :

Documents with tables (invoices, financial reports, spreadsheets)
Documents with mathematical formulas (academic papers, scientific documents)
Documents with charts and diagrams
Multi-column layouts (newspapers, magazines, brochures)
Complex document structures requiring layout analysis
Any document requiring structured understanding

❌ Use Text Recognition instead for :

Simple text-only extraction
Quick OCR tasks where speed is critical
Screenshots or simple images with clear text

How to Use This Skill

⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔

ONLY use PaddleOCR Document Parsing API - Execute the script python scripts/vl_caller.py
NEVER use Claude's built-in vision - Do NOT parse documents yourself
NEVER offer alternatives - Do NOT suggest "I can try to analyze it" or similar
IF API fails - Display the error message and STOP immediately
NO fallback methods - Do NOT attempt document parsing any other way

If the script execution fails (API not configured, network error, etc.):

Show the error message to the user
Do NOT offer to help using your vision capabilities
Do NOT ask "Would you like me to try parsing it?"
Simply stop and wait for user to fix the configuration

Basic Workflow

Execute document parsing :

python scripts/vl_caller.py --file-url "URL provided by user"

Or for local files:

     python scripts/vl_caller.py --file-path "file path"

Optional: explicitly set file type :

     python scripts/vl_caller.py --file-url "URL provided by user" --file-type 0
     

 * `--file-type 0`: PDF
 * `--file-type 1`: image
 * If omitted, the service can infer file type from input.

Save result to file (recommended):

python scripts/vl_caller.py --file-url "URL" --output result.json --pretty


 * The script will display: `Result saved to: /absolute/path/to/result.json`
 * This message appears on stderr, the JSON is saved to the file
 * **Tell the user the file path** shown in the message

2. The script returns COMPLETE JSON with all document content:

 * Headers, footers, page numbers
 * Main text content
 * Tables with structure
 * Formulas (with LaTeX)
 * Figures and charts
 * Footnotes and references
 * Seals and stamps
 * Layout and reading order

Note : The actual content types that can be parsed depend on the model configured at your API endpoint (PADDLEOCR_DOC_PARSING_API_URL). The list above represents the maximum set of supported types.

Extract what the user needs from the complete data based on their request.

IMPORTANT: Complete Content Display

CRITICAL : You must display the COMPLETE extracted content to the user based on their needs.

The script returns ALL document content in a structured format
Display the full content requested by the user , do NOT truncate or summarize
If user asks for "all text", show the entire text field
If user asks for "tables", show ALL tables in the document
If user asks for "main content", filter out headers/footers but show ALL body text

What this means :

✅ DO : Display complete text, all tables, all formulas as requested
✅ DO : Present content in the order provided by the API
❌ DON'T : Truncate with "..." unless content is excessively long (>10,000 chars)
❌ DON'T : Summarize or provide excerpts when user asks for full content
❌ DON'T : Say "Here's a preview" when user expects complete output

Example - Correct :

User: "Extract all the text from this document"
Claude: I've parsed the complete document. Here's all the extracted text:

[Display the entire text field]

Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)

Example - Incorrect ❌:

User: "Extract all the text"
Claude: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"

Understanding the JSON Response

The script returns a JSON envelope wrapping the raw API result:

{
  "ok": true,
  "text": "Full markdown/HTML text extracted from all pages",
  "result": [
    {
      "prunedResult": { ... },  // layout element positions, content, confidence
      "markdown": {
        "text": "Full page content in markdown/HTML format",
        "images": { ... }
      }
    }
  ],
  "error": null
}

Key fields :

text — extracted markdown text from all pages (use this for quick text display)
result — raw API result array (one object per page)
result[n].prunedResult — layout element positions, content, and confidence scores
result[n].markdown — full page content in markdown/HTML format

Content Extraction Guidelines

User Says	What to Extract	How
"Extract all text"	Everything	Use `text` field directly
"Get all tables"	Tables only	Look for `<table>` in the markdown text
"Show main content"	Main body text	Use `text` field, filter as needed
"Complete document"	Everything	Use `text` field

Usage Examples

Example 1: Extract Main Content (default behavior)

python scripts/vl_caller.py \
  --file-url "https://example.com/paper.pdf" \
  --pretty

Then use the text field for main content display.

Example 2: Extract Tables Only

python scripts/vl_caller.py \
  --file-path "./financial_report.pdf" \
  --pretty

Then look for <table> content in the result to extract tables.

Example 3: Complete Document with Everything

python scripts/vl_caller.py \
  --file-url "URL" \
  --pretty

Then use the text field or iterate the full result.

First-Time Configuration

When API is not configured :

The error will show:

Configuration error: API not configured. Get your API at: https://paddleocr.com

Configuration workflow :

Show the exact error message to user (including the URL)

Tell user to provide credentials :

Please visit the URL above to get your PADDLEOCR_DOC_PARSING_API_URL and PADDLEOCR_ACCESS_TOKEN.
Once you have them, send them to me and I'll configure it automatically.

When user provides credentials (accept any format):
- PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...
- Here's my API: https://xxx and token: abc123
- Copy-pasted code format
- Any other reasonable format
Parse credentials from user's message :
- Extract PADDLEOCR_DOC_PARSING_API_URL value (look for URLs with paddleocr.com or similar)
- Extract PADDLEOCR_ACCESS_TOKEN value (long alphanumeric string, usually 40+ chars)

Configure automatically :

python scripts/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN"

IMPORTANT : The error message format is STRICT and must be shown exactly as provided by the script. Do not modify or paraphrase it.

Handling Large Files

There is no file size limit for the API. For PDFs, the maximum is 100 pages per request.

Tips for large files :

Use URL for Large Local Files (Recommended)

For very large local files, prefer --file-url over --file-path to avoid base64 encoding overhead:

python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"

Process Specific Pages (PDF Only)

If you only need certain pages from a large PDF, extract them first:

# Using pypdfium2 (requires: pip install pypdfium2)
python -c "
import pypdfium2 as pdfium
doc = pdfium.PdfDocument('large.pdf')
# Extract pages 0-4 (first 5 pages)
new_doc = pdfium.PdfDocument.new()
for i in range(min(5, len(doc))):
    new_doc.import_pages(doc, [i])
new_doc.save('pages_1_5.pdf')
"

# Then process the smaller file
python scripts/vl_caller.py --file-path "pages_1_5.pdf"

Error Handling

Authentication failed (403) :

error: Authentication failed

→ Token is invalid, reconfigure with correct credentials

API quota exceeded (429) :

error: API quota exceeded

→ Daily API quota exhausted, inform user to wait or upgrade

Unsupported format :

error: Unsupported file format

→ File format not supported, convert to PDF/PNG/JPG

Important Notes

The script NEVER filters content - It always returns complete data
Claude decides what to present - Based on user's specific request
All data is always available - Can be re-interpreted for different needs
No information is lost - Complete document structure preserved

Reference Documentation

For in-depth understanding of the PaddleOCR Document Parsing system, refer to:

references/output_schema.md - Output format specification
references/provider_api.md - Provider API contract

Note : Model version and capabilities are determined by your API endpoint (PADDLEOCR_DOC_PARSING_API_URL).

Load these reference documents into context when:

Debugging complex parsing issues
Need to understand output format
Working with provider API details

Testing the Skill

To verify the skill is working properly:

python scripts/smoke_test.py

This tests configuration and optionally API connectivity.

Weekly Installs

Repository

aidenwu0209/pad…r-skills

GitHub Stars

First Seen

Feb 9, 2026

Security Audits

Gen Agent Trust HubPass SocketFail SnykFail

Installed on

gemini-cli48

opencode48

codex47

github-copilot46

kimi-cli45

amp45

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

47,700 周安装

If configuration succeeds :

Inform user: "Configuration complete! Parsing document now..."
Retry the original parsing task

If configuration fails :

Show the error
Ask user to verify the credentials