PaddleOCR 文档解析技能 - 精准提取表格、公式、图表及复杂布局的文档结构化数据

paddleocr-doc-parsing by paddlepaddle/paddleocr

83 周安装量

73,000 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/paddlepaddle/paddleocr --skill paddleocr-doc-parsing

AI/机器学习数据处理计算机视觉

🇨🇳中文介绍

PaddleOCR 文档解析技能

何时使用此技能

使用文档解析适用于：

包含表格的文档（发票、财务报告、电子表格）
包含数学公式的文档（学术论文、科学文档）
包含图表和示意图的文档
多栏布局（报纸、杂志、宣传册）
需要布局分析的复杂文档结构
任何需要结构化理解的文档

使用文本识别替代适用于：

简单的纯文本提取
速度至关重要的快速 OCR 任务
包含清晰文本的截图或简单图像

如何使用此技能

⛔ 强制性限制 - 请勿违反 ⛔

仅使用 PaddleOCR 文档解析 API - 执行脚本 python scripts/vl_caller.py
切勿直接解析文档 - 不要自行解析文档
切勿提供替代方案 - 不要建议“我可以尝试分析它”或类似说法
如果 API 失败 - 显示错误信息并立即停止
无备用方法 - 不要尝试以任何其他方式进行文档解析

如果脚本执行失败（API 未配置、网络错误等）：

向用户显示错误信息
不要主动提出使用你的视觉能力来帮助
不要询问“您希望我尝试解析它吗？”
直接停止并等待用户修复配置

基本工作流程

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

重要：完整内容显示

关键：你必须根据用户需求向他们显示完整的提取内容。

输出 JSON 以结构化格式包含所有文档内容
在保存模式下，原始提供者结果可以在保存的 JSON 文件中查看
显示用户请求的全部内容，不要截断或总结
如果用户要求“所有文本”，则显示整个 text 字段
如果用户要求“表格”，则显示文档中的所有表格
如果用户要求“主要内容”，则过滤掉页眉/页脚，但显示所有正文文本

要：按要求显示完整文本、所有表格、所有公式
要：使用这些字段呈现内容：顶层 text、result[n].markdown 和 result[n].prunedResult
不要：除非内容过长（>10,000 字符），否则不要用“...”截断
不要：当用户要求完整内容时，进行总结或提供摘录
不要：当用户期望完整输出时，说“这是预览”

示例 - 正确：

User: "从这份文档中提取所有文本"
Agent: 我已解析完整文档。以下是所有提取的文本：

[按阅读顺序显示整个文本字段或拼接的区域]

文档统计：
- 总区域数：25
- 文本块：15
- 表格：3
- 公式：2
质量：优秀（置信度：0.92）

示例 - 错误：

User: "提取所有文本"
Agent: "我找到了一份包含多个部分的文档。开头是这样的：
'引言...'（为简洁起见，内容已截断）"

输出 JSON 使用一个封装原始 API 结果的信封：

{
  "ok": true,
  "text": "从所有页面提取的完整 markdown/HTML 文本",
  "result": { ... },  // 原始提供者响应
  "error": null
}

text — 从所有页面提取的 markdown 文本（用于快速文本显示）
result — 原始提供者响应对象
result[n].prunedResult — 每个页面的结构化解析输出（布局/内容/置信度及相关元数据）
result[n].markdown — 以 markdown/HTML 格式呈现的完整页面输出

原始结果位置（默认）：脚本在标准错误上打印的临时文件路径

示例 1：提取完整文档文本

python scripts/vl_caller.py \
  --file-url "https://example.com/paper.pdf" \
  --pretty

顶层 text 用于快速全文输出
当需要页面级输出时使用 result[n].markdown

示例 2：提取结构化页面数据

python scripts/vl_caller.py \
  --file-path "./financial_report.pdf" \
  --pretty

用于结构化解析数据（布局/内容/置信度）时使用 result[n].prunedResult
用于呈现页面内容时使用 result[n].markdown

示例 3：打印 JSON 而不保存

python scripts/vl_caller.py \
  --file-url "URL" \
  --stdout \
  --pretty

当用户要求完整文档内容时，返回完整 text
当用户需要完整的结构化页面数据时，返回 result[n].prunedResult 和 result[n].markdown

通常可以假设所需的环境变量已经配置好。只有当解析任务失败时，才应分析错误信息以确定是否由配置问题引起。如果确实是配置问题，则应通知用户进行修复。

当 API 未配置时：

CONFIG_ERROR: PADDLEOCR_DOC_PARSING_API_URL 未配置。获取你的 API 地址：https://paddleocr.com

配置工作流程：

向用户显示确切的错误信息（包括 URL）。
指导用户安全配置：
- 建议通过宿主应用程序的标准方法（例如，设置文件、环境变量 UI）进行配置，而不是在聊天中粘贴凭据。
- 列出所需的环境变量：
```
- PADDLEOCR_DOC_PARSING_API_URL
- PADDLEOCR_ACCESS_TOKEN
- 可选：PADDLEOCR_DOC_PARSING_TIMEOUT
```
如果用户仍然在聊天中提供凭据（接受任何合理的格式），例如：
- PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...
- 这是我的 API：https://xxx 和令牌：abc123
- 复制粘贴的代码格式
- 任何其他合理格式
- 安全说明：警告用户，在聊天中共享的凭据可能会存储在对话历史中。如果可能，建议通过宿主应用程序的配置来设置它们。

然后解析并验证这些值：

 * 提取 `PADDLEOCR_DOC_PARSING_API_URL`（查找包含 `paddleocr.com` 或类似内容的 URL）
 * 确认 `PADDLEOCR_DOC_PARSING_API_URL` 是一个完整的端点，以 `/layout-parsing` 结尾
 * 提取 `PADDLEOCR_ACCESS_TOKEN`（长字母数字字符串，通常 40+ 字符）

4. 请用户确认环境已配置。

仅在确认后重试：
- 一旦用户确认环境变量可用，就重试原始的解析任务

API 没有文件大小限制。对于 PDF，每个请求最多 100 页。

大文件处理技巧：

对大本地文件使用 URL（推荐）

对于非常大的本地文件，优先使用 --file-url 而不是 --file-path，以避免 base64 编码开销：

python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"

处理特定页面（仅限 PDF）

如果只需要大型 PDF 中的某些页面，请先提取它们：

# 提取第 1-5 页
python scripts/split_pdf.py large.pdf pages_1_5.pdf --pages "1-5"

# 支持混合范围
python scripts/split_pdf.py large.pdf selected_pages.pdf --pages "1-5,8,10-12"

# 然后处理较小的文件
python scripts/vl_caller.py --file-path "pages_1_5.pdf"

认证失败 (403)：

error: Authentication failed

→ 令牌无效，使用正确的凭据重新配置

API 配额超限 (429)：

error: API quota exceeded

→ 每日 API 配额已用完，通知用户等待或升级

不支持的文件格式：

error: Unsupported file format

→ 文件格式不支持，转换为 PDF/PNG/JPG

脚本从不过滤内容 - 它始终返回完整数据
AI 代理决定呈现什么 - 基于用户的具体请求
所有数据始终可用 - 可以根据不同需求重新解释
信息不会丢失 - 完整的文档结构得以保留

references/output_schema.md - 输出格式规范

注意：模型版本和功能由你的 API 端点 (PADDLEOCR_DOC_PARSING_API_URL) 决定。

在以下情况下将这些参考文档加载到上下文中：

调试复杂的解析问题
需要理解输出格式时
处理提供者 API 详细信息时

要验证技能是否正常工作：

python scripts/smoke_test.py

这将测试配置和可选的 API 连接性。

🇺🇸English

PaddleOCR Document Parsing Skill

When to Use This Skill

Use Document Parsing for :

Documents with tables (invoices, financial reports, spreadsheets)
Documents with mathematical formulas (academic papers, scientific documents)
Documents with charts and diagrams
Multi-column layouts (newspapers, magazines, brochures)
Complex document structures requiring layout analysis
Any document requiring structured understanding

Use Text Recognition instead for :

Simple text-only extraction
Quick OCR tasks where speed is critical
Screenshots or simple images with clear text

How to Use This Skill

⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔

ONLY use PaddleOCR Document Parsing API - Execute the script python scripts/vl_caller.py
NEVER parse documents directly - Do NOT parse documents yourself
NEVER offer alternatives - Do NOT suggest "I can try to analyze it" or similar
IF API fails - Display the error message and STOP immediately
NO fallback methods - Do NOT attempt document parsing any other way

If the script execution fails (API not configured, network error, etc.):

Show the error message to the user
Do NOT offer to help using your vision capabilities
Do NOT ask "Would you like me to try parsing it?"
Simply stop and wait for user to fix the configuration

Basic Workflow

Execute document parsing :

python scripts/vl_caller.py --file-url "URL provided by user" --pretty

Or for local files:

     python scripts/vl_caller.py --file-path "file path" --pretty

Optional: explicitly set file type :

     python scripts/vl_caller.py --file-url "URL provided by user" --file-type 0 --pretty
     

 * `--file-type 0`: PDF
 * `--file-type 1`: image
 * If omitted, the service can infer file type from input.

Default behavior: save raw JSON to a temp file :

 * If `--output` is omitted, the script saves automatically under the system temp directory
 * Default path pattern: `<system-temp>/paddleocr/doc-parsing/results/result_<timestamp>_<id>.json`
 * If `--output` is provided, it overrides the default temp-file destination
 * If `--stdout` is provided, JSON is printed to stdout and no file is saved
 * In save mode, the script prints the absolute saved path on stderr: `Result saved to: /absolute/path/...`
 * In default/custom save mode, read and parse the saved JSON file before responding
 * In save mode, always tell the user the saved file path and that full raw JSON is available there
 * Use `--stdout` only when you explicitly want to skip file persistence

2. The output JSON contains COMPLETE content with all document data:

 * Headers, footers, page numbers
 * Main text content
 * Tables with structure
 * Formulas (with LaTeX)
 * Figures and charts
 * Footnotes and references
 * Seals and stamps
 * Layout and reading order

Input type note :

 * Supported file types depend on the model and endpoint configuration.
 * Always follow the file type constraints documented by your endpoint API.

3. Extract what the user needs from the output JSON using these fields:

 * Top-level `text`
 * `result[n].markdown`
 * `result[n].prunedResult`

IMPORTANT: Complete Content Display

CRITICAL : You must display the COMPLETE extracted content to the user based on their needs.

The output JSON contains ALL document content in a structured format
In save mode, the raw provider result can be inspected in the saved JSON file
Display the full content requested by the user , do NOT truncate or summarize
If user asks for "all text", show the entire text field
If user asks for "tables", show ALL tables in the document
If user asks for "main content", filter out headers/footers but show ALL body text

What this means :

DO : Display complete text, all tables, all formulas as requested
DO : Present content using these fields: top-level text, result[n].markdown, and result[n].prunedResult
DON'T : Truncate with "..." unless content is excessively long (>10,000 chars)
DON'T : Summarize or provide excerpts when user asks for full content
DON'T : Say "Here's a preview" when user expects complete output

Example - Correct :

User: "Extract all the text from this document"
Agent: I've parsed the complete document. Here's all the extracted text:

[Display entire text field or concatenated regions in reading order]

Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)

Example - Incorrect :

User: "Extract all the text"
Agent: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"

Understanding the JSON Response

The output JSON uses an envelope wrapping the raw API result:

{
  "ok": true,
  "text": "Full markdown/HTML text extracted from all pages",
  "result": { ... },  // raw provider response
  "error": null
}

Key fields :

text — extracted markdown text from all pages (use this for quick text display)
result - raw provider response object
result[n].prunedResult - structured parsing output for each page (layout/content/confidence and related metadata)
result[n].markdown — full rendered page output in markdown/HTML

Raw result location (default): the temp-file path printed by the script on stderr

Usage Examples

Example 1: Extract Full Document Text

python scripts/vl_caller.py \
  --file-url "https://example.com/paper.pdf" \
  --pretty

Then use:

Top-level text for quick full-text output
result[n].markdown when page-level output is needed

Example 2: Extract Structured Page Data

python scripts/vl_caller.py \
  --file-path "./financial_report.pdf" \
  --pretty

Then use:

result[n].prunedResult for structured parsing data (layout/content/confidence)
result[n].markdown for rendered page content

Example 3: Print JSON Without Saving

python scripts/vl_caller.py \
  --file-url "URL" \
  --stdout \
  --pretty

Then return:

Full text when user asks for full document content
result[n].prunedResult and result[n].markdown when user needs complete structured page data

First-Time Configuration

You can generally assume that the required environment variables have already been configured. Only when a parsing task fails should you analyze the error message to determine whether it is caused by a configuration issue. If it is indeed a configuration problem, you should notify the user to fix it.

When API is not configured :

The error will show:

CONFIG_ERROR: PADDLEOCR_DOC_PARSING_API_URL not configured. Get your API at: https://paddleocr.com

Configuration workflow :

Show the exact error message to the user (including the URL).
Guide the user to configure securely :
- Recommend configuring through the host application's standard method (e.g., settings file, environment variable UI) rather than pasting credentials in chat.
- List the required environment variables:
```
- PADDLEOCR_DOC_PARSING_API_URL
- PADDLEOCR_ACCESS_TOKEN
- Optional: PADDLEOCR_DOC_PARSING_TIMEOUT
```
If the user provides credentials in chat anyway (accept any reasonable format), for example:
- PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...
- Here's my API: https://xxx and token: abc123
- Copy-pasted code format
- Any other reasonable format
- Security note : Warn the user that credentials shared in chat may be stored in conversation history. Recommend setting them through the host application's configuration instead when possible.

Then parse and validate the values:

 * Extract `PADDLEOCR_DOC_PARSING_API_URL` (look for URLs with `paddleocr.com` or similar)
 * Confirm `PADDLEOCR_DOC_PARSING_API_URL` is a full endpoint ending with `/layout-parsing`
 * Extract `PADDLEOCR_ACCESS_TOKEN` (long alphanumeric string, usually 40+ chars)

4. Ask the user to confirm the environment is configured.

Retry only after confirmation :
- Once the user confirms the environment variables are available, retry the original parsing task

Handling Large Files

There is no file size limit for the API. For PDFs, the maximum is 100 pages per request.

Tips for large files :

Use URL for Large Local Files (Recommended)

For very large local files, prefer --file-url over --file-path to avoid base64 encoding overhead:

python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"

Process Specific Pages (PDF Only)

If you only need certain pages from a large PDF, extract them first:

# Extract pages 1-5
python scripts/split_pdf.py large.pdf pages_1_5.pdf --pages "1-5"

# Mixed ranges are supported
python scripts/split_pdf.py large.pdf selected_pages.pdf --pages "1-5,8,10-12"

# Then process the smaller file
python scripts/vl_caller.py --file-path "pages_1_5.pdf"

Error Handling

Authentication failed (403) :

error: Authentication failed

→ Token is invalid, reconfigure with correct credentials

API quota exceeded (429) :

error: API quota exceeded

→ Daily API quota exhausted, inform user to wait or upgrade

Unsupported format :

error: Unsupported file format

→ File format not supported, convert to PDF/PNG/JPG

Important Notes

The script NEVER filters content - It always returns complete data
The AI agent decides what to present - Based on user's specific request
All data is always available - Can be re-interpreted for different needs
No information is lost - Complete document structure preserved

Reference Documentation

references/output_schema.md - Output format specification

Note : Model version and capabilities are determined by your API endpoint (PADDLEOCR_DOC_PARSING_API_URL).

Load these reference documents into context when:

Debugging complex parsing issues
Need to understand output format
Working with provider API details

Testing the Skill

To verify the skill is working properly:

python scripts/smoke_test.py

This tests configuration and optionally API connectivity.

Weekly Installs

Repository

paddlepaddle/paddleocr

GitHub Stars

73.0K

First Seen

Mar 6, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykFail

Installed on

gemini-cli82

github-copilot82

amp82

cline82

codex82

opencode82

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

49,800 周安装