paddleocr-doc-parsing by paddlepaddle/paddleocr
npx skills add https://github.com/paddlepaddle/paddleocr --skill paddleocr-doc-parsing使用文档解析适用于:
使用文本识别替代适用于:
⛔ 强制性限制 - 请勿违反 ⛔
python scripts/vl_caller.py如果脚本执行失败(API 未配置、网络错误等):
:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
python scripts/vl_caller.py --file-url "用户提供的URL" --pretty
或对于本地文件:
python scripts/vl_caller.py --file-path "文件路径" --pretty
可选:显式设置文件类型:
python scripts/vl_caller.py --file-url "用户提供的URL" --file-type 0 --pretty
* `--file-type 0`:PDF
* `--file-type 1`:图像
* 如果省略,服务可以从输入推断文件类型。
默认行为:将原始 JSON 保存到临时文件:
* 如果省略 `--output`,脚本会自动保存在系统临时目录下
* 默认路径模式:`<系统临时目录>/paddleocr/doc-parsing/results/result_<时间戳>_<ID>.json`
* 如果提供了 `--output`,它将覆盖默认的临时文件目标
* 如果提供了 `--stdout`,JSON 将打印到标准输出,且不保存文件
* 在保存模式下,脚本会在标准错误上打印保存的绝对路径:`结果已保存至:/绝对路径/...`
* 在默认/自定义保存模式下,请在响应前读取并解析保存的 JSON 文件
* 在保存模式下,始终告知用户保存的文件路径以及完整的原始 JSON 可在该处获取
* 仅当你明确希望跳过文件持久化时才使用 `--stdout`
2. 输出 JSON 包含完整内容,包含所有文档数据:
* 页眉、页脚、页码
* 主要文本内容
* 带结构的表格
* 公式(含 LaTeX)
* 图形和图表
* 脚注和参考文献
* 印章和戳记
* 布局和阅读顺序
输入类型说明:
* 支持的文件类型取决于模型和端点配置。
* 始终遵循你的端点 API 文档中记录的文件类型约束。
3. 从输出 JSON 中提取用户所需内容,使用以下字段:
* 顶层 `text`
* `result[n].markdown`
* `result[n].prunedResult`
关键:你必须根据用户需求向他们显示完整的提取内容。
text 字段这意味着:
text、result[n].markdown 和 result[n].prunedResult示例 - 正确:
User: "从这份文档中提取所有文本"
Agent: 我已解析完整文档。以下是所有提取的文本:
[按阅读顺序显示整个文本字段或拼接的区域]
文档统计:
- 总区域数:25
- 文本块:15
- 表格:3
- 公式:2
质量:优秀(置信度:0.92)
示例 - 错误:
User: "提取所有文本"
Agent: "我找到了一份包含多个部分的文档。开头是这样的:
'引言...'(为简洁起见,内容已截断)"
输出 JSON 使用一个封装原始 API 结果的信封:
{
"ok": true,
"text": "从所有页面提取的完整 markdown/HTML 文本",
"result": { ... }, // 原始提供者响应
"error": null
}
关键字段:
text — 从所有页面提取的 markdown 文本(用于快速文本显示)result — 原始提供者响应对象result[n].prunedResult — 每个页面的结构化解析输出(布局/内容/置信度及相关元数据)result[n].markdown — 以 markdown/HTML 格式呈现的完整页面输出原始结果位置(默认):脚本在标准错误上打印的临时文件路径
示例 1:提取完整文档文本
python scripts/vl_caller.py \
--file-url "https://example.com/paper.pdf" \
--pretty
然后使用:
text 用于快速全文输出result[n].markdown示例 2:提取结构化页面数据
python scripts/vl_caller.py \
--file-path "./financial_report.pdf" \
--pretty
然后使用:
result[n].prunedResultresult[n].markdown示例 3:打印 JSON 而不保存
python scripts/vl_caller.py \
--file-url "URL" \
--stdout \
--pretty
然后返回:
textresult[n].prunedResult 和 result[n].markdown通常可以假设所需的环境变量已经配置好。只有当解析任务失败时,才应分析错误信息以确定是否由配置问题引起。如果确实是配置问题,则应通知用户进行修复。
当 API 未配置时:
错误将显示:
CONFIG_ERROR: PADDLEOCR_DOC_PARSING_API_URL 未配置。获取你的 API 地址:https://paddleocr.com
配置工作流程:
向用户显示确切的错误信息(包括 URL)。
指导用户安全配置:
建议通过宿主应用程序的标准方法(例如,设置文件、环境变量 UI)进行配置,而不是在聊天中粘贴凭据。
列出所需的环境变量:
- PADDLEOCR_DOC_PARSING_API_URL
- PADDLEOCR_ACCESS_TOKEN
- 可选:PADDLEOCR_DOC_PARSING_TIMEOUT
如果用户仍然在聊天中提供凭据(接受任何合理的格式),例如:
PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...这是我的 API:https://xxx 和令牌:abc123然后解析并验证这些值:
* 提取 `PADDLEOCR_DOC_PARSING_API_URL`(查找包含 `paddleocr.com` 或类似内容的 URL)
* 确认 `PADDLEOCR_DOC_PARSING_API_URL` 是一个完整的端点,以 `/layout-parsing` 结尾
* 提取 `PADDLEOCR_ACCESS_TOKEN`(长字母数字字符串,通常 40+ 字符)
4. 请用户确认环境已配置。
仅在确认后重试:
API 没有文件大小限制。对于 PDF,每个请求最多 100 页。
大文件处理技巧:
对于非常大的本地文件,优先使用 --file-url 而不是 --file-path,以避免 base64 编码开销:
python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"
如果只需要大型 PDF 中的某些页面,请先提取它们:
# 提取第 1-5 页
python scripts/split_pdf.py large.pdf pages_1_5.pdf --pages "1-5"
# 支持混合范围
python scripts/split_pdf.py large.pdf selected_pages.pdf --pages "1-5,8,10-12"
# 然后处理较小的文件
python scripts/vl_caller.py --file-path "pages_1_5.pdf"
认证失败 (403):
error: Authentication failed
→ 令牌无效,使用正确的凭据重新配置
API 配额超限 (429):
error: API quota exceeded
→ 每日 API 配额已用完,通知用户等待或升级
不支持的文件格式:
error: Unsupported file format
→ 文件格式不支持,转换为 PDF/PNG/JPG
references/output_schema.md - 输出格式规范注意:模型版本和功能由你的 API 端点 (
PADDLEOCR_DOC_PARSING_API_URL) 决定。
在以下情况下将这些参考文档加载到上下文中:
要验证技能是否正常工作:
python scripts/smoke_test.py
这将测试配置和可选的 API 连接性。
每周安装次数
83
代码仓库
GitHub 星标数
73.0K
首次出现
Mar 6, 2026
安全审计
安装于
gemini-cli82
github-copilot82
amp82
cline82
codex82
opencode82
Use Document Parsing for :
Use Text Recognition instead for :
⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔
python scripts/vl_caller.pyIf the script execution fails (API not configured, network error, etc.):
Execute document parsing :
python scripts/vl_caller.py --file-url "URL provided by user" --pretty
Or for local files:
python scripts/vl_caller.py --file-path "file path" --pretty
Optional: explicitly set file type :
python scripts/vl_caller.py --file-url "URL provided by user" --file-type 0 --pretty
* `--file-type 0`: PDF
* `--file-type 1`: image
* If omitted, the service can infer file type from input.
Default behavior: save raw JSON to a temp file :
* If `--output` is omitted, the script saves automatically under the system temp directory
* Default path pattern: `<system-temp>/paddleocr/doc-parsing/results/result_<timestamp>_<id>.json`
* If `--output` is provided, it overrides the default temp-file destination
* If `--stdout` is provided, JSON is printed to stdout and no file is saved
* In save mode, the script prints the absolute saved path on stderr: `Result saved to: /absolute/path/...`
* In default/custom save mode, read and parse the saved JSON file before responding
* In save mode, always tell the user the saved file path and that full raw JSON is available there
* Use `--stdout` only when you explicitly want to skip file persistence
2. The output JSON contains COMPLETE content with all document data:
* Headers, footers, page numbers
* Main text content
* Tables with structure
* Formulas (with LaTeX)
* Figures and charts
* Footnotes and references
* Seals and stamps
* Layout and reading order
Input type note :
* Supported file types depend on the model and endpoint configuration.
* Always follow the file type constraints documented by your endpoint API.
3. Extract what the user needs from the output JSON using these fields:
* Top-level `text`
* `result[n].markdown`
* `result[n].prunedResult`
CRITICAL : You must display the COMPLETE extracted content to the user based on their needs.
text fieldWhat this means :
text, result[n].markdown, and result[n].prunedResultExample - Correct :
User: "Extract all the text from this document"
Agent: I've parsed the complete document. Here's all the extracted text:
[Display entire text field or concatenated regions in reading order]
Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)
Example - Incorrect :
User: "Extract all the text"
Agent: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"
The output JSON uses an envelope wrapping the raw API result:
{
"ok": true,
"text": "Full markdown/HTML text extracted from all pages",
"result": { ... }, // raw provider response
"error": null
}
Key fields :
text — extracted markdown text from all pages (use this for quick text display)result - raw provider response objectresult[n].prunedResult - structured parsing output for each page (layout/content/confidence and related metadata)result[n].markdown — full rendered page output in markdown/HTMLRaw result location (default): the temp-file path printed by the script on stderr
Example 1: Extract Full Document Text
python scripts/vl_caller.py \
--file-url "https://example.com/paper.pdf" \
--pretty
Then use:
text for quick full-text outputresult[n].markdown when page-level output is neededExample 2: Extract Structured Page Data
python scripts/vl_caller.py \
--file-path "./financial_report.pdf" \
--pretty
Then use:
result[n].prunedResult for structured parsing data (layout/content/confidence)result[n].markdown for rendered page contentExample 3: Print JSON Without Saving
python scripts/vl_caller.py \
--file-url "URL" \
--stdout \
--pretty
Then return:
text when user asks for full document contentresult[n].prunedResult and result[n].markdown when user needs complete structured page dataYou can generally assume that the required environment variables have already been configured. Only when a parsing task fails should you analyze the error message to determine whether it is caused by a configuration issue. If it is indeed a configuration problem, you should notify the user to fix it.
When API is not configured :
The error will show:
CONFIG_ERROR: PADDLEOCR_DOC_PARSING_API_URL not configured. Get your API at: https://paddleocr.com
Configuration workflow :
Show the exact error message to the user (including the URL).
Guide the user to configure securely :
Recommend configuring through the host application's standard method (e.g., settings file, environment variable UI) rather than pasting credentials in chat.
List the required environment variables:
- PADDLEOCR_DOC_PARSING_API_URL
- PADDLEOCR_ACCESS_TOKEN
- Optional: PADDLEOCR_DOC_PARSING_TIMEOUT
If the user provides credentials in chat anyway (accept any reasonable format), for example:
PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...Here's my API: https://xxx and token: abc123Then parse and validate the values:
* Extract `PADDLEOCR_DOC_PARSING_API_URL` (look for URLs with `paddleocr.com` or similar)
* Confirm `PADDLEOCR_DOC_PARSING_API_URL` is a full endpoint ending with `/layout-parsing`
* Extract `PADDLEOCR_ACCESS_TOKEN` (long alphanumeric string, usually 40+ chars)
4. Ask the user to confirm the environment is configured.
Retry only after confirmation :
There is no file size limit for the API. For PDFs, the maximum is 100 pages per request.
Tips for large files :
For very large local files, prefer --file-url over --file-path to avoid base64 encoding overhead:
python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"
If you only need certain pages from a large PDF, extract them first:
# Extract pages 1-5
python scripts/split_pdf.py large.pdf pages_1_5.pdf --pages "1-5"
# Mixed ranges are supported
python scripts/split_pdf.py large.pdf selected_pages.pdf --pages "1-5,8,10-12"
# Then process the smaller file
python scripts/vl_caller.py --file-path "pages_1_5.pdf"
Authentication failed (403) :
error: Authentication failed
→ Token is invalid, reconfigure with correct credentials
API quota exceeded (429) :
error: API quota exceeded
→ Daily API quota exhausted, inform user to wait or upgrade
Unsupported format :
error: Unsupported file format
→ File format not supported, convert to PDF/PNG/JPG
references/output_schema.md - Output format specificationNote : Model version and capabilities are determined by your API endpoint (
PADDLEOCR_DOC_PARSING_API_URL).
Load these reference documents into context when:
To verify the skill is working properly:
python scripts/smoke_test.py
This tests configuration and optionally API connectivity.
Weekly Installs
83
Repository
GitHub Stars
73.0K
First Seen
Mar 6, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykFail
Installed on
gemini-cli82
github-copilot82
amp82
cline82
codex82
opencode82
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
49,800 周安装
Outlook自动化指南:通过Rube MCP与Composio实现邮件、日历、联系人管理
233 周安装
Claude AI 与 Claude Code 能力对比参考 - 最新功能、限制与使用指南
230 周安装
病毒式钩子生成器:基于心理学模式的社交媒体内容创作工具 | 提升参与度
238 周安装
React/Next.js 高级质量保证工具:自动化测试、覆盖率分析与E2E测试脚手架
232 周安装
jQuery 4.0 迁移指南:破坏性变更、升级步骤与兼容性解决方案
232 周安装
Hugging Face Jobs:云端运行AI工作负载,无需本地GPU,支持数据处理、批量推理和模型训练
232 周安装