MinerU Extract：智能网页内容提取工具，一键将URL转为Markdown文档

mineru-extract by blessonism/openclaw-search-skills

130 周安装量

405 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/blessonism/openclaw-search-skills --skill mineru-extract

内容创作自动化数据处理

🇨🇳中文介绍

MinerU Extract (官方 API)

使用 MinerU 作为上游“内容标准化器”：向 MinerU 提交 URL，轮询处理完成，下载结果压缩包，并提取主要的 Markdown 内容。

快速开始 (MCP 对齐)

我们遵循 MinerU MCP 的思维模型，但我们不运行 MCP 服务器。

主要脚本 (MCP 风格): scripts/mineru_parse_documents.py
- 输入: --file-sources (逗号或换行符分隔)
- 输出: 在 stdout 输出 JSON 约定: { ok, items, errors }
底层脚本 (单个 URL): scripts/mineru_extract.py

认证:

设置 MINERU_TOKEN (来自 mineru.net 的 Bearer token)

默认模型启发式规则:

以 .pdf/.doc/.ppt/.png/.jpg 结尾的 URL →

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

1) 配置 token (技能本地)

将密钥放在 技能根目录 的 .env 文件中 (不要粘贴到聊天输出中):

# 在 mineru-extract 技能目录下: .env
MINERU_TOKEN=your_token_here
MINERU_API_BASE=https://mineru.net

2) 解析 URL(s) → Markdown (推荐)

MCP 风格包装器 (返回 JSON，可选包含 markdown 文本):

python3 mineru-extract/scripts/mineru_parse_documents.py \
  --file-sources "<URL1>\n<URL2>" \
  --language ch \
  --enable-ocr \
  --model-version MinerU-HTML

如果你希望 JSON 中包含 markdown 内容 (可能很大):

python3 mineru-extract/scripts/mineru_parse_documents.py \
  --file-sources "<URL>" \
  --model-version MinerU-HTML \
  --emit-markdown --max-chars 20000

底层调用 (单个 URL，将 markdown 打印到 stdout):

python3 mineru-extract/scripts/mineru_extract.py "<URL>" --model MinerU-HTML --print > /tmp/out.md

脚本总是会下载并解压 MinerU 结果压缩包到:

~/.openclaw/workspace/mineru/<task_id>/

result.zip
解压后的文件 (Markdown + JSON + 资源文件)

它会向 stderr 打印一个包含路径的 JSON 摘要:

task_id, full_zip_url, out_dir, markdown_path

--model: pipeline | vlm | MinerU-HTML (HTML 需要 MinerU-HTML)
--ocr/--no-ocr: 启用 OCR (对 pipeline/vlm 有效)
--table/--no-table: 表格识别
--formula/--no-formula: 公式识别
--language ch|en|...
--page-ranges "2,4-6" (非 HTML)
--timeout 600 / --poll-interval 2

失败模式与回退方案

MinerU 可能无法抓取某些 URL (反爬虫 / 地域限制 / 需要登录)。
- 回退方案: 提供一个 HTML 文件或 PDF/长截图；然后使用 MinerU 批量上传端点实现“上传 + 解析”流程。
- 始终报告失败的 URL 和 MinerU 的 err_msg，并在输出中保留原始来源链接。

🇺🇸English

MinerU Extract (official API)

Use MinerU as an upstream “content normalizer”: submit a URL to MinerU, poll for completion, download the result zip, and extract the main Markdown.

Quick start (MCP-aligned)

We align to the MinerU MCP mental model, but we do not run an MCP server.

Primary script (MCP-style): scripts/mineru_parse_documents.py
- Input: --file-sources (comma/newline-separated)
- Output: JSON contract on stdout: { ok, items, errors }
Low-level script (single URL): scripts/mineru_extract.py

Auth:

Set MINERU_TOKEN (Bearer token from mineru.net)

Default model heuristic:

URLs ending with .pdf/.doc/.ppt/.png/.jpg → pipeline
Otherwise → MinerU-HTML (best for HTML pages like WeChat articles)

1) Configure token (skill-local)

Put secrets in skill root .env (do not paste into chat outputs):

# In the mineru-extract skill directory: .env
MINERU_TOKEN=your_token_here
MINERU_API_BASE=https://mineru.net

2) Parse URL(s) → Markdown (recommended)

MCP-style wrapper (returns JSON, optionally includes markdown text):

python3 mineru-extract/scripts/mineru_parse_documents.py \
  --file-sources "<URL1>\n<URL2>" \
  --language ch \
  --enable-ocr \
  --model-version MinerU-HTML

If you want the markdown content inline in the JSON (can be large):

python3 mineru-extract/scripts/mineru_parse_documents.py \
  --file-sources "<URL>" \
  --model-version MinerU-HTML \
  --emit-markdown --max-chars 20000

Low-level (single URL, print markdown to stdout):

python3 mineru-extract/scripts/mineru_extract.py "<URL>" --model MinerU-HTML --print > /tmp/out.md

Output

The script always downloads + extracts the MinerU result zip to:

~/.openclaw/workspace/mineru/<task_id>/

It writes:

result.zip
extracted files (Markdown + JSON + assets)

It prints a JSON summary to stderr with paths:

task_id, full_zip_url, out_dir, markdown_path

Parameters (common)

--model: pipeline | vlm | MinerU-HTML (HTML requires MinerU-HTML)
--ocr/--no-ocr: enable OCR (effective for pipeline/vlm)
--table/--no-table: table recognition
--formula/--no-formula: formula recognition
--language ch|en|...
--page-ranges "2,4-6" (non-HTML)

Failure modes & fallbacks

MinerU may fail to fetch some URLs (anti-bot / geo / login).
- Fallback: provide an HTML file or a PDF/long screenshot; then implement “upload + parse” flow with MinerU batch upload endpoints.
- Always report the failing URL + MinerU err_msg and keep an original-source link in outputs.

References

MinerU API docs: https://mineru.net/apiManage/docs
MinerU output files: https://opendatalab.github.io/MinerU/reference/output_files/

Weekly Installs

Repository

blessonism/open…h-skills

GitHub Stars

233

First Seen

Feb 11, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

codex62

openclaw61

gemini-cli60

amp60

github-copilot60

kimi-cli60

通过 LiteLLM 代理让 Claude Code 对接 GitHub Copilot 运行 | 高级变通方案指南

43,100 周安装

--timeout 600 / --poll-interval 2