网页内容提取器 - 智能URL转Markdown工具，支持JS渲染与反爬虫网站

web-content-fetcher by shirenchuang/web-content-fetcher

1,100 周安装量

460 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/shirenchuang/web-content-fetcher --skill web-content-fetcher

开发数据分析生产力

🇨🇳中文介绍

网页内容提取器

给定一个 URL，将其主要内容以纯净的 Markdown 格式返回——标题、链接、图片、列表、代码块等均被保留。

提取策略

始终尝试 每个 URL 使用一种方法——不要盲目级联。预先选择正确的方法。

URL
 │
 ├─ 1. Scrapling 脚本（首选）
 │     运行 fetch.py —— 检查域名路由表以决定使用快速模式还是 --stealth 模式。
 │     适用于大多数网站。直接返回纯净的 Markdown。
 │
 └─ 2. Jina Reader（备用方案——仅在 Scrapling 失败或依赖未安装时使用）
       web_fetch("https://r.jina.ai/<url>")
       免费层级：200 次请求/天。速度快（约 1-2 秒），Markdown 输出质量好。
       不适用于：微信（403）、部分中国平台。

Scrapling 脚本

python3 <SKILL_DIR>/scripts/fetch.py "<url>" [max_chars] [--stealth]

<SKILL_DIR> 是此 SKILL.md 文件所在的目录。在调用脚本前请先解析其路径。

该脚本内置两种模式：

默认（快速）： HTTP 抓取，约 1-3 秒，适用于大多数网站
--stealth： 无头浏览器模式，约 5-15 秒，适用于 JS 渲染或反爬虫网站

当不带 --stealth 参数运行时，如果快速模式获取的内容过少，脚本会自动回退到 stealth 模式。因此你很少需要手动指定 ——唯一需要强制使用的情况是你已经知道该网站需要此模式（参见路由表），这样可以节省初始的快速尝试。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

域名	命令	原因
`mp.weixin.qq.com`	`fetch.py <url> --stealth`	JS 渲染的内容
`zhuanlan.zhihu.com`	`fetch.py <url> --stealth`	反爬虫 + JS
`juejin.cn`	`fetch.py <url> --stealth`	JS 渲染的 SPA
`sspai.com`	`fetch.py <url>`	静态 HTML
`blog.csdn.net`	`fetch.py <url>`	静态 HTML
`ruanyifeng.com`	`fetch.py <url>`	静态博客
`openai.com`	`fetch.py <url>`	静态 HTML
`blog.google`	`fetch.py <url>`	静态 HTML
其他所有域名	`fetch.py <url>`	自动回退机制会处理

🇺🇸English

Web Content Fetcher

Given a URL, return its main content as clean Markdown — headings, links, images, lists, code blocks all preserved.

Extraction Strategy

Always try one method per URL — don't cascade blindly. Pick the right one upfront.

URL
 │
 ├─ 1. Scrapling script (preferred)
 │     Run fetch.py — check the domain routing table to decide fast vs --stealth.
 │     Works for most sites. Returns clean Markdown directly.
 │
 └─ 2. Jina Reader (fallback — only if Scrapling fails or dependencies not installed)
       web_fetch("https://r.jina.ai/<url>")
       Free tier: 200 req/day. Fast (~1-2s), good Markdown output.
       Does NOT work for: WeChat (403), some Chinese platforms.

Scrapling script

python3 <SKILL_DIR>/scripts/fetch.py "<url>" [max_chars] [--stealth]

<SKILL_DIR> is the directory where this SKILL.md lives. Resolve it before calling the script.

The script has two modes built in:

Default (fast): HTTP fetch, ~1-3s, works for most sites
--stealth: Headless browser, ~5-15s, for JS-rendered or anti-scraping sites

When run without --stealth, the script automatically falls back to stealth if the fast result has too little content. So you rarely need to specify --stealth manually — the only reason to force it is when you already know the site needs it (see routing table), which saves the initial fast attempt.

Domain Routing

Use this table to pick the right mode on the first call:

Domain	Command	Why
`mp.weixin.qq.com`	`fetch.py <url> --stealth`	JS-rendered content
`zhuanlan.zhihu.com`	`fetch.py <url> --stealth`	Anti-scraping + JS
`juejin.cn`	`fetch.py <url> --stealth`	JS-rendered SPA
`sspai.com`

Script Options

# Basic — auto-selects fast or stealth
python3 <SKILL_DIR>/scripts/fetch.py "https://sspai.com/post/73145"

# Force stealth for known JS-heavy sites
python3 <SKILL_DIR>/scripts/fetch.py "https://mp.weixin.qq.com/s/xxx" --stealth

# Limit output to 15000 characters (default: 30000)
python3 <SKILL_DIR>/scripts/fetch.py "https://example.com/article" 15000

# JSON output with metadata (url, mode, selector, content_length)
python3 <SKILL_DIR>/scripts/fetch.py "https://example.com" --json

Install Dependencies

First use only — the script checks and tells you if anything is missing:

pip install scrapling html2text

If on system-managed Python (macOS/Linux), add --break-system-packages or use a venv.

Failure Rules

Same URL fails once → give up, tell the user "unable to extract content from this URL"
Do not retry — each failed call wastes context tokens

Weekly Installs

1.1K

Repository

shirenchuang/we…-fetcher

GitHub Stars

460

First Seen

Mar 9, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode1.0K

codex1.0K

gemini-cli1.0K

kimi-cli1.0K

github-copilot1.0K

cursor1.0K

网页内容提取器 - 智能URL转Markdown工具，支持JS渲染与反爬虫网站

🇨🇳中文介绍

网页内容提取器

提取策略

Scrapling 脚本

相关 Skills

域名路由

脚本选项

安装依赖

失败规则