URL转Markdown工具：Chrome CDP抓取网页并转换为干净Markdown

baoyu-url-to-markdown by jimliu/baoyu-skills

12,200 周安装量

11,800 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/jimliu/baoyu-skills --skill baoyu-url-to-markdown

开发生产力

🇨🇳中文介绍

URL 转 Markdown

通过 Chrome CDP 获取任意 URL，保存渲染后的 HTML 快照，并将其转换为干净的 Markdown。

脚本目录

重要提示：所有脚本都位于此技能的 scripts/ 子目录中。

代理执行说明：

确定此 SKILL.md 文件的目录路径为 {baseDir}
脚本路径 = {baseDir}/scripts/<脚本名称>.ts
解析 ${BUN_X} 运行时：如果已安装 bun → 使用 bun；如果 npx 可用 → 使用 npx -y bun；否则建议安装 bun

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

脚本	用途
`scripts/main.ts`	URL 获取的 CLI 入口点
`scripts/html-to-markdown.ts`	Markdown 转换入口点及转换器选择
`scripts/parsers/index.ts`	统一解析器入口：在通用转换器之前分发 URL 特定规则
`scripts/parsers/types.ts`	所有规则文件共享的统一解析器接口
`scripts/parsers/rules/*.ts`	每个 URL 规则一个文件，例如 X 状态和 X 文章
`scripts/defuddle-converter.ts`	基于 Defuddle 的转换
`scripts/legacy-converter.ts`	Defuddle 之前的旧版提取和 Markdown 转换
`scripts/markdown-conversion-shared.ts`	共享的元数据解析和 Markdown 文档辅助函数

偏好设置 (EXTEND.md)

检查 EXTEND.md 是否存在（优先级顺序）：

# macOS, Linux, WSL, Git Bash
test -f .baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo "project"
test -f "${XDG_CONFIG_HOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "xdg"
test -f "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "user"



# PowerShell (Windows)
if (Test-Path .baoyu-skills/baoyu-url-to-markdown/EXTEND.md) { "project" }
$xdg = if ($env:XDG_CONFIG_HOME) { $env:XDG_CONFIG_HOME } else { "$HOME/.config" }
if (Test-Path "$xdg/baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "xdg" }
if (Test-Path "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "user" }

┌────────────────────────────────────────────────────────┬───────────────────┐ │ 路径 │ 位置 │ ├────────────────────────────────────────────────────────┼───────────────────┤ │ .baoyu-skills/baoyu-url-to-markdown/EXTEND.md │ 项目目录 │ ├────────────────────────────────────────────────────────┼───────────────────┤ │ $HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md │ 用户主目录 │ └────────────────────────────────────────────────────────┴───────────────────┘

┌───────────┬───────────────────────────────────────────────────────────────────────────┐ │ 结果 │ 操作 │ ├───────────┼───────────────────────────────────────────────────────────────────────────┤ │ 找到 │ 读取、解析、应用设置 │ ├───────────┼───────────────────────────────────────────────────────────────────────────┤ │ 未找到 │ 必须运行首次设置（见下文）——切勿静默创建默认值 │ └───────────┴───────────────────────────────────────────────────────────────────────────┘

EXTEND.md 支持：默认下载媒体 | 默认输出目录 | 默认捕获模式 | 超时设置

首次设置（阻塞式）

关键：当未找到 EXTEND.md 时，在创建 EXTEND.md 之前，你必须使用 AskUserQuestion 来询问用户的偏好设置。切勿在不询问的情况下使用默认值创建 EXTEND.md。这是一个阻塞式操作——在设置完成之前，请勿进行任何转换。

使用 AskUserQuestion 一次性提出所有问题：

问题 1 — 标题："媒体"，问题："如何处理页面中的图片和视频？"

"每次询问（推荐）" — 保存 Markdown 后，询问是否下载媒体
"总是下载" — 总是将媒体下载到本地的 imgs/ 和 videos/ 目录
"从不下载" — 在 Markdown 中保留原始的远程 URL

问题 2 — 标题："输出"，问题："默认输出目录？"

"url-to-markdown（推荐）" — 保存到 ./url-to-markdown/{domain}/{slug}.md
（用户可以选择"其他"来输入自定义路径）

问题 3 — 标题："保存"，问题："将偏好设置保存在哪里？"

"用户（推荐）" — ~/.baoyu-skills/（所有项目）
"项目" — .baoyu-skills/（仅限此项目）

用户回答后，在所选位置创建 EXTEND.md，确认"偏好设置已保存到 [路径]"，然后继续。

键	默认值	值	描述
`download_media`	`ask`	`ask` / `1` / `0`	`ask` = 每次提示，`1` = 总是下载，`0` = 从不
`default_output_dir`	空	路径或空	默认输出目录（空 = `./url-to-markdown/`）

EXTEND.md → CLI 映射：

EXTEND.md 键	CLI 参数	备注
`download_media: 1`	`--download-media`
`default_output_dir: ./posts/`	`--output-dir ./posts/`	目录路径。不要传递给 `-o`（它期望的是文件路径）

CLI 参数 (--download-media, -o, --output-dir)
EXTEND.md
技能默认值

使用 Chrome CDP 进行完整的 JavaScript 渲染
针对需要自定义 HTML 规则的网站，提供 URL 特定的解析器层，先于通用提取
两种捕获模式：自动或等待用户
将渲染的 HTML 保存为同级的 -captured.html 文件
带有元数据的干净 Markdown 输出
升级为 Defuddle 优先的 Markdown 转换，并自动回退到 git 历史记录中的 Defuddle 之前的提取器
X/Twitter 页面可以对推文和文章使用 HTML 特定解析，从而改进在 x.com / twitter.com 上的标题/正文/媒体提取
archive.ph / 相关的存档镜像可以从 input[name=q] 恢复原始 URL，并优先使用 #CONTENT，然后再回退到页面正文
在转换前具体化 Shadow DOM 内容，以便 Web 组件页面能更好地序列化
当 YouTube 公开字幕轨道时，YouTube 页面可以在 Markdown 中包含转录/字幕文本
如果本地浏览器捕获完全失败，可以回退到 defuddle.md/<url> 并仍然保存 Markdown
通过等待模式处理需要登录的页面
将图片和视频下载到本地目录

# 自动模式（默认）- 页面加载时捕获
${BUN_X} {baseDir}/scripts/main.ts <url>

# 等待模式 - 等待用户信号后再捕获
${BUN_X} {baseDir}/scripts/main.ts <url> --wait

# 保存到特定文件
${BUN_X} {baseDir}/scripts/main.ts <url> -o output.md

# 保存到自定义输出目录（自动生成文件名）
${BUN_X} {baseDir}/scripts/main.ts <url> --output-dir ./posts/

# 将图片和视频下载到本地目录
${BUN_X} {baseDir}/scripts/main.ts <url> --download-media

选项	描述
`<url>`	要获取的 URL
`-o <路径>`	输出文件路径 — 必须是文件路径，而非目录（默认：自动生成）
`--output-dir <目录>`	基础输出目录 — 自动生成 `{dir}/{domain}/{slug}.md`（默认：`./url-to-markdown/`）
`--wait`	捕获前等待用户信号
`--timeout <毫秒>`	页面加载超时时间（默认：30000）
`--download-media`	将图片/视频资源下载到本地的 `imgs/` 和 `videos/`，并将 Markdown 链接重写为本地相对路径

模式	行为	使用场景
自动（默认）	网络空闲时捕获	公开页面，静态内容
等待 (`--wait`)	用户就绪时发出信号	需要登录，懒加载，付费墙

等待模式工作流程：

使用 --wait 运行 → 脚本输出"准备就绪后按 Enter"
请用户确认页面已准备就绪
向 stdin 发送换行符以触发捕获

每次运行会并排保存两个文件：

Markdown：包含 url、title、description、author、published、可选的 coverImage 和 captured_at 的 YAML 前置元数据，后跟转换后的 Markdown 内容
HTML 快照：*-captured.html，包含从 Chrome 捕获的渲染页面 HTML

当 Defuddle 或页面元数据提供语言提示时，Markdown 前置元数据也会包含 language。

HTML 快照在任何 Markdown 媒体本地化之前保存，因此它忠实保留了用于转换的页面 DOM 捕获。如果使用了托管的 defuddle.md API 回退，Markdown 仍会被保存，但该次运行不会生成本地的 -captured.html 快照。

默认：url-to-markdown/<domain>/<slug>.md 使用 --output-dir ./posts/ 时：./posts/<domain>/<slug>.md

HTML 快照路径使用相同的基础名称：

url-to-markdown/<domain>/<slug>-captured.html
./posts/<domain>/<slug>-captured.html
<slug>：来自页面标题或 URL 路径（短横线分隔，2-6 个单词）
冲突解决：附加时间戳 <slug>-YYYYMMDD-HHMMSS.md

当启用 --download-media 时：

图片保存到 Markdown 文件旁边的 imgs/ 目录
视频保存到 Markdown 文件旁边的 videos/ 目录
Markdown 媒体链接被重写为本地相对路径

当站点规则匹配时，首先尝试 URL 特定的解析器层
如果没有专门的解析器匹配，则尝试 Defuddle
对于富页面（如 YouTube），优先使用 Defuddle 的特定提取器输出（包括可用的转录文本），而不是用旧版流程替换它
如果 Defuddle 抛出错误、无法加载、返回明显不完整的 Markdown，或者捕获的内容质量低于旧版流程，则自动回退到 Defuddle 之前的提取器
如果在生成 Markdown 之前整个本地浏览器捕获流程失败，则尝试托管的 https://defuddle.md/<url> API 并直接保存其 Markdown 输出
旧版回退路径使用从 git 历史记录中恢复的、基于旧版 Readability/选择器/Next.js 数据的 HTML 转 Markdown 实现

CLI 输出将显示：

Converter: parser:... 当 URL 特定解析器成功时
Converter: defuddle 当 Defuddle 成功时
Converter: legacy:... 加上 Fallback used: ... 当需要回退时
Converter: defuddle-api 当本地浏览器捕获失败并使用托管 API 时

媒体下载工作流程

基于 EXTEND.md 中的 download_media 设置：

设置	行为
`1`（总是）	使用 `--download-media` 标志运行脚本
`0`（从不）	不使用 `--download-media` 标志运行脚本
`ask`（默认）	遵循下面的每次询问流程

不使用 --download-media 运行脚本 → Markdown 已保存
检查已保存的 Markdown 中是否有远程媒体 URL（图片/视频链接中的 https://）
如果未找到远程媒体 → 完成，无需提示
如果找到远程媒体 → 使用 AskUserQuestion：
- 标题："媒体"，问题："将 N 个图片/视频下载到本地文件？"
- "是" — 下载到本地目录
- "否" — 保留远程 URL
如果用户确认 → 再次使用 --download-media 运行脚本（用本地化链接覆盖 Markdown）

变量	描述
`URL_CHROME_PATH`	自定义 Chrome 可执行文件路径
`URL_DATA_DIR`	自定义数据目录
`URL_CHROME_PROFILE_DIR`	自定义 Chrome 配置文件目录

故障排除：未找到 Chrome → 设置 URL_CHROME_PATH。超时 → 增加 --timeout。复杂页面 → 尝试 --wait 模式。如果 Markdown 质量差，请检查保存的 -captured.html 并查看运行是否记录了旧版回退。

升级后的 Defuddle 路径使用异步提取器，因此 YouTube 页面可以直接在 Markdown 正文中包含转录文本。
转录的可用性取决于 YouTube 是否公开字幕轨道。字幕被禁用、播放受限或区域访问被阻止的视频可能仍只生成描述性输出。
如果页面需要时间来完成加载描述、章节或播放器元数据，请优先使用 --wait，并在观看页面完全水合后捕获。

托管的回退端点是 https://defuddle.md/<url>。Shell 形式：curl https://defuddle.md/stephango.com
仅当本地 Chrome/CDP 捕获路径完全失败时才使用它。本地路径仍然具有更高的保真度，因为它可以保存捕获的 HTML 并处理经过身份验证的页面。
托管 API 已经返回带有 YAML 前置元数据的 Markdown，因此按原样保存该响应，然后根据需要应用正常的媒体本地化步骤。

通过 EXTEND.md 进行自定义配置。有关路径和支持的选项，请参阅偏好设置部分。

2026 年 1 月 22 日

🇺🇸English

URL to Markdown

Fetches any URL via Chrome CDP, saves the rendered HTML snapshot, and converts it to clean markdown.

Script Directory

Important : All scripts are located in the scripts/ subdirectory of this skill.

Agent Execution Instructions :

Determine this SKILL.md file's directory path as {baseDir}
Script path = {baseDir}/scripts/<script-name>.ts
Resolve ${BUN_X} runtime: if bun installed → bun; if npx available → npx -y bun; else suggest installing bun
Replace all {baseDir} and ${BUN_X} in this document with actual values

Script Reference :

Script	Purpose
`scripts/main.ts`	CLI entry point for URL fetching
`scripts/html-to-markdown.ts`	Markdown conversion entry point and converter selection
`scripts/parsers/index.ts`	Unified parser entry: dispatches URL-specific rules before generic converters
`scripts/parsers/types.ts`	Unified parser interface shared by all rule files
`scripts/parsers/rules/*.ts`	One file per URL rule, for example X status and X article
`scripts/defuddle-converter.ts`

Preferences (EXTEND.md)

Check EXTEND.md existence (priority order):

# macOS, Linux, WSL, Git Bash
test -f .baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo "project"
test -f "${XDG_CONFIG_HOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "xdg"
test -f "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "user"



# PowerShell (Windows)
if (Test-Path .baoyu-skills/baoyu-url-to-markdown/EXTEND.md) { "project" }
$xdg = if ($env:XDG_CONFIG_HOME) { $env:XDG_CONFIG_HOME } else { "$HOME/.config" }
if (Test-Path "$xdg/baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "xdg" }
if (Test-Path "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "user" }

┌────────────────────────────────────────────────────────┬───────────────────┐ │ Path │ Location │ ├────────────────────────────────────────────────────────┼───────────────────┤ │ .baoyu-skills/baoyu-url-to-markdown/EXTEND.md │ Project directory │ ├────────────────────────────────────────────────────────┼───────────────────┤ │ $HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md │ User home │ └────────────────────────────────────────────────────────┴───────────────────┘

┌───────────┬───────────────────────────────────────────────────────────────────────────┐ │ Result │ Action │ ├───────────┼───────────────────────────────────────────────────────────────────────────┤ │ Found │ Read, parse, apply settings │ ├───────────┼───────────────────────────────────────────────────────────────────────────┤ │ Not found │ MUST run first-time setup (see below) — do NOT silently create defaults │ └───────────┴───────────────────────────────────────────────────────────────────────────┘

EXTEND.md Supports : Download media by default | Default output directory | Default capture mode | Timeout settings

First-Time Setup (BLOCKING)

CRITICAL : When EXTEND.md is not found, you MUST useAskUserQuestion to ask the user for their preferences before creating EXTEND.md. NEVER create EXTEND.md with defaults without asking. This is a BLOCKING operation — do NOT proceed with any conversion until setup is complete.

Use AskUserQuestion with ALL questions in ONE call:

Question 1 — header: "Media", question: "How to handle images and videos in pages?"

"Ask each time (Recommended)" — After saving markdown, ask whether to download media
"Always download" — Always download media to local imgs/ and videos/ directories
"Never download" — Keep original remote URLs in markdown

Question 2 — header: "Output", question: "Default output directory?"

"url-to-markdown (Recommended)" — Save to ./url-to-markdown/{domain}/{slug}.md
(User may choose "Other" to type a custom path)

Question 3 — header: "Save", question: "Where to save preferences?"

"User (Recommended)" — ~/.baoyu-skills/ (all projects)
"Project" — .baoyu-skills/ (this project only)

After user answers, create EXTEND.md at the chosen location, confirm "Preferences saved to [path]", then continue.

Full reference: references/config/first-time-setup.md

Supported Keys

Key	Default	Values	Description
`download_media`	`ask`	`ask` / `1` / `0`	`ask` = prompt each time, `1` = always download, `0` = never
`default_output_dir`

EXTEND.md → CLI mapping :

EXTEND.md key	CLI argument	Notes
`download_media: 1`	`--download-media`
`default_output_dir: ./posts/`	`--output-dir ./posts/`	Directory path. Do NOT pass to `-o` (which expects a file path)

Value priority :

CLI arguments (--download-media, -o, --output-dir)
EXTEND.md
Skill defaults

Features

Chrome CDP for full JavaScript rendering
URL-specific parser layer for sites that need custom HTML rules before generic extraction
Two capture modes: auto or wait-for-user
Save rendered HTML as a sibling -captured.html file
Clean markdown output with metadata
Upgraded Defuddle-first markdown conversion with automatic fallback to the pre-Defuddle extractor from git history
X/Twitter pages can use HTML-specific parsing for Tweets and Articles, which improves title/body/media extraction on x.com / twitter.com
archive.ph / related archive mirrors can restore the original URL from input[name=q] and prefer #CONTENT before falling back to the page body
Materializes shadow DOM content before conversion so web-component pages survive serialization better
YouTube pages can include transcript/caption text in the markdown when YouTube exposes a caption track
If local browser capture fails completely, can fall back to defuddle.md/<url> and still save markdown

Usage

# Auto mode (default) - capture when page loads
${BUN_X} {baseDir}/scripts/main.ts <url>

# Wait mode - wait for user signal before capture
${BUN_X} {baseDir}/scripts/main.ts <url> --wait

# Save to specific file
${BUN_X} {baseDir}/scripts/main.ts <url> -o output.md

# Save to a custom output directory (auto-generates filename)
${BUN_X} {baseDir}/scripts/main.ts <url> --output-dir ./posts/

# Download images and videos to local directories
${BUN_X} {baseDir}/scripts/main.ts <url> --download-media

Options

Option	Description
`<url>`	URL to fetch
`-o <path>`	Output file path — must be a file path, not directory (default: auto-generated)
`--output-dir <dir>`	Base output directory — auto-generates `{dir}/{domain}/{slug}.md` (default: `./url-to-markdown/`)
`--wait`	Wait for user signal before capturing
`--timeout <ms>`

Capture Modes

Mode	Behavior	Use When
Auto (default)	Capture on network idle	Public pages, static content
Wait (`--wait`)	User signals when ready	Login-required, lazy loading, paywalls

Wait mode workflow :

Run with --wait → script outputs "Press Enter when ready"
Ask user to confirm page is ready
Send newline to stdin to trigger capture

Output Format

Each run saves two files side by side:

Markdown: YAML front matter with url, title, description, author, published, optional coverImage, and captured_at, followed by converted markdown content
HTML snapshot: *-captured.html, containing the rendered page HTML captured from Chrome

When Defuddle or page metadata provides a language hint, the markdown front matter also includes language.

The HTML snapshot is saved before any markdown media localization, so it stays a faithful capture of the page DOM used for conversion. If the hosted defuddle.md API fallback is used, markdown is still saved, but there is no local -captured.html snapshot for that run.

Output Directory

Default: url-to-markdown/<domain>/<slug>.md With --output-dir ./posts/: ./posts/<domain>/<slug>.md

HTML snapshot path uses the same basename:

url-to-markdown/<domain>/<slug>-captured.html
./posts/<domain>/<slug>-captured.html
<slug>: From page title or URL path (kebab-case, 2-6 words)
Conflict resolution: Append timestamp <slug>-YYYYMMDD-HHMMSS.md

When --download-media is enabled:

Images are saved to imgs/ next to the markdown file
Videos are saved to videos/ next to the markdown file
Markdown media links are rewritten to local relative paths

Conversion Fallback

Conversion order:

Try the URL-specific parser layer first when a site rule matches
If no specialized parser matches, try Defuddle
For rich pages such as YouTube, prefer Defuddle's extractor-specific output (including transcripts when available) instead of replacing it with the legacy pipeline
If Defuddle throws, cannot load, returns obviously incomplete markdown, or captures lower-quality content than the legacy pipeline, automatically fall back to the pre-Defuddle extractor
If the entire local browser capture flow fails before markdown can be produced, try the hosted https://defuddle.md/<url> API and save its markdown output directly
The legacy fallback path uses the older Readability/selector/Next.js-data based HTML-to-Markdown implementation recovered from git history

CLI output will show:

Converter: parser:... when a URL-specific parser succeeded
Converter: defuddle when Defuddle succeeds
Converter: legacy:... plus Fallback used: ... when fallback was needed
Converter: defuddle-api when local browser capture failed and the hosted API was used instead

Media Download Workflow

Based on download_media setting in EXTEND.md:

Setting	Behavior
`1` (always)	Run script with `--download-media` flag
`0` (never)	Run script without `--download-media` flag
`ask` (default)	Follow the ask-each-time flow below

Ask-Each-Time Flow

Run script without --download-media → markdown saved
Check saved markdown for remote media URLs (https:// in image/video links)
If no remote media found → done, no prompt needed
If remote media found → use AskUserQuestion:
- header: "Media", question: "Download N images/videos to local files?"
- "Yes" — Download to local directories
- "No" — Keep remote URLs
If user confirms → run script again with --download-media (overwrites markdown with localized links)

Environment Variables

Variable	Description
`URL_CHROME_PATH`	Custom Chrome executable path
`URL_DATA_DIR`	Custom data directory
`URL_CHROME_PROFILE_DIR`	Custom Chrome profile directory

Troubleshooting : Chrome not found → set URL_CHROME_PATH. Timeout → increase --timeout. Complex pages → try --wait mode. If markdown quality is poor, inspect the saved -captured.html and check whether the run logged a legacy fallback.

YouTube Notes

The upgraded Defuddle path uses async extractors, so YouTube pages can include transcript text directly in the markdown body.
Transcript availability depends on YouTube exposing a caption track. Videos with captions disabled, restricted playback, or blocked regional access may still produce description-only output.
If the page needs time to finish loading descriptions, chapters, or player metadata, prefer --wait and capture after the watch page is fully hydrated.

Hosted API Fallback

The hosted fallback endpoint is https://defuddle.md/<url>. In shell form: curl https://defuddle.md/stephango.com
Use it only when the local Chrome/CDP capture path fails outright. The local path still has higher fidelity because it can save the captured HTML and handle authenticated pages.
The hosted API already returns Markdown with YAML frontmatter, so save that response as-is and then apply the normal media-localization step if requested.

Extension Support

Custom configurations via EXTEND.md. See Preferences section for paths and supported options.

Weekly Installs

11.7K

Repository

jimliu/baoyu-skills

GitHub Stars

11.3K

First Seen

Jan 22, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode10.7K

gemini-cli10.4K

codex10.2K

cursor9.9K

github-copilot9.5K

amp9.2K

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

102,200 周安装

Handles login-required pages via wait mode

Download images and videos to local directories

URL转Markdown工具：Chrome CDP抓取网页并转换为干净Markdown

🇨🇳中文介绍

URL 转 Markdown

脚本目录

相关 Skills

偏好设置 (EXTEND.md)

首次设置（阻塞式）

支持的键

功能特性

用法

选项

捕获模式

输出格式

输出目录

转换回退