Python网页抓取工具：轻量级爬虫，支持HTML/Markdown格式与图片下载

web-scraper by agentbay-ai/agentbay-skills

136 周安装量

31 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/agentbay-ai/agentbay-skills --skill web-scraper

Python Web框架自动化数据处理

🇨🇳中文介绍

网页抓取工具

获取网页内容（文本 + 图片）并本地保存为 HTML 或 Markdown 格式。

依赖极简：仅需 requests 和 beautifulsoup4 - 无需浏览器自动化。

默认行为：自动下载图片到本地 images/ 目录。

快速开始

单页面抓取

{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md

递归抓取（跟随链接）

{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive

环境设置

需要 Python 3.8+ 和极简依赖：

cd {baseDir}
pip install -r requirements.txt

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

需要收集的输入

URL：要抓取的网页（必需）
格式：html 或 md（默认：html）
输出路径：保存文件的位置（默认：当前目录，自动生成文件名）
图片：默认下载图片（使用 --no-download-images 禁用）

递归模式 (`--recursive`)

URL：递归抓取的起始点
格式：html 或 md
输出目录：保存所有抓取页面的位置
最大深度：跟随链接的层级深度（默认：2）
最大页面数：抓取的最大总页面数（默认：50）
域名过滤：是否限制在同一域名内（默认：是）
图片：默认下载图片

询问用户要抓取的 URL
询问首选输出格式（HTML 或 Markdown）
- 注意：两种格式默认都包含文本和图片
- HTML：保留原始结构并下载图片
- Markdown：干净的文本格式，下载的图片保存在 images/ 文件夹中
对于递归模式：询问最大深度和最大页面数（可选，有合理的默认值）
询问保存位置（或建议默认路径如 /tmp/ 或 ~/Downloads/）
运行脚本并确认成功
显示保存的文件/目录路径

{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html

保存为 Markdown（包含图片，默认）

{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md

结果：创建 web-scraping.md + images/ 文件夹，包含所有下载的图片（文本 + 图片）。

不下载图片（可选）

{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images

结果：仅包含文本 + 图片 URL（不下载到本地）。

自动生成文件名

{baseDir}/scripts/scrape.py --url "https://example.com" --format html
# 保存为：example-com-{timestamp}.html

基本递归抓取（深度 2，同一域名，包含图片）

{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive

输出结构（所有页面的文本 + 图片）：

docs-archive/
├── index.md
├── getting-started.md
├── api/
│   ├── authentication.md
│   └── endpoints.md
└── images/              # 所有页面的共享图片
    ├── logo.png
    └── diagram.svg

自定义限制的深度抓取

{baseDir}/scripts/scrape.py \
  --url "https://blog.example.com" \
  --format html \
  --recursive \
  --max-depth 3 \
  --max-pages 100 \
  --output ~/Archives/blog-backup

忽略 robots.txt（谨慎使用）

{baseDir}/scripts/scrape.py \
  --url "https://example.com" \
  --format md \
  --recursive \
  --no-respect-robots \
  --rate-limit 1.0

更快的抓取（降低速率限制）

{baseDir}/scripts/scrape.py \
  --url "https://yoursite.com" \
  --format md \
  --recursive \
  --rate-limit 0.2

HTML 输出：保留原始页面结构
- ✅ 干净、可读的 HTML 文档
- ✅ 所有图片下载到 images/ 文件夹
- ✅ 适合离线查看
Markdown 输出：提取干净的文本内容
- ✅ 自动下载图片到本地 images/ 目录（默认）
- ✅ 将图片 URL 转换为相对路径
- ✅ 干净、可读的格式，适合归档
- ✅ 如果下载失败，回退到原始 URL
- 使用 --no-download-images 标志仅保留原始 URL
简单快速：纯 HTTP 请求，无需浏览器
自动生成文件名：如果未指定，则从 URL 生成安全的文件名

递归模式 (`--recursive`)

✅ 智能链接发现：自动跟随抓取页面上的所有链接
✅ 深度控制：--max-depth 限制抓取的层级深度（默认：2）
✅ 页面限制：--max-pages 限制总页面数，防止无限抓取（默认：50）
✅ 域名过滤：--same-domain 将抓取限制在起始域名内（默认：开启）
✅ robots.txt 合规：默认遵守网站的抓取规则
✅ 速率限制：--rate-limit 在请求之间添加延迟（默认：0.5秒）
✅ 智能 URL 过滤：跳过图片、脚本、CSS 和重复的 URL
✅ 进度跟踪：实时控制台输出，包含成功/失败/跳过计数
✅ 组织化输出：在目录层次结构中保留 URL 结构
✅ 高效抓取：顺序抓取并带有速率限制，以尊重服务器

遵守 robots.txt 和网站服务条款
某些网站可能阻止自动化访问；此工具使用标准 HTTP 请求
包含大量图片的大型页面可能需要较长时间下载

从小规模开始：先用 --max-depth 1 --max-pages 10 测试
遵守 robots.txt：默认开启；仅对您自己的网站使用 --no-respect-robots
速率限制：默认 0.5 秒是礼貌的；对于公共网站，不要低于 0.2 秒
同一域名：强烈建议保持 --same-domain 启用
监控进度：注意高失败率（可能表示被阻止）
存储：递归抓取可能生成大量文件；确保有足够的磁盘空间
法律：确保您有权限抓取和归档目标网站

连接错误：检查您的互联网连接和 URL 有效性
403/被阻止：某些网站阻止抓取工具；该工具使用真实的 User-Agent 标头
超时：对于加载缓慢的页面，增加 --timeout 标志（值以秒为单位）
图片下载失败：图片将回退到原始 URL
缺少图片：某些网站使用 JavaScript 动态加载图片（不支持）

🇺🇸English

Web Scraper

Fetch web page content (text + images) and save as HTML or Markdown locally.

Minimal dependencies : Only requires requests and beautifulsoup4 - no browser automation.

Default behavior : Downloads images to local images/ directory automatically.

Quick start

Single page

{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md

Recursive (follow links)

{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive

Setup

Requires Python 3.8+ and minimal dependencies:

cd {baseDir}
pip install -r requirements.txt

Or install manually:

pip install requests beautifulsoup4

Note : No browser or driver needed - uses pure HTTP requests.

Inputs to collect

Single page mode

URL : The web page to scrape (required)
Format : html or md (default: html)
Output path : Where to save the file (default: current directory with auto-generated name)
Images : Downloads images by default (use --no-download-images to disable)

Recursive mode (--recursive)

URL : Starting point for recursive scraping
Format : html or md
Output directory : Where to save all scraped pages
Max depth : How many levels deep to follow links (default: 2)
Max pages : Maximum total pages to scrape (default: 50)
Domain filter : Whether to stay within same domain (default: yes)
Images : Downloads images by default

Conversation Flow

Ask user for the URL to scrape
Ask preferred output format (HTML or Markdown)
- Note: Both formats include text and images by default
- HTML: Preserves original structure with downloaded images
- Markdown: Clean text format with downloaded images in images/ folder
For recursive mode: Ask max depth and max pages (optional, has sensible defaults)
Ask where to save (or suggest a default path like /tmp/ or ~/Downloads/)
Run the script and confirm success
Show the saved file/directory path

Examples

Single Page Scraping

Save as HTML

{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html

Save as Markdown (with images, default)

{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md

Result : Creates web-scraping.md + images/ folder with all downloaded images (text + images).

Without downloading images (optional)

{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images

Result : Only text + image URLs (not downloaded locally).

Auto-generate filename

{baseDir}/scripts/scrape.py --url "https://example.com" --format html
# Saves to: example-com-{timestamp}.html

Recursive Scraping

Basic recursive crawl (depth 2, same domain, with images)

{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive

Output structure (text + images for all pages):

docs-archive/
├── index.md
├── getting-started.md
├── api/
│   ├── authentication.md
│   └── endpoints.md
└── images/              # Shared images from all pages
    ├── logo.png
    └── diagram.svg

Deep crawl with custom limits

{baseDir}/scripts/scrape.py \
  --url "https://blog.example.com" \
  --format html \
  --recursive \
  --max-depth 3 \
  --max-pages 100 \
  --output ~/Archives/blog-backup

Ignore robots.txt (use with caution)

{baseDir}/scripts/scrape.py \
  --url "https://example.com" \
  --format md \
  --recursive \
  --no-respect-robots \
  --rate-limit 1.0

Faster scraping (reduced rate limit)

{baseDir}/scripts/scrape.py \
  --url "https://yoursite.com" \
  --format md \
  --recursive \
  --rate-limit 0.2

Features

Single Page Mode

HTML output : Preserves original page structure
- ✅ Clean, readable HTML document
- ✅ All images downloaded to images/ folder
- ✅ Suitable for offline viewing
Markdown output : Extracts clean text content
- ✅ Auto-downloads images to local images/ directory (default)
- ✅ Converts image URLs to relative paths
- ✅ Clean, readable format for archiving
- ✅ Fallback to original URLs if download fails
- Use --no-download-images flag to keep original URLs only
Simple and fast : Pure HTTP requests, no browser needed
Auto filename : Generates safe filename from URL if not specified

Recursive Mode (`--recursive`)

✅ Intelligent link discovery : Automatically follows all links on crawled pages
✅ Depth control : --max-depth limits how many levels deep to crawl (default: 2)
✅ Page limit : --max-pages caps total pages to prevent runaway crawls (default: 50)
✅ Domain filtering : --same-domain keeps crawl within starting domain (default: on)
✅ robots.txt compliance : Respects site's crawling rules by default
✅ Rate limiting : --rate-limit adds delay between requests (default: 0.5s)
✅ Smart URL filtering : Skips images, scripts, CSS, and duplicate URLs
✅ Progress tracking : Real-time console output with success/fail/skip counts
✅ Organized output : Preserves URL structure in directory hierarchy
✅ Efficient crawling : Sequential with rate limiting to respect servers

Guardrails

Single Page Mode

Respect robots.txt and site terms of service
Some sites may block automated access; this tool uses standard HTTP requests
Large pages with many images may take time to download

Recursive Mode

Start small : Test with --max-depth 1 --max-pages 10 first
Respect robots.txt : Default is on; only use --no-respect-robots for your own sites
Rate limiting : Default 0.5s is polite; don't go below 0.2s for public sites
Same domain : Strongly recommended to keep --same-domain enabled
Monitor progress : Watch for high fail rates (may indicate blocking)
Storage : Recursive crawls can generate many files; ensure sufficient disk space
Legal : Ensure you have permission to crawl and archive the target site

Troubleshooting

Connection errors : Check your internet connection and URL validity
403/blocked : Some sites block scrapers; the tool uses realistic User-Agent headers
Timeout : Increase --timeout flag for slow-loading pages (value in seconds)
Image download fails : Images will fall back to original URLs
Missing images : Some sites use JavaScript to load images dynamically (not supported)

Weekly Installs

Repository

agentbay-ai/age…y-skills

GitHub Stars

First Seen

Feb 13, 2026

Security Audits

Gen Agent Trust HubPass SocketFail SnykWarn

Installed on

gemini-cli17

codex17

opencode17

amp16

github-copilot16

kimi-cli16

Skills CLI 使用指南：AI Agent 技能包管理器安装与管理教程

44,900 周安装

Python网页抓取工具：轻量级爬虫，支持HTML/Markdown格式与图片下载

🇨🇳中文介绍

网页抓取工具

快速开始

单页面抓取

递归抓取（跟随链接）

环境设置

相关 Skills

需要收集的输入

单页面模式

递归模式 (--recursive)

对话流程

示例

单页面抓取

保存为 HTML

保存为 Markdown（包含图片，默认）

不下载图片（可选）

自动生成文件名

递归抓取

基本递归抓取（深度 2，同一域名，包含图片）

自定义限制的深度抓取

忽略 robots.txt（谨慎使用）

更快的抓取（降低速率限制）

功能特性

单页面模式

递归模式 (--recursive)

防护措施

单页面模式

递归模式

故障排除

🇺🇸English

Web Scraper

Quick start

Single page

Recursive (follow links)

Setup

Inputs to collect

Single page mode

Recursive mode (--recursive)

Conversation Flow

Examples

Single Page Scraping

Save as HTML

Save as Markdown (with images, default)

Without downloading images (optional)

Auto-generate filename

Recursive Scraping

Basic recursive crawl (depth 2, same domain, with images)

Deep crawl with custom limits

Ignore robots.txt (use with caution)

Faster scraping (reduced rate limit)

Features

Single Page Mode

Recursive Mode (--recursive)

Guardrails

Single Page Mode

Recursive Mode

Troubleshooting

最新 Skills

递归模式 (`--recursive`)

递归模式 (`--recursive`)

Recursive Mode (`--recursive`)