Tavily Crawl：网站爬虫工具，批量抓取页面并保存为Markdown文件

tavily-crawl by tavily-ai/skills

1,400 周安装量

138 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/tavily-ai/skills --skill tavily-crawl

AI/机器学习开发数据分析

🇨🇳中文介绍

tavily crawl

爬取网站并从多个页面提取内容。支持将每个页面保存为本地 Markdown 文件。

运行任何命令之前

如果 PATH 中找不到 tvly，请先安装它：

curl -fsSL https://cli.tavily.com/install.sh | bash && tvly login

不要跳过此步骤或回退到其他工具。

有关替代安装方法和身份验证选项，请参阅 tavily-cli。

何时使用

您需要来自网站上许多页面的内容（例如，所有 /docs/ 页面）
您想下载文档以供离线使用
工作流程中的第 4 步：搜索 → 提取 → 映射 → 爬取 → 研究

快速开始

# 基本爬取
tvly crawl "https://docs.example.com" --json

# 将每个页面保存为 Markdown 文件
tvly crawl "https://docs.example.com" --output-dir ./docs/

# 深度爬取并设置限制
tvly crawl "https://docs.example.com" --max-depth 2 --limit 50 --json

# 筛选特定路径
tvly crawl "https://example.com" --select-paths "/api/.*,/guides/.*" --exclude-paths "/blog/.*" --json

# 语义聚焦（返回相关片段，而非完整页面）
tvly crawl "https://docs.example.com" --instructions "查找身份验证文档" --chunks-per-source 3 --json

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

733,500 周安装

Vercel React 最佳实践指南 | 58条Next.js性能优化规则与代码重构

252,100 周安装

Vercel Web界面规范检查工具 - 自动检测代码是否符合Web设计指南

202,600 周安装

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

133,200 周安装

选项	描述
`--max-depth`	爬取深度级别（1-5，默认值：1）
`--max-breadth`	每页爬取的链接数（默认值：20）
`--limit`	总页面数上限（默认值：50）
`--instructions`	用于语义聚焦的自然语言指导
`--chunks-per-source`	每个页面的片段数（1-5，需要 `--instructions`）
`--extract-depth`	`basic`（默认值）或 `advanced`
`--format`	`markdown`（默认值）或 `text`
`--select-paths`	逗号分隔的正则表达式模式，用于指定包含的路径
`--exclude-paths`	逗号分隔的正则表达式模式，用于指定排除的路径
`--select-domains`	逗号分隔的正则表达式，用于指定包含的域名
`--exclude-domains`	逗号分隔的正则表达式，用于指定排除的域名
`--allow-external / --no-external`	包含外部链接（默认值：允许）
`--include-images`	包含图片
`--timeout`	最长等待时间（10-150 秒）
`-o, --output`	将 JSON 输出保存到文件
`--output-dir`	将每个页面保存为目录中的 .md 文件
`--json`	结构化 JSON 输出

🇺🇸English

tavily crawl

Crawl a website and extract content from multiple pages. Supports saving each page as a local markdown file.

Before running any command

If tvly is not found on PATH, install it first:

curl -fsSL https://cli.tavily.com/install.sh | bash && tvly login

Do not skip this step or fall back to other tools.

See tavily-cli for alternative install methods and auth options.

When to use

You need content from many pages on a site (e.g., all /docs/)
You want to download documentation for offline use
Step 4 in the workflow: search → extract → map → crawl → research

Quick start

# Basic crawl
tvly crawl "https://docs.example.com" --json

# Save each page as a markdown file
tvly crawl "https://docs.example.com" --output-dir ./docs/

# Deeper crawl with limits
tvly crawl "https://docs.example.com" --max-depth 2 --limit 50 --json

# Filter to specific paths
tvly crawl "https://example.com" --select-paths "/api/.*,/guides/.*" --exclude-paths "/blog/.*" --json

# Semantic focus (returns relevant chunks, not full pages)
tvly crawl "https://docs.example.com" --instructions "Find authentication docs" --chunks-per-source 3 --json

Options

Option	Description
`--max-depth`	Levels deep (1-5, default: 1)
`--max-breadth`	Links per page (default: 20)
`--limit`	Total pages cap (default: 50)
`--instructions`	Natural language guidance for semantic focus
`--chunks-per-source`	Chunks per page (1-5, requires `--instructions`)
`--extract-depth`

Crawl for context vs. data collection

For agentic use (feeding results to an LLM):

Always use --instructions + --chunks-per-source. Returns only relevant chunks instead of full pages — prevents context explosion.

tvly crawl "https://docs.example.com" --instructions "API authentication" --chunks-per-source 3 --json

For data collection (saving to files):

Use --output-dir without --chunks-per-source to get full pages as markdown files.

tvly crawl "https://docs.example.com" --max-depth 2 --output-dir ./docs/

Tips

Start conservative — --max-depth 1, --limit 20 — and scale up.
Use--select-paths to focus on the section you need.
Use map first to understand site structure before a full crawl.
Always set--limit to prevent runaway crawls.

Tavily Crawl：网站爬虫工具，批量抓取页面并保存为Markdown文件

🇨🇳中文介绍

tavily crawl

运行任何命令之前

何时使用

快速开始

相关 Skills

选项

用于上下文爬取与数据收集

提示

另请参阅

🇺🇸English

tavily crawl

Before running any command

When to use

Quick start

Options

Crawl for context vs. data collection

Tips

See also

最新 Skills