npx skills add https://github.com/tavily-ai/skills --skill crawl爬取网站以从多个页面提取内容。适用于文档、知识库和全站内容提取。
该脚本通过 Tavily MCP 服务器使用 OAuth 认证。无需手动设置 - 首次运行时,它将:
~/.mcp-auth/ 中是否存在现有令牌注意: 您必须拥有一个现有的 Tavily 账户。OAuth 流程仅支持登录 - 不支持通过此流程创建账户。如果您还没有账户,请先在 tavily.com 注册。
如果您更倾向于使用 API 密钥,请在 https://tavily.com 获取一个,并添加到 ~/.claude/settings.json 中:
{
"env": {
"TAVILY_API_KEY": "tvly-your-api-key-here"
}
}
./scripts/crawl.sh '<json>' [output_dir]
示例:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 基本爬取
./scripts/crawl.sh '{"url": "https://docs.example.com"}'
# 深度爬取并设置限制
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'
# 保存到文件
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs
# 使用路径过滤器进行聚焦爬取
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.*", "/api/.*"], "exclude_paths": ["/blog/.*"]}'
# 使用语义指令(用于智能体使用)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'
当提供 output_dir 参数时,每个爬取的页面将保存为单独的 Markdown 文件。
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 1,
"limit": 20
}'
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and code examples",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'
POST https://api.tavily.com/crawl
| 请求头 | 值 |
|---|---|
Authorization | Bearer <TAVILY_API_KEY> |
Content-Type | application/json |
| 字段 | 类型 | 默认值 | 描述 |
|---|---|---|---|
url | string | 必填 | 开始爬取的根 URL |
max_depth | integer | 1 | 爬取深度级别 (1-5) |
max_breadth | integer | 20 | 每页链接数 |
limit | integer | 50 | 总页面数上限 |
instructions | string | null | 用于聚焦的自然语言指导 |
chunks_per_source | integer | 3 | 每页块数 (1-5,需要 instructions) |
extract_depth | string | "basic" | basic 或 advanced |
format | string | "markdown" | markdown 或 text |
select_paths | array | null | 包含的正则表达式模式 |
exclude_paths | array | null | 排除的正则表达式模式 |
allow_external | boolean | true | 包含外部域名链接 |
timeout | float | 150 | 最大等待时间 (10-150 秒) |
{
"base_url": "https://docs.example.com",
"results": [
{
"url": "https://docs.example.com/page",
"raw_content": "# Page Title\n\nContent..."
}
],
"response_time": 12.5
}
| 深度 | 典型页面数 | 时间 |
|---|---|---|
| 1 | 10-50 | 秒级 |
| 2 | 50-500 | 分钟级 |
| 3 | 500-5000 | 数分钟 |
从 max_depth=1 开始,仅在需要时增加。
用于智能体使用(将结果输入上下文): 始终使用 instructions + chunks_per_source。这将仅返回相关块而非完整页面,防止上下文窗口爆炸。
用于数据收集(保存到文件): 省略 chunks_per_source 以获取完整页面内容。
当将爬取结果输入 LLM 上下文时使用:
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and authentication guides",
"chunks_per_source": 3
}'
每页仅返回最相关的块(每个最多 500 字符)- 适合上下文而不会使其过载。
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com",
"max_depth": 2,
"instructions": "Find all documentation about authentication and security",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'
当将内容保存到文件以供后续处理时使用:
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com/blog",
"max_depth": 2,
"max_breadth": 50,
"select_paths": ["/blog/.*"],
"exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
}'
返回完整页面内容 - 使用带 output_dir 参数的脚本将其保存为 Markdown 文件。
当您只需要 URL 而非内容时,使用 map 代替 crawl:
curl --request POST \
--url https://api.tavily.com/map \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find all API docs and guides"
}'
仅返回 URL(比爬取更快):
{
"base_url": "https://docs.example.com",
"results": [
"https://docs.example.com/api/auth",
"https://docs.example.com/guides/quickstart"
]
}
chunks_per_source - 防止将结果输入 LLM 时上下文爆炸chunks_per_source - 当将完整页面保存到文件时max_depth=1, limit=20) 并逐步扩展limit 以防止失控的爬取每周安装量
3.5K
代码仓库
GitHub 星标
143
首次出现
Jan 25, 2026
安全审计
安装于
opencode3.2K
gemini-cli3.1K
codex3.1K
github-copilot3.0K
kimi-cli2.9K
amp2.9K
Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.
The script uses OAuth via the Tavily MCP server. No manual setup required - on first run, it will:
~/.mcp-auth/Note: You must have an existing Tavily account. The OAuth flow only supports login — account creation is not available through this flow. Sign up at tavily.com first if you don't have an account.
If you prefer using an API key, get one at https://tavily.com and add to ~/.claude/settings.json:
{
"env": {
"TAVILY_API_KEY": "tvly-your-api-key-here"
}
}
./scripts/crawl.sh '<json>' [output_dir]
Examples:
# Basic crawl
./scripts/crawl.sh '{"url": "https://docs.example.com"}'
# Deeper crawl with limits
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'
# Save to files
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs
# Focused crawl with path filters
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.*", "/api/.*"], "exclude_paths": ["/blog/.*"]}'
# With semantic instructions (for agentic use)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'
When output_dir is provided, each crawled page is saved as a separate markdown file.
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 1,
"limit": 20
}'
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and code examples",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'
POST https://api.tavily.com/crawl
| Header | Value |
|---|---|
Authorization | Bearer <TAVILY_API_KEY> |
Content-Type | application/json |
| Field | Type | Default | Description |
|---|---|---|---|
url | string | Required | Root URL to begin crawling |
max_depth | integer | 1 | Levels deep to crawl (1-5) |
max_breadth | integer | 20 | Links per page |
limit | integer | 50 | Total pages cap |
{
"base_url": "https://docs.example.com",
"results": [
{
"url": "https://docs.example.com/page",
"raw_content": "# Page Title\n\nContent..."
}
],
"response_time": 12.5
}
| Depth | Typical Pages | Time |
|---|---|---|
| 1 | 10-50 | Seconds |
| 2 | 50-500 | Minutes |
| 3 | 500-5000 | Many minutes |
Start withmax_depth=1 and increase only if needed.
For agentic use (feeding results into context): Always use instructions + chunks_per_source. This returns only relevant chunks instead of full pages, preventing context window explosion.
For data collection (saving to files): Omit chunks_per_source to get full page content.
Use when feeding crawl results into an LLM context:
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and authentication guides",
"chunks_per_source": 3
}'
Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com",
"max_depth": 2,
"instructions": "Find all documentation about authentication and security",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'
Use when saving content to files for later processing:
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com/blog",
"max_depth": 2,
"max_breadth": 50,
"select_paths": ["/blog/.*"],
"exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
}'
Returns full page content - use the script with output_dir to save as markdown files.
Use map instead of crawl when you only need URLs, not content:
curl --request POST \
--url https://api.tavily.com/map \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find all API docs and guides"
}'
Returns URLs only (faster than crawl):
{
"base_url": "https://docs.example.com",
"results": [
"https://docs.example.com/api/auth",
"https://docs.example.com/guides/quickstart"
]
}
chunks_per_source for agentic workflows - prevents context explosion when feeding results to LLMschunks_per_source only for data collection - when saving full pages to filesmax_depth=1, limit=20) and scale uplimit to prevent runaway crawlsWeekly Installs
3.5K
Repository
GitHub Stars
143
First Seen
Jan 25, 2026
Security Audits
Gen Agent Trust HubFailSocketPassSnykWarn
Installed on
opencode3.2K
gemini-cli3.1K
codex3.1K
github-copilot3.0K
kimi-cli2.9K
amp2.9K
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
41,400 周安装
FastMCP Python框架:快速构建MCP服务器,为Claude等大模型提供工具与资源
405 周安装
Nginx配置完全指南:生产级反向代理、负载均衡、SSL终止与性能优化
406 周安装
iOS HIG 设计指南:原生应用界面开发框架与 SwiftUI/UIKit 最佳实践
406 周安装
Aceternity UI - Next.js 13+ 动画React组件库 | Tailwind CSS & Framer Motion
406 周安装
RAG工程师技能详解:检索增强生成系统架构与最佳实践
406 周安装
survey技能:跨平台问题空间扫描与调研工具 | 智能体开发研究
406 周安装
instructions| string |
| null |
| Natural language guidance for focus |
chunks_per_source | integer | 3 | Chunks per page (1-5, requires instructions) |
extract_depth | string | "basic" | basic or advanced |
format | string | "markdown" | markdown or text |
select_paths | array | null | Regex patterns to include |
exclude_paths | array | null | Regex patterns to exclude |
allow_external | boolean | true | Include external domain links |
timeout | float | 150 | Max wait (10-150 seconds) |