Tavily AI 网站爬取工具 - 智能提取文档、知识库和全站内容

crawl by tavily-ai/skills

3,500 周安装量

143 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/tavily-ai/skills --skill crawl

AI/机器学习自动化 API

🇨🇳中文介绍

爬取技能

爬取网站以从多个页面提取内容。适用于文档、知识库和全站内容提取。

认证

该脚本通过 Tavily MCP 服务器使用 OAuth 认证。无需手动设置 - 首次运行时，它将：

检查 ~/.mcp-auth/ 中是否存在现有令牌
如果未找到，将自动打开浏览器进行 OAuth 认证

注意： 您必须拥有一个现有的 Tavily 账户。OAuth 流程仅支持登录 - 不支持通过此流程创建账户。如果您还没有账户，请先在 tavily.com 注册。

替代方案：API 密钥

如果您更倾向于使用 API 密钥，请在 https://tavily.com 获取一个，并添加到 ~/.claude/settings.json 中：

{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}

快速开始

使用脚本

./scripts/crawl.sh '<json>' [output_dir]

示例：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

带指令的聚焦爬取

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and code examples",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

POST https://api.tavily.com/crawl

请求头	值
`Authorization`	`Bearer <TAVILY_API_KEY>`
`Content-Type`	`application/json`

字段	类型	默认值	描述
`url`	string	必填	开始爬取的根 URL
`max_depth`	integer	1	爬取深度级别 (1-5)
`max_breadth`	integer	20	每页链接数
`limit`	integer	50	总页面数上限
`instructions`	string	null	用于聚焦的自然语言指导
`chunks_per_source`	integer	3	每页块数 (1-5，需要 instructions)
`extract_depth`	string	`"basic"`	`basic` 或 `advanced`
`format`	string	`"markdown"`	`markdown` 或 `text`
`select_paths`	array	null	包含的正则表达式模式
`exclude_paths`	array	null	排除的正则表达式模式
`allow_external`	boolean	true	包含外部域名链接
`timeout`	float	150	最大等待时间 (10-150 秒)

{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent..."
    }
  ],
  "response_time": 12.5
}

深度	典型页面数	时间
1	10-50	秒级
2	50-500	分钟级
3	500-5000	数分钟

从 max_depth=1 开始，仅在需要时增加。

用于上下文与数据收集的爬取

用于智能体使用（将结果输入上下文）： 始终使用 instructions + chunks_per_source。这将仅返回相关块而非完整页面，防止上下文窗口爆炸。

用于数据收集（保存到文件）： 省略 chunks_per_source 以获取完整页面内容。

用于上下文：智能体研究（推荐）

当将爬取结果输入 LLM 上下文时使用：

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and authentication guides",
    "chunks_per_source": 3
  }'

每页仅返回最相关的块（每个最多 500 字符）- 适合上下文而不会使其过载。

用于上下文：定向技术文档

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "max_depth": 2,
    "instructions": "Find all documentation about authentication and security",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

用于数据收集：完整页面存档

当将内容保存到文件以供后续处理时使用：

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/blog",
    "max_depth": 2,
    "max_breadth": 50,
    "select_paths": ["/blog/.*"],
    "exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
  }'

返回完整页面内容 - 使用带 output_dir 参数的脚本将其保存为 Markdown 文件。

地图 API（URL 发现）

当您只需要 URL 而非内容时，使用 map 代替 crawl：

curl --request POST \
  --url https://api.tavily.com/map \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find all API docs and guides"
  }'

仅返回 URL（比爬取更快）：

{
  "base_url": "https://docs.example.com",
  "results": [
    "https://docs.example.com/api/auth",
    "https://docs.example.com/guides/quickstart"
  ]
}

对于智能体工作流，始终使用 chunks_per_source - 防止将结果输入 LLM 时上下文爆炸
仅在数据收集时省略 chunks_per_source - 当将完整页面保存到文件时
从保守开始 (max_depth=1, limit=20) 并逐步扩展
使用路径模式 聚焦于相关部分
先使用地图 API 在完整爬取前了解网站结构
始终设置 limit 以防止失控的爬取

🇺🇸English

Crawl Skill

Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.

Authentication

The script uses OAuth via the Tavily MCP server. No manual setup required - on first run, it will:

Check for existing tokens in ~/.mcp-auth/
If none found, automatically open your browser for OAuth authentication

Note: You must have an existing Tavily account. The OAuth flow only supports login — account creation is not available through this flow. Sign up at tavily.com first if you don't have an account.

Alternative: API Key

If you prefer using an API key, get one at https://tavily.com and add to ~/.claude/settings.json:

{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}

Quick Start

Using the Script

./scripts/crawl.sh '<json>' [output_dir]

Examples:

# Basic crawl
./scripts/crawl.sh '{"url": "https://docs.example.com"}'

# Deeper crawl with limits
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'

# Save to files
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs

# Focused crawl with path filters
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.*", "/api/.*"], "exclude_paths": ["/blog/.*"]}'

# With semantic instructions (for agentic use)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'

When output_dir is provided, each crawled page is saved as a separate markdown file.

Basic Crawl

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 1,
    "limit": 20
  }'

Focused Crawl with Instructions

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and code examples",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

API Reference

Endpoint

POST https://api.tavily.com/crawl

Headers

Header	Value
`Authorization`	`Bearer <TAVILY_API_KEY>`
`Content-Type`	`application/json`

Request Body

Field	Type	Default	Description
`url`	string	Required	Root URL to begin crawling
`max_depth`	integer	1	Levels deep to crawl (1-5)
`max_breadth`	integer	20	Links per page
`limit`	integer	50	Total pages cap

Response Format

{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent..."
    }
  ],
  "response_time": 12.5
}

Depth vs Performance

Depth	Typical Pages	Time
1	10-50	Seconds
2	50-500	Minutes
3	500-5000	Many minutes

Start withmax_depth=1 and increase only if needed.

Crawl for Context vs Data Collection

For agentic use (feeding results into context): Always use instructions + chunks_per_source. This returns only relevant chunks instead of full pages, preventing context window explosion.

For data collection (saving to files): Omit chunks_per_source to get full page content.

Examples

For Context: Agentic Research (Recommended)

Use when feeding crawl results into an LLM context:

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and authentication guides",
    "chunks_per_source": 3
  }'

Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.

For Context: Targeted Technical Docs

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "max_depth": 2,
    "instructions": "Find all documentation about authentication and security",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

For Data Collection: Full Page Archive

Use when saving content to files for later processing:

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/blog",
    "max_depth": 2,
    "max_breadth": 50,
    "select_paths": ["/blog/.*"],
    "exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
  }'

Returns full page content - use the script with output_dir to save as markdown files.

Map API (URL Discovery)

Use map instead of crawl when you only need URLs, not content:

curl --request POST \
  --url https://api.tavily.com/map \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find all API docs and guides"
  }'

Returns URLs only (faster than crawl):

{
  "base_url": "https://docs.example.com",
  "results": [
    "https://docs.example.com/api/auth",
    "https://docs.example.com/guides/quickstart"
  ]
}

Tips

Always usechunks_per_source for agentic workflows - prevents context explosion when feeding results to LLMs
Omitchunks_per_source only for data collection - when saving full pages to files
Start conservative (max_depth=1, limit=20) and scale up
Use path patterns to focus on relevant sections
Use Map first to understand site structure before full crawl
Always set alimit to prevent runaway crawls

Weekly Installs

3.5K

Repository

tavily-ai/skills

GitHub Stars

143

First Seen

Jan 25, 2026

Security Audits

Gen Agent Trust HubFail SocketPass SnykWarn

Installed on

opencode3.2K

gemini-cli3.1K

codex3.1K

github-copilot3.0K

kimi-cli2.9K

amp2.9K

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

41,400 周安装