Firecrawl 爬虫技能：智能网页抓取、网站爬取与内容提取工具

firecrawl-scraper by benedictking/firecrawl-scraper

424 周安装量

5 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/benedictking/firecrawl-scraper --skill firecrawl-scraper

自动化数据处理 SEO

🇨🇳中文介绍

Firecrawl 爬虫技能

触发条件与端点选择

根据用户意图选择 Firecrawl 端点：

scrape：需要从单个网页提取内容（markdown、html、json、截图、pdf）
crawl：需要爬取整个网站，支持深度控制和路径过滤
map：需要快速获取网站上的所有 URL 列表
batch-scrape：需要并行爬取多个 URL
crawl-status：给定爬取任务 ID，检查爬取进度/结果（可选 --wait 参数）

执行方法

使用 Task 工具调用 firecrawl-fetcher 子技能，传递命令和 JSON（标准输入）：

Task 参数：
- subagent_type: Bash
- description: "调用 Firecrawl API"
- prompt: cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs <scrape|crawl|map|batch-scrape|crawl-status> [--wait]
  { ...payload... }
  JSON

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

1) 爬取单个页面

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs scrape
{
  "url": "https://example.com",
  "formats": ["markdown", "links"],
  "onlyMainContent": true,
  "includeTags": [],
  "excludeTags": ["nav", "footer"],
  "waitFor": 0,
  "timeout": 30000
}
JSON

"markdown"、"html"、"rawHtml"、"links"、"images"、"summary"
{"type": "json", "prompt": "提取产品信息", "schema": {...}}
{"type": "screenshot", "fullPage": true, "quality": 85}

2) 带操作的爬取（页面交互）

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs scrape
{
  "url": "https://example.com",
  "formats": ["markdown"],
  "actions": [
    {"type": "wait", "milliseconds": 2000},
    {"type": "click", "selector": "#load-more"},
    {"type": "wait", "milliseconds": 1000},
    {"type": "scroll", "direction": "down", "amount": 500}
  ]
}
JSON

wait、click、write、press、scroll、screenshot、scrape、executeJavascript

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs scrape
{
  "url": "https://example.com/document.pdf",
  "formats": ["markdown"],
  "parsers": ["pdf"]
}
JSON

4) 提取结构化 JSON

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs scrape
{
  "url": "https://example.com/product",
  "formats": [
    {
      "type": "json",
      "prompt": "提取产品信息",
      "schema": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "price": {"type": "number"},
          "description": {"type": "string"}
        },
        "required": ["name", "price"]
      }
    }
  ]
}
JSON

5) 爬取整个网站

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs crawl
{
  "url": "https://docs.example.com",
  "formats": ["markdown"],
  "includePaths": ["^/docs/.*"],
  "excludePaths": ["^/blog/.*"],
  "maxDiscoveryDepth": 3,
  "limit": 100,
  "allowExternalLinks": false,
  "allowSubdomains": false
}
JSON

5.1) 爬取 + 等待完成

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs crawl --wait
{
  "url": "https://docs.example.com",
  "formats": ["markdown"],
  "limit": 100
}
JSON

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs map
{
  "url": "https://example.com",
  "search": "documentation",
  "limit": 5000
}
JSON

7) 批量爬取多个 URL

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs batch-scrape
{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "formats": ["markdown"]
}
JSON

8) 检查爬取状态

node .claude/skills/firecrawl-scraper/firecrawl-api.cjs crawl-status <crawl-id>

node .claude/skills/firecrawl-scraper/firecrawl-api.cjs crawl-status <crawl-id> --wait

markdown：干净的 markdown 内容
html：解析后的 HTML
rawHtml：原始 HTML
links：页面上的所有链接
images：页面上的所有图片
summary：AI 生成的摘要
json：带模式的结构化数据提取
screenshot：页面截图（PNG）

onlyMainContent：仅提取主要内容（默认：true）
includeTags：要包含的 CSS 选择器
excludeTags：要排除的 CSS 选择器
waitFor：爬取前的等待时间（毫秒）
maxAge：缓存持续时间（默认：48 小时）

操作（浏览器自动化）

wait：等待指定时间
click：通过选择器点击元素
write：向字段输入文本
press：按下键盘按键
scroll：滚动页面
executeJavascript：运行自定义 JS

includePaths：要包含的正则表达式模式
excludePaths：要排除的正则表达式模式
maxDiscoveryDepth：最大爬取深度
limit：最大爬取页面数
allowExternalLinks：是否跟踪外部链接
allowSubdomains：是否跟踪子域名

环境变量与 API 密钥

两种配置 API 密钥的方式（优先级：环境变量 > .env）：

环境变量：FIRECRAWL_API_KEY
.env 文件：放置在 .claude/skills/firecrawl-scraper/.env，可从 .env.example 复制

所有端点返回 JSON，包含：

success：布尔值，表示是否成功
data：提取的内容（格式取决于端点）
对于爬取：返回任务 ID，使用 crawl-status（或 GET /v2/crawl/{id}）检查状态

🇺🇸English

Firecrawl Scraper Skill

Trigger Conditions & Endpoint Selection

Choose Firecrawl endpoint based on user intent:

scrape : Need to extract content from a single web page (markdown, html, json, screenshot, pdf)
crawl : Need to crawl entire website with depth control and path filtering
map : Need to quickly get a list of all URLs on a website
batch-scrape : Need to scrape multiple URLs in parallel
crawl-status : Given crawl job ID, check crawl progress/results (optional --wait)

Recommended Architecture (Main Skill + Sub-skill)

This skill uses a two-phase architecture:

Main skill (current context) : Understand user question → Choose endpoint → Assemble JSON payload
Sub-skill (fork context) : Only responsible for HTTP call execution, avoiding conversation history token waste

Execution Method

Use Task tool to invoke firecrawl-fetcher sub-skill, passing command and JSON (stdin):

Task parameters:
- subagent_type: Bash
- description: "Call Firecrawl API"
- prompt: cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs <scrape|crawl|map|batch-scrape|crawl-status> [--wait]
  { ...payload... }
  JSON

Payload Examples

1) Scrape Single Page

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs scrape
{
  "url": "https://example.com",
  "formats": ["markdown", "links"],
  "onlyMainContent": true,
  "includeTags": [],
  "excludeTags": ["nav", "footer"],
  "waitFor": 0,
  "timeout": 30000
}
JSON

Available formats:

"markdown", "html", "rawHtml", "links", "images", "summary"
{"type": "json", "prompt": "Extract product info", "schema": {...}}
{"type": "screenshot", "fullPage": true, "quality": 85}

2) Scrape with Actions (Page Interaction)

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs scrape
{
  "url": "https://example.com",
  "formats": ["markdown"],
  "actions": [
    {"type": "wait", "milliseconds": 2000},
    {"type": "click", "selector": "#load-more"},
    {"type": "wait", "milliseconds": 1000},
    {"type": "scroll", "direction": "down", "amount": 500}
  ]
}
JSON

Available actions:

wait, click, write, press, scroll, screenshot, scrape, executeJavascript

3) Parse PDF

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs scrape
{
  "url": "https://example.com/document.pdf",
  "formats": ["markdown"],
  "parsers": ["pdf"]
}
JSON

4) Extract Structured JSON

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs scrape
{
  "url": "https://example.com/product",
  "formats": [
    {
      "type": "json",
      "prompt": "Extract product information",
      "schema": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "price": {"type": "number"},
          "description": {"type": "string"}
        },
        "required": ["name", "price"]
      }
    }
  ]
}
JSON

5) Crawl Entire Website

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs crawl
{
  "url": "https://docs.example.com",
  "formats": ["markdown"],
  "includePaths": ["^/docs/.*"],
  "excludePaths": ["^/blog/.*"],
  "maxDiscoveryDepth": 3,
  "limit": 100,
  "allowExternalLinks": false,
  "allowSubdomains": false
}
JSON

5.1) Crawl + Wait for Completion

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs crawl --wait
{
  "url": "https://docs.example.com",
  "formats": ["markdown"],
  "limit": 100
}
JSON

6) Map Website URLs

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs map
{
  "url": "https://example.com",
  "search": "documentation",
  "limit": 5000
}
JSON

7) Batch Scrape Multiple URLs

cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.cjs batch-scrape
{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "formats": ["markdown"]
}
JSON

8) Check Crawl Status

node .claude/skills/firecrawl-scraper/firecrawl-api.cjs crawl-status <crawl-id>

Wait for completion:

node .claude/skills/firecrawl-scraper/firecrawl-api.cjs crawl-status <crawl-id> --wait

Key Features

Formats

markdown : Clean markdown content
html : Parsed HTML
rawHtml : Original HTML
links : All links on page
images : All images on page
summary : AI-generated summary
json : Structured data extraction with schema
screenshot : Page screenshot (PNG)

Content Control

onlyMainContent: Extract only main content (default: true)
includeTags: CSS selectors to include
excludeTags: CSS selectors to exclude
waitFor: Wait time before scraping (ms)
maxAge: Cache duration (default: 48 hours)

Actions (Browser Automation)

wait: Wait for specified time
click: Click element by selector
write: Input text into field
press: Press keyboard key
scroll: Scroll page
executeJavascript: Run custom JS

Crawl Options

includePaths: Regex patterns to include
excludePaths: Regex patterns to exclude
maxDiscoveryDepth: Maximum crawl depth
limit: Maximum pages to crawl
allowExternalLinks: Follow external links
allowSubdomains: Follow subdomains

Environment Variables & API Key

Two ways to configure API Key (priority: environment variable > .env):

Environment variable: FIRECRAWL_API_KEY
.env file: Place in .claude/skills/firecrawl-scraper/.env, can copy from .env.example

Response Format

All endpoints return JSON with:

success: Boolean indicating success
data: Extracted content (format depends on endpoint)
For crawl: Returns job ID, use crawl-status (or GET /v2/crawl/{id}) to check status

Weekly Installs

420

Repository

benedictking/fi…-scraper

GitHub Stars

First Seen

Jan 22, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykWarn

Installed on

opencode384

codex376

gemini-cli369

github-copilot320

cursor315

amp285

程序化SEO实战指南：大规模创建优质页面，避免内容单薄惩罚

33,300 周安装