Firecrawl 网页抓取 API：AI 驱动的网页数据提取与结构化工具

firecrawl-scraper by jezweb/claude-skills

518 周安装量

650 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/jezweb/claude-skills --skill firecrawl-scraper

AI/机器学习自动化数据处理

🇨🇳中文介绍

Firecrawl 网页抓取技能

状态：生产就绪 最后更新：2026-01-20 官方文档：https://docs.firecrawl.dev API 版本：v2 SDK 版本：firecrawl-py 4.13.0+, @mendable/firecrawl-js 4.11.1+

什么是 Firecrawl？

Firecrawl 是一个面向 AI 的网页数据 API，可将网站转换为适合大语言模型的 Markdown 或结构化数据。它能处理：

JavaScript 渲染 - 执行客户端 JavaScript 以捕获动态内容
反机器人绕过 - 绕过 CAPTCHA 和机器人检测系统
格式转换 - 输出为 Markdown、HTML、JSON、截图、摘要
文档解析 - 处理 PDF、DOCX 文件和图像
自主代理 - 无需 URL 的 AI 驱动网页数据收集
变更追踪 - 监控内容随时间的变化
品牌提取 - 提取配色方案、排版、徽标

API 端点概览

端点	目的	使用场景
`/scrape`

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

1. 抓取端点 (`/v2/scrape`)

抓取单个网页并返回干净、结构化的内容。

from firecrawl import Firecrawl
import os

app = Firecrawl(api_key=os.environ.get("FIRECRAWL_API_KEY"))

# 基本抓取
doc = app.scrape(
    url="https://example.com/article",
    formats=["markdown", "html"],
    only_main_content=True
)

print(doc.markdown)
print(doc.metadata)

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });

const result = await app.scrapeUrl('https://example.com/article', {
  formats: ['markdown', 'html'],
  onlyMainContent: true
});

console.log(result.markdown);

格式	描述
`markdown`	为大语言模型优化的内容
`html`	完整 HTML
`rawHtml`	未经处理的 HTML
`screenshot`	页面截图（带视口选项）
`links`	页面上的所有 URL
`json`	结构化数据提取
`summary`	AI 生成的摘要
`branding`	设计系统数据
`changeTracking`	内容变更检测

doc = app.scrape(
    url="https://example.com",
    formats=["markdown", "screenshot"],
    only_main_content=True,
    remove_base64_images=True,
    wait_for=5000,  # 等待 5 秒以执行 JS
    timeout=30000,
    # 位置和语言
    location={"country": "AU", "languages": ["en-AU"]},
    # 缓存控制
    max_age=0,  # 新鲜内容（无缓存）
    store_in_cache=True,
    # 复杂网站的隐身模式
    stealth=True,
    # 自定义请求头
    headers={"User-Agent": "Custom Bot 1.0"}
)

在抓取前执行交互操作：

doc = app.scrape(
    url="https://example.com",
    actions=[
        {"type": "click", "selector": "button.load-more"},
        {"type": "wait", "milliseconds": 2000},
        {"type": "scroll", "direction": "down"},
        {"type": "write", "selector": "input#search", "text": "query"},
        {"type": "press", "key": "Enter"},
        {"type": "screenshot"}  # 在操作过程中捕获状态
    ]
)

JSON 模式（结构化提取）

# 使用模式
doc = app.scrape(
    url="https://example.com/product",
    formats=["json"],
    json_options={
        "schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "price": {"type": "number"},
                "in_stock": {"type": "boolean"}
            }
        }
    }
)

# 不使用模式（仅提示词）
doc = app.scrape(
    url="https://example.com/product",
    formats=["json"],
    json_options={
        "prompt": "Extract the product name, price, and availability"
    }
)

提取设计系统和品牌标识：

doc = app.scrape(
    url="https://example.com",
    formats=["branding"]
)

# 返回：
# - 配色方案和调色板
# - 排版（字体、大小、粗细）
# - 间距和布局指标
# - UI 组件样式
# - 徽标和图像 URL
# - 品牌个性特征

2. 爬取端点 (`/v2/crawl`)

从起始 URL 爬取所有可访问的页面。

result = app.crawl(
    url="https://docs.example.com",
    limit=100,
    max_depth=3,
    allowed_domains=["docs.example.com"],
    exclude_paths=["/api/*", "/admin/*"],
    scrape_options={
        "formats": ["markdown"],
        "only_main_content": True
    }
)

for page in result.data:
    print(f"Scraped: {page.metadata.source_url}")
    print(f"Content: {page.markdown[:200]}...")

使用 Webhook 的异步爬取

# 开始爬取（立即返回）
job = app.start_crawl(
    url="https://docs.example.com",
    limit=1000,
    webhook="https://your-domain.com/webhook"
)

print(f"Job ID: {job.id}")

# 或轮询状态
status = app.check_crawl_status(job.id)

3. 映射端点 (`/v2/map`)

快速发现网站上的所有 URL，而无需抓取内容。

urls = app.map(url="https://example.com")

print(f"Found {len(urls)} pages")
for url in urls[:10]:
    print(url)

用于：站点地图发现、爬取规划、网站审计。

4. 搜索端点 (`/search`) - 新增

执行网页搜索，并可选择性地在一个操作中抓取结果。

# 基本搜索
results = app.search(
    query="best practices for React server components",
    limit=10
)

for result in results:
    print(f"{result.title}: {result.url}")

# 搜索 + 抓取结果
results = app.search(
    query="React server components tutorial",
    limit=5,
    scrape_options={
        "formats": ["markdown"],
        "only_main_content": True
    }
)

for result in results:
    print(f"{result.title}")
    print(result.markdown[:500])

results = app.search(
    query="machine learning papers",
    limit=20,
    # 按来源类型过滤
    sources=["web", "news", "images"],
    # 按类别过滤
    categories=["github", "research", "pdf"],
    # 位置
    location={"country": "US"},
    # 时间过滤器
    tbs="qdr:m",  # 过去一个月 (qdr:h=小时, qdr:d=天, qdr:w=周, qdr:y=年)
    timeout=30000
)

成本：每 10 个结果 2 个积分 + 如果启用抓取则加上抓取成本。

5. 提取端点 (`/v2/extract`)

从单个页面、多个页面或整个域中提取 AI 驱动的结构化数据。

from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: float
    description: str
    in_stock: bool

result = app.extract(
    urls=["https://example.com/product"],
    schema=Product,
    system_prompt="Extract product information"
)

print(result.data)

多页面 / 域提取

# 使用通配符从整个域提取
result = app.extract(
    urls=["example.com/*"],  # 域上的所有页面
    schema=Product,
    system_prompt="Extract all products"
)

# 启用网页搜索以获取额外上下文
result = app.extract(
    urls=["example.com/products"],
    schema=Product,
    enable_web_search=True  # 跟踪外部链接
)

仅提示词提取（无模式）

result = app.extract(
    urls=["https://example.com/about"],
    prompt="Extract the company name, founding year, and key executives"
)
# 大语言模型决定输出结构

6. 代理端点 (`/agent`) - 新增

无需特定 URL 的自主网页数据收集。代理使用自然语言提示词进行搜索、导航和收集数据。

# 基本代理用法
result = app.agent(
    prompt="Find the pricing plans for the top 3 headless CMS platforms and compare their features"
)

print(result.data)

# 使用模式进行结构化输出
from pydantic import BaseModel
from typing import List

class CMSPricing(BaseModel):
    name: str
    free_tier: bool
    starter_price: float
    features: List[str]

result = app.agent(
    prompt="Find pricing for Contentful, Sanity, and Strapi",
    schema=CMSPricing
)

# 可选：专注于特定 URL
result = app.agent(
    prompt="Extract the enterprise pricing details",
    urls=["https://contentful.com/pricing", "https://sanity.io/pricing"]
)

模型	最适合	成本
`spark-1-mini` (默认)	简单提取，高吞吐量	标准
`spark-1-pro`	复杂分析，模糊数据	额外 60%

result = app.agent(
    prompt="Analyze competitive positioning...",
    model="spark-1-pro"  # 用于复杂任务
)

# 启动代理（立即返回）
job = app.start_agent(
    prompt="Research market trends..."
)

# 轮询结果
status = app.check_agent_status(job.id)
if status.status == "completed":
    print(status.data)

注意：代理处于研究预览阶段。每天 5 次免费请求，之后按积分计费。

7. 批量抓取 - 新增

在单个操作中高效处理多个 URL。

同步（等待完成）

results = app.batch_scrape(
    urls=[
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
    ],
    formats=["markdown"],
    only_main_content=True
)

for page in results.data:
    print(f"{page.metadata.source_url}: {len(page.markdown)} chars")

异步（使用 Webhook）

job = app.start_batch_scrape(
    urls=url_list,
    formats=["markdown"],
    webhook="https://your-domain.com/webhook"
)

# Webhook 接收事件：started, page, completed, failed

const job = await app.startBatchScrape(urls, {
  formats: ['markdown'],
  webhook: 'https://your-domain.com/webhook'
});

// 轮询状态
const status = await app.checkBatchScrapeStatus(job.id);

8. 变更追踪 - 新增

通过比较抓取结果来监控内容随时间的变化。

# 启用变更追踪
doc = app.scrape(
    url="https://example.com/pricing",
    formats=["markdown", "changeTracking"]
)

# 响应包括：
print(doc.change_tracking.status)  # new, same, changed, removed
print(doc.change_tracking.previous_scrape_at)
print(doc.change_tracking.visibility)  # visible, hidden

# Git-diff 模式（默认）
doc = app.scrape(
    url="https://example.com/docs",
    formats=["markdown", "changeTracking"],
    change_tracking_options={
        "mode": "diff"
    }
)
print(doc.change_tracking.diff)  # 逐行变更

# JSON 模式（结构化比较）
doc = app.scrape(
    url="https://example.com/pricing",
    formats=["markdown", "changeTracking"],
    change_tracking_options={
        "mode": "json",
        "schema": {"type": "object", "properties": {"price": {"type": "number"}}}
    }
)
# 每页消耗 5 个积分

new - 之前未见过此页面
same - 自上次抓取以来无变化
changed - 内容已修改
removed - 页面无法再访问

# 从 https://www.firecrawl.dev/app 获取 API 密钥
# 存储在环境变量中
FIRECRAWL_API_KEY=fc-your-api-key-here

切勿硬编码 API 密钥！

Cloudflare Workers 集成

Firecrawl SDK 无法在 Cloudflare Workers 中运行（需要 Node.js）。请直接使用 REST API：

interface Env {
  FIRECRAWL_API_KEY: string;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { url } = await request.json<{ url: string }>();

    const response = await fetch('https://api.firecrawl.dev/v2/scrape', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        url,
        formats: ['markdown'],
        onlyMainContent: true
      })
    });

    const result = await response.json();
    return Response.json(result);
  }
};

速率限制和定价

警告：隐身模式定价变更（2025 年 5 月）

隐身模式现在在主动使用时每个请求消耗 5 个积分。默认行为使用 "auto" 模式，仅当基本模式失败时才收取隐身积分。

# 使用自动模式（默认）- 仅在需要隐身时才收取 5 个积分
doc = app.scrape(url, formats=["markdown"])

# 或根据特定错误有条件地启用隐身模式
if error_status_code in [401, 403, 500]:
    doc = app.scrape(url, formats=["markdown"], proxy="stealth")

统一计费（2025 年 11 月）

积分和令牌合并为单一系统。提取端点使用积分（15 个令牌 = 1 个积分）。

层级	积分/月	备注
免费	500	适合测试
爱好者	3,000	$19/月
标准	100,000	$99/月
增长	500,000	$399/月

抓取：1 个积分（基本），5 个积分（隐身）
爬取：每页 1 个积分
搜索：每 10 个结果 2 个积分
提取：每页 5 个积分（在 v2.6.0 中从令牌更改）
代理：动态（基于复杂度）
变更追踪 JSON 模式：+5 个积分

常见问题与解决方案

问题	原因	解决方案
内容为空	JS 未加载	添加 `wait_for: 5000` 或使用 `actions`
超出速率限制	超出配额	检查仪表盘，升级计划
超时错误	页面加载慢	增加 `timeout`，使用 `stealth: true`
机器人检测	反抓取	使用 `stealth: true`，添加 `location`
无效 API 密钥	格式错误	必须以 `fc-` 开头

此技能预防 10 个已记录的问题：

问题 #1：隐身模式定价变更（2025 年 5 月）

错误：使用隐身模式时出现意外的积分消耗来源：隐身模式文档 | 更新日志原因：从 2025 年 5 月 8 日起，隐身模式代理请求每个请求消耗 5 个积分（之前包含在标准定价中）。这是一个重大的计费变更。预防：使用自动模式（默认），仅当基本模式失败时才收取隐身积分

# 推荐：使用自动模式（默认）
doc = app.scrape(url, formats=['markdown'])
# 仅在基本模式失败时自动重试隐身模式（5 个积分）

# 或根据错误状态有条件地启用
try:
    doc = app.scrape(url, formats=['markdown'], proxy='basic')
except Exception as e:
    if e.status_code in [401, 403, 500]:
        doc = app.scrape(url, formats=['markdown'], proxy='stealth')

隐身模式选项：

auto（默认）：仅当基本模式失败且隐身模式成功后收取 5 个积分
basic：标准代理，1 个积分成本
stealth：主动使用时每个请求 5 个积分

问题 #2：v2.0.0 破坏性变更 - 方法重命名

错误：AttributeError: 'FirecrawlApp' object has no attribute 'scrape_url' 来源：v2.0.0 发布 | 迁移指南原因：v2.0.0（2025 年 8 月）在所有语言的 SDK 中重命名了方法预防：使用新的方法名

JavaScript/TypeScript：

scrapeUrl() → scrape()
crawlUrl() → crawl() 或 startCrawl()
asyncCrawlUrl() → startCrawl()
checkCrawlStatus() → getCrawlStatus()

scrape_url() → scrape()
crawl_url() → crawl() 或 start_crawl()

# 旧版 (v1)
doc = app.scrape_url("https://example.com")

# 新版 (v2)
doc = app.scrape("https://example.com")

问题 #3：v2.0.0 破坏性变更 - 格式更改

错误：'extract' is not a valid format 来源：v2.0.0 发布原因：旧的 "extract" 格式在 v2.0.0 中重命名为 "json" 预防：使用新的对象格式进行 JSON 提取

# 旧版 (v1)
doc = app.scrape_url(
    url="https://example.com",
    params={
        "formats": ["extract"],
        "extract": {"prompt": "Extract title"}
    }
)

# 新版 (v2)
doc = app.scrape(
    url="https://example.com",
    formats=[{"type": "json", "prompt": "Extract title"}]
)

# 使用模式
doc = app.scrape(
    url="https://example.com",
    formats=[{
        "type": "json",
        "prompt": "Extract product info",
        "schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "price": {"type": "number"}
            }
        }
    }]
)

截图格式也已更改：

# 新版：截图作为对象
formats=[{
    "type": "screenshot",
    "fullPage": True,
    "quality": 80,
    "viewport": {"width": 1920, "height": 1080}
}]

问题 #4：v2.0.0 破坏性变更 - 爬取选项

错误：'allowBackwardCrawling' is not a valid parameter 来源：v2.0.0 发布原因：v2.0.0 中多个爬取参数被重命名或移除预防：使用新的参数名

allowBackwardCrawling → 改用 crawlEntireDomain
maxDepth → 改用 maxDiscoveryDepth
ignoreSitemap (bool) → sitemap ("only", "skip", "include")

# 旧版 (v1)
app.crawl_url(
    url="https://docs.example.com",
    params={
        "allowBackwardCrawling": True,
        "maxDepth": 3,
        "ignoreSitemap": False
    }
)

# 新版 (v2)
app.crawl(
    url="https://docs.example.com",
    crawl_entire_domain=True,
    max_discovery_depth=3,
    sitemap="include"  # "only", "skip", 或 "include"
)

问题 #5：v2.0.0 默认行为变更

错误：意外返回过时的缓存内容来源：v2.0.0 发布原因：v2.0.0 更改了多个默认值预防：注意新的默认值

默认值变更：

maxAge 现在默认为 2 天（默认缓存）
blockAds, skipTlsVerification, removeBase64Images 默认启用

# 如果需要强制获取新数据
doc = app.scrape(url, formats=['markdown'], max_age=0)

# 完全禁用缓存
doc = app.scrape(url, formats=['markdown'], store_in_cache=False)

问题 #6：任务状态竞态条件

错误：创建后立即检查爬取状态时出现 "Job not found" 来源：GitHub Issue #2662 原因：任务创建与状态端点可用性之间存在数据库复制延迟预防：在首次状态检查前等待 1-3 秒，或实现重试逻辑

import time

# 开始爬取
job = app.start_crawl(url="https://docs.example.com")
print(f"Job ID: {job.id}")

# 必需：在首次状态检查前等待
time.sleep(2)  # 建议 1-3 秒

# 现在状态检查成功
status = app.get_crawl_status(job.id)

# 或实现重试逻辑
def get_status_with_retry(job_id, max_retries=3, delay=1):
    for attempt in range(max_retries):
        try:
            return app.get_crawl_status(job_id)
        except Exception as e:
            if "Job not found" in str(e) and attempt < max_retries - 1:
                time.sleep(delay)
                continue
            raise

status = get_status_with_retry(job.id)

问题 #7：DNS 错误返回 HTTP 200

错误：DNS 解析失败返回 success: false 和 HTTP 200 状态码，而不是 4xx 来源：GitHub Issue #2402 | 在 v2.7.0 中修复原因：在 v2.7.0 中更改以实现一致的错误处理预防：检查 success 字段和 code 字段，不要仅依赖 HTTP 状态码

const result = await app.scrape('https://nonexistent-domain-xyz.com');

// 不要依赖 HTTP 状态码
// 响应：HTTP 200 附带 { success: false, code: "SCRAPE_DNS_RESOLUTION_ERROR" }

// 务必检查 success 字段
if (!result.success) {
    if (result.code === 'SCRAPE_DNS_RESOLUTION_ERROR') {
        console.error('DNS resolution failed');
    }
    throw new Error(result.error);
}

注意：DNS 解析错误即使失败也会消耗 1 个积分。

问题 #8：机器人检测仍会消耗积分

错误：Cloudflare 错误页面作为 "成功" 抓取返回，积分被消耗来源：GitHub Issue #2413 原因：Fire-1 引擎即使在机器人检测阻止访问时也会消耗积分预防：在处理前验证内容不是错误页面；对受保护的站点使用隐身模式

# 首次尝试不使用隐身模式
doc = app.scrape(url="https://protected-site.com", formats=["markdown"])

# 验证内容不是错误页面
if "cloudflare" in doc.markdown.lower() or "access denied" in doc.markdown.lower():
    # 使用隐身模式重试（如果成功则消耗 5 个积分）
    doc = app.scrape(url, formats=["markdown"], stealth=True)

成本影响：基本抓取即使在失败时也消耗 1 个积分，隐身重试额外消耗 5 个积分。

问题 #9：自托管反机器人指纹识别弱点

错误：在具有反机器人措施的网站上出现 "All scraping engines failed!" (SCRAPE_ALL_ENGINES_FAILED) 来源：GitHub Issue #2257 原因：自托管的 Firecrawl 缺乏云服务中存在的高级反指纹识别技术预防：对具有强大反机器人措施的站点使用 Firecrawl 云服务，或配置代理

# 自托管在受 Cloudflare 保护的站点上失败
curl -X POST 'http://localhost:3002/v2/scrape' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
  "url": "https://www.example.com/",
  "pageOptions": { "engine": "playwright" }
}'
# 错误："All scraping engines failed!"

# 解决方法：改用云服务
# 云服务具有更好的反指纹识别能力

注意：这会影响使用默认 docker-compose 设置的自托管 v2.3.0+。存在警告："⚠️ 警告：未提供代理服务器。您的 IP 地址可能会被阻止。"

问题 #10：缓存性能最佳实践（社区来源）

次优：不利用缓存可能导致请求速度慢 500% 来源：快速抓取文档 | 博客文章 重要性：v2+ 中默认 maxAge 为 2 天，但许多用例需要不同的策略预防：根据您的内容类型使用适当的缓存策略

# 新鲜数据（实时定价、股价）
doc = app.scrape(url, formats=["markdown"], max_age=0)

# 10 分钟缓存（新闻、博客）
doc = app.scrape(url, formats=["markdown"], max_age=600000)  # 毫秒

# 对静态内容使用默认缓存（2 天）
doc = app.scrape(url, formats=["markdown"])  # maxAge 默认为 172800000

# 不存储在缓存中（一次性抓取）
doc = app.scrape(url, formats=["markdown"], store_in_cache=False)

# 要求最小年龄后才重新抓取 (v2.7.0+)
doc = app.scrape(url, formats=["markdown"], min_age=3600000)  # 最小 1 小时

缓存响应：毫秒级
新鲜抓取：秒级
速度差异：高达 500%

软件包	版本	最后检查
firecrawl-py	4.13.0+	2026-01-20
@mendable/firecrawl-js	4.11.1+	2026-01-20
API 版本	v2	当前

令牌节省：相比手动集成节省约 65% 错误预防：10 个已记录问题（v2 迁移、隐身定价、任务状态竞态、DNS 错误、机器人检测计费、自托管限制、缓存优化） 生产就绪：是 最后验证：2026-01-21 | 技能版本：2.0.0 | 变更：添加了已知问题预防部分，包含来自 TIER 1-2 研究发现的 10 个已记录错误；添加了 v2 迁移指南；记录了隐身模式定价变更和统一计费模型

2026 年 1 月 20 日

🇺🇸English

Firecrawl Web Scraper Skill

Status : Production Ready Last Updated : 2026-01-20 Official Docs : https://docs.firecrawl.dev API Version : v2 SDK Versions : firecrawl-py 4.13.0+, @mendable/firecrawl-js 4.11.1+

What is Firecrawl?

Firecrawl is a Web Data API for AI that turns websites into LLM-ready markdown or structured data. It handles:

JavaScript rendering - Executes client-side JavaScript to capture dynamic content
Anti-bot bypass - Gets past CAPTCHA and bot detection systems
Format conversion - Outputs as markdown, HTML, JSON, screenshots, summaries
Document parsing - Processes PDFs, DOCX files, and images
Autonomous agents - AI-powered web data gathering without URLs
Change tracking - Monitor content changes over time
Branding extraction - Extract color schemes, typography, logos

API Endpoints Overview

Endpoint	Purpose	Use Case
`/scrape`	Single page	Extract article, product page
`/crawl`	Full site	Index docs, archive sites
`/map`	URL discovery	Find all pages, plan strategy
`/search`	Web search + scrape	Research with live data
`/extract`	Structured data	Product prices, contacts
`/agent`	Autonomous gathering

1. Scrape Endpoint (`/v2/scrape`)

Scrapes a single webpage and returns clean, structured content.

Basic Usage

from firecrawl import Firecrawl
import os

app = Firecrawl(api_key=os.environ.get("FIRECRAWL_API_KEY"))

# Basic scrape
doc = app.scrape(
    url="https://example.com/article",
    formats=["markdown", "html"],
    only_main_content=True
)

print(doc.markdown)
print(doc.metadata)



import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });

const result = await app.scrapeUrl('https://example.com/article', {
  formats: ['markdown', 'html'],
  onlyMainContent: true
});

console.log(result.markdown);

Output Formats

Format	Description
`markdown`	LLM-optimized content
`html`	Full HTML
`rawHtml`	Unprocessed HTML
`screenshot`	Page capture (with viewport options)
`links`	All URLs on page
`json`	Structured data extraction

Advanced Options

doc = app.scrape(
    url="https://example.com",
    formats=["markdown", "screenshot"],
    only_main_content=True,
    remove_base64_images=True,
    wait_for=5000,  # Wait 5s for JS
    timeout=30000,
    # Location & language
    location={"country": "AU", "languages": ["en-AU"]},
    # Cache control
    max_age=0,  # Fresh content (no cache)
    store_in_cache=True,
    # Stealth mode for complex sites
    stealth=True,
    # Custom headers
    headers={"User-Agent": "Custom Bot 1.0"}
)

Browser Actions

Perform interactions before scraping:

doc = app.scrape(
    url="https://example.com",
    actions=[
        {"type": "click", "selector": "button.load-more"},
        {"type": "wait", "milliseconds": 2000},
        {"type": "scroll", "direction": "down"},
        {"type": "write", "selector": "input#search", "text": "query"},
        {"type": "press", "key": "Enter"},
        {"type": "screenshot"}  # Capture state mid-action
    ]
)

JSON Mode (Structured Extraction)

# With schema
doc = app.scrape(
    url="https://example.com/product",
    formats=["json"],
    json_options={
        "schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "price": {"type": "number"},
                "in_stock": {"type": "boolean"}
            }
        }
    }
)

# Without schema (prompt-only)
doc = app.scrape(
    url="https://example.com/product",
    formats=["json"],
    json_options={
        "prompt": "Extract the product name, price, and availability"
    }
)

Branding Extraction

Extract design system and brand identity:

doc = app.scrape(
    url="https://example.com",
    formats=["branding"]
)

# Returns:
# - Color schemes and palettes
# - Typography (fonts, sizes, weights)
# - Spacing and layout metrics
# - UI component styles
# - Logo and imagery URLs
# - Brand personality traits

2. Crawl Endpoint (`/v2/crawl`)

Crawls all accessible pages from a starting URL.

result = app.crawl(
    url="https://docs.example.com",
    limit=100,
    max_depth=3,
    allowed_domains=["docs.example.com"],
    exclude_paths=["/api/*", "/admin/*"],
    scrape_options={
        "formats": ["markdown"],
        "only_main_content": True
    }
)

for page in result.data:
    print(f"Scraped: {page.metadata.source_url}")
    print(f"Content: {page.markdown[:200]}...")

Async Crawl with Webhooks

# Start crawl (returns immediately)
job = app.start_crawl(
    url="https://docs.example.com",
    limit=1000,
    webhook="https://your-domain.com/webhook"
)

print(f"Job ID: {job.id}")

# Or poll for status
status = app.check_crawl_status(job.id)

3. Map Endpoint (`/v2/map`)

Rapidly discover all URLs on a website without scraping content.

urls = app.map(url="https://example.com")

print(f"Found {len(urls)} pages")
for url in urls[:10]:
    print(url)

Use for: sitemap discovery, crawl planning, website audits.

4. Search Endpoint (`/search`) - NEW

Perform web searches and optionally scrape the results in one operation.

# Basic search
results = app.search(
    query="best practices for React server components",
    limit=10
)

for result in results:
    print(f"{result.title}: {result.url}")

# Search + scrape results
results = app.search(
    query="React server components tutorial",
    limit=5,
    scrape_options={
        "formats": ["markdown"],
        "only_main_content": True
    }
)

for result in results:
    print(f"{result.title}")
    print(result.markdown[:500])

Search Options

results = app.search(
    query="machine learning papers",
    limit=20,
    # Filter by source type
    sources=["web", "news", "images"],
    # Filter by category
    categories=["github", "research", "pdf"],
    # Location
    location={"country": "US"},
    # Time filter
    tbs="qdr:m",  # Past month (qdr:h=hour, qdr:d=day, qdr:w=week, qdr:y=year)
    timeout=30000
)

Cost : 2 credits per 10 results + scraping costs if enabled.

5. Extract Endpoint (`/v2/extract`)

AI-powered structured data extraction from single pages, multiple pages, or entire domains.

Single Page

from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: float
    description: str
    in_stock: bool

result = app.extract(
    urls=["https://example.com/product"],
    schema=Product,
    system_prompt="Extract product information"
)

print(result.data)

Multi-Page / Domain Extraction

# Extract from entire domain using wildcard
result = app.extract(
    urls=["example.com/*"],  # All pages on domain
    schema=Product,
    system_prompt="Extract all products"
)

# Enable web search for additional context
result = app.extract(
    urls=["example.com/products"],
    schema=Product,
    enable_web_search=True  # Follow external links
)

Prompt-Only Extraction (No Schema)

result = app.extract(
    urls=["https://example.com/about"],
    prompt="Extract the company name, founding year, and key executives"
)
# LLM determines output structure

6. Agent Endpoint (`/agent`) - NEW

Autonomous web data gathering without requiring specific URLs. The agent searches, navigates, and gathers data using natural language prompts.

# Basic agent usage
result = app.agent(
    prompt="Find the pricing plans for the top 3 headless CMS platforms and compare their features"
)

print(result.data)

# With schema for structured output
from pydantic import BaseModel
from typing import List

class CMSPricing(BaseModel):
    name: str
    free_tier: bool
    starter_price: float
    features: List[str]

result = app.agent(
    prompt="Find pricing for Contentful, Sanity, and Strapi",
    schema=CMSPricing
)

# Optional: focus on specific URLs
result = app.agent(
    prompt="Extract the enterprise pricing details",
    urls=["https://contentful.com/pricing", "https://sanity.io/pricing"]
)

Agent Models

Model	Best For	Cost
`spark-1-mini` (default)	Simple extractions, high volume	Standard
`spark-1-pro`	Complex analysis, ambiguous data	60% more

result = app.agent(
    prompt="Analyze competitive positioning...",
    model="spark-1-pro"  # For complex tasks
)

Async Agent

# Start agent (returns immediately)
job = app.start_agent(
    prompt="Research market trends..."
)

# Poll for results
status = app.check_agent_status(job.id)
if status.status == "completed":
    print(status.data)

Note : Agent is in Research Preview. 5 free daily requests, then credit-based billing.

7. Batch Scrape - NEW

Process multiple URLs efficiently in a single operation.

Synchronous (waits for completion)

results = app.batch_scrape(
    urls=[
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
    ],
    formats=["markdown"],
    only_main_content=True
)

for page in results.data:
    print(f"{page.metadata.source_url}: {len(page.markdown)} chars")

Asynchronous (with webhooks)

job = app.start_batch_scrape(
    urls=url_list,
    formats=["markdown"],
    webhook="https://your-domain.com/webhook"
)

# Webhook receives events: started, page, completed, failed



const job = await app.startBatchScrape(urls, {
  formats: ['markdown'],
  webhook: 'https://your-domain.com/webhook'
});

// Poll for status
const status = await app.checkBatchScrapeStatus(job.id);

8. Change Tracking - NEW

Monitor content changes over time by comparing scrapes.

# Enable change tracking
doc = app.scrape(
    url="https://example.com/pricing",
    formats=["markdown", "changeTracking"]
)

# Response includes:
print(doc.change_tracking.status)  # new, same, changed, removed
print(doc.change_tracking.previous_scrape_at)
print(doc.change_tracking.visibility)  # visible, hidden

Comparison Modes

# Git-diff mode (default)
doc = app.scrape(
    url="https://example.com/docs",
    formats=["markdown", "changeTracking"],
    change_tracking_options={
        "mode": "diff"
    }
)
print(doc.change_tracking.diff)  # Line-by-line changes

# JSON mode (structured comparison)
doc = app.scrape(
    url="https://example.com/pricing",
    formats=["markdown", "changeTracking"],
    change_tracking_options={
        "mode": "json",
        "schema": {"type": "object", "properties": {"price": {"type": "number"}}}
    }
)
# Costs 5 credits per page

Change States :

new - Page not seen before
same - No changes since last scrape
changed - Content modified
removed - Page no longer accessible

Authentication

# Get API key from https://www.firecrawl.dev/app
# Store in environment
FIRECRAWL_API_KEY=fc-your-api-key-here

Never hardcode API keys!

Cloudflare Workers Integration

The Firecrawl SDK cannot run in Cloudflare Workers (requires Node.js). Use the REST API directly:

interface Env {
  FIRECRAWL_API_KEY: string;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { url } = await request.json<{ url: string }>();

    const response = await fetch('https://api.firecrawl.dev/v2/scrape', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        url,
        formats: ['markdown'],
        onlyMainContent: true
      })
    });

    const result = await response.json();
    return Response.json(result);
  }
};

Rate Limits & Pricing

Warning: Stealth Mode Pricing Change (May 2025)

Stealth mode now costs 5 credits per request when actively used. Default behavior uses "auto" mode which only charges stealth credits if basic fails.

Recommended pattern :

# Use auto mode (default) - only charges 5 credits if stealth is needed
doc = app.scrape(url, formats=["markdown"])

# Or conditionally enable stealth for specific errors
if error_status_code in [401, 403, 500]:
    doc = app.scrape(url, formats=["markdown"], proxy="stealth")

Unified Billing (November 2025)

Credits and tokens merged into single system. Extract endpoint uses credits (15 tokens = 1 credit).

Pricing Tiers

Tier	Credits/Month	Notes
Free	500	Good for testing
Hobby	3,000	$19/month
Standard	100,000	$99/month
Growth	500,000	$399/month

Credit Costs :

Scrape: 1 credit (basic), 5 credits (stealth)
Crawl: 1 credit per page
Search: 2 credits per 10 results
Extract: 5 credits per page (changed from tokens in v2.6.0)
Agent: Dynamic (complexity-based)
Change Tracking JSON mode: +5 credits

Common Issues & Solutions

Issue	Cause	Solution
Empty content	JS not loaded	Add `wait_for: 5000` or use `actions`
Rate limit exceeded	Over quota	Check dashboard, upgrade plan
Timeout error	Slow page	Increase `timeout`, use `stealth: true`
Bot detection	Anti-scraping	Use `stealth: true`, add `location`

Known Issues Prevention

This skill prevents 10 documented issues:

Issue #1: Stealth Mode Pricing Change (May 2025)

Error : Unexpected credit costs when using stealth mode Source : Stealth Mode Docs | Changelog Why It Happens : Starting May 8th, 2025, Stealth Mode proxy requests cost 5 credits per request (previously included in standard pricing). This is a significant billing change. Prevention : Use auto mode (default) which only charges stealth credits if basic fails

# RECOMMENDED: Use auto mode (default)
doc = app.scrape(url, formats=['markdown'])
# Auto retries with stealth (5 credits) only if basic fails

# Or conditionally enable based on error status
try:
    doc = app.scrape(url, formats=['markdown'], proxy='basic')
except Exception as e:
    if e.status_code in [401, 403, 500]:
        doc = app.scrape(url, formats=['markdown'], proxy='stealth')

Stealth Mode Options :

auto (default): Charges 5 credits only if stealth succeeds after basic fails
basic: Standard proxies, 1 credit cost
stealth: 5 credits per request when actively used

Issue #2: v2.0.0 Breaking Changes - Method Renames

Error : AttributeError: 'FirecrawlApp' object has no attribute 'scrape_url' Source : v2.0.0 Release | Migration Guide Why It Happens : v2.0.0 (August 2025) renamed SDK methods across all languages Prevention : Use new method names

JavaScript/TypeScript :

scrapeUrl() → scrape()
crawlUrl() → crawl() or startCrawl()
asyncCrawlUrl() → startCrawl()
checkCrawlStatus() → getCrawlStatus()

Python :

scrape_url() → scrape()
crawl_url() → crawl() or start_crawl()

OLD (v1)

doc = app.scrape_url("https://example.com")

NEW (v2)

doc = app.scrape("https://example.com")

Issue #3: v2.0.0 Breaking Changes - Format Changes

Error : 'extract' is not a valid format Source : v2.0.0 Release Why It Happens : Old "extract" format renamed to "json" in v2.0.0 Prevention : Use new object format for JSON extraction

# OLD (v1)
doc = app.scrape_url(
    url="https://example.com",
    params={
        "formats": ["extract"],
        "extract": {"prompt": "Extract title"}
    }
)

# NEW (v2)
doc = app.scrape(
    url="https://example.com",
    formats=[{"type": "json", "prompt": "Extract title"}]
)

# With schema
doc = app.scrape(
    url="https://example.com",
    formats=[{
        "type": "json",
        "prompt": "Extract product info",
        "schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "price": {"type": "number"}
            }
        }
    }]
)

Screenshot format also changed :

# NEW: Screenshot as object
formats=[{
    "type": "screenshot",
    "fullPage": True,
    "quality": 80,
    "viewport": {"width": 1920, "height": 1080}
}]

Issue #4: v2.0.0 Breaking Changes - Crawl Options

Error : 'allowBackwardCrawling' is not a valid parameter Source : v2.0.0 Release Why It Happens : Several crawl parameters renamed or removed in v2.0.0 Prevention : Use new parameter names

Parameter Changes :

allowBackwardCrawling → Use crawlEntireDomain instead
maxDepth → Use maxDiscoveryDepth instead
ignoreSitemap (bool) → sitemap ("only", "skip", "include")

OLD (v1)

app.crawl_url( url="https://docs.example.com", params={ "allowBackwardCrawling": True, "maxDepth": 3, "ignoreSitemap": False } )

NEW (v2)

app.crawl( url="https://docs.example.com", crawl_entire_domain=True, max_discovery_depth=3, sitemap="include" # "only", "skip", or "include" )

Issue #5: v2.0.0 Default Behavior Changes

Error : Stale cached content returned unexpectedly Source : v2.0.0 Release Why It Happens : v2.0.0 changed several defaults Prevention : Be aware of new defaults

Default Changes :

maxAge now defaults to 2 days (cached by default)
blockAds, skipTlsVerification, removeBase64Images enabled by default

Force fresh data if needed

doc = app.scrape(url, formats=['markdown'], max_age=0)

Disable cache entirely

doc = app.scrape(url, formats=['markdown'], store_in_cache=False)

Issue #6: Job Status Race Condition

Error : "Job not found" when checking crawl status immediately after creation Source : GitHub Issue #2662 Why It Happens : Database replication delay between job creation and status endpoint availability Prevention : Wait 1-3 seconds before first status check, or implement retry logic

import time

# Start crawl
job = app.start_crawl(url="https://docs.example.com")
print(f"Job ID: {job.id}")

# REQUIRED: Wait before first status check
time.sleep(2)  # 1-3 seconds recommended

# Now status check succeeds
status = app.get_crawl_status(job.id)

# Or implement retry logic
def get_status_with_retry(job_id, max_retries=3, delay=1):
    for attempt in range(max_retries):
        try:
            return app.get_crawl_status(job_id)
        except Exception as e:
            if "Job not found" in str(e) and attempt < max_retries - 1:
                time.sleep(delay)
                continue
            raise

status = get_status_with_retry(job.id)

Issue #7: DNS Errors Return HTTP 200

Error : DNS resolution failures return success: false with HTTP 200 status instead of 4xx Source : GitHub Issue #2402 | Fixed in v2.7.0 Why It Happens : Changed in v2.7.0 for consistent error handling Prevention : Check success field and code field, don't rely on HTTP status alone

const result = await app.scrape('https://nonexistent-domain-xyz.com');

// DON'T rely on HTTP status code
// Response: HTTP 200 with { success: false, code: "SCRAPE_DNS_RESOLUTION_ERROR" }

// DO check success field
if (!result.success) {
    if (result.code === 'SCRAPE_DNS_RESOLUTION_ERROR') {
        console.error('DNS resolution failed');
    }
    throw new Error(result.error);
}

Note : DNS resolution errors still charge 1 credit despite failure.

Issue #8: Bot Detection Still Charges Credits

Error : Cloudflare error page returned as "successful" scrape, credits charged Source : GitHub Issue #2413 Why It Happens : Fire-1 engine charges credits even when bot detection prevents access Prevention : Validate content isn't an error page before processing; use stealth mode for protected sites

# First attempt without stealth
doc = app.scrape(url="https://protected-site.com", formats=["markdown"])

# Validate content isn't an error page
if "cloudflare" in doc.markdown.lower() or "access denied" in doc.markdown.lower():
    # Retry with stealth (costs 5 credits if successful)
    doc = app.scrape(url, formats=["markdown"], stealth=True)

Cost Impact : Basic scrape charges 1 credit even on failure, stealth retry charges additional 5 credits.

Issue #9: Self-Hosted Anti-Bot Fingerprinting Weakness

Error : "All scraping engines failed!" (SCRAPE_ALL_ENGINES_FAILED) on sites with anti-bot measures Source : GitHub Issue #2257 Why It Happens : Self-hosted Firecrawl lacks advanced anti-fingerprinting techniques present in cloud service Prevention : Use Firecrawl cloud service for sites with strong anti-bot measures, or configure proxy

# Self-hosted fails on Cloudflare-protected sites
curl -X POST 'http://localhost:3002/v2/scrape' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
  "url": "https://www.example.com/",
  "pageOptions": { "engine": "playwright" }
}'
# Error: "All scraping engines failed!"

# Workaround: Use cloud service instead
# Cloud service has better anti-fingerprinting

Note : This affects self-hosted v2.3.0+ with default docker-compose setup. Warning present: "⚠️ WARNING: No proxy server provided. Your IP address may be blocked."

Issue #10: Cache Performance Best Practices (Community-sourced)

Suboptimal : Not leveraging cache can make requests 500% slower Source : Fast Scraping Docs | Blog Post Why It Matters : Default maxAge is 2 days in v2+, but many use cases need different strategies Prevention : Use appropriate cache strategy for your content type

# Fresh data (real-time pricing, stock prices)
doc = app.scrape(url, formats=["markdown"], max_age=0)

# 10-minute cache (news, blogs)
doc = app.scrape(url, formats=["markdown"], max_age=600000)  # milliseconds

# Use default cache (2 days) for static content
doc = app.scrape(url, formats=["markdown"])  # maxAge defaults to 172800000

# Don't store in cache (one-time scrape)
doc = app.scrape(url, formats=["markdown"], store_in_cache=False)

# Require minimum age before re-scraping (v2.7.0+)
doc = app.scrape(url, formats=["markdown"], min_age=3600000)  # 1 hour minimum

Performance Impact :

Cached response: Milliseconds
Fresh scrape: Seconds
Speed difference: Up to 500%

Package Versions

Package	Version	Last Checked
firecrawl-py	4.13.0+	2026-01-20
@mendable/firecrawl-js	4.11.1+	2026-01-20
API Version	v2	Current

Official Documentation

Docs : https://docs.firecrawl.dev
Python SDK : https://docs.firecrawl.dev/sdks/python
Node.js SDK : https://docs.firecrawl.dev/sdks/node
API Reference : https://docs.firecrawl.dev/api-reference
GitHub : https://github.com/mendableai/firecrawl
Dashboard : https://www.firecrawl.dev/app

Token Savings : ~65% vs manual integration Error Prevention : 10 documented issues (v2 migration, stealth pricing, job status race, DNS errors, bot detection billing, self-hosted limitations, cache optimization) Production Ready : Yes Last verified : 2026-01-21 | Skill version : 2.0.0 | Changes : Added Known Issues Prevention section with 10 documented errors from TIER 1-2 research findings; added v2 migration guidance; documented stealth mode pricing change and unified billing model

Weekly Installs

518

Repository

jezweb/claude-skills

GitHub Stars

650

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

claude-code407

opencode350

gemini-cli331

cursor311

codex303

antigravity277

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

42,300 周安装