firecrawl-scraper by jezweb/claude-skills
npx skills add https://github.com/jezweb/claude-skills --skill firecrawl-scraper状态:生产就绪 最后更新:2026-01-20 官方文档:https://docs.firecrawl.dev API 版本:v2 SDK 版本:firecrawl-py 4.13.0+, @mendable/firecrawl-js 4.11.1+
Firecrawl 是一个面向 AI 的网页数据 API,可将网站转换为适合大语言模型的 Markdown 或结构化数据。它能处理:
| 端点 | 目的 | 使用场景 |
|---|---|---|
/scrape |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 单页面 |
| 提取文章、产品页面 |
/crawl | 整站 | 索引文档、归档网站 |
/map | URL 发现 | 查找所有页面,规划策略 |
/search | 网页搜索 + 抓取 | 使用实时数据进行研究 |
/extract | 结构化数据 | 产品价格、联系方式 |
/agent | 自主收集 | 无需 URL,AI 自动导航 |
/batch-scrape | 多个 URL | 批量处理 |
/v2/scrape)抓取单个网页并返回干净、结构化的内容。
from firecrawl import Firecrawl
import os
app = Firecrawl(api_key=os.environ.get("FIRECRAWL_API_KEY"))
# 基本抓取
doc = app.scrape(
url="https://example.com/article",
formats=["markdown", "html"],
only_main_content=True
)
print(doc.markdown)
print(doc.metadata)
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
const result = await app.scrapeUrl('https://example.com/article', {
formats: ['markdown', 'html'],
onlyMainContent: true
});
console.log(result.markdown);
| 格式 | 描述 |
|---|---|
markdown | 为大语言模型优化的内容 |
html | 完整 HTML |
rawHtml | 未经处理的 HTML |
screenshot | 页面截图(带视口选项) |
links | 页面上的所有 URL |
json | 结构化数据提取 |
summary | AI 生成的摘要 |
branding | 设计系统数据 |
changeTracking | 内容变更检测 |
doc = app.scrape(
url="https://example.com",
formats=["markdown", "screenshot"],
only_main_content=True,
remove_base64_images=True,
wait_for=5000, # 等待 5 秒以执行 JS
timeout=30000,
# 位置和语言
location={"country": "AU", "languages": ["en-AU"]},
# 缓存控制
max_age=0, # 新鲜内容(无缓存)
store_in_cache=True,
# 复杂网站的隐身模式
stealth=True,
# 自定义请求头
headers={"User-Agent": "Custom Bot 1.0"}
)
在抓取前执行交互操作:
doc = app.scrape(
url="https://example.com",
actions=[
{"type": "click", "selector": "button.load-more"},
{"type": "wait", "milliseconds": 2000},
{"type": "scroll", "direction": "down"},
{"type": "write", "selector": "input#search", "text": "query"},
{"type": "press", "key": "Enter"},
{"type": "screenshot"} # 在操作过程中捕获状态
]
)
# 使用模式
doc = app.scrape(
url="https://example.com/product",
formats=["json"],
json_options={
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"in_stock": {"type": "boolean"}
}
}
}
)
# 不使用模式(仅提示词)
doc = app.scrape(
url="https://example.com/product",
formats=["json"],
json_options={
"prompt": "Extract the product name, price, and availability"
}
)
提取设计系统和品牌标识:
doc = app.scrape(
url="https://example.com",
formats=["branding"]
)
# 返回:
# - 配色方案和调色板
# - 排版(字体、大小、粗细)
# - 间距和布局指标
# - UI 组件样式
# - 徽标和图像 URL
# - 品牌个性特征
/v2/crawl)从起始 URL 爬取所有可访问的页面。
result = app.crawl(
url="https://docs.example.com",
limit=100,
max_depth=3,
allowed_domains=["docs.example.com"],
exclude_paths=["/api/*", "/admin/*"],
scrape_options={
"formats": ["markdown"],
"only_main_content": True
}
)
for page in result.data:
print(f"Scraped: {page.metadata.source_url}")
print(f"Content: {page.markdown[:200]}...")
# 开始爬取(立即返回)
job = app.start_crawl(
url="https://docs.example.com",
limit=1000,
webhook="https://your-domain.com/webhook"
)
print(f"Job ID: {job.id}")
# 或轮询状态
status = app.check_crawl_status(job.id)
/v2/map)快速发现网站上的所有 URL,而无需抓取内容。
urls = app.map(url="https://example.com")
print(f"Found {len(urls)} pages")
for url in urls[:10]:
print(url)
用于:站点地图发现、爬取规划、网站审计。
/search) - 新增执行网页搜索,并可选择性地在一个操作中抓取结果。
# 基本搜索
results = app.search(
query="best practices for React server components",
limit=10
)
for result in results:
print(f"{result.title}: {result.url}")
# 搜索 + 抓取结果
results = app.search(
query="React server components tutorial",
limit=5,
scrape_options={
"formats": ["markdown"],
"only_main_content": True
}
)
for result in results:
print(f"{result.title}")
print(result.markdown[:500])
results = app.search(
query="machine learning papers",
limit=20,
# 按来源类型过滤
sources=["web", "news", "images"],
# 按类别过滤
categories=["github", "research", "pdf"],
# 位置
location={"country": "US"},
# 时间过滤器
tbs="qdr:m", # 过去一个月 (qdr:h=小时, qdr:d=天, qdr:w=周, qdr:y=年)
timeout=30000
)
成本:每 10 个结果 2 个积分 + 如果启用抓取则加上抓取成本。
/v2/extract)从单个页面、多个页面或整个域中提取 AI 驱动的结构化数据。
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
description: str
in_stock: bool
result = app.extract(
urls=["https://example.com/product"],
schema=Product,
system_prompt="Extract product information"
)
print(result.data)
# 使用通配符从整个域提取
result = app.extract(
urls=["example.com/*"], # 域上的所有页面
schema=Product,
system_prompt="Extract all products"
)
# 启用网页搜索以获取额外上下文
result = app.extract(
urls=["example.com/products"],
schema=Product,
enable_web_search=True # 跟踪外部链接
)
result = app.extract(
urls=["https://example.com/about"],
prompt="Extract the company name, founding year, and key executives"
)
# 大语言模型决定输出结构
/agent) - 新增无需特定 URL 的自主网页数据收集。代理使用自然语言提示词进行搜索、导航和收集数据。
# 基本代理用法
result = app.agent(
prompt="Find the pricing plans for the top 3 headless CMS platforms and compare their features"
)
print(result.data)
# 使用模式进行结构化输出
from pydantic import BaseModel
from typing import List
class CMSPricing(BaseModel):
name: str
free_tier: bool
starter_price: float
features: List[str]
result = app.agent(
prompt="Find pricing for Contentful, Sanity, and Strapi",
schema=CMSPricing
)
# 可选:专注于特定 URL
result = app.agent(
prompt="Extract the enterprise pricing details",
urls=["https://contentful.com/pricing", "https://sanity.io/pricing"]
)
| 模型 | 最适合 | 成本 |
|---|---|---|
spark-1-mini (默认) | 简单提取,高吞吐量 | 标准 |
spark-1-pro | 复杂分析,模糊数据 | 额外 60% |
result = app.agent(
prompt="Analyze competitive positioning...",
model="spark-1-pro" # 用于复杂任务
)
# 启动代理(立即返回)
job = app.start_agent(
prompt="Research market trends..."
)
# 轮询结果
status = app.check_agent_status(job.id)
if status.status == "completed":
print(status.data)
注意:代理处于研究预览阶段。每天 5 次免费请求,之后按积分计费。
在单个操作中高效处理多个 URL。
results = app.batch_scrape(
urls=[
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
formats=["markdown"],
only_main_content=True
)
for page in results.data:
print(f"{page.metadata.source_url}: {len(page.markdown)} chars")
job = app.start_batch_scrape(
urls=url_list,
formats=["markdown"],
webhook="https://your-domain.com/webhook"
)
# Webhook 接收事件:started, page, completed, failed
const job = await app.startBatchScrape(urls, {
formats: ['markdown'],
webhook: 'https://your-domain.com/webhook'
});
// 轮询状态
const status = await app.checkBatchScrapeStatus(job.id);
通过比较抓取结果来监控内容随时间的变化。
# 启用变更追踪
doc = app.scrape(
url="https://example.com/pricing",
formats=["markdown", "changeTracking"]
)
# 响应包括:
print(doc.change_tracking.status) # new, same, changed, removed
print(doc.change_tracking.previous_scrape_at)
print(doc.change_tracking.visibility) # visible, hidden
# Git-diff 模式(默认)
doc = app.scrape(
url="https://example.com/docs",
formats=["markdown", "changeTracking"],
change_tracking_options={
"mode": "diff"
}
)
print(doc.change_tracking.diff) # 逐行变更
# JSON 模式(结构化比较)
doc = app.scrape(
url="https://example.com/pricing",
formats=["markdown", "changeTracking"],
change_tracking_options={
"mode": "json",
"schema": {"type": "object", "properties": {"price": {"type": "number"}}}
}
)
# 每页消耗 5 个积分
变更状态:
new - 之前未见过此页面same - 自上次抓取以来无变化changed - 内容已修改removed - 页面无法再访问# 从 https://www.firecrawl.dev/app 获取 API 密钥
# 存储在环境变量中
FIRECRAWL_API_KEY=fc-your-api-key-here
切勿硬编码 API 密钥!
Firecrawl SDK 无法在 Cloudflare Workers 中运行(需要 Node.js)。请直接使用 REST API:
interface Env {
FIRECRAWL_API_KEY: string;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { url } = await request.json<{ url: string }>();
const response = await fetch('https://api.firecrawl.dev/v2/scrape', {
method: 'POST',
headers: {
'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url,
formats: ['markdown'],
onlyMainContent: true
})
});
const result = await response.json();
return Response.json(result);
}
};
隐身模式现在在主动使用时每个请求消耗 5 个积分。默认行为使用 "auto" 模式,仅当基本模式失败时才收取隐身积分。
推荐模式:
# 使用自动模式(默认)- 仅在需要隐身时才收取 5 个积分
doc = app.scrape(url, formats=["markdown"])
# 或根据特定错误有条件地启用隐身模式
if error_status_code in [401, 403, 500]:
doc = app.scrape(url, formats=["markdown"], proxy="stealth")
积分和令牌合并为单一系统。提取端点使用积分(15 个令牌 = 1 个积分)。
| 层级 | 积分/月 | 备注 |
|---|---|---|
| 免费 | 500 | 适合测试 |
| 爱好者 | 3,000 | $19/月 |
| 标准 | 100,000 | $99/月 |
| 增长 | 500,000 | $399/月 |
积分成本:
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 内容为空 | JS 未加载 | 添加 wait_for: 5000 或使用 actions |
| 超出速率限制 | 超出配额 | 检查仪表盘,升级计划 |
| 超时错误 | 页面加载慢 | 增加 timeout,使用 stealth: true |
| 机器人检测 | 反抓取 | 使用 stealth: true,添加 location |
| 无效 API 密钥 | 格式错误 | 必须以 fc- 开头 |
此技能预防 10 个已记录的问题:
错误:使用隐身模式时出现意外的积分消耗 来源:隐身模式文档 | 更新日志 原因:从 2025 年 5 月 8 日起,隐身模式代理请求每个请求消耗 5 个积分(之前包含在标准定价中)。这是一个重大的计费变更。 预防:使用自动模式(默认),仅当基本模式失败时才收取隐身积分
# 推荐:使用自动模式(默认)
doc = app.scrape(url, formats=['markdown'])
# 仅在基本模式失败时自动重试隐身模式(5 个积分)
# 或根据错误状态有条件地启用
try:
doc = app.scrape(url, formats=['markdown'], proxy='basic')
except Exception as e:
if e.status_code in [401, 403, 500]:
doc = app.scrape(url, formats=['markdown'], proxy='stealth')
隐身模式选项:
auto(默认):仅当基本模式失败且隐身模式成功后收取 5 个积分basic:标准代理,1 个积分成本stealth:主动使用时每个请求 5 个积分错误:AttributeError: 'FirecrawlApp' object has no attribute 'scrape_url' 来源:v2.0.0 发布 | 迁移指南 原因:v2.0.0(2025 年 8 月)在所有语言的 SDK 中重命名了方法 预防:使用新的方法名
JavaScript/TypeScript:
scrapeUrl() → scrape()crawlUrl() → crawl() 或 startCrawl()asyncCrawlUrl() → startCrawl()checkCrawlStatus() → getCrawlStatus()Python:
scrape_url() → scrape()crawl_url() → crawl() 或 start_crawl()# 旧版 (v1)
doc = app.scrape_url("https://example.com")
# 新版 (v2)
doc = app.scrape("https://example.com")
错误:'extract' is not a valid format 来源:v2.0.0 发布 原因:旧的 "extract" 格式在 v2.0.0 中重命名为 "json" 预防:使用新的对象格式进行 JSON 提取
# 旧版 (v1)
doc = app.scrape_url(
url="https://example.com",
params={
"formats": ["extract"],
"extract": {"prompt": "Extract title"}
}
)
# 新版 (v2)
doc = app.scrape(
url="https://example.com",
formats=[{"type": "json", "prompt": "Extract title"}]
)
# 使用模式
doc = app.scrape(
url="https://example.com",
formats=[{
"type": "json",
"prompt": "Extract product info",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"}
}
}
}]
)
截图格式也已更改:
# 新版:截图作为对象
formats=[{
"type": "screenshot",
"fullPage": True,
"quality": 80,
"viewport": {"width": 1920, "height": 1080}
}]
错误:'allowBackwardCrawling' is not a valid parameter 来源:v2.0.0 发布 原因:v2.0.0 中多个爬取参数被重命名或移除 预防:使用新的参数名
参数变更:
allowBackwardCrawling → 改用 crawlEntireDomainmaxDepth → 改用 maxDiscoveryDepthignoreSitemap (bool) → sitemap ("only", "skip", "include")# 旧版 (v1)
app.crawl_url(
url="https://docs.example.com",
params={
"allowBackwardCrawling": True,
"maxDepth": 3,
"ignoreSitemap": False
}
)
# 新版 (v2)
app.crawl(
url="https://docs.example.com",
crawl_entire_domain=True,
max_discovery_depth=3,
sitemap="include" # "only", "skip", 或 "include"
)
错误:意外返回过时的缓存内容 来源:v2.0.0 发布 原因:v2.0.0 更改了多个默认值 预防:注意新的默认值
默认值变更:
maxAge 现在默认为 2 天(默认缓存)blockAds, skipTlsVerification, removeBase64Images 默认启用# 如果需要强制获取新数据
doc = app.scrape(url, formats=['markdown'], max_age=0)
# 完全禁用缓存
doc = app.scrape(url, formats=['markdown'], store_in_cache=False)
错误:创建后立即检查爬取状态时出现 "Job not found" 来源:GitHub Issue #2662 原因:任务创建与状态端点可用性之间存在数据库复制延迟 预防:在首次状态检查前等待 1-3 秒,或实现重试逻辑
import time
# 开始爬取
job = app.start_crawl(url="https://docs.example.com")
print(f"Job ID: {job.id}")
# 必需:在首次状态检查前等待
time.sleep(2) # 建议 1-3 秒
# 现在状态检查成功
status = app.get_crawl_status(job.id)
# 或实现重试逻辑
def get_status_with_retry(job_id, max_retries=3, delay=1):
for attempt in range(max_retries):
try:
return app.get_crawl_status(job_id)
except Exception as e:
if "Job not found" in str(e) and attempt < max_retries - 1:
time.sleep(delay)
continue
raise
status = get_status_with_retry(job.id)
错误:DNS 解析失败返回 success: false 和 HTTP 200 状态码,而不是 4xx 来源:GitHub Issue #2402 | 在 v2.7.0 中修复 原因:在 v2.7.0 中更改以实现一致的错误处理 预防:检查 success 字段和 code 字段,不要仅依赖 HTTP 状态码
const result = await app.scrape('https://nonexistent-domain-xyz.com');
// 不要依赖 HTTP 状态码
// 响应:HTTP 200 附带 { success: false, code: "SCRAPE_DNS_RESOLUTION_ERROR" }
// 务必检查 success 字段
if (!result.success) {
if (result.code === 'SCRAPE_DNS_RESOLUTION_ERROR') {
console.error('DNS resolution failed');
}
throw new Error(result.error);
}
注意:DNS 解析错误即使失败也会消耗 1 个积分。
错误:Cloudflare 错误页面作为 "成功" 抓取返回,积分被消耗 来源:GitHub Issue #2413 原因:Fire-1 引擎即使在机器人检测阻止访问时也会消耗积分 预防:在处理前验证内容不是错误页面;对受保护的站点使用隐身模式
# 首次尝试不使用隐身模式
doc = app.scrape(url="https://protected-site.com", formats=["markdown"])
# 验证内容不是错误页面
if "cloudflare" in doc.markdown.lower() or "access denied" in doc.markdown.lower():
# 使用隐身模式重试(如果成功则消耗 5 个积分)
doc = app.scrape(url, formats=["markdown"], stealth=True)
成本影响:基本抓取即使在失败时也消耗 1 个积分,隐身重试额外消耗 5 个积分。
错误:在具有反机器人措施的网站上出现 "All scraping engines failed!" (SCRAPE_ALL_ENGINES_FAILED) 来源:GitHub Issue #2257 原因:自托管的 Firecrawl 缺乏云服务中存在的高级反指纹识别技术 预防:对具有强大反机器人措施的站点使用 Firecrawl 云服务,或配置代理
# 自托管在受 Cloudflare 保护的站点上失败
curl -X POST 'http://localhost:3002/v2/scrape' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://www.example.com/",
"pageOptions": { "engine": "playwright" }
}'
# 错误:"All scraping engines failed!"
# 解决方法:改用云服务
# 云服务具有更好的反指纹识别能力
注意:这会影响使用默认 docker-compose 设置的自托管 v2.3.0+。存在警告:"⚠️ 警告:未提供代理服务器。您的 IP 地址可能会被阻止。"
# 新鲜数据(实时定价、股价)
doc = app.scrape(url, formats=["markdown"], max_age=0)
# 10 分钟缓存(新闻、博客)
doc = app.scrape(url, formats=["markdown"], max_age=600000) # 毫秒
# 对静态内容使用默认缓存(2 天)
doc = app.scrape(url, formats=["markdown"]) # maxAge 默认为 172800000
# 不存储在缓存中(一次性抓取)
doc = app.scrape(url, formats=["markdown"], store_in_cache=False)
# 要求最小年龄后才重新抓取 (v2.7.0+)
doc = app.scrape(url, formats=["markdown"], min_age=3600000) # 最小 1 小时
性能影响:
| 软件包 | 版本 | 最后检查 |
|---|---|---|
| firecrawl-py | 4.13.0+ | 2026-01-20 |
| @mendable/firecrawl-js | 4.11.1+ | 2026-01-20 |
| API 版本 | v2 | 当前 |
令牌节省:相比手动集成节省约 65% 错误预防:10 个已记录问题(v2 迁移、隐身定价、任务状态竞态、DNS 错误、机器人检测计费、自托管限制、缓存优化) 生产就绪:是 最后验证:2026-01-21 | 技能版本:2.0.0 | 变更:添加了已知问题预防部分,包含来自 TIER 1-2 研究发现的 10 个已记录错误;添加了 v2 迁移指南;记录了隐身模式定价变更和统一计费模型
每周安装
518
仓库
GitHub 星标
650
首次出现
2026 年 1 月 20 日
安全审计
安装于
claude-code407
opencode350
gemini-cli331
cursor311
codex303
antigravity277
Status : Production Ready Last Updated : 2026-01-20 Official Docs : https://docs.firecrawl.dev API Version : v2 SDK Versions : firecrawl-py 4.13.0+, @mendable/firecrawl-js 4.11.1+
Firecrawl is a Web Data API for AI that turns websites into LLM-ready markdown or structured data. It handles:
| Endpoint | Purpose | Use Case |
|---|---|---|
/scrape | Single page | Extract article, product page |
/crawl | Full site | Index docs, archive sites |
/map | URL discovery | Find all pages, plan strategy |
/search | Web search + scrape | Research with live data |
/extract | Structured data | Product prices, contacts |
/agent | Autonomous gathering |
/v2/scrape)Scrapes a single webpage and returns clean, structured content.
from firecrawl import Firecrawl
import os
app = Firecrawl(api_key=os.environ.get("FIRECRAWL_API_KEY"))
# Basic scrape
doc = app.scrape(
url="https://example.com/article",
formats=["markdown", "html"],
only_main_content=True
)
print(doc.markdown)
print(doc.metadata)
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
const result = await app.scrapeUrl('https://example.com/article', {
formats: ['markdown', 'html'],
onlyMainContent: true
});
console.log(result.markdown);
| Format | Description |
|---|---|
markdown | LLM-optimized content |
html | Full HTML |
rawHtml | Unprocessed HTML |
screenshot | Page capture (with viewport options) |
links | All URLs on page |
json | Structured data extraction |
doc = app.scrape(
url="https://example.com",
formats=["markdown", "screenshot"],
only_main_content=True,
remove_base64_images=True,
wait_for=5000, # Wait 5s for JS
timeout=30000,
# Location & language
location={"country": "AU", "languages": ["en-AU"]},
# Cache control
max_age=0, # Fresh content (no cache)
store_in_cache=True,
# Stealth mode for complex sites
stealth=True,
# Custom headers
headers={"User-Agent": "Custom Bot 1.0"}
)
Perform interactions before scraping:
doc = app.scrape(
url="https://example.com",
actions=[
{"type": "click", "selector": "button.load-more"},
{"type": "wait", "milliseconds": 2000},
{"type": "scroll", "direction": "down"},
{"type": "write", "selector": "input#search", "text": "query"},
{"type": "press", "key": "Enter"},
{"type": "screenshot"} # Capture state mid-action
]
)
# With schema
doc = app.scrape(
url="https://example.com/product",
formats=["json"],
json_options={
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"in_stock": {"type": "boolean"}
}
}
}
)
# Without schema (prompt-only)
doc = app.scrape(
url="https://example.com/product",
formats=["json"],
json_options={
"prompt": "Extract the product name, price, and availability"
}
)
Extract design system and brand identity:
doc = app.scrape(
url="https://example.com",
formats=["branding"]
)
# Returns:
# - Color schemes and palettes
# - Typography (fonts, sizes, weights)
# - Spacing and layout metrics
# - UI component styles
# - Logo and imagery URLs
# - Brand personality traits
/v2/crawl)Crawls all accessible pages from a starting URL.
result = app.crawl(
url="https://docs.example.com",
limit=100,
max_depth=3,
allowed_domains=["docs.example.com"],
exclude_paths=["/api/*", "/admin/*"],
scrape_options={
"formats": ["markdown"],
"only_main_content": True
}
)
for page in result.data:
print(f"Scraped: {page.metadata.source_url}")
print(f"Content: {page.markdown[:200]}...")
# Start crawl (returns immediately)
job = app.start_crawl(
url="https://docs.example.com",
limit=1000,
webhook="https://your-domain.com/webhook"
)
print(f"Job ID: {job.id}")
# Or poll for status
status = app.check_crawl_status(job.id)
/v2/map)Rapidly discover all URLs on a website without scraping content.
urls = app.map(url="https://example.com")
print(f"Found {len(urls)} pages")
for url in urls[:10]:
print(url)
Use for: sitemap discovery, crawl planning, website audits.
/search) - NEWPerform web searches and optionally scrape the results in one operation.
# Basic search
results = app.search(
query="best practices for React server components",
limit=10
)
for result in results:
print(f"{result.title}: {result.url}")
# Search + scrape results
results = app.search(
query="React server components tutorial",
limit=5,
scrape_options={
"formats": ["markdown"],
"only_main_content": True
}
)
for result in results:
print(f"{result.title}")
print(result.markdown[:500])
results = app.search(
query="machine learning papers",
limit=20,
# Filter by source type
sources=["web", "news", "images"],
# Filter by category
categories=["github", "research", "pdf"],
# Location
location={"country": "US"},
# Time filter
tbs="qdr:m", # Past month (qdr:h=hour, qdr:d=day, qdr:w=week, qdr:y=year)
timeout=30000
)
Cost : 2 credits per 10 results + scraping costs if enabled.
/v2/extract)AI-powered structured data extraction from single pages, multiple pages, or entire domains.
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
description: str
in_stock: bool
result = app.extract(
urls=["https://example.com/product"],
schema=Product,
system_prompt="Extract product information"
)
print(result.data)
# Extract from entire domain using wildcard
result = app.extract(
urls=["example.com/*"], # All pages on domain
schema=Product,
system_prompt="Extract all products"
)
# Enable web search for additional context
result = app.extract(
urls=["example.com/products"],
schema=Product,
enable_web_search=True # Follow external links
)
result = app.extract(
urls=["https://example.com/about"],
prompt="Extract the company name, founding year, and key executives"
)
# LLM determines output structure
/agent) - NEWAutonomous web data gathering without requiring specific URLs. The agent searches, navigates, and gathers data using natural language prompts.
# Basic agent usage
result = app.agent(
prompt="Find the pricing plans for the top 3 headless CMS platforms and compare their features"
)
print(result.data)
# With schema for structured output
from pydantic import BaseModel
from typing import List
class CMSPricing(BaseModel):
name: str
free_tier: bool
starter_price: float
features: List[str]
result = app.agent(
prompt="Find pricing for Contentful, Sanity, and Strapi",
schema=CMSPricing
)
# Optional: focus on specific URLs
result = app.agent(
prompt="Extract the enterprise pricing details",
urls=["https://contentful.com/pricing", "https://sanity.io/pricing"]
)
| Model | Best For | Cost |
|---|---|---|
spark-1-mini (default) | Simple extractions, high volume | Standard |
spark-1-pro | Complex analysis, ambiguous data | 60% more |
result = app.agent(
prompt="Analyze competitive positioning...",
model="spark-1-pro" # For complex tasks
)
# Start agent (returns immediately)
job = app.start_agent(
prompt="Research market trends..."
)
# Poll for results
status = app.check_agent_status(job.id)
if status.status == "completed":
print(status.data)
Note : Agent is in Research Preview. 5 free daily requests, then credit-based billing.
Process multiple URLs efficiently in a single operation.
results = app.batch_scrape(
urls=[
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
formats=["markdown"],
only_main_content=True
)
for page in results.data:
print(f"{page.metadata.source_url}: {len(page.markdown)} chars")
job = app.start_batch_scrape(
urls=url_list,
formats=["markdown"],
webhook="https://your-domain.com/webhook"
)
# Webhook receives events: started, page, completed, failed
const job = await app.startBatchScrape(urls, {
formats: ['markdown'],
webhook: 'https://your-domain.com/webhook'
});
// Poll for status
const status = await app.checkBatchScrapeStatus(job.id);
Monitor content changes over time by comparing scrapes.
# Enable change tracking
doc = app.scrape(
url="https://example.com/pricing",
formats=["markdown", "changeTracking"]
)
# Response includes:
print(doc.change_tracking.status) # new, same, changed, removed
print(doc.change_tracking.previous_scrape_at)
print(doc.change_tracking.visibility) # visible, hidden
# Git-diff mode (default)
doc = app.scrape(
url="https://example.com/docs",
formats=["markdown", "changeTracking"],
change_tracking_options={
"mode": "diff"
}
)
print(doc.change_tracking.diff) # Line-by-line changes
# JSON mode (structured comparison)
doc = app.scrape(
url="https://example.com/pricing",
formats=["markdown", "changeTracking"],
change_tracking_options={
"mode": "json",
"schema": {"type": "object", "properties": {"price": {"type": "number"}}}
}
)
# Costs 5 credits per page
Change States :
new - Page not seen beforesame - No changes since last scrapechanged - Content modifiedremoved - Page no longer accessible# Get API key from https://www.firecrawl.dev/app
# Store in environment
FIRECRAWL_API_KEY=fc-your-api-key-here
Never hardcode API keys!
The Firecrawl SDK cannot run in Cloudflare Workers (requires Node.js). Use the REST API directly:
interface Env {
FIRECRAWL_API_KEY: string;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { url } = await request.json<{ url: string }>();
const response = await fetch('https://api.firecrawl.dev/v2/scrape', {
method: 'POST',
headers: {
'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url,
formats: ['markdown'],
onlyMainContent: true
})
});
const result = await response.json();
return Response.json(result);
}
};
Stealth mode now costs 5 credits per request when actively used. Default behavior uses "auto" mode which only charges stealth credits if basic fails.
Recommended pattern :
# Use auto mode (default) - only charges 5 credits if stealth is needed
doc = app.scrape(url, formats=["markdown"])
# Or conditionally enable stealth for specific errors
if error_status_code in [401, 403, 500]:
doc = app.scrape(url, formats=["markdown"], proxy="stealth")
Credits and tokens merged into single system. Extract endpoint uses credits (15 tokens = 1 credit).
| Tier | Credits/Month | Notes |
|---|---|---|
| Free | 500 | Good for testing |
| Hobby | 3,000 | $19/month |
| Standard | 100,000 | $99/month |
| Growth | 500,000 | $399/month |
Credit Costs :
| Issue | Cause | Solution |
|---|---|---|
| Empty content | JS not loaded | Add wait_for: 5000 or use actions |
| Rate limit exceeded | Over quota | Check dashboard, upgrade plan |
| Timeout error | Slow page | Increase timeout, use stealth: true |
| Bot detection | Anti-scraping | Use stealth: true, add location |
This skill prevents 10 documented issues:
Error : Unexpected credit costs when using stealth mode Source : Stealth Mode Docs | Changelog Why It Happens : Starting May 8th, 2025, Stealth Mode proxy requests cost 5 credits per request (previously included in standard pricing). This is a significant billing change. Prevention : Use auto mode (default) which only charges stealth credits if basic fails
# RECOMMENDED: Use auto mode (default)
doc = app.scrape(url, formats=['markdown'])
# Auto retries with stealth (5 credits) only if basic fails
# Or conditionally enable based on error status
try:
doc = app.scrape(url, formats=['markdown'], proxy='basic')
except Exception as e:
if e.status_code in [401, 403, 500]:
doc = app.scrape(url, formats=['markdown'], proxy='stealth')
Stealth Mode Options :
auto (default): Charges 5 credits only if stealth succeeds after basic failsbasic: Standard proxies, 1 credit coststealth: 5 credits per request when actively usedError : AttributeError: 'FirecrawlApp' object has no attribute 'scrape_url' Source : v2.0.0 Release | Migration Guide Why It Happens : v2.0.0 (August 2025) renamed SDK methods across all languages Prevention : Use new method names
JavaScript/TypeScript :
scrapeUrl() → scrape()crawlUrl() → crawl() or startCrawl()asyncCrawlUrl() → startCrawl()checkCrawlStatus() → getCrawlStatus()Python :
scrape_url() → scrape()
crawl_url() → crawl() or start_crawl()
doc = app.scrape_url("https://example.com")
doc = app.scrape("https://example.com")
Error : 'extract' is not a valid format Source : v2.0.0 Release Why It Happens : Old "extract" format renamed to "json" in v2.0.0 Prevention : Use new object format for JSON extraction
# OLD (v1)
doc = app.scrape_url(
url="https://example.com",
params={
"formats": ["extract"],
"extract": {"prompt": "Extract title"}
}
)
# NEW (v2)
doc = app.scrape(
url="https://example.com",
formats=[{"type": "json", "prompt": "Extract title"}]
)
# With schema
doc = app.scrape(
url="https://example.com",
formats=[{
"type": "json",
"prompt": "Extract product info",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"}
}
}
}]
)
Screenshot format also changed :
# NEW: Screenshot as object
formats=[{
"type": "screenshot",
"fullPage": True,
"quality": 80,
"viewport": {"width": 1920, "height": 1080}
}]
Error : 'allowBackwardCrawling' is not a valid parameter Source : v2.0.0 Release Why It Happens : Several crawl parameters renamed or removed in v2.0.0 Prevention : Use new parameter names
Parameter Changes :
allowBackwardCrawling → Use crawlEntireDomain instead
maxDepth → Use maxDiscoveryDepth instead
ignoreSitemap (bool) → sitemap ("only", "skip", "include")
app.crawl_url( url="https://docs.example.com", params={ "allowBackwardCrawling": True, "maxDepth": 3, "ignoreSitemap": False } )
app.crawl( url="https://docs.example.com", crawl_entire_domain=True, max_discovery_depth=3, sitemap="include" # "only", "skip", or "include" )
Error : Stale cached content returned unexpectedly Source : v2.0.0 Release Why It Happens : v2.0.0 changed several defaults Prevention : Be aware of new defaults
Default Changes :
maxAge now defaults to 2 days (cached by default)
blockAds, skipTlsVerification, removeBase64Images enabled by default
doc = app.scrape(url, formats=['markdown'], max_age=0)
doc = app.scrape(url, formats=['markdown'], store_in_cache=False)
Error : "Job not found" when checking crawl status immediately after creation Source : GitHub Issue #2662 Why It Happens : Database replication delay between job creation and status endpoint availability Prevention : Wait 1-3 seconds before first status check, or implement retry logic
import time
# Start crawl
job = app.start_crawl(url="https://docs.example.com")
print(f"Job ID: {job.id}")
# REQUIRED: Wait before first status check
time.sleep(2) # 1-3 seconds recommended
# Now status check succeeds
status = app.get_crawl_status(job.id)
# Or implement retry logic
def get_status_with_retry(job_id, max_retries=3, delay=1):
for attempt in range(max_retries):
try:
return app.get_crawl_status(job_id)
except Exception as e:
if "Job not found" in str(e) and attempt < max_retries - 1:
time.sleep(delay)
continue
raise
status = get_status_with_retry(job.id)
Error : DNS resolution failures return success: false with HTTP 200 status instead of 4xx Source : GitHub Issue #2402 | Fixed in v2.7.0 Why It Happens : Changed in v2.7.0 for consistent error handling Prevention : Check success field and code field, don't rely on HTTP status alone
const result = await app.scrape('https://nonexistent-domain-xyz.com');
// DON'T rely on HTTP status code
// Response: HTTP 200 with { success: false, code: "SCRAPE_DNS_RESOLUTION_ERROR" }
// DO check success field
if (!result.success) {
if (result.code === 'SCRAPE_DNS_RESOLUTION_ERROR') {
console.error('DNS resolution failed');
}
throw new Error(result.error);
}
Note : DNS resolution errors still charge 1 credit despite failure.
Error : Cloudflare error page returned as "successful" scrape, credits charged Source : GitHub Issue #2413 Why It Happens : Fire-1 engine charges credits even when bot detection prevents access Prevention : Validate content isn't an error page before processing; use stealth mode for protected sites
# First attempt without stealth
doc = app.scrape(url="https://protected-site.com", formats=["markdown"])
# Validate content isn't an error page
if "cloudflare" in doc.markdown.lower() or "access denied" in doc.markdown.lower():
# Retry with stealth (costs 5 credits if successful)
doc = app.scrape(url, formats=["markdown"], stealth=True)
Cost Impact : Basic scrape charges 1 credit even on failure, stealth retry charges additional 5 credits.
Error : "All scraping engines failed!" (SCRAPE_ALL_ENGINES_FAILED) on sites with anti-bot measures Source : GitHub Issue #2257 Why It Happens : Self-hosted Firecrawl lacks advanced anti-fingerprinting techniques present in cloud service Prevention : Use Firecrawl cloud service for sites with strong anti-bot measures, or configure proxy
# Self-hosted fails on Cloudflare-protected sites
curl -X POST 'http://localhost:3002/v2/scrape' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://www.example.com/",
"pageOptions": { "engine": "playwright" }
}'
# Error: "All scraping engines failed!"
# Workaround: Use cloud service instead
# Cloud service has better anti-fingerprinting
Note : This affects self-hosted v2.3.0+ with default docker-compose setup. Warning present: "⚠️ WARNING: No proxy server provided. Your IP address may be blocked."
Suboptimal : Not leveraging cache can make requests 500% slower Source : Fast Scraping Docs | Blog Post Why It Matters : Default maxAge is 2 days in v2+, but many use cases need different strategies Prevention : Use appropriate cache strategy for your content type
# Fresh data (real-time pricing, stock prices)
doc = app.scrape(url, formats=["markdown"], max_age=0)
# 10-minute cache (news, blogs)
doc = app.scrape(url, formats=["markdown"], max_age=600000) # milliseconds
# Use default cache (2 days) for static content
doc = app.scrape(url, formats=["markdown"]) # maxAge defaults to 172800000
# Don't store in cache (one-time scrape)
doc = app.scrape(url, formats=["markdown"], store_in_cache=False)
# Require minimum age before re-scraping (v2.7.0+)
doc = app.scrape(url, formats=["markdown"], min_age=3600000) # 1 hour minimum
Performance Impact :
| Package | Version | Last Checked |
|---|---|---|
| firecrawl-py | 4.13.0+ | 2026-01-20 |
| @mendable/firecrawl-js | 4.11.1+ | 2026-01-20 |
| API Version | v2 | Current |
Token Savings : ~65% vs manual integration Error Prevention : 10 documented issues (v2 migration, stealth pricing, job status race, DNS errors, bot detection billing, self-hosted limitations, cache optimization) Production Ready : Yes Last verified : 2026-01-21 | Skill version : 2.0.0 | Changes : Added Known Issues Prevention section with 10 documented errors from TIER 1-2 research findings; added v2 migration guidance; documented stealth mode pricing change and unified billing model
Weekly Installs
518
Repository
GitHub Stars
650
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
claude-code407
opencode350
gemini-cli331
cursor311
codex303
antigravity277
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
42,300 周安装
| No URLs needed, AI navigates |
/batch-scrape | Multiple URLs | Bulk processing |
summary| AI-generated summary |
branding | Design system data |
changeTracking | Content change detection |
| Invalid API key | Wrong format | Must start with fc- |