crawl4ai by brettdavies/crawl4ai-skill
npx skills add https://github.com/brettdavies/crawl4ai-skill --skill crawl4aiCrawl4AI 提供全面的网络爬取和数据提取功能。该技能支持 CLI(推荐用于快速任务)和 Python SDK(用于程序化控制)。
选择您的接口:
pip install crawl4ai
crawl4ai-setup
# 验证安装
crawl4ai-doctor
# 基础爬取 - 返回 markdown
crwl https://example.com
# 获取 markdown 输出
crwl https://example.com -o markdown
# JSON 输出并绕过缓存
crwl https://example.com -o json -v --bypass-cache
# 查看更多示例
crwl --example
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500])
asyncio.run(main())
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
关于 SDK 配置详情:SDK 指南 - 配置 (第 61-150 行)
CLI 和 SDK 使用相同的基础配置:
| 概念 | CLI | SDK |
|---|---|---|
| 浏览器设置 | -B browser.yml 或 -b "param=value" | BrowserConfig(...) |
| 爬取设置 | -C crawler.yml 或 -c "param=value" | CrawlerRunConfig(...) |
| 提取 | -e extract.yml -s schema.json | extraction_strategy=... |
| 内容过滤器 | -f filter.yml | markdown_generator=... |
浏览器配置:
headless: 带/不带 GUI 运行viewport_width/height: 浏览器尺寸user_agent: 自定义用户代理proxy_config: 代理设置爬取器配置:
page_timeout: 最大页面加载时间(毫秒)wait_for: 等待的 CSS 选择器或 JS 条件cache_mode: bypass, enabled, disabledjs_code: 要执行的 JavaScript 代码css_selector: 聚焦于特定元素每次爬取返回:
Crawl4AI 擅长生成干净、格式良好的 markdown:
# 基础 markdown
crwl https://docs.example.com -o markdown
# 过滤后的 markdown(移除噪音)
crwl https://docs.example.com -o markdown-fit
# 使用内容过滤器
crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit
过滤器配置:
# filter_bm25.yml (基于相关性)
type: "bm25"
query: "machine learning tutorials"
threshold: 1.0
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
bm25_filter = BM25ContentFilter(user_query="machine learning", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)
print(result.markdown.fit_markdown) # 过滤后的
print(result.markdown.raw_markdown) # 原始的
关于内容过滤器:内容处理 (第 2481-3101 行)
无需 LLM - 快速、确定性、零成本。
CLI:
# 生成模式(使用 LLM)
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
# 使用模式进行提取(无需 LLM)
crwl https://shop.com -e extract_css.yml -s product_schema.json -o json
模式格式:
{
"name": "products",
"baseSelector": ".product-card",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
适用于复杂或不规则的内容:
CLI:
# extract_llm.yml
type: "llm"
provider: "openai/gpt-4o-mini"
instruction: "Extract product names and prices"
api_token: "your-token"
crwl https://shop.com -e extract_llm.yml -o json
关于提取详情:提取策略 (第 4522-5429 行)
CLI:
crwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000"
爬取器配置:
# crawler.yml
wait_for: "css:.ajax-content"
scan_full_page: true
page_timeout: 60000
delay_before_return_html: 2.0
CLI(顺序):
for url in url1 url2 url3; do crwl "$url" -o markdown; done
Python SDK(并发):
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await crawler.arun_many(urls, config=config)
关于批处理:arun_many() 参考 (第 1057-1224 行)
CLI:
# login_crawler.yml
session_id: "user_session"
js_code: |
document.querySelector('#username').value = 'user';
document.querySelector('#password').value = 'pass';
document.querySelector('#submit').click();
wait_for: "css:.dashboard"
# 登录
crwl https://site.com/login -C login_crawler.yml
# 访问受保护内容(会话复用)
crwl https://site.com/protected -c "session_id=user_session"
关于会话管理:高级功能 (第 5429-5940 行)
CLI:
# browser.yml
headless: true
proxy_config:
server: "http://proxy:8080"
username: "user"
password: "pass"
user_agent_mode: "random"
crwl https://example.com -B browser.yml
crwl https://docs.example.com -o markdown > docs.md
# 生成模式
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
# 监控(无 LLM 成本)
crwl https://shop.com -e extract_css.yml -s schema.json -o json
# 多源过滤
for url in news1.com news2.com news3.com; do
crwl "https://$url" -f filter_bm25.yml -o markdown-fit
done
# 先查看内容
crwl https://example.com -o markdown
# 然后提问
crwl https://example.com -q "What are the main conclusions?"
crwl https://example.com -q "Summarize the key points"
--bypass-cachecrwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000"
crwl https://example.com -B browser.yml
# browser.yml
headless: false
viewport_width: 1920
viewport_height: 1080
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
# 调试:查看完整输出
crwl https://example.com -o all -v
# 尝试不同的等待策略
crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null"
# 验证会话
crwl https://site.com -c "session_id=test" -o all | grep -i session
完整的 API 文档,请参阅完整 SDK 参考。
每周安装量
285
代码仓库
GitHub 星标数
14
首次出现
2026年2月6日
安全审计
安装于
codex262
opencode262
gemini-cli260
github-copilot255
amp247
kimi-cli246
Crawl4AI provides comprehensive web crawling and data extraction capabilities. This skill supports both CLI (recommended for quick tasks) and Python SDK (for programmatic control).
Choose your interface:
crwl) - Quick, scriptable commands: CLI Guidepip install crawl4ai
crawl4ai-setup
# Verify installation
crawl4ai-doctor
# Basic crawling - returns markdown
crwl https://example.com
# Get markdown output
crwl https://example.com -o markdown
# JSON output with cache bypass
crwl https://example.com -o json -v --bypass-cache
# See more examples
crwl --example
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500])
asyncio.run(main())
For SDK configuration details: SDK Guide - Configuration (lines 61-150)
Both CLI and SDK use the same underlying configuration:
| Concept | CLI | SDK |
|---|---|---|
| Browser settings | -B browser.yml or -b "param=value" | BrowserConfig(...) |
| Crawl settings | -C crawler.yml or -c "param=value" | CrawlerRunConfig(...) |
| Extraction | -e extract.yml -s schema.json |
Browser Configuration:
headless: Run with/without GUIviewport_width/height: Browser dimensionsuser_agent: Custom user agentproxy_config: Proxy settingsCrawler Configuration:
page_timeout: Max page load time (ms)wait_for: CSS selector or JS condition to wait forcache_mode: bypass, enabled, disabledjs_code: JavaScript to executecss_selector: Focus on specific elementFor complete parameters: CLI Config | SDK Config
Every crawl returns:
Crawl4AI excels at generating clean, well-formatted markdown:
# Basic markdown
crwl https://docs.example.com -o markdown
# Filtered markdown (removes noise)
crwl https://docs.example.com -o markdown-fit
# With content filter
crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit
Filter configuration:
# filter_bm25.yml (relevance-based)
type: "bm25"
query: "machine learning tutorials"
threshold: 1.0
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
bm25_filter = BM25ContentFilter(user_query="machine learning", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)
print(result.markdown.fit_markdown) # Filtered
print(result.markdown.raw_markdown) # Original
For content filters: Content Processing (lines 2481-3101)
No LLM required - fast, deterministic, cost-free.
CLI:
# Generate schema once (uses LLM)
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
# Use schema for extraction (no LLM)
crwl https://shop.com -e extract_css.yml -s product_schema.json -o json
Schema format:
{
"name": "products",
"baseSelector": ".product-card",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
For complex or irregular content:
CLI:
# extract_llm.yml
type: "llm"
provider: "openai/gpt-4o-mini"
instruction: "Extract product names and prices"
api_token: "your-token"
crwl https://shop.com -e extract_llm.yml -o json
For extraction details: Extraction Strategies (lines 4522-5429)
CLI:
crwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000"
Crawler config:
# crawler.yml
wait_for: "css:.ajax-content"
scan_full_page: true
page_timeout: 60000
delay_before_return_html: 2.0
CLI (sequential):
for url in url1 url2 url3; do crwl "$url" -o markdown; done
Python SDK (concurrent):
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await crawler.arun_many(urls, config=config)
For batch processing: arun_many() Reference (lines 1057-1224)
CLI:
# login_crawler.yml
session_id: "user_session"
js_code: |
document.querySelector('#username').value = 'user';
document.querySelector('#password').value = 'pass';
document.querySelector('#submit').click();
wait_for: "css:.dashboard"
# Login
crwl https://site.com/login -C login_crawler.yml
# Access protected content (session reused)
crwl https://site.com/protected -c "session_id=user_session"
For session management: Advanced Features (lines 5429-5940)
CLI:
# browser.yml
headless: true
proxy_config:
server: "http://proxy:8080"
username: "user"
password: "pass"
user_agent_mode: "random"
crwl https://example.com -B browser.yml
crwl https://docs.example.com -o markdown > docs.md
# Generate schema once
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
# Monitor (no LLM costs)
crwl https://shop.com -e extract_css.yml -s schema.json -o json
# Multiple sources with filtering
for url in news1.com news2.com news3.com; do
crwl "https://$url" -f filter_bm25.yml -o markdown-fit
done
# First view content
crwl https://example.com -o markdown
# Then ask questions
crwl https://example.com -q "What are the main conclusions?"
crwl https://example.com -q "Summarize the key points"
| Document | Purpose |
|---|---|
| CLI Guide | Command-line interface reference |
| SDK Guide | Python SDK quick reference |
| Complete SDK Reference | Full API documentation (5900+ lines) |
--bypass-cache only when neededcrwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000"
crwl https://example.com -B browser.yml
# browser.yml
headless: false
viewport_width: 1920
viewport_height: 1080
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
# Debug: see full output
crwl https://example.com -o all -v
# Try different wait strategy
crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null"
# Verify session
crwl https://site.com -c "session_id=test" -o all | grep -i session
For comprehensive API documentation, see Complete SDK Reference.
Weekly Installs
285
Repository
GitHub Stars
14
First Seen
Feb 6, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykFail
Installed on
codex262
opencode262
gemini-cli260
github-copilot255
amp247
kimi-cli246
Skills CLI 使用指南:AI Agent 技能包管理器安装与管理教程
27,400 周安装
extraction_strategy=... |
| Content filter | -f filter.yml | markdown_generator=... |