playwright-web-scraper by dawiddutoit/custom-claude
npx skills add https://github.com/dawiddutoit/custom-claude --skill playwright-web-scraper使用尊重、道德的爬取实践从多个网页中提取结构化数据。
当需要从网站提取结构化数据时使用,例如“从...抓取数据”、“从页面提取信息”、“从网站收集数据”或“爬取多个页面”。
不要用于测试工作流(使用 playwright-e2e-testing)、监控错误(使用 playwright-console-monitor)或分析网络(使用 playwright-network-analyzer)。始终尊重 robots.txt 和速率限制。
从电子商务网站抓取产品列表:
// 1. 验证 URL
python scripts/validate_urls.py urls.txt
// 2. 带速率限制地抓取页面
const results = [];
for (const url of urls) {
await browser_navigate({ url });
await browser_wait_for({ time: Math.random() * 2 + 1 }); // 1-3秒延迟
const data = await browser_evaluate({
function: `
Array.from(document.querySelectorAll('.product')).map(el => ({
title: el.querySelector('.title')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
url: el.querySelector('a')?.getAttribute('href')
}))
`
});
results.push(...data);
}
// 3. 处理结果
python scripts/process_results.py scraped.json -o products.csv
Extract structured data from multiple web pages with respectful, ethical crawling practices.
Use when extracting structured data from websites with "scrape data from", "extract information from pages", "collect data from site", or "crawl multiple pages".
Do NOT use for testing workflows (use playwright-e2e-testing), monitoring errors (use playwright-console-monitor), or analyzing network (use playwright-network-analyzer). Always respect robots.txt and rate limits.
Scrape product listings from an e-commerce site:
// 1. Validate URLs
python scripts/validate_urls.py urls.txt
// 2. Scrape pages with rate limiting
const results = [];
for (const url of urls) {
await browser_navigate({ url });
await browser_wait_for({ time: Math.random() * 2 + 1 }); // 1-3s delay
const data = await browser_evaluate({
function: `
Array.from(document.querySelectorAll('.product')).map(el => ({
title: el.querySelector('.title')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
url: el.querySelector('a')?.getAttribute('href')
}))
`
});
results.push(...data);
}
// 3. Process results
python scripts/process_results.py scraped.json -o products.csv
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
创建一个包含要抓取的 URL 的文本文件(每行一个):
https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products?page=3
验证 URL 并检查 robots.txt 合规性:
python scripts/validate_urls.py urls.txt --user-agent "MyBot/1.0"
导航到网站并拍摄快照以了解结构:
await browser_navigate({ url: firstUrl });
await browser_snapshot();
使用快照识别用于数据提取的 CSS 选择器。
在请求之间使用随机延迟(至少 1-3 秒):
const results = [];
for (const url of urlList) {
// 导航到页面
await browser_navigate({ url });
// 等待内容加载
await browser_wait_for({ text: 'Expected content marker' });
// 添加礼貌延迟(1-3 秒)
const delay = Math.random() * 2 + 1;
await browser_wait_for({ time: delay });
// 提取数据
const pageData = await browser_evaluate({
function: `/* extraction code */`
});
results.push(...pageData);
// 检查控制台是否有错误/警告
const console = await browser_console_messages();
// 监控速率限制警告
}
使用 browser_evaluate 通过 JavaScript 提取数据:
const data = await browser_evaluate({
function: `
try {
return Array.from(document.querySelectorAll('.item')).map(el => ({
title: el.querySelector('.title')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
rating: el.querySelector('.rating')?.textContent?.trim(),
url: el.querySelector('a')?.getAttribute('href')
})).filter(item => item.title && item.price); // 过滤不完整记录
} catch (e) {
console.error('Extraction failed:', e);
return [];
}
`
});
有关全面的提取模式,请参阅 references/extraction-patterns.md。
监控速率限制指示器:
// 通过 browser_network_requests 检查 HTTP 响应
const requests = await browser_network_requests();
const rateLimited = requests.some(r => r.status === 429 || r.status === 503);
if (rateLimited) {
// 指数退避
await browser_wait_for({ time: 10 }); // 等待 10 秒
// 重试或跳过
}
// 检查控制台是否有阻止消息
const console = await browser_console_messages({ pattern: 'rate limit|blocked|captcha' });
if (console.length > 0) {
// 处理阻止
}
将结果保存到 JSON 文件:
// 在您的抓取脚本中
fs.writeFileSync('scraped.json', JSON.stringify({ results }, null, 2));
处理并转换为所需格式:
# 查看统计信息
python scripts/process_results.py scraped.json --stats
# 转换为 CSV
python scripts/process_results.py scraped.json -o output.csv
# 转换为 Markdown 表格
python scripts/process_results.py scraped.json -o output.md
始终在请求之间添加延迟:
// 1-3 秒之间的随机延迟
const randomDelay = () => Math.random() * 2 + 1;
await browser_wait_for({ time: randomDelay() });
// 速率限制后的指数退避
let backoffSeconds = 5;
for (let retry = 0; retry < 3; retry++) {
try {
await browser_navigate({ url });
break; // 成功
} catch (e) {
await browser_wait_for({ time: backoffSeconds });
backoffSeconds *= 2; // 每次重试延迟加倍
}
}
根据响应调整延迟:
| 响应代码 | 操作 |
|---|---|
| 200 OK | 继续使用正常延迟(1-3秒) |
| 429 请求过多 | 增加延迟到 10 秒,重试 |
| 503 服务不可用 | 等待 60 秒,然后重试 |
| 403 禁止访问 | 停止抓取此域名 |
有关详细的速率限制策略,请参阅 references/ethical-scraping.md。
在抓取前使用 validate_urls.py 以确保合规性:
# 基本验证
python scripts/validate_urls.py urls.txt
# 使用特定用户代理检查 robots.txt
python scripts/validate_urls.py urls.txt --user-agent "MyBot/1.0"
# 严格模式(任何无效/不允许的 URL 都退出)
python scripts/validate_urls.py urls.txt --strict
输出包括:
// 单页提取
const data = await browser_evaluate({
function: `
Array.from(document.querySelectorAll('.item')).map(el => ({
field1: el.querySelector('.selector1')?.textContent?.trim(),
field2: el.querySelector('.selector2')?.getAttribute('href')
}))
`
});
let hasMore = true;
let page = 1;
while (hasMore) {
await browser_navigate({ url: `${baseUrl}?page=${page}` });
await browser_wait_for({ time: randomDelay() });
const pageData = await browser_evaluate({ function: extractionCode });
results.push(...pageData);
// 检查是否有下一页
hasMore = await browser_evaluate({
function: `document.querySelector('.next:not(.disabled)') !== null`
});
page++;
}
有关以下内容,请参阅 references/extraction-patterns.md:
try {
await browser_navigate({ url });
} catch (e) {
console.error(`Failed to load ${url}:`, e);
failedUrls.push(url);
continue; // 跳到下一个 URL
}
const data = await browser_evaluate({ function: extractionCode });
if (!data || data.length === 0) {
console.warn(`No data extracted from ${url}`);
// 记录以供手动审查
}
// 验证数据结构
const validData = data.filter(item =>
item.title && item.price // 确保必填字段存在
);
检查阻止/错误:
// 监控控制台
const console = await browser_console_messages({
pattern: 'error|rate|limit|captcha',
onlyErrors: true
});
if (console.length > 0) {
console.log('Warnings detected:', console);
}
// 监控网络
const requests = await browser_network_requests();
const errors = requests.filter(r => r.status >= 400);
python scripts/process_results.py scraped.json --stats
输出:
📊 Statistics:
Total records: 150
Fields (5): title, price, rating, url, image
Sample record: {...}
# 转换为 CSV
python scripts/process_results.py scraped.json -o products.csv
# 转换为 JSON(紧凑格式)
python scripts/process_results.py scraped.json -o products.json --compact
# 转换为 Markdown 表格
python scripts/process_results.py scraped.json -o products.md
python scripts/process_results.py scraped.json -o products.csv --stats
scripts/validate_urls.py - 验证 URL 列表,检查 robots.txt 合规性,按域名分组scripts/process_results.py - 将抓取的 JSON 转换为 CSV/JSON/Markdown,查看统计信息references/ethical-scraping.md - 关于速率限制、robots.txt、错误处理和监控的全面指南references/extraction-patterns.md - 用于数据提取、选择器、分页、表格的 JavaScript 模式✅ 已验证 50 个 URL
✅ 在 5 分钟内抓取了 50 个页面(6 次请求/分钟)
✅ 提取了 1,250 个产品
✅ 零速率限制错误
✅ 已导出到 products.csv(1,250 行)
⚠️ 已验证 50 个 URL(2 个被 robots.txt 禁止)
✅ 抓取了 48 个页面
⚠️ 3 个页面未返回数据(已记录供审查)
✅ 提取了 1,100 个产品
⚠️ 1 个速率限制警告(已成功退避)
✅ 已导出到 products.csv(1,100 行)
❌ 抓取 20 个页面后受到速率限制(429 响应)
✅ 已指数退避(5秒 → 10秒 → 20秒)
✅ 已成功恢复抓取
✅ 从 25 个页面提取了 450 个产品
| 指标 | 之前 | 之后 |
|---|---|---|
| 设置时间 | 30-45 分钟 | 5-10 分钟 |
| 速率限制错误 | 常见 | 罕见 |
| robots.txt 违规 | 可能 | 已预防 |
| 数据格式转换 | 手动 | 自动化 |
| 错误检测 | 手动审查 | 自动化监控 |
validate_urls.py每周安装次数
52
仓库
首次出现
2026年1月23日
安全审计
已安装于
gemini-cli51
codex51
opencode51
github-copilot50
cursor50
amp49
Create a text file with URLs to scrape (one per line):
https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products?page=3
Validate URLs and check robots.txt compliance:
python scripts/validate_urls.py urls.txt --user-agent "MyBot/1.0"
Navigate to the site and take a snapshot to understand structure:
await browser_navigate({ url: firstUrl });
await browser_snapshot();
Identify CSS selectors for data extraction using the snapshot.
Use random delays between requests (1-3 seconds minimum):
const results = [];
for (const url of urlList) {
// Navigate to page
await browser_navigate({ url });
// Wait for content to load
await browser_wait_for({ text: 'Expected content marker' });
// Add respectful delay (1-3 seconds)
const delay = Math.random() * 2 + 1;
await browser_wait_for({ time: delay });
// Extract data
const pageData = await browser_evaluate({
function: `/* extraction code */`
});
results.push(...pageData);
// Check console for errors/warnings
const console = await browser_console_messages();
// Monitor for rate limit warnings
}
Use browser_evaluate to extract data with JavaScript:
const data = await browser_evaluate({
function: `
try {
return Array.from(document.querySelectorAll('.item')).map(el => ({
title: el.querySelector('.title')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
rating: el.querySelector('.rating')?.textContent?.trim(),
url: el.querySelector('a')?.getAttribute('href')
})).filter(item => item.title && item.price); // Filter incomplete records
} catch (e) {
console.error('Extraction failed:', e);
return [];
}
`
});
See references/extraction-patterns.md for comprehensive extraction patterns.
Monitor for rate limiting indicators:
// Check HTTP responses via browser_network_requests
const requests = await browser_network_requests();
const rateLimited = requests.some(r => r.status === 429 || r.status === 503);
if (rateLimited) {
// Back off exponentially
await browser_wait_for({ time: 10 }); // Wait 10 seconds
// Retry or skip
}
// Check console for blocking messages
const console = await browser_console_messages({ pattern: 'rate limit|blocked|captcha' });
if (console.length > 0) {
// Handle blocking
}
Save results to JSON file:
// In your scraping script
fs.writeFileSync('scraped.json', JSON.stringify({ results }, null, 2));
Process and convert to desired format:
# View statistics
python scripts/process_results.py scraped.json --stats
# Convert to CSV
python scripts/process_results.py scraped.json -o output.csv
# Convert to Markdown table
python scripts/process_results.py scraped.json -o output.md
Always add delays between requests:
// Random delay between 1-3 seconds
const randomDelay = () => Math.random() * 2 + 1;
await browser_wait_for({ time: randomDelay() });
// Exponential backoff after rate limit
let backoffSeconds = 5;
for (let retry = 0; retry < 3; retry++) {
try {
await browser_navigate({ url });
break; // Success
} catch (e) {
await browser_wait_for({ time: backoffSeconds });
backoffSeconds *= 2; // Double delay each retry
}
}
Adjust delays based on response:
| Response Code | Action |
|---|---|
| 200 OK | Continue with normal delay (1-3s) |
| 429 Too Many Requests | Increase delay to 10s, retry |
| 503 Service Unavailable | Wait 60s, then retry |
| 403 Forbidden | Stop scraping this domain |
See references/ethical-scraping.md for detailed rate limiting strategies.
Use validate_urls.py before scraping to ensure compliance:
# Basic validation
python scripts/validate_urls.py urls.txt
# Check robots.txt with specific user agent
python scripts/validate_urls.py urls.txt --user-agent "MyBot/1.0"
# Strict mode (exit on any invalid/disallowed URL)
python scripts/validate_urls.py urls.txt --strict
Output includes :
// Single page extraction
const data = await browser_evaluate({
function: `
Array.from(document.querySelectorAll('.item')).map(el => ({
field1: el.querySelector('.selector1')?.textContent?.trim(),
field2: el.querySelector('.selector2')?.getAttribute('href')
}))
`
});
let hasMore = true;
let page = 1;
while (hasMore) {
await browser_navigate({ url: `${baseUrl}?page=${page}` });
await browser_wait_for({ time: randomDelay() });
const pageData = await browser_evaluate({ function: extractionCode });
results.push(...pageData);
// Check for next page
hasMore = await browser_evaluate({
function: `document.querySelector('.next:not(.disabled)') !== null`
});
page++;
}
See references/extraction-patterns.md for:
try {
await browser_navigate({ url });
} catch (e) {
console.error(`Failed to load ${url}:`, e);
failedUrls.push(url);
continue; // Skip to next URL
}
const data = await browser_evaluate({ function: extractionCode });
if (!data || data.length === 0) {
console.warn(`No data extracted from ${url}`);
// Log for manual review
}
// Validate data structure
const validData = data.filter(item =>
item.title && item.price // Ensure required fields exist
);
Check for blocking/errors:
// Monitor console
const console = await browser_console_messages({
pattern: 'error|rate|limit|captcha',
onlyErrors: true
});
if (console.length > 0) {
console.log('Warnings detected:', console);
}
// Monitor network
const requests = await browser_network_requests();
const errors = requests.filter(r => r.status >= 400);
python scripts/process_results.py scraped.json --stats
Output:
📊 Statistics:
Total records: 150
Fields (5): title, price, rating, url, image
Sample record: {...}
# To CSV
python scripts/process_results.py scraped.json -o products.csv
# To JSON (compact)
python scripts/process_results.py scraped.json -o products.json --compact
# To Markdown table
python scripts/process_results.py scraped.json -o products.md
python scripts/process_results.py scraped.json -o products.csv --stats
scripts/validate_urls.py - Validate URL lists, check robots.txt compliance, group by domainscripts/process_results.py - Convert scraped JSON to CSV/JSON/Markdown, view statisticsreferences/ethical-scraping.md - Comprehensive guide to rate limiting, robots.txt, error handling, and monitoringreferences/extraction-patterns.md - JavaScript patterns for data extraction, selectors, pagination, tables✅ Validated 50 URLs
✅ Scraped 50 pages in 5 minutes (6 req/min)
✅ Extracted 1,250 products
✅ Zero rate limit errors
✅ Exported to products.csv (1,250 rows)
⚠️ Validated 50 URLs (2 disallowed by robots.txt)
✅ Scraped 48 pages
⚠️ 3 pages returned no data (logged for review)
✅ Extracted 1,100 products
⚠️ 1 rate limit warning (backed off successfully)
✅ Exported to products.csv (1,100 rows)
❌ Rate limited after 20 pages (429 responses)
✅ Backed off exponentially (5s → 10s → 20s)
✅ Resumed scraping successfully
✅ Extracted 450 products from 25 pages
| Metric | Before | After |
|---|---|---|
| Setup time | 30-45 min | 5-10 min |
| Rate limit errors | Common | Rare |
| robots.txt violations | Possible | Prevented |
| Data format conversion | Manual | Automated |
| Error detection | Manual review | Automated monitoring |
validate_urls.py before scrapingWeekly Installs
52
Repository
First Seen
Jan 23, 2026
Security Audits
Installed on
gemini-cli51
codex51
opencode51
github-copilot50
cursor50
amp49
Skills CLI 使用指南:AI Agent 技能包管理器安装与管理教程
33,600 周安装
Mermaid.js v11 图表生成教程 - 流程图、时序图、类图等24+图表类型
160 周安装
summarize:命令行AI摘要工具,支持URL、PDF、YouTube视频内容快速总结
160 周安装
SvelteKit 2 + Svelte 5 + Tailwind v4 集成指南 - 现代Web应用开发技能
160 周安装
日志记录最佳实践指南:结构化日志、安全合规与性能优化
161 周安装
Microsoft 365 租户管理器:自动化脚本工具,助力全局管理员高效管理用户与安全策略
161 周安装
Basecamp CLI 命令大全:130个端点管理待办事项、消息、文件等项目管理功能
161 周安装