npx skills add https://github.com/alphaonedev/openclaw-graph --skill playwright-scraper此技能使用 Playwright(一个用于浏览器自动化的 Node.js 库)实现网页抓取。它专注于处理动态内容、身份验证流程、分页、数据提取和截图,以可靠地抓取现代网站。
当需要抓取包含 JavaScript 渲染内容(例如 React 或 Angular 应用)的网站、需要登录的网站(例如仪表板)、处理多页结果(例如搜索结果)或捕获视觉数据(例如用于验证的截图)时,请使用此技能。对于静态 HTML 网站,使用更简单的工具(如 requests)即可满足需求,应避免使用此技能。
始终先初始化浏览器上下文,然后创建页面进行导航。使用异步模式以确保可靠性。对于需要身份验证的抓取,请按上下文处理 cookie 或会话。构建脚本以循环遍历页面进行分页,并对不稳定的元素使用 try-catch。通过 JSON 文件或环境变量传递配置以提高可重用性。
使用 Playwright 的 Node.js API。通过 npm install playwright 安装。关键方法包括:
const browser = await playwright.chromium.launch({ headless: true });const page = await browser.newPage(); await page.goto('https://example.com');await page.fill('#username', process.env.USERNAME); await page.fill('#password', process.env.PASSWORD); await page.click('#login');This skill enables web scraping using Playwright, a Node.js library for browser automation. It focuses on handling dynamic content, authentication flows, pagination, data extraction, and screenshots to reliably scrape modern websites.
Use this skill for scraping sites with JavaScript-rendered content (e.g., React or Angular apps), sites requiring login (e.g., dashboards), handling multi-page results (e.g., search results), or capturing visual data (e.g., screenshots for verification). Avoid for static HTML sites where simpler tools like requests suffice.
Always initialize a browser context first, then create pages for navigation. Use async patterns for reliability. For authenticated scraping, handle cookies or sessions per context. Structure scripts to loop through pages for pagination and use try-catch for flaky elements. Pass configurations via JSON files or environment variables for reusability.
Use Playwright's Node.js API. Install via npm install playwright. Key methods include:
const browser = await playwright.chromium.launch({ headless: true });广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
const data = await page.evaluate(() => document.querySelector('#target').innerText); console.log(data);while (await page.$('#next-button')) { await page.click('#next-button'); await page.waitForSelector('.item'); }await page.screenshot({ path: 'screenshot.png' });
运行脚本的 CLI 标志:使用 npx playwright test 并配合标志,例如 --headed 用于可见模式,或 --timeout 30000 用于延长等待时间。通过在 Node.js 项目中导入 Playwright 进行集成。对于身份验证,使用环境变量(如 $PLAYWRIGHT_USERNAME 和 $PLAYWRIGHT_PASSWORD)以避免硬编码。配置格式:使用 JSON 文件存储设置,例如 { "url": "https://target.com", "selector": "#data-element" }。通过脚本参数传递:node scraper.js --config config.json。对于更大的系统,可以与 Puppeteer(如果迁移)等工具链接,或通过 page.evaluate 结果将数据导出到数据库。确保与 Node.js 14+ 兼容,并使用 browser.launch({ proxy: { server: 'http://myproxy.com:8080' } }) 处理代理设置。
预见常见错误,如动态加载超时或选择器失败。使用带超时的 page.waitForSelector:await page.waitForSelector('#element', { timeout: 10000 }).catch(err => console.error('Element not found:', err));。对于网络问题,将 page.goto 包装在 try-catch 中:try { await page.goto(url, { waitUntil: 'networkidle' }); } catch (e) { console.error('Navigation failed:', e.message); await browser.close(); }。通过检查错误元素来处理身份验证失败:if (await page.$('#error-message')) { throw new Error('Login failed'); }。记录错误详细信息,并使用循环重试最多 3 次。
export PLAYWRIGHT_USERNAME='user@example.com' 和 export PLAYWRIGHT_PASSWORD='securepass'。然后运行:const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://dashboard.com/login'); await page.fill('#username', process.env.PLAYWRIGHT_USERNAME); await page.fill('#password', process.env.PLAYWRIGHT_PASSWORD); await page.click('#submit'); const data = await page.evaluate(() => document.querySelector('#dashboard-data').innerText); console.log(data); await browser.close(); 此代码从受保护的页面提取数据。const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://search.com?q=query'); let items = []; while (true) { items.push(...await page.$$eval('.result-item', elements => elements.map(el => el.innerText))); const nextButton = await page.$('#next-page'); if (!nextButton) break; await nextButton.click(); await page.waitForTimeout(2000); } console.log(items); await browser.close(); 此代码收集跨多个页面的结果。每周安装次数
396
代码仓库
首次出现
11 天前
安全审计
已安装于
gemini-cli394
amp394
cline394
github-copilot394
codex394
kimi-cli394
const page = await browser.newPage(); await page.goto('https://example.com');await page.fill('#username', process.env.USERNAME); await page.fill('#password', process.env.PASSWORD); await page.click('#login');const data = await page.evaluate(() => document.querySelector('#target').innerText); console.log(data);while (await page.$('#next-button')) { await page.click('#next-button'); await page.waitForSelector('.item'); }await page.screenshot({ path: 'screenshot.png' }); CLI flags for running scripts: Use npx playwright test with flags like --headed for visible mode or --timeout 30000 for extended waits.Integrate by importing Playwright in Node.js projects. For auth, use environment variables like $PLAYWRIGHT_USERNAME and $PLAYWRIGHT_PASSWORD to avoid hardcoding. Configuration format: Use a JSON file for settings, e.g., { "url": "https://target.com", "selector": "#data-element" }. Pass it via script args: node scraper.js --config config.json. For larger systems, chain with tools like Puppeteer (if migrating) or export data to databases via page.evaluate results. Ensure compatibility with Node.js 14+ and handle proxy settings with browser.launch({ proxy: { server: 'http://myproxy.com:8080' } }).
Anticipate common errors like timeout on dynamic loads or selector failures. Use page.waitForSelector with timeouts: await page.waitForSelector('#element', { timeout: 10000 }).catch(err => console.error('Element not found:', err));. For network issues, wrap page.goto in try-catch: try { await page.goto(url, { waitUntil: 'networkidle' }); } catch (e) { console.error('Navigation failed:', e.message); await browser.close(); }. Handle authentication failures by checking for error elements: if (await page.$('#error-message')) { throw new Error('Login failed'); }. Log errors with details and retry up to 3 times using a loop.
export PLAYWRIGHT_USERNAME='user@example.com' and export PLAYWRIGHT_PASSWORD='securepass'. Then, run: const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://dashboard.com/login'); await page.fill('#username', process.env.PLAYWRIGHT_USERNAME); await page.fill('#password', process.env.PLAYWRIGHT_PASSWORD); await page.click('#submit'); const data = await page.evaluate(() => document.querySelector('#dashboard-data').innerText); console.log(data); await browser.close(); This extracts data from a protected page.const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://search.com?q=query'); let items = []; while (true) { items.push(...await page.$$eval('.result-item', elements => elements.map(el => el.innerText))); const nextButton = await page.$('#next-page'); if (!nextButton) break; await nextButton.click(); await page.waitForTimeout(2000); } console.log(items); await browser.close(); This collects results across multiple pages.Weekly Installs
396
Repository
First Seen
11 days ago
Security Audits
Installed on
gemini-cli394
amp394
cline394
github-copilot394
codex394
kimi-cli394
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
102,200 周安装
Gemini Interactions API 指南:统一接口、智能体交互与服务器端状态管理
833 周安装
Apollo MCP 服务器:让AI代理通过GraphQL API交互的完整指南
834 周安装
智能体记忆系统构建指南:分块策略、向量存储与检索优化
835 周安装
Scrapling官方网络爬虫框架 - 自适应解析、绕过Cloudflare、Python爬虫库
836 周安装
抽奖赢家选取器 - 随机选择工具,支持CSV、Excel、Google Sheets,公平透明
838 周安装
Medusa 前端开发指南:使用 SDK、React Query 构建电商商店
839 周安装