Playwright网页抓取技能：自动化抓取动态网站、处理登录分页和截图

playwright-scraper by alphaonedev/openclaw-graph

845 周安装量

GitHub

安装命令

npx skills add https://github.com/alphaonedev/openclaw-graph --skill playwright-scraper

开发自动化测试

🇨🇳中文介绍

playwright-scraper

目的

此技能使用 Playwright（一个用于浏览器自动化的 Node.js 库）实现网页抓取。它专注于处理动态内容、身份验证流程、分页、数据提取和截图，以可靠地抓取现代网站。

使用时机

当需要抓取包含 JavaScript 渲染内容（例如 React 或 Angular 应用）的网站、需要登录的网站（例如仪表板）、处理多页结果（例如搜索结果）或捕获视觉数据（例如用于验证的截图）时，请使用此技能。对于静态 HTML 网站，使用更简单的工具（如 requests）即可满足需求，应避免使用此技能。

核心能力

使用 Playwright 的浏览器控制动态加载和交互内容。
管理身份验证流程，例如通过表单或 API 令牌登录。
通过导航页面、点击“下一页”按钮或解析 URL 来处理分页。
使用选择器提取数据，可选择输出 JSON 或保存到文件。
捕获截图或完整页面 PDF 以用于调试或报告。
支持无头或有头浏览器模式，提供灵活性。

使用模式

始终先初始化浏览器上下文，然后创建页面进行导航。使用异步模式以确保可靠性。对于需要身份验证的抓取，请按上下文处理 cookie 或会话。构建脚本以循环遍历页面进行分页，并对不稳定的元素使用 try-catch。通过 JSON 文件或环境变量传递配置以提高可重用性。

常用命令/API

使用 Playwright 的 Node.js API。通过 npm install playwright 安装。关键方法包括：

启动浏览器：const browser = await playwright.chromium.launch({ headless: true });
导航页面：const page = await browser.newPage(); await page.goto('https://example.com');
处理身份验证：await page.fill('#username', process.env.USERNAME); await page.fill('#password', process.env.PASSWORD); await page.click('#login');

🇺🇸English

playwright-scraper

Purpose

This skill enables web scraping using Playwright, a Node.js library for browser automation. It focuses on handling dynamic content, authentication flows, pagination, data extraction, and screenshots to reliably scrape modern websites.

When to Use

Use this skill for scraping sites with JavaScript-rendered content (e.g., React or Angular apps), sites requiring login (e.g., dashboards), handling multi-page results (e.g., search results), or capturing visual data (e.g., screenshots for verification). Avoid for static HTML sites where simpler tools like requests suffice.

Key Capabilities

Dynamically load and interact with content using Playwright's browser control.
Manage authentication flows, such as logging in via forms or API tokens.
Handle pagination by navigating pages, clicking "next" buttons, or parsing URLs.
Extract data using selectors, with options for JSON output or file saves.
Capture screenshots or full-page PDFs for debugging or reporting.
Supports headless or visible browser modes for flexibility.

Usage Patterns

Always initialize a browser context first, then create pages for navigation. Use async patterns for reliability. For authenticated scraping, handle cookies or sessions per context. Structure scripts to loop through pages for pagination and use try-catch for flaky elements. Pass configurations via JSON files or environment variables for reusability.

Common Commands/API

Use Playwright's Node.js API. Install via npm install playwright. Key methods include:

Launch browser: const browser = await playwright.chromium.launch({ headless: true });

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

733,500 周安装

Vercel React 最佳实践指南 | 58条Next.js性能优化规则与代码重构

252,100 周安装

Vercel Web界面规范检查工具 - 自动检测代码是否符合Web设计指南

202,600 周安装

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

133,200 周安装

提取数据：const data = await page.evaluate(() => document.querySelector('#target').innerText); console.log(data);

分页：while (await page.$('#next-button')) { await page.click('#next-button'); await page.waitForSelector('.item'); }

截图：await page.screenshot({ path: 'screenshot.png' }); 运行脚本的 CLI 标志：使用 npx playwright test 并配合标志，例如 --headed 用于可见模式，或 --timeout 30000 用于延长等待时间。

通过在 Node.js 项目中导入 Playwright 进行集成。对于身份验证，使用环境变量（如 $PLAYWRIGHT_USERNAME 和 $PLAYWRIGHT_PASSWORD）以避免硬编码。配置格式：使用 JSON 文件存储设置，例如 { "url": "https://target.com", "selector": "#data-element" }。通过脚本参数传递：node scraper.js --config config.json。对于更大的系统，可以与 Puppeteer（如果迁移）等工具链接，或通过 page.evaluate 结果将数据导出到数据库。确保与 Node.js 14+ 兼容，并使用 browser.launch({ proxy: { server: 'http://myproxy.com:8080' } }) 处理代理设置。

预见常见错误，如动态加载超时或选择器失败。使用带超时的 page.waitForSelector：await page.waitForSelector('#element', { timeout: 10000 }).catch(err => console.error('Element not found:', err));。对于网络问题，将 page.goto 包装在 try-catch 中：try { await page.goto(url, { waitUntil: 'networkidle' }); } catch (e) { console.error('Navigation failed:', e.message); await browser.close(); }。通过检查错误元素来处理身份验证失败：if (await page.$('#error-message')) { throw new Error('Login failed'); }。记录错误详细信息，并使用循环重试最多 3 次。

抓取已登录的仪表板： 首先，设置环境变量：export PLAYWRIGHT_USERNAME='user@example.com' 和 export PLAYWRIGHT_PASSWORD='securepass'。然后运行：const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://dashboard.com/login'); await page.fill('#username', process.env.PLAYWRIGHT_USERNAME); await page.fill('#password', process.env.PLAYWRIGHT_PASSWORD); await page.click('#submit'); const data = await page.evaluate(() => document.querySelector('#dashboard-data').innerText); console.log(data); await browser.close(); 此代码从受保护的页面提取数据。
处理搜索网站的分页： 脚本：const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://search.com?q=query'); let items = []; while (true) { items.push(...await page.$$eval('.result-item', elements => elements.map(el => el.innerText))); const nextButton = await page.$('#next-page'); if (!nextButton) break; await nextButton.click(); await page.waitForTimeout(2000); } console.log(items); await browser.close(); 此代码收集跨多个页面的结果。

相关技能："selenium-automation"（替代的浏览器自动化工具）
依赖项："node-runtime"（用于执行 Playwright）
补充技能："data-extraction"（用于后处理抓取的数据）
所在集群："community"（与其他开源工具共享）

Navigate page: const page = await browser.newPage(); await page.goto('https://example.com');

Handle auth:

await page.fill('#username', process.env.USERNAME); await page.fill('#password', process.env.PASSWORD); await page.click('#login');

Extract data: const data = await page.evaluate(() => document.querySelector('#target').innerText); console.log(data);

Pagination: while (await page.$('#next-button')) { await page.click('#next-button'); await page.waitForSelector('.item'); }

Take screenshot: await page.screenshot({ path: 'screenshot.png' }); CLI flags for running scripts: Use npx playwright test with flags like --headed for visible mode or --timeout 30000 for extended waits.

Integrate by importing Playwright in Node.js projects. For auth, use environment variables like $PLAYWRIGHT_USERNAME and $PLAYWRIGHT_PASSWORD to avoid hardcoding. Configuration format: Use a JSON file for settings, e.g., { "url": "https://target.com", "selector": "#data-element" }. Pass it via script args: node scraper.js --config config.json. For larger systems, chain with tools like Puppeteer (if migrating) or export data to databases via page.evaluate results. Ensure compatibility with Node.js 14+ and handle proxy settings with browser.launch({ proxy: { server: 'http://myproxy.com:8080' } }).

Anticipate common errors like timeout on dynamic loads or selector failures. Use page.waitForSelector with timeouts: await page.waitForSelector('#element', { timeout: 10000 }).catch(err => console.error('Element not found:', err));. For network issues, wrap page.goto in try-catch: try { await page.goto(url, { waitUntil: 'networkidle' }); } catch (e) { console.error('Navigation failed:', e.message); await browser.close(); }. Handle authentication failures by checking for error elements: if (await page.$('#error-message')) { throw new Error('Login failed'); }. Log errors with details and retry up to 3 times using a loop.

Concrete Usage Examples

Scraping a logged-in dashboard: First, set env vars: export PLAYWRIGHT_USERNAME='user@example.com' and export PLAYWRIGHT_PASSWORD='securepass'. Then, run: const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://dashboard.com/login'); await page.fill('#username', process.env.PLAYWRIGHT_USERNAME); await page.fill('#password', process.env.PLAYWRIGHT_PASSWORD); await page.click('#submit'); const data = await page.evaluate(() => document.querySelector('#dashboard-data').innerText); console.log(data); await browser.close(); This extracts data from a protected page.
Handling pagination on a search site: Script: const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://search.com?q=query'); let items = []; while (true) { items.push(...await page.$$eval('.result-item', elements => elements.map(el => el.innerText))); const nextButton = await page.$('#next-page'); if (!nextButton) break; await nextButton.click(); await page.waitForTimeout(2000); } console.log(items); await browser.close(); This collects results across multiple pages.

Related to: "selenium-automation" (alternative browser automation tool)
Depends on: "node-runtime" (for Playwright execution)
Complements: "data-extraction" (for post-processing scraped data)
In cluster: "community" (shared with other open-source tools)

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

102,200 周安装

Playwright网页抓取技能：自动化抓取动态网站、处理登录分页和截图

🇨🇳中文介绍

playwright-scraper

目的

使用时机

核心能力

使用模式

常用命令/API

🇺🇸English

playwright-scraper

Purpose

When to Use

Key Capabilities

Usage Patterns

Common Commands/API

相关 Skills

集成说明

错误处理

具体使用示例

图谱关系

Integration Notes

Error Handling

Concrete Usage Examples

Graph Relationships

最新 Skills