scraper-builder by jwynia/agent-skills
npx skills add https://github.com/jwynia/agent-skills --skill scraper-builder使用 Playwright 和 TypeScript 的 PageObject 模式生成完整、可运行的网页抓取器项目。此技能生成针对特定站点的抓取器,包含类型化数据提取、Docker 部署以及用于自动化站点分析的可选代理浏览器集成。
在以下情况下使用此技能:
在以下情况下请不要使用此技能:
目标站点上的每个页面都映射到一个 PageObject 类。定位器在构造函数中定义,抓取逻辑位于方法中。页面对象从不包含断言或业务逻辑——它们提取并返回数据。
优先选择以下顺序的选择器:data-testid > id > 语义化 HTML(role, aria-label)> 结构化 CSS 类 > 文本内容。避免使用位置选择器(nth-child)和依赖于布局的路径。完整的层次结构请参见 。
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
references/playwright-selectors.md可复用的 UI 模式(分页、数据表格、搜索栏)被建模为组件类,页面对象通过属性组合它们。只有 BasePage 使用继承——其他所有内容都使用组合。
所有抓取的数据都通过 Zod 模式进行验证。这可以在提取时捕获选择器漂移(当站点更改其标记时),而不是在下游处理。请参见 assets/templates/data-schema.ts.md。
生成的项目包含一个使用微软官方 Playwright 镜像的 Dockerfile,以及一个带有输出数据和调试截图卷挂载的 docker-compose.yml。这确保了跨机器一致的浏览器环境。
使用 agent-browser 导航目标站点,捕获可访问性树快照,并自动发现选择器。当代理可以访问 agent-browser CLI 时,这是首选模式。
先决条件: 如果尚未安装 agent-browser,请先将其添加为技能:
npx skills add vercel-labs/agent-browser
工作流程:
# 1. 打开目标页面
agent-browser open https://example.com/products
# 2. 捕获包含元素引用的交互式快照
agent-browser snapshot -i --json > snapshot.json
# 3. 捕获作用域部分以进行聚焦分析
agent-browser snapshot -i --json -s "main" > main-content.json
agent-browser snapshot -i --json -s "nav" > navigation.json
# 4. 测试动态行为(分页、加载更多)
agent-browser click @e3
agent-browser wait --load networkidle
agent-browser snapshot -i --json > after-click.json
# 5. 完成后关闭
agent-browser close
代理如何处理快照:
@e1, @e2 等)及其角色完整的工作流程参考请参见 references/agent-browser-workflow.md。
用户描述目标站点的页面结构,代理将其映射到页面对象。代理会提出结构化问题:
然后代理会:
data/site-archetypes.json 中的站点原型进行匹配使用脚手架脚本一次性生成完整的可运行项目:
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
--name "my-scraper" \
--url "https://example.com" \
--pages "ProductListing,ProductDetail" \
--fields "title,price,image_url,description"
这将生成一个包含所有源文件、配置、Docker 设置和准备运行的入口点的项目。完整选项请参见脚本参考部分。
| 类别 | 方法 | 详情 |
|---|---|---|
| 框架 | Playwright | playwright 包,而非 @playwright/test |
| 语言 | TypeScript | 严格模式,ES2022 目标 |
| 模式 | PageObject | 每个页面一个类,组合组件 |
| 选择器 | 健壮性 | data-testid > id > role > CSS 类 > 文本 |
| 等待策略 | 自动等待 | Playwright 内置,外加用于导航的 networkidle |
| 验证 | Zod | 每个页面对象输出类型的模式 |
| 输出 | JSON + CSV | 可通过存储工具配置 |
| Docker | 官方镜像 | mcr.microsoft.com/playwright:v1.48.0-jammy |
| 重试 | 指数退避 | 默认 3 次尝试,可配置 |
| 截图 | 出错时 | 保存到 screenshots/ 用于调试 |
生成抓取器时遵循以下顺序:
向用户询问:
使用模式 1(代理浏览器)或模式 2(手动描述)来理解:
创建一个计划,列出:
在生成代码之前向用户展示页面对象映射。包括类名、字段名和执行流程。等待确认。
使用 assets/templates/ 中的模板作为基础:
base-page.ts.md —— BasePage 抽象类page-object.ts.md —— 特定于站点的页面对象component.ts.md —— 可复用组件scraper-runner.ts.md —— 编排器data-schema.ts.md —— Zod 验证模式提供完整的项目,包含:
assets/configs/ 的配置文件提供 navigate()、waitForPageLoad()、screenshot() 和 getText() 辅助方法的抽象类。所有页面对象都继承此类。
export abstract class BasePage {
constructor(protected readonly page: Page) {}
async navigate(url: string): Promise<void> { /* ... */ }
async screenshot(name: string): Promise<void> { /* ... */ }
}
参见:assets/templates/base-page.ts.md
特定于站点的类,将定位器定义为只读属性,抓取方法返回类型化数据,以及用于多页面流程的导航方法。
export class ProductListingPage extends BasePage {
readonly productCards: Locator;
readonly nextButton: Locator;
async scrapeProducts(): Promise<Product[]> { /* ... */ }
async goToNextPage(): Promise<boolean> { /* ... */ }
}
参见:assets/templates/page-object.ts.md
可复用的 UI 模式(Pagination、DataTable),接收父定位器作用域并提供提取方法。
export class Pagination {
constructor(private page: Page, private scope: Locator) {}
async hasNextPage(): Promise<boolean> { /* ... */ }
async goToNext(): Promise<void> { /* ... */ }
}
参见:assets/templates/component.ts.md
编排器,启动浏览器、创建页面对象、遍历页面、收集数据、使用模式验证并写入输出。
export class SiteScraper {
async run(): Promise<void> {
const browser = await chromium.launch();
const page = await browser.newPage();
// navigate, scrape, validate, write
}
}
参见:assets/templates/scraper-runner.ts.md
Zod 模式,用于验证抓取的记录,在提取时捕获选择器漂移和格式错误的数据。
export const ProductSchema = z.object({
title: z.string().min(1),
price: z.number().positive(),
});
参见:assets/templates/data-schema.ts.md
| 反模式 | 问题 | 解决方案 |
|---|---|---|
| 单体抓取器 | 所有抓取逻辑在一个文件中 | 按页面拆分为 PageObject 类 |
| 睡眠等待器 | 使用 setTimeout/固定延迟 | 使用 Playwright 自动等待和 networkidle |
| 未验证的管道 | 输出没有模式验证 | 为每种数据类型添加 Zod 模式 |
| 选择器抽奖 | 脆弱的位置选择器 | 使用健壮的选择器层次结构 |
| 静默失败 | 吞掉错误而不记录日志 | 记录失败并保存调试截图 |
| 无节流爬虫 | 请求之间没有延迟 | 添加可配置的请求延迟 |
| 硬编码配置 | 代码中的 URL 和选择器 | 使用环境变量和配置文件 |
| 无重试逻辑 | 每个请求只尝试一次 | 实现指数退避 |
包含示例和修复的扩展目录请参见 references/anti-patterns.md。
生成完整的抓取器项目:
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts [options]
Options:
--name <name> 项目名称(必需)
--path <path> 目标目录(默认:./)
--url <url> 目标站点基础 URL
--pages <pages> 逗号分隔的页面名称(例如:ProductListing,ProductDetail)
--fields <fields> 逗号分隔的数据字段(例如:title,price,rating)
--no-docker 跳过 Docker 设置
--no-validation 跳过 Zod 验证设置
--json 输出为 JSON
-h, --help 显示帮助
Examples:
# 搭建一个产品抓取器
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
--name "shop-scraper" --url "https://shop.example.com" \
--pages "ProductListing,ProductDetail" --fields "title,price,image_url"
# 无 Docker 的最小化抓取器
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
--name "blog-scraper" --no-docker
为现有项目生成单个 PageObject 类:
deno run --allow-read --allow-write scripts/generate-page-object.ts [options]
Options:
--name <name> 类名(必需)
--url <url> 页面 URL(用于文档注释)
--fields <fields> 逗号分隔的数据字段
--selectors <json> 字段到选择器的 JSON 映射
--with-pagination 包含分页方法
--output <path> 输出文件路径(默认:stdout)
--json 输出为 JSON
-h, --help 显示帮助
Examples:
# 使用已知选择器生成页面对象
deno run --allow-read --allow-write scripts/generate-page-object.ts \
--name "ProductListing" --url "https://shop.example.com/products" \
--fields "title,price,rating" \
--selectors '{"title":".product-title","price":".product-price","rating":".star-rating"}' \
--with-pagination --output src/pages/ProductListingPage.ts
# 快速生成到标准输出
deno run --allow-read scripts/generate-page-object.ts \
--name "SearchResults" --fields "title,url,snippet"
| 模板 | 用途 |
|---|---|
base-page.ts.md | 具有导航、截图、文本辅助功能的抽象 BasePage |
page-object.ts.md | 具有定位器和抓取方法的特定于站点的页面对象 |
component.ts.md | 可复用组件:Pagination、DataTable |
scraper-runner.ts.md | 编排器:浏览器启动、迭代、收集、输出 |
data-schema.ts.md | 用于抓取数据验证的 Zod 模式 |
| 配置 | 用途 |
|---|---|
dockerfile.md | 使用官方 Playwright 镜像的多阶段 Dockerfile |
docker-compose.yml.md | 带有数据/截图卷挂载的服务 |
tsconfig.json.md | 严格模式 TypeScript,ES2022 目标 |
package.json.md | playwright、zod、tsx 依赖项 |
playwright.config.ts.md | 面向抓取器的 Playwright 配置 |
| 参考 | 用途 |
|---|---|
pageobject-pattern.md | 适用于抓取的 PageObject 模式 |
playwright-selectors.md | 选择器策略和健壮性层次结构 |
docker-setup.md | Docker 配置和部署 |
agent-browser-workflow.md | 代理浏览器分析工作流程 |
anti-patterns.md | 扩展的反模式目录 |
| 示例 | 用途 |
|---|---|
ecommerce-scraper.md | 完整的多页面产品抓取器演练 |
multi-page-pagination.md | 分页处理策略 |
| 文件 | 用途 |
|---|---|
selector-patterns.json | 按 UI 元素类型组织的常见选择器 |
site-archetypes.json | 具有典型页面和字段的网站结构原型 |
用户: “我需要一个在线书店的抓取器。我想从目录页面获取书籍标题、作者、价格和评分。”
代理工作流程:
site-archetypes.json —— 匹配 ecommerce 原型BookListingPage —— 带分页的目录BookDetailPage —— 单个书籍页面(如果需要详细抓取)Pagination 组件 —— 共享的分页处理器title → [itemprop="name"] 或 .book-titleauthor → [itemprop="author"] 或 .book-authorprice → [itemprop="price"] 或 .pricerating → .star-rating 或 [data-rating]Book 类型 Zod 模式的项目此技能连接到:
此技能不:
每周安装次数
105
仓库
GitHub 星标数
37
首次出现
2026年2月4日
安全审计
安装于
opencode96
codex95
gemini-cli94
github-copilot89
kimi-cli84
amp84
Generate complete, runnable web scraper projects using the PageObject pattern with Playwright and TypeScript. This skill produces site-specific scrapers with typed data extraction, Docker deployment, and optional agent-browser integration for automated site analysis.
Use this skill when:
Do NOT use this skill when:
Each page on the target site maps to one PageObject class. Locators are defined in the constructor, and scraping logic lives in methods. Page objects never contain assertions or business logic — they extract and return data.
Prefer selectors in this order: data-testid > id > semantic HTML (role, aria-label) > structured CSS classes > text content. Avoid positional selectors (nth-child) and layout-dependent paths. See references/playwright-selectors.md for the full hierarchy.
Reusable UI patterns (pagination, data tables, search bars) are modeled as component classes that page objects compose via properties. Only BasePage uses inheritance — everything else composes.
All scraped data flows through Zod schemas for validation. This catches selector drift (when a site changes its markup) at extraction time rather than downstream. See assets/templates/data-schema.ts.md.
Generated projects include a Dockerfile using Microsoft's official Playwright images and a docker-compose.yml with volume mounts for output data and debug screenshots. This ensures consistent browser environments across machines.
Use agent-browser to navigate the target site, capture accessibility tree snapshots, and automatically discover selectors. This is the preferred mode when the agent has access to the agent-browser CLI.
Prerequisites: If agent-browser is not already installed, add it as a skill first:
npx skills add vercel-labs/agent-browser
Workflow:
# 1. Open the target page
agent-browser open https://example.com/products
# 2. Capture interactive snapshot with element references
agent-browser snapshot -i --json > snapshot.json
# 3. Capture scoped sections for focused analysis
agent-browser snapshot -i --json -s "main" > main-content.json
agent-browser snapshot -i --json -s "nav" > navigation.json
# 4. Test dynamic behavior (pagination, load-more)
agent-browser click @e3
agent-browser wait --load networkidle
agent-browser snapshot -i --json > after-click.json
# 5. Close when done
agent-browser close
What the agent does with snapshots:
@e1, @e2, etc.) and their rolesSee references/agent-browser-workflow.md for the complete workflow reference.
The user describes the target site's page structure and the agent maps it to page objects. The agent asks structured questions:
The agent then:
data/site-archetypes.jsonGenerate a complete runnable project in one operation using the scaffolder script:
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
--name "my-scraper" \
--url "https://example.com" \
--pages "ProductListing,ProductDetail" \
--fields "title,price,image_url,description"
This produces a project with all source files, configuration, Docker setup, and an entry point ready to run. See the Scripts Reference section for full options.
| Category | Approach | Details |
|---|---|---|
| Framework | Playwright | playwright package, not @playwright/test |
| Language | TypeScript | Strict mode, ES2022 target |
| Pattern | PageObject | One class per page, compose components |
| Selectors | Resilient | data-testid > id > role > CSS class > text |
| Wait strategy | Auto-wait | Playwright built-in, plus networkidle for navigation |
| Validation | Zod | Schema per page object's output type |
| Output |
Follow this sequence when generating a scraper:
Ask the user for:
Use Mode 1 (agent-browser) or Mode 2 (manual description) to understand:
Create a plan listing:
Show the user the page object map before generating code. Include class names, field names, and the execution flow. Wait for confirmation.
Use the templates in assets/templates/ as the foundation:
base-page.ts.md — BasePage abstract classpage-object.ts.md — Site-specific page objectcomponent.ts.md — Reusable componentsscraper-runner.ts.md — Orchestratordata-schema.ts.md — Zod validation schemasProvide the complete project with:
assets/configs/Abstract class providing navigate(), waitForPageLoad(), screenshot(), and getText() helpers. All page objects extend this.
export abstract class BasePage {
constructor(protected readonly page: Page) {}
async navigate(url: string): Promise<void> { /* ... */ }
async screenshot(name: string): Promise<void> { /* ... */ }
}
See: assets/templates/base-page.ts.md
Site-specific class with locators as readonly properties, scrape methods returning typed data, and navigation methods for multi-page flows.
export class ProductListingPage extends BasePage {
readonly productCards: Locator;
readonly nextButton: Locator;
async scrapeProducts(): Promise<Product[]> { /* ... */ }
async goToNextPage(): Promise<boolean> { /* ... */ }
}
See: assets/templates/page-object.ts.md
Reusable UI pattern (Pagination, DataTable) that receives a parent locator scope and provides extraction methods.
export class Pagination {
constructor(private page: Page, private scope: Locator) {}
async hasNextPage(): Promise<boolean> { /* ... */ }
async goToNext(): Promise<void> { /* ... */ }
}
See: assets/templates/component.ts.md
Orchestrator that launches the browser, creates page objects, iterates through pages, collects data, validates with schemas, and writes output.
export class SiteScraper {
async run(): Promise<void> {
const browser = await chromium.launch();
const page = await browser.newPage();
// navigate, scrape, validate, write
}
}
See: assets/templates/scraper-runner.ts.md
Zod schemas that validate scraped records, catching selector drift and malformed data at extraction time.
export const ProductSchema = z.object({
title: z.string().min(1),
price: z.number().positive(),
});
See: assets/templates/data-schema.ts.md
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Monolith Scraper | All scraping logic in one file | Split into PageObject classes per page |
| Sleep Waiter | Using setTimeout/fixed delays | Use Playwright auto-wait and networkidle |
| Unvalidated Pipeline | No schema validation on output | Add Zod schemas for every data type |
| Selector Lottery | Fragile positional selectors | Use resilient selector hierarchy |
| Silent Failure | Swallowing errors without logging | Log failures and save debug screenshots |
| Unthrottled Crawler | No delay between requests | Add configurable request delays |
| Hardcoded Config | URLs and selectors in code |
See references/anti-patterns.md for the extended catalog with examples and fixes.
Generate a complete scraper project:
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts [options]
Options:
--name <name> Project name (required)
--path <path> Target directory (default: ./)
--url <url> Target site base URL
--pages <pages> Comma-separated page names (e.g., ProductListing,ProductDetail)
--fields <fields> Comma-separated data fields (e.g., title,price,rating)
--no-docker Skip Docker setup
--no-validation Skip Zod validation setup
--json Output as JSON
-h, --help Show help
Examples:
# Scaffold a product scraper
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
--name "shop-scraper" --url "https://shop.example.com" \
--pages "ProductListing,ProductDetail" --fields "title,price,image_url"
# Minimal scraper without Docker
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
--name "blog-scraper" --no-docker
Generate a single PageObject class for an existing project:
deno run --allow-read --allow-write scripts/generate-page-object.ts [options]
Options:
--name <name> Class name (required)
--url <url> Page URL (for documentation comment)
--fields <fields> Comma-separated data fields
--selectors <json> JSON map of field to selector
--with-pagination Include pagination methods
--output <path> Output file path (default: stdout)
--json Output as JSON
-h, --help Show help
Examples:
# Generate a page object with known selectors
deno run --allow-read --allow-write scripts/generate-page-object.ts \
--name "ProductListing" --url "https://shop.example.com/products" \
--fields "title,price,rating" \
--selectors '{"title":".product-title","price":".product-price","rating":".star-rating"}' \
--with-pagination --output src/pages/ProductListingPage.ts
# Quick generation to stdout
deno run --allow-read scripts/generate-page-object.ts \
--name "SearchResults" --fields "title,url,snippet"
| Template | Purpose |
|---|---|
base-page.ts.md | Abstract BasePage with navigation, screenshots, text helpers |
page-object.ts.md | Site-specific page object with locators and scrape methods |
component.ts.md | Reusable components: Pagination, DataTable |
scraper-runner.ts.md | Orchestrator: browser launch, iteration, collection, output |
data-schema.ts.md | Zod schemas for scraped data validation |
| Config | Purpose |
|---|---|
dockerfile.md | Multi-stage Dockerfile using official Playwright image |
docker-compose.yml.md | Service with data/screenshots volume mounts |
tsconfig.json.md | Strict TypeScript with ES2022 target |
package.json.md | playwright, zod, tsx dependencies |
playwright.config.ts.md | Scraper-focused Playwright configuration |
| Reference | Purpose |
|---|---|
pageobject-pattern.md | PageObject pattern adapted for scraping |
playwright-selectors.md | Selector strategies and resilience hierarchy |
docker-setup.md | Docker configuration and deployment |
agent-browser-workflow.md | Agent-browser analysis workflow |
anti-patterns.md | Extended anti-pattern catalog |
| Example | Purpose |
|---|---|
ecommerce-scraper.md | Complete multi-page product scraper walkthrough |
multi-page-pagination.md | Pagination handling strategies |
| File | Purpose |
|---|---|
selector-patterns.json | Common selectors organized by UI element type |
site-archetypes.json | Website structure archetypes with typical pages and fields |
User: "I need a scraper for an online bookstore. I want to get book titles, authors, prices, and ratings from the catalog pages."
Agent workflow:
site-archetypes.json — matches ecommerce archetypeBookListingPage — catalog with paginationBookDetailPage — individual book page (if detail scraping needed)Pagination component — shared pagination handlertitle → [itemprop="name"] or .book-titleauthor → or This skill connects to:
This skill does NOT:
Weekly Installs
105
Repository
GitHub Stars
37
First Seen
Feb 4, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode96
codex95
gemini-cli94
github-copilot89
kimi-cli84
amp84
通过 LiteLLM 代理让 Claude Code 对接 GitHub Copilot 运行 | 高级变通方案指南
40,000 周安装
Conductor 并行编码代理 Mac 应用 Rails 项目设置指南 - 自动化配置脚本
104 周安装
App Store ASO优化工具 - 元数据分析、关键词排名追踪与截图策略
104 周安装
代码仓库分析器 - 自动扫描技术栈、依赖关系与架构,快速理解项目
104 周安装
Java开发专家技能:Spring Boot、Quarkus、企业级设计模式与性能优化指南
104 周安装
FastAPI 安全开发指南:构建高性能、安全的 REST API 与 WebSocket 服务
104 周安装
Django REST API开发指南:使用DRF构建可扩展、高性能的API接口
104 周安装
| JSON + CSV |
| Configurable via storage utility |
| Docker | Official image | mcr.microsoft.com/playwright:v1.48.0-jammy |
| Retry | Exponential backoff | 3 attempts default, configurable |
| Screenshots | On error | Saved to screenshots/ for debugging |
| Use environment variables and config files |
| No Retry Logic | Single attempt per request | Implement exponential backoff |
[itemprop="author"].book-authorprice → [itemprop="price"] or .pricerating → .star-rating or [data-rating]Book type