Playwright TypeScript网页抓取器构建器 - 自动生成PageObject模式项目

scraper-builder by jwynia/agent-skills

105 周安装量

37 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/jwynia/agent-skills --skill scraper-builder

自动化 TypeScript 数据处理

🇨🇳中文介绍

网页抓取器构建器

使用 Playwright 和 TypeScript 的 PageObject 模式生成完整、可运行的网页抓取器项目。此技能生成针对特定站点的抓取器，包含类型化数据提取、Docker 部署以及用于自动化站点分析的可选代理浏览器集成。

何时使用此技能

在以下情况下使用此技能：

为数据提取构建针对特定站点的网页抓取器
为目标网站生成 PageObject 类
搭建具有 Docker 支持的完整抓取器项目
使用代理浏览器分析站点并自动生成选择器
创建可复用的抓取组件（分页、数据表格）

在以下情况下请不要使用此技能：

构建 API 客户端（直接使用 HTTP 客户端库）
编写 QA/E2E 测试套件（使用 Playwright 测试运行器及面向测试的模式）
大规模爬取或爬行整个域名（使用 Crawlee 或 Scrapy）
抓取需要绕过身份验证或解决验证码的网站

核心原则

1. PageObject 封装

目标站点上的每个页面都映射到一个 PageObject 类。定位器在构造函数中定义，抓取逻辑位于方法中。页面对象从不包含断言或业务逻辑——它们提取并返回数据。

2. 选择器健壮性

优先选择以下顺序的选择器：data-testid > id > 语义化 HTML（role, aria-label）> 结构化 CSS 类 > 文本内容。避免使用位置选择器（nth-child）和依赖于布局的路径。完整的层次结构请参见。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

3. 组合优于继承

可复用的 UI 模式（分页、数据表格、搜索栏）被建模为组件类，页面对象通过属性组合它们。只有 BasePage 使用继承——其他所有内容都使用组合。

4. 类型化数据提取

所有抓取的数据都通过 Zod 模式进行验证。这可以在提取时捕获选择器漂移（当站点更改其标记时），而不是在下游处理。请参见 assets/templates/data-schema.ts.md。

5. Docker 优先部署

生成的项目包含一个使用微软官方 Playwright 镜像的 Dockerfile，以及一个带有输出数据和调试截图卷挂载的 docker-compose.yml。这确保了跨机器一致的浏览器环境。

模式 1：代理浏览器分析

使用 agent-browser 导航目标站点，捕获可访问性树快照，并自动发现选择器。当代理可以访问 agent-browser CLI 时，这是首选模式。

先决条件： 如果尚未安装 agent-browser，请先将其添加为技能：

npx skills add vercel-labs/agent-browser

# 1. 打开目标页面
agent-browser open https://example.com/products

# 2. 捕获包含元素引用的交互式快照
agent-browser snapshot -i --json > snapshot.json

# 3. 捕获作用域部分以进行聚焦分析
agent-browser snapshot -i --json -s "main" > main-content.json
agent-browser snapshot -i --json -s "nav" > navigation.json

# 4. 测试动态行为（分页、加载更多）
agent-browser click @e3
agent-browser wait --load networkidle
agent-browser snapshot -i --json > after-click.json

# 5. 完成后关闭
agent-browser close

代理如何处理快照：

解析元素引用（@e1, @e2 等）及其角色
按语义目的（导航、数据显示、表单、操作）对元素进行分组
将数据元素映射到字段（标题、价格、图片等）
使用发现的选择器生成 PageObject 类
识别分页和动态加载模式

完整的工作流程参考请参见 references/agent-browser-workflow.md。

模式 2：手动描述

用户描述目标站点的页面结构，代理将其映射到页面对象。代理会提出结构化问题：

要抓取哪些页面？ —— URL 或页面类型列表
要提取哪些数据？ —— 每个页面的字段名称和预期类型
数据如何分页？ —— 编号页面、加载更多、无限滚动或单页
已知哪些选择器？ —— 用户已知的任何 CSS 选择器、data-testid 值或 XPath

将描述与 data/site-archetypes.json 中的站点原型进行匹配
提出包含类名和职责的页面对象映射方案
在用户确认计划后生成代码

模式 3：完整项目脚手架

使用脚手架脚本一次性生成完整的可运行项目：

deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
  --name "my-scraper" \
  --url "https://example.com" \
  --pages "ProductListing,ProductDetail" \
  --fields "title,price,image_url,description"

这将生成一个包含所有源文件、配置、Docker 设置和准备运行的入口点的项目。完整选项请参见脚本参考部分。

类别	方法	详情
框架	Playwright	`playwright` 包，而非 `@playwright/test`
语言	TypeScript	严格模式，ES2022 目标
模式	PageObject	每个页面一个类，组合组件
选择器	健壮性	data-testid > id > role > CSS 类 > 文本
等待策略	自动等待	Playwright 内置，外加用于导航的 `networkidle`
验证	Zod	每个页面对象输出类型的模式
输出	JSON + CSV	可通过存储工具配置
Docker	官方镜像	`mcr.microsoft.com/playwright:v1.48.0-jammy`
重试	指数退避	默认 3 次尝试，可配置
截图	出错时	保存到 `screenshots/` 用于调试

生成抓取器时遵循以下顺序：

步骤 1：收集需求

目标站点 URL
要提取的数据字段
预期的页面/项目数量
输出格式偏好（JSON、CSV 或两者）
是否需要 Docker 部署

步骤 2：分析站点

使用模式 1（代理浏览器）或模式 2（手动描述）来理解：

页面结构和导航流程
数据元素位置和选择器策略
分页或无限滚动模式
动态内容加载行为

步骤 3：设计页面对象映射

创建一个计划，列出：

每个 PageObject 类及其 URL 模式
所需的组件类（Pagination、DataTable 等）
每个页面的数据模式字段和类型
抓取器在页面之间的导航流程

步骤 4：展示计划

在生成代码之前向用户展示页面对象映射。包括类名、字段名和执行流程。等待确认。

步骤 5：生成代码

使用 assets/templates/ 中的模板作为基础：

base-page.ts.md —— BasePage 抽象类
page-object.ts.md —— 特定于站点的页面对象
component.ts.md —— 可复用组件
scraper-runner.ts.md —— 编排器
data-schema.ts.md —— Zod 验证模式

提供完整的项目，包含：

所有源文件
来自 assets/configs/ 的配置文件
解释如何运行的 README
Docker 设置（除非明确排除）

提供 navigate()、waitForPageLoad()、screenshot() 和 getText() 辅助方法的抽象类。所有页面对象都继承此类。

export abstract class BasePage {
  constructor(protected readonly page: Page) {}
  async navigate(url: string): Promise<void> { /* ... */ }
  async screenshot(name: string): Promise<void> { /* ... */ }
}

参见：assets/templates/base-page.ts.md

特定于站点的类，将定位器定义为只读属性，抓取方法返回类型化数据，以及用于多页面流程的导航方法。

export class ProductListingPage extends BasePage {
  readonly productCards: Locator;
  readonly nextButton: Locator;
  async scrapeProducts(): Promise<Product[]> { /* ... */ }
  async goToNextPage(): Promise<boolean> { /* ... */ }
}

参见：assets/templates/page-object.ts.md

可复用的 UI 模式（Pagination、DataTable），接收父定位器作用域并提供提取方法。

export class Pagination {
  constructor(private page: Page, private scope: Locator) {}
  async hasNextPage(): Promise<boolean> { /* ... */ }
  async goToNext(): Promise<void> { /* ... */ }
}

参见：assets/templates/component.ts.md

编排器，启动浏览器、创建页面对象、遍历页面、收集数据、使用模式验证并写入输出。

export class SiteScraper {
  async run(): Promise<void> {
    const browser = await chromium.launch();
    const page = await browser.newPage();
    // navigate, scrape, validate, write
  }
}

参见：assets/templates/scraper-runner.ts.md

Zod 模式，用于验证抓取的记录，在提取时捕获选择器漂移和格式错误的数据。

export const ProductSchema = z.object({
  title: z.string().min(1),
  price: z.number().positive(),
});

参见：assets/templates/data-schema.ts.md

反模式	问题	解决方案
单体抓取器	所有抓取逻辑在一个文件中	按页面拆分为 PageObject 类
睡眠等待器	使用 `setTimeout`/固定延迟	使用 Playwright 自动等待和 `networkidle`
未验证的管道	输出没有模式验证	为每种数据类型添加 Zod 模式
选择器抽奖	脆弱的位置选择器	使用健壮的选择器层次结构
静默失败	吞掉错误而不记录日志	记录失败并保存调试截图
无节流爬虫	请求之间没有延迟	添加可配置的请求延迟
硬编码配置	代码中的 URL 和选择器	使用环境变量和配置文件
无重试逻辑	每个请求只尝试一次	实现指数退避

包含示例和修复的扩展目录请参见 references/anti-patterns.md。

scaffold-scraper-project.ts

生成完整的抓取器项目：

deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts [options]

Options:
  --name <name>       项目名称（必需）
  --path <path>       目标目录（默认：./）
  --url <url>         目标站点基础 URL
  --pages <pages>     逗号分隔的页面名称（例如：ProductListing,ProductDetail）
  --fields <fields>   逗号分隔的数据字段（例如：title,price,rating）
  --no-docker         跳过 Docker 设置
  --no-validation     跳过 Zod 验证设置
  --json              输出为 JSON
  -h, --help          显示帮助

Examples:
  # 搭建一个产品抓取器
  deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
    --name "shop-scraper" --url "https://shop.example.com" \
    --pages "ProductListing,ProductDetail" --fields "title,price,image_url"

  # 无 Docker 的最小化抓取器
  deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
    --name "blog-scraper" --no-docker

generate-page-object.ts

为现有项目生成单个 PageObject 类：

deno run --allow-read --allow-write scripts/generate-page-object.ts [options]

Options:
  --name <name>           类名（必需）
  --url <url>             页面 URL（用于文档注释）
  --fields <fields>       逗号分隔的数据字段
  --selectors <json>      字段到选择器的 JSON 映射
  --with-pagination       包含分页方法
  --output <path>         输出文件路径（默认：stdout）
  --json                  输出为 JSON
  -h, --help              显示帮助

Examples:
  # 使用已知选择器生成页面对象
  deno run --allow-read --allow-write scripts/generate-page-object.ts \
    --name "ProductListing" --url "https://shop.example.com/products" \
    --fields "title,price,rating" \
    --selectors '{"title":".product-title","price":".product-price","rating":".star-rating"}' \
    --with-pagination --output src/pages/ProductListingPage.ts

  # 快速生成到标准输出
  deno run --allow-read scripts/generate-page-object.ts \
    --name "SearchResults" --fields "title,url,snippet"

模板 (assets/templates/)

模板	用途
`base-page.ts.md`	具有导航、截图、文本辅助功能的抽象 BasePage
`page-object.ts.md`	具有定位器和抓取方法的特定于站点的页面对象
`component.ts.md`	可复用组件：Pagination、DataTable
`scraper-runner.ts.md`	编排器：浏览器启动、迭代、收集、输出
`data-schema.ts.md`	用于抓取数据验证的 Zod 模式

配置 (assets/configs/)

配置	用途
`dockerfile.md`	使用官方 Playwright 镜像的多阶段 Dockerfile
`docker-compose.yml.md`	带有数据/截图卷挂载的服务
`tsconfig.json.md`	严格模式 TypeScript，ES2022 目标
`package.json.md`	playwright、zod、tsx 依赖项
`playwright.config.ts.md`	面向抓取器的 Playwright 配置

参考 (references/)

参考	用途
`pageobject-pattern.md`	适用于抓取的 PageObject 模式
`playwright-selectors.md`	选择器策略和健壮性层次结构
`docker-setup.md`	Docker 配置和部署
`agent-browser-workflow.md`	代理浏览器分析工作流程
`anti-patterns.md`	扩展的反模式目录

示例 (assets/examples/)

示例	用途
`ecommerce-scraper.md`	完整的多页面产品抓取器演练
`multi-page-pagination.md`	分页处理策略

数据文件 (data/)

文件	用途
`selector-patterns.json`	按 UI 元素类型组织的常见选择器
`site-archetypes.json`	具有典型页面和字段的网站结构原型

用户： “我需要一个在线书店的抓取器。我想从目录页面获取书籍标题、作者、价格和评分。”

代理工作流程：

检查 site-archetypes.json —— 匹配 ecommerce 原型
提出页面对象映射方案：
- BookListingPage —— 带分页的目录
- BookDetailPage —— 单个书籍页面（如果需要详细抓取）
- Pagination 组件 —— 共享的分页处理器
展示带有字段映射的计划：
- title → [itemprop="name"] 或 .book-title
- author → [itemprop="author"] 或 .book-author
- price → [itemprop="price"] 或 .price
- rating → .star-rating 或 [data-rating]
确认后，使用脚手架脚本或手动代码生成
交付包含 Docker 设置和 Book 类型 Zod 模式的项目

此技能连接到：

typescript-best-practices —— 生成代码中使用的 TypeScript 编码模式
devcontainer —— 生成项目的开发容器设置
agent-browser —— 站点分析和选择器发现（外部工具）

您不要做的事情

绕过身份验证或登录墙
解决验证码或机器人检测
生成仅限 JavaScript 的输出（始终是 TypeScript）
创建爬行整个域名的爬虫
违反 robots.txt 创建抓取器
处理速率受限的 API（API 工作使用 HTTP 客户端）
生成测试套件（QA 使用 Playwright 测试模式）

🇺🇸English

Scraper Builder

Generate complete, runnable web scraper projects using the PageObject pattern with Playwright and TypeScript. This skill produces site-specific scrapers with typed data extraction, Docker deployment, and optional agent-browser integration for automated site analysis.

When to Use This Skill

Use this skill when:

Building a site-specific web scraper for data extraction
Generating PageObject classes for a target website
Scaffolding a complete scraper project with Docker support
Using agent-browser to analyze a site and auto-generate selectors
Creating reusable scraping components (pagination, data tables)

Do NOT use this skill when:

Building API clients (use HTTP client libraries directly)
Writing QA/E2E test suites (use Playwright test runner with test-focused patterns)
Mass crawling or spidering entire domains (use Crawlee or Scrapy)
Scraping sites that require authentication bypass or CAPTCHA solving

Core Principles

1. PageObject Encapsulation

Each page on the target site maps to one PageObject class. Locators are defined in the constructor, and scraping logic lives in methods. Page objects never contain assertions or business logic — they extract and return data.

2. Selector Resilience

Prefer selectors in this order: data-testid > id > semantic HTML (role, aria-label) > structured CSS classes > text content. Avoid positional selectors (nth-child) and layout-dependent paths. See references/playwright-selectors.md for the full hierarchy.

3. Composition Over Inheritance

Reusable UI patterns (pagination, data tables, search bars) are modeled as component classes that page objects compose via properties. Only BasePage uses inheritance — everything else composes.

4. Typed Data Extraction

All scraped data flows through Zod schemas for validation. This catches selector drift (when a site changes its markup) at extraction time rather than downstream. See assets/templates/data-schema.ts.md.

5. Docker-First Deployment

Generated projects include a Dockerfile using Microsoft's official Playwright images and a docker-compose.yml with volume mounts for output data and debug screenshots. This ensures consistent browser environments across machines.

Generation Modes

Mode 1: Agent-Browser Analysis

Use agent-browser to navigate the target site, capture accessibility tree snapshots, and automatically discover selectors. This is the preferred mode when the agent has access to the agent-browser CLI.

Prerequisites: If agent-browser is not already installed, add it as a skill first:

npx skills add vercel-labs/agent-browser

Workflow:

# 1. Open the target page
agent-browser open https://example.com/products

# 2. Capture interactive snapshot with element references
agent-browser snapshot -i --json > snapshot.json

# 3. Capture scoped sections for focused analysis
agent-browser snapshot -i --json -s "main" > main-content.json
agent-browser snapshot -i --json -s "nav" > navigation.json

# 4. Test dynamic behavior (pagination, load-more)
agent-browser click @e3
agent-browser wait --load networkidle
agent-browser snapshot -i --json > after-click.json

# 5. Close when done
agent-browser close

What the agent does with snapshots:

Parse element references (@e1, @e2, etc.) and their roles
Group elements by semantic purpose (navigation, data display, forms, actions)
Map data elements to fields (title, price, image, etc.)
Generate PageObject classes with discovered selectors
Identify pagination and dynamic loading patterns

See references/agent-browser-workflow.md for the complete workflow reference.

Mode 2: Manual Description

The user describes the target site's page structure and the agent maps it to page objects. The agent asks structured questions:

What pages to scrape? — List of URLs or page types
What data to extract? — Field names and expected types per page
How is data paginated? — Numbered pages, load-more, infinite scroll, or single page
What selectors are known? — Any CSS selectors, data-testid values, or XPath the user already knows

The agent then:

Matches the description to a site archetype from data/site-archetypes.json
Proposes a page object map with class names and responsibilities
Generates code after the user confirms the plan

Mode 3: Full Project Scaffold

Generate a complete runnable project in one operation using the scaffolder script:

deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
  --name "my-scraper" \
  --url "https://example.com" \
  --pages "ProductListing,ProductDetail" \
  --fields "title,price,image_url,description"

This produces a project with all source files, configuration, Docker setup, and an entry point ready to run. See the Scripts Reference section for full options.

Quick Reference

Category	Approach	Details
Framework	Playwright	`playwright` package, not `@playwright/test`
Language	TypeScript	Strict mode, ES2022 target
Pattern	PageObject	One class per page, compose components
Selectors	Resilient	data-testid > id > role > CSS class > text
Wait strategy	Auto-wait	Playwright built-in, plus `networkidle` for navigation
Validation	Zod	Schema per page object's output type
Output

Generation Process

Follow this sequence when generating a scraper:

Step 1: Gather Requirements

Ask the user for:

Target site URL(s)
Data fields to extract
Number of pages/items expected
Output format preference (JSON, CSV, both)
Whether Docker deployment is needed

Step 2: Analyze the Site

Use Mode 1 (agent-browser) or Mode 2 (manual description) to understand:

Page structure and navigation flow
Data element locations and selector strategies
Pagination or infinite scroll patterns
Dynamic content loading behavior

Step 3: Design the Page Object Map

Create a plan listing:

Each PageObject class and its URL pattern
Component classes needed (Pagination, DataTable, etc.)
Data schema fields and types per page
The scraper's navigation flow between pages

Step 4: Present the Plan

Show the user the page object map before generating code. Include class names, field names, and the execution flow. Wait for confirmation.

Step 5: Generate Code

Use the templates in assets/templates/ as the foundation:

base-page.ts.md — BasePage abstract class
page-object.ts.md — Site-specific page object
component.ts.md — Reusable components
scraper-runner.ts.md — Orchestrator
data-schema.ts.md — Zod validation schemas

Step 6: Deliver

Provide the complete project with:

All source files
Configuration files from assets/configs/
A README explaining how to run it
Docker setup (unless explicitly excluded)

Code Patterns

BasePage

Abstract class providing navigate(), waitForPageLoad(), screenshot(), and getText() helpers. All page objects extend this.

export abstract class BasePage {
  constructor(protected readonly page: Page) {}
  async navigate(url: string): Promise<void> { /* ... */ }
  async screenshot(name: string): Promise<void> { /* ... */ }
}

See: assets/templates/base-page.ts.md

PageObject

Site-specific class with locators as readonly properties, scrape methods returning typed data, and navigation methods for multi-page flows.

export class ProductListingPage extends BasePage {
  readonly productCards: Locator;
  readonly nextButton: Locator;
  async scrapeProducts(): Promise<Product[]> { /* ... */ }
  async goToNextPage(): Promise<boolean> { /* ... */ }
}

See: assets/templates/page-object.ts.md

Component

Reusable UI pattern (Pagination, DataTable) that receives a parent locator scope and provides extraction methods.

export class Pagination {
  constructor(private page: Page, private scope: Locator) {}
  async hasNextPage(): Promise<boolean> { /* ... */ }
  async goToNext(): Promise<void> { /* ... */ }
}

See: assets/templates/component.ts.md

ScraperRunner

Orchestrator that launches the browser, creates page objects, iterates through pages, collects data, validates with schemas, and writes output.

export class SiteScraper {
  async run(): Promise<void> {
    const browser = await chromium.launch();
    const page = await browser.newPage();
    // navigate, scrape, validate, write
  }
}

See: assets/templates/scraper-runner.ts.md

DataSchema

Zod schemas that validate scraped records, catching selector drift and malformed data at extraction time.

export const ProductSchema = z.object({
  title: z.string().min(1),
  price: z.number().positive(),
});

See: assets/templates/data-schema.ts.md

Anti-Patterns

Anti-Pattern	Problem	Solution
Monolith Scraper	All scraping logic in one file	Split into PageObject classes per page
Sleep Waiter	Using `setTimeout`/fixed delays	Use Playwright auto-wait and `networkidle`
Unvalidated Pipeline	No schema validation on output	Add Zod schemas for every data type
Selector Lottery	Fragile positional selectors	Use resilient selector hierarchy
Silent Failure	Swallowing errors without logging	Log failures and save debug screenshots
Unthrottled Crawler	No delay between requests	Add configurable request delays
Hardcoded Config	URLs and selectors in code

See references/anti-patterns.md for the extended catalog with examples and fixes.

Scripts Reference

scaffold-scraper-project.ts

Generate a complete scraper project:

deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts [options]

Options:
  --name <name>       Project name (required)
  --path <path>       Target directory (default: ./)
  --url <url>         Target site base URL
  --pages <pages>     Comma-separated page names (e.g., ProductListing,ProductDetail)
  --fields <fields>   Comma-separated data fields (e.g., title,price,rating)
  --no-docker         Skip Docker setup
  --no-validation     Skip Zod validation setup
  --json              Output as JSON
  -h, --help          Show help

Examples:
  # Scaffold a product scraper
  deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
    --name "shop-scraper" --url "https://shop.example.com" \
    --pages "ProductListing,ProductDetail" --fields "title,price,image_url"

  # Minimal scraper without Docker
  deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
    --name "blog-scraper" --no-docker

generate-page-object.ts

Generate a single PageObject class for an existing project:

deno run --allow-read --allow-write scripts/generate-page-object.ts [options]

Options:
  --name <name>           Class name (required)
  --url <url>             Page URL (for documentation comment)
  --fields <fields>       Comma-separated data fields
  --selectors <json>      JSON map of field to selector
  --with-pagination       Include pagination methods
  --output <path>         Output file path (default: stdout)
  --json                  Output as JSON
  -h, --help              Show help

Examples:
  # Generate a page object with known selectors
  deno run --allow-read --allow-write scripts/generate-page-object.ts \
    --name "ProductListing" --url "https://shop.example.com/products" \
    --fields "title,price,rating" \
    --selectors '{"title":".product-title","price":".product-price","rating":".star-rating"}' \
    --with-pagination --output src/pages/ProductListingPage.ts

  # Quick generation to stdout
  deno run --allow-read scripts/generate-page-object.ts \
    --name "SearchResults" --fields "title,url,snippet"

Templates & References

Templates (assets/templates/)

Template	Purpose
`base-page.ts.md`	Abstract BasePage with navigation, screenshots, text helpers
`page-object.ts.md`	Site-specific page object with locators and scrape methods
`component.ts.md`	Reusable components: Pagination, DataTable
`scraper-runner.ts.md`	Orchestrator: browser launch, iteration, collection, output
`data-schema.ts.md`	Zod schemas for scraped data validation

Configs (assets/configs/)

Config	Purpose
`dockerfile.md`	Multi-stage Dockerfile using official Playwright image
`docker-compose.yml.md`	Service with data/screenshots volume mounts
`tsconfig.json.md`	Strict TypeScript with ES2022 target
`package.json.md`	playwright, zod, tsx dependencies
`playwright.config.ts.md`	Scraper-focused Playwright configuration

References (references/)

Reference	Purpose
`pageobject-pattern.md`	PageObject pattern adapted for scraping
`playwright-selectors.md`	Selector strategies and resilience hierarchy
`docker-setup.md`	Docker configuration and deployment
`agent-browser-workflow.md`	Agent-browser analysis workflow
`anti-patterns.md`	Extended anti-pattern catalog

Examples (assets/examples/)

Example	Purpose
`ecommerce-scraper.md`	Complete multi-page product scraper walkthrough
`multi-page-pagination.md`	Pagination handling strategies

Data Files (data/)

File	Purpose
`selector-patterns.json`	Common selectors organized by UI element type
`site-archetypes.json`	Website structure archetypes with typical pages and fields

Example Interaction

User: "I need a scraper for an online bookstore. I want to get book titles, authors, prices, and ratings from the catalog pages."

Agent workflow:

Checks site-archetypes.json — matches ecommerce archetype
Proposes page object map:
- BookListingPage — catalog with pagination
- BookDetailPage — individual book page (if detail scraping needed)
- Pagination component — shared pagination handler
Presents the plan with field mapping:
- title → [itemprop="name"] or .book-title
- author → or

Integration

This skill connects to:

typescript-best-practices — TypeScript coding patterns used in generated code
devcontainer — Development container setup for the generated project
agent-browser — Site analysis and selector discovery (external tool)

What You Do NOT Do

This skill does NOT:

Bypass authentication or login walls
Solve CAPTCHAs or bot detection
Generate JavaScript-only output (always TypeScript)
Produce crawlers that spider entire domains
Create scrapers that violate robots.txt
Handle rate-limited APIs (use HTTP clients for API work)
Generate test suites (use Playwright test patterns for QA)

Weekly Installs

105

Repository

jwynia/agent-skills

GitHub Stars

First Seen

Feb 4, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode96

codex95

gemini-cli94

github-copilot89

kimi-cli84

amp84

通过 LiteLLM 代理让 Claude Code 对接 GitHub Copilot 运行 | 高级变通方案指南

40,000 周安装

[itemprop="author"]

price → [itemprop="price"] or .price

rating → .star-rating or [data-rating]

After confirmation, generates using the scaffold script or manual code generation

Delivers project with Docker setup and Zod schemas for Book type

Playwright TypeScript网页抓取器构建器 - 自动生成PageObject模式项目

🇨🇳中文介绍

网页抓取器构建器

何时使用此技能

核心原则

1. PageObject 封装

2. 选择器健壮性

相关 Skills

3. 组合优于继承

4. 类型化数据提取

5. Docker 优先部署

生成模式

模式 1：代理浏览器分析

模式 2：手动描述

模式 3：完整项目脚手架

快速参考

生成流程

步骤 1：收集需求

步骤 2：分析站点

步骤 3：设计页面对象映射

步骤 4：展示计划

步骤 5：生成代码

步骤 6：交付

代码模式

BasePage

PageObject

Component

ScraperRunner

DataSchema

反模式

脚本参考

scaffold-scraper-project.ts

generate-page-object.ts

模板与参考

模板 (assets/templates/)

配置 (assets/configs/)

参考 (references/)

示例 (assets/examples/)

数据文件 (data/)

示例交互

集成

您不要做的事情

🇺🇸English

Scraper Builder

When to Use This Skill

Core Principles

1. PageObject Encapsulation

2. Selector Resilience

3. Composition Over Inheritance

4. Typed Data Extraction

5. Docker-First Deployment

Generation Modes

Mode 1: Agent-Browser Analysis

Mode 2: Manual Description

Mode 3: Full Project Scaffold

Quick Reference

Generation Process

Step 1: Gather Requirements

Step 2: Analyze the Site

Step 3: Design the Page Object Map

Step 4: Present the Plan

Step 5: Generate Code

Step 6: Deliver

Code Patterns

BasePage

PageObject

Component

ScraperRunner

DataSchema

Anti-Patterns

Scripts Reference

scaffold-scraper-project.ts

generate-page-object.ts

Templates & References

Templates (assets/templates/)

Configs (assets/configs/)

References (references/)

Examples (assets/examples/)

Data Files (data/)

Example Interaction