Actionbook Scraper：双重验证网页数据抓取工具，确保脚本运行与数据正确性

actionbook-scraper by actionbook/actionbook

128 周安装量

1,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/actionbook/actionbook --skill actionbook-scraper

自动化数据处理测试

🇨🇳中文介绍

Actionbook Scraper Skill

⚠️ 关键：双重验证

每个生成的脚本必须通过以下两项检查：

检查项	验证内容	失败示例
第一部分：脚本运行	无错误，无超时	`未找到选择器`
第二部分：数据正确	内容符合预期	提取了"点击展开"而非名称

┌─────────────────────────────────────────────────────┐
│   1. 生成脚本                                       │
│          ↓                                          │
│   2. 执行脚本                                       │
│          ↓                                          │
│   3. 检查第一部分：脚本是否无错误运行？             │
│          ↓                                          │
│   4. 检查第二部分：数据内容是否正确？               │
│      - 非空                                         │
│      - 非占位符文本（"加载中..."）                  │
│      - 非界面文本（"点击展开"）                     │
│      - 字段映射正确                                 │
│          ↓                                          │
│      ┌───┴───┐                                      │
│   两者都通过  任一失败                              │
│      │           │                                  │
│      │           ↓                                  │
│      │       是 Actionbook 数据问题吗？             │
│      │           │                                  │
│      │       ┌───┴───┐                              │
│      │      是       否                             │
│      │       │       │                              │
│      │       ↓       ↓                              │
│      │    记录到    修复脚本                        │
│      │    .actionbook-issues.log                    │
│      │       │       │                              │
│      │       └───┬───┘                              │
│      │           ↓                                  │
│      │       重试（最多3次）                        │
│      ↓                                              │
│   输出脚本                                          │
└─────────────────────────────────────────────────────┘

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

使用 --standalone 标志

/actionbook-scraper:generate <url> --standalone

输出 = Playwright JavaScript 代码

每个生成的脚本必须通过以下两项检查：

检查项	验证内容	失败处理
1. 脚本运行	无错误，无超时	修复语法/选择器错误
2. 数据正确	内容符合预期字段	修复提取逻辑

第一部分：脚本执行检查

无运行时错误
无超时错误
浏览器正确关闭

第二部分：数据内容检查（关键）

验证提取的数据是否符合预期结构：

Expected: Company name, description, website, year founded
Actual:   "Click to expand", "Loading...", empty strings

→ 失败：数据内容不正确，需要修复提取逻辑

数据验证规则：

规则	失败示例	修复方法
字段非空	`name: ""`	检查选择器是否定位到正确元素
无占位符文本	`name: "Loading..."`	为动态内容添加等待
无界面文本	`name: "Click to expand"`	在展开后提取，而非按钮文本
正确数据类型	`year: "View Details"`	选择器错误，修复字段映射
合理数量	预期约100，得到3	添加滚动/分页处理

对于 agent-browser 脚本

执行生成的命令
检查脚本是否无错误运行
检查数据内容是否正确：
- 字段符合预期结构
- 值是实际数据，而非界面文本
- 数量合理
如果失败：
- 分析问题所在（脚本错误 vs 数据错误）
- 修复选择器、等待逻辑或提取方法
- 重新执行
如果成功：
- 输出已验证的脚本
- 显示包含字段验证的数据预览

对于 Playwright 脚本（--standalone）

将脚本写入临时文件
使用node script.js运行
检查脚本是否无错误运行
检查输出数据是否正确：
- JSON 结构符合预期字段
- 值包含实际数据
- 数量符合预期范围
如果失败：
- 分析错误类型
- 修复脚本
- 重新运行
如果成功：
- 输出已验证的脚本

/generate <url>              → 输出：agent-browser bash 命令
/generate <url> --standalone → 输出：Playwright .js 文件



┌─────────────────────────────────────────────────────────────┐
│                   /generate <url>                           │
│                                                             │
│   1. 搜索 Actionbook → 获取选择器                          │
│   2. 生成输出：                                            │
│                                                             │
│      不使用 --standalone    │    使用 --standalone          │
│      ─────────────────────   │    ──────────────────        │
│      agent-browser 命令      │    Playwright .js 代码       │
│                              │                              │
│      ```bash                 │    ```javascript             │
│      agent-browser open ...  │    const { chromium } = ...  │
│      agent-browser get ...   │    await page.goto(...)      │
│      agent-browser close     │    ```                       │
│      ```                     │                              │
└─────────────────────────────────────────────────────────────┘

操作	主要工具	备用工具	备注
为 URL 查找选择器	`search_actions`	无	按域名/关键词搜索
获取完整选择器详情	`get_action_by_id`	无	使用搜索得到的 action_id
列出可用数据源	`list_sources`	`search_sources`	浏览所有已索引的网站
生成 agent-browser 脚本	Agent (sonnet)	-	/generate 的默认模式
生成 Playwright 脚本	Agent (sonnet)	-	使用 --standalone 标志
结构分析	Agent (haiku)	-	解析 Actionbook 响应
请求新网站	`agent-browser`	手动	提交到 actionbook.dev（唯一会执行 agent-browser 的命令）

关键：生成 → 验证 → 修复

每个生成的脚本必须通过执行来验证。

步骤	操作
1	使用 Actionbook 选择器生成脚本
2	执行脚本以验证其工作
3	如果失败：分析错误，修复脚本，返回步骤 2
4	如果成功：输出已验证的脚本 + 数据预览

对于 agent-browser 脚本：

# 执行每条命令
agent-browser open "https://example.com"
agent-browser wait --load networkidle
agent-browser get text ".selector"
# 检查是否返回数据
# 如果错误 → 修复并重试
agent-browser close

对于 Playwright 脚本（--standalone）：

# 写入临时文件并执行
node /tmp/scraper.js
# 检查输出文件是否有数据
# 如果错误 → 修复并重试

始终验证生成的脚本 - 执行并检查两部分
第一部分：脚本必须能运行 - 无错误，无超时
第二部分：数据必须正确 - 非空，非界面文本，字段映射正确
自动修复错误 - 不要输出损坏的脚本或错误数据
优先使用 Actionbook MCP 工具 - 切勿猜测选择器
为懒加载页面包含滚动处理
为卡片式布局包含展开/折叠逻辑
始终关闭浏览器 - 包含 agent-browser close
最多重试 3 次 - 如果仍然失败，报告具体问题

需要捕获的常见数据错误

错误	示例	修复方法
提取了按钮文本	`name: "Click to expand"`	在展开后提取内容
提取了占位符	`desc: "Loading..."`	为动态内容添加等待
空字段	`name: ""`	修复选择器
错误的字段映射	`year: "San Francisco"`	为每个字段修复选择器
项目数量过少	预期 100，得到 3	添加滚动/分页

记录 Actionbook 数据问题

如果 Actionbook 选择器错误或已过时，记录到本地文件：

.actionbook-issues.log

选择器在页面上不存在
选择器返回了错误的元素
页面结构已更改
关键元素缺少选择器

[YYYY-MM-DD HH:MM] URL: {url}
Action ID: {action_id}
Issue Type: {selector_error | outdated | missing}
Details: {description}
Selector: {selector}
Expected: {what it should select}
Actual: {what it actually selects or error}
---

当 Actionbook 提供多个选择器时，按以下顺序优先选择：

data-testid - 最稳定，专为自动化设计
aria-label - 基于可访问性，语义化
css - 基于类的选择器
xpath - 最后手段，最脆弱

命令	描述	代理
`/actionbook-scraper:analyze <url>`	分析页面结构并显示可用选择器	structure-analyzer
`/actionbook-scraper:generate <url>`	生成 agent-browser 爬虫脚本	code-generator
`/actionbook-scraper:generate <url> --standalone`	生成 Playwright/Puppeteer 脚本	code-generator
`/actionbook-scraper:list-sources`	列出包含 Actionbook 数据的网站	-
`/actionbook-scraper:request-website <url>`	请求索引新网站（使用 agent-browser）	website-requester

1. 用户：/actionbook-scraper:analyze https://example.com/page
2. 从 URL 提取域名 → "example.com"
3. search_actions("example page") → [action_ids]
4. 对于最佳匹配：get_action_by_id(action_id) → 完整的选择器数据
5. structure-analyzer 代理格式化并呈现结果

生成命令（默认：agent-browser 脚本）

用户：/actionbook-scraper:generate https://example.com/page

步骤 1：搜索 Actionbook
  search_actions("example.com page") → action_ids

步骤 2：获取选择器
  get_action_by_id(best_match) → selectors

步骤 3：生成 agent-browser 脚本
  ```bash
  agent-browser open "https://example.com/page"
  agent-browser wait --load networkidle
  agent-browser scroll down 2000
  agent-browser get text ".item-container"
  agent-browser close

步骤 4：验证脚本（必需）执行命令并检查是否提取了数据如果失败 → 分析错误 → 修复脚本 → 重试（最多3次）

步骤 5：返回已验证的脚本 + 数据预览

**示例输出：**
````markdown
## 已验证的爬虫（agent-browser）

**状态**：✅ 已验证（提取了 50 个项目）

运行以下命令进行爬取：

```bash
agent-browser open "https://example.com/page"
agent-browser wait --load networkidle
agent-browser scroll down 2000
agent-browser get text ".item-container"
agent-browser close

[
  {"name": "Item 1", "description": "..."},
  {"name": "Item 2", "description": "..."},
  // ... 显示前 3 个项目
]



### 生成命令（--standalone：Playwright 脚本）

```
用户：/actionbook-scraper:generate https://example.com/page --standalone

步骤 1：为选择器搜索 Actionbook
步骤 2：获取完整的选择器数据
步骤 3：生成 Playwright/Puppeteer 脚本
步骤 4：验证脚本（必需）
  写入临时文件 → node /tmp/scraper.js → 检查输出
  如果失败 → 分析错误 → 修复脚本 → 重试（最多3次）
步骤 5：返回已验证的脚本 + 数据预览
```

**示例输出：**
````markdown
## 已验证的爬虫（Playwright）

**状态**：✅ 已验证（提取了 50 个项目）

```javascript
const { chromium } = require('playwright');
// ... 包含 Actionbook 选择器的完整代码
```

用法：
```bash
npm install playwright
node scraper.js
```

### 数据预览
```json
[
  {"name": "Item 1", "description": "..."},
  // ... 前 3 个项目
]
```

1. 用户：/actionbook-scraper:request-website https://newsite.com/page
2. 启动 website-requester 代理（使用 agent-browser）
3. 代理工作流：
   a. agent-browser open "https://actionbook.dev/request-website"
   b. agent-browser snapshot -i （发现表单选择器）
   c. agent-browser type <url-field> "https://newsite.com/page"
   d. agent-browser type <email-field> （可选）
   e. agent-browser type <usecase-field> （可选）
   f. agent-browser click <submit-button>
   g. agent-browser snapshot -i （验证提交）
   h. agent-browser close
4. 输出：提交确认

选择器数据结构

Actionbook 以以下格式返回选择器数据：

{
  "url": "https://example.com/page",
  "title": "Page Title",
  "content": "## Selector Reference\n\n| Element | CSS | XPath | Type |\n..."
}

常见选择器模式

卡片式布局：

Container: .card-list, .grid-container
Card item: .card, .list-item
Card name: .card__title, .card-name
Card description: .card__description
Expand button: .card__expand, button.expand

详情提取（dt/dd 模式）：

// 键值对的常见模式
const items = container.querySelectorAll('.info-item');
items.forEach(item => {
  const label = item.querySelector('dt').textContent;
  const value = item.querySelector('dd').textContent;
});

Table: table, .data-table
Header: thead th, .table-header
Row: tbody tr, .table-row
Cell: td, .table-cell

指示器	页面类型	模板
滚动以加载更多	动态/无限滚动	playwright-js （带滚动）
点击展开	卡片式	playwright-js （带点击）
分页链接	分页	playwright-js （带分页）
静态内容	静态	puppeteer 或 playwright
检测到 SPA 框架	SPA	playwright-js （网络空闲）

## 页面分析：{url}

### 匹配的操作
- **操作 ID**：{action_id}
- **置信度**：高 | 中 | 低

### 可用选择器

| 元素 | 选择器 | 类型 | 方法 |
|---------|----------|------|---------|
| {name} | {selector} | {type} | {methods} |

### 页面结构
- **类型**：{静态|动态|spa}
- **数据模式**：{卡片|表格|列表}
- **懒加载**：{是|否}
- **展开/折叠**：{是|否}

### 建议
- 建议模板：{template}
- 需要特殊处理：{notes}

生成的代码输出

## 生成的爬虫

**目标 URL**：{url}
**模板**：{template}
**预期输出**：{description}

### 依赖项
```bash
npm install playwright

结果保存到 {output_file}

## 模板参考

| 模板 | 标志 | 输出 | 运行方式 |
|----------|------|--------|----------|
| **agent-browser** | （默认） | CLI 命令 | `agent-browser` CLI |
| playwright-js | --standalone | .js 文件 | `node scraper.js` |
| playwright-python | --standalone --template playwright-python | .py 文件 | `python scraper.py` |
| puppeteer | --standalone --template puppeteer | .js 文件 | `node scraper.js` |

## 错误处理

| 错误 | 原因 | 解决方案 |
|-------|-------|----------|
| 未找到操作 | URL 未索引 | 使用 `/actionbook-scraper:request-website` 请求索引 |
| 选择器不工作 | 页面已更新 | 报告给 Actionbook，尝试替代选择器 |
| 超时 | 页面加载慢 | 增加超时时间，添加重试逻辑 |
| 空数据 | 动态内容 | 添加滚动/等待处理 |
| 表单提交失败 | 网络/页面问题 | 重试或手动在 actionbook.dev 提交 |

## agent-browser 用法

对于 `request-website` 命令，插件使用 **agent-browser CLI** 来自动化表单提交。

### agent-browser 命令

```bash
# 打开 URL
agent-browser open "https://actionbook.dev/request-website"

# 获取页面快照（发现选择器）
agent-browser snapshot -i

# 在表单字段中输入
agent-browser type "input[name='url']" "https://example.com"

# 点击按钮
agent-browser click "button[type='submit']"

# 关闭浏览器（始终执行此操作）
agent-browser close

如果表单选择器未知，使用快照来发现它们：

agent-browser open "https://actionbook.dev/request-website"
agent-browser snapshot -i  # 返回包含选择器的页面结构

始终关闭浏览器

关键：始终在任何 agent-browser 会话结束时运行 agent-browser close，即使发生错误。

Actionbook MCP：本地使用无速率限制
目标网站：遵守 robots.txt 并在请求之间添加延迟
建议：页面请求之间延迟 1-2 秒

示例 1：生成 agent-browser 脚本（默认）

/actionbook-scraper:generate https://firstround.com/companies

输出：agent-browser 命令
```bash
agent-browser open "https://firstround.com/companies"
agent-browser scroll down 2000
agent-browser get text ".company-list-card-small"
agent-browser close

用户运行这些命令进行爬取。

### 示例 2：生成 Playwright 脚本

/actionbook-scraper:generate https://firstround.com/companies --standalone

输出：Playwright JavaScript 代码

const { chromium } = require('playwright');
// ... 完整脚本

用户运行：node scraper.js

### 示例 3：分析页面结构

/actionbook-scraper:analyze https://example.com/products

输出：显示以下内容的分析：

可用选择器
页面结构
推荐方法

示例 4：请求新网站

/actionbook-scraper:request-website https://newsite.com/data

操作：向 actionbook.dev 提交表单（此命令会执行 agent-browser）

## 最佳实践

1. **始终在生成前进行分析** \- 首先了解页面结构
2. **检查 list-sources** \- 在尝试前验证网站是否已索引
3. **审查生成的代码** \- 验证选择器是否匹配预期元素
4. **添加适当的延迟** \- 尊重目标服务器
5. **处理边缘情况** \- 空状态、加载状态、错误
6. **增量测试** \- 在完整爬取前在小子集上运行

🇺🇸English

Actionbook Scraper Skill

⚠️ CRITICAL: Two-Part Verification

Every generated script MUST pass BOTH checks:

Check	What to Verify	Failure Example
Part 1: Script Runs	No errors, no timeouts	`Selector not found`
Part 2: Data Correct	Content matches expected	Extracted "Click to expand" instead of name

┌─────────────────────────────────────────────────────┐
│   1. Generate Script                                │
│          ↓                                          │
│   2. Execute Script                                 │
│          ↓                                          │
│   3. Check Part 1: Script runs without errors?      │
│          ↓                                          │
│   4. Check Part 2: Data content is correct?         │
│      - Not empty                                    │
│      - Not placeholder text ("Loading...")          │
│      - Not UI text ("Click to expand")              │
│      - Fields mapped correctly                      │
│          ↓                                          │
│      ┌───┴───┐                                      │
│   BOTH Pass  Either Fails                           │
│      │           │                                  │
│      │           ↓                                  │
│      │       Is it Actionbook data issue?           │
│      │           │                                  │
│      │       ┌───┴───┐                              │
│      │      Yes      No                             │
│      │       │       │                              │
│      │       ↓       ↓                              │
│      │    Log to   Fix script                       │
│      │    .actionbook-issues.log                    │
│      │       │       │                              │
│      │       └───┬───┘                              │
│      │           ↓                                  │
│      │       Retry (max 3x)                         │
│      ↓                                              │
│   Output Script                                     │
└─────────────────────────────────────────────────────┘

Default Output Format

/actionbook-scraper:generate <url>

DEFAULT = agent-browser script (bash commands)

agent-browser open "https://example.com"
agent-browser scroll down 2000
agent-browser get text ".selector"
agent-browser close

With --standalone Flag

/actionbook-scraper:generate <url> --standalone

Output = Playwright JavaScript code

Verification Requirements

Two-Part Verification

Every generated script must pass BOTH checks:

Check	What to Verify	Failure Action
1. Script Runs	No errors, no timeouts	Fix syntax/selector errors
2. Data Correct	Content matches expected fields	Fix extraction logic

Part 1: Script Execution Check

No runtime errors
No timeout errors
Browser closes properly

Part 2: Data Content Check (CRITICAL)

Verify extracted data matches the expected structure:

Expected: Company name, description, website, year founded
Actual:   "Click to expand", "Loading...", empty strings

→ FAIL: Data content incorrect, need to fix extraction logic

Data validation rules:

Rule	Example Failure	Fix
Fields not empty	`name: ""`	Check selector targets correct element
No placeholder text	`name: "Loading..."`	Add wait for dynamic content
No UI text	`name: "Click to expand"`	Extract after expanding, not button text
Correct data type	`year: "View Details"`	Wrong selector, fix field mapping
Reasonable count	Expected ~100, got 3	Add scroll/pagination handling

For agent-browser Scripts

Execute the generated commands
Check script runs without errors
Check data content is correct:
- Fields match expected structure
- Values are actual data, not UI text
- Count is reasonable
If failed:
- Analyze what's wrong (script error vs data error)
- Fix selector, wait logic, or extraction
- Re-execute
If success:
- Output the verified script
- Show data preview with field validation

For Playwright Scripts (--standalone)

Write script to temp file
Run withnode script.js
Check script runs without errors
Check output data is correct:
- JSON structure matches expected fields
- Values contain actual data
- Count matches expected range
If failed:
- Analyze error type
- Fix script
- Re-run
If success:
- Output the verified script

Architecture Overview

/generate <url>              → OUTPUT: agent-browser bash commands
/generate <url> --standalone → OUTPUT: Playwright .js file



┌─────────────────────────────────────────────────────────────┐
│                   /generate <url>                           │
│                                                             │
│   1. Search Actionbook → get selectors                      │
│   2. Generate OUTPUT:                                       │
│                                                             │
│      WITHOUT --standalone    │    WITH --standalone         │
│      ─────────────────────   │    ──────────────────        │
│      agent-browser commands  │    Playwright .js code       │
│                              │                              │
│      ```bash                 │    ```javascript             │
│      agent-browser open ...  │    const { chromium } = ...  │
│      agent-browser get ...   │    await page.goto(...)      │
│      agent-browser close     │    ```                       │
│      ```                     │                              │
└─────────────────────────────────────────────────────────────┘

Tool Priority

Operation	Primary Tool	Fallback	Notes
Find selectors for URL	`search_actions`	None	Search by domain/keywords
Get full selector details	`get_action_by_id`	None	Use action_id from search
List available sources	`list_sources`	`search_sources`	Browse all indexed sites
Generate agent-browser script	Agent (sonnet)	-	Default mode for /generate

Workflow Rules

CRITICAL: Generate → Verify → Fix

Every generated script MUST be verified by executing it.

Step	Action
1	Generate script with Actionbook selectors
2	Execute script to verify it works
3	If failed: analyze error, fix script, go to step 2
4	If success: output verified script + data preview

Verification Process

For agent-browser scripts:

# Execute each command
agent-browser open "https://example.com"
agent-browser wait --load networkidle
agent-browser get text ".selector"
# Check if data is returned
# If error → fix and retry
agent-browser close

For Playwright scripts (--standalone):

# Write to temp file and execute
node /tmp/scraper.js
# Check if output file has data
# If error → fix and retry

Critical Rules

ALWAYS verify generated scripts - Execute and check BOTH parts
Part 1: Script must run - No errors, no timeouts
Part 2: Data must be correct - Not empty, not UI text, fields mapped correctly
Fix errors automatically - Don't output broken scripts or wrong data
Use Actionbook MCP tools first - Never guess selectors
Include scroll handling for lazy-loaded pages
Include expand/collapse logic for card-based layouts
Always close browser - Include agent-browser close
Retry up to 3 times - If still failing, report the specific issue

Common Data Errors to Catch

Error	Example	Fix
Extracted button text	`name: "Click to expand"`	Extract content after expanding
Extracted placeholder	`desc: "Loading..."`	Add wait for dynamic content
Empty fields	`name: ""`	Fix selector
Wrong field mapping	`year: "San Francisco"`	Fix selector for each field
Too few items	Expected 100, got 3	Add scroll/pagination

Record Actionbook Data Issues

If Actionbook selectors are wrong or outdated, record to local file:

.actionbook-issues.log

When to record:

Selector doesn't exist on page
Selector returns wrong element
Page structure has changed
Missing selectors for key elements

Log format:

[YYYY-MM-DD HH:MM] URL: {url}
Action ID: {action_id}
Issue Type: {selector_error | outdated | missing}
Details: {description}
Selector: {selector}
Expected: {what it should select}
Actual: {what it actually selects or error}
---

Selector Priority

When Actionbook provides multiple selectors, prefer in this order:

data-testid - Most stable, designed for automation
aria-label - Accessibility-based, semantic
css - Class-based selectors
xpath - Last resort, most fragile

Commands

Command	Description	Agent
`/actionbook-scraper:analyze <url>`	Analyze page structure and show available selectors	structure-analyzer
`/actionbook-scraper:generate <url>`	Generate agent-browser scraper script	code-generator
`/actionbook-scraper:generate <url> --standalone`	Generate Playwright/Puppeteer script	code-generator
`/actionbook-scraper:list-sources`	List websites with Actionbook data	-
`/actionbook-scraper:request-website <url>`

Data Flow

Analyze Command

1. User: /actionbook-scraper:analyze https://example.com/page
2. Extract domain from URL → "example.com"
3. search_actions("example page") → [action_ids]
4. For best match: get_action_by_id(action_id) → full selector data
5. Structure-analyzer agent formats and presents findings

Generate Command (Default: agent-browser script)

User: /actionbook-scraper:generate https://example.com/page

Step 1: Search Actionbook
  search_actions("example.com page") → action_ids

Step 2: Get selectors
  get_action_by_id(best_match) → selectors

Step 3: Generate agent-browser script
  ```bash
  agent-browser open "https://example.com/page"
  agent-browser wait --load networkidle
  agent-browser scroll down 2000
  agent-browser get text ".item-container"
  agent-browser close

Step 4: VERIFY script (REQUIRED) Execute the commands and check if data is extracted If failed → analyze error → fix script → retry (max 3x)

Step 5: Return verified script + data preview

**Example Output:**
````markdown
## Verified Scraper (agent-browser)

**Status**: ✅ Verified (extracted 50 items)

Run these commands to scrape:

```bash
agent-browser open "https://example.com/page"
agent-browser wait --load networkidle
agent-browser scroll down 2000
agent-browser get text ".item-container"
agent-browser close

Data Preview

[
  {"name": "Item 1", "description": "..."},
  {"name": "Item 2", "description": "..."},
  // ... showing first 3 items
]



### Generate Command (--standalone: Playwright script)

```
User: /actionbook-scraper:generate https://example.com/page --standalone

Step 1: Search Actionbook for selectors
Step 2: Get full selector data
Step 3: Generate Playwright/Puppeteer script
Step 4: VERIFY script (REQUIRED)
  Write to temp file → node /tmp/scraper.js → check output
  If failed → analyze error → fix script → retry (max 3x)
Step 5: Return verified script + data preview
```

**Example Output:**
````markdown
## Verified Scraper (Playwright)

**Status**: ✅ Verified (extracted 50 items)

```javascript
const { chromium } = require('playwright');
// ... generated code with Actionbook selectors
```

Usage:
```bash
npm install playwright
node scraper.js
```

### Data Preview
```json
[
  {"name": "Item 1", "description": "..."},
  // ... first 3 items
]
```

Request Website Command

1. User: /actionbook-scraper:request-website https://newsite.com/page
2. Launch website-requester agent (uses agent-browser)
3. Agent workflow:
   a. agent-browser open "https://actionbook.dev/request-website"
   b. agent-browser snapshot -i (discover form selectors)
   c. agent-browser type <url-field> "https://newsite.com/page"
   d. agent-browser type <email-field> (optional)
   e. agent-browser type <usecase-field> (optional)
   f. agent-browser click <submit-button>
   g. agent-browser snapshot -i (verify submission)
   h. agent-browser close
4. Output: Confirmation of submission

Selector Data Structure

Actionbook returns selector data in this format:

{
  "url": "https://example.com/page",
  "title": "Page Title",
  "content": "## Selector Reference\n\n| Element | CSS | XPath | Type |\n..."
}

Common Selector Patterns

Card-based layouts:

Container: .card-list, .grid-container
Card item: .card, .list-item
Card name: .card__title, .card-name
Card description: .card__description
Expand button: .card__expand, button.expand

Detail extraction (dt/dd pattern):

// Common pattern for key-value pairs
const items = container.querySelectorAll('.info-item');
items.forEach(item => {
  const label = item.querySelector('dt').textContent;
  const value = item.querySelector('dd').textContent;
});

Table layouts:

Table: table, .data-table
Header: thead th, .table-header
Row: tbody tr, .table-row
Cell: td, .table-cell

Page Type Detection

Indicator	Page Type	Template
Scroll to load more	Dynamic/Infinite	playwright-js (with scroll)
Click to expand	Card-based	playwright-js (with click)
Pagination links	Paginated	playwright-js (with pagination)
Static content	Static	puppeteer or playwright
SPA framework detected	SPA	playwright-js (network idle)

Output Formats

Analysis Output

## Page Analysis: {url}

### Matched Action
- **Action ID**: {action_id}
- **Confidence**: HIGH | MEDIUM | LOW

### Available Selectors

| Element | Selector | Type | Methods |
|---------|----------|------|---------|
| {name} | {selector} | {type} | {methods} |

### Page Structure
- **Type**: {static|dynamic|spa}
- **Data Pattern**: {cards|table|list}
- **Lazy Loading**: {yes|no}
- **Expand/Collapse**: {yes|no}

### Recommendations
- Suggested template: {template}
- Special handling needed: {notes}

Generated Code Output

## Generated Scraper

**Target URL**: {url}
**Template**: {template}
**Expected Output**: {description}

### Dependencies
```bash
npm install playwright

Code

{generated_code}

Usage

node scraper.js

Output

Results saved to {output_file}

## Templates Reference

| Template | Flag | Output | Run With |
|----------|------|--------|----------|
| **agent-browser** | (default) | CLI commands | `agent-browser` CLI |
| playwright-js | --standalone | .js file | `node scraper.js` |
| playwright-python | --standalone --template playwright-python | .py file | `python scraper.py` |
| puppeteer | --standalone --template puppeteer | .js file | `node scraper.js` |

## Error Handling

| Error | Cause | Solution |
|-------|-------|----------|
| No actions found | URL not indexed | Use `/actionbook-scraper:request-website` to request indexing |
| Selectors not working | Page updated | Report to Actionbook, try alternative selectors |
| Timeout | Slow page load | Increase timeout, add retry logic |
| Empty data | Dynamic content | Add scroll/wait handling |
| Form submission failed | Network/page issue | Retry or submit manually at actionbook.dev |

## agent-browser Usage

For the `request-website` command, the plugin uses **agent-browser CLI** to automate form submission.

### agent-browser Commands

```bash
# Open a URL
agent-browser open "https://actionbook.dev/request-website"

# Get page snapshot (discover selectors)
agent-browser snapshot -i

# Type into form field
agent-browser type "input[name='url']" "https://example.com"

# Click button
agent-browser click "button[type='submit']"

# Close browser (ALWAYS do this)
agent-browser close

Selector Discovery

If form selectors are unknown, use snapshot to discover them:

agent-browser open "https://actionbook.dev/request-website"
agent-browser snapshot -i  # Returns page structure with selectors

Always Close Browser

Critical : Always run agent-browser close at the end of any agent-browser session, even if errors occur.

Rate Limiting

Actionbook MCP: No rate limit for local usage
Target websites: Respect robots.txt and add delays between requests
Recommended: 1-2 second delay between page requests

Examples

Example 1: Generate agent-browser Script (Default)

/actionbook-scraper:generate https://firstround.com/companies

Output: agent-browser commands
```bash
agent-browser open "https://firstround.com/companies"
agent-browser scroll down 2000
agent-browser get text ".company-list-card-small"
agent-browser close

User runs these commands to scrape.

### Example 2: Generate Playwright Script

/actionbook-scraper:generate https://firstround.com/companies --standalone

Output: Playwright JavaScript code

const { chromium } = require('playwright');
// ... full script

User runs: node scraper.js

### Example 3: Analyze Page Structure

/actionbook-scraper:analyze https://example.com/products

Output: Analysis showing:

Available selectors
Page structure
Recommended approach

Example 4: Request New Website

/actionbook-scraper:request-website https://newsite.com/data

Action: Submits form to actionbook.dev (this command DOES execute agent-browser)

## Best Practices

1. **Always analyze before generating** - Understand the page structure first
2. **Check list-sources** - Verify the site is indexed before attempting
3. **Review generated code** - Verify selectors match expected elements
4. **Add appropriate delays** - Be respectful to target servers
5. **Handle edge cases** - Empty states, loading states, errors
6. **Test incrementally** - Run on small subset before full scrape

Weekly Installs

125

Repository

actionbook/actionbook

GitHub Stars

1.4K

First Seen

Feb 4, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode101

codex101

gemini-cli94

github-copilot80

claude-code75

amp70

通过 LiteLLM 代理让 Claude Code 对接 GitHub Copilot 运行 | 高级变通方案指南

44,900 周安装

Actionbook Scraper：双重验证网页数据抓取工具，确保脚本运行与数据正确性

🇨🇳中文介绍

Actionbook Scraper Skill

⚠️ 关键：双重验证

相关 Skills

默认输出格式

使用 --standalone 标志

验证要求

双重验证

第一部分：脚本执行检查

第二部分：数据内容检查（关键）

对于 agent-browser 脚本

对于 Playwright 脚本（--standalone）

架构概述

工具优先级

工作流规则

关键：生成 → 验证 → 修复

验证过程

关键规则

需要捕获的常见数据错误

记录 Actionbook 数据问题

选择器优先级

命令

数据流

分析命令

生成命令（默认：agent-browser 脚本）

数据预览

请求网站命令

选择器数据结构

常见选择器模式

页面类型检测

输出格式

分析输出

生成的代码输出

代码

用法

输出

选择器发现

始终关闭浏览器

速率限制

示例

示例 1：生成 agent-browser 脚本（默认）

示例 4：请求新网站

🇺🇸English

Actionbook Scraper Skill

⚠️ CRITICAL: Two-Part Verification

Default Output Format

With --standalone Flag

Verification Requirements

Two-Part Verification

Part 1: Script Execution Check

Part 2: Data Content Check (CRITICAL)

For agent-browser Scripts

For Playwright Scripts (--standalone)

Architecture Overview

Tool Priority

Workflow Rules

CRITICAL: Generate → Verify → Fix

Verification Process

Critical Rules

Common Data Errors to Catch

Record Actionbook Data Issues

Selector Priority

Commands

Data Flow

Analyze Command

Generate Command (Default: agent-browser script)

Data Preview

Request Website Command

Selector Data Structure

Common Selector Patterns

Page Type Detection

Output Formats

Analysis Output

Generated Code Output

Code

Usage

Output

Selector Discovery

Always Close Browser