extract 技能：自动化网页数据抓取工具，生成Playwright脚本和JSON/CSV数据

extract by actionbook/actionbook

257 周安装量

1,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/actionbook/actionbook --skill extract

自动化数据处理测试

🇨🇳中文介绍

何时使用此技能

当用户想要从网站获取数据时激活：

“从该页面提取所有产品价格”
“从...抓取结果表格”
“从 arXiv 搜索结果中提取作者和标题列表”
“收集此页面上的所有职位列表”
“从此仪表板表格中获取数据”
“从...收集评论分数”
“从...下载所有链接/图片/卡片”

交付物始终是两个工件：

可执行的 Playwright 脚本 —— 一个独立的 .cjs 文件，可在运行时无需 Actionbook 的情况下重现提取过程。
提取的数据 —— 以 JSON（默认）、CSV 或用户指定的格式写入磁盘。

决策策略

将 Actionbook 用作有条件的加速器，而非强制步骤。目标是以最短路径获得可靠的选择器。

User request
  │
  ├─► actionbook search "<site> <intent>"
  │     ├─ Results with Health Score ≥ 70%  ──► actionbook get "<ID>" ──► use selectors
  │     └─ No results / low score  ──► Fallback
  │
  └─► Fallback: actionbook browser open <url>
        ├─ actionbook browser snapshot   (accessibility tree → find selectors)
        ├─ actionbook browser screenshot (visual confirmation)
        └─ manual selector discovery via DOM inspection

选择器来源的优先级顺序：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

优先级	来源	适用情况
1	`actionbook get`	网站已编入索引，健康度得分 ≥ 70%
2	`actionbook browser snapshot`	未编入索引或选择器已过时
3	通过截图 + 快照进行 DOM 检查	复杂的 SPA / 动态内容

机制感知脚本策略

网站使用的模式会破坏简单的抓取。生成的 Playwright 脚本必须考虑以下情况：

流式渲染 / SSR / RSC 水合

页面可能先渲染一个外壳，然后流式传输或水合内容。

// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
  const items = document.querySelectorAll('[data-item]');
  return items.length > 0 && !document.querySelector('[data-pending]');
});

检测线索： 带有 data-reactroot 的 React 根、Next.js 的 __NEXT_DATA__、JS 运行后才会填充的空容器。如果 actionbook browser text "<selector>" 返回空但截图显示有内容，则水合尚未完成。

虚拟化列表 / 虚拟 DOM

只有可见行存在于 DOM 中。滚动会渲染新行并销毁旧行。

// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;

const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');

let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
  const items = await page.$$eval('[data-row]', rows =>
    rows.map(r => ({ text: r.textContent.trim() }))
  );
  for (const item of items) {
    if (!allItems.find(i => i.text === item.text)) allItems.push(item);
  }

  await container.evaluate(el => el.scrollBy(0, 600));
  await page.waitForTimeout(300);

  const currentTop = await container.evaluate(el => el.scrollTop);
  if (currentTop === previousTop) break;

  previousTop = currentTop;
  scrolls += 1;
}

检测线索： 容器具有固定高度和 overflow: auto/scroll，DOM 中的行数远小于声明的总数，行具有 transform: translateY(...) 或 position: absolute; top: ...px。

无限滚动 / 懒加载

当用户滚动到底部附近时，会追加新内容。

// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;

while (scrolls < maxScrolls && noGrowthStreak < 3) {
  await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
  await page.waitForTimeout(1200);

  const newCount = await page.$$eval('.item', els => els.length);
  if (newCount > itemCount) {
    itemCount = newCount;
    noGrowthStreak = 0;
  } else {
    noGrowthStreak += 1;
  }

  scrolls += 1;
}

检测线索： 页面 JS 中的 Intersection Observer、“加载更多”按钮、底部的哨兵元素、滚动时触发的网络请求。

“下一页”按钮或编号页面后面的多页结果。

// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
  const pageData = await page.$$eval('.result-item', items =>
    items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
  );
  allData.push(...pageData);

  const nextBtn = await page.$('a.next-page:not([disabled])');
  if (!nextBtn) break;

  const previousUrl = page.url();
  const previousFirstItem = await page
    .$eval('.result-item', el => el.textContent?.trim() || '')
    .catch(() => '');

  await nextBtn.click();

  // Post-click detection only: advance must be caused by this click
  const advanced = await Promise.any([
    page
      .waitForURL(url => url.toString() !== previousUrl, { timeout: 5000 })
      .then(() => true),
    page
      .waitForFunction(
        prev => {
          const first = document.querySelector('.result-item');
          return !!first && (first.textContent || '').trim() !== prev;
        },
        previousFirstItem,
        { timeout: 5000 }
      )
      .then(() => true),
  ]).catch(() => false);

  if (!advanced) break;

  await page.waitForLoadState('networkidle').catch(() => {});
  pageIndex += 1;
}

步骤 1：理解目标

从用户请求中识别：

URL —— 要从中提取数据的页面
数据结构 —— 需要哪些字段/列
范围 —— 单页、分页、无限滚动还是多页爬取
输出格式 —— JSON（默认）、CSV 或其他

步骤 2：获取选择器并选择执行路径

# Try Actionbook index first
actionbook search "<site> <data-description>" --domain <domain>

# If good results (health ≥ 70%), get full selectors
actionbook get "<ID>"

严格遵循此路由：

路径 A（当 get 结果良好时的默认路径）： 请求的字段被 get 选择器覆盖且质量可接受。
- 从 get 选择器开始，并快速进入脚本草稿阶段。
- 在最终确定脚本策略之前，可以运行轻量级机制探测（browser text、快速滚动检查）。
- 除非探测/样本验证显示不匹配，否则在完成第一版草稿之前不要运行完整的后备方案（snapshot / screenshot）。
- 字段映射必须默认使用 get 选择器，并将来源标记为 actionbook_get。
路径 B（部分/不稳定）： get 存在但缺少所需字段、选择器解析出 0 个元素或验证失败。
- 仅对失败的字段/步骤运行有针对性的后备方案。
路径 C（无可用覆盖）： search/get 没有可用的结果。
- 运行完整的后备方案发现。

步骤 3：探测页面机制，仅在需要时回退

路径 A 机制检测时机：

在最终脚本草稿之前或样本验证期间运行最小探测。
在任何探测命令之前，确保打开了正确的页面上下文：
- actionbook browser open "<url>"（如果当前标签页上下文未知/陈旧）
如果探测/样本运行表明不匹配（缺少行、选择器不稳定、分页行为错误），则升级到路径 B 的有针对性后备方案。

按路径进行的后备方案发现：

路径 B 有针对性后备方案（仅针对失败的字段/步骤）：

actionbook browser open "<url>"     # if not already open
actionbook browser snapshot          # focus on failed field/container mapping
# actionbook browser screenshot      # optional visual confirmation for failed area

路径 C 完整后备方案（无可用覆盖）：

actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshot

机制探测（当脚本策略需要确认时运行）：

# Hydration / streaming check
actionbook browser text "<container-selector>"

# Infinite scroll quick signal (explicit before/after decision)
actionbook browser eval "document.querySelectorAll('<item-selector>').length"   # before
actionbook browser click "<scroll-container-selector-or-body>"                    # focus scroll context
actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);"
actionbook browser eval "document.querySelectorAll('<item-selector>').length"   # after
# If count increases, treat page as lazy-load/infinite-scroll.

后备方案触发条件：

actionbook get 无法映射所有必需字段。
actionbook get 选择器在样本运行中返回空值/不稳定值。
运行时行为与预期机制冲突（例如，虚拟化容器、延迟水合）。

步骤 4：生成 Playwright 脚本

编写一个独立的 Playwright 脚本（extract_<domain>_<slug>.cjs），该脚本：

导航到目标 URL。
等待正确的就绪信号（不仅仅是 load —— 参见上述机制）。
处理检测到的机制（虚拟滚动、分页等）。
将数据提取到结构化对象中。
将输出写入磁盘（JSON.stringify / CSV）。
关闭浏览器。
强制执行防护措施（maxPages、maxScrolls、超时预算）以避免无限循环。

// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('<URL>', { waitUntil: 'domcontentloaded' });

  // -- wait for readiness --
  await page.waitForSelector('<container>', { state: 'visible' });

  // -- extract --
  const data = await page.$$eval('<item-selector>', items =>
    items.map(el => ({
      // fields mapped from user request
    }))
  );

  // -- output --
  const fs = require('fs');
  fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
  console.log(`Extracted ${data.length} items → output.json`);

  await browser.close();
})();

步骤 5：执行和验证

运行脚本以确认其工作：

node extract_<domain>_<slug>.cjs

检查项	通过条件
脚本以 0 退出	无运行时错误
输出文件存在	写入了非空文件
记录数 > 0	至少提取了一个项目
无 null/空字段	每个声明的字段在 ≥ 90% 的记录中都有值
数据与页面匹配	根据 `actionbook browser text` 抽查第一条和最后一条记录

如果验证失败，检查输出，调整选择器或等待策略，然后重新运行。

脚本路径 —— 他们可以随时重新运行的 .cjs 文件。
数据路径 —— 输出的 JSON/CSV 文件。
记录数 —— 提取了多少个项目。
注意事项 —— 任何特定于机制的注意事项（例如，“此网站使用无限滚动；脚本默认最多滚动 50 页”）。

每次 extract 调用都会产生：

工件	路径	格式
Playwright 脚本	`./extract_<domain>_<slug>.cjs`	使用 `playwright` 的独立 Node.js 脚本
提取的数据	`./output.json`（默认）或用户指定的路径	JSON 对象数组（默认）、CSV 或用户指定

脚本必须是可重新运行的 —— 只要运行时环境中安装了 Node.js + Playwright，用户应该能够在以后无需安装 Actionbook 的情况下执行它。

当 actionbook get 提供多种选择器类型时：

优先级	类型	原因
1	`data-testid`	稳定，面向测试，很少更改
2	`aria-label`	基于可访问性，语义上有意义
3	CSS 选择器	结构性的，可能在重新设计时失效
4	XPath	最后手段，最脆弱

错误	操作
`actionbook search` 无结果	回退到 `snapshot` + `screenshot`
选择器返回 0 个元素	重新快照，与截图比较，更新选择器
脚本超时	增加更长的 `waitForTimeout`，检查反机器人措施
部分数据（某些字段为空）	检查内容是否为懒加载；添加滚动/等待
反机器人 / CAPTCHA	通知用户；建议使用 `headless: false` 运行或通过 `actionbook setup` 扩展模式使用他们自己的浏览器会话

🇺🇸English

When to Use This Skill

Activate when the user wants to obtain data from a website:

"Extract all product prices from this page"
"Scrape the table of results from ..."
"Pull the list of authors and titles from arXiv search results"
"Collect all job listings from this page"
"Get the data from this dashboard table"
"Harvest review scores from ..."
"Download all the links/images/cards from ..."

The deliverable is always two artifacts :

Executable Playwright script — a standalone .cjs file that reproduces the extraction without Actionbook at runtime.
Extracted data — JSON (default), CSV, or user-specified format written to disk.

Decision Strategy

Use Actionbook as a conditional accelerator , not a mandatory step. The goal is reliable selectors in the shortest path.

User request
  │
  ├─► actionbook search "<site> <intent>"
  │     ├─ Results with Health Score ≥ 70%  ──► actionbook get "<ID>" ──► use selectors
  │     └─ No results / low score  ──► Fallback
  │
  └─► Fallback: actionbook browser open <url>
        ├─ actionbook browser snapshot   (accessibility tree → find selectors)
        ├─ actionbook browser screenshot (visual confirmation)
        └─ manual selector discovery via DOM inspection

Priority order for selector sources:

Priority	Source	When
1	`actionbook get`	Site is indexed, health score ≥ 70%
2	`actionbook browser snapshot`	Not indexed or selectors outdated
3	DOM inspection via screenshot + snapshot	Complex SPA / dynamic content

Non-negotiable rule: if search + get already provides usable selectors for required fields, start from get selectors and do not jump to full fallback (snapshot/screenshot) by default. Exception: lightweight mechanism probes (for hydration/virtualization/pagination) are allowed when runtime behavior may affect script correctness. Escalate to snapshot/screenshot only when probes/sample validation indicate selector gaps or instability.

Mechanism-Aware Script Strategy

Websites use patterns that break naive scraping. The generated Playwright script must account for these:

Streaming / SSR / RSC hydration

Pages may render a shell first, then stream or hydrate content.

// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
  const items = document.querySelectorAll('[data-item]');
  return items.length > 0 && !document.querySelector('[data-pending]');
});

Detection cues: React root with data-reactroot, Next.js __NEXT_DATA__, empty containers that fill after JS runs. If actionbook browser text "<selector>" returns empty but the screenshot shows content, hydration hasn't completed.

Virtualized lists / virtual DOM

Only visible rows exist in the DOM. Scrolling renders new rows and destroys old ones.

// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;

const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');

let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
  const items = await page.$$eval('[data-row]', rows =>
    rows.map(r => ({ text: r.textContent.trim() }))
  );
  for (const item of items) {
    if (!allItems.find(i => i.text === item.text)) allItems.push(item);
  }

  await container.evaluate(el => el.scrollBy(0, 600));
  await page.waitForTimeout(300);

  const currentTop = await container.evaluate(el => el.scrollTop);
  if (currentTop === previousTop) break;

  previousTop = currentTop;
  scrolls += 1;
}

Detection cues: Container has fixed height with overflow: auto/scroll, row count in DOM is much smaller than stated total, rows have transform: translateY(...) or position: absolute; top: ...px.

Infinite scroll / lazy loading

New content appends when the user scrolls near the bottom.

// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;

while (scrolls < maxScrolls && noGrowthStreak < 3) {
  await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
  await page.waitForTimeout(1200);

  const newCount = await page.$$eval('.item', els => els.length);
  if (newCount > itemCount) {
    itemCount = newCount;
    noGrowthStreak = 0;
  } else {
    noGrowthStreak += 1;
  }

  scrolls += 1;
}

Detection cues: Intersection Observer in page JS, "Load more" button, sentinel element at bottom, network requests firing on scroll.

Pagination

Multi-page results behind "Next" buttons or numbered pages.

// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
  const pageData = await page.$$eval('.result-item', items =>
    items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
  );
  allData.push(...pageData);

  const nextBtn = await page.$('a.next-page:not([disabled])');
  if (!nextBtn) break;

  const previousUrl = page.url();
  const previousFirstItem = await page
    .$eval('.result-item', el => el.textContent?.trim() || '')
    .catch(() => '');

  await nextBtn.click();

  // Post-click detection only: advance must be caused by this click
  const advanced = await Promise.any([
    page
      .waitForURL(url => url.toString() !== previousUrl, { timeout: 5000 })
      .then(() => true),
    page
      .waitForFunction(
        prev => {
          const first = document.querySelector('.result-item');
          return !!first && (first.textContent || '').trim() !== prev;
        },
        previousFirstItem,
        { timeout: 5000 }
      )
      .then(() => true),
  ]).catch(() => false);

  if (!advanced) break;

  await page.waitForLoadState('networkidle').catch(() => {});
  pageIndex += 1;
}

Execution Chain

Step 1: Understand the target

Identify from the user request:

URL — the page to extract from
Data shape — what fields / columns are needed
Scope — single page, paginated, infinite scroll, or multi-page crawl
Output format — JSON (default), CSV, or other

Step 2: Obtain selectors and choose execution path

# Try Actionbook index first
actionbook search "<site> <data-description>" --domain <domain>

# If good results (health ≥ 70%), get full selectors
actionbook get "<ID>"

Use this routing strictly:

Path A (default whenget is good): requested fields are covered by get selectors and quality is acceptable.
- Start from get selectors and move to script draft quickly.
- You may run lightweight mechanism probes (browser text, quick scroll checks) before finalizing script strategy.
- Do not run full fallback (snapshot / screenshot) before first draft unless probe/sample validation shows mismatch.
- Field mapping must default to get selectors and mark source as actionbook_get.
Path B (partial / unstable): exists but required fields are missing, selector resolves 0 elements, or validation fails.

Step 3: Probe page mechanisms and fallback only when needed

Path A mechanism detection timing:

Run minimal probes either before final script draft or during sample validation.
Before any probe command, ensure the correct page context is open:
- actionbook browser open "<url>" (if current tab context is unknown/stale)
If probes/sample run indicate mismatch (missing rows, unstable selectors, wrong pagination behavior), escalate to Path B targeted fallback.

Fallback discovery by path:

Path B targeted fallback (only failed fields/steps):

actionbook browser open "<url>"     # if not already open
actionbook browser snapshot          # focus on failed field/container mapping
# actionbook browser screenshot      # optional visual confirmation for failed area

Path C full fallback (no usable coverage):

actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshot

Mechanism probes (run when script strategy needs confirmation):

# Hydration / streaming check
actionbook browser text "<container-selector>"

# Infinite scroll quick signal (explicit before/after decision)
actionbook browser eval "document.querySelectorAll('<item-selector>').length"   # before
actionbook browser click "<scroll-container-selector-or-body>"                    # focus scroll context
actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);"
actionbook browser eval "document.querySelectorAll('<item-selector>').length"   # after
# If count increases, treat page as lazy-load/infinite-scroll.

Fallback trigger conditions:

actionbook get cannot map all required fields.
actionbook get selectors return empty/unstable values in sample run.
Runtime behavior conflicts with expected mechanism (e.g., virtualized container, delayed hydration).

Step 4: Generate Playwright script

Write a standalone Playwright script (extract_<domain>_<slug>.cjs) that:

Navigates to the target URL.
Waits for the correct readiness signal (not just load — see mechanisms above).
Handles the detected mechanism (virtual scroll, pagination, etc.).
Extracts data into structured objects.
Writes output to disk (JSON.stringify / CSV).
Closes the browser.
Enforces guardrails (maxPages, maxScrolls, timeout budget) to avoid infinite loops.

Script template:

// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('<URL>', { waitUntil: 'domcontentloaded' });

  // -- wait for readiness --
  await page.waitForSelector('<container>', { state: 'visible' });

  // -- extract --
  const data = await page.$$eval('<item-selector>', items =>
    items.map(el => ({
      // fields mapped from user request
    }))
  );

  // -- output --
  const fs = require('fs');
  fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
  console.log(`Extracted ${data.length} items → output.json`);

  await browser.close();
})();

Step 5: Execute and validate

Run the script to confirm it works:

node extract_<domain>_<slug>.cjs

Validation rules:

Check	Pass condition
Script exits 0	No runtime errors
Output file exists	Non-empty file written
Record count > 0	At least one item extracted
No null/empty fields	Every declared field has a value in ≥ 90% of records
Data matches page	Spot-check first and last record against `actionbook browser text`

If validation fails, inspect the output, adjust selectors or wait strategy, and re-run.

Step 6: Deliver

Present to the user:

Script path — the .cjs file they can re-run anytime.
Data path — the output JSON/CSV file.
Record count — how many items were extracted.
Notes — any mechanism-specific caveats (e.g., "this site uses infinite scroll; the script scrolls up to 50 pages by default").

Output Contract

Every extract invocation produces:

Artifact	Path	Format
Playwright script	`./extract_<domain>_<slug>.cjs`	Standalone Node.js script using `playwright`
Extracted data	`./output.json` (default) or user-specified path	JSON array of objects (default), CSV, or user-specified

The script must be re-runnable — a user should be able to execute it later without Actionbook installed, as long as Node.js + Playwright are available in the runtime environment.

Selector Priority

When multiple selector types are available from actionbook get:

Priority	Type	Reason
1	`data-testid`	Stable, test-oriented, rarely changes
2	`aria-label`	Accessibility-driven, semantically meaningful
3	CSS selector	Structural, may break on redesign
4	XPath	Last resort, most brittle

Error Handling

Error	Action
`actionbook search` returns no results	Fall back to `snapshot` + `screenshot`
Selector returns 0 elements	Re-snapshot, compare with screenshot, update selector
Script times out	Add longer `waitForTimeout`, check for anti-bot measures
Partial data (some fields empty)	Check if content is lazy-loaded; add scroll/wait
Anti-bot / CAPTCHA	Inform user; suggest running with `headless: false` or using their own browser session via `actionbook setup` extension mode

Weekly Installs

143

Repository

actionbook/actionbook

GitHub Stars

1.4K

First Seen

Feb 23, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykWarn

Installed on

codex140

github-copilot139

cursor138

kimi-cli138

amp138

gemini-cli138

通过 LiteLLM 代理让 Claude Code 对接 GitHub Copilot 运行 | 高级变通方案指南

22,200 周安装

Run targeted fallback only for failed fields/steps.

Path C (no usable coverage): search/get has no usable result.

Run full fallback discovery.