extract by actionbook/actionbook
npx skills add https://github.com/actionbook/actionbook --skill extract当用户想要从网站获取数据时激活:
交付物始终是两个工件:
.cjs 文件,可在运行时无需 Actionbook 的情况下重现提取过程。将 Actionbook 用作有条件的加速器,而非强制步骤。目标是以最短路径获得可靠的选择器。
User request
│
├─► actionbook search "<site> <intent>"
│ ├─ Results with Health Score ≥ 70% ──► actionbook get "<ID>" ──► use selectors
│ └─ No results / low score ──► Fallback
│
└─► Fallback: actionbook browser open <url>
├─ actionbook browser snapshot (accessibility tree → find selectors)
├─ actionbook browser screenshot (visual confirmation)
└─ manual selector discovery via DOM inspection
选择器来源的优先级顺序:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 优先级 | 来源 | 适用情况 |
|---|---|---|
| 1 | actionbook get | 网站已编入索引,健康度得分 ≥ 70% |
| 2 | actionbook browser snapshot | 未编入索引或选择器已过时 |
| 3 | 通过截图 + 快照进行 DOM 检查 | 复杂的 SPA / 动态内容 |
不可协商的规则: 如果 search + get 已为所需字段提供了可用的选择器,则从 get 选择器开始,默认情况下不要跳转到完整的后备方案(snapshot/screenshot)。例外情况:当运行时行为可能影响脚本正确性时,允许使用轻量级机制探测(用于水合/虚拟化/分页)。仅当探测/样本验证表明存在选择器缺口或不稳定时,才升级到 snapshot/screenshot。
网站使用的模式会破坏简单的抓取。生成的 Playwright 脚本必须考虑以下情况:
页面可能先渲染一个外壳,然后流式传输或水合内容。
// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
const items = document.querySelectorAll('[data-item]');
return items.length > 0 && !document.querySelector('[data-pending]');
});
检测线索: 带有 data-reactroot 的 React 根、Next.js 的 __NEXT_DATA__、JS 运行后才会填充的空容器。如果 actionbook browser text "<selector>" 返回空但截图显示有内容,则水合尚未完成。
只有可见行存在于 DOM 中。滚动会渲染新行并销毁旧行。
// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;
const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');
let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
const items = await page.$$eval('[data-row]', rows =>
rows.map(r => ({ text: r.textContent.trim() }))
);
for (const item of items) {
if (!allItems.find(i => i.text === item.text)) allItems.push(item);
}
await container.evaluate(el => el.scrollBy(0, 600));
await page.waitForTimeout(300);
const currentTop = await container.evaluate(el => el.scrollTop);
if (currentTop === previousTop) break;
previousTop = currentTop;
scrolls += 1;
}
检测线索: 容器具有固定高度和 overflow: auto/scroll,DOM 中的行数远小于声明的总数,行具有 transform: translateY(...) 或 position: absolute; top: ...px。
当用户滚动到底部附近时,会追加新内容。
// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;
while (scrolls < maxScrolls && noGrowthStreak < 3) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1200);
const newCount = await page.$$eval('.item', els => els.length);
if (newCount > itemCount) {
itemCount = newCount;
noGrowthStreak = 0;
} else {
noGrowthStreak += 1;
}
scrolls += 1;
}
检测线索: 页面 JS 中的 Intersection Observer、“加载更多”按钮、底部的哨兵元素、滚动时触发的网络请求。
“下一页”按钮或编号页面后面的多页结果。
// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
const pageData = await page.$$eval('.result-item', items =>
items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
);
allData.push(...pageData);
const nextBtn = await page.$('a.next-page:not([disabled])');
if (!nextBtn) break;
const previousUrl = page.url();
const previousFirstItem = await page
.$eval('.result-item', el => el.textContent?.trim() || '')
.catch(() => '');
await nextBtn.click();
// Post-click detection only: advance must be caused by this click
const advanced = await Promise.any([
page
.waitForURL(url => url.toString() !== previousUrl, { timeout: 5000 })
.then(() => true),
page
.waitForFunction(
prev => {
const first = document.querySelector('.result-item');
return !!first && (first.textContent || '').trim() !== prev;
},
previousFirstItem,
{ timeout: 5000 }
)
.then(() => true),
]).catch(() => false);
if (!advanced) break;
await page.waitForLoadState('networkidle').catch(() => {});
pageIndex += 1;
}
从用户请求中识别:
# Try Actionbook index first
actionbook search "<site> <data-description>" --domain <domain>
# If good results (health ≥ 70%), get full selectors
actionbook get "<ID>"
严格遵循此路由:
get 结果良好时的默认路径): 请求的字段被 get 选择器覆盖且质量可接受。
get 选择器开始,并快速进入脚本草稿阶段。browser text、快速滚动检查)。snapshot / screenshot)。get 选择器,并将来源标记为 actionbook_get。get 存在但缺少所需字段、选择器解析出 0 个元素或验证失败。
路径 A 机制检测时机:
actionbook browser open "<url>"(如果当前标签页上下文未知/陈旧)按路径进行的后备方案发现:
路径 B 有针对性后备方案(仅针对失败的字段/步骤):
actionbook browser open "<url>" # if not already open
actionbook browser snapshot # focus on failed field/container mapping
# actionbook browser screenshot # optional visual confirmation for failed area
路径 C 完整后备方案(无可用覆盖):
actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshot
机制探测(当脚本策略需要确认时运行):
# Hydration / streaming check
actionbook browser text "<container-selector>"
# Infinite scroll quick signal (explicit before/after decision)
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # before
actionbook browser click "<scroll-container-selector-or-body>" # focus scroll context
actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);"
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # after
# If count increases, treat page as lazy-load/infinite-scroll.
后备方案触发条件:
actionbook get 无法映射所有必需字段。actionbook get 选择器在样本运行中返回空值/不稳定值。编写一个独立的 Playwright 脚本(extract_<domain>_<slug>.cjs),该脚本:
load —— 参见上述机制)。JSON.stringify / CSV)。maxPages、maxScrolls、超时预算)以避免无限循环。脚本模板:
// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('<URL>', { waitUntil: 'domcontentloaded' });
// -- wait for readiness --
await page.waitForSelector('<container>', { state: 'visible' });
// -- extract --
const data = await page.$$eval('<item-selector>', items =>
items.map(el => ({
// fields mapped from user request
}))
);
// -- output --
const fs = require('fs');
fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
console.log(`Extracted ${data.length} items → output.json`);
await browser.close();
})();
运行脚本以确认其工作:
node extract_<domain>_<slug>.cjs
验证规则:
| 检查项 | 通过条件 |
|---|---|
| 脚本以 0 退出 | 无运行时错误 |
| 输出文件存在 | 写入了非空文件 |
| 记录数 > 0 | 至少提取了一个项目 |
| 无 null/空字段 | 每个声明的字段在 ≥ 90% 的记录中都有值 |
| 数据与页面匹配 | 根据 actionbook browser text 抽查第一条和最后一条记录 |
如果验证失败,检查输出,调整选择器或等待策略,然后重新运行。
向用户呈现:
.cjs 文件。每次 extract 调用都会产生:
| 工件 | 路径 | 格式 |
|---|---|---|
| Playwright 脚本 | ./extract_<domain>_<slug>.cjs | 使用 playwright 的独立 Node.js 脚本 |
| 提取的数据 | ./output.json(默认)或用户指定的路径 | JSON 对象数组(默认)、CSV 或用户指定 |
脚本必须是可重新运行的 —— 只要运行时环境中安装了 Node.js + Playwright,用户应该能够在以后无需安装 Actionbook 的情况下执行它。
当 actionbook get 提供多种选择器类型时:
| 优先级 | 类型 | 原因 |
|---|---|---|
| 1 | data-testid | 稳定,面向测试,很少更改 |
| 2 | aria-label | 基于可访问性,语义上有意义 |
| 3 | CSS 选择器 | 结构性的,可能在重新设计时失效 |
| 4 | XPath | 最后手段,最脆弱 |
| 错误 | 操作 |
|---|---|
actionbook search 无结果 | 回退到 snapshot + screenshot |
| 选择器返回 0 个元素 | 重新快照,与截图比较,更新选择器 |
| 脚本超时 | 增加更长的 waitForTimeout,检查反机器人措施 |
| 部分数据(某些字段为空) | 检查内容是否为懒加载;添加滚动/等待 |
| 反机器人 / CAPTCHA | 通知用户;建议使用 headless: false 运行或通过 actionbook setup 扩展模式使用他们自己的浏览器会话 |
每周安装数
143
代码仓库
GitHub 星标数
1.4K
首次出现
2026年2月23日
安全审计
安装于
codex140
github-copilot139
cursor138
kimi-cli138
amp138
gemini-cli138
Activate when the user wants to obtain data from a website:
The deliverable is always two artifacts :
.cjs file that reproduces the extraction without Actionbook at runtime.Use Actionbook as a conditional accelerator , not a mandatory step. The goal is reliable selectors in the shortest path.
User request
│
├─► actionbook search "<site> <intent>"
│ ├─ Results with Health Score ≥ 70% ──► actionbook get "<ID>" ──► use selectors
│ └─ No results / low score ──► Fallback
│
└─► Fallback: actionbook browser open <url>
├─ actionbook browser snapshot (accessibility tree → find selectors)
├─ actionbook browser screenshot (visual confirmation)
└─ manual selector discovery via DOM inspection
Priority order for selector sources:
| Priority | Source | When |
|---|---|---|
| 1 | actionbook get | Site is indexed, health score ≥ 70% |
| 2 | actionbook browser snapshot | Not indexed or selectors outdated |
| 3 | DOM inspection via screenshot + snapshot | Complex SPA / dynamic content |
Non-negotiable rule: if search + get already provides usable selectors for required fields, start from get selectors and do not jump to full fallback (snapshot/screenshot) by default. Exception: lightweight mechanism probes (for hydration/virtualization/pagination) are allowed when runtime behavior may affect script correctness. Escalate to snapshot/screenshot only when probes/sample validation indicate selector gaps or instability.
Websites use patterns that break naive scraping. The generated Playwright script must account for these:
Pages may render a shell first, then stream or hydrate content.
// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
const items = document.querySelectorAll('[data-item]');
return items.length > 0 && !document.querySelector('[data-pending]');
});
Detection cues: React root with data-reactroot, Next.js __NEXT_DATA__, empty containers that fill after JS runs. If actionbook browser text "<selector>" returns empty but the screenshot shows content, hydration hasn't completed.
Only visible rows exist in the DOM. Scrolling renders new rows and destroys old ones.
// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;
const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');
let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
const items = await page.$$eval('[data-row]', rows =>
rows.map(r => ({ text: r.textContent.trim() }))
);
for (const item of items) {
if (!allItems.find(i => i.text === item.text)) allItems.push(item);
}
await container.evaluate(el => el.scrollBy(0, 600));
await page.waitForTimeout(300);
const currentTop = await container.evaluate(el => el.scrollTop);
if (currentTop === previousTop) break;
previousTop = currentTop;
scrolls += 1;
}
Detection cues: Container has fixed height with overflow: auto/scroll, row count in DOM is much smaller than stated total, rows have transform: translateY(...) or position: absolute; top: ...px.
New content appends when the user scrolls near the bottom.
// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;
while (scrolls < maxScrolls && noGrowthStreak < 3) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1200);
const newCount = await page.$$eval('.item', els => els.length);
if (newCount > itemCount) {
itemCount = newCount;
noGrowthStreak = 0;
} else {
noGrowthStreak += 1;
}
scrolls += 1;
}
Detection cues: Intersection Observer in page JS, "Load more" button, sentinel element at bottom, network requests firing on scroll.
Multi-page results behind "Next" buttons or numbered pages.
// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
const pageData = await page.$$eval('.result-item', items =>
items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
);
allData.push(...pageData);
const nextBtn = await page.$('a.next-page:not([disabled])');
if (!nextBtn) break;
const previousUrl = page.url();
const previousFirstItem = await page
.$eval('.result-item', el => el.textContent?.trim() || '')
.catch(() => '');
await nextBtn.click();
// Post-click detection only: advance must be caused by this click
const advanced = await Promise.any([
page
.waitForURL(url => url.toString() !== previousUrl, { timeout: 5000 })
.then(() => true),
page
.waitForFunction(
prev => {
const first = document.querySelector('.result-item');
return !!first && (first.textContent || '').trim() !== prev;
},
previousFirstItem,
{ timeout: 5000 }
)
.then(() => true),
]).catch(() => false);
if (!advanced) break;
await page.waitForLoadState('networkidle').catch(() => {});
pageIndex += 1;
}
Identify from the user request:
# Try Actionbook index first
actionbook search "<site> <data-description>" --domain <domain>
# If good results (health ≥ 70%), get full selectors
actionbook get "<ID>"
Use this routing strictly:
Path A (default whenget is good): requested fields are covered by get selectors and quality is acceptable.
get selectors and move to script draft quickly.browser text, quick scroll checks) before finalizing script strategy.snapshot / screenshot) before first draft unless probe/sample validation shows mismatch.get selectors and mark source as actionbook_get.Path B (partial / unstable): exists but required fields are missing, selector resolves 0 elements, or validation fails.
Path A mechanism detection timing:
actionbook browser open "<url>" (if current tab context is unknown/stale)Fallback discovery by path:
Path B targeted fallback (only failed fields/steps):
actionbook browser open "<url>" # if not already open
actionbook browser snapshot # focus on failed field/container mapping
# actionbook browser screenshot # optional visual confirmation for failed area
Path C full fallback (no usable coverage):
actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshot
Mechanism probes (run when script strategy needs confirmation):
# Hydration / streaming check
actionbook browser text "<container-selector>"
# Infinite scroll quick signal (explicit before/after decision)
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # before
actionbook browser click "<scroll-container-selector-or-body>" # focus scroll context
actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);"
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # after
# If count increases, treat page as lazy-load/infinite-scroll.
Fallback trigger conditions:
actionbook get cannot map all required fields.actionbook get selectors return empty/unstable values in sample run.Write a standalone Playwright script (extract_<domain>_<slug>.cjs) that:
load — see mechanisms above).JSON.stringify / CSV).maxPages, maxScrolls, timeout budget) to avoid infinite loops.Script template:
// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('<URL>', { waitUntil: 'domcontentloaded' });
// -- wait for readiness --
await page.waitForSelector('<container>', { state: 'visible' });
// -- extract --
const data = await page.$$eval('<item-selector>', items =>
items.map(el => ({
// fields mapped from user request
}))
);
// -- output --
const fs = require('fs');
fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
console.log(`Extracted ${data.length} items → output.json`);
await browser.close();
})();
Run the script to confirm it works:
node extract_<domain>_<slug>.cjs
Validation rules:
| Check | Pass condition |
|---|---|
| Script exits 0 | No runtime errors |
| Output file exists | Non-empty file written |
| Record count > 0 | At least one item extracted |
| No null/empty fields | Every declared field has a value in ≥ 90% of records |
| Data matches page | Spot-check first and last record against actionbook browser text |
If validation fails, inspect the output, adjust selectors or wait strategy, and re-run.
Present to the user:
.cjs file they can re-run anytime.Every extract invocation produces:
| Artifact | Path | Format |
|---|---|---|
| Playwright script | ./extract_<domain>_<slug>.cjs | Standalone Node.js script using playwright |
| Extracted data | ./output.json (default) or user-specified path | JSON array of objects (default), CSV, or user-specified |
The script must be re-runnable — a user should be able to execute it later without Actionbook installed, as long as Node.js + Playwright are available in the runtime environment.
When multiple selector types are available from actionbook get:
| Priority | Type | Reason |
|---|---|---|
| 1 | data-testid | Stable, test-oriented, rarely changes |
| 2 | aria-label | Accessibility-driven, semantically meaningful |
| 3 | CSS selector | Structural, may break on redesign |
| 4 | XPath | Last resort, most brittle |
| Error | Action |
|---|---|
actionbook search returns no results | Fall back to snapshot + screenshot |
| Selector returns 0 elements | Re-snapshot, compare with screenshot, update selector |
| Script times out | Add longer waitForTimeout, check for anti-bot measures |
| Partial data (some fields empty) | Check if content is lazy-loaded; add scroll/wait |
| Anti-bot / CAPTCHA | Inform user; suggest running with headless: false or using their own browser session via actionbook setup extension mode |
Weekly Installs
143
Repository
GitHub Stars
1.4K
First Seen
Feb 23, 2026
Security Audits
Gen Agent Trust HubWarnSocketPassSnykWarn
Installed on
codex140
github-copilot139
cursor138
kimi-cli138
amp138
gemini-cli138
通过 LiteLLM 代理让 Claude Code 对接 GitHub Copilot 运行 | 高级变通方案指南
22,200 周安装
Nx Import 使用指南:从源仓库导入代码并保留Git历史
250 周安装
OpenPencil CLI 工具:.fig 设计文件命令行操作与 MCP 服务器 | 设计自动化
250 周安装
学术深度研究技能:AI驱动的学术文献综述与多源验证工具,生成APA格式报告
250 周安装
React PDF 渲染器 - 使用 JSON 生成 PDF 文档,支持自定义组件和流式渲染
250 周安装
后端安全编码专家 | 安全开发实践、漏洞预防与防御性编程技术指南
250 周安装
TanStack Form:高性能无头表单库,支持TypeScript、Zod、Valibot验证
250 周安装
getPath C (no usable coverage): search/get has no usable result.