智能策略网络爬取工具 - 自适应反屏蔽爬虫，支持深度数据提取与Apify部署

web-scraping by yfe404/web-scraper

142 周安装量

35 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/yfe404/web-scraper --skill web-scraping

自动化数据分析数据处理

🇨🇳中文介绍

智能策略选择的网络爬取

此技能何时激活

当用户请求以下内容时自动激活：

"爬取 [网站]"
"从 [站点] 提取数据"
"从 [URL] 获取产品信息"
"查找 [站点] 上的所有链接/页面"
"我被屏蔽了" 或 "遇到 403 错误"（加载 strategies/anti-blocking.md）
"将其制作成 Apify Actor"（加载 apify/ 子目录）
"将此爬虫投入生产"

输入解析

根据用户请求确定侦察深度：

用户表述	模式	运行的阶段
"快速侦察"、"仅检查"、"是什么框架"	快速	仅运行阶段 0
"爬取 X"、"从 X 提取数据"（默认）	标准	运行阶段 0-3 + 5，仅在检测到保护信号时运行阶段 4
"完整侦察"、"深度扫描"、"生产爬取"	完整	所有阶段（0-5），包括保护测试

默认为标准模式。在任何阶段出现保护信号时，升级到完整模式。

自适应侦察工作流

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

阶段 0：快速评估（curl，无浏览器）

以最低成本收集最大情报——仅一次 HTTP 请求。

步骤 0a：获取原始 HTML 和响应头

curl -s -D- -L "https://target.com/page" -o response.html

步骤 0b：检查响应头

将响应头与 strategies/framework-signatures.md → 响应头签名表进行匹配
注意 Server、X-Powered-By、X-Shopify-Stage、Set-Cookie（保护标记）
检查 HTTP 状态码（200 = 可访问，403 = 受保护，3xx = 重定向）

步骤 0c：检查已知主要站点表

将域名与 strategies/framework-signatures.md → 已知主要站点进行匹配
如果匹配：使用指定的数据策略，跳过通用模式扫描

步骤 0d：从 HTML 中检测框架

在原始 HTML 中搜索 strategies/framework-signatures.md → HTML 签名表中的签名
查找 __NEXT_DATA__、__NUXT__、ld+json、/wp-content/、data-reactroot

步骤 0e：搜索目标数据点

对于用户想要的每个数据点：在原始 HTML 中搜索该内容
跟踪哪些数据点已找到 vs 缺失
检查站点地图：curl -s https://[site]/robots.txt | grep -i Sitemap

步骤 0f：记录保护信号

403/503 状态码、Cloudflare 挑战页面 HTML、验证码元素、cf-ray 响应头
为阶段 4 的决策做记录

参见：strategies/cheerio-vs-browser-test.md 获取 Cheerio 可行性评估

质量门控 A：所有目标数据点都在原始 HTML 中找到 + 没有保护信号？ → 是：跳转到阶段 3（验证发现）。无需浏览器。 → 否：继续到阶段 1。

阶段 1：浏览器侦察（仅在阶段 0 需要时启动）

仅在原始 HTML 中缺失数据点或需要 JavaScript 渲染时才启动浏览器。

步骤 1a：初始化浏览器会话

proxy_start() → 启动流量拦截代理
interceptor_chrome_launch(url, stealthMode: true) → 以反检测模式启动 Chrome
interceptor_chrome_devtools_attach(target_id) → 附加 DevTools 桥接
interceptor_chrome_devtools_screenshot() → 捕获视觉状态

步骤 1b：捕获流量和渲染后的 DOM

proxy_list_traffic() → 查看页面加载的所有流量
proxy_search_traffic(query: "application/json") → 查找 JSON 响应
interceptor_chrome_devtools_list_network(resource_types: ["xhr", "fetch"]) → XHR/fetch 调用
interceptor_chrome_devtools_snapshot() → 可访问性树（渲染后的 DOM）

步骤 1c：在渲染后的 DOM 中搜索缺失的数据点

对于每个在阶段 0 中未找到的数据点：在渲染后的 DOM 中搜索
使用 strategies/framework-signatures.md → 框架 → 搜索策略表中的框架特定搜索策略
仅搜索与检测到的框架相关的模式

步骤 1d：检查发现的端点

proxy_get_exchange(exchange_id) → 获取有前景端点的完整请求/响应
记录：方法、请求头、认证、响应结构、分页

质量门控 B：所有目标数据点现在都已覆盖（原始 HTML + 渲染后的 DOM + 流量）？ → 是：跳转到阶段 3（验证发现）。无需深度扫描。 → 否：仅针对缺失的数据点继续到阶段 2。

阶段 2：深度扫描（仅针对缺失的数据点）

针对尚未找到的数据点进行定向调查。仅搜索缺失的内容。

步骤 2a：测试缺失数据的交互

每次操作前 proxy_clear_traffic() → 隔离 API 调用
humanizer_click(target_id, selector) → 触发动态内容加载
humanizer_scroll(target_id, direction, amount) → 触发懒加载 / 无限滚动
humanizer_idle(target_id, duration_ms) → 等待延迟加载的内容
每次操作后：proxy_list_traffic() → 检查新的 API 调用

步骤 2b：嗅探 API（框架感知）

仅搜索与检测到的框架相关的模式：
- Next.js → proxy_list_traffic(url_filter: "/_next/data/")
- WordPress → proxy_list_traffic(url_filter: "/wp-json/")
- GraphQL → proxy_search_traffic(query: "graphql")
- 通用 → proxy_list_traffic(url_filter: "/api/") + proxy_search_traffic(query: "application/json")
跳过不适用于检测到的框架的模式

步骤 2c：测试分页和筛选

仅当分页数据是缺失的数据点或需要用于覆盖率评估时
proxy_clear_traffic() → 点击下一页 → proxy_list_traffic(url_filter: "page=")
记录分页类型（基于 URL、API 偏移量、游标、无限滚动）

质量门控 C：是否有足够的数据点被覆盖以生成有用的报告？ → 是：进入阶段 3。 → 否：记录差距，仍然进入阶段 3（报告将在自我批评中注明缺失的数据）。

阶段 3：验证发现

每个声称的提取方法都必须经过验证。在指定并测试提取路径之前，数据点不算"已找到"。

参见：strategies/cheerio-vs-browser-test.md 获取验证方法

步骤 3a：验证 CSS 选择器

对于每个基于 Cheerio/选择器的方法：确认选择器匹配实际的 HTML
针对原始 HTML（curl 输出）或渲染后的 DOM（快照）进行测试
确认选择器提取的是正确的值，而不是其他元素

步骤 3b：验证 JSON 路径

对于每个 JSON 提取（例如，__NEXT_DATA__、API 响应）：确认路径能解析
解析 JSON，遵循路径，验证其返回预期的数据类型和值

步骤 3c：验证 API 端点

对于每个发现的 API：重放请求（curl 或 proxy_get_exchange）
确认：响应状态 200、预期的数据结构、正确的值
如果声称支持分页则进行测试（至少第 1 页和第 2 页）

步骤 3d：降级或重新调查失败项

如果选择器不匹配：尝试替代选择器，或降级为部分置信度
如果 API 返回 403：记录保护要求，标记给阶段 4
如果 JSON 路径错误：重新检查 JSON 结构，更正路径

阶段 4：保护测试（条件性）

参见：strategies/proxy-escalation.md 获取完整的跳过/运行决策逻辑

当以下条件全部为真时跳过阶段 4：

在阶段 0-2 中未检测到保护信号
所有数据点都有经过验证的提取方法
用户未请求"完整侦察"

当以下任一条件为真时运行阶段 4：

在任何阶段观察到 403/挑战页面
已知的高保护域名
高流量或生产意图
用户明确请求

步骤 4a：测试原始 HTTP 访问

curl -s -o /dev/null -w "%{http_code}" "https://target.com/page"

200 → Cheerio 可行，对于可访问的端点无需浏览器
403/503 → 升级到隐身浏览器

步骤 4b：使用隐身浏览器测试（如果需要）

如果已从阶段 1 运行——检查页面是否在没有挑战的情况下加载
interceptor_chrome_devtools_list_cookies(domain_filter: "cloudflare") → 保护性 Cookie
interceptor_chrome_devtools_list_storage_keys(storage_type: "local") → 指纹标记
proxy_get_tls_fingerprints() → TLS 指纹分析

步骤 4c：使用上游代理测试（如果需要）

proxy_set_upstream("http://user:pass@proxy-provider:port")
通过代理重新测试被阻止的端点
记录每个数据点的最低访问级别

步骤 4d：记录保护配置文件

存在哪些保护措施，哪些方法成功绕过，生产爬虫需要什么

阶段 5：报告 + 自我批评

生成情报报告，然后批判性地审查其中的差距。

参见：reference/report-schema.md 获取完整的报告格式

步骤 5a：生成报告

遵循 reference/report-schema.md 模式（第 1-6 节）
包含每个策略的 已验证？ 状态（是 / 部分 / 否）
包含所有发现的端点及其完整规格

步骤 5b：自我批评

按照 reference/report-schema.md 撰写第 7 节（自我批评）：
- 差距：未找到的数据点——原因，以及如何找到它们
- 跳过的步骤：跳过了哪些阶段，以及质量门控的理由
- 未经验证的主张：任何标记为部分或否的内容
- 假设：未经验证的事项（例如，"跨类别布局一致"）
- 过时风险：地理依赖的价格、A/B 布局、会话特定内容
- 建议：有针对性的后续步骤（不是"重新运行所有内容"）

步骤 5c：通过有针对性的重新调查修复差距

如果自我批评揭示了可修复的差距：回到特定的阶段/步骤，而不是完全重新运行
示例："价格选择器未测试" → 运行一次 curl + 解析，不重新启动浏览器
使用结果更新报告

步骤 5d：记录会话（如果使用了浏览器）

proxy_session_start(name) → proxy_session_stop(session_id) → proxy_export_har(session_id, path)
HAR 文件捕获所有流量以供重放。参见 strategies/session-workflows.md

实施（侦察之后）

在侦察报告被接受后，迭代式地实施爬虫。

实施推荐的方法（最简代码）
用小批量测试（5-10 个项目）
验证数据质量
扩展到完整数据集或回退方案
如果遇到则处理屏蔽问题
添加健壮性（错误处理、重试、日志记录）

参见：workflows/implementation.md 获取完整的实施模式和代码示例

生产化（根据请求）

将爬虫转换为生产就绪的 Apify Actor。

激活触发器："将其制作成 Apify Actor"、"生产化此爬虫"、"部署到 Apify"

确认 TypeScript 偏好（强烈推荐）
使用 apify create 命令初始化（关键）
将爬取逻辑移植到 Actor 格式
本地测试并部署

注意：在开发过程中，proxy-mcp 提供侦察和流量分析。对于生产 Actor，请在 Apify 基础设施上使用 Crawlee 爬虫（CheerioCrawler/PlaywrightCrawler）。

参见：workflows/productionization.md 获取完整的工作流，以及 apify/ 获取 Actor 开发指南

任务	模式/命令	文档
侦察	自适应阶段 0-5	`workflows/reconnaissance.md`
框架检测	响应头 + HTML 签名匹配	`strategies/framework-signatures.md`
Cheerio 与浏览器	三方测试 + 提前退出	`strategies/cheerio-vs-browser-test.md`
流量分析	`proxy_list_traffic()` + `proxy_get_exchange()`	`strategies/traffic-interception.md`
保护测试	条件性升级	`strategies/proxy-escalation.md`
报告格式	包含自我批评的第 1-7 节	`reference/report-schema.md`
查找站点地图	`RobotsFile.find(url)`	`strategies/sitemap-discovery.md`
筛选站点地图 URL	`RequestList + regex`	`reference/regex-patterns.md`
发现 API	流量捕获（自动）	`strategies/api-discovery.md`
DOM 爬取	DevTools 桥接 + humanizer	`strategies/dom-scraping.md`
HTTP 爬取	`CheerioCrawler`	`strategies/cheerio-scraping.md`
混合方法	站点地图 + API	`strategies/hybrid-approaches.md`
处理屏蔽	隐身模式 + 上游代理	`strategies/anti-blocking.md`
会话记录	`proxy_session_start()` / `proxy_export_har()`	`strategies/session-workflows.md`
Proxy-MCP 工具	完整参考	`reference/proxy-tool-reference.md`
指纹配置	隐身 + TLS 预设	`reference/fingerprint-patterns.md`
创建 Apify Actor	`apify create`	`apify/cli-workflow.md`
模板选择	Cheerio 与 Playwright	`workflows/productionization.md`
输入模式	`.actor/input_schema.json`	`apify/input-schemas.md`
部署 Actor	`apify push`	`apify/deployment.md`

模式 1：基于站点地图的爬取

import { RobotsFile, CheerioCrawler, Dataset } from 'crawlee';

// 自动发现并解析站点地图
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();

const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        const data = {
            title: $('h1').text().trim(),
            // ... 提取数据
        };
        await Dataset.pushData(data);
    },
});

await crawler.addRequests(urls);
await crawler.run();

参见 examples/sitemap-basic.js 获取完整示例。

模式 2：基于 API 的爬取

import { gotScraping } from 'got-scraping';

const productIds = [123, 456, 789];

for (const id of productIds) {
    const response = await gotScraping({
        url: `https://api.example.com/products/${id}`,
        responseType: 'json',
    });

    console.log(response.body);
}

参见 examples/api-scraper.js 获取完整示例。

模式 3：混合（站点地图 + API）

// 从站点地图获取 URL
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();

// 从 URL 中提取 ID
const productIds = urls
    .map(url => url.match(/\/products\/(\d+)/)?.[1])
    .filter(Boolean);

// 通过 API 获取数据
for (const id of productIds) {
    const data = await gotScraping({
        url: `https://api.shop.com/v1/products/${id}`,
        responseType: 'json',
    });
    // 处理数据
}

参见 examples/hybrid-sitemap-api.js 获取完整示例。

此技能使用渐进式披露——详细信息被组织在子目录中，仅在需要时加载。

工作流（实施模式）

用于：每个阶段的逐步工作流指南

workflows/reconnaissance.md - 阶段 1 交互式侦察（关键）
workflows/implementation.md - 阶段 4 迭代实施模式
workflows/productionization.md - 阶段 5 Apify Actor 创建工作流

策略（深度解析）

用于：特定爬取方法的详细指南

strategies/framework-signatures.md - 框架检测查找表（阶段 0/1）
strategies/cheerio-vs-browser-test.md - Cheerio 与浏览器决策测试及提前退出
strategies/proxy-escalation.md - 保护测试跳过/运行条件（阶段 4）
strategies/traffic-interception.md - 通过 MITM 代理进行流量拦截
strategies/sitemap-discovery.md - 完整的站点地图指南（4 种模式）
strategies/api-discovery.md - 查找和使用 API
strategies/dom-scraping.md - 通过 DevTools 桥接进行 DOM 爬取
strategies/cheerio-scraping.md - 仅 HTTP 爬取
strategies/hybrid-approaches.md - 组合策略
strategies/anti-blocking.md - 多层反检测（隐身、humanizer、代理、TLS）
strategies/session-workflows.md - 会话记录、HAR 导出、重放

示例（可运行代码）

用于：可供参考或执行的工作代码

JavaScript 学习示例（简单的独立脚本）：

examples/sitemap-basic.js - 简单的站点地图爬虫
examples/api-scraper.js - 纯 API 方法
examples/traffic-interception-basic.js - 基于代理的侦察
examples/hybrid-sitemap-api.js - 组合方法
examples/iterative-fallback.js - 尝试流量拦截→站点地图→API→DOM 爬取

TypeScript 生产示例（完整的 Actor）：

apify/examples/basic-scraper/ - 站点地图 + Playwright
apify/examples/anti-blocking/ - 指纹 + 代理
apify/examples/hybrid-api/ - 站点地图 + API（最优）

参考（快速查找）

用于：快速模式和故障排除

reference/report-schema.md - 情报报告格式（第 1-7 节 + 自我批评）
reference/proxy-tool-reference.md - Proxy-MCP 工具参考（所有 80+ 个工具）
reference/regex-patterns.md - 常见的 URL 正则表达式模式
reference/fingerprint-patterns.md - 隐身模式 + TLS 指纹预设
reference/anti-patterns.md - 不应做的事情

Apify（生产部署）

用于：创建生产 Apify Actor

apify/README.md - 何时以及如何使用 Apify
apify/typescript-first.md - 为什么 Actor 要使用 TypeScript
apify/cli-workflow.md - apify create 工作流（关键）
apify/initialization.md - 完整的设置指南
apify/input-schemas.md - 输入验证模式
apify/configuration.md - actor.json 设置
apify/deployment.md - 测试和部署
apify/templates/ - TypeScript 样板文件

注意：每个文件都是自包含的，可以独立阅读。Claude 将在需要时导航到特定文件。

1. 在投入资源前进行评估

从低成本开始（curl），仅在需要时升级：

阶段 0（curl）先于阶段 1（浏览器）先于阶段 2（深度扫描）
当数据足够时，质量门控会跳过阶段
如果 curl 能提供所有信息，绝不启动浏览器

2. 先检测，然后搜索相关模式

使用框架检测来聚焦搜索：

在扫描前与 strategies/framework-signatures.md 进行匹配
跳过不适用的模式（Amazon 上没有 __NEXT_DATA__）
已知主要站点直接进行策略查找

3. 验证，而非假设

每个声称的提取方法都必须经过测试：

"在 HTML 中找到文本"是不够的——需要一个有效的选择器/路径
阶段 3 在报告前验证每个发现
未经验证的主张在报告中标记为部分或否

先小批量测试（5-10 个项目）
验证质量
扩展或回退
最后添加健壮性

5. 生产就绪的代码

使用 TypeScript（强烈推荐）
使用 apify create（绝不手动设置）
添加适当的错误处理
包含日志记录和监控

记住：流量拦截第一，站点地图第二，API 第三，DOM 爬取最后！

有关任何主题的详细指导，请导航到上面列出的相关子目录文件。

🇺🇸English

Web Scraping with Intelligent Strategy Selection

When This Skill Activates

Activate automatically when user requests:

"Scrape [website]"
"Extract data from [site]"
"Get product information from [URL]"
"Find all links/pages on [site]"
"I'm getting blocked" or "Getting 403 errors" (loads strategies/anti-blocking.md)
"Make this an Apify Actor" (loads apify/ subdirectory)
"Productionize this scraper"

Input Parsing

Determine reconnaissance depth from user request:

User Says	Mode	Phases Run
"quick recon", "just check", "what framework"	Quick	Phase 0 only
"scrape X", "extract data from X" (default)	Standard	Phases 0-3 + 5, Phase 4 only if protection signals detected
"full recon", "deep scan", "production scraping"	Full	All phases (0-5) including protection testing

Default is Standard mode. Escalate to Full if protection signals appear during any phase.

Adaptive Reconnaissance Workflow

This skill uses an adaptive phased workflow with quality gates. Each gate asks "Do I have enough?" — continue only when the answer is no.

See : strategies/framework-signatures.md for framework detection tables referenced throughout.

Phase 0: QUICK ASSESSMENT (curl, no browser)

Gather maximum intelligence with minimum cost — a single HTTP request.

Step 0a: Fetch raw HTML and headers

curl -s -D- -L "https://target.com/page" -o response.html

Step 0b: Check response headers

Match headers against strategies/framework-signatures.md → Response Header Signatures table
Note Server, X-Powered-By, X-Shopify-Stage, Set-Cookie (protection markers)
Check HTTP status code (200 = accessible, 403 = protected, 3xx = redirects)

Step 0c: Check Known Major Sites table

Match domain against strategies/framework-signatures.md → Known Major Sites
If matched: use the specified data strategy, skip generic pattern scanning

Step 0d: Detect framework from HTML

Search raw HTML for signatures in strategies/framework-signatures.md → HTML Signatures table
Look for __NEXT_DATA__, __NUXT__, ld+json, /wp-content/, data-reactroot

Step 0e: Search for target data points

For each data point the user wants: search raw HTML for that content
Track which data points are found vs missing
Check for sitemaps: curl -s https://[site]/robots.txt | grep -i Sitemap

Step 0f: Note protection signals

403/503 status, Cloudflare challenge HTML, CAPTCHA elements, cf-ray header
Record for Phase 4 decision

See : strategies/cheerio-vs-browser-test.md for the Cheerio viability assessment

QUALITY GATE A : All target data points found in raw HTML + no protection signals? → YES: Skip to Phase 3 (Validate Findings). No browser needed. → NO: Continue to Phase 1.

Phase 1: BROWSER RECONNAISSANCE (only if Phase 0 needs it)

Launch browser only for data points missing from raw HTML or when JavaScript rendering is required.

Step 1a: Initialize browser session

proxy_start() → Start traffic interception proxy
interceptor_chrome_launch(url, stealthMode: true) → Launch Chrome with anti-detection
interceptor_chrome_devtools_attach(target_id) → Attach DevTools bridge
interceptor_chrome_devtools_screenshot() → Capture visual state

Step 1b: Capture traffic and rendered DOM

proxy_list_traffic() → Review all traffic from page load
proxy_search_traffic(query: "application/json") → Find JSON responses
interceptor_chrome_devtools_list_network(resource_types: ["xhr", "fetch"]) → XHR/fetch calls
interceptor_chrome_devtools_snapshot() → Accessibility tree (rendered DOM)

Step 1c: Search rendered DOM for missing data points

For each data point NOT found in Phase 0: search rendered DOM
Use framework-specific search strategy from strategies/framework-signatures.md → Framework → Search Strategy table
Only search patterns relevant to the detected framework

Step 1d: Inspect discovered endpoints

proxy_get_exchange(exchange_id) → Full request/response for promising endpoints
Document: method, headers, auth, response structure, pagination

QUALITY GATE B : All target data points now covered (raw HTML + rendered DOM + traffic)? → YES: Skip to Phase 3 (Validate Findings). No deep scan needed. → NO: Continue to Phase 2 for missing data points only.

Phase 2: DEEP SCAN (only for missing data points)

Targeted investigation for data points not yet found. Only search for what's missing.

Step 2a: Test interactions for missing data

proxy_clear_traffic() before each action → Isolate API calls
humanizer_click(target_id, selector) → Trigger dynamic content loads
humanizer_scroll(target_id, direction, amount) → Trigger lazy loading / infinite scroll
humanizer_idle(target_id, duration_ms) → Wait for delayed content
After each action: proxy_list_traffic() → Check for new API calls

Step 2b: Sniff APIs (framework-aware)

Search only patterns relevant to detected framework:
- Next.js → proxy_list_traffic(url_filter: "/_next/data/")
- WordPress → proxy_list_traffic(url_filter: "/wp-json/")
- GraphQL → proxy_search_traffic(query: "graphql")
- Generic → proxy_list_traffic(url_filter: "/api/") + proxy_search_traffic(query: "application/json")
Skip patterns that don't apply to the detected framework

Step 2c: Test pagination and filtering

Only if pagination data is a missing data point or needed for coverage assessment
proxy_clear_traffic() → click next page → proxy_list_traffic(url_filter: "page=")
Document pagination type (URL-based, API offset, cursor, infinite scroll)

QUALITY GATE C : Enough data points covered for a useful report? → YES: Go to Phase 3. → NO: Document gaps, go to Phase 3 anyway (report will note missing data in self-critique).

Phase 3: VALIDATE FINDINGS

Every claimed extraction method must be verified. A data point is not "found" until the extraction path is specified and tested.

See : strategies/cheerio-vs-browser-test.md for validation methodology

Step 3a: Validate CSS selectors

For each Cheerio/selector-based method: confirm the selector matches actual HTML
Test against raw HTML (curl output) or rendered DOM (snapshot)
Confirm selector extracts the correct value, not a different element

Step 3b: Validate JSON paths

For each JSON extraction (e.g., __NEXT_DATA__, API response): confirm the path resolves
Parse the JSON, follow the path, verify it returns the expected data type and value

Step 3c: Validate API endpoints

For each discovered API: replay the request (curl or proxy_get_exchange)
Confirm: response status 200, expected data structure, correct values
Test pagination if claimed (at least page 1 and page 2)

Step 3d: Downgrade or re-investigate failures

If a selector doesn't match: try alternative selectors, or downgrade to PARTIAL confidence
If an API returns 403: note protection requirement, flag for Phase 4
If a JSON path is wrong: re-examine the JSON structure, correct the path

Phase 4: PROTECTION TESTING (conditional)

See : strategies/proxy-escalation.md for complete skip/run decision logic

Skip Phase 4 when ALL true :

No protection signals detected in Phases 0-2
All data points have validated extraction methods
User didn't request "full recon"

Run Phase 4 when ANY true :

403/challenge page observed during any phase
Known high-protection domain
High-volume or production intent
User explicitly requested it

If running :

Step 4a: Test raw HTTP access

curl -s -o /dev/null -w "%{http_code}" "https://target.com/page"

200 → Cheerio viable, no browser needed for accessible endpoints
403/503 → Escalate to stealth browser

Step 4b: Test with stealth browser (if needed)

Already running from Phase 1 — check if pages loaded without challenges
interceptor_chrome_devtools_list_cookies(domain_filter: "cloudflare") → Protection cookies
interceptor_chrome_devtools_list_storage_keys(storage_type: "local") → Fingerprint markers
proxy_get_tls_fingerprints() → TLS fingerprint analysis

Step 4c: Test with upstream proxy (if needed)

proxy_set_upstream("http://user:pass@proxy-provider:port")
Re-test blocked endpoints through proxy
Document minimum access level for each data point

Step 4d: Document protection profile

What protections exist, what worked to bypass them, what production scrapers will need

Phase 5: REPORT + SELF-CRITIQUE

Generate the intelligence report, then critically review it for gaps.

See : reference/report-schema.md for complete report format

Step 5a: Generate report

Follow reference/report-schema.md schema (Sections 1-6)
Include Validated? status for every strategy (YES / PARTIAL / NO)
Include all discovered endpoints with full specs

Step 5b: Self-critique

Write Section 7 (Self-Critique) per reference/report-schema.md:
- Gaps : Data points not found — why, and what would find them
- Skipped steps : Which phases skipped, with quality gate reasoning
- Unvalidated claims : Anything marked PARTIAL or NO
- Assumptions : Things not verified (e.g., "consistent layout across categories")
- Staleness risk : Geo-dependent prices, A/B layouts, session-specific content
- Recommendations : Targeted next steps (not "re-run everything")

Step 5c: Fix gaps with targeted re-investigation

If self-critique reveals fixable gaps: go back to the specific phase/step, not a full re-run
Example: "Price selector untested" → run one curl + parse, don't re-launch browser
Update report with results

Step 5d: Record session (if browser was used)

proxy_session_start(name) → proxy_session_stop(session_id) → proxy_export_har(session_id, path)
HAR file captures all traffic for replay. See strategies/session-workflows.md

IMPLEMENTATION (after reconnaissance)

After reconnaissance report is accepted, implement scraper iteratively.

Core Pattern :

Implement recommended approach (minimal code)
Test with small batch (5-10 items)
Validate data quality
Scale to full dataset or fallback
Handle blocking if encountered
Add robustness (error handling, retries, logging)

See : workflows/implementation.md for complete implementation patterns and code examples

PRODUCTIONIZATION (on request)

Convert scraper to production-ready Apify Actor.

Activation triggers : "Make this an Apify Actor", "Productionize this", "Deploy to Apify"

Core Pattern :

Confirm TypeScript preference (STRONGLY RECOMMENDED)
Initialize with apify create command (CRITICAL)
Port scraping logic to Actor format
Test locally and deploy

Note : During development, proxy-mcp provides reconnaissance and traffic analysis. For production Actors, use Crawlee crawlers (CheerioCrawler/PlaywrightCrawler) on Apify infrastructure.

See : workflows/productionization.md for complete workflow and apify/ for Actor development guides

Quick Reference

Task	Pattern/Command	Documentation
Reconnaissance	Adaptive Phases 0-5	`workflows/reconnaissance.md`
Framework detection	Header + HTML signature matching	`strategies/framework-signatures.md`
Cheerio vs Browser	Three-way test + early exit	`strategies/cheerio-vs-browser-test.md`
Traffic analysis	`proxy_list_traffic()` + `proxy_get_exchange()`

Common Patterns

Pattern 1: Sitemap-Based Scraping

import { RobotsFile, CheerioCrawler, Dataset } from 'crawlee';

// Auto-discover and parse sitemaps
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();

const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        const data = {
            title: $('h1').text().trim(),
            // ... extract data
        };
        await Dataset.pushData(data);
    },
});

await crawler.addRequests(urls);
await crawler.run();

See examples/sitemap-basic.js for complete example.

Pattern 2: API-Based Scraping

import { gotScraping } from 'got-scraping';

const productIds = [123, 456, 789];

for (const id of productIds) {
    const response = await gotScraping({
        url: `https://api.example.com/products/${id}`,
        responseType: 'json',
    });

    console.log(response.body);
}

See examples/api-scraper.js for complete example.

Pattern 3: Hybrid (Sitemap + API)

// Get URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();

// Extract IDs from URLs
const productIds = urls
    .map(url => url.match(/\/products\/(\d+)/)?.[1])
    .filter(Boolean);

// Fetch data via API
for (const id of productIds) {
    const data = await gotScraping({
        url: `https://api.shop.com/v1/products/${id}`,
        responseType: 'json',
    });
    // Process data
}

See examples/hybrid-sitemap-api.js for complete example.

Directory Navigation

This skill uses progressive disclosure - detailed information is organized in subdirectories and loaded only when needed.

Workflows (Implementation Patterns)

For : Step-by-step workflow guides for each phase

workflows/reconnaissance.md - Phase 1 interactive reconnaissance (CRITICAL)
workflows/implementation.md - Phase 4 iterative implementation patterns
workflows/productionization.md - Phase 5 Apify Actor creation workflow

Strategies (Deep Dives)

For : Detailed guides on specific scraping approaches

strategies/framework-signatures.md - Framework detection lookup tables (Phase 0/1)
strategies/cheerio-vs-browser-test.md - Cheerio vs Browser decision test with early exit
strategies/proxy-escalation.md - Protection testing skip/run conditions (Phase 4)
strategies/traffic-interception.md - Traffic interception via MITM proxy
strategies/sitemap-discovery.md - Complete sitemap guide (4 patterns)
strategies/api-discovery.md - Finding and using APIs
strategies/dom-scraping.md - DOM scraping via DevTools bridge
strategies/cheerio-scraping.md - HTTP-only scraping

Examples (Runnable Code)

For : Working code to reference or execute

JavaScript Learning Examples (Simple standalone scripts):

examples/sitemap-basic.js - Simple sitemap scraper
examples/api-scraper.js - Pure API approach
examples/traffic-interception-basic.js - Proxy-based reconnaissance
examples/hybrid-sitemap-api.js - Combined approach
examples/iterative-fallback.js - Try traffic interception→sitemap→API→DOM scraping

TypeScript Production Examples (Complete Actors):

apify/examples/basic-scraper/ - Sitemap + Playwright
apify/examples/anti-blocking/ - Fingerprinting + proxies
apify/examples/hybrid-api/ - Sitemap + API (optimal)

Reference (Quick Lookup)

For : Quick patterns and troubleshooting

reference/report-schema.md - Intelligence report format (Sections 1-7 + self-critique)
reference/proxy-tool-reference.md - Proxy-MCP tool reference (all 80+ tools)
reference/regex-patterns.md - Common URL regex patterns
reference/fingerprint-patterns.md - Stealth mode + TLS fingerprint presets
reference/anti-patterns.md - What NOT to do

Apify (Production Deployment)

For : Creating production Apify Actors

apify/README.md - When and how to use Apify
apify/typescript-first.md - Why TypeScript for actors
apify/cli-workflow.md - apify create workflow (CRITICAL)
apify/initialization.md - Complete setup guide
apify/input-schemas.md - Input validation patterns
apify/configuration.md - actor.json setup
apify/deployment.md - Testing and deployment
apify/templates/ - TypeScript boilerplate

Note : Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.

Core Principles

1. Assess Before Committing Resources

Start cheap (curl), escalate only when needed:

Phase 0 (curl) before Phase 1 (browser) before Phase 2 (deep scan)
Quality gates skip phases when data is sufficient
Never launch a browser if curl gives you everything

2. Detect First, Then Search Relevant Patterns

Use framework detection to focus searches:

Match against strategies/framework-signatures.md before scanning
Skip patterns that don't apply (no __NEXT_DATA__ on Amazon)
Known major sites get direct strategy lookup

3. Validate, Don't Assume

Every claimed extraction method must be tested:

"Found text in HTML" is not enough — need a working selector/path
Phase 3 validates every finding before the report
Unvalidated claims are marked PARTIAL or NO in the report

4. Iterative Implementation

Build incrementally:

Small test batch first (5-10 items)
Validate quality
Scale or fallback
Add robustness last

5. Production-Ready Code

When productionizing:

Use TypeScript (strongly recommended)
Use apify create (never manual setup)
Add proper error handling
Include logging and monitoring

Remember : Traffic interception first, sitemaps second, APIs third, DOM scraping last!

For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.

Weekly Installs

122

Repository

yfe404/web-scraper

GitHub Stars

First Seen

Jan 31, 2026

Security Audits

Gen Agent Trust HubFail SocketWarn SnykFail

Installed on

codex120

gemini-cli119

github-copilot119

opencode118

kimi-cli117

amp117

Python PDF处理教程：合并拆分、提取文本表格、创建PDF文件

65,000 周安装

strategies/hybrid-approaches.md - Combining strategies

strategies/anti-blocking.md - Multi-layer anti-detection (stealth, humanizer, proxies, TLS)

strategies/session-workflows.md - Session recording, HAR export, replay