web-scraping by yfe404/web-scraper
npx skills add https://github.com/yfe404/web-scraper --skill web-scraping当用户请求以下内容时自动激活:
strategies/anti-blocking.md)apify/ 子目录)根据用户请求确定侦察深度:
| 用户表述 | 模式 | 运行的阶段 |
|---|---|---|
| "快速侦察"、"仅检查"、"是什么框架" | 快速 | 仅运行阶段 0 |
| "爬取 X"、"从 X 提取数据"(默认) | 标准 | 运行阶段 0-3 + 5,仅在检测到保护信号时运行阶段 4 |
| "完整侦察"、"深度扫描"、"生产爬取" | 完整 | 所有阶段(0-5),包括保护测试 |
默认为标准模式。在任何阶段出现保护信号时,升级到完整模式。
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
此技能使用带有质量门控的自适应分阶段工作流。每个门控都会询问 "我是否拥有足够的信息?" —— 只有当答案为否时才继续。
参见:strategies/framework-signatures.md 获取整个过程中引用的框架检测表。
以最低成本收集最大情报——仅一次 HTTP 请求。
步骤 0a:获取原始 HTML 和响应头
curl -s -D- -L "https://target.com/page" -o response.html
步骤 0b:检查响应头
strategies/framework-signatures.md → 响应头签名表进行匹配Server、X-Powered-By、X-Shopify-Stage、Set-Cookie(保护标记)步骤 0c:检查已知主要站点表
strategies/framework-signatures.md → 已知主要站点进行匹配步骤 0d:从 HTML 中检测框架
strategies/framework-signatures.md → HTML 签名表中的签名__NEXT_DATA__、__NUXT__、ld+json、/wp-content/、data-reactroot步骤 0e:搜索目标数据点
curl -s https://[site]/robots.txt | grep -i Sitemap步骤 0f:记录保护信号
cf-ray 响应头参见:strategies/cheerio-vs-browser-test.md 获取 Cheerio 可行性评估
质量门控 A:所有目标数据点都在原始 HTML 中找到 + 没有保护信号? → 是:跳转到阶段 3(验证发现)。无需浏览器。 → 否:继续到阶段 1。
仅在原始 HTML 中缺失数据点或需要 JavaScript 渲染时才启动浏览器。
步骤 1a:初始化浏览器会话
proxy_start() → 启动流量拦截代理interceptor_chrome_launch(url, stealthMode: true) → 以反检测模式启动 Chromeinterceptor_chrome_devtools_attach(target_id) → 附加 DevTools 桥接interceptor_chrome_devtools_screenshot() → 捕获视觉状态步骤 1b:捕获流量和渲染后的 DOM
proxy_list_traffic() → 查看页面加载的所有流量proxy_search_traffic(query: "application/json") → 查找 JSON 响应interceptor_chrome_devtools_list_network(resource_types: ["xhr", "fetch"]) → XHR/fetch 调用interceptor_chrome_devtools_snapshot() → 可访问性树(渲染后的 DOM)步骤 1c:在渲染后的 DOM 中搜索缺失的数据点
strategies/framework-signatures.md → 框架 → 搜索策略表中的框架特定搜索策略步骤 1d:检查发现的端点
proxy_get_exchange(exchange_id) → 获取有前景端点的完整请求/响应质量门控 B:所有目标数据点现在都已覆盖(原始 HTML + 渲染后的 DOM + 流量)? → 是:跳转到阶段 3(验证发现)。无需深度扫描。 → 否:仅针对缺失的数据点继续到阶段 2。
针对尚未找到的数据点进行定向调查。仅搜索缺失的内容。
步骤 2a:测试缺失数据的交互
proxy_clear_traffic() → 隔离 API 调用humanizer_click(target_id, selector) → 触发动态内容加载humanizer_scroll(target_id, direction, amount) → 触发懒加载 / 无限滚动humanizer_idle(target_id, duration_ms) → 等待延迟加载的内容proxy_list_traffic() → 检查新的 API 调用步骤 2b:嗅探 API(框架感知)
proxy_list_traffic(url_filter: "/_next/data/")proxy_list_traffic(url_filter: "/wp-json/")proxy_search_traffic(query: "graphql")proxy_list_traffic(url_filter: "/api/") + proxy_search_traffic(query: "application/json")步骤 2c:测试分页和筛选
proxy_clear_traffic() → 点击下一页 → proxy_list_traffic(url_filter: "page=")质量门控 C:是否有足够的数据点被覆盖以生成有用的报告? → 是:进入阶段 3。 → 否:记录差距,仍然进入阶段 3(报告将在自我批评中注明缺失的数据)。
每个声称的提取方法都必须经过验证。在指定并测试提取路径之前,数据点不算"已找到"。
参见:strategies/cheerio-vs-browser-test.md 获取验证方法
步骤 3a:验证 CSS 选择器
步骤 3b:验证 JSON 路径
__NEXT_DATA__、API 响应):确认路径能解析步骤 3c:验证 API 端点
proxy_get_exchange)步骤 3d:降级或重新调查失败项
参见:strategies/proxy-escalation.md 获取完整的跳过/运行决策逻辑
当以下条件全部为真时跳过阶段 4:
当以下任一条件为真时运行阶段 4:
如果运行:
步骤 4a:测试原始 HTTP 访问
curl -s -o /dev/null -w "%{http_code}" "https://target.com/page"
步骤 4b:使用隐身浏览器测试(如果需要)
interceptor_chrome_devtools_list_cookies(domain_filter: "cloudflare") → 保护性 Cookieinterceptor_chrome_devtools_list_storage_keys(storage_type: "local") → 指纹标记proxy_get_tls_fingerprints() → TLS 指纹分析步骤 4c:使用上游代理测试(如果需要)
proxy_set_upstream("http://user:pass@proxy-provider:port")步骤 4d:记录保护配置文件
生成情报报告,然后批判性地审查其中的差距。
参见:reference/report-schema.md 获取完整的报告格式
步骤 5a:生成报告
reference/report-schema.md 模式(第 1-6 节)已验证? 状态(是 / 部分 / 否)步骤 5b:自我批评
reference/report-schema.md 撰写第 7 节(自我批评):
步骤 5c:通过有针对性的重新调查修复差距
步骤 5d:记录会话(如果使用了浏览器)
proxy_session_start(name) → proxy_session_stop(session_id) → proxy_export_har(session_id, path)strategies/session-workflows.md在侦察报告被接受后,迭代式地实施爬虫。
核心模式:
参见:workflows/implementation.md 获取完整的实施模式和代码示例
将爬虫转换为生产就绪的 Apify Actor。
激活触发器:"将其制作成 Apify Actor"、"生产化此爬虫"、"部署到 Apify"
核心模式:
apify create 命令初始化(关键)注意:在开发过程中,proxy-mcp 提供侦察和流量分析。对于生产 Actor,请在 Apify 基础设施上使用 Crawlee 爬虫(CheerioCrawler/PlaywrightCrawler)。
参见:workflows/productionization.md 获取完整的工作流,以及 apify/ 获取 Actor 开发指南
| 任务 | 模式/命令 | 文档 |
|---|---|---|
| 侦察 | 自适应阶段 0-5 | workflows/reconnaissance.md |
| 框架检测 | 响应头 + HTML 签名匹配 | strategies/framework-signatures.md |
| Cheerio 与浏览器 | 三方测试 + 提前退出 | strategies/cheerio-vs-browser-test.md |
| 流量分析 | proxy_list_traffic() + proxy_get_exchange() | strategies/traffic-interception.md |
| 保护测试 | 条件性升级 | strategies/proxy-escalation.md |
| 报告格式 | 包含自我批评的第 1-7 节 | reference/report-schema.md |
| 查找站点地图 | RobotsFile.find(url) | strategies/sitemap-discovery.md |
| 筛选站点地图 URL | RequestList + regex | reference/regex-patterns.md |
| 发现 API | 流量捕获(自动) | strategies/api-discovery.md |
| DOM 爬取 | DevTools 桥接 + humanizer | strategies/dom-scraping.md |
| HTTP 爬取 | CheerioCrawler | strategies/cheerio-scraping.md |
| 混合方法 | 站点地图 + API | strategies/hybrid-approaches.md |
| 处理屏蔽 | 隐身模式 + 上游代理 | strategies/anti-blocking.md |
| 会话记录 | proxy_session_start() / proxy_export_har() | strategies/session-workflows.md |
| Proxy-MCP 工具 | 完整参考 | reference/proxy-tool-reference.md |
| 指纹配置 | 隐身 + TLS 预设 | reference/fingerprint-patterns.md |
| 创建 Apify Actor | apify create | apify/cli-workflow.md |
| 模板选择 | Cheerio 与 Playwright | workflows/productionization.md |
| 输入模式 | .actor/input_schema.json | apify/input-schemas.md |
| 部署 Actor | apify push | apify/deployment.md |
import { RobotsFile, CheerioCrawler, Dataset } from 'crawlee';
// 自动发现并解析站点地图
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();
const crawler = new CheerioCrawler({
async requestHandler({ $, request }) {
const data = {
title: $('h1').text().trim(),
// ... 提取数据
};
await Dataset.pushData(data);
},
});
await crawler.addRequests(urls);
await crawler.run();
参见 examples/sitemap-basic.js 获取完整示例。
import { gotScraping } from 'got-scraping';
const productIds = [123, 456, 789];
for (const id of productIds) {
const response = await gotScraping({
url: `https://api.example.com/products/${id}`,
responseType: 'json',
});
console.log(response.body);
}
参见 examples/api-scraper.js 获取完整示例。
// 从站点地图获取 URL
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();
// 从 URL 中提取 ID
const productIds = urls
.map(url => url.match(/\/products\/(\d+)/)?.[1])
.filter(Boolean);
// 通过 API 获取数据
for (const id of productIds) {
const data = await gotScraping({
url: `https://api.shop.com/v1/products/${id}`,
responseType: 'json',
});
// 处理数据
}
参见 examples/hybrid-sitemap-api.js 获取完整示例。
此技能使用渐进式披露——详细信息被组织在子目录中,仅在需要时加载。
用于:每个阶段的逐步工作流指南
workflows/reconnaissance.md - 阶段 1 交互式侦察(关键)workflows/implementation.md - 阶段 4 迭代实施模式workflows/productionization.md - 阶段 5 Apify Actor 创建工作流用于:特定爬取方法的详细指南
strategies/framework-signatures.md - 框架检测查找表(阶段 0/1)strategies/cheerio-vs-browser-test.md - Cheerio 与浏览器决策测试及提前退出strategies/proxy-escalation.md - 保护测试跳过/运行条件(阶段 4)strategies/traffic-interception.md - 通过 MITM 代理进行流量拦截strategies/sitemap-discovery.md - 完整的站点地图指南(4 种模式)strategies/api-discovery.md - 查找和使用 APIstrategies/dom-scraping.md - 通过 DevTools 桥接进行 DOM 爬取strategies/cheerio-scraping.md - 仅 HTTP 爬取strategies/hybrid-approaches.md - 组合策略strategies/anti-blocking.md - 多层反检测(隐身、humanizer、代理、TLS)strategies/session-workflows.md - 会话记录、HAR 导出、重放用于:可供参考或执行的工作代码
JavaScript 学习示例(简单的独立脚本):
examples/sitemap-basic.js - 简单的站点地图爬虫examples/api-scraper.js - 纯 API 方法examples/traffic-interception-basic.js - 基于代理的侦察examples/hybrid-sitemap-api.js - 组合方法examples/iterative-fallback.js - 尝试流量拦截→站点地图→API→DOM 爬取TypeScript 生产示例(完整的 Actor):
apify/examples/basic-scraper/ - 站点地图 + Playwrightapify/examples/anti-blocking/ - 指纹 + 代理apify/examples/hybrid-api/ - 站点地图 + API(最优)用于:快速模式和故障排除
reference/report-schema.md - 情报报告格式(第 1-7 节 + 自我批评)reference/proxy-tool-reference.md - Proxy-MCP 工具参考(所有 80+ 个工具)reference/regex-patterns.md - 常见的 URL 正则表达式模式reference/fingerprint-patterns.md - 隐身模式 + TLS 指纹预设reference/anti-patterns.md - 不应做的事情用于:创建生产 Apify Actor
apify/README.md - 何时以及如何使用 Apifyapify/typescript-first.md - 为什么 Actor 要使用 TypeScriptapify/cli-workflow.md - apify create 工作流(关键)apify/initialization.md - 完整的设置指南apify/input-schemas.md - 输入验证模式apify/configuration.md - actor.json 设置apify/deployment.md - 测试和部署apify/templates/ - TypeScript 样板文件注意:每个文件都是自包含的,可以独立阅读。Claude 将在需要时导航到特定文件。
从低成本开始(curl),仅在需要时升级:
使用框架检测来聚焦搜索:
strategies/framework-signatures.md 进行匹配__NEXT_DATA__)每个声称的提取方法都必须经过测试:
增量构建:
生产化时:
apify create(绝不手动设置)记住:流量拦截第一,站点地图第二,API 第三,DOM 爬取最后!
有关任何主题的详细指导,请导航到上面列出的相关子目录文件。
每周安装
122
仓库
GitHub 星标
22
首次出现
2026年1月31日
安全审计
安装于
codex120
gemini-cli119
github-copilot119
opencode118
kimi-cli117
amp117
Activate automatically when user requests:
strategies/anti-blocking.md)apify/ subdirectory)Determine reconnaissance depth from user request:
| User Says | Mode | Phases Run |
|---|---|---|
| "quick recon", "just check", "what framework" | Quick | Phase 0 only |
| "scrape X", "extract data from X" (default) | Standard | Phases 0-3 + 5, Phase 4 only if protection signals detected |
| "full recon", "deep scan", "production scraping" | Full | All phases (0-5) including protection testing |
Default is Standard mode. Escalate to Full if protection signals appear during any phase.
This skill uses an adaptive phased workflow with quality gates. Each gate asks "Do I have enough?" — continue only when the answer is no.
See : strategies/framework-signatures.md for framework detection tables referenced throughout.
Gather maximum intelligence with minimum cost — a single HTTP request.
Step 0a: Fetch raw HTML and headers
curl -s -D- -L "https://target.com/page" -o response.html
Step 0b: Check response headers
strategies/framework-signatures.md → Response Header Signatures tableServer, X-Powered-By, X-Shopify-Stage, Set-Cookie (protection markers)Step 0c: Check Known Major Sites table
strategies/framework-signatures.md → Known Major SitesStep 0d: Detect framework from HTML
strategies/framework-signatures.md → HTML Signatures table__NEXT_DATA__, __NUXT__, ld+json, /wp-content/, data-reactrootStep 0e: Search for target data points
curl -s https://[site]/robots.txt | grep -i SitemapStep 0f: Note protection signals
cf-ray headerSee : strategies/cheerio-vs-browser-test.md for the Cheerio viability assessment
QUALITY GATE A : All target data points found in raw HTML + no protection signals? → YES: Skip to Phase 3 (Validate Findings). No browser needed. → NO: Continue to Phase 1.
Launch browser only for data points missing from raw HTML or when JavaScript rendering is required.
Step 1a: Initialize browser session
proxy_start() → Start traffic interception proxyinterceptor_chrome_launch(url, stealthMode: true) → Launch Chrome with anti-detectioninterceptor_chrome_devtools_attach(target_id) → Attach DevTools bridgeinterceptor_chrome_devtools_screenshot() → Capture visual stateStep 1b: Capture traffic and rendered DOM
proxy_list_traffic() → Review all traffic from page loadproxy_search_traffic(query: "application/json") → Find JSON responsesinterceptor_chrome_devtools_list_network(resource_types: ["xhr", "fetch"]) → XHR/fetch callsinterceptor_chrome_devtools_snapshot() → Accessibility tree (rendered DOM)Step 1c: Search rendered DOM for missing data points
strategies/framework-signatures.md → Framework → Search Strategy tableStep 1d: Inspect discovered endpoints
proxy_get_exchange(exchange_id) → Full request/response for promising endpointsQUALITY GATE B : All target data points now covered (raw HTML + rendered DOM + traffic)? → YES: Skip to Phase 3 (Validate Findings). No deep scan needed. → NO: Continue to Phase 2 for missing data points only.
Targeted investigation for data points not yet found. Only search for what's missing.
Step 2a: Test interactions for missing data
proxy_clear_traffic() before each action → Isolate API callshumanizer_click(target_id, selector) → Trigger dynamic content loadshumanizer_scroll(target_id, direction, amount) → Trigger lazy loading / infinite scrollhumanizer_idle(target_id, duration_ms) → Wait for delayed contentproxy_list_traffic() → Check for new API callsStep 2b: Sniff APIs (framework-aware)
proxy_list_traffic(url_filter: "/_next/data/")proxy_list_traffic(url_filter: "/wp-json/")proxy_search_traffic(query: "graphql")proxy_list_traffic(url_filter: "/api/") + proxy_search_traffic(query: "application/json")Step 2c: Test pagination and filtering
proxy_clear_traffic() → click next page → proxy_list_traffic(url_filter: "page=")QUALITY GATE C : Enough data points covered for a useful report? → YES: Go to Phase 3. → NO: Document gaps, go to Phase 3 anyway (report will note missing data in self-critique).
Every claimed extraction method must be verified. A data point is not "found" until the extraction path is specified and tested.
See : strategies/cheerio-vs-browser-test.md for validation methodology
Step 3a: Validate CSS selectors
Step 3b: Validate JSON paths
__NEXT_DATA__, API response): confirm the path resolvesStep 3c: Validate API endpoints
proxy_get_exchange)Step 3d: Downgrade or re-investigate failures
See : strategies/proxy-escalation.md for complete skip/run decision logic
Skip Phase 4 when ALL true :
Run Phase 4 when ANY true :
If running :
Step 4a: Test raw HTTP access
curl -s -o /dev/null -w "%{http_code}" "https://target.com/page"
Step 4b: Test with stealth browser (if needed)
interceptor_chrome_devtools_list_cookies(domain_filter: "cloudflare") → Protection cookiesinterceptor_chrome_devtools_list_storage_keys(storage_type: "local") → Fingerprint markersproxy_get_tls_fingerprints() → TLS fingerprint analysisStep 4c: Test with upstream proxy (if needed)
proxy_set_upstream("http://user:pass@proxy-provider:port")Step 4d: Document protection profile
Generate the intelligence report, then critically review it for gaps.
See : reference/report-schema.md for complete report format
Step 5a: Generate report
reference/report-schema.md schema (Sections 1-6)Validated? status for every strategy (YES / PARTIAL / NO)Step 5b: Self-critique
reference/report-schema.md:
Step 5c: Fix gaps with targeted re-investigation
Step 5d: Record session (if browser was used)
proxy_session_start(name) → proxy_session_stop(session_id) → proxy_export_har(session_id, path)strategies/session-workflows.mdAfter reconnaissance report is accepted, implement scraper iteratively.
Core Pattern :
See : workflows/implementation.md for complete implementation patterns and code examples
Convert scraper to production-ready Apify Actor.
Activation triggers : "Make this an Apify Actor", "Productionize this", "Deploy to Apify"
Core Pattern :
apify create command (CRITICAL)Note : During development, proxy-mcp provides reconnaissance and traffic analysis. For production Actors, use Crawlee crawlers (CheerioCrawler/PlaywrightCrawler) on Apify infrastructure.
See : workflows/productionization.md for complete workflow and apify/ for Actor development guides
| Task | Pattern/Command | Documentation |
|---|---|---|
| Reconnaissance | Adaptive Phases 0-5 | workflows/reconnaissance.md |
| Framework detection | Header + HTML signature matching | strategies/framework-signatures.md |
| Cheerio vs Browser | Three-way test + early exit | strategies/cheerio-vs-browser-test.md |
| Traffic analysis | proxy_list_traffic() + proxy_get_exchange() |
import { RobotsFile, CheerioCrawler, Dataset } from 'crawlee';
// Auto-discover and parse sitemaps
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();
const crawler = new CheerioCrawler({
async requestHandler({ $, request }) {
const data = {
title: $('h1').text().trim(),
// ... extract data
};
await Dataset.pushData(data);
},
});
await crawler.addRequests(urls);
await crawler.run();
See examples/sitemap-basic.js for complete example.
import { gotScraping } from 'got-scraping';
const productIds = [123, 456, 789];
for (const id of productIds) {
const response = await gotScraping({
url: `https://api.example.com/products/${id}`,
responseType: 'json',
});
console.log(response.body);
}
See examples/api-scraper.js for complete example.
// Get URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();
// Extract IDs from URLs
const productIds = urls
.map(url => url.match(/\/products\/(\d+)/)?.[1])
.filter(Boolean);
// Fetch data via API
for (const id of productIds) {
const data = await gotScraping({
url: `https://api.shop.com/v1/products/${id}`,
responseType: 'json',
});
// Process data
}
See examples/hybrid-sitemap-api.js for complete example.
This skill uses progressive disclosure - detailed information is organized in subdirectories and loaded only when needed.
For : Step-by-step workflow guides for each phase
workflows/reconnaissance.md - Phase 1 interactive reconnaissance (CRITICAL)workflows/implementation.md - Phase 4 iterative implementation patternsworkflows/productionization.md - Phase 5 Apify Actor creation workflowFor : Detailed guides on specific scraping approaches
strategies/framework-signatures.md - Framework detection lookup tables (Phase 0/1)strategies/cheerio-vs-browser-test.md - Cheerio vs Browser decision test with early exitstrategies/proxy-escalation.md - Protection testing skip/run conditions (Phase 4)strategies/traffic-interception.md - Traffic interception via MITM proxystrategies/sitemap-discovery.md - Complete sitemap guide (4 patterns)strategies/api-discovery.md - Finding and using APIsstrategies/dom-scraping.md - DOM scraping via DevTools bridgestrategies/cheerio-scraping.md - HTTP-only scrapingFor : Working code to reference or execute
JavaScript Learning Examples (Simple standalone scripts):
examples/sitemap-basic.js - Simple sitemap scraperexamples/api-scraper.js - Pure API approachexamples/traffic-interception-basic.js - Proxy-based reconnaissanceexamples/hybrid-sitemap-api.js - Combined approachexamples/iterative-fallback.js - Try traffic interception→sitemap→API→DOM scrapingTypeScript Production Examples (Complete Actors):
apify/examples/basic-scraper/ - Sitemap + Playwrightapify/examples/anti-blocking/ - Fingerprinting + proxiesapify/examples/hybrid-api/ - Sitemap + API (optimal)For : Quick patterns and troubleshooting
reference/report-schema.md - Intelligence report format (Sections 1-7 + self-critique)reference/proxy-tool-reference.md - Proxy-MCP tool reference (all 80+ tools)reference/regex-patterns.md - Common URL regex patternsreference/fingerprint-patterns.md - Stealth mode + TLS fingerprint presetsreference/anti-patterns.md - What NOT to doFor : Creating production Apify Actors
apify/README.md - When and how to use Apifyapify/typescript-first.md - Why TypeScript for actorsapify/cli-workflow.md - apify create workflow (CRITICAL)apify/initialization.md - Complete setup guideapify/input-schemas.md - Input validation patternsapify/configuration.md - actor.json setupapify/deployment.md - Testing and deploymentapify/templates/ - TypeScript boilerplateNote : Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.
Start cheap (curl), escalate only when needed:
Use framework detection to focus searches:
strategies/framework-signatures.md before scanning__NEXT_DATA__ on Amazon)Every claimed extraction method must be tested:
Build incrementally:
When productionizing:
apify create (never manual setup)Remember : Traffic interception first, sitemaps second, APIs third, DOM scraping last!
For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.
Weekly Installs
122
Repository
GitHub Stars
22
First Seen
Jan 31, 2026
Security Audits
Gen Agent Trust HubFailSocketWarnSnykFail
Installed on
codex120
gemini-cli119
github-copilot119
opencode118
kimi-cli117
amp117
Python PDF处理教程:合并拆分、提取文本表格、创建PDF文件
65,000 周安装
autonomous-skill:Claude Code 多会话任务自动执行工具 - 支持结构化与轻量级模式
145 周安装
主题工厂技能:一键应用专业字体颜色主题,提升演示文稿设计效率
146 周安装
Clawdbot 文档专家技能:一站式导航、搜索与配置指南
145 周安装
GSAP动画开发指南:JavaScript网页动画性能优化与ScrollTrigger实战
147 周安装
专利权利要求分析器 - 自动检查USPTO 35 USC 112(b)合规性,提升专利撰写质量
146 周安装
Web3前端开发指南:钱包集成、交易管理与React Hooks实战
147 周安装
strategies/traffic-interception.md |
| Protection testing | Conditional escalation | strategies/proxy-escalation.md |
| Report format | Sections 1-7 with self-critique | reference/report-schema.md |
| Find sitemaps | RobotsFile.find(url) | strategies/sitemap-discovery.md |
| Filter sitemap URLs | RequestList + regex | reference/regex-patterns.md |
| Discover APIs | Traffic capture (automatic) | strategies/api-discovery.md |
| DOM scraping | DevTools bridge + humanizer | strategies/dom-scraping.md |
| HTTP scraping | CheerioCrawler | strategies/cheerio-scraping.md |
| Hybrid approach | Sitemap + API | strategies/hybrid-approaches.md |
| Handle blocking | Stealth mode + upstream proxies | strategies/anti-blocking.md |
| Session recording | proxy_session_start() / proxy_export_har() | strategies/session-workflows.md |
| Proxy-MCP tools | Complete reference | reference/proxy-tool-reference.md |
| Fingerprint configs | Stealth + TLS presets | reference/fingerprint-patterns.md |
| Create Apify Actor | apify create | apify/cli-workflow.md |
| Template selection | Cheerio vs Playwright | workflows/productionization.md |
| Input schema | .actor/input_schema.json | apify/input-schemas.md |
| Deploy actor | apify push | apify/deployment.md |
strategies/hybrid-approaches.md - Combining strategiesstrategies/anti-blocking.md - Multi-layer anti-detection (stealth, humanizer, proxies, TLS)strategies/session-workflows.md - Session recording, HAR export, replay