Apify自动化爬虫工具：在Claude中直接运行网络爬取Actor并管理数据集

Apify Automation by composiohq/awesome-claude-skills

43,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/composiohq/awesome-claude-skills --skill 'Apify Automation'

自动化数据处理 API

🇨🇳中文介绍

Apify Automation

直接在 Claude Code 中运行 Apify 网络爬取 Actor 并管理数据集。同步或异步执行爬虫，获取结构化数据，创建可重复使用的任务，并检查运行日志，无需离开终端。

工具包文档： composio.dev/toolkits/apify

设置

将 Composio MCP 服务器添加到您的配置中：
```
https://rube.app/mcp
```
在提示时连接您的 Apify 账户。代理将提供一个身份验证链接。
在 apify.com/store 浏览可用的 Actor。每个 Actor 都有其独特的输入模式——在运行前务必查看 Actor 的文档。

核心工作流

1. 同步运行 Actor 并获取结果

执行一个 Actor 并在单次调用中立即获取其数据集项。适用于快速爬取任务。

工具： APIFY_RUN_ACTOR_SYNC_GET_DATASET_ITEMS

关键参数：

actorId (必需) -- Actor ID，格式为 用户名/actor名称 (例如，)

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

2. 异步运行 Actor

触发一个 Actor 运行，无需等待完成。适用于长时间运行的爬取任务。

工具： APIFY_RUN_ACTOR

actorId (必需) -- Actor 别名或 ID
body -- Actor 的 JSON 输入对象
memory -- 内存限制，单位 MB (必须是 2 的幂，最小 128)
timeout -- 运行超时时间（秒）
maxItems -- 返回项目的上限
build -- 特定的构建标签 (例如，latest, beta)

使用 APIFY_GET_DATASET_ITEMS 并传入运行的 datasetId 来获取结果。

示例提示："以 1024MB 内存异步启动 example.com 的网络爬虫 Actor"

3. 检索数据集项

从特定数据集中获取数据，支持分页、字段选择和过滤。

工具： APIFY_GET_DATASET_ITEMS

datasetId (必需) -- 数据集标识符
limit (默认/最大 1000) -- 每页项目数
offset (默认 0) -- 分页偏移量
format -- json (推荐), csv, xlsx
fields -- 仅包含特定字段
omit -- 排除特定字段
clean -- 移除 Apify 特定的元数据
desc -- 反向排序 (最新的在前)

示例提示："从数据集 myDatasetId 中以 JSON 格式获取前 500 个项目"

4. 检查 Actor 详情

在运行前查看 Actor 元数据、输入模式和配置。

工具： APIFY_GET_ACTOR

actorId (必需) -- Actor ID，格式为 用户名/actor名称 或十六进制 ID

示例提示："显示 apify/web-scraper Actor 的详情和输入模式"

5. 创建可重复使用的任务

为重复的爬取任务配置具有预设输入的可重复使用的 Actor 任务。

工具： APIFY_CREATE_TASK

配置一次任务，然后使用一致的输入参数重复触发它。适用于计划或重复的数据收集工作流。

示例提示："为 Google 搜索爬虫创建一个 Apify 任务，默认查询为 'AI 初创公司' 和美国位置"

6. 管理运行和数据集

列出 Actor 运行、浏览数据集以及检查运行详情，用于监控和调试。

工具： APIFY_GET_LIST_OF_RUNS, APIFY_DATASETS_GET, APIFY_DATASET_GET, APIFY_GET_LOG

用于列出运行：

按 Actor 和可选状态过滤
从运行详情中获取 datasetId 用于数据检索

用于数据集管理：

APIFY_DATASETS_GET -- 分页列出您的所有数据集
APIFY_DATASET_GET -- 获取特定数据集的元数据

APIFY_GET_LOG -- 检索运行或构建的执行日志

示例提示："列出 web scraper Actor 的最后 10 次运行，并显示最近一次运行的日志"

Actor 输入模式差异很大： 每个 Actor 都有其独特的输入字段。像 queries 或 search_terms 这样的通用字段名称将被拒绝。请务必在 apify.com/store 上查看 Actor 的页面以获取确切的字段名称 (例如，Google Maps 的 searchStringsArray，网络爬虫的 startUrls)。
URL 格式要求： URL 中始终包含完整的协议 (https:// 或 http://)。许多 Actor 要求 URL 作为具有 url 属性的对象：{"startUrls": [{"url": "https://example.com"}]}。
数据集分页上限： APIFY_GET_DATASET_ITEMS 每次调用的最大 limit 为 1000。对于大型数据集，请使用 offset 循环收集所有项目。
枚举值为小写： 大多数 Actor 期望小写的枚举值 (例如，relevance 而不是 RELEVANCE，all 而不是 ALL)。
同步超时为 5 分钟： APIFY_RUN_ACTOR_SYNC_GET_DATASET_ITEMS 的最大 waitForFinish 为 300 秒。对于更长的运行，请使用 APIFY_RUN_ACTOR (异步) 并通过 APIFY_GET_DATASET_ITEMS 轮询。
数据量成本： 获取大型数据集可能成本高昂。建议使用适中的限制和增量处理，以避免超时或内存压力。
推荐 JSON 格式： 虽然 CSV/XLSX 格式可用，但 JSON 对于自动化处理是最可靠的。对于下游自动化，请避免使用 CSV/XLSX。

工具别名	描述
`APIFY_RUN_ACTOR_SYNC_GET_DATASET_ITEMS`	同步运行 Actor 并立即获取结果
`APIFY_RUN_ACTOR`	异步运行 Actor (触发并返回)
`APIFY_RUN_ACTOR_SYNC`	同步运行 Actor，返回输出记录
`APIFY_GET_ACTOR`	获取 Actor 元数据和输入模式
`APIFY_GET_DATASET_ITEMS`	从数据集检索项目 (分页)
`APIFY_DATASET_GET`	获取数据集元数据 (项目计数等)
`APIFY_DATASETS_GET`	列出所有用户数据集
`APIFY_CREATE_TASK`	创建可重复使用的 Actor 任务
`APIFY_GET_TASK_INPUT`	检查任务的存储输入
`APIFY_GET_LIST_OF_RUNS`	列出 Actor 的运行
`APIFY_GET_LOG`	获取运行的执行日志

由 Composio 提供支持

🇺🇸English

Apify Automation

Run Apify web scraping Actors and manage datasets directly from Claude Code. Execute crawlers synchronously or asynchronously, retrieve structured data, create reusable tasks, and inspect run logs without leaving your terminal.

Toolkit docs: composio.dev/toolkits/apify

Setup

Add the Composio MCP server to your configuration:
```
https://rube.app/mcp
```
Connect your Apify account when prompted. The agent will provide an authentication link.
Browse available Actors at apify.com/store. Each Actor has its own unique input schema -- always check the Actor's documentation before running.

Core Workflows

1. Run an Actor Synchronously and Get Results

Execute an Actor and immediately retrieve its dataset items in a single call. Best for quick scraping jobs.

Tool: APIFY_RUN_ACTOR_SYNC_GET_DATASET_ITEMS

Key parameters:

actorId (required) -- Actor ID in format username/actor-name (e.g., compass/crawler-google-places)
input -- JSON input object matching the Actor's schema. Each Actor has unique field names -- check apify.com/store for the exact schema.
limit -- max items to return
offset -- skip items for pagination
format -- json (default), csv, jsonl, html, xlsx, xml
timeout -- run timeout in seconds
waitForFinish -- max wait time (0-300 seconds)
fields -- comma-separated list of fields to include
omit -- comma-separated list of fields to exclude

Example prompt: "Run the Google Places scraper for 'restaurants in New York' and return the first 50 results"

2. Run an Actor Asynchronously

Trigger an Actor run without waiting for completion. Use for long-running scraping jobs.

Tool: APIFY_RUN_ACTOR

Key parameters:

actorId (required) -- Actor slug or ID
body -- JSON input object for the Actor
memory -- memory limit in MB (must be power of 2, minimum 128)
timeout -- run timeout in seconds
maxItems -- cap on returned items
build -- specific build tag (e.g., latest, beta)

Follow up with APIFY_GET_DATASET_ITEMS to retrieve results using the run's datasetId.

Example prompt: "Start the web scraper Actor for example.com asynchronously with 1024MB memory"

3. Retrieve Dataset Items

Fetch data from a specific dataset with pagination, field selection, and filtering.

Tool: APIFY_GET_DATASET_ITEMS

Key parameters:

datasetId (required) -- dataset identifier
limit (default/max 1000) -- items per page
offset (default 0) -- pagination offset
format -- json (recommended), csv, xlsx
fields -- include only specific fields
omit -- exclude specific fields
clean -- remove Apify-specific metadata
-- reverse order (newest first)

Example prompt: "Get the first 500 items from dataset myDatasetId in JSON format"

4. Inspect Actor Details

View Actor metadata, input schema, and configuration before running it.

Tool: APIFY_GET_ACTOR

Key parameters:

actorId (required) -- Actor ID in format username/actor-name or hex ID

Example prompt: "Show me the details and input schema for the apify/web-scraper Actor"

5. Create Reusable Tasks

Configure reusable Actor tasks with preset inputs for recurring scraping jobs.

Tool: APIFY_CREATE_TASK

Configure a task once, then trigger it repeatedly with consistent input parameters. Useful for scheduled or recurring data collection workflows.

Example prompt: "Create an Apify task for the Google Search scraper with default query 'AI startups' and US location"

6. Manage Runs and Datasets

List Actor runs, browse datasets, and inspect run details for monitoring and debugging.

Tools: APIFY_GET_LIST_OF_RUNS, APIFY_DATASETS_GET, APIFY_DATASET_GET, APIFY_GET_LOG

For listing runs:

Filter by Actor and optionally by status
Get datasetId from run details for data retrieval

For dataset management:

APIFY_DATASETS_GET -- list all your datasets with pagination
APIFY_DATASET_GET -- get metadata for a specific dataset

For debugging:

APIFY_GET_LOG -- retrieve execution logs for a run or build

Example prompt: "List the last 10 runs for the web scraper Actor and show logs for the most recent one"

Known Pitfalls

Actor input schemas vary wildly: Every Actor has its own unique input fields. Generic field names like queries or search_terms will be rejected. Always check the Actor's page on apify.com/store for exact field names (e.g., searchStringsArray for Google Maps, startUrls for web scrapers).
URL format requirements: Always include the full protocol (https:// or http://) in URLs. Many Actors require URLs as objects with a url property: {"startUrls": [{"url": "https://example.com"}]}.
Dataset pagination cap: APIFY_GET_DATASET_ITEMS has a max of 1000 per call. For large datasets, loop with to collect all items.

Quick Reference

Tool Slug	Description
`APIFY_RUN_ACTOR_SYNC_GET_DATASET_ITEMS`	Run Actor synchronously and get results immediately
`APIFY_RUN_ACTOR`	Run Actor asynchronously (trigger and return)
`APIFY_RUN_ACTOR_SYNC`	Run Actor synchronously, return output record
`APIFY_GET_ACTOR`	Get Actor metadata and input schema
`APIFY_GET_DATASET_ITEMS`	Retrieve items from a dataset (paginated)
`APIFY_DATASET_GET`

Powered byComposio

Weekly Installs

Repository

composiohq/awes…e-skills

GitHub Stars

43.1K

First Seen

Jan 1, 1970

Security Audits

Gen Agent Trust HubWarn SocketPass SnykWarn

Skills CLI 使用指南：AI Agent 技能包管理器安装与管理教程

27,400 周安装

Enum values are lowercase: Most Actors expect lowercase enum values (e.g., relevance not RELEVANCE, all not ALL).

Sync timeout at 5 minutes: APIFY_RUN_ACTOR_SYNC_GET_DATASET_ITEMS has a maximum waitForFinish of 300 seconds. For longer runs, use APIFY_RUN_ACTOR (async) and poll with APIFY_GET_DATASET_ITEMS.

Data volume costs: Large datasets can be expensive to fetch. Prefer moderate limits and incremental processing to avoid timeouts or memory pressure.

JSON format recommended: While CSV/XLSX formats are available, JSON is the most reliable for automated processing. Avoid CSV/XLSX for downstream automation.

Apify自动化爬虫工具：在Claude中直接运行网络爬取Actor并管理数据集

🇨🇳中文介绍

Apify Automation

设置

核心工作流

1. 同步运行 Actor 并获取结果

相关 Skills

2. 异步运行 Actor

3. 检索数据集项

4. 检查 Actor 详情

5. 创建可重复使用的任务

6. 管理运行和数据集

已知问题

快速参考

🇺🇸English

Apify Automation

Setup

Core Workflows

1. Run an Actor Synchronously and Get Results

2. Run an Actor Asynchronously

3. Retrieve Dataset Items

4. Inspect Actor Details

5. Create Reusable Tasks

6. Manage Runs and Datasets

Known Pitfalls

Quick Reference

最新 Skills