Apify 网络爬虫和自动化平台 - 无需编码抓取亚马逊、谷歌、领英等网站数据

apify by vm0-ai/vm0-skills

231 周安装量

47 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/vm0-ai/vm0-skills --skill apify

自动化数据处理 API

🇨🇳中文介绍

Apify

网络爬虫和自动化平台。运行预构建的 Actor（爬虫）或创建您自己的。访问数千个适用于热门网站的即用型爬虫。

官方文档：https://docs.apify.com/api/v2

使用场景

当您需要时使用此技能：

从网站（亚马逊、谷歌、领英、推特等）抓取数据
无需编码即可运行预构建的网络爬虫
从任何网站提取结构化数据
大规模自动化网络任务
存储和检索抓取的数据

先决条件

在 https://apify.com/ 创建账户
从 https://console.apify.com/account#/integrations 获取您的 API 令牌

设置环境变量：

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxxxxxxxxxx"

使用方法

1. 异步运行 Actor

异步启动一个 Actor 运行：

写入 /tmp/apify_request.json：

{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 10,
  "pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

2. 同步运行 Actor

等待完成并直接获取结果（最长 5 分钟）：

写入 /tmp/apify_request.json：

{
  "startUrls": [{"url": "https://news.ycombinator.com"}],
  "maxPagesPerCrawl": 1,
  "pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}

curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/run-sync-get-dataset-items" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json

3. 检查运行状态

⚠️ 重要提示： 下面的 {runId} 是一个 占位符 - 请将其替换为您的异步运行响应中的实际运行 ID（位于 .data.id 中）。请参阅下面的完整工作流程示例。

轮询运行状态：

# 将 {runId} 替换为实际 ID，例如 "HG7ML7M8z78YcAPEB"
curl -s "https://api.apify.com/v2/actor-runs/{runId}" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq -r '.data.status'

完整工作流程示例（捕获运行 ID 并检查状态）：

写入 /tmp/apify_request.json：

{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 10
}

# 步骤 1：启动异步运行并捕获运行 ID
RUN_ID=$(curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json | jq -r '.data.id')

# 步骤 2：检查运行状态
curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq '.data.status'

状态：READY、RUNNING、SUCCEEDED、FAILED、ABORTED、TIMED-OUT

4. 获取数据集项

⚠️ 重要提示： 下面的 {datasetId} 是一个 占位符 - 不要直接使用它！您必须将其替换为您的运行响应中的实际数据集 ID（位于 .data.defaultDatasetId 中）。请参阅下面的完整工作流程示例，了解如何捕获和使用真实的 ID。

从已完成的运行中获取结果：

# 将 {datasetId} 替换为实际 ID，例如 "WkzbQMuFYuamGv3YF"
curl -s "https://api.apify.com/v2/datasets/{datasetId}/items" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"

完整工作流程示例（异步运行、等待并获取结果）：

写入 /tmp/apify_request.json：

{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 10
}

# 步骤 1：启动异步运行并捕获 ID
RESPONSE=$(curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json)

RUN_ID=$(echo "$RESPONSE" | jq -r '.data.id')
DATASET_ID=$(echo "$RESPONSE" | jq -r '.data.defaultDatasetId')

# 步骤 2：等待完成（轮询状态）
while true; do
  STATUS=$(curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq -r '.data.status')
  echo "状态: $STATUS"
  [[ "$STATUS" == "SUCCEEDED" ]] && break
  [[ "$STATUS" == "FAILED" || "$STATUS" == "ABORTED" ]] && exit 1
  sleep 5
done

# 步骤 3：获取数据集项
curl -s "https://api.apify.com/v2/datasets/${DATASET_ID}/items" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"

# 将 {datasetId} 替换为实际 ID
curl -s "https://api.apify.com/v2/datasets/{datasetId}/items?limit=100&offset=0" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"

写入 /tmp/apify_request.json：

{
  "queries": "web scraping tools",
  "maxPagesPerQuery": 1,
  "resultsPerPage": 10
}

curl -s -X POST "https://api.apify.com/v2/acts/apify~google-search-scraper/run-sync-get-dataset-items?timeout=120" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json

写入 /tmp/apify_request.json：

{
  "startUrls": [{"url": "https://docs.example.com"}],
  "maxCrawlPages": 10,
  "crawlerType": "cheerio"
}

curl -s -X POST "https://api.apify.com/v2/acts/apify~website-content-crawler/run-sync-get-dataset-items?timeout=300" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json

写入 /tmp/apify_request.json：

{
  "directUrls": ["https://www.instagram.com/apaborotnikov/"],
  "resultsType": "posts",
  "resultsLimit": 10
}

curl -s -X POST "https://api.apify.com/v2/acts/apify~instagram-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json

写入 /tmp/apify_request.json：

{
  "categoryOrProductUrls": [{"url": "https://www.amazon.com/dp/B0BSHF7WHW"}],
  "maxItemsPerStartUrl": 1
}

curl -s -X POST "https://api.apify.com/v2/acts/junglee~amazon-crawler/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json

6. 列出您的运行

获取最近的 Actor 运行：

curl -s "https://api.apify.com/v2/actor-runs?limit=10&desc=true" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq '.data.items[] | {id, actId, status, startedAt}'

⚠️ 重要提示： 下面的 {runId} 是一个 占位符 - 请将其替换为实际的运行 ID。请参阅下面的完整工作流程示例。

停止正在运行的 Actor：

# 将 {runId} 替换为实际 ID，例如 "HG7ML7M8z78YcAPEB"
curl -s -X POST "https://api.apify.com/v2/actor-runs/{runId}/abort" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"

完整工作流程示例（启动一个运行并中止它）：

写入 /tmp/apify_request.json：

{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 100
}

# 步骤 1：启动异步运行并捕获运行 ID
RUN_ID=$(curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json | jq -r '.data.id')

echo "已启动运行: $RUN_ID"

# 步骤 2：中止运行
curl -s -X POST "https://api.apify.com/v2/actor-runs/${RUN_ID}/abort" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"

8. 列出可用 Actor

浏览公共 Actor：

curl -s "https://api.apify.com/v2/store?limit=20&category=ECOMMERCE" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq '.data.items[] | {name, username, title}'

Actor ID	描述
`apify/web-scraper`	通用网页爬虫
`apify/website-content-crawler`	爬取整个网站
`apify/google-search-scraper`	Google 搜索结果
`apify/instagram-scraper`	Instagram 帖子/个人资料
`junglee/amazon-crawler`	Amazon 产品
`apify/twitter-scraper`	Twitter/X 帖子
`apify/youtube-scraper`	YouTube 视频
`apify/linkedin-scraper`	LinkedIn 个人资料
`lukaskrivka/google-maps`	Google 地图地点

在 https://apify.com/store 查找更多

参数	类型	描述
`timeout`	number	运行超时时间（秒）
`memory`	number	内存大小（MB）（128、256、512、1024、2048、4096）
`maxItems`	number	最大返回项数（用于同步端点）
`build`	string	Actor 构建标签（默认："latest"）
`waitForFinish`	number	等待时间（秒）（用于异步运行）

{
  "data": {
  "id": "HG7ML7M8z78YcAPEB",
  "actId": "HDSasDasz78YcAPEB",
  "status": "SUCCEEDED",
  "startedAt": "2024-01-01T00:00:00.000Z",
  "finishedAt": "2024-01-01T00:01:00.000Z",
  "defaultDatasetId": "WkzbQMuFYuamGv3YF",
  "defaultKeyValueStoreId": "tbhFDFDh78YcAPEB"
  }
}

同步与异步：对于快速任务（<5 分钟）使用 run-sync-get-dataset-items，对于长时间作业使用异步
速率限制：全局每分钟 250,000 次请求，每个资源每秒 400 次
内存：更高的内存 = 更快的执行速度，但消耗更多积分
超时：默认值因 Actor 而异；为同步调用设置明确的超时时间
分页：对于大型数据集，使用 limit 和 offset
Actor 输入：每个 Actor 都有不同的输入模式 - 请查看 Actor 页面了解详情
积分：在 https://console.apify.com/billing 查看使用情况

🇺🇸English

Apify

Web scraping and automation platform. Run pre-built Actors (scrapers) or create your own. Access thousands of ready-to-use scrapers for popular websites.

Official docs: https://docs.apify.com/api/v2

When to Use

Use this skill when you need to:

Scrape data from websites (Amazon, Google, LinkedIn, Twitter, etc.)
Run pre-built web scrapers without coding
Extract structured data from any website
Automate web tasks at scale
Store and retrieve scraped data

Prerequisites

Create an account at https://apify.com/
Get your API token from https://console.apify.com/account#/integrations

Set environment variable:

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxxxxxxxxxx"

How to Use

1. Run an Actor (Async)

Start an Actor run asynchronously:

Write to /tmp/apify_request.json:

{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 10,
  "pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}

Then run:

curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json

Response containsid (run ID) and defaultDatasetId for fetching results.

2. Run Actor Synchronously

Wait for completion and get results directly (max 5 min):

Write to /tmp/apify_request.json:

{
  "startUrls": [{"url": "https://news.ycombinator.com"}],
  "maxPagesPerCrawl": 1,
  "pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}

Then run:

curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/run-sync-get-dataset-items" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json

3. Check Run Status

⚠️ Important: The {runId} below is a placeholder - replace it with the actual run ID from your async run response (found in .data.id). See the complete workflow example below.

Poll the run status:

# Replace {runId} with actual ID like "HG7ML7M8z78YcAPEB"
curl -s "https://api.apify.com/v2/actor-runs/{runId}" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq -r '.data.status'

Complete workflow example (capture run ID and check status):

Write to /tmp/apify_request.json:

{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 10
}

Then run:

# Step 1: Start an async run and capture the run ID
RUN_ID=$(curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json | jq -r '.data.id')

# Step 2: Check the run status
curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq '.data.status'

Statuses : READY, RUNNING, SUCCEEDED, FAILED, ABORTED, TIMED-OUT

4. Get Dataset Items

⚠️ Important: The {datasetId} below is a placeholder - do not use it literally! You must replace it with the actual dataset ID from your run response (found in .data.defaultDatasetId). See the complete workflow example below for how to capture and use the real ID.

Fetch results from a completed run:

# Replace {datasetId} with actual ID like "WkzbQMuFYuamGv3YF"
curl -s "https://api.apify.com/v2/datasets/{datasetId}/items" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"

Complete workflow example (run async, wait, and fetch results):

Write to /tmp/apify_request.json:

{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 10
}

Then run:

# Step 1: Start async run and capture IDs
RESPONSE=$(curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json)

RUN_ID=$(echo "$RESPONSE" | jq -r '.data.id')
DATASET_ID=$(echo "$RESPONSE" | jq -r '.data.defaultDatasetId')

# Step 2: Wait for completion (poll status)
while true; do
  STATUS=$(curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq -r '.data.status')
  echo "Status: $STATUS"
  [[ "$STATUS" == "SUCCEEDED" ]] && break
  [[ "$STATUS" == "FAILED" || "$STATUS" == "ABORTED" ]] && exit 1
  sleep 5
done

# Step 3: Fetch the dataset items
curl -s "https://api.apify.com/v2/datasets/${DATASET_ID}/items" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"

With pagination:

# Replace {datasetId} with actual ID
curl -s "https://api.apify.com/v2/datasets/{datasetId}/items?limit=100&offset=0" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"

5. Popular Actors

Google Search Scraper

Write to /tmp/apify_request.json:

{
  "queries": "web scraping tools",
  "maxPagesPerQuery": 1,
  "resultsPerPage": 10
}

Then run:

curl -s -X POST "https://api.apify.com/v2/acts/apify~google-search-scraper/run-sync-get-dataset-items?timeout=120" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json

Website Content Crawler

Write to /tmp/apify_request.json:

{
  "startUrls": [{"url": "https://docs.example.com"}],
  "maxCrawlPages": 10,
  "crawlerType": "cheerio"
}

Then run:

curl -s -X POST "https://api.apify.com/v2/acts/apify~website-content-crawler/run-sync-get-dataset-items?timeout=300" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json

Instagram Scraper

Write to /tmp/apify_request.json:

{
  "directUrls": ["https://www.instagram.com/apaborotnikov/"],
  "resultsType": "posts",
  "resultsLimit": 10
}

Then run:

curl -s -X POST "https://api.apify.com/v2/acts/apify~instagram-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json

Amazon Product Scraper

Write to /tmp/apify_request.json:

{
  "categoryOrProductUrls": [{"url": "https://www.amazon.com/dp/B0BSHF7WHW"}],
  "maxItemsPerStartUrl": 1
}

Then run:

curl -s -X POST "https://api.apify.com/v2/acts/junglee~amazon-crawler/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json

6. List Your Runs

Get recent Actor runs:

curl -s "https://api.apify.com/v2/actor-runs?limit=10&desc=true" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq '.data.items[] | {id, actId, status, startedAt}'

7. Abort a Run

⚠️ Important: The {runId} below is a placeholder - replace it with the actual run ID. See the complete workflow example below.

Stop a running Actor:

# Replace {runId} with actual ID like "HG7ML7M8z78YcAPEB"
curl -s -X POST "https://api.apify.com/v2/actor-runs/{runId}/abort" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"

Complete workflow example (start a run and abort it):

Write to /tmp/apify_request.json:

{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 100
}

Then run:

# Step 1: Start an async run and capture the run ID
RUN_ID=$(curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json | jq -r '.data.id')

echo "Started run: $RUN_ID"

# Step 2: Abort the run
curl -s -X POST "https://api.apify.com/v2/actor-runs/${RUN_ID}/abort" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"

8. List Available Actors

Browse public Actors:

curl -s "https://api.apify.com/v2/store?limit=20&category=ECOMMERCE" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq '.data.items[] | {name, username, title}'

Popular Actors Reference

Actor ID	Description
`apify/web-scraper`	General web scraper
`apify/website-content-crawler`	Crawl entire websites
`apify/google-search-scraper`	Google search results
`apify/instagram-scraper`	Instagram posts/profiles
`junglee/amazon-crawler`	Amazon products
`apify/twitter-scraper`

Find more at: https://apify.com/store

Run Options

Parameter	Type	Description
`timeout`	number	Run timeout in seconds
`memory`	number	Memory in MB (128, 256, 512, 1024, 2048, 4096)
`maxItems`	number	Max items to return (for sync endpoints)
`build`	string	Actor build tag (default: "latest")
`waitForFinish`	number	Wait time in seconds (for async runs)

Response Format

Run object:

{
  "data": {
  "id": "HG7ML7M8z78YcAPEB",
  "actId": "HDSasDasz78YcAPEB",
  "status": "SUCCEEDED",
  "startedAt": "2024-01-01T00:00:00.000Z",
  "finishedAt": "2024-01-01T00:01:00.000Z",
  "defaultDatasetId": "WkzbQMuFYuamGv3YF",
  "defaultKeyValueStoreId": "tbhFDFDh78YcAPEB"
  }
}

Guidelines

Sync vs Async : Use run-sync-get-dataset-items for quick tasks (<5 min), async for longer jobs
Rate Limits : 250,000 requests/min globally, 400/sec per resource
Memory : Higher memory = faster execution but more credits
Timeouts : Default varies by Actor; set explicit timeout for sync calls
Pagination : Use limit and offset for large datasets
Actor Input : Each Actor has different input schema - check Actor's page for details
Credits : Check usage at https://console.apify.com/billing

Weekly Installs

231

Repository

vm0-ai/vm0-skills

GitHub Stars

First Seen

Jan 24, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode202

gemini-cli196

codex191

cursor188

github-copilot184

kimi-cli174

Skills CLI 使用指南：AI Agent 技能包管理器安装与管理教程

27,400 周安装

Apify 网络爬虫和自动化平台 - 无需编码抓取亚马逊、谷歌、领英等网站数据

🇨🇳中文介绍

Apify

使用场景

先决条件

使用方法

1. 异步运行 Actor

相关 Skills

2. 同步运行 Actor

3. 检查运行状态

4. 获取数据集项

5. 热门 Actor

Google 搜索爬虫

网站内容爬虫

Instagram 爬虫

Amazon 产品爬虫

6. 列出您的运行

7. 中止运行

8. 列出可用 Actor

热门 Actor 参考

运行选项

响应格式

指南

🇺🇸English

Apify

When to Use

Prerequisites

How to Use

1. Run an Actor (Async)

2. Run Actor Synchronously

3. Check Run Status

4. Get Dataset Items

5. Popular Actors

Google Search Scraper

Website Content Crawler

Instagram Scraper

Amazon Product Scraper

6. List Your Runs

7. Abort a Run

8. List Available Actors

Popular Actors Reference

Run Options

Response Format

Guidelines

最新 Skills