npx skills add https://github.com/vm0-ai/vm0-skills --skill apify网络爬虫和自动化平台。运行预构建的 Actor(爬虫)或创建您自己的。访问数千个适用于热门网站的即用型爬虫。
当您需要时使用此技能:
设置环境变量:
export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxxxxxxxxxx"
异步启动一个 Actor 运行:
写入 /tmp/apify_request.json:
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 10,
"pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
然后运行:
curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json
响应包含用于获取结果的 id(运行 ID)和 defaultDatasetId。
等待完成并直接获取结果(最长 5 分钟):
写入 /tmp/apify_request.json:
{
"startUrls": [{"url": "https://news.ycombinator.com"}],
"maxPagesPerCrawl": 1,
"pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}
然后运行:
curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/run-sync-get-dataset-items" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json
⚠️ 重要提示: 下面的
{runId}是一个 占位符 - 请将其替换为您的异步运行响应中的实际运行 ID(位于.data.id中)。请参阅下面的完整工作流程示例。
轮询运行状态:
# 将 {runId} 替换为实际 ID,例如 "HG7ML7M8z78YcAPEB"
curl -s "https://api.apify.com/v2/actor-runs/{runId}" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq -r '.data.status'
完整工作流程示例(捕获运行 ID 并检查状态):
写入 /tmp/apify_request.json:
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 10
}
然后运行:
# 步骤 1:启动异步运行并捕获运行 ID
RUN_ID=$(curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json | jq -r '.data.id')
# 步骤 2:检查运行状态
curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq '.data.status'
状态:READY、RUNNING、SUCCEEDED、FAILED、ABORTED、TIMED-OUT
⚠️ 重要提示: 下面的
{datasetId}是一个 占位符 - 不要直接使用它!您必须将其替换为您的运行响应中的实际数据集 ID(位于.data.defaultDatasetId中)。请参阅下面的完整工作流程示例,了解如何捕获和使用真实的 ID。
从已完成的运行中获取结果:
# 将 {datasetId} 替换为实际 ID,例如 "WkzbQMuFYuamGv3YF"
curl -s "https://api.apify.com/v2/datasets/{datasetId}/items" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"
完整工作流程示例(异步运行、等待并获取结果):
写入 /tmp/apify_request.json:
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 10
}
然后运行:
# 步骤 1:启动异步运行并捕获 ID
RESPONSE=$(curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json)
RUN_ID=$(echo "$RESPONSE" | jq -r '.data.id')
DATASET_ID=$(echo "$RESPONSE" | jq -r '.data.defaultDatasetId')
# 步骤 2:等待完成(轮询状态)
while true; do
STATUS=$(curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq -r '.data.status')
echo "状态: $STATUS"
[[ "$STATUS" == "SUCCEEDED" ]] && break
[[ "$STATUS" == "FAILED" || "$STATUS" == "ABORTED" ]] && exit 1
sleep 5
done
# 步骤 3:获取数据集项
curl -s "https://api.apify.com/v2/datasets/${DATASET_ID}/items" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"
带分页:
# 将 {datasetId} 替换为实际 ID
curl -s "https://api.apify.com/v2/datasets/{datasetId}/items?limit=100&offset=0" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"
写入 /tmp/apify_request.json:
{
"queries": "web scraping tools",
"maxPagesPerQuery": 1,
"resultsPerPage": 10
}
然后运行:
curl -s -X POST "https://api.apify.com/v2/acts/apify~google-search-scraper/run-sync-get-dataset-items?timeout=120" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json
写入 /tmp/apify_request.json:
{
"startUrls": [{"url": "https://docs.example.com"}],
"maxCrawlPages": 10,
"crawlerType": "cheerio"
}
然后运行:
curl -s -X POST "https://api.apify.com/v2/acts/apify~website-content-crawler/run-sync-get-dataset-items?timeout=300" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json
写入 /tmp/apify_request.json:
{
"directUrls": ["https://www.instagram.com/apaborotnikov/"],
"resultsType": "posts",
"resultsLimit": 10
}
然后运行:
curl -s -X POST "https://api.apify.com/v2/acts/apify~instagram-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json
写入 /tmp/apify_request.json:
{
"categoryOrProductUrls": [{"url": "https://www.amazon.com/dp/B0BSHF7WHW"}],
"maxItemsPerStartUrl": 1
}
然后运行:
curl -s -X POST "https://api.apify.com/v2/acts/junglee~amazon-crawler/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json
获取最近的 Actor 运行:
curl -s "https://api.apify.com/v2/actor-runs?limit=10&desc=true" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq '.data.items[] | {id, actId, status, startedAt}'
⚠️ 重要提示: 下面的
{runId}是一个 占位符 - 请将其替换为实际的运行 ID。请参阅下面的完整工作流程示例。
停止正在运行的 Actor:
# 将 {runId} 替换为实际 ID,例如 "HG7ML7M8z78YcAPEB"
curl -s -X POST "https://api.apify.com/v2/actor-runs/{runId}/abort" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"
完整工作流程示例(启动一个运行并中止它):
写入 /tmp/apify_request.json:
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 100
}
然后运行:
# 步骤 1:启动异步运行并捕获运行 ID
RUN_ID=$(curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json | jq -r '.data.id')
echo "已启动运行: $RUN_ID"
# 步骤 2:中止运行
curl -s -X POST "https://api.apify.com/v2/actor-runs/${RUN_ID}/abort" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"
浏览公共 Actor:
curl -s "https://api.apify.com/v2/store?limit=20&category=ECOMMERCE" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq '.data.items[] | {name, username, title}'
| Actor ID | 描述 |
|---|---|
apify/web-scraper | 通用网页爬虫 |
apify/website-content-crawler | 爬取整个网站 |
apify/google-search-scraper | Google 搜索结果 |
apify/instagram-scraper | Instagram 帖子/个人资料 |
junglee/amazon-crawler | Amazon 产品 |
apify/twitter-scraper | Twitter/X 帖子 |
apify/youtube-scraper | YouTube 视频 |
apify/linkedin-scraper | LinkedIn 个人资料 |
lukaskrivka/google-maps | Google 地图地点 |
在 https://apify.com/store 查找更多
| 参数 | 类型 | 描述 |
|---|---|---|
timeout | number | 运行超时时间(秒) |
memory | number | 内存大小(MB)(128、256、512、1024、2048、4096) |
maxItems | number | 最大返回项数(用于同步端点) |
build | string | Actor 构建标签(默认:"latest") |
waitForFinish | number | 等待时间(秒)(用于异步运行) |
运行对象:
{
"data": {
"id": "HG7ML7M8z78YcAPEB",
"actId": "HDSasDasz78YcAPEB",
"status": "SUCCEEDED",
"startedAt": "2024-01-01T00:00:00.000Z",
"finishedAt": "2024-01-01T00:01:00.000Z",
"defaultDatasetId": "WkzbQMuFYuamGv3YF",
"defaultKeyValueStoreId": "tbhFDFDh78YcAPEB"
}
}
run-sync-get-dataset-items,对于长时间作业使用异步limit 和 offset每周安装数
231
仓库
GitHub 星标数
47
首次出现
2026年1月24日
安全审计
安装于
opencode202
gemini-cli196
codex191
cursor188
github-copilot184
kimi-cli174
Web scraping and automation platform. Run pre-built Actors (scrapers) or create your own. Access thousands of ready-to-use scrapers for popular websites.
Official docs: https://docs.apify.com/api/v2
Use this skill when you need to:
Set environment variable:
export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxxxxxxxxxx"
Start an Actor run asynchronously:
Write to /tmp/apify_request.json:
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 10,
"pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}
Then run:
curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json
Response containsid (run ID) and defaultDatasetId for fetching results.
Wait for completion and get results directly (max 5 min):
Write to /tmp/apify_request.json:
{
"startUrls": [{"url": "https://news.ycombinator.com"}],
"maxPagesPerCrawl": 1,
"pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}
Then run:
curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/run-sync-get-dataset-items" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json
⚠️ Important: The
{runId}below is a placeholder - replace it with the actual run ID from your async run response (found in.data.id). See the complete workflow example below.
Poll the run status:
# Replace {runId} with actual ID like "HG7ML7M8z78YcAPEB"
curl -s "https://api.apify.com/v2/actor-runs/{runId}" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq -r '.data.status'
Complete workflow example (capture run ID and check status):
Write to /tmp/apify_request.json:
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 10
}
Then run:
# Step 1: Start an async run and capture the run ID
RUN_ID=$(curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json | jq -r '.data.id')
# Step 2: Check the run status
curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq '.data.status'
Statuses : READY, RUNNING, SUCCEEDED, FAILED, ABORTED, TIMED-OUT
⚠️ Important: The
{datasetId}below is a placeholder - do not use it literally! You must replace it with the actual dataset ID from your run response (found in.data.defaultDatasetId). See the complete workflow example below for how to capture and use the real ID.
Fetch results from a completed run:
# Replace {datasetId} with actual ID like "WkzbQMuFYuamGv3YF"
curl -s "https://api.apify.com/v2/datasets/{datasetId}/items" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"
Complete workflow example (run async, wait, and fetch results):
Write to /tmp/apify_request.json:
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 10
}
Then run:
# Step 1: Start async run and capture IDs
RESPONSE=$(curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json)
RUN_ID=$(echo "$RESPONSE" | jq -r '.data.id')
DATASET_ID=$(echo "$RESPONSE" | jq -r '.data.defaultDatasetId')
# Step 2: Wait for completion (poll status)
while true; do
STATUS=$(curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq -r '.data.status')
echo "Status: $STATUS"
[[ "$STATUS" == "SUCCEEDED" ]] && break
[[ "$STATUS" == "FAILED" || "$STATUS" == "ABORTED" ]] && exit 1
sleep 5
done
# Step 3: Fetch the dataset items
curl -s "https://api.apify.com/v2/datasets/${DATASET_ID}/items" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"
With pagination:
# Replace {datasetId} with actual ID
curl -s "https://api.apify.com/v2/datasets/{datasetId}/items?limit=100&offset=0" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"
Write to /tmp/apify_request.json:
{
"queries": "web scraping tools",
"maxPagesPerQuery": 1,
"resultsPerPage": 10
}
Then run:
curl -s -X POST "https://api.apify.com/v2/acts/apify~google-search-scraper/run-sync-get-dataset-items?timeout=120" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json
Write to /tmp/apify_request.json:
{
"startUrls": [{"url": "https://docs.example.com"}],
"maxCrawlPages": 10,
"crawlerType": "cheerio"
}
Then run:
curl -s -X POST "https://api.apify.com/v2/acts/apify~website-content-crawler/run-sync-get-dataset-items?timeout=300" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json
Write to /tmp/apify_request.json:
{
"directUrls": ["https://www.instagram.com/apaborotnikov/"],
"resultsType": "posts",
"resultsLimit": 10
}
Then run:
curl -s -X POST "https://api.apify.com/v2/acts/apify~instagram-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json
Write to /tmp/apify_request.json:
{
"categoryOrProductUrls": [{"url": "https://www.amazon.com/dp/B0BSHF7WHW"}],
"maxItemsPerStartUrl": 1
}
Then run:
curl -s -X POST "https://api.apify.com/v2/acts/junglee~amazon-crawler/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json
Get recent Actor runs:
curl -s "https://api.apify.com/v2/actor-runs?limit=10&desc=true" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq '.data.items[] | {id, actId, status, startedAt}'
⚠️ Important: The
{runId}below is a placeholder - replace it with the actual run ID. See the complete workflow example below.
Stop a running Actor:
# Replace {runId} with actual ID like "HG7ML7M8z78YcAPEB"
curl -s -X POST "https://api.apify.com/v2/actor-runs/{runId}/abort" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"
Complete workflow example (start a run and abort it):
Write to /tmp/apify_request.json:
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 100
}
Then run:
# Step 1: Start an async run and capture the run ID
RUN_ID=$(curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" --header "Content-Type: application/json" -d @/tmp/apify_request.json | jq -r '.data.id')
echo "Started run: $RUN_ID"
# Step 2: Abort the run
curl -s -X POST "https://api.apify.com/v2/actor-runs/${RUN_ID}/abort" --header "Authorization: Bearer $(printenv APIFY_TOKEN)"
Browse public Actors:
curl -s "https://api.apify.com/v2/store?limit=20&category=ECOMMERCE" --header "Authorization: Bearer $(printenv APIFY_TOKEN)" | jq '.data.items[] | {name, username, title}'
| Actor ID | Description |
|---|---|
apify/web-scraper | General web scraper |
apify/website-content-crawler | Crawl entire websites |
apify/google-search-scraper | Google search results |
apify/instagram-scraper | Instagram posts/profiles |
junglee/amazon-crawler | Amazon products |
apify/twitter-scraper |
Find more at: https://apify.com/store
| Parameter | Type | Description |
|---|---|---|
timeout | number | Run timeout in seconds |
memory | number | Memory in MB (128, 256, 512, 1024, 2048, 4096) |
maxItems | number | Max items to return (for sync endpoints) |
build | string | Actor build tag (default: "latest") |
waitForFinish | number | Wait time in seconds (for async runs) |
Run object:
{
"data": {
"id": "HG7ML7M8z78YcAPEB",
"actId": "HDSasDasz78YcAPEB",
"status": "SUCCEEDED",
"startedAt": "2024-01-01T00:00:00.000Z",
"finishedAt": "2024-01-01T00:01:00.000Z",
"defaultDatasetId": "WkzbQMuFYuamGv3YF",
"defaultKeyValueStoreId": "tbhFDFDh78YcAPEB"
}
}
run-sync-get-dataset-items for quick tasks (<5 min), async for longer jobslimit and offset for large datasetsWeekly Installs
231
Repository
GitHub Stars
47
First Seen
Jan 24, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode202
gemini-cli196
codex191
cursor188
github-copilot184
kimi-cli174
Skills CLI 使用指南:AI Agent 技能包管理器安装与管理教程
27,400 周安装
Azure成本优化技能:识别节省机会、清理孤立资源、调整规模 | Microsoft Copilot
103,300 周安装
Azure AI 服务指南:AI Search、Speech、OpenAI 与 MCP 工具使用教程
103,700 周安装
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
107,800 周安装
Skill Creator 技能创建工具 - Anthropic Claude 技能开发指南
118,800 周安装
agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试
144,300 周安装
Remotion最佳实践指南:字幕处理、FFmpeg视频操作、音频可视化与音效使用
191,700 周安装
| Twitter/X posts |
apify/youtube-scraper | YouTube videos |
apify/linkedin-scraper | LinkedIn profiles |
lukaskrivka/google-maps | Google Maps places |