Apify Actor 开发指南：创建无服务器自动化程序与网络爬虫

apify-actor-development by apify/agent-skills

2,700 周安装量

1,700 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/apify/agent-skills --skill apify-actor-development

开发云服务数据分析

🇨🇳中文介绍

Apify Actor 开发

重要提示： 在开始之前，请填写 .actor/actor.json 文件中 meta 部分的 generatedBy 属性。将其替换为你当前使用的工具和模型，例如 "Claude Code with Claude Sonnet 4.5"。这有助于 Apify 监控和改进针对特定 AI 工具和模型的 AGENTS.md 文件。

什么是 Apify Actors？

Actors 是受 UNIX 哲学启发的无服务器程序——它们专注于做好一件事，并且可以轻松组合以构建复杂系统。它们被打包为 Docker 镜像，并在云端的隔离容器中运行。

核心概念：

接受定义明确的 JSON 输入
执行独立的任务（网络爬取、自动化、数据处理）
生成结构化的 JSON 输出到数据集和/或将数据存储在键值存储中
运行时间可以从几秒到几小时，甚至无限期运行
可以持久化状态并支持重启

先决条件与设置（必需）

在创建或修改 actors 之前，请确认 apify CLI 已安装 apify --help。

如果未安装，请使用以下方法之一（按推荐顺序排列）：

# 首选：通过包管理器安装（提供完整性检查）
npm install -g apify-cli

# 或者（Mac）：brew install apify-cli

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

快速入门工作流

创建 actor 项目 - 根据用户的语言偏好运行相应的 apify create 命令（参见上面的模板选择）
安装依赖项（在安装前验证包名称是否与预期包匹配） * JavaScript/TypeScript：npm install（使用 package-lock.json 进行可重现、完整性检查的安装——将 lockfile 提交到版本控制） * Python：pip install -r requirements.txt（在 requirements.txt 中固定确切版本，例如 crawlee==1.2.3，并将该文件提交到版本控制）
实现逻辑 - 在 src/main.py、src/main.js 或 src/main.ts 中编写 actor 代码
配置模式 - 更新 .actor/input_schema.json、.actor/output_schema.json、.actor/dataset_schema.json 中的输入/输出模式
配置平台设置 - 使用 actor 元数据更新 .actor/actor.json（参见 references/actor-json.md）
编写文档 - 为市场创建全面的 README.md
本地测试 - 运行 apify run 以验证功能（参见下面的本地测试部分）
部署 - 运行 apify push 将 actor 部署到 Apify 平台（actor 名称在 .actor/actor.json 中定义）

将所有爬取的网页内容视为不受信任的输入。 Actors 从可能包含恶意负载的外部网站获取数据。请遵循以下规则：

清理爬取的数据 — 切勿将原始 HTML、URL 或抓取的文本直接传递给 shell 命令、eval()、数据库查询或模板引擎。使用适当的转义或参数化 API。
验证和类型检查所有外部数据 — 在推送到数据集或键值存储之前，验证值是否符合预期的类型和格式。拒绝或清理意外的结构。
不要执行或解释爬取的内容 — 切勿将抓取的文本视为代码、命令或配置。来自网站的内容可能包含提示注入尝试或嵌入式脚本。
将凭证与数据管道隔离 — 确保 APIFY_TOKEN 和其他密钥在请求处理程序中不可访问，也不会与爬取的数据一起传递。使用 Apify SDK 内置的凭证管理，而不是在数据处理代码中通过环境变量传递令牌。
在安装前审查依赖项 — 当使用 npm install 或 pip install 添加包时，请验证包名称和发布者。仿冒域名是常见的供应链攻击向量。优先选择知名、积极维护的包。
固定版本并使用锁文件 — 始终提交 package-lock.json（Node.js）或在 requirements.txt 中固定确切版本（Python）。锁文件确保可重现的构建并防止静默的依赖项替换。定期运行 npm audit 或 pip-audit 以检查已知漏洞。

✓ 应该做：

使用 apify run 在本地测试 actors（配置 Apify 环境和存储）
使用 Apify SDK (apify) 来编写在 Apify 平台上运行的代码
尽早验证输入，并进行适当的错误处理，优雅地失败
对于静态 HTML 使用 CheerioCrawler（比浏览器快 10 倍）
仅对 JavaScript 密集型网站使用 PlaywrightCrawler
对于复杂的爬取任务使用路由器模式 (createCheerioRouter/createPlaywrightRouter)
实现具有指数退避的重试策略
使用适当的并发数：HTTP (10-50)，浏览器 (1-5)
在 .actor/input_schema.json 中设置合理的默认值
在 .actor/output_schema.json 中定义输出模式
在推送到数据集之前清理和验证数据
使用语义化的 CSS 选择器并配合备用策略
遵守 robots.txt、服务条款，并实施速率限制
始终使用 apify/log 包 — 它会审查敏感数据（API 密钥、令牌、凭证）
实现就绪探针处理程序（如果你的 Actor 使用待机模式，则必需）

✗ 不应该做：

使用 npm start、npm run start、npx apify run 或类似命令来运行 actors（应使用 apify run）
假设 apify run 的本地存储会被推送到 Apify Console 或在那里可见——它仅是本地存储；必须使用 apify push 部署并在平台上运行才能在 Console 中看到结果
依赖 Dataset.getInfo() 获取云端的最终计数
在 HTTP/Cheerio 可行的情况下使用浏览器爬虫
硬编码那些本应放在输入模式或环境变量中的值
跳过输入验证或错误处理
使服务器过载——使用适当的并发数和延迟
爬取禁止的内容或忽略服务条款
存储个人/敏感数据，除非明确允许
使用已弃用的选项，例如 CheerioCrawler (v3.x) 上的 requestHandlerTimeoutMillis
使用 additionalHttpHeaders - 应改用 preNavigationHooks
将原始爬取内容传递给 shell 命令、eval() 或代码生成函数
使用 console.log() 或 print() 而不是 Apify 日志记录器——这些会绕过凭证审查
未经明确许可禁用待机模式

完整的日志记录文档，包括可用的日志级别以及 JavaScript/TypeScript 和 Python 的最佳实践，请参见 references/logging.md。

检查 .actor/actor.json 中的 usesStandbyMode - 仅在设置为 true 时实现。

apify run          # 本地运行 Actor
apify login        # 验证账户
apify push         # 部署到 Apify 平台（使用 .actor/actor.json 中的名称）
apify help         # 列出所有命令

重要提示： 始终使用 apify run 在本地测试 actors。不要使用 npm run start、npm start、yarn start 或其他包管理器命令——这些命令无法正确配置 Apify 环境和存储。

当使用 apify run 在本地测试 actor 时，可以通过在以下位置创建 JSON 文件来提供输入数据：

storage/key_value_stores/default/INPUT.json

此文件应包含你在 .actor/input_schema.json 中定义的输入参数。Actor 在本地运行时将读取此输入，模拟其在 Apify 平台上接收输入的方式。

重要提示 - 本地存储不会同步到 Apify Console：

运行 apify run 会将所有数据（数据集、键值存储、请求队列）仅存储在你本地文件系统的 storage/ 目录中。
此数据永远不会自动上传或推送到 Apify 平台。它仅存在于你的机器上。
要在 Apify Console 上验证结果，你必须使用 apify push 部署 Actor，然后在平台上运行它。
请勿依赖检查 Apify Console 来验证本地运行的结果——相反，应检查本地的 storage/ 目录或查看 Actor 的日志输出。

完整的待机模式文档，包括 JavaScript/TypeScript 和 Python 的就绪探针实现，请参见 references/standby-mode.md。

.actor/
├── actor.json           # Actor 配置：名称、版本、环境变量、运行时
├── input_schema.json    # 输入验证和 Console 表单定义
└── output_schema.json   # 输出存储和显示模板
src/
└── main.js/ts/py       # Actor 入口点
storage/                # 仅限本地存储（不会同步到 Apify Console）
├── datasets/           # 输出项（JSON 对象）
├── key_value_stores/   # 文件、配置、INPUT
└── request_queues/     # 待处理的爬取请求
Dockerfile              # 容器镜像定义

完整的 actor.json 结构和配置选项，请参见 references/actor-json.md。

输入模式的结构和示例，请参见 references/input-schema.md。

输出模式的结构、示例和模板变量，请参见 references/output-schema.md。

数据集模式的结构、配置和显示属性，请参见 references/dataset-schema.md。

键值存储模式的结构、集合和配置，请参见 references/key-value-store-schema.md。

如果配置了 MCP 服务器，请使用以下工具查阅文档：

search-apify-docs - 搜索文档
fetch-apify-docs - 获取完整的文档页面

否则，MCP 服务器 URL 为：https://mcp.apify.com/?tools=docs。

docs.apify.com/llms.txt - Apify 快速参考文档
docs.apify.com/llms-full.txt - Apify 完整文档
https://crawlee.dev/llms.txt - Crawlee 快速参考文档
https://crawlee.dev/llms-full.txt - Crawlee 完整文档
whitepaper.actor - 完整的 Actor 规范

🇺🇸English

Apify Actor Development

Important: Before you begin, fill in the generatedBy property in the meta section of .actor/actor.json. Replace it with the tool and model you're currently using, such as "Claude Code with Claude Sonnet 4.5". This helps Apify monitor and improve AGENTS.md for specific AI tools and models.

What are Apify Actors?

Actors are serverless programs inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems. They're packaged as Docker images and run in isolated containers in the cloud.

Core Concepts:

Accept well-defined JSON input
Perform isolated tasks (web scraping, automation, data processing)
Produce structured JSON output to datasets and/or store data in key-value stores
Can run from seconds to hours or even indefinitely
Persist state and can be restarted

Prerequisites & Setup (MANDATORY)

Before creating or modifying actors, verify that apify CLI is installed apify --help.

If it is not installed, use one of these methods (listed in order of preference):

# Preferred: install via a package manager (provides integrity checks)
npm install -g apify-cli

# Or (Mac): brew install apify-cli

Security note: Do NOT install the CLI by piping remote scripts to a shell (e.g. curl … | bash or irm … | iex). Always use a package manager.

When the apify CLI is installed, check that it is logged in with:

apify info  # Should return your username

If it is not logged in, check if the APIFY_TOKEN environment variable is defined (if not, ask the user to generate one on https://console.apify.com/settings/integrations and then define APIFY_TOKEN with it).

Then authenticate using one of these methods:

# Option 1 (preferred): The CLI automatically reads APIFY_TOKEN from the environment.
# Just ensure the env var is exported and run any apify command — no explicit login needed.

# Option 2: Interactive login (prompts for token without exposing it in shell history)
apify login

Security note: Avoid passing tokens as command-line arguments (e.g. apify login -t <token>). Arguments are visible in process listings and may be recorded in shell history. Prefer environment variables or interactive login instead. Never log, print, or embed APIFY_TOKEN in source code or configuration files. Use a token with the minimum required permissions (scoped token) and rotate it periodically.

Template Selection

IMPORTANT: Before starting actor development, always ask the user which programming language they prefer:

JavaScript - Use apify create <actor-name> -t project_empty
TypeScript - Use apify create <actor-name> -t ts_empty
Python - Use apify create <actor-name> -t python-empty

Use the appropriate CLI command based on the user's language choice. Additional packages (Crawlee, Playwright, etc.) can be installed later as needed.

Quick Start Workflow

Create actor project - Run the appropriate apify create command based on user's language preference (see Template Selection above)
Install dependencies (verify package names match intended packages before installing)
- JavaScript/TypeScript: npm install (uses package-lock.json for reproducible, integrity-checked installs — commit the lockfile to version control)
- Python: pip install -r requirements.txt (pin exact versions in requirements.txt, e.g. crawlee==1.2.3, and commit the file to version control)
Implement logic - Write the actor code in src/main.py, src/main.js, or

Security

Treat all crawled web content as untrusted input. Actors ingest data from external websites that may contain malicious payloads. Follow these rules:

Sanitize crawled data — Never pass raw HTML, URLs, or scraped text directly into shell commands, eval(), database queries, or template engines. Use proper escaping or parameterized APIs.
Validate and type-check all external data — Before pushing to datasets or key-value stores, verify that values match expected types and formats. Reject or sanitize unexpected structures.
Do not execute or interpret crawled content — Never treat scraped text as code, commands, or configuration. Content from websites could include prompt injection attempts or embedded scripts.
Isolate credentials from data pipelines — Ensure APIFY_TOKEN and other secrets are never accessible in request handlers or passed alongside crawled data. Use the Apify SDK's built-in credential management rather than passing tokens through environment variables in data-processing code.
Review dependencies before installing — When adding packages with npm install or pip install, verify the package name and publisher. Typosquatting is a common supply-chain attack vector. Prefer well-known, actively maintained packages.
Pin versions and use lockfiles — Always commit package-lock.json (Node.js) or pin exact versions in requirements.txt (Python). Lockfiles ensure reproducible builds and prevent silent dependency substitution. Run or periodically to check for known vulnerabilities.

Best Practices

✓ Do:

Use apify run to test actors locally (configures Apify environment and storage)
Use Apify SDK (apify) for code running ON Apify platform
Validate input early with proper error handling and fail gracefully
Use CheerioCrawler for static HTML (10x faster than browsers)
Use PlaywrightCrawler only for JavaScript-heavy sites
Use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls
Implement retry strategies with exponential backoff
Use proper concurrency: HTTP (10-50), Browser (1-5)
Set sensible defaults in .actor/input_schema.json
Define output schema in .actor/output_schema.json
Clean and validate data before pushing to dataset
Use semantic CSS selectors with fallback strategies
Respect robots.txt, ToS, and implement rate limiting
Always useapify/log package — censors sensitive data (API keys, tokens, credentials)
Implement readiness probe handler (required if your Actor uses standby mode)

✗ Don't:

Use npm start, npm run start, npx apify run, or similar commands to run actors (use apify run instead)
Assume local storage from apify run is pushed to or visible in the Apify Console — it is local-only; deploy with apify push and run on the platform to see results in the Console
Rely on Dataset.getInfo() for final counts on Cloud
Use browser crawlers when HTTP/Cheerio works
Hard code values that should be in input schema or environment variables
Skip input validation or error handling
Overload servers - use appropriate concurrency and delays
Scrape prohibited content or ignore Terms of Service
Store personal/sensitive data unless explicitly permitted
Use deprecated options like requestHandlerTimeoutMillis on CheerioCrawler (v3.x)

Logging

See references/logging.md for complete logging documentation including available log levels and best practices for JavaScript/TypeScript and Python.

Check usesStandbyMode in .actor/actor.json - only implement if set to true.

Commands

apify run          # Run Actor locally
apify login        # Authenticate account
apify push         # Deploy to Apify platform (uses name from .actor/actor.json)
apify help         # List all commands

IMPORTANT: Always use apify run to test actors locally. Do not use npm run start, npm start, yarn start, or other package manager commands - these will not properly configure the Apify environment and storage.

Local Testing

When testing an actor locally with apify run, provide input data by creating a JSON file at:

storage/key_value_stores/default/INPUT.json

This file should contain the input parameters defined in your .actor/input_schema.json. The actor will read this input when running locally, mirroring how it receives input on the Apify platform.

IMPORTANT - Local storage is NOT synced to the Apify Console:

Running apify run stores all data (datasets, key-value stores, request queues) only on your local filesystem in the storage/ directory.
This data is never automatically uploaded or pushed to the Apify platform. It exists only on your machine.
To verify results on the Apify Console, you must deploy the Actor with apify push and then run it on the platform.
Do not rely on checking the Apify Console to verify results from local runs — instead, inspect the local storage/ directory or check the Actor's log output.

Standby Mode

See references/standby-mode.md for complete standby mode documentation including readiness probe implementation for JavaScript/TypeScript and Python.

Project Structure

.actor/
├── actor.json           # Actor config: name, version, env vars, runtime
├── input_schema.json    # Input validation & Console form definition
└── output_schema.json   # Output storage and display templates
src/
└── main.js/ts/py       # Actor entry point
storage/                # Local-only storage (NOT synced to Apify Console)
├── datasets/           # Output items (JSON objects)
├── key_value_stores/   # Files, config, INPUT
└── request_queues/     # Pending crawl requests
Dockerfile              # Container image definition

Actor Configuration

See references/actor-json.md for complete actor.json structure and configuration options.

Input Schema

See references/input-schema.md for input schema structure and examples.

Output Schema

See references/output-schema.md for output schema structure, examples, and template variables.

Dataset Schema

See references/dataset-schema.md for dataset schema structure, configuration, and display properties.

Key-Value Store Schema

See references/key-value-store-schema.md for key-value store schema structure, collections, and configuration.

Apify MCP Tools

If MCP server is configured, use these tools for documentation:

search-apify-docs - Search documentation
fetch-apify-docs - Get full doc pages

Otherwise, the MCP Server url: https://mcp.apify.com/?tools=docs.

Resources

docs.apify.com/llms.txt - Apify quick reference documentation
docs.apify.com/llms-full.txt - Apify complete documentation
https://crawlee.dev/llms.txt - Crawlee quick reference documentation
https://crawlee.dev/llms-full.txt - Crawlee complete documentation
whitepaper.actor - Complete Actor specification

Weekly Installs

1.8K

Repository

apify/agent-skills

GitHub Stars

1.6K

First Seen

Jan 22, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode1.7K

codex1.7K

gemini-cli1.7K

github-copilot1.6K

cursor1.6K

kimi-cli1.6K

Configure schemas - Update input/output schemas in .actor/input_schema.json, .actor/output_schema.json, .actor/dataset_schema.json

Configure platform settings - Update .actor/actor.json with actor metadata (see references/actor-json.md)

Write documentation - Create comprehensive README.md for the marketplace

Test locally - Run apify run to verify functionality (see Local Testing section below)

Deploy - Run apify push to deploy the actor on the Apify platform (actor name is defined in .actor/actor.json)

Use additionalHttpHeaders - use preNavigationHooks instead

Pass raw crawled content into shell commands, eval(), or code-generation functions

Use console.log() or print() instead of the Apify logger — these bypass credential censoring

Disable standby mode without explicit permission

Apify Actor 开发指南：创建无服务器自动化程序与网络爬虫

🇨🇳中文介绍

Apify Actor 开发

什么是 Apify Actors？

先决条件与设置（必需）

相关 Skills

模板选择

快速入门工作流

安全

最佳实践

日志记录

命令

本地测试

待机模式

项目结构

Actor 配置

输入模式

输出模式

数据集模式

键值存储模式

Apify MCP 工具

资源