桌面计算机自动化工具 Midscene 使用指南 - 跨平台 AI 驱动自动化

desktop-computer-automation by web-infra-dev/midscene-skills

1,400 周安装量

141 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/web-infra-dev/midscene-skills --skill desktop-computer-automation

AI/机器学习开发生产力

🇨🇳中文介绍

桌面计算机自动化

关键规则 — 违反将导致工作流中断：

切勿在后台运行 midscene 命令。 每个命令必须同步运行，以便您能在决定下一步操作前读取其输出（尤其是截图）。后台执行会破坏截图-分析-执行循环。

一次只运行一个 midscene 命令。 等待上一个命令完成，读取截图，然后决定下一个操作。切勿将多个命令串联在一起。

为每个命令留出足够的完成时间。 Midscene 命令涉及 AI 推理和屏幕交互，可能比典型的 shell 命令耗时更长。一个典型命令需要大约 1 分钟；复杂的 act 命令可能需要更长时间。

在结束前始终报告任务结果。 完成自动化任务后，您必须主动向用户总结结果——包括找到的关键数据、完成的动作、拍摄的截图以及任何相关发现。切勿在最后一个自动化步骤后默默结束；用户期望在单次交互中获得完整的回应。

使用 npx @midscene/computer@1 控制您的桌面（macOS、Windows、Linux）。每个 CLI 命令直接映射到一个 MCP 工具——您（AI 代理）充当大脑，根据截图决定采取哪些操作。

`act` 能做什么

在桌面上的一次 act 调用中，Midscene 可以移动鼠标、单击、双击、右键单击、拖拽项目、输入或清除文本、滚动、按下单个按键或键盘快捷键，并在所选显示器上可见的任何内容上执行多步骤交互。

前提条件

Midscene 需要具备强大视觉基础能力的模型。必须配置以下环境变量——可以作为系统环境变量，也可以放在当前工作目录的 .env 文件中（Midscene 会自动加载）：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

733,500 周安装

Vercel React 最佳实践指南 | 58条Next.js性能优化规则与代码重构

252,100 周安装

Vercel Web界面规范检查工具 - 自动检测代码是否符合Web设计指南

202,600 周安装

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

133,200 周安装

MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"

示例：Gemini (Gemini-3-Flash)

MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"

MIDSCENE_MODEL_API_KEY="your-aliyun-api-key"
MIDSCENE_MODEL_NAME="qwen3.5-plus"
MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MIDSCENE_MODEL_FAMILY="qwen3.5"
MIDSCENE_MODEL_REASONING_ENABLED="false"
# 如果使用 OpenRouter，请设置：
# MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
# MIDSCENE_MODEL_NAME="qwen/qwen3.5-plus"
# MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"

示例：Doubao Seed 2.0 Lite

MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-2-0-lite"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-seed"

常用模型：Doubao Seed 2.0 Lite、Qwen 3.5、Zhipu GLM-4.6V、Gemini-3-Pro、Gemini-3-Flash。

如果模型未配置，请要求用户进行设置。有关支持的提供商，请参阅模型配置。

npx @midscene/computer@1 connect
npx @midscene/computer@1 connect --displayId <id>

npx @midscene/computer@1 list_displays

npx @midscene/computer@1 take_screenshot

截取屏幕截图后，在决定下一步操作之前，请读取保存的图像文件以了解当前屏幕状态。

使用 act 与计算机交互并获取结果。它在内部自主处理所有 UI 交互——点击、输入、滚动、等待和导航——因此您应该给它一个复杂、高级别的任务整体，而不是将其分解为小步骤。用自然语言描述您想做什么以及期望的效果：

# 具体指令
npx @midscene/computer@1 act --prompt "type hello world in the search field and press Enter"
npx @midscene/computer@1 act --prompt "drag the file icon to the Trash"

# 或目标驱动指令
npx @midscene/computer@1 act --prompt "search for the weather in Shanghai using the Chrome browser, tell me the result"

npx @midscene/computer@1 disconnect

由于 CLI 命令在调用之间是无状态的，请遵循以下模式：

连接以建立会话
健康检查 — 观察 connect 命令的输出。如果 connect 已经执行了健康检查（截图和鼠标移动测试），则无需额外检查。如果 connect 没有执行健康检查，请手动执行一次：截取屏幕截图并验证成功，然后将鼠标移动到随机位置（act --prompt "move the mouse to a random position"）并验证成功。如果任何一步失败，请停止并在继续之前进行故障排除。只有在两项检查都通过且没有错误后，才能进行后续步骤。
启动目标应用程序并截取屏幕截图 以查看当前状态，确保应用程序已启动并在屏幕上可见。
执行操作 使用 act 执行期望的操作或目标驱动的指令。
断开连接 当完成时
报告结果 — 总结完成的工作，呈现任务期间提取的关键发现和数据，并列出生成的任何文件（截图、日志等）及其路径

始终先进行健康检查：连接后，观察 connect 命令的输出。如果 connect 已经执行了健康检查（截图和鼠标移动测试），则无需额外检查。如果没有，请手动执行一次：截取屏幕截图并将鼠标移动到随机位置。两者都必须成功（无错误）才能继续进行任何进一步操作。这有助于及早发现环境问题。
在使用此技能前将目标应用程序置于前台：为了获得最佳效率，请在使用任何 midscene 命令之前，通过其他方式（例如，在 macOS 上使用 open -a <AppName>，在 Windows 上使用 start <AppName>）将应用程序置于前台。然后截取屏幕截图以确认应用程序确实在前台。只有在视觉确认后，才应使用此技能继续进行 UI 自动化。避免通过 midscene 使用 Spotlight、开始菜单搜索或其他基于启动器的方法——它们涉及瞬态 UI、多个 AI 推理步骤，并且速度明显更慢。
明确指定 UI 元素：提供清晰、具体的细节，而不是模糊的描述。说 "the red close button in the top-left corner of the Safari window" 而不是 "the close button"。
尽可能描述位置：通过描述元素的位置来帮助定位（例如，"the icon in the top-right corner of the menu bar"、"the third item in the left sidebar"）。
切勿在后台运行：每个 midscene 命令必须同步运行——后台执行会破坏截图-分析-执行循环。
检查多个显示器：如果您启动了一个应用程序但在截图中看不到它，应用程序窗口可能已在其他显示器上打开。使用 list_displays 检查可用的显示器。您有两个选择：要么将应用程序窗口移动到当前显示器，要么使用 connect --displayId <id> 切换到应用程序所在的显示器。
将相关操作批量放入单个 act 命令：当在同一应用程序中执行连续操作时，将它们合并到一个 act 提示中，而不是拆分成单独的命令。例如，“搜索 X，点击第一个结果，向下滚动查看更多详细信息”应该是一个 act 调用，而不是三个。这减少了往返次数，避免了不必要的截图-分析周期，并且速度明显更快。
在运行前设置 PATH（macOS）：在 macOS 上，如果 PATH 不完整，某些命令（例如 system_profiler）可能无法找到。在运行任何 midscene 命令之前，请确保 PATH 包含标准的系统目录：
```
export PATH="/usr/sbin:/usr/bin:/bin:/sbin:$PATH"
```

这可以防止因缺少系统实用程序而导致的截图失败。 9. 完成后始终报告结果：完成自动化任务后，您必须主动向用户呈现结果，而不是等待用户询问。这包括：(1) 用户原始问题的答案或请求任务的结果，(2) 执行期间提取或观察到的关键数据，(3) 截图和其他生成的文件及其路径，(4) 所采取步骤的简要总结。切勿在最后一个自动化命令后默默结束——用户期望在单次交互中获得完整的结果。

示例 — 上下文菜单交互：

npx @midscene/computer@1 act --prompt "right-click the file icon and select Delete from the context menu"
npx @midscene/computer@1 take_screenshot

示例 — 下拉菜单：

npx @midscene/computer@1 act --prompt "open the File menu and click New Window"
npx @midscene/computer@1 take_screenshot

macOS：辅助功能权限被拒绝

您的终端应用程序没有辅助功能访问权限：

打开 系统设置 > 隐私与安全性 > 辅助功能
添加您的终端应用程序并启用它
授予权限后重新启动您的终端应用程序

macOS：未找到 Xcode 命令行工具

xcode-select --install

检查 .env 文件是否包含 MIDSCENE_MODEL_API_KEY=<your-key>。

macOS：截图失败，提示 `system_profiler` 未找到

如果 take_screenshot 失败并出现类似 system_profiler: command not found 的错误，则 PATH 环境变量可能不完整。通过运行以下命令修复：

export PATH="/usr/sbin:/usr/bin:/bin:/sbin:$PATH"

然后重试截图命令。

AI 无法找到元素

截取屏幕截图以验证元素是否实际可见
使用更具体的描述（包括颜色、位置、周围文本）
确保元素没有被其他窗口遮挡

🇺🇸English

Desktop Computer Automation

CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:

Never run midscene commands in the background. Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.

Run only one midscene command at a time. Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.

Allow enough time for each command to complete. Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex act commands may need even longer.

Always report task results before finishing. After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.

Control your desktop (macOS, Windows, Linux) using npx @midscene/computer@1. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.

What `act` Can Do

Inside a single act call on desktop, Midscene can move the mouse, click, double-click, right-click, drag items, type or clear text, scroll, press single keys or keyboard shortcuts, and work through multi-step interactions on whatever is visible on the selected display.

Prerequisites

Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a .env file in the current working directory (Midscene loads .env automatically):

MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"

Example: Gemini (Gemini-3-Flash)

MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"

Example: Qwen 3.5

MIDSCENE_MODEL_API_KEY="your-aliyun-api-key"
MIDSCENE_MODEL_NAME="qwen3.5-plus"
MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MIDSCENE_MODEL_FAMILY="qwen3.5"
MIDSCENE_MODEL_REASONING_ENABLED="false"
# If using OpenRouter, set:
# MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
# MIDSCENE_MODEL_NAME="qwen/qwen3.5-plus"
# MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"

Example: Doubao Seed 2.0 Lite

MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-2-0-lite"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-seed"

Commonly used models: Doubao Seed 2.0 Lite, Qwen 3.5, Zhipu GLM-4.6V, Gemini-3-Pro, Gemini-3-Flash.

If the model is not configured, ask the user to set it up. See Model Configuration for supported providers.

Commands

Connect to Desktop

npx @midscene/computer@1 connect
npx @midscene/computer@1 connect --displayId <id>

List Displays

npx @midscene/computer@1 list_displays

Take Screenshot

npx @midscene/computer@1 take_screenshot

After taking a screenshot, read the saved image file to understand the current screen state before deciding the next action.

Perform Action

Use act to interact with the computer and get the result. It autonomously handles all UI interactions internally — clicking, typing, scrolling, waiting, and navigating — so you should give it complex, high-level tasks as a whole rather than breaking them into small steps. Describe what you want to do and the desired effect in natural language:

# specific instructions
npx @midscene/computer@1 act --prompt "type hello world in the search field and press Enter"
npx @midscene/computer@1 act --prompt "drag the file icon to the Trash"

# or target-driven instructions
npx @midscene/computer@1 act --prompt "search for the weather in Shanghai using the Chrome browser, tell me the result"

Disconnect

npx @midscene/computer@1 disconnect

Workflow Pattern

Since CLI commands are stateless between invocations, follow this pattern:

Connect to establish a session
Health check — observe the output of the connect command. If connect already performed a health check (screenshot and mouse movement test), no additional check is needed. If connect did not perform a health check, do one manually: take a screenshot and verify it succeeds, then move the mouse to a random position (act --prompt "move the mouse to a random position") and verify it succeeds. If either step fails, stop and troubleshoot before continuing. Only proceed to the next steps after both checks pass without errors.
Launch the target app and take screenshot to see the current state, make sure the app is launched and visible on the screen.
Execute action using act to perform the desired action or target-driven instructions.
Disconnect when done
Report results — summarize what was accomplished, present key findings and data extracted during the task, and list any generated files (screenshots, logs, etc.) with their paths

Best Practices

Always run a health check first : After connecting, observe the output of the connect command. If connect already performed a health check (screenshot and mouse movement test), no additional check is needed. If it did not, do one manually: take a screenshot and move the mouse to a random position. Both must succeed (no errors) before proceeding with any further operations. This catches environment issues early.
Bring the target app to the foreground before using this skill : For best efficiency, bring the app to the foreground using other means (e.g., open -a <AppName> on macOS, start <AppName> on Windows) before invoking any midscene commands. Then take a screenshot to confirm the app is actually in the foreground. Only after visual confirmation should you proceed with UI automation using this skill. Avoid using Spotlight, Start menu search, or other launcher-based approaches through midscene — they involve transient UI, multiple AI inference steps, and are significantly slower.
Be specific about UI elements : Instead of vague descriptions, provide clear, specific details. Say "the red close button in the top-left corner of the Safari window" instead of "the close button".

This prevents screenshot failures caused by missing system utilities. 9. Always report results after completion : After finishing the automation task, you MUST proactively present the results to the user without waiting for them to ask. This includes: (1) the answer to the user's original question or the outcome of the requested task, (2) key data extracted or observed during execution, (3) screenshots and other generated files with their paths, (4) a brief summary of steps taken. Do NOT silently finish after the last automation command — the user expects complete results in a single interaction.

Example — Context menu interaction:

npx @midscene/computer@1 act --prompt "right-click the file icon and select Delete from the context menu"
npx @midscene/computer@1 take_screenshot

Example — Dropdown menu:

npx @midscene/computer@1 act --prompt "open the File menu and click New Window"
npx @midscene/computer@1 take_screenshot

Troubleshooting

macOS: Accessibility Permission Denied

Your terminal app does not have Accessibility access:

Open System Settings > Privacy & Security > Accessibility
Add your terminal app and enable it
Restart your terminal app after granting permission

macOS: Xcode Command Line Tools Not Found

xcode-select --install

API Key Not Set

Check .env file contains MIDSCENE_MODEL_API_KEY=<your-key>.

macOS: Screenshot Fails with `system_profiler` Not Found

If take_screenshot fails with an error like system_profiler: command not found, the PATH environment variable is likely incomplete. Fix it by running:

export PATH="/usr/sbin:/usr/bin:/bin:/sbin:$PATH"

Then retry the screenshot command.

AI Cannot Find the Element

Take a screenshot to verify the element is actually visible
Use more specific descriptions (include color, position, surrounding text)
Ensure the element is not hidden behind another window

Weekly Installs

1.4K

Repository

web-infra-dev/m…e-skills

GitHub Stars

141

First Seen

Mar 6, 2026

Security Audits

Gen Agent Trust HubPass SocketWarn SnykWarn

Installed on

openclaw1.1K

codex1.0K

opencode1.0K

cursor1.0K

gemini-cli1.0K

github-copilot1.0K

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

102,200 周安装

Describe locations when possible : Help target elements by describing their position (e.g., "the icon in the top-right corner of the menu bar", "the third item in the left sidebar").

Never run in background : Every midscene command must run synchronously — background execution breaks the screenshot-analyze-act loop.

Check for multiple displays : If you launched an app but cannot see it on the screenshot, the app window may have opened on a different display. Use list_displays to check available displays. You have two options: either move the app window to the current display, or use connect --displayId <id> to switch to the display where the app is.

Batch related operations into a singleact command: When performing consecutive operations within the same app, combine them into one act prompt instead of splitting them into separate commands. For example, "search for X, click the first result, and scroll down to see more details" should be a single act call, not three. This reduces round-trips, avoids unnecessary screenshot-analyze cycles, and is significantly faster.

Set upPATH before running (macOS): On macOS, some commands (e.g., system_profiler) may not be found if the PATH is incomplete. Before running any midscene commands, ensure the PATH includes the standard system directories:

export PATH="/usr/sbin:/usr/bin:/bin:/sbin:$PATH"

桌面计算机自动化工具 Midscene 使用指南 - 跨平台 AI 驱动自动化

🇨🇳中文介绍

桌面计算机自动化

`act` 能做什么

前提条件

相关 Skills

命令

连接到桌面

列出显示器

截取屏幕截图

执行操作

断开连接

工作流模式

最佳实践

故障排除

macOS：辅助功能权限被拒绝

macOS：未找到 Xcode 命令行工具

API 密钥未设置

macOS：截图失败，提示 `system_profiler` 未找到

AI 无法找到元素

🇺🇸English

Desktop Computer Automation

What `act` Can Do

Prerequisites

Commands

Connect to Desktop

List Displays

Take Screenshot

Perform Action

Disconnect

Workflow Pattern

Best Practices

Troubleshooting

macOS: Accessibility Permission Denied

macOS: Xcode Command Line Tools Not Found

API Key Not Set

macOS: Screenshot Fails with `system_profiler` Not Found

AI Cannot Find the Element

最新 Skills

桌面计算机自动化工具 Midscene 使用指南 - 跨平台 AI 驱动自动化

🇨🇳中文介绍

桌面计算机自动化

act 能做什么

前提条件

相关 Skills

命令

连接到桌面

列出显示器

截取屏幕截图

执行操作

断开连接

工作流模式

最佳实践

故障排除

macOS：辅助功能权限被拒绝

macOS：未找到 Xcode 命令行工具

API 密钥未设置

macOS：截图失败，提示 system_profiler 未找到

AI 无法找到元素

🇺🇸English

Desktop Computer Automation

What act Can Do

Prerequisites

Commands

Connect to Desktop

List Displays

Take Screenshot

Perform Action

Disconnect

Workflow Pattern

Best Practices

Troubleshooting

macOS: Accessibility Permission Denied

macOS: Xcode Command Line Tools Not Found

API Key Not Set

macOS: Screenshot Fails with system_profiler Not Found

AI Cannot Find the Element

最新 Skills

`act` 能做什么

macOS：截图失败，提示 `system_profiler` 未找到

What `act` Can Do

macOS: Screenshot Fails with `system_profiler` Not Found