desktop-computer-automation by web-infra-dev/midscene-skills
npx skills add https://github.com/web-infra-dev/midscene-skills --skill desktop-computer-automation关键规则 — 违反将导致工作流中断:
- 切勿在后台运行 midscene 命令。 每个命令必须同步运行,以便您能在决定下一步操作前读取其输出(尤其是截图)。后台执行会破坏截图-分析-执行循环。
- 一次只运行一个 midscene 命令。 等待上一个命令完成,读取截图,然后决定下一个操作。切勿将多个命令串联在一起。
- 为每个命令留出足够的完成时间。 Midscene 命令涉及 AI 推理和屏幕交互,可能比典型的 shell 命令耗时更长。一个典型命令需要大约 1 分钟;复杂的
act命令可能需要更长时间。- 在结束前始终报告任务结果。 完成自动化任务后,您必须主动向用户总结结果——包括找到的关键数据、完成的动作、拍摄的截图以及任何相关发现。切勿在最后一个自动化步骤后默默结束;用户期望在单次交互中获得完整的回应。
使用 npx @midscene/computer@1 控制您的桌面(macOS、Windows、Linux)。每个 CLI 命令直接映射到一个 MCP 工具——您(AI 代理)充当大脑,根据截图决定采取哪些操作。
act 能做什么在桌面上的一次 act 调用中,Midscene 可以移动鼠标、单击、双击、右键单击、拖拽项目、输入或清除文本、滚动、按下单个按键或键盘快捷键,并在所选显示器上可见的任何内容上执行多步骤交互。
Midscene 需要具备强大视觉基础能力的模型。必须配置以下环境变量——可以作为系统环境变量,也可以放在当前工作目录的 .env 文件中(Midscene 会自动加载 ):
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
.envMIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"
示例:Gemini (Gemini-3-Flash)
MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"
示例:Qwen 3.5
MIDSCENE_MODEL_API_KEY="your-aliyun-api-key"
MIDSCENE_MODEL_NAME="qwen3.5-plus"
MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MIDSCENE_MODEL_FAMILY="qwen3.5"
MIDSCENE_MODEL_REASONING_ENABLED="false"
# 如果使用 OpenRouter,请设置:
# MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
# MIDSCENE_MODEL_NAME="qwen/qwen3.5-plus"
# MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"
示例:Doubao Seed 2.0 Lite
MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-2-0-lite"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-seed"
常用模型:Doubao Seed 2.0 Lite、Qwen 3.5、Zhipu GLM-4.6V、Gemini-3-Pro、Gemini-3-Flash。
如果模型未配置,请要求用户进行设置。有关支持的提供商,请参阅模型配置。
npx @midscene/computer@1 connect
npx @midscene/computer@1 connect --displayId <id>
npx @midscene/computer@1 list_displays
npx @midscene/computer@1 take_screenshot
截取屏幕截图后,在决定下一步操作之前,请读取保存的图像文件以了解当前屏幕状态。
使用 act 与计算机交互并获取结果。它在内部自主处理所有 UI 交互——点击、输入、滚动、等待和导航——因此您应该给它一个复杂、高级别的任务整体,而不是将其分解为小步骤。用自然语言描述您想做什么以及期望的效果:
# 具体指令
npx @midscene/computer@1 act --prompt "type hello world in the search field and press Enter"
npx @midscene/computer@1 act --prompt "drag the file icon to the Trash"
# 或目标驱动指令
npx @midscene/computer@1 act --prompt "search for the weather in Shanghai using the Chrome browser, tell me the result"
npx @midscene/computer@1 disconnect
由于 CLI 命令在调用之间是无状态的,请遵循以下模式:
connect 命令的输出。如果 connect 已经执行了健康检查(截图和鼠标移动测试),则无需额外检查。如果 connect 没有执行健康检查,请手动执行一次:截取屏幕截图并验证成功,然后将鼠标移动到随机位置(act --prompt "move the mouse to a random position")并验证成功。如果任何一步失败,请停止并在继续之前进行故障排除。只有在两项检查都通过且没有错误后,才能进行后续步骤。act 执行期望的操作或目标驱动的指令。始终先进行健康检查:连接后,观察 connect 命令的输出。如果 connect 已经执行了健康检查(截图和鼠标移动测试),则无需额外检查。如果没有,请手动执行一次:截取屏幕截图并将鼠标移动到随机位置。两者都必须成功(无错误)才能继续进行任何进一步操作。这有助于及早发现环境问题。
在使用此技能前将目标应用程序置于前台:为了获得最佳效率,请在使用任何 midscene 命令之前,通过其他方式(例如,在 macOS 上使用 open -a <AppName>,在 Windows 上使用 start <AppName>)将应用程序置于前台。然后截取屏幕截图以确认应用程序确实在前台。只有在视觉确认后,才应使用此技能继续进行 UI 自动化。避免通过 midscene 使用 Spotlight、开始菜单搜索或其他基于启动器的方法——它们涉及瞬态 UI、多个 AI 推理步骤,并且速度明显更慢。
明确指定 UI 元素:提供清晰、具体的细节,而不是模糊的描述。说 "the red close button in the top-left corner of the Safari window" 而不是 "the close button"。
尽可能描述位置:通过描述元素的位置来帮助定位(例如,"the icon in the top-right corner of the menu bar"、"the third item in the left sidebar")。
切勿在后台运行:每个 midscene 命令必须同步运行——后台执行会破坏截图-分析-执行循环。
检查多个显示器:如果您启动了一个应用程序但在截图中看不到它,应用程序窗口可能已在其他显示器上打开。使用 list_displays 检查可用的显示器。您有两个选择:要么将应用程序窗口移动到当前显示器,要么使用 connect --displayId <id> 切换到应用程序所在的显示器。
将相关操作批量放入单个 act 命令:当在同一应用程序中执行连续操作时,将它们合并到一个 act 提示中,而不是拆分成单独的命令。例如,“搜索 X,点击第一个结果,向下滚动查看更多详细信息”应该是一个 act 调用,而不是三个。这减少了往返次数,避免了不必要的截图-分析周期,并且速度明显更快。
在运行前设置 PATH(macOS):在 macOS 上,如果 PATH 不完整,某些命令(例如 system_profiler)可能无法找到。在运行任何 midscene 命令之前,请确保 PATH 包含标准的系统目录:
export PATH="/usr/sbin:/usr/bin:/bin:/sbin:$PATH"
这可以防止因缺少系统实用程序而导致的截图失败。 9. 完成后始终报告结果:完成自动化任务后,您必须主动向用户呈现结果,而不是等待用户询问。这包括:(1) 用户原始问题的答案或请求任务的结果,(2) 执行期间提取或观察到的关键数据,(3) 截图和其他生成的文件及其路径,(4) 所采取步骤的简要总结。切勿在最后一个自动化命令后默默结束——用户期望在单次交互中获得完整的结果。
示例 — 上下文菜单交互:
npx @midscene/computer@1 act --prompt "right-click the file icon and select Delete from the context menu"
npx @midscene/computer@1 take_screenshot
示例 — 下拉菜单:
npx @midscene/computer@1 act --prompt "open the File menu and click New Window"
npx @midscene/computer@1 take_screenshot
您的终端应用程序没有辅助功能访问权限:
xcode-select --install
检查 .env 文件是否包含 MIDSCENE_MODEL_API_KEY=<your-key>。
system_profiler 未找到如果 take_screenshot 失败并出现类似 system_profiler: command not found 的错误,则 PATH 环境变量可能不完整。通过运行以下命令修复:
export PATH="/usr/sbin:/usr/bin:/bin:/sbin:$PATH"
然后重试截图命令。
每周安装量
1.4K
代码仓库
GitHub 星标数
141
首次出现
2026年3月6日
安全审计
安装于
openclaw1.1K
codex1.0K
opencode1.0K
cursor1.0K
gemini-cli1.0K
github-copilot1.0K
CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:
- Never run midscene commands in the background. Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.
- Run only one midscene command at a time. Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.
- Allow enough time for each command to complete. Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex
actcommands may need even longer.- Always report task results before finishing. After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.
Control your desktop (macOS, Windows, Linux) using npx @midscene/computer@1. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.
act Can DoInside a single act call on desktop, Midscene can move the mouse, click, double-click, right-click, drag items, type or clear text, scroll, press single keys or keyboard shortcuts, and work through multi-step interactions on whatever is visible on the selected display.
Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a .env file in the current working directory (Midscene loads .env automatically):
MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"
Example: Gemini (Gemini-3-Flash)
MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"
Example: Qwen 3.5
MIDSCENE_MODEL_API_KEY="your-aliyun-api-key"
MIDSCENE_MODEL_NAME="qwen3.5-plus"
MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MIDSCENE_MODEL_FAMILY="qwen3.5"
MIDSCENE_MODEL_REASONING_ENABLED="false"
# If using OpenRouter, set:
# MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
# MIDSCENE_MODEL_NAME="qwen/qwen3.5-plus"
# MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"
Example: Doubao Seed 2.0 Lite
MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-2-0-lite"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-seed"
Commonly used models: Doubao Seed 2.0 Lite, Qwen 3.5, Zhipu GLM-4.6V, Gemini-3-Pro, Gemini-3-Flash.
If the model is not configured, ask the user to set it up. See Model Configuration for supported providers.
npx @midscene/computer@1 connect
npx @midscene/computer@1 connect --displayId <id>
npx @midscene/computer@1 list_displays
npx @midscene/computer@1 take_screenshot
After taking a screenshot, read the saved image file to understand the current screen state before deciding the next action.
Use act to interact with the computer and get the result. It autonomously handles all UI interactions internally — clicking, typing, scrolling, waiting, and navigating — so you should give it complex, high-level tasks as a whole rather than breaking them into small steps. Describe what you want to do and the desired effect in natural language:
# specific instructions
npx @midscene/computer@1 act --prompt "type hello world in the search field and press Enter"
npx @midscene/computer@1 act --prompt "drag the file icon to the Trash"
# or target-driven instructions
npx @midscene/computer@1 act --prompt "search for the weather in Shanghai using the Chrome browser, tell me the result"
npx @midscene/computer@1 disconnect
Since CLI commands are stateless between invocations, follow this pattern:
connect command. If connect already performed a health check (screenshot and mouse movement test), no additional check is needed. If connect did not perform a health check, do one manually: take a screenshot and verify it succeeds, then move the mouse to a random position (act --prompt "move the mouse to a random position") and verify it succeeds. If either step fails, stop and troubleshoot before continuing. Only proceed to the next steps after both checks pass without errors.act to perform the desired action or target-driven instructions.Always run a health check first : After connecting, observe the output of the connect command. If connect already performed a health check (screenshot and mouse movement test), no additional check is needed. If it did not, do one manually: take a screenshot and move the mouse to a random position. Both must succeed (no errors) before proceeding with any further operations. This catches environment issues early.
Bring the target app to the foreground before using this skill : For best efficiency, bring the app to the foreground using other means (e.g., open -a <AppName> on macOS, start <AppName> on Windows) before invoking any midscene commands. Then take a screenshot to confirm the app is actually in the foreground. Only after visual confirmation should you proceed with UI automation using this skill. Avoid using Spotlight, Start menu search, or other launcher-based approaches through midscene — they involve transient UI, multiple AI inference steps, and are significantly slower.
Be specific about UI elements : Instead of vague descriptions, provide clear, specific details. Say "the red close button in the top-left corner of the Safari window" instead of "the close button".
This prevents screenshot failures caused by missing system utilities. 9. Always report results after completion : After finishing the automation task, you MUST proactively present the results to the user without waiting for them to ask. This includes: (1) the answer to the user's original question or the outcome of the requested task, (2) key data extracted or observed during execution, (3) screenshots and other generated files with their paths, (4) a brief summary of steps taken. Do NOT silently finish after the last automation command — the user expects complete results in a single interaction.
Example — Context menu interaction:
npx @midscene/computer@1 act --prompt "right-click the file icon and select Delete from the context menu"
npx @midscene/computer@1 take_screenshot
Example — Dropdown menu:
npx @midscene/computer@1 act --prompt "open the File menu and click New Window"
npx @midscene/computer@1 take_screenshot
Your terminal app does not have Accessibility access:
xcode-select --install
Check .env file contains MIDSCENE_MODEL_API_KEY=<your-key>.
system_profiler Not FoundIf take_screenshot fails with an error like system_profiler: command not found, the PATH environment variable is likely incomplete. Fix it by running:
export PATH="/usr/sbin:/usr/bin:/bin:/sbin:$PATH"
Then retry the screenshot command.
Weekly Installs
1.4K
Repository
GitHub Stars
141
First Seen
Mar 6, 2026
Security Audits
Gen Agent Trust HubPassSocketWarnSnykWarn
Installed on
openclaw1.1K
codex1.0K
opencode1.0K
cursor1.0K
gemini-cli1.0K
github-copilot1.0K
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
102,200 周安装
Describe locations when possible : Help target elements by describing their position (e.g., "the icon in the top-right corner of the menu bar", "the third item in the left sidebar").
Never run in background : Every midscene command must run synchronously — background execution breaks the screenshot-analyze-act loop.
Check for multiple displays : If you launched an app but cannot see it on the screenshot, the app window may have opened on a different display. Use list_displays to check available displays. You have two options: either move the app window to the current display, or use connect --displayId <id> to switch to the display where the app is.
Batch related operations into a singleact command: When performing consecutive operations within the same app, combine them into one act prompt instead of splitting them into separate commands. For example, "search for X, click the first result, and scroll down to see more details" should be a single act call, not three. This reduces round-trips, avoids unnecessary screenshot-analyze cycles, and is significantly faster.
Set upPATH before running (macOS): On macOS, some commands (e.g., system_profiler) may not be found if the PATH is incomplete. Before running any midscene commands, ensure the PATH includes the standard system directories:
export PATH="/usr/sbin:/usr/bin:/bin:/sbin:$PATH"