ubuntu-desktop-control by lommaj/ubuntu-desktop-control
npx skills add https://github.com/lommaj/ubuntu-desktop-control --skill ubuntu-desktop-control使用语义元素定位来控制桌面图形用户界面。通过名称而非坐标来查找并点击用户界面元素。
主要特性:
安装依赖项:
bash install.sh
或手动安装:
# 系统包
sudo apt-get install -y xdotool scrot imagemagick \
at-spi2-core libatk-adaptor python3-gi gir1.2-atspi-2.0 \
tesseract-ocr tesseract-ocr-eng python3-pip
# Python 包
pip3 install -r requirements.txt
对于无头 Xvfb 会话:
export GTK_MODULES=gail:atk-bridge
export QT_LINUX_ACCESSIBILITY_ALWAYS_ON=1
/usr/lib/at-spi2-core/at-spi-bus-launcher &
所有命令默认使用 DISPLAY=:10.0。可使用 标志覆盖。
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
--display通过 AT-SPI 查找用户界面元素,并附带 OCR 后备方案。
python3 scripts/desktop.py find-element --name "Confirm" [--role button] [--app Firefox]
| 参数 | 类型 | 必需 | 描述 |
|---|---|---|---|
--name, -n | string | 否 | 要查找的元素名称/文本 |
--role, -r | string | 否 | 元素角色(button, entry 等) |
--app, -a | string | 否 | 应用程序名称过滤器 |
--all | flag | 否 | 查找所有匹配项 |
--clickable | flag | 否 | 仅可点击元素 |
--max-results | int | 否 | 最大结果数(默认:50) |
返回:
{
"element": {
"name": "Confirm",
"bounds": { "x": 400, "y": 300, "width": 100, "height": 30 },
"center": { "x": 450, "y": 315 },
"role": "push button",
"source": "atspi",
"visible": true,
"enabled": true,
"clickable": true
}
}
仅通过 OCR 在屏幕上查找文本。
python3 scripts/desktop.py find-text "I have an existing wallet" [--exact] [--all]
| 参数 | 类型 | 必需 | 描述 |
|---|---|---|---|
text | string | 是 | 要查找的文本 |
--exact | flag | 否 | 要求完全匹配 |
--case-sensitive | flag | 否 | 区分大小写匹配 |
--all | flag | 否 | 查找所有出现位置 |
--max-results | int | 否 | 最大结果数(默认:50) |
返回:
{
"match": {
"name": "I have an existing wallet",
"bounds": { "x": 200, "y": 400, "width": 180, "height": 20 },
"center": { "x": 290, "y": 410 },
"source": "ocr",
"confidence": 95.2
}
}
按名称/角色点击元素(先查找元素,然后点击其中心)。
python3 scripts/desktop.py click-element --name "Next" [--role button] [--verify]
| 参数 | 类型 | 必需 | 描述 |
|---|---|---|---|
--name, -n | string | 否 | 元素名称/文本 |
--role, -r | string | 否 | 元素角色 |
--app, -a | string | 否 | 应用程序名称过滤器 |
--right | flag | 否 | 右键点击 |
--double | flag | 否 | 双击 |
--verify | flag | 否 | 点击前进行 OCR 验证 |
click-element 至少需要一个选择器:--name 或 --role。当使用 --verify 时,必须可用 OCR 并且必须提供文本(通常通过 --name)。
返回:
{
"clicked": {
"element": { "name": "Next", "..." },
"x": 450,
"y": 315,
"button": "left",
"double": false
}
}
等待元素或文本出现(带超时和指数退避)。
python3 scripts/desktop.py wait-for --name "Success" --timeout 30
python3 scripts/desktop.py wait-for --text "Transaction complete" --timeout 60
python3 scripts/desktop.py wait-for --name "Loading" --gone --timeout 30
| 参数 | 类型 | 必需 | 描述 |
|---|---|---|---|
--name, -n | string | 否 | 元素名称(AT-SPI + OCR) |
--role, -r | string | 否 | 元素角色(仅 AT-SPI) |
--app, -a | string | 否 | 应用程序过滤器 |
--text, -t | string | 否 | 要查找的文本(仅 OCR) |
--exact | flag | 否 | 精确文本匹配 |
--gone | flag | 否 | 等待直到消失 |
--timeout | float | 否 | 超时时间(秒)(默认:30) |
单次调用请使用 --text 或元素选择器(--name、--role、--app)中的一种,不要同时使用。对于元素等待(带或不带 --gone),请至少提供 --name 或 --role 中的一个。
返回:
{ "found": { "name": "Success", "..." } }
// 或
{ "gone": true, "name": "Loading" }
// 或
{ "error": "Element not found within 30s", "timeout": true }
列出所有交互式元素(按钮、输入框、链接等)。
python3 scripts/desktop.py list-elements [--app Firefox] [--role button]
| 参数 | 类型 | 必需 | 描述 |
|---|---|---|---|
--app, -a | string | 否 | 应用程序名称过滤器 |
--role, -r | string | 否 | 按角色过滤 |
--include-hidden | flag | 否 | 包含隐藏元素 |
--max-results | int | 否 | 最大结果数(默认:100) |
返回:
{
"elements": [
{ "name": "Sign In", "role": "push button", "..." },
{ "name": "Email", "role": "entry", "..." }
],
"count": 2
}
检查 AT-SPI 和 OCR 的可用性。
python3 scripts/desktop.py status
返回:
{
"atspi": {
"available": true,
"applications": ["Firefox", "gnome-calculator"]
},
"ocr": { "available": true },
"display": ":10.0"
}
这些基于坐标的命令仍然可用:
| 命令 | 描述 |
|---|---|
screenshot [--output PATH] | 截取屏幕截图 |
click X Y [--right] [--double] | 在坐标处点击 |
type "TEXT" [--type-delay MS] | 输入文本 |
key "KEYS" | 按下按键组合 |
move X Y | 移动鼠标 |
active | 获取活动窗口信息 |
find-window "NAME" | 按名称查找窗口 |
focus "NAME" | 按名称聚焦窗口 |
position | 获取鼠标位置 |
windows | 列出所有窗口 |
# 等待并点击确认按钮
python3 scripts/desktop.py wait-for --name "Confirm" --role button --timeout 30
python3 scripts/desktop.py click-element --name "Confirm" --role button
# 等待成功消息
python3 scripts/desktop.py wait-for --text "Transaction submitted" --timeout 60
# 点击 "I have an existing wallet"
python3 scripts/desktop.py click-element --name "I already have a wallet"
# 等待种子短语输入框
python3 scripts/desktop.py wait-for --role entry --timeout 10
# 输入种子短语
python3 scripts/desktop.py type "word1 word2 word3..."
# 点击导入
python3 scripts/desktop.py click-element --name "Import"
# 对已知按钮使用语义方法
python3 scripts/desktop.py click-element --name "Settings"
# 对未标记的图标回退到坐标方法
python3 scripts/desktop.py screenshot --output /tmp/screen.png
# (分析截图以获取坐标)
python3 scripts/desktop.py click 850 120
click-element 和 wait-for 比坐标更稳健status 以验证 AT-SPI 和 OCR 是否可用find-text 会使用 OCRwait-for 比固定延迟更可靠每周安装数
83
代码仓库
首次出现
2026年2月10日
安全审计
安装于
gemini-cli78
openclaw78
codex77
kimi-cli77
amp77
opencode77
Control the desktop GUI using semantic element targeting. Find and click UI elements by name instead of coordinates.
Key Features:
Install dependencies:
bash install.sh
Or manually:
# System packages
sudo apt-get install -y xdotool scrot imagemagick \
at-spi2-core libatk-adaptor python3-gi gir1.2-atspi-2.0 \
tesseract-ocr tesseract-ocr-eng python3-pip
# Python packages
pip3 install -r requirements.txt
For headless Xvfb sessions:
export GTK_MODULES=gail:atk-bridge
export QT_LINUX_ACCESSIBILITY_ALWAYS_ON=1
/usr/lib/at-spi2-core/at-spi-bus-launcher &
All commands use DISPLAY=:10.0 by default. Override with --display flag.
Find UI element via AT-SPI with OCR fallback.
python3 scripts/desktop.py find-element --name "Confirm" [--role button] [--app Firefox]
| Parameter | Type | Required | Description |
|---|---|---|---|
--name, -n | string | No | Element name/text to find |
--role, -r | string | No | Element role (button, entry, etc.) |
--app, -a | string | No | Application name filter |
Returns:
{
"element": {
"name": "Confirm",
"bounds": { "x": 400, "y": 300, "width": 100, "height": 30 },
"center": { "x": 450, "y": 315 },
"role": "push button",
"source": "atspi",
"visible": true,
"enabled": true,
"clickable": true
}
}
Find text on screen via OCR only.
python3 scripts/desktop.py find-text "I have an existing wallet" [--exact] [--all]
| Parameter | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Text to find |
--exact | flag | No | Require exact match |
--case-sensitive | flag | No | Case-sensitive matching |
--all | flag | No | Find all occurrences |
--max-results |
Returns:
{
"match": {
"name": "I have an existing wallet",
"bounds": { "x": 200, "y": 400, "width": 180, "height": 20 },
"center": { "x": 290, "y": 410 },
"source": "ocr",
"confidence": 95.2
}
}
Click element by name/role (finds element first, then clicks at center).
python3 scripts/desktop.py click-element --name "Next" [--role button] [--verify]
| Parameter | Type | Required | Description |
|---|---|---|---|
--name, -n | string | No | Element name/text |
--role, -r | string | No | Element role |
--app, -a | string | No | Application name filter |
click-element requires at least one selector: --name or --role. When --verify is used, OCR must be available and text must be provided (typically via --name).
Returns:
{
"clicked": {
"element": { "name": "Next", "..." },
"x": 450,
"y": 315,
"button": "left",
"double": false
}
}
Wait for element or text to appear (with timeout and exponential backoff).
python3 scripts/desktop.py wait-for --name "Success" --timeout 30
python3 scripts/desktop.py wait-for --text "Transaction complete" --timeout 60
python3 scripts/desktop.py wait-for --name "Loading" --gone --timeout 30
| Parameter | Type | Required | Description |
|---|---|---|---|
--name, -n | string | No | Element name (AT-SPI + OCR) |
--role, -r | string | No | Element role (AT-SPI only) |
--app, -a | string | No | Application filter |
Use either --text or element selectors (--name, --role, --app) for a single call, not both. For element waits (with or without --gone), provide at least one of --name or --role.
Returns:
{ "found": { "name": "Success", "..." } }
// or
{ "gone": true, "name": "Loading" }
// or
{ "error": "Element not found within 30s", "timeout": true }
List all interactive elements (buttons, inputs, links, etc.)
python3 scripts/desktop.py list-elements [--app Firefox] [--role button]
| Parameter | Type | Required | Description |
|---|---|---|---|
--app, -a | string | No | Application name filter |
--role, -r | string | No | Filter by role |
--include-hidden | flag | No | Include hidden elements |
--max-results |
Returns:
{
"elements": [
{ "name": "Sign In", "role": "push button", "..." },
{ "name": "Email", "role": "entry", "..." }
],
"count": 2
}
Check AT-SPI and OCR availability.
python3 scripts/desktop.py status
Returns:
{
"atspi": {
"available": true,
"applications": ["Firefox", "gnome-calculator"]
},
"ocr": { "available": true },
"display": ":10.0"
}
These coordinate-based commands are still available:
| Command | Description |
|---|---|
screenshot [--output PATH] | Take screenshot |
click X Y [--right] [--double] | Click at coordinates |
type "TEXT" [--type-delay MS] | Type text |
key "KEYS" | Press key combination |
move X Y | Move mouse |
active | Get active window info |
# Wait for and click Confirm button
python3 scripts/desktop.py wait-for --name "Confirm" --role button --timeout 30
python3 scripts/desktop.py click-element --name "Confirm" --role button
# Wait for success message
python3 scripts/desktop.py wait-for --text "Transaction submitted" --timeout 60
# Click "I have an existing wallet"
python3 scripts/desktop.py click-element --name "I already have a wallet"
# Wait for seed phrase input
python3 scripts/desktop.py wait-for --role entry --timeout 10
# Type seed phrase
python3 scripts/desktop.py type "word1 word2 word3..."
# Click Import
python3 scripts/desktop.py click-element --name "Import"
# Use semantic for known buttons
python3 scripts/desktop.py click-element --name "Settings"
# Fall back to coordinates for unlabeled icons
python3 scripts/desktop.py screenshot --output /tmp/screen.png
# (analyze screenshot to get coordinates)
python3 scripts/desktop.py click 850 120
click-element and wait-for are more robust than coordinatesstatus to verify AT-SPI and OCR are availablefind-text uses OCRwait-for is more reliable than fixed delaysWeekly Installs
83
Repository
First Seen
Feb 10, 2026
Security Audits
Gen Agent Trust HubPassSocketWarnSnykFail
Installed on
gemini-cli78
openclaw78
codex77
kimi-cli77
amp77
opencode77
Skills CLI 使用指南:AI Agent 技能包管理器安装与管理教程
46,600 周安装
--all | flag | No | Find all matches |
--clickable | flag | No | Only clickable elements |
--max-results | int | No | Maximum results (default: 50) |
| int |
| No |
| Maximum results (default: 50) |
--right | flag | No | Right click |
--double | flag | No | Double click |
--verify | flag | No | OCR verify before click |
--text, -t | string | No | Text to find (OCR only) |
--exact | flag | No | Exact text match |
--gone | flag | No | Wait until disappears |
--timeout | float | No | Timeout in seconds (default: 30) |
| int |
| No |
| Maximum results (default: 100) |
find-window "NAME" | Find windows by name |
focus "NAME" | Focus window by name |
position | Get mouse position |
windows | List all windows |