Ubuntu 桌面自动化控制工具：基于 AT-SPI 和 OCR 的 GUI 元素定位与点击

ubuntu-desktop-control by lommaj/ubuntu-desktop-control

99 周安装量

1 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/lommaj/ubuntu-desktop-control --skill ubuntu-desktop-control

自动化系统管理测试

🇨🇳中文介绍

桌面控制技能

使用语义元素定位来控制桌面图形用户界面。通过名称而非坐标来查找并点击用户界面元素。

主要特性：

AT-SPI - 主要方法，使用无障碍功能树（知晓元素角色、状态、动作）
OCR 后备方案 - 当 AT-SPI 无法找到元素时，使用基于 Tesseract 的文本查找
等待工具 - 使用指数退避轮询等待元素出现
点击验证 - 可选的点击前截图验证

先决条件

安装依赖项：

bash install.sh

或手动安装：

# 系统包
sudo apt-get install -y xdotool scrot imagemagick \
    at-spi2-core libatk-adaptor python3-gi gir1.2-atspi-2.0 \
    tesseract-ocr tesseract-ocr-eng python3-pip

# Python 包
pip3 install -r requirements.txt

对于无头 Xvfb 会话：

export GTK_MODULES=gail:atk-bridge
export QT_LINUX_ACCESSIBILITY_ALWAYS_ON=1
/usr/lib/at-spi2-core/at-spi-bus-launcher &

命令

所有命令默认使用 DISPLAY=:10.0。可使用标志覆盖。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

FlyClaw：零登录航班聚合查询工具，Python实现多源航班信息与价格搜索

4,000,000 周安装

Azure RBAC 权限管理工具：查找最小角色、创建自定义角色与自动化分配

142,000 周安装

GitHub Actions 官方文档查询助手 - 精准解答 CI/CD 工作流问题

47,200 周安装

通过 LiteLLM 代理让 Claude Code 对接 GitHub Copilot 运行 | 高级变通方案指南

46,900 周安装

通过 AT-SPI 查找用户界面元素，并附带 OCR 后备方案。

python3 scripts/desktop.py find-element --name "Confirm" [--role button] [--app Firefox]

参数	类型	必需	描述
`--name`, `-n`	string	否	要查找的元素名称/文本
`--role`, `-r`	string	否	元素角色（button, entry 等）
`--app`, `-a`	string	否	应用程序名称过滤器
`--all`	flag	否	查找所有匹配项
`--clickable`	flag	否	仅可点击元素
`--max-results`	int	否	最大结果数（默认：50）

{
  "element": {
    "name": "Confirm",
    "bounds": { "x": 400, "y": 300, "width": 100, "height": 30 },
    "center": { "x": 450, "y": 315 },
    "role": "push button",
    "source": "atspi",
    "visible": true,
    "enabled": true,
    "clickable": true
  }
}

仅通过 OCR 在屏幕上查找文本。

python3 scripts/desktop.py find-text "I have an existing wallet" [--exact] [--all]

参数	类型	必需	描述
`text`	string	是	要查找的文本
`--exact`	flag	否	要求完全匹配
`--case-sensitive`	flag	否	区分大小写匹配
`--all`	flag	否	查找所有出现位置
`--max-results`	int	否	最大结果数（默认：50）

{
  "match": {
    "name": "I have an existing wallet",
    "bounds": { "x": 200, "y": 400, "width": 180, "height": 20 },
    "center": { "x": 290, "y": 410 },
    "source": "ocr",
    "confidence": 95.2
  }
}

按名称/角色点击元素（先查找元素，然后点击其中心）。

python3 scripts/desktop.py click-element --name "Next" [--role button] [--verify]

参数	类型	必需	描述
`--name`, `-n`	string	否	元素名称/文本
`--role`, `-r`	string	否	元素角色
`--app`, `-a`	string	否	应用程序名称过滤器
`--right`	flag	否	右键点击
`--double`	flag	否	双击
`--verify`	flag	否	点击前进行 OCR 验证

click-element 至少需要一个选择器：--name 或 --role。当使用 --verify 时，必须可用 OCR 并且必须提供文本（通常通过 --name）。

{
  "clicked": {
    "element": { "name": "Next", "..." },
    "x": 450,
    "y": 315,
    "button": "left",
    "double": false
  }
}

等待元素或文本出现（带超时和指数退避）。

python3 scripts/desktop.py wait-for --name "Success" --timeout 30
python3 scripts/desktop.py wait-for --text "Transaction complete" --timeout 60
python3 scripts/desktop.py wait-for --name "Loading" --gone --timeout 30

参数	类型	必需	描述
`--name`, `-n`	string	否	元素名称（AT-SPI + OCR）
`--role`, `-r`	string	否	元素角色（仅 AT-SPI）
`--app`, `-a`	string	否	应用程序过滤器
`--text`, `-t`	string	否	要查找的文本（仅 OCR）
`--exact`	flag	否	精确文本匹配
`--gone`	flag	否	等待直到消失
`--timeout`	float	否	超时时间（秒）（默认：30）

单次调用请使用 --text 或元素选择器（--name、--role、--app）中的一种，不要同时使用。对于元素等待（带或不带 --gone），请至少提供 --name 或 --role 中的一个。

{ "found": { "name": "Success", "..." } }
// 或
{ "gone": true, "name": "Loading" }
// 或
{ "error": "Element not found within 30s", "timeout": true }

列出所有交互式元素（按钮、输入框、链接等）。

python3 scripts/desktop.py list-elements [--app Firefox] [--role button]

参数	类型	必需	描述
`--app`, `-a`	string	否	应用程序名称过滤器
`--role`, `-r`	string	否	按角色过滤
`--include-hidden`	flag	否	包含隐藏元素
`--max-results`	int	否	最大结果数（默认：100）

{
  "elements": [
    { "name": "Sign In", "role": "push button", "..." },
    { "name": "Email", "role": "entry", "..." }
  ],
  "count": 2
}

检查 AT-SPI 和 OCR 的可用性。

python3 scripts/desktop.py status

{
  "atspi": {
    "available": true,
    "applications": ["Firefox", "gnome-calculator"]
  },
  "ocr": { "available": true },
  "display": ":10.0"
}

这些基于坐标的命令仍然可用：

命令	描述
`screenshot [--output PATH]`	截取屏幕截图
`click X Y [--right] [--double]`	在坐标处点击
`type "TEXT" [--type-delay MS]`	输入文本
`key "KEYS"`	按下按键组合
`move X Y`	移动鼠标
`active`	获取活动窗口信息
`find-window "NAME"`	按名称查找窗口
`focus "NAME"`	按名称聚焦窗口
`position`	获取鼠标位置
`windows`	列出所有窗口

MetaMask 交易（语义）

# 等待并点击确认按钮
python3 scripts/desktop.py wait-for --name "Confirm" --role button --timeout 30
python3 scripts/desktop.py click-element --name "Confirm" --role button

# 等待成功消息
python3 scripts/desktop.py wait-for --text "Transaction submitted" --timeout 60

Phantom 钱包导入（语义）

# 点击 "I have an existing wallet"
python3 scripts/desktop.py click-element --name "I already have a wallet"

# 等待种子短语输入框
python3 scripts/desktop.py wait-for --role entry --timeout 10

# 输入种子短语
python3 scripts/desktop.py type "word1 word2 word3..."

# 点击导入
python3 scripts/desktop.py click-element --name "Import"

混合方法（语义 + 坐标）

# 对已知按钮使用语义方法
python3 scripts/desktop.py click-element --name "Settings"

# 对未标记的图标回退到坐标方法
python3 scripts/desktop.py screenshot --output /tmp/screen.png
# （分析截图以获取坐标）
python3 scripts/desktop.py click 850 120

优先使用语义命令 - click-element 和 wait-for 比坐标更稳健
先检查状态 - 运行 status 以验证 AT-SPI 和 OCR 是否可用
使用 --role 提高精度 - 区分名称相同的按钮和文本
回退到 OCR - 如果 AT-SPI 未暴露某个元素，find-text 会使用 OCR
使用等待而非休眠 - wait-for 比固定延迟更可靠
关键点击使用 --verify - 在点击前添加 OCR 验证

🇺🇸English

Desktop Control Skill

Control the desktop GUI using semantic element targeting. Find and click UI elements by name instead of coordinates.

Key Features:

AT-SPI - Primary method using accessibility tree (knows element roles, states, actions)
OCR Fallback - Tesseract-based text finding when AT-SPI can't find the element
Wait Utilities - Poll for elements to appear with exponential backoff
Click Verification - Optional pre-click screenshot verification

Prerequisites

Install dependencies:

bash install.sh

Or manually:

# System packages
sudo apt-get install -y xdotool scrot imagemagick \
    at-spi2-core libatk-adaptor python3-gi gir1.2-atspi-2.0 \
    tesseract-ocr tesseract-ocr-eng python3-pip

# Python packages
pip3 install -r requirements.txt

For headless Xvfb sessions:

export GTK_MODULES=gail:atk-bridge
export QT_LINUX_ACCESSIBILITY_ALWAYS_ON=1
/usr/lib/at-spi2-core/at-spi-bus-launcher &

Commands

All commands use DISPLAY=:10.0 by default. Override with --display flag.

find-element

Find UI element via AT-SPI with OCR fallback.

python3 scripts/desktop.py find-element --name "Confirm" [--role button] [--app Firefox]

Parameter	Type	Required	Description
`--name`, `-n`	string	No	Element name/text to find
`--role`, `-r`	string	No	Element role (button, entry, etc.)
`--app`, `-a`	string	No	Application name filter

Returns:

{
  "element": {
    "name": "Confirm",
    "bounds": { "x": 400, "y": 300, "width": 100, "height": 30 },
    "center": { "x": 450, "y": 315 },
    "role": "push button",
    "source": "atspi",
    "visible": true,
    "enabled": true,
    "clickable": true
  }
}

find-text

Find text on screen via OCR only.

python3 scripts/desktop.py find-text "I have an existing wallet" [--exact] [--all]

Parameter	Type	Required	Description
`text`	string	Yes	Text to find
`--exact`	flag	No	Require exact match
`--case-sensitive`	flag	No	Case-sensitive matching
`--all`	flag	No	Find all occurrences
`--max-results`

Returns:

{
  "match": {
    "name": "I have an existing wallet",
    "bounds": { "x": 200, "y": 400, "width": 180, "height": 20 },
    "center": { "x": 290, "y": 410 },
    "source": "ocr",
    "confidence": 95.2
  }
}

click-element

Click element by name/role (finds element first, then clicks at center).

python3 scripts/desktop.py click-element --name "Next" [--role button] [--verify]

Parameter	Type	Required	Description
`--name`, `-n`	string	No	Element name/text
`--role`, `-r`	string	No	Element role
`--app`, `-a`	string	No	Application name filter

click-element requires at least one selector: --name or --role. When --verify is used, OCR must be available and text must be provided (typically via --name).

Returns:

{
  "clicked": {
    "element": { "name": "Next", "..." },
    "x": 450,
    "y": 315,
    "button": "left",
    "double": false
  }
}

wait-for

Wait for element or text to appear (with timeout and exponential backoff).

python3 scripts/desktop.py wait-for --name "Success" --timeout 30
python3 scripts/desktop.py wait-for --text "Transaction complete" --timeout 60
python3 scripts/desktop.py wait-for --name "Loading" --gone --timeout 30

Parameter	Type	Required	Description
`--name`, `-n`	string	No	Element name (AT-SPI + OCR)
`--role`, `-r`	string	No	Element role (AT-SPI only)
`--app`, `-a`	string	No	Application filter

Use either --text or element selectors (--name, --role, --app) for a single call, not both. For element waits (with or without --gone), provide at least one of --name or --role.

Returns:

{ "found": { "name": "Success", "..." } }
// or
{ "gone": true, "name": "Loading" }
// or
{ "error": "Element not found within 30s", "timeout": true }

list-elements

List all interactive elements (buttons, inputs, links, etc.)

python3 scripts/desktop.py list-elements [--app Firefox] [--role button]

Parameter	Type	Required	Description
`--app`, `-a`	string	No	Application name filter
`--role`, `-r`	string	No	Filter by role
`--include-hidden`	flag	No	Include hidden elements
`--max-results`

Returns:

{
  "elements": [
    { "name": "Sign In", "role": "push button", "..." },
    { "name": "Email", "role": "entry", "..." }
  ],
  "count": 2
}

status

Check AT-SPI and OCR availability.

python3 scripts/desktop.py status

Returns:

{
  "atspi": {
    "available": true,
    "applications": ["Firefox", "gnome-calculator"]
  },
  "ocr": { "available": true },
  "display": ":10.0"
}

Original Commands

These coordinate-based commands are still available:

Command	Description
`screenshot [--output PATH]`	Take screenshot
`click X Y [--right] [--double]`	Click at coordinates
`type "TEXT" [--type-delay MS]`	Type text
`key "KEYS"`	Press key combination
`move X Y`	Move mouse
`active`	Get active window info

Example Workflows

MetaMask Transaction (Semantic)

# Wait for and click Confirm button
python3 scripts/desktop.py wait-for --name "Confirm" --role button --timeout 30
python3 scripts/desktop.py click-element --name "Confirm" --role button

# Wait for success message
python3 scripts/desktop.py wait-for --text "Transaction submitted" --timeout 60

Phantom Wallet Import (Semantic)

# Click "I have an existing wallet"
python3 scripts/desktop.py click-element --name "I already have a wallet"

# Wait for seed phrase input
python3 scripts/desktop.py wait-for --role entry --timeout 10

# Type seed phrase
python3 scripts/desktop.py type "word1 word2 word3..."

# Click Import
python3 scripts/desktop.py click-element --name "Import"

Hybrid Approach (Semantic + Coordinates)

# Use semantic for known buttons
python3 scripts/desktop.py click-element --name "Settings"

# Fall back to coordinates for unlabeled icons
python3 scripts/desktop.py screenshot --output /tmp/screen.png
# (analyze screenshot to get coordinates)
python3 scripts/desktop.py click 850 120

Tips

Prefer semantic commands - click-element and wait-for are more robust than coordinates
Check status first - Run status to verify AT-SPI and OCR are available
Use --role for precision - Distinguish between buttons and text with same name
Fall back to OCR - If AT-SPI doesn't expose an element, find-text uses OCR
Wait instead of sleep - wait-for is more reliable than fixed delays
Use --verify for critical clicks - Adds OCR verification before clicking

Weekly Installs

Repository

lommaj/ubuntu-d…-control

First Seen

Feb 10, 2026

Security Audits

Gen Agent Trust HubPass SocketWarn SnykFail

Installed on

gemini-cli78

openclaw78

codex77

kimi-cli77

amp77

opencode77

Skills CLI 使用指南：AI Agent 技能包管理器安装与管理教程

46,600 周安装

Ubuntu 桌面自动化控制工具：基于 AT-SPI 和 OCR 的 GUI 元素定位与点击

🇨🇳中文介绍

桌面控制技能

先决条件

命令

相关 Skills

find-element

find-text

click-element

wait-for

list-elements

status

原始命令

示例工作流

MetaMask 交易（语义）

Phantom 钱包导入（语义）

混合方法（语义 + 坐标）

提示

🇺🇸English

Desktop Control Skill

Prerequisites

Commands

find-element

find-text

click-element

wait-for

list-elements

status

Original Commands

Example Workflows

MetaMask Transaction (Semantic)

Phantom Wallet Import (Semantic)

Hybrid Approach (Semantic + Coordinates)

Tips

最新 Skills