dd-monitors by datadog-labs/agent-skills
npx skills add https://github.com/datadog-labs/agent-skills --skill dd-monitors创建、管理和维护用于告警的监控器。
此功能要求您的路径中包含 Go 或 pup 二进制文件。
pup - go install github.com/datadog-labs/pup@latest 确保 ~/go/bin 在 $PATH 中。
pup auth login
pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"
pup monitors get <id> --json
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
pup monitors create \
--name "High CPU on web servers" \
--type "metric alert" \
--query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
--message "CPU above 80% @slack-ops"
# 带持续时间的静音
pup monitors mute --id 12345 --duration 1h
# 或指定结束时间的静音
pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"
# 取消静音
pup monitors unmute --id 12345
| 规则 | 原因 |
|---|---|
| 避免告警抖动 | 使用 last_Xm 而非 last_1m |
| 设置合理的阈值 | 基于 SLO,而非猜测 |
| 告警应可操作 | 如果无需操作,则不要告警 |
| 包含运维手册 | 在消息中使用 @runbook-url |
# 错误示例 - 会持续抖动
query = "avg(last_1m):avg:system.cpu.user{*} > 50" # ❌ 过于敏感
# 正确示例 - 稳定的告警
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80" # ✅ 合理的窗口
# 错误示例 - 对所有对象告警
query = "avg(last_5m):avg:system.cpu.user{*} > 80" # ❌ 无作用域
# 正确示例 - 限定到关键对象
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80" # ✅
monitor = {
"query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
"options": {
"thresholds": {
"critical": 80,
"critical_recovery": 70, # ✅ 防止抖动
"warning": 60,
"warning_recovery": 50
}
}
}
message = """
## 高 CPU 告警
主机: {{host.name}}
当前值: {{value}}
阈值: {{threshold}}
### 运维手册
1. 检查 top 进程: `ssh {{host.name}} 'top -bn1 | head -20'`
2. 检查最近的部署
3. 如有需要则进行扩缩容
@slack-ops @pagerduty-oncall
"""
使用安全的删除工作流(与仪表板相同):
def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
"""标记监控器而非删除。"""
monitor = client.get_monitor(monitor_id)
name = monitor.get("name", "")
if "[MARKED FOR DELETION]" in name:
print(f"Already marked: {name}")
return False
new_name = f"[MARKED FOR DELETION] {name}"
client.update_monitor(monitor_id, {"name": new_name})
print(f"✓ Marked: {new_name}")
return True
| 类型 | 使用场景 |
|---|---|
metric alert | CPU、内存、自定义指标 |
query alert | 复杂的指标查询 |
service check | Agent 检查状态 |
event alert | 事件流模式 |
log alert | 日志模式匹配 |
composite | 组合多个监控器 |
apm | APM 指标 |
# 查找没有所有者的监控器
pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'
# 查找告警频繁的监控器(高告警计数)
pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'
| 用途 | 适用场景 |
|---|---|
| 静音监控器 | 快速的一次性操作,< 1 小时 |
| 停机时间 | 计划内的维护、周期性维护 |
# 停机时间(推荐)
pup downtime create \
--scope "env:prod" \
--monitor-tags "team:platform" \
--start "2024-01-15T02:00:00Z" \
--end "2024-01-15T06:00:00Z"
| 问题 | 解决方法 |
|---|---|
| 告警未触发 | 检查查询是否返回数据、阈值设置 |
| 告警过多 | 增加时间窗口、添加恢复阈值 |
| 无数据告警 | 检查 Agent 连接性、指标是否存在 |
| 认证错误 | pup auth refresh |
每周安装次数
83
代码仓库
GitHub Stars
59
首次出现
12 天前
安全审计
安装于
codex82
kimi-cli81
gemini-cli81
amp81
github-copilot81
opencode81
Create, manage, and maintain monitors for alerting.
This requires Go or the pup binary in your path.
pup - go install github.com/datadog-labs/pup@latest Ensure ~/go/bin is in $PATH.
pup auth login
pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"
pup monitors get <id> --json
pup monitors create \
--name "High CPU on web servers" \
--type "metric alert" \
--query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
--message "CPU above 80% @slack-ops"
# Mute with duration
pup monitors mute --id 12345 --duration 1h
# Or mute with specific end time
pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"
# Unmute
pup monitors unmute --id 12345
| Rule | Why |
|---|---|
| No flapping alerts | Use last_Xm not last_1m |
| Meaningful thresholds | Based on SLOs, not guesses |
| Actionable alerts | If no action needed, don't alert |
| Include runbook | @runbook-url in message |
# WRONG - will flap constantly
query = "avg(last_1m):avg:system.cpu.user{*} > 50" # ❌ Too sensitive
# CORRECT - stable alerting
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80" # ✅ Reasonable window
# WRONG - alerts on everything
query = "avg(last_5m):avg:system.cpu.user{*} > 80" # ❌ No scope
# CORRECT - scoped to what matters
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80" # ✅
monitor = {
"query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
"options": {
"thresholds": {
"critical": 80,
"critical_recovery": 70, # ✅ Prevents flapping
"warning": 60,
"warning_recovery": 50
}
}
}
message = """
## High CPU Alert
Host: {{host.name}}
Current Value: {{value}}
Threshold: {{threshold}}
### Runbook
1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'`
2. Check recent deploys
3. Scale if needed
@slack-ops @pagerduty-oncall
"""
Use safe deletion workflow (same as dashboards):
def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
"""Mark monitor instead of deleting."""
monitor = client.get_monitor(monitor_id)
name = monitor.get("name", "")
if "[MARKED FOR DELETION]" in name:
print(f"Already marked: {name}")
return False
new_name = f"[MARKED FOR DELETION] {name}"
client.update_monitor(monitor_id, {"name": new_name})
print(f"✓ Marked: {new_name}")
return True
| Type | Use Case |
|---|---|
metric alert | CPU, memory, custom metrics |
query alert | Complex metric queries |
service check | Agent check status |
event alert | Event stream patterns |
log alert | Log pattern matching |
composite | Combine multiple monitors |
# Find monitors without owners
pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'
# Find noisy monitors (high alert count)
pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'
| Use | When |
|---|---|
| Mute monitor | Quick one-off, < 1 hour |
| Downtime | Scheduled maintenance, recurring |
# Downtime (preferred)
pup downtime create \
--scope "env:prod" \
--monitor-tags "team:platform" \
--start "2024-01-15T02:00:00Z" \
--end "2024-01-15T06:00:00Z"
| Problem | Fix |
|---|---|
| Alert not firing | Check query returns data, thresholds |
| Too many alerts | Increase window, add recovery threshold |
| No data alerts | Check agent connectivity, metric exists |
| Auth error | pup auth refresh |
Weekly Installs
83
Repository
GitHub Stars
59
First Seen
12 days ago
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex82
kimi-cli81
gemini-cli81
amp81
github-copilot81
opencode81
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
104,900 周安装
apm | APM metrics |