Datadog 监控器管理工具：创建、管理、静音告警监控器最佳实践

dd-monitors by datadog-labs/agent-skills

257 周安装量

85 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/datadog-labs/agent-skills --skill dd-monitors

开发运维命令行工具监控

🇨🇳中文介绍

Datadog 监控器

创建、管理和维护用于告警的监控器。

前提条件

此功能要求您的路径中包含 Go 或 pup 二进制文件。

pup - go install github.com/datadog-labs/pup@latest 确保 ~/go/bin 在 $PATH 中。

快速开始

pup auth login

常用操作

列出监控器

pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"

获取监控器

pup monitors get <id> --json

创建监控器

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

⚠️ 监控器创建最佳实践

1. 避免告警疲劳

规则	原因
避免告警抖动	使用 `last_Xm` 而非 `last_1m`
设置合理的阈值	基于 SLO，而非猜测
告警应可操作	如果无需操作，则不要告警
包含运维手册	在消息中使用 `@runbook-url`

# 错误示例 - 会持续抖动
query = "avg(last_1m):avg:system.cpu.user{*} > 50"  # ❌ 过于敏感

# 正确示例 - 稳定的告警
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"  # ✅ 合理的窗口

2. 使用适当的作用域

# 错误示例 - 对所有对象告警
query = "avg(last_5m):avg:system.cpu.user{*} > 80"  # ❌ 无作用域

# 正确示例 - 限定到关键对象
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"  # ✅

3. 设置恢复阈值

monitor = {
    "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
    "options": {
        "thresholds": {
            "critical": 80,
            "critical_recovery": 70,  # ✅ 防止抖动
            "warning": 60,
            "warning_recovery": 50
        }
    }
}

4. 在消息中包含上下文信息

message = """
## 高 CPU 告警

主机: {{host.name}}
当前值: {{value}}
阈值: {{threshold}}

### 运维手册
1. 检查 top 进程: `ssh {{host.name}} 'top -bn1 | head -20'`
2. 检查最近的部署
3. 如有需要则进行扩缩容

@slack-ops @pagerduty-oncall
"""

⚠️ 切勿直接删除监控器

使用安全的删除工作流（与仪表板相同）：

def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
    """标记监控器而非删除。"""
    monitor = client.get_monitor(monitor_id)
    name = monitor.get("name", "")
    
    if "[MARKED FOR DELETION]" in name:
        print(f"Already marked: {name}")
        return False
    
    new_name = f"[MARKED FOR DELETION] {name}"
    client.update_monitor(monitor_id, {"name": new_name})
    print(f"✓ Marked: {new_name}")
    return True

类型	使用场景
`metric alert`	CPU、内存、自定义指标
`query alert`	复杂的指标查询
`service check`	Agent 检查状态
`event alert`	事件流模式
`log alert`	日志模式匹配
`composite`	组合多个监控器
`apm`	APM 指标

# 查找没有所有者的监控器
pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'

# 查找告警频繁的监控器（高告警计数）
pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'

停机时间 vs 静音

用途	适用场景
静音监控器	快速的一次性操作，< 1 小时
停机时间	计划内的维护、周期性维护

# 停机时间（推荐）
pup downtime create \
  --scope "env:prod" \
  --monitor-tags "team:platform" \
  --start "2024-01-15T02:00:00Z" \
  --end "2024-01-15T06:00:00Z"

问题	解决方法
告警未触发	检查查询是否返回数据、阈值设置
告警过多	增加时间窗口、添加恢复阈值
无数据告警	检查 Agent 连接性、指标是否存在
认证错误	`pup auth refresh`

🇺🇸English

Datadog Monitors

Create, manage, and maintain monitors for alerting.

Prerequisites

This requires Go or the pup binary in your path.

pup - go install github.com/datadog-labs/pup@latest Ensure ~/go/bin is in $PATH.

Quick Start

pup auth login

Common Operations

List Monitors

pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"

Get Monitor

pup monitors get <id> --json

Create Monitor

pup monitors create \
  --name "High CPU on web servers" \
  --type "metric alert" \
  --query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
  --message "CPU above 80% @slack-ops"

Mute/Unmute

# Mute with duration
pup monitors mute --id 12345 --duration 1h

# Or mute with specific end time
pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"

# Unmute
pup monitors unmute --id 12345

⚠️ Monitor Creation Best Practices

1. Avoid Alert Fatigue

Rule	Why
No flapping alerts	Use `last_Xm` not `last_1m`
Meaningful thresholds	Based on SLOs, not guesses
Actionable alerts	If no action needed, don't alert
Include runbook	`@runbook-url` in message

# WRONG - will flap constantly
query = "avg(last_1m):avg:system.cpu.user{*} > 50"  # ❌ Too sensitive

# CORRECT - stable alerting
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"  # ✅ Reasonable window

2. Use Proper Scoping

# WRONG - alerts on everything
query = "avg(last_5m):avg:system.cpu.user{*} > 80"  # ❌ No scope

# CORRECT - scoped to what matters
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"  # ✅

3. Set Recovery Thresholds

monitor = {
    "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
    "options": {
        "thresholds": {
            "critical": 80,
            "critical_recovery": 70,  # ✅ Prevents flapping
            "warning": 60,
            "warning_recovery": 50
        }
    }
}

4. Include Context in Messages

message = """
## High CPU Alert

Host: {{host.name}}
Current Value: {{value}}
Threshold: {{threshold}}

### Runbook
1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'`
2. Check recent deploys
3. Scale if needed

@slack-ops @pagerduty-oncall
"""

⚠️ NEVER Delete Monitors Directly

Use safe deletion workflow (same as dashboards):

def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
    """Mark monitor instead of deleting."""
    monitor = client.get_monitor(monitor_id)
    name = monitor.get("name", "")
    
    if "[MARKED FOR DELETION]" in name:
        print(f"Already marked: {name}")
        return False
    
    new_name = f"[MARKED FOR DELETION] {name}"
    client.update_monitor(monitor_id, {"name": new_name})
    print(f"✓ Marked: {new_name}")
    return True

Monitor Types

Type	Use Case
`metric alert`	CPU, memory, custom metrics
`query alert`	Complex metric queries
`service check`	Agent check status
`event alert`	Event stream patterns
`log alert`	Log pattern matching
`composite`	Combine multiple monitors

Audit Monitors

# Find monitors without owners
pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'

# Find noisy monitors (high alert count)
pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'

Downtime vs Muting

Use	When
Mute monitor	Quick one-off, < 1 hour
Downtime	Scheduled maintenance, recurring

# Downtime (preferred)
pup downtime create \
  --scope "env:prod" \
  --monitor-tags "team:platform" \
  --start "2024-01-15T02:00:00Z" \
  --end "2024-01-15T06:00:00Z"

Failure Handling

Problem	Fix
Alert not firing	Check query returns data, thresholds
Too many alerts	Increase window, add recovery threshold
No data alerts	Check agent connectivity, metric exists
Auth error	`pup auth refresh`

References

Weekly Installs

Repository

datadog-labs/ag…t-skills

GitHub Stars

First Seen

12 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex82

kimi-cli81

gemini-cli81

amp81

github-copilot81

opencode81

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

104,900 周安装