监控与可观测性实施指南：指标设计、日志聚合、告警策略与SLO管理

monitoring-observability by ahmedasmar/devops-claude-skills

249 周安装量

107 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/ahmedasmar/devops-claude-skills --skill monitoring-observability

开发运维监控性能优化

🇨🇳中文介绍

监控与可观测性

概述

此技能为监控和可观测性工作流提供全面指导，包括指标设计、日志聚合、分布式追踪、告警策略、SLO/SLA 管理以及工具选择。

何时使用此技能：

为新服务设置监控
设计告警和仪表盘
排查性能问题
实施 SLO 跟踪和错误预算
在监控工具间进行选择
集成 OpenTelemetry 插桩
分析指标、日志和追踪
优化 Datadog 成本并发现浪费
从 Datadog 迁移到开源技术栈

核心工作流：可观测性实施

使用此决策树来确定您的起点：

Are you setting up monitoring from scratch?
├─ YES → Start with "1. Design Metrics Strategy"
└─ NO → Do you have an existing issue?
    ├─ YES → Go to "9. Troubleshooting & Analysis"
    └─ NO → Are you improving existing monitoring?
        ├─ Alerts → Go to "3. Alert Design"
        ├─ Dashboards → Go to "4. Dashboard & Visualization"
        ├─ SLOs → Go to "5. SLO & Error Budgets"
        ├─ Tool selection → Read references/tool_comparison.md
        └─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"

1. 设计指标策略

从四大黄金信号开始

每个服务都应监控：

延迟：响应时间（p50, p95, p99）

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

深入探讨：指标设计

获取全面的指标设计指导，包括：

指标类型（计数器、仪表、直方图、摘要）
基数最佳实践
命名约定
仪表盘设计原则

自动化指标分析

检测指标中的异常和趋势：

# Analyze Prometheus metrics for anomalies
python3 scripts/analyze_metrics.py prometheus \
  --endpoint http://localhost:9090 \
  --query 'rate(http_requests_total[5m])' \
  --hours 24

# Analyze CloudWatch metrics
python3 scripts/analyze_metrics.py cloudwatch \
  --namespace AWS/EC2 \
  --metric CPUUtilization \
  --dimensions InstanceId=i-1234567890abcdef0 \
  --hours 48

2. 日志聚合与分析

结构化日志清单

每个日志条目应包含：

✅ 时间戳（ISO 8601 格式）
✅ 日志级别（DEBUG, INFO, WARN, ERROR, FATAL）
✅ 消息（人类可读）
✅ 服务名称
✅ 请求 ID（用于追踪）

结构化日志示例（JSON）：

{
  "timestamp": "2024-10-28T14:32:15Z",
  "level": "error",
  "message": "Payment processing failed",
  "service": "payment-service",
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "user_id": "user123",
  "order_id": "ORD-456",
  "error_type": "GatewayTimeout",
  "duration_ms": 5000
}

ELK 技术栈（Elasticsearch, Logstash, Kibana）：

最适合：深度日志分析、复杂查询
成本：高（基础设施 + 运维）
复杂度：高

最适合：经济高效的日志记录、Kubernetes
成本：低
复杂度：中等

CloudWatch Logs：

最适合：以 AWS 为中心的应用程序
成本：中等
复杂度：低

分析日志中的错误、模式和异常：

# Analyze log file for patterns
python3 scripts/log_analyzer.py application.log

# Show error lines with context
python3 scripts/log_analyzer.py application.log --show-errors

# Extract stack traces
python3 scripts/log_analyzer.py application.log --show-traces

→ 脚本：scripts/log_analyzer.py

深入探讨：日志记录

获取全面的日志记录指导，包括：

结构化日志实现示例（Python, Node.js, Go, Java）
日志聚合模式（ELK, Loki, CloudWatch, Fluentd）
查询模式和最佳实践
PII 脱敏和安全性
采样和速率限制

每个告警必须可操作 - 如果你不能做任何事情，就不要告警
告警症状，而非原因 - 告警用户体验，而非组件
将告警与 SLO 关联 - 连接到业务影响
减少噪音 - 仅对关键问题发送页面通知

严重性	响应时间	示例
严重	立即页面通知	服务宕机，SLO 违规
警告	创建工单，数小时内审查	错误率升高，资源警告
信息	记录以供知晓	部署完成，扩缩容事件

多窗口燃烧率告警

当错误预算消耗过快时告警：

# Fast burn (1h window) - Critical
- alert: ErrorBudgetFastBurn
  expr: |
    (error_rate / 0.001) > 14.4  # 99.9% SLO
  for: 2m
  labels:
    severity: critical

# Slow burn (6h window) - Warning
- alert: ErrorBudgetSlowBurn
  expr: |
    (error_rate / 0.001) > 6  # 99.9% SLO
  for: 30m
  labels:
    severity: warning

告警质量检查器

根据最佳实践审核您的告警规则：

# Check single file
python3 scripts/alert_quality_checker.py alerts.yml

# Check all rules in directory
python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/

告警命名约定
必需标签（严重性、团队）
必需注解（摘要、描述、runbook_url）
PromQL 表达式质量
防止抖动的 'for' 子句

生产就绪的告警规则模板：

深入探讨：告警

获取全面的告警指导，包括：

告警设计模式（多窗口、变化率、带滞后的阈值）
告警注解最佳实践
告警路由（基于严重性、基于团队、基于时间）
抑制规则
运行手册结构
值班最佳实践

为您的告警创建全面的运行手册：

4. 仪表盘与可视化

仪表盘设计原则

自上而下的布局：最重要的指标优先
颜色编码：红色（严重）、黄色（警告）、绿色（健康）
一致的时间窗口：所有面板使用相同的时间范围
限制面板数量：每个仪表盘最多 8-12 个面板
包含上下文：将相关指标放在一起显示

生成 Grafana 仪表盘

从模板自动生成仪表盘：

# Web application dashboard
python3 scripts/dashboard_generator.py webapp \
  --title "My API Dashboard" \
  --service my_api \
  --output dashboard.json

# Kubernetes dashboard
python3 scripts/dashboard_generator.py kubernetes \
  --title "K8s Production" \
  --namespace production \
  --output k8s-dashboard.json

# Database dashboard
python3 scripts/dashboard_generator.py database \
  --title "PostgreSQL" \
  --db-type postgres \
  --instance db.example.com:5432 \
  --output db-dashboard.json

Web 应用程序（请求、错误、延迟、资源）
Kubernetes（Pod、节点、资源、网络）
数据库（PostgreSQL, MySQL）

5. SLO 与错误预算

SLI（服务等级指标）：服务质量的度量

示例：请求延迟、错误率、可用性

SLO（服务等级目标）：SLI 的目标值

示例："99.9% 的请求在 < 500ms 内返回"

错误预算：允许的故障量 = (100% - SLO)

示例：99.9% SLO = 0.1% 错误预算 = 43.2 分钟/月

可用性	每月停机时间	使用场景
99%	7.2 小时	内部工具
99.9%	43.2 分钟	标准生产环境
99.95%	21.6 分钟	关键服务
99.99%	4.3 分钟	高可用性

计算合规性、错误预算和燃烧率：

# Show SLO reference table
python3 scripts/slo_calculator.py --table

# Calculate availability SLO
python3 scripts/slo_calculator.py availability \
  --slo 99.9 \
  --total-requests 1000000 \
  --failed-requests 1500 \
  --period-days 30

# Calculate burn rate
python3 scripts/slo_calculator.py burn-rate \
  --slo 99.9 \
  --errors 50 \
  --requests 10000 \
  --window-hours 1

→ 脚本：scripts/slo_calculator.py

深入探讨：SLO/SLA

获取全面的 SLO/SLA 指导，包括：

选择适当的 SLI
设置现实的 SLO 目标
错误预算策略
燃烧率告警
SLA 结构和合同
月度报告模板

当您需要时使用分布式追踪：

跨服务调试性能问题
理解通过微服务的请求流
识别分布式系统中的瓶颈
发现 N+1 查询问题

OpenTelemetry 实现

Python 示例：

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order")
def process_order(order_id):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)

    try:
        result = payment_service.charge(order_id)
        span.set_attribute("payment.status", "success")
        return result
    except Exception as e:
        span.set_status(trace.Status(trace.StatusCode.ERROR))
        span.record_exception(e)
        raise

开发环境：100% (ALWAYS_ON)
预发布环境：50-100%
生产环境：1-10%（或基于错误的采样）

基于错误的采样（始终采样错误，1% 的成功请求）：

class ErrorSampler(Sampler):
    def should_sample(self, parent_context, trace_id, name, **kwargs):
        attributes = kwargs.get('attributes', {})

        if attributes.get('error', False):
            return Decision.RECORD_AND_SAMPLE

        if trace_id & 0xFF < 3:  # ~1%
            return Decision.RECORD_AND_SAMPLE

        return Decision.DROP

OTel Collector 配置

生产就绪的 OpenTelemetry Collector 配置：

接收 OTLP、Prometheus 和主机指标
批处理和内存限制
尾部采样（基于错误、基于延迟、概率性）
多种导出器（Tempo, Jaeger, Loki, Prometheus, CloudWatch, Datadog）

深入探讨：追踪

获取全面的追踪指导，包括：

OpenTelemetry 插桩（Python, Node.js, Go, Java）
跨度属性和语义约定
上下文传播（W3C Trace Context）
后端比较（Jaeger, Tempo, X-Ray, Datadog APM）
分析模式（查找慢追踪、N+1 查询）
与日志的集成

7. Datadog 成本优化与迁移

场景 1：我正在使用 Datadog 且成本过高

如果您的 Datadog 账单增长失控，首先识别浪费：

自动分析您的 Datadog 使用情况并找到成本优化机会：

# Analyze Datadog usage (requires API key and APP key)
python3 scripts/datadog_cost_analyzer.py \
  --api-key $DD_API_KEY \
  --app-key $DD_APP_KEY

# Show detailed breakdown by category
python3 scripts/datadog_cost_analyzer.py \
  --api-key $DD_API_KEY \
  --app-key $DD_APP_KEY \
  --show-details

基础设施主机数量和成本
自定义指标使用情况和高基数指标
日志摄入量和趋势
APM 主机使用情况
未使用或噪音大的监控器
容器与虚拟机优化机会

常见成本优化策略

1. 自定义指标优化（典型节省：20-40%）：

移除高基数标签（用户 ID、请求 ID）
删除未使用的自定义指标
发送前聚合指标
使用指标前缀来标识团队/服务

2. 日志管理（典型节省：30-50%）：

对高流量服务实施日志采样
在生产环境中对调试/追踪日志使用排除过滤器
7 天后将冷日志归档到 S3/GCS
设置日志保留策略（15 天而非 30 天）

3. APM 优化（典型节省：15-25%）：

降低追踪采样率（生产环境从 10% 降至 5%）
使用头部采样而非完全采样
从非关键服务中移除 APM
使用保留期较短的追踪搜索

4. 基础设施监控（典型节省：10-20%）：

在可能的情况下从基于虚拟机的定价切换到基于容器的定价
从临时实例中移除代理
使用 Datadog 的主机缩减策略
整合预发布环境

场景 2：从 Datadog 迁移

如果您正在考虑迁移到更具成本效益的开源技术栈：

从 Datadog → 迁移到开源技术栈：

指标：Datadog → Prometheus + Grafana
日志：Datadog Logs → Grafana Loki
追踪：Datadog APM → Tempo 或 Jaeger
仪表盘：Datadog → Grafana
告警：Datadog Monitors → Prometheus Alertmanager

预计成本节省：60-77%（100 主机环境每年节省 $49.8k-61.8k）

阶段 1：并行运行（第 1-2 个月）：

在 Datadog 旁边部署开源技术栈
首先迁移指标（风险最低）
验证数据准确性

阶段 2：迁移仪表盘和告警（第 2-3 个月）：

将 Datadog 仪表盘转换为 Grafana
翻译告警规则（使用下面的 DQL → PromQL 指南）
培训团队使用新工具

阶段 3：迁移日志和追踪（第 3-4 个月）：

为日志聚合设置 Loki
为追踪部署 Tempo/Jaeger
更新应用程序插桩

阶段 4：停用 Datadog（第 4-5 个月）：

确认所有功能已迁移
取消 Datadog 订阅

查询翻译：DQL → PromQL

迁移仪表盘和告警时，您需要将 Datadog 查询翻译为 PromQL：

# Average CPU
Datadog: avg:system.cpu.user{*}
Prometheus: avg(node_cpu_seconds_total{mode="user"})

# Request rate
Datadog: sum:requests.count{*}.as_rate()
Prometheus: sum(rate(http_requests_total[5m]))

# P95 latency
Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Error rate percentage
Datadog: (sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100

→ 完整翻译指南：references/dql_promql_translation.md

示例：100 主机基础设施

组件	Datadog（年）	开源（年）	节省
基础设施	$18,000	$10,000（自托管基础设施）	$8,000
自定义指标	$600	已包含	$600
日志	$24,000	$3,000（存储）	$21,000
APM/追踪	$37,200	$5,000（存储）	$32,200
总计	$79,800	$18,000	$61,800 (77%)

深入探讨：Datadog 迁移

获取全面的迁移指导，包括：

详细的成本对比和投资回报率计算
分步迁移说明
基础设施规模建议（CPU、RAM、存储）
仪表盘转换工具和示例
告警规则翻译模式
应用程序插桩更改（DogStatsD → Prometheus 客户端）
用于导出 Datadog 仪表盘和监控器的 Python 脚本
常见挑战和解决方案

8. 工具选择与比较

选择 Prometheus + Grafana 如果：

✅ 使用 Kubernetes
✅ 需要控制和定制化
✅ 拥有运维能力
✅ 预算有限

选择 Datadog 如果：

✅ 需要易用性
✅ 需要立即获得完整的可观测性
✅ 预算允许（100 主机每月 $8k+）

选择 Grafana 技术栈（LGTM）如果：

✅ 需要开源全栈
✅ 经济高效的解决方案
✅ 云原生架构

选择 ELK 技术栈如果：

✅ 有大量日志分析需求
✅ 需要强大的搜索功能
✅ 拥有专门的运维团队

选择云原生（CloudWatch 等）如果：

✅ 单一云提供商
✅ 需求简单
✅ 希望设置最少

成本对比（100 主机，每月 1TB 日志）

解决方案	每月成本	设置	运维负担
Prometheus + Loki + Tempo	$1,500	中等	中等
Grafana Cloud	$3,000	低	低
Datadog	$8,000	低	无
ELK 技术栈	$4,000	高	高
CloudWatch	$2,000	低	低

深入探讨：工具比较

获取全面的工具比较，包括：

指标平台（Prometheus, Datadog, New Relic, CloudWatch, Grafana Cloud）
日志平台（ELK, Loki, Splunk, CloudWatch Logs, Sumo Logic）
追踪平台（Jaeger, Tempo, Datadog APM, X-Ray）
全栈可观测性比较
按公司规模的推荐

9. 故障排查与分析

根据最佳实践验证健康检查端点：

# Check single endpoint
python3 scripts/health_check_validator.py https://api.example.com/health

# Check multiple endpoints
python3 scripts/health_check_validator.py \
  https://api.example.com/health \
  https://api.example.com/readiness \
  --verbose

✓ 返回 200 状态码
✓ 响应时间 < 1 秒
✓ 返回 JSON 格式
✓ 包含 'status' 字段
✓ 包含版本/构建信息
✓ 检查依赖项
✓ 禁用缓存

常见故障排查工作流

高延迟调查：

检查仪表盘是否有延迟峰值
查询追踪以查找慢操作
检查数据库慢查询日志
检查外部 API 响应时间
审查最近的部署
检查资源利用率（CPU、内存）

高错误率调查：

检查错误日志中的模式
识别受影响的端点
检查依赖项健康状态
审查最近的部署
检查资源限制
验证配置

服务宕机调查：

检查 Pod/实例是否正在运行
检查健康检查端点
审查最近的部署
检查资源可用性
检查网络连接性
审查启动错误日志

# Request rate
sum(rate(http_requests_total[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m])) * 100

# P95 latency
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Check pod status
kubectl get pods -n <namespace>

# View pod logs
kubectl logs -f <pod-name> -n <namespace>

# Check pod resources
kubectl top pods -n <namespace>

# Describe pod for events
kubectl describe pod <pod-name> -n <namespace>

# Check recent deployments
kubectl rollout history deployment/<name> -n <namespace>

Elasticsearch：

GET /logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "error" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  }
}

{job="app", level="error"} |= "error" | json

CloudWatch Insights：

fields @timestamp, level, message
| filter level = "error"
| filter @timestamp > ago(1h)

脚本（自动化和分析）

analyze_metrics.py - 检测 Prometheus/CloudWatch 指标中的异常
alert_quality_checker.py - 根据最佳实践审核告警规则
slo_calculator.py - 计算 SLO 合规性和错误预算
log_analyzer.py - 解析日志中的错误和模式
dashboard_generator.py - 从模板生成 Grafana 仪表盘
health_check_validator.py - 验证健康检查端点
datadog_cost_analyzer.py - 分析 Datadog 使用情况并发现成本浪费

参考资料（深入文档）

metrics_design.md - 四大黄金信号、RED/USE 方法、指标类型
alerting_best_practices.md - 告警设计、运行手册、值班实践
logging_guide.md - 结构化日志记录、聚合模式
tracing_guide.md - OpenTelemetry、分布式追踪
slo_sla_guide.md - SLI/SLO/SLA 定义、错误预算
tool_comparison.md - 监控工具的全面比较
datadog_migration.md - 从 Datadog 迁移到 OSS 技术栈的完整指南
dql_promql_translation.md - Datadog 查询语言到 PromQL 的翻译参考

模板（开箱即用的配置）

prometheus-alerts/webapp-alerts.yml - 生产就绪的 Web 应用程序告警
prometheus-alerts/kubernetes-alerts.yml - Kubernetes 监控告警
otel-config/collector-config.yaml - OpenTelemetry Collector 配置
runbooks/incident-runbook-template.md - 事件响应模板

从四大黄金信号开始
使用适当的指标类型（计数器、仪表、直方图）
保持基数低（避免高基数标签）
遵循命名约定

使用结构化日志记录（JSON）
包含用于追踪的请求 ID
设置适当的日志级别
记录前对 PII 进行脱敏

使每个告警都可操作
告警症状，而非原因
使用多窗口燃烧率告警
包含运行手册链接

适当采样（生产环境 1-10%）
始终记录错误
使用语义约定
在服务间传播上下文

从当前性能开始
设置现实的目标
定义错误预算策略
每季度审查和调整

🇺🇸English

Monitoring & Observability

Overview

This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.

When to use this skill :

Setting up monitoring for new services
Designing alerts and dashboards
Troubleshooting performance issues
Implementing SLO tracking and error budgets
Choosing between monitoring tools
Integrating OpenTelemetry instrumentation
Analyzing metrics, logs, and traces
Optimizing Datadog costs and finding waste
Migrating from Datadog to open-source stack

Core Workflow: Observability Implementation

Use this decision tree to determine your starting point:

Are you setting up monitoring from scratch?
├─ YES → Start with "1. Design Metrics Strategy"
└─ NO → Do you have an existing issue?
    ├─ YES → Go to "9. Troubleshooting & Analysis"
    └─ NO → Are you improving existing monitoring?
        ├─ Alerts → Go to "3. Alert Design"
        ├─ Dashboards → Go to "4. Dashboard & Visualization"
        ├─ SLOs → Go to "5. SLO & Error Budgets"
        ├─ Tool selection → Read references/tool_comparison.md
        └─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"

1. Design Metrics Strategy

Start with The Four Golden Signals

Every service should monitor:

Latency : Response time (p50, p95, p99)
Traffic : Requests per second
Errors : Failure rate
Saturation : Resource utilization

For request-driven services , use the RED Method :

R ate: Requests/sec
E rrors: Error rate
D uration: Response time

For infrastructure resources , use the USE Method :

U tilization: % time busy
S aturation**: Queue depth
E rrors**: Error count

Quick Start - Web Application Example :

# Rate (requests/sec)
sum(rate(http_requests_total[5m]))

# Errors (error rate %)
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m])) * 100

# Duration (p95 latency)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Deep Dive: Metric Design

For comprehensive metric design guidance including:

Metric types (counter, gauge, histogram, summary)
Cardinality best practices
Naming conventions
Dashboard design principles

→ Read : references/metrics_design.md

Automated Metric Analysis

Detect anomalies and trends in your metrics:

# Analyze Prometheus metrics for anomalies
python3 scripts/analyze_metrics.py prometheus \
  --endpoint http://localhost:9090 \
  --query 'rate(http_requests_total[5m])' \
  --hours 24

# Analyze CloudWatch metrics
python3 scripts/analyze_metrics.py cloudwatch \
  --namespace AWS/EC2 \
  --metric CPUUtilization \
  --dimensions InstanceId=i-1234567890abcdef0 \
  --hours 48

→ Script : scripts/analyze_metrics.py

2. Log Aggregation & Analysis

Structured Logging Checklist

Every log entry should include:

✅ Timestamp (ISO 8601 format)
✅ Log level (DEBUG, INFO, WARN, ERROR, FATAL)
✅ Message (human-readable)
✅ Service name
✅ Request ID (for tracing)

Example structured log (JSON) :

{
  "timestamp": "2024-10-28T14:32:15Z",
  "level": "error",
  "message": "Payment processing failed",
  "service": "payment-service",
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "user_id": "user123",
  "order_id": "ORD-456",
  "error_type": "GatewayTimeout",
  "duration_ms": 5000
}

Log Aggregation Patterns

ELK Stack (Elasticsearch, Logstash, Kibana):

Best for: Deep log analysis, complex queries
Cost: High (infrastructure + operations)
Complexity: High

Grafana Loki :

Best for: Cost-effective logging, Kubernetes
Cost: Low
Complexity: Medium

CloudWatch Logs :

Best for: AWS-centric applications
Cost: Medium
Complexity: Low

Log Analysis

Analyze logs for errors, patterns, and anomalies:

# Analyze log file for patterns
python3 scripts/log_analyzer.py application.log

# Show error lines with context
python3 scripts/log_analyzer.py application.log --show-errors

# Extract stack traces
python3 scripts/log_analyzer.py application.log --show-traces

→ Script : scripts/log_analyzer.py

Deep Dive: Logging

For comprehensive logging guidance including:

Structured logging implementation examples (Python, Node.js, Go, Java)
Log aggregation patterns (ELK, Loki, CloudWatch, Fluentd)
Query patterns and best practices
PII redaction and security
Sampling and rate limiting

→ Read : references/logging_guide.md

3. Alert Design

Alert Design Principles

Every alert must be actionable - If you can't do something, don't alert
Alert on symptoms, not causes - Alert on user experience, not components
Tie alerts to SLOs - Connect to business impact
Reduce noise - Only page for critical issues

Alert Severity Levels

Severity	Response Time	Example
Critical	Page immediately	Service down, SLO violation
Warning	Ticket, review in hours	Elevated error rate, resource warning
Info	Log for awareness	Deployment completed, scaling event

Multi-Window Burn Rate Alerting

Alert when error budget is consumed too quickly:

# Fast burn (1h window) - Critical
- alert: ErrorBudgetFastBurn
  expr: |
    (error_rate / 0.001) > 14.4  # 99.9% SLO
  for: 2m
  labels:
    severity: critical

# Slow burn (6h window) - Warning
- alert: ErrorBudgetSlowBurn
  expr: |
    (error_rate / 0.001) > 6  # 99.9% SLO
  for: 30m
  labels:
    severity: warning

Alert Quality Checker

Audit your alert rules against best practices:

# Check single file
python3 scripts/alert_quality_checker.py alerts.yml

# Check all rules in directory
python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/

Checks for :

Alert naming conventions
Required labels (severity, team)
Required annotations (summary, description, runbook_url)
PromQL expression quality
'for' clause to prevent flapping

→ Script : scripts/alert_quality_checker.py

Alert Templates

Production-ready alert rule templates:

→ Templates :

assets/templates/prometheus-alerts/webapp-alerts.yml - Web application alerts
assets/templates/prometheus-alerts/kubernetes-alerts.yml - Kubernetes alerts

Deep Dive: Alerting

For comprehensive alerting guidance including:

Alert design patterns (multi-window, rate of change, threshold with hysteresis)
Alert annotation best practices
Alert routing (severity-based, team-based, time-based)
Inhibition rules
Runbook structure
On-call best practices

→ Read : references/alerting_best_practices.md

Runbook Template

Create comprehensive runbooks for your alerts:

→ Template : assets/templates/runbooks/incident-runbook-template.md

4. Dashboard & Visualization

Dashboard Design Principles

Top-down layout : Most important metrics first
Color coding : Red (critical), yellow (warning), green (healthy)
Consistent time windows : All panels use same time range
Limit panels : 8-12 panels per dashboard maximum
Include context : Show related metrics together

Recommended Dashboard Structure

┌─────────────────────────────────────┐
│  Overall Health (Single Stats)      │
│  [Requests/s] [Error%] [P95 Latency]│
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│  Request Rate & Errors (Graphs)     │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│  Latency Distribution (Graphs)      │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│  Resource Usage (Graphs)            │
└─────────────────────────────────────┘

Generate Grafana Dashboards

Automatically generate dashboards from templates:

# Web application dashboard
python3 scripts/dashboard_generator.py webapp \
  --title "My API Dashboard" \
  --service my_api \
  --output dashboard.json

# Kubernetes dashboard
python3 scripts/dashboard_generator.py kubernetes \
  --title "K8s Production" \
  --namespace production \
  --output k8s-dashboard.json

# Database dashboard
python3 scripts/dashboard_generator.py database \
  --title "PostgreSQL" \
  --db-type postgres \
  --instance db.example.com:5432 \
  --output db-dashboard.json

Supports :

Web applications (requests, errors, latency, resources)
Kubernetes (pods, nodes, resources, network)
Databases (PostgreSQL, MySQL)

→ Script : scripts/dashboard_generator.py

5. SLO & Error Budgets

SLO Fundamentals

SLI (Service Level Indicator): Measurement of service quality

Example: Request latency, error rate, availability

SLO (Service Level Objective): Target value for an SLI

Example: "99.9% of requests return in < 500ms"

Error Budget : Allowed failure amount = (100% - SLO)

Example: 99.9% SLO = 0.1% error budget = 43.2 minutes/month

Common SLO Targets

Availability	Downtime/Month	Use Case
99%	7.2 hours	Internal tools
99.9%	43.2 minutes	Standard production
99.95%	21.6 minutes	Critical services
99.99%	4.3 minutes	High availability

SLO Calculator

Calculate compliance, error budgets, and burn rates:

# Show SLO reference table
python3 scripts/slo_calculator.py --table

# Calculate availability SLO
python3 scripts/slo_calculator.py availability \
  --slo 99.9 \
  --total-requests 1000000 \
  --failed-requests 1500 \
  --period-days 30

# Calculate burn rate
python3 scripts/slo_calculator.py burn-rate \
  --slo 99.9 \
  --errors 50 \
  --requests 10000 \
  --window-hours 1

→ Script : scripts/slo_calculator.py

Deep Dive: SLO/SLA

For comprehensive SLO/SLA guidance including:

Choosing appropriate SLIs
Setting realistic SLO targets
Error budget policies
Burn rate alerting
SLA structure and contracts
Monthly reporting templates

→ Read : references/slo_sla_guide.md

6. Distributed Tracing

When to Use Tracing

Use distributed tracing when you need to:

Debug performance issues across services
Understand request flow through microservices
Identify bottlenecks in distributed systems
Find N+1 query problems

OpenTelemetry Implementation

Python example :

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order")
def process_order(order_id):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)

    try:
        result = payment_service.charge(order_id)
        span.set_attribute("payment.status", "success")
        return result
    except Exception as e:
        span.set_status(trace.Status(trace.StatusCode.ERROR))
        span.record_exception(e)
        raise

Sampling Strategies

Development : 100% (ALWAYS_ON)
Staging : 50-100%
Production : 1-10% (or error-based sampling)

Error-based sampling (always sample errors, 1% of successes):

class ErrorSampler(Sampler):
    def should_sample(self, parent_context, trace_id, name, **kwargs):
        attributes = kwargs.get('attributes', {})

        if attributes.get('error', False):
            return Decision.RECORD_AND_SAMPLE

        if trace_id & 0xFF < 3:  # ~1%
            return Decision.RECORD_AND_SAMPLE

        return Decision.DROP

OTel Collector Configuration

Production-ready OpenTelemetry Collector configuration:

→ Template : assets/templates/otel-config/collector-config.yaml

Features :

Receives OTLP, Prometheus, and host metrics
Batching and memory limiting
Tail sampling (error-based, latency-based, probabilistic)
Multiple exporters (Tempo, Jaeger, Loki, Prometheus, CloudWatch, Datadog)

Deep Dive: Tracing

For comprehensive tracing guidance including:

OpenTelemetry instrumentation (Python, Node.js, Go, Java)
Span attributes and semantic conventions
Context propagation (W3C Trace Context)
Backend comparison (Jaeger, Tempo, X-Ray, Datadog APM)
Analysis patterns (finding slow traces, N+1 queries)
Integration with logs

→ Read : references/tracing_guide.md

7. Datadog Cost Optimization & Migration

Scenario 1: I'm Using Datadog and Costs Are Too High

If your Datadog bill is growing out of control, start by identifying waste:

Cost Analysis Script

Automatically analyze your Datadog usage and find cost optimization opportunities:

# Analyze Datadog usage (requires API key and APP key)
python3 scripts/datadog_cost_analyzer.py \
  --api-key $DD_API_KEY \
  --app-key $DD_APP_KEY

# Show detailed breakdown by category
python3 scripts/datadog_cost_analyzer.py \
  --api-key $DD_API_KEY \
  --app-key $DD_APP_KEY \
  --show-details

What it checks :

Infrastructure host count and cost
Custom metrics usage and high-cardinality metrics
Log ingestion volume and trends
APM host usage
Unused or noisy monitors
Container vs VM optimization opportunities

→ Script : scripts/datadog_cost_analyzer.py

Common Cost Optimization Strategies

1. Custom Metrics Optimization (typical savings: 20-40%):

Remove high-cardinality tags (user IDs, request IDs)
Delete unused custom metrics
Aggregate metrics before sending
Use metric prefixes to identify teams/services

2. Log Management (typical savings: 30-50%):

Implement log sampling for high-volume services
Use exclusion filters for debug/trace logs in production
Archive cold logs to S3/GCS after 7 days
Set log retention policies (15 days instead of 30)

3. APM Optimization (typical savings: 15-25%):

Reduce trace sampling rates (10% → 5% in prod)
Use head-based sampling instead of complete sampling
Remove APM from non-critical services
Use trace search with lower retention

4. Infrastructure Monitoring (typical savings: 10-20%):

Switch from VM-based to container-based pricing where possible
Remove agents from ephemeral instances
Use Datadog's host reduction strategies
Consolidate staging environments

Scenario 2: Migrating Away from Datadog

If you're considering migrating to a more cost-effective open-source stack:

Migration Overview

From Datadog → To Open Source Stack :

Metrics: Datadog → Prometheus + Grafana
Logs: Datadog Logs → Grafana Loki
Traces: Datadog APM → Tempo or Jaeger
Dashboards: Datadog → Grafana
Alerts: Datadog Monitors → Prometheus Alertmanager

Estimated Cost Savings : 60-77% ($49.8k-61.8k/year for 100-host environment)

Migration Strategy

Phase 1: Run Parallel (Month 1-2):

Deploy open-source stack alongside Datadog
Migrate metrics first (lowest risk)
Validate data accuracy

Phase 2: Migrate Dashboards & Alerts (Month 2-3):

Convert Datadog dashboards to Grafana
Translate alert rules (use DQL → PromQL guide below)
Train team on new tools

Phase 3: Migrate Logs & Traces (Month 3-4):

Set up Loki for log aggregation
Deploy Tempo/Jaeger for tracing
Update application instrumentation

Phase 4: Decommission Datadog (Month 4-5):

Confirm all functionality migrated
Cancel Datadog subscription

Query Translation: DQL → PromQL

When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL:

Quick examples :

# Average CPU
Datadog: avg:system.cpu.user{*}
Prometheus: avg(node_cpu_seconds_total{mode="user"})

# Request rate
Datadog: sum:requests.count{*}.as_rate()
Prometheus: sum(rate(http_requests_total[5m]))

# P95 latency
Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Error rate percentage
Datadog: (sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100

→ Full Translation Guide : references/dql_promql_translation.md

Cost Comparison

Example: 100-host infrastructure

Component	Datadog (Annual)	Open Source (Annual)	Savings
Infrastructure	$18,000	$10,000 (self-hosted infra)	$8,000
Custom Metrics	$600	Included	$600
Logs	$24,000	$3,000 (storage)	$21,000
APM/Traces	$37,200	$5,000 (storage)	$32,200
Total	$79,800	$18,000	$61,800 (77%)

Deep Dive: Datadog Migration

For comprehensive migration guidance including:

Detailed cost comparison and ROI calculations
Step-by-step migration instructions
Infrastructure sizing recommendations (CPU, RAM, storage)
Dashboard conversion tools and examples
Alert rule translation patterns
Application instrumentation changes (DogStatsD → Prometheus client)
Python scripts for exporting Datadog dashboards and monitors
Common challenges and solutions

→ Read : references/datadog_migration.md

8. Tool Selection & Comparison

Decision Matrix

Choose Prometheus + Grafana if :

✅ Using Kubernetes
✅ Want control and customization
✅ Have ops capacity
✅ Budget-conscious

Choose Datadog if :

✅ Want ease of use
✅ Need full observability now
✅ Budget allows ($8k+/month for 100 hosts)

Choose Grafana Stack (LGTM) if :

✅ Want open source full stack
✅ Cost-effective solution
✅ Cloud-native architecture

Choose ELK Stack if :

✅ Heavy log analysis needs
✅ Need powerful search
✅ Have dedicated ops team

Choose Cloud Native (CloudWatch/etc) if :

✅ Single cloud provider
✅ Simple needs
✅ Want minimal setup

Cost Comparison (100 hosts, 1TB logs/month)

Solution	Monthly Cost	Setup	Ops Burden
Prometheus + Loki + Tempo	$1,500	Medium	Medium
Grafana Cloud	$3,000	Low	Low
Datadog	$8,000	Low	None
ELK Stack	$4,000	High	High
CloudWatch	$2,000	Low	Low

Deep Dive: Tool Comparison

For comprehensive tool comparison including:

Metrics platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana Cloud)
Logging platforms (ELK, Loki, Splunk, CloudWatch Logs, Sumo Logic)
Tracing platforms (Jaeger, Tempo, Datadog APM, X-Ray)
Full-stack observability comparison
Recommendations by company size

→ Read : references/tool_comparison.md

9. Troubleshooting & Analysis

Health Check Validation

Validate health check endpoints against best practices:

# Check single endpoint
python3 scripts/health_check_validator.py https://api.example.com/health

# Check multiple endpoints
python3 scripts/health_check_validator.py \
  https://api.example.com/health \
  https://api.example.com/readiness \
  --verbose

Checks for :

✓ Returns 200 status code
✓ Response time < 1 second
✓ Returns JSON format
✓ Contains 'status' field
✓ Includes version/build info
✓ Checks dependencies
✓ Disables caching

→ Script : scripts/health_check_validator.py

Common Troubleshooting Workflows

High Latency Investigation :

Check dashboards for latency spike
Query traces for slow operations
Check database slow query log
Check external API response times
Review recent deployments
Check resource utilization (CPU, memory)

High Error Rate Investigation :

Check error logs for patterns
Identify affected endpoints
Check dependency health
Review recent deployments
Check resource limits
Verify configuration

Service Down Investigation :

Check if pods/instances are running
Check health check endpoint
Review recent deployments
Check resource availability
Check network connectivity
Review logs for startup errors

Quick Reference Commands

Prometheus Queries

# Request rate
sum(rate(http_requests_total[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m])) * 100

# P95 latency
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Kubernetes Commands

# Check pod status
kubectl get pods -n <namespace>

# View pod logs
kubectl logs -f <pod-name> -n <namespace>

# Check pod resources
kubectl top pods -n <namespace>

# Describe pod for events
kubectl describe pod <pod-name> -n <namespace>

# Check recent deployments
kubectl rollout history deployment/<name> -n <namespace>

Log Queries

Elasticsearch :

GET /logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "error" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  }
}

Loki (LogQL) :

{job="app", level="error"} |= "error" | json

CloudWatch Insights :

fields @timestamp, level, message
| filter level = "error"
| filter @timestamp > ago(1h)

Resources Summary

Scripts (automation and analysis)

analyze_metrics.py - Detect anomalies in Prometheus/CloudWatch metrics
alert_quality_checker.py - Audit alert rules against best practices
slo_calculator.py - Calculate SLO compliance and error budgets
log_analyzer.py - Parse logs for errors and patterns
dashboard_generator.py - Generate Grafana dashboards from templates
health_check_validator.py - Validate health check endpoints
datadog_cost_analyzer.py - Analyze Datadog usage and find cost waste

References (deep-dive documentation)

metrics_design.md - Four Golden Signals, RED/USE methods, metric types
alerting_best_practices.md - Alert design, runbooks, on-call practices
logging_guide.md - Structured logging, aggregation patterns
tracing_guide.md - OpenTelemetry, distributed tracing
slo_sla_guide.md - SLI/SLO/SLA definitions, error budgets
tool_comparison.md - Comprehensive comparison of monitoring tools
datadog_migration.md - Complete guide for migrating from Datadog to OSS stack
dql_promql_translation.md - Datadog Query Language to PromQL translation reference

Templates (ready-to-use configurations)

prometheus-alerts/webapp-alerts.yml - Production-ready web app alerts
prometheus-alerts/kubernetes-alerts.yml - Kubernetes monitoring alerts
otel-config/collector-config.yaml - OpenTelemetry Collector configuration
runbooks/incident-runbook-template.md - Incident response template

Best Practices

Metrics

Start with Four Golden Signals
Use appropriate metric types (counter, gauge, histogram)
Keep cardinality low (avoid high-cardinality labels)
Follow naming conventions

Logging

Use structured logging (JSON)
Include request IDs for tracing
Set appropriate log levels
Redact PII before logging

Alerting

Make every alert actionable
Alert on symptoms, not causes
Use multi-window burn rate alerts
Include runbook links

Tracing

Sample appropriately (1-10% in production)
Always record errors
Use semantic conventions
Propagate context between services

SLOs

Start with current performance
Set realistic targets
Define error budget policies
Review and adjust quarterly

Weekly Installs

137

Repository

ahmedasmar/devo…e-skills

GitHub Stars

First Seen

Jan 23, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode110

claude-code100

gemini-cli97

codex92

cursor89

github-copilot88

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

68,100 周安装

监控与可观测性实施指南：指标设计、日志聚合、告警策略与SLO管理

🇨🇳中文介绍

监控与可观测性

概述

核心工作流：可观测性实施

1. 设计指标策略

从四大黄金信号开始

相关 Skills

深入探讨：指标设计

自动化指标分析

2. 日志聚合与分析

结构化日志清单

日志聚合模式

日志分析

深入探讨：日志记录

3. 告警设计

告警设计原则

告警严重级别

多窗口燃烧率告警

告警质量检查器

告警模板

深入探讨：告警

运行手册模板

4. 仪表盘与可视化

仪表盘设计原则

推荐的仪表盘结构

生成 Grafana 仪表盘

5. SLO 与错误预算

SLO 基础

常见 SLO 目标

SLO 计算器

深入探讨：SLO/SLA

6. 分布式追踪

何时使用追踪

OpenTelemetry 实现

采样策略

OTel Collector 配置

深入探讨：追踪

7. Datadog 成本优化与迁移

场景 1：我正在使用 Datadog 且成本过高

成本分析脚本

常见成本优化策略

场景 2：从 Datadog 迁移

迁移概述

迁移策略

查询翻译：DQL → PromQL

成本对比

深入探讨：Datadog 迁移

8. 工具选择与比较

决策矩阵

成本对比（100 主机，每月 1TB 日志）

深入探讨：工具比较

9. 故障排查与分析

健康检查验证

常见故障排查工作流

快速参考命令

Prometheus 查询

Kubernetes 命令

日志查询

资源摘要