monitoring-observability by ahmedasmar/devops-claude-skills
npx skills add https://github.com/ahmedasmar/devops-claude-skills --skill monitoring-observability此技能为监控和可观测性工作流提供全面指导,包括指标设计、日志聚合、分布式追踪、告警策略、SLO/SLA 管理以及工具选择。
何时使用此技能:
使用此决策树来确定您的起点:
Are you setting up monitoring from scratch?
├─ YES → Start with "1. Design Metrics Strategy"
└─ NO → Do you have an existing issue?
├─ YES → Go to "9. Troubleshooting & Analysis"
└─ NO → Are you improving existing monitoring?
├─ Alerts → Go to "3. Alert Design"
├─ Dashboards → Go to "4. Dashboard & Visualization"
├─ SLOs → Go to "5. SLO & Error Budgets"
├─ Tool selection → Read references/tool_comparison.md
└─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"
每个服务都应监控:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
对于请求驱动的服务,使用 RED 方法:
对于基础设施资源,使用 USE 方法:
快速入门 - Web 应用程序示例:
# Rate (requests/sec)
sum(rate(http_requests_total[5m]))
# Errors (error rate %)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Duration (p95 latency)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
获取全面的指标设计指导,包括:
检测指标中的异常和趋势:
# Analyze Prometheus metrics for anomalies
python3 scripts/analyze_metrics.py prometheus \
--endpoint http://localhost:9090 \
--query 'rate(http_requests_total[5m])' \
--hours 24
# Analyze CloudWatch metrics
python3 scripts/analyze_metrics.py cloudwatch \
--namespace AWS/EC2 \
--metric CPUUtilization \
--dimensions InstanceId=i-1234567890abcdef0 \
--hours 48
每个日志条目应包含:
结构化日志示例(JSON):
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "error",
"message": "Payment processing failed",
"service": "payment-service",
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user123",
"order_id": "ORD-456",
"error_type": "GatewayTimeout",
"duration_ms": 5000
}
ELK 技术栈(Elasticsearch, Logstash, Kibana):
Grafana Loki:
CloudWatch Logs:
分析日志中的错误、模式和异常:
# Analyze log file for patterns
python3 scripts/log_analyzer.py application.log
# Show error lines with context
python3 scripts/log_analyzer.py application.log --show-errors
# Extract stack traces
python3 scripts/log_analyzer.py application.log --show-traces
获取全面的日志记录指导,包括:
| 严重性 | 响应时间 | 示例 |
|---|---|---|
| 严重 | 立即页面通知 | 服务宕机,SLO 违规 |
| 警告 | 创建工单,数小时内审查 | 错误率升高,资源警告 |
| 信息 | 记录以供知晓 | 部署完成,扩缩容事件 |
当错误预算消耗过快时告警:
# Fast burn (1h window) - Critical
- alert: ErrorBudgetFastBurn
expr: |
(error_rate / 0.001) > 14.4 # 99.9% SLO
for: 2m
labels:
severity: critical
# Slow burn (6h window) - Warning
- alert: ErrorBudgetSlowBurn
expr: |
(error_rate / 0.001) > 6 # 99.9% SLO
for: 30m
labels:
severity: warning
根据最佳实践审核您的告警规则:
# Check single file
python3 scripts/alert_quality_checker.py alerts.yml
# Check all rules in directory
python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/
检查项:
生产就绪的告警规则模板:
→ 模板:
获取全面的告警指导,包括:
为您的告警创建全面的运行手册:
┌─────────────────────────────────────┐
│ Overall Health (Single Stats) │
│ [Requests/s] [Error%] [P95 Latency]│
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Request Rate & Errors (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Latency Distribution (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Resource Usage (Graphs) │
└─────────────────────────────────────┘
从模板自动生成仪表盘:
# Web application dashboard
python3 scripts/dashboard_generator.py webapp \
--title "My API Dashboard" \
--service my_api \
--output dashboard.json
# Kubernetes dashboard
python3 scripts/dashboard_generator.py kubernetes \
--title "K8s Production" \
--namespace production \
--output k8s-dashboard.json
# Database dashboard
python3 scripts/dashboard_generator.py database \
--title "PostgreSQL" \
--db-type postgres \
--instance db.example.com:5432 \
--output db-dashboard.json
支持:
SLI(服务等级指标):服务质量的度量
SLO(服务等级目标):SLI 的目标值
错误预算:允许的故障量 = (100% - SLO)
| 可用性 | 每月停机时间 | 使用场景 |
|---|---|---|
| 99% | 7.2 小时 | 内部工具 |
| 99.9% | 43.2 分钟 | 标准生产环境 |
| 99.95% | 21.6 分钟 | 关键服务 |
| 99.99% | 4.3 分钟 | 高可用性 |
计算合规性、错误预算和燃烧率:
# Show SLO reference table
python3 scripts/slo_calculator.py --table
# Calculate availability SLO
python3 scripts/slo_calculator.py availability \
--slo 99.9 \
--total-requests 1000000 \
--failed-requests 1500 \
--period-days 30
# Calculate burn rate
python3 scripts/slo_calculator.py burn-rate \
--slo 99.9 \
--errors 50 \
--requests 10000 \
--window-hours 1
获取全面的 SLO/SLA 指导,包括:
当您需要时使用分布式追踪:
Python 示例:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
try:
result = payment_service.charge(order_id)
span.set_attribute("payment.status", "success")
return result
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raise
基于错误的采样(始终采样错误,1% 的成功请求):
class ErrorSampler(Sampler):
def should_sample(self, parent_context, trace_id, name, **kwargs):
attributes = kwargs.get('attributes', {})
if attributes.get('error', False):
return Decision.RECORD_AND_SAMPLE
if trace_id & 0xFF < 3: # ~1%
return Decision.RECORD_AND_SAMPLE
return Decision.DROP
生产就绪的 OpenTelemetry Collector 配置:
功能:
获取全面的追踪指导,包括:
如果您的 Datadog 账单增长失控,首先识别浪费:
自动分析您的 Datadog 使用情况并找到成本优化机会:
# Analyze Datadog usage (requires API key and APP key)
python3 scripts/datadog_cost_analyzer.py \
--api-key $DD_API_KEY \
--app-key $DD_APP_KEY
# Show detailed breakdown by category
python3 scripts/datadog_cost_analyzer.py \
--api-key $DD_API_KEY \
--app-key $DD_APP_KEY \
--show-details
检查内容:
1. 自定义指标优化(典型节省:20-40%):
2. 日志管理(典型节省:30-50%):
3. APM 优化(典型节省:15-25%):
4. 基础设施监控(典型节省:10-20%):
如果您正在考虑迁移到更具成本效益的开源技术栈:
从 Datadog → 迁移到开源技术栈:
预计成本节省:60-77%(100 主机环境每年节省 $49.8k-61.8k)
阶段 1:并行运行(第 1-2 个月):
阶段 2:迁移仪表盘和告警(第 2-3 个月):
阶段 3:迁移日志和追踪(第 3-4 个月):
阶段 4:停用 Datadog(第 4-5 个月):
迁移仪表盘和告警时,您需要将 Datadog 查询翻译为 PromQL:
快速示例:
# Average CPU
Datadog: avg:system.cpu.user{*}
Prometheus: avg(node_cpu_seconds_total{mode="user"})
# Request rate
Datadog: sum:requests.count{*}.as_rate()
Prometheus: sum(rate(http_requests_total[5m]))
# P95 latency
Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Error rate percentage
Datadog: (sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
→ 完整翻译指南:references/dql_promql_translation.md
示例:100 主机基础设施
| 组件 | Datadog(年) | 开源(年) | 节省 |
|---|---|---|---|
| 基础设施 | $18,000 | $10,000(自托管基础设施) | $8,000 |
| 自定义指标 | $600 | 已包含 | $600 |
| 日志 | $24,000 | $3,000(存储) | $21,000 |
| APM/追踪 | $37,200 | $5,000(存储) | $32,200 |
| 总计 | $79,800 | $18,000 | $61,800 (77%) |
获取全面的迁移指导,包括:
选择 Prometheus + Grafana 如果:
选择 Datadog 如果:
选择 Grafana 技术栈(LGTM)如果:
选择 ELK 技术栈如果:
选择云原生(CloudWatch 等)如果:
| 解决方案 | 每月成本 | 设置 | 运维负担 |
|---|---|---|---|
| Prometheus + Loki + Tempo | $1,500 | 中等 | 中等 |
| Grafana Cloud | $3,000 | 低 | 低 |
| Datadog | $8,000 | 低 | 无 |
| ELK 技术栈 | $4,000 | 高 | 高 |
| CloudWatch | $2,000 | 低 | 低 |
获取全面的工具比较,包括:
根据最佳实践验证健康检查端点:
# Check single endpoint
python3 scripts/health_check_validator.py https://api.example.com/health
# Check multiple endpoints
python3 scripts/health_check_validator.py \
https://api.example.com/health \
https://api.example.com/readiness \
--verbose
检查项:
高延迟调查:
高错误率调查:
服务宕机调查:
# Request rate
sum(rate(http_requests_total[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# P95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Check pod status
kubectl get pods -n <namespace>
# View pod logs
kubectl logs -f <pod-name> -n <namespace>
# Check pod resources
kubectl top pods -n <namespace>
# Describe pod for events
kubectl describe pod <pod-name> -n <namespace>
# Check recent deployments
kubectl rollout history deployment/<name> -n <namespace>
Elasticsearch:
GET /logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
}
}
Loki (LogQL):
{job="app", level="error"} |= "error" | json
CloudWatch Insights:
fields @timestamp, level, message
| filter level = "error"
| filter @timestamp > ago(1h)
analyze_metrics.py - 检测 Prometheus/CloudWatch 指标中的异常alert_quality_checker.py - 根据最佳实践审核告警规则slo_calculator.py - 计算 SLO 合规性和错误预算log_analyzer.py - 解析日志中的错误和模式dashboard_generator.py - 从模板生成 Grafana 仪表盘health_check_validator.py - 验证健康检查端点datadog_cost_analyzer.py - 分析 Datadog 使用情况并发现成本浪费metrics_design.md - 四大黄金信号、RED/USE 方法、指标类型alerting_best_practices.md - 告警设计、运行手册、值班实践logging_guide.md - 结构化日志记录、聚合模式tracing_guide.md - OpenTelemetry、分布式追踪slo_sla_guide.md - SLI/SLO/SLA 定义、错误预算tool_comparison.md - 监控工具的全面比较datadog_migration.md - 从 Datadog 迁移到 OSS 技术栈的完整指南dql_promql_translation.md - Datadog 查询语言到 PromQL 的翻译参考prometheus-alerts/webapp-alerts.yml - 生产就绪的 Web 应用程序告警prometheus-alerts/kubernetes-alerts.yml - Kubernetes 监控告警otel-config/collector-config.yaml - OpenTelemetry Collector 配置runbooks/incident-runbook-template.md - 事件响应模板每周安装数
137
仓库
GitHub 星标数
89
首次出现
Jan 23, 2026
安全审计
安装于
opencode110
claude-code100
gemini-cli97
codex92
cursor89
github-copilot88
This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.
When to use this skill :
Use this decision tree to determine your starting point:
Are you setting up monitoring from scratch?
├─ YES → Start with "1. Design Metrics Strategy"
└─ NO → Do you have an existing issue?
├─ YES → Go to "9. Troubleshooting & Analysis"
└─ NO → Are you improving existing monitoring?
├─ Alerts → Go to "3. Alert Design"
├─ Dashboards → Go to "4. Dashboard & Visualization"
├─ SLOs → Go to "5. SLO & Error Budgets"
├─ Tool selection → Read references/tool_comparison.md
└─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"
Every service should monitor:
For request-driven services , use the RED Method :
For infrastructure resources , use the USE Method :
Quick Start - Web Application Example :
# Rate (requests/sec)
sum(rate(http_requests_total[5m]))
# Errors (error rate %)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Duration (p95 latency)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
For comprehensive metric design guidance including:
→ Read : references/metrics_design.md
Detect anomalies and trends in your metrics:
# Analyze Prometheus metrics for anomalies
python3 scripts/analyze_metrics.py prometheus \
--endpoint http://localhost:9090 \
--query 'rate(http_requests_total[5m])' \
--hours 24
# Analyze CloudWatch metrics
python3 scripts/analyze_metrics.py cloudwatch \
--namespace AWS/EC2 \
--metric CPUUtilization \
--dimensions InstanceId=i-1234567890abcdef0 \
--hours 48
→ Script : scripts/analyze_metrics.py
Every log entry should include:
Example structured log (JSON) :
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "error",
"message": "Payment processing failed",
"service": "payment-service",
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user123",
"order_id": "ORD-456",
"error_type": "GatewayTimeout",
"duration_ms": 5000
}
ELK Stack (Elasticsearch, Logstash, Kibana):
Grafana Loki :
CloudWatch Logs :
Analyze logs for errors, patterns, and anomalies:
# Analyze log file for patterns
python3 scripts/log_analyzer.py application.log
# Show error lines with context
python3 scripts/log_analyzer.py application.log --show-errors
# Extract stack traces
python3 scripts/log_analyzer.py application.log --show-traces
→ Script : scripts/log_analyzer.py
For comprehensive logging guidance including:
→ Read : references/logging_guide.md
| Severity | Response Time | Example |
|---|---|---|
| Critical | Page immediately | Service down, SLO violation |
| Warning | Ticket, review in hours | Elevated error rate, resource warning |
| Info | Log for awareness | Deployment completed, scaling event |
Alert when error budget is consumed too quickly:
# Fast burn (1h window) - Critical
- alert: ErrorBudgetFastBurn
expr: |
(error_rate / 0.001) > 14.4 # 99.9% SLO
for: 2m
labels:
severity: critical
# Slow burn (6h window) - Warning
- alert: ErrorBudgetSlowBurn
expr: |
(error_rate / 0.001) > 6 # 99.9% SLO
for: 30m
labels:
severity: warning
Audit your alert rules against best practices:
# Check single file
python3 scripts/alert_quality_checker.py alerts.yml
# Check all rules in directory
python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/
Checks for :
→ Script : scripts/alert_quality_checker.py
Production-ready alert rule templates:
→ Templates :
For comprehensive alerting guidance including:
→ Read : references/alerting_best_practices.md
Create comprehensive runbooks for your alerts:
→ Template : assets/templates/runbooks/incident-runbook-template.md
┌─────────────────────────────────────┐
│ Overall Health (Single Stats) │
│ [Requests/s] [Error%] [P95 Latency]│
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Request Rate & Errors (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Latency Distribution (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Resource Usage (Graphs) │
└─────────────────────────────────────┘
Automatically generate dashboards from templates:
# Web application dashboard
python3 scripts/dashboard_generator.py webapp \
--title "My API Dashboard" \
--service my_api \
--output dashboard.json
# Kubernetes dashboard
python3 scripts/dashboard_generator.py kubernetes \
--title "K8s Production" \
--namespace production \
--output k8s-dashboard.json
# Database dashboard
python3 scripts/dashboard_generator.py database \
--title "PostgreSQL" \
--db-type postgres \
--instance db.example.com:5432 \
--output db-dashboard.json
Supports :
→ Script : scripts/dashboard_generator.py
SLI (Service Level Indicator): Measurement of service quality
SLO (Service Level Objective): Target value for an SLI
Error Budget : Allowed failure amount = (100% - SLO)
| Availability | Downtime/Month | Use Case |
|---|---|---|
| 99% | 7.2 hours | Internal tools |
| 99.9% | 43.2 minutes | Standard production |
| 99.95% | 21.6 minutes | Critical services |
| 99.99% | 4.3 minutes | High availability |
Calculate compliance, error budgets, and burn rates:
# Show SLO reference table
python3 scripts/slo_calculator.py --table
# Calculate availability SLO
python3 scripts/slo_calculator.py availability \
--slo 99.9 \
--total-requests 1000000 \
--failed-requests 1500 \
--period-days 30
# Calculate burn rate
python3 scripts/slo_calculator.py burn-rate \
--slo 99.9 \
--errors 50 \
--requests 10000 \
--window-hours 1
→ Script : scripts/slo_calculator.py
For comprehensive SLO/SLA guidance including:
→ Read : references/slo_sla_guide.md
Use distributed tracing when you need to:
Python example :
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
try:
result = payment_service.charge(order_id)
span.set_attribute("payment.status", "success")
return result
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raise
Error-based sampling (always sample errors, 1% of successes):
class ErrorSampler(Sampler):
def should_sample(self, parent_context, trace_id, name, **kwargs):
attributes = kwargs.get('attributes', {})
if attributes.get('error', False):
return Decision.RECORD_AND_SAMPLE
if trace_id & 0xFF < 3: # ~1%
return Decision.RECORD_AND_SAMPLE
return Decision.DROP
Production-ready OpenTelemetry Collector configuration:
→ Template : assets/templates/otel-config/collector-config.yaml
Features :
For comprehensive tracing guidance including:
→ Read : references/tracing_guide.md
If your Datadog bill is growing out of control, start by identifying waste:
Automatically analyze your Datadog usage and find cost optimization opportunities:
# Analyze Datadog usage (requires API key and APP key)
python3 scripts/datadog_cost_analyzer.py \
--api-key $DD_API_KEY \
--app-key $DD_APP_KEY
# Show detailed breakdown by category
python3 scripts/datadog_cost_analyzer.py \
--api-key $DD_API_KEY \
--app-key $DD_APP_KEY \
--show-details
What it checks :
→ Script : scripts/datadog_cost_analyzer.py
1. Custom Metrics Optimization (typical savings: 20-40%):
2. Log Management (typical savings: 30-50%):
3. APM Optimization (typical savings: 15-25%):
4. Infrastructure Monitoring (typical savings: 10-20%):
If you're considering migrating to a more cost-effective open-source stack:
From Datadog → To Open Source Stack :
Estimated Cost Savings : 60-77% ($49.8k-61.8k/year for 100-host environment)
Phase 1: Run Parallel (Month 1-2):
Phase 2: Migrate Dashboards & Alerts (Month 2-3):
Phase 3: Migrate Logs & Traces (Month 3-4):
Phase 4: Decommission Datadog (Month 4-5):
When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL:
Quick examples :
# Average CPU
Datadog: avg:system.cpu.user{*}
Prometheus: avg(node_cpu_seconds_total{mode="user"})
# Request rate
Datadog: sum:requests.count{*}.as_rate()
Prometheus: sum(rate(http_requests_total[5m]))
# P95 latency
Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Error rate percentage
Datadog: (sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
→ Full Translation Guide : references/dql_promql_translation.md
Example: 100-host infrastructure
| Component | Datadog (Annual) | Open Source (Annual) | Savings |
|---|---|---|---|
| Infrastructure | $18,000 | $10,000 (self-hosted infra) | $8,000 |
| Custom Metrics | $600 | Included | $600 |
| Logs | $24,000 | $3,000 (storage) | $21,000 |
| APM/Traces | $37,200 | $5,000 (storage) | $32,200 |
| Total | $79,800 | $18,000 | $61,800 (77%) |
For comprehensive migration guidance including:
→ Read : references/datadog_migration.md
Choose Prometheus + Grafana if :
Choose Datadog if :
Choose Grafana Stack (LGTM) if :
Choose ELK Stack if :
Choose Cloud Native (CloudWatch/etc) if :
| Solution | Monthly Cost | Setup | Ops Burden |
|---|---|---|---|
| Prometheus + Loki + Tempo | $1,500 | Medium | Medium |
| Grafana Cloud | $3,000 | Low | Low |
| Datadog | $8,000 | Low | None |
| ELK Stack | $4,000 | High | High |
| CloudWatch | $2,000 | Low | Low |
For comprehensive tool comparison including:
→ Read : references/tool_comparison.md
Validate health check endpoints against best practices:
# Check single endpoint
python3 scripts/health_check_validator.py https://api.example.com/health
# Check multiple endpoints
python3 scripts/health_check_validator.py \
https://api.example.com/health \
https://api.example.com/readiness \
--verbose
Checks for :
→ Script : scripts/health_check_validator.py
High Latency Investigation :
High Error Rate Investigation :
Service Down Investigation :
# Request rate
sum(rate(http_requests_total[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# P95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Check pod status
kubectl get pods -n <namespace>
# View pod logs
kubectl logs -f <pod-name> -n <namespace>
# Check pod resources
kubectl top pods -n <namespace>
# Describe pod for events
kubectl describe pod <pod-name> -n <namespace>
# Check recent deployments
kubectl rollout history deployment/<name> -n <namespace>
Elasticsearch :
GET /logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
}
}
Loki (LogQL) :
{job="app", level="error"} |= "error" | json
CloudWatch Insights :
fields @timestamp, level, message
| filter level = "error"
| filter @timestamp > ago(1h)
analyze_metrics.py - Detect anomalies in Prometheus/CloudWatch metricsalert_quality_checker.py - Audit alert rules against best practicesslo_calculator.py - Calculate SLO compliance and error budgetslog_analyzer.py - Parse logs for errors and patternsdashboard_generator.py - Generate Grafana dashboards from templateshealth_check_validator.py - Validate health check endpointsdatadog_cost_analyzer.py - Analyze Datadog usage and find cost wastemetrics_design.md - Four Golden Signals, RED/USE methods, metric typesalerting_best_practices.md - Alert design, runbooks, on-call practiceslogging_guide.md - Structured logging, aggregation patternstracing_guide.md - OpenTelemetry, distributed tracingslo_sla_guide.md - SLI/SLO/SLA definitions, error budgetstool_comparison.md - Comprehensive comparison of monitoring toolsdatadog_migration.md - Complete guide for migrating from Datadog to OSS stackdql_promql_translation.md - Datadog Query Language to PromQL translation referenceprometheus-alerts/webapp-alerts.yml - Production-ready web app alertsprometheus-alerts/kubernetes-alerts.yml - Kubernetes monitoring alertsotel-config/collector-config.yaml - OpenTelemetry Collector configurationrunbooks/incident-runbook-template.md - Incident response templateWeekly Installs
137
Repository
GitHub Stars
89
First Seen
Jan 23, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode110
claude-code100
gemini-cli97
codex92
cursor89
github-copilot88
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
68,100 周安装