observability-%26-monitoring by ariegoldkin/ai-agent-hub
npx skills add https://github.com/ariegoldkin/ai-agent-hub --skill 'Observability & Monitoring'实现可观测性的综合框架,包括结构化日志记录、指标收集、分布式追踪和告警配置。
┌─────────────────┬─────────────────┬─────────────────┐
│ LOGS │ METRICS │ TRACES │
├─────────────────┼─────────────────┼─────────────────┤
│ What happened │ How is system │ How do requests │
│ at specific │ performing │ flow through │
│ point in time │ over time │ services │
└─────────────────┴─────────────────┴─────────────────┘
| 级别 | 使用场景 |
|---|---|
| ERROR | 未处理的异常、失败的操作 |
| WARN | 已弃用的 API、重试尝试 |
| INFO |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 业务事件、成功的操作 |
| DEBUG | 开发阶段故障排除 |
// Good: Structured with context
logger.info('User action completed', {
action: 'purchase',
userId: user.id,
orderId: order.id,
duration_ms: 150
});
// Bad: String interpolation
logger.info(`User ${user.id} completed purchase`);
有关 Winston 设置和请求中间件,请参阅
templates/structured-logging.ts
任何服务的基本指标:
// HTTP request latency
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
// Database query latency
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]
完整的指标配置请参阅
templates/prometheus-metrics.ts
自动检测常见库:
tracer.startActiveSpan('processOrder', async (span) => {
span.setAttribute('order.id', orderId);
// ... work
span.end();
});
完整设置请参阅
templates/opentelemetry-tracing.ts
| 级别 | 响应时间 | 示例 |
|---|---|---|
| Critical (P1) | < 15 分钟 | 服务宕机、数据丢失 |
| High (P2) | < 1 小时 | 主要功能损坏 |
| Medium (P3) | < 4 小时 | 错误率上升 |
| Low (P4) | 次日 | 警告 |
| 告警 | 条件 | 严重级别 |
|---|---|---|
| ServiceDown | up == 0 持续 1 分钟 | Critical |
| HighErrorRate | 5xx 错误 > 5% 持续 5 分钟 | Critical |
| HighLatency | p95 > 2 秒 持续 5 分钟 | High |
| LowCacheHitRate | < 70% 持续 10 分钟 | Medium |
Prometheus 告警规则请参阅
templates/alerting-rules.yml
| 探针 | 目的 | 端点 |
|---|---|---|
| Liveness | 应用程序是否在运行? | /health |
| Readiness | 是否准备好接收流量? | /ready |
| Startup | 是否已完成启动? | /startup |
{
"status": "healthy|degraded|unhealthy",
"checks": {
"database": { "status": "pass", "latency_ms": 5 },
"redis": { "status": "pass", "latency_ms": 2 }
},
"version": "1.0.0",
"uptime": 3600
}
实现方式请参阅
templates/health-checks.ts
使用 Opus 4.5 扩展思维处理:
| 模板 | 用途 |
|---|---|
structured-logging.ts | 包含请求中间件的 Winston 日志记录器 |
prometheus-metrics.ts | 包含中间件的 HTTP、数据库、缓存指标 |
opentelemetry-tracing.ts | 分布式追踪设置 |
alerting-rules.yml | Prometheus 告警规则 |
health-checks.ts | 存活、就绪、启动探针 |
每周安装量
0
代码仓库
GitHub 星标数
8
首次出现时间
1970年1月1日
Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.
┌─────────────────┬─────────────────┬─────────────────┐
│ LOGS │ METRICS │ TRACES │
├─────────────────┼─────────────────┼─────────────────┤
│ What happened │ How is system │ How do requests │
│ at specific │ performing │ flow through │
│ point in time │ over time │ services │
└─────────────────┴─────────────────┴─────────────────┘
| Level | Use Case |
|---|---|
| ERROR | Unhandled exceptions, failed operations |
| WARN | Deprecated API, retry attempts |
| INFO | Business events, successful operations |
| DEBUG | Development troubleshooting |
// Good: Structured with context
logger.info('User action completed', {
action: 'purchase',
userId: user.id,
orderId: order.id,
duration_ms: 150
});
// Bad: String interpolation
logger.info(`User ${user.id} completed purchase`);
See
templates/structured-logging.tsfor Winston setup and request middleware
Essential metrics for any service:
// HTTP request latency
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
// Database query latency
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]
See
templates/prometheus-metrics.tsfor full metrics configuration
Auto-instrument common libraries:
tracer.startActiveSpan('processOrder', async (span) => {
span.setAttribute('order.id', orderId);
// ... work
span.end();
});
See
templates/opentelemetry-tracing.tsfor full setup
| Level | Response Time | Examples |
|---|---|---|
| Critical (P1) | < 15 min | Service down, data loss |
| High (P2) | < 1 hour | Major feature broken |
| Medium (P3) | < 4 hours | Increased error rate |
| Low (P4) | Next day | Warnings |
| Alert | Condition | Severity |
|---|---|---|
| ServiceDown | up == 0 for 1m | Critical |
| HighErrorRate | 5xx > 5% for 5m | Critical |
| HighLatency | p95 > 2s for 5m | High |
| LowCacheHitRate | < 70% for 10m | Medium |
See
templates/alerting-rules.ymlfor Prometheus alerting rules
| Probe | Purpose | Endpoint |
|---|---|---|
| Liveness | Is app running? | /health |
| Readiness | Ready for traffic? | /ready |
| Startup | Finished starting? | /startup |
{
"status": "healthy|degraded|unhealthy",
"checks": {
"database": { "status": "pass", "latency_ms": 5 },
"redis": { "status": "pass", "latency_ms": 2 }
},
"version": "1.0.0",
"uptime": 3600
}
See
templates/health-checks.tsfor implementation
Use Opus 4.5 extended thinking for:
| Template | Purpose |
|---|---|
structured-logging.ts | Winston logger with request middleware |
prometheus-metrics.ts | HTTP, DB, cache metrics with middleware |
opentelemetry-tracing.ts | Distributed tracing setup |
alerting-rules.yml | Prometheus alerting rules |
health-checks.ts | Liveness, readiness, startup probes |
Weekly Installs
0
Repository
GitHub Stars
8
First Seen
Jan 1, 1970
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
64,099 周安装
Google搜索浏览器使用指南 - 通过browser-use实现真实浏览器模式搜索与内容提取
1,600 周安装
SwiftUI UI 模式指南 - 现代 SwiftUI 开发最佳实践与状态管理
1,600 周安装
使用 x402 协议构建付费 API 服务器 | 通过 USDC 按请求收费
1,600 周安装
奥派经济聊天室:AI模拟哈耶克与米塞斯对话,探讨奥地利学派经济学
588 周安装
Coinbase Agentic Wallet send-usdc 技能:Base链USDC转账命令教程
1,600 周安装
Laravel 11/12应用开发指南:工作流程、技术栈检测与最佳实践
589 周安装