golang-observability by samber/cc-skills-golang
npx skills add https://github.com/samber/cc-skills-golang --skill golang-observability角色: 你是一名 Go 可观测性工程师。你将每一个未被观测的生产系统都视为一种责任——主动进行代码插桩,关联信号以进行诊断,并且直到一个功能具备可观测性之前,绝不认为它已完成。
模式:
社区默认值。 明确取代
samber/cc-skills-golang@golang-observability技能的公司技能具有优先权。
可观测性是指通过系统的外部输出来理解其内部状态的能力。在 Go 服务中,这意味着五种互补的信号:日志、指标、追踪、性能剖析和RUM。每种信号回答不同的问题,它们共同为你提供系统行为和用户体验的完整可见性。
使用可观测性库(Prometheus 客户端、OpenTelemetry SDK、供应商集成)时,请参考库的官方文档和代码示例以获取最新的 API 签名。
log/slog —— 生产服务必须输出结构化日志(JSON),而非自由格式的字符串。广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
slog.InfoContext(ctx, ...) 将日志与追踪关联起来。histogram_quantile() 来追踪百分位数(P50、P90、P99、P99.9)。关于单一处理规则,请参见 samber/cc-skills-golang@golang-error-handling 技能。关于使用可观测性信号诊断生产问题,请参见 samber/cc-skills-golang@golang-troubleshooting 技能。关于保护 pprof 端点和避免日志中的 PII,请参见 samber/cc-skills-golang@golang-security 技能。关于跨服务边界传播追踪上下文,请参见 samber/cc-skills-golang@golang-context 技能。关于从 CLI 查询和探索针对 Prometheus 的 PromQL 表达式,请参见 samber/cc-skills@promql-cli 技能。
| 信号 | 它回答的问题 | 工具 | 何时使用 |
|---|---|---|---|
| 日志 | 发生了什么? | log/slog | 离散事件、错误、审计追踪 |
| 指标 | 有多少 / 有多快? | Prometheus 客户端 | 聚合测量、告警、SLO |
| 追踪 | 时间花在哪里了? | OpenTelemetry | 跨服务的请求流、延迟细分 |
| 性能剖析 | 为什么慢 / 占用内存? | pprof, Pyroscope | CPU 热点、内存泄漏、锁争用 |
| RUM | 用户体验如何? | PostHog, Segment | 产品分析、漏斗分析、会话回放 |
每种信号都有一个专门的指南,包含完整的代码示例、配置模式和成本分析:
结构化日志记录 —— 为什么结构化日志记录对于大规模日志聚合很重要。涵盖 log/slog 设置、日志级别(Debug/Info/Warn/Error)及各自使用时机、使用 trace ID 进行请求关联、使用 slog.InfoContext 进行上下文传播、请求范围的属性、slog 生态系统(处理器、格式化器、中间件)以及从 zap/logrus/zerolog 迁移的策略。
指标收集 —— Prometheus 客户端设置和四种指标类型(Counter 用于变化率,Gauge 用于快照,Histogram 用于延迟聚合)。深入探讨:为什么直方图优于摘要(服务端聚合,支持 PromQL 的 histogram_quantile)、命名约定、PromQL 作为注释的约定(在指标声明上方编写查询以提高可发现性)、生产级 PromQL 示例、多窗口 SLO 燃烧率告警以及高基数标签问题(为什么像用户 ID 这样的无界值会破坏性能)。
分布式追踪 —— 何时以及如何使用 OpenTelemetry SDK 来追踪跨服务的请求流。涵盖跨度(创建、属性、状态记录)、用于 HTTP 插桩的 otelhttp 中间件、使用 span.RecordError() 记录错误、追踪采样(为什么在大规模时无法收集所有内容)、跨服务边界传播追踪上下文以及成本优化。
性能剖析 —— 使用 pprof 进行按需性能剖析(CPU、堆、协程、互斥锁、阻塞剖析)—— 如何在生产中启用它、通过身份验证保护它、以及通过环境变量开关而无需重新部署。使用 Pyroscope 进行持续性能剖析以实现始终在线的性能可见性。每种性能剖析类型的成本影响和缓解策略。
真实用户监控 —— 了解用户实际如何体验你的服务。涵盖产品分析(事件跟踪、漏斗分析)、客户数据平台集成以及关键合规性:GDPR/CCPA 同意检查、数据主体权利(用户删除端点)和跟踪的隐私检查清单。服务端事件跟踪(PostHog、Segment)和身份密钥最佳实践。
告警 —— 主动问题检测。涵盖四个黄金信号(延迟、流量、错误、饱和度)、awesome-prometheus-alerts 作为规则库(包含约 500 条按技术分类的即用型规则)、Go 运行时告警(协程泄漏、GC 压力、OOM 风险)、严重级别以及破坏告警的常见错误(使用 irate 而不是 rate、缺少 for: 持续时间以避免抖动)。
Grafana 仪表板 —— 用于 Go 运行时监控的预建仪表板(堆分配、GC 暂停频率、协程计数、CPU)。解释要安装的标准仪表板、如何为你的服务自定义它们,以及每个仪表板何时回答不同的操作问题。
当信号相互连接时,它们最为强大。日志中的 trace_id 让你可以从日志行跳转到完整的请求追踪。指标上的样本可以将延迟峰值链接到导致它的确切追踪。
otelslog 桥接import "go.opentelemetry.io/contrib/bridges/otelslog"
// 创建一个自动注入 trace_id 和 span_id 的日志记录器
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))
// 现在每个带上下文的 slog 调用都包含追踪关联
slog.InfoContext(ctx, "order created", "order_id", orderID)
// 输出包含:{"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}
// 记录直方图观测值时,将 trace_id 作为样本附加
// 这样你就可以从 P99 峰值直接跳转到有问题的追踪
histogram.WithLabelValues("POST", "/orders").
(Exemplar(prometheus.Labels{"trace_id": traceID}, duration))
如果项目当前使用 zap、logrus 或 zerolog,请迁移到 log/slog。它是自 Go 1.21 以来的标准库日志记录器,拥有稳定的 API,并且生态系统已围绕它整合。继续使用第三方日志记录器意味着在没有任何好处的情况下维护额外的依赖项。
迁移策略:
slog 作为新的日志记录器,使用 slog.SetDefault()zap.L().Info(...) / logrus.Info(...) / log.Info().Msg(...) 调用替换为 slog.Info(...)一个功能在具备可观测性之前,不算生产就绪。在将功能标记为完成之前,请验证:
slog 的结构化键值对、使用了上下文变体(slog.InfoContext)、日志中无 PII、错误必须要么被记录要么被返回(绝不两者都做)。span.RecordError() 记录。user_id(非电子邮件),跟踪前已检查同意。// ✗ 错误 —— 记录日志 AND 返回(错误在调用链上被多次记录)
if err != nil {
slog.Error("query failed", "error", err)
return fmt.Errorf("query: %w", err)
}
// ✓ 正确 —— 返回带上下文,只在顶层记录一次日志
if err != nil {
return fmt.Errorf("querying users: %w", err)
}
// ✗ 错误 —— 高基数标签(无界的用户 ID)
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()
// ✓ 正确 —— 仅使用有界的标签值
httpRequests.WithLabelValues(r.Method, routePattern).Inc()
// ✗ 错误 —— 未传递上下文(破坏追踪传播)
result, err := db.Query("SELECT ...")
// ✓ 正确 —— 上下文流动,追踪继续
result, err := db.QueryContext(ctx, "SELECT ...")
// ✗ 错误 —— 对延迟使用摘要(无法跨实例聚合)
prometheus.NewSummary(prometheus.SummaryOpts{
Name: "http_request_duration_seconds",
Objectives: map[float64]float64{0.99: 0.001},
})
// ✓ 正确 —— 使用直方图(可聚合,支持 histogram_quantile)
prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: prometheus.DefBuckets,
})
每周安装次数
123
仓库
GitHub Stars
276
首次出现
4 天前
安全审计
安装于
opencode102
gemini-cli99
cursor99
codex99
amp98
cline98
Persona: You are a Go observability engineer. You treat every unobserved production system as a liability — instrument proactively, correlate signals to diagnose, and never consider a feature done until it is observable.
Modes:
Community default. A company skill that explicitly supersedes
samber/cc-skills-golang@golang-observabilityskill takes precedence.
Observability is the ability to understand a system's internal state from its external outputs. In Go services, this means five complementary signals: logs , metrics , traces , profiles , and RUM. Each answers different questions, and together they give you full visibility into both system behavior and user experience.
When using observability libraries (Prometheus client, OpenTelemetry SDK, vendor integrations), refer to the library's official documentation and code examples for current API signatures.
log/slog — production services MUST emit structured logs (JSON), not freeform stringsslog.InfoContext(ctx, ...) to correlate logs with traceshistogram_quantile() in PromQLSee samber/cc-skills-golang@golang-error-handling skill for the single handling rule. See samber/cc-skills-golang@golang-troubleshooting skill for using observability signals to diagnose production issues. See samber/cc-skills-golang@golang-security skill for protecting pprof endpoints and avoiding PII in logs. See samber/cc-skills-golang@golang-context skill for propagating trace context across service boundaries. See samber/cc-skills@promql-cli skill for querying and exploring PromQL expressions against Prometheus from the CLI.
| Signal | Question it answers | Tool | When to use |
|---|---|---|---|
| Logs | What happened? | log/slog | Discrete events, errors, audit trails |
| Metrics | How much / how fast? | Prometheus client | Aggregated measurements, alerting, SLOs |
| Traces | Where did time go? | OpenTelemetry | Request flow across services, latency breakdown |
| Profiles | Why is it slow / using memory? | pprof, Pyroscope | CPU hotspots, memory leaks, lock contention |
| RUM | How do users experience it? | PostHog, Segment | Product analytics, funnels, session replay |
Each signal has a dedicated guide with full code examples, configuration patterns, and cost analysis:
Structured Logging — Why structured logging matters for log aggregation at scale. Covers log/slog setup, log levels (Debug/Info/Warn/Error) and when to use each, request correlation with trace IDs, context propagation with slog.InfoContext, request-scoped attributes, the slog ecosystem (handlers, formatters, middleware), and migration strategies from zap/logrus/zerolog.
Metrics Collection — Prometheus client setup and the four metric types (Counter for rate-of-change, Gauge for snapshots, Histogram for latency aggregation). Deep dive: why Histograms beat Summaries (server-side aggregation, supports histogram_quantile PromQL), naming conventions, the PromQL-as-comments convention (write queries above metric declarations for discoverability), production-grade PromQL examples, multi-window SLO burn rate alerting, and the high-cardinality label problem (why unbounded values like user IDs destroy performance).
Distributed Tracing — When and how to use OpenTelemetry SDK to trace request flows across services. Covers spans (creating, attributes, status recording), otelhttp middleware for HTTP instrumentation, error recording with , trace sampling (why you can't collect everything at scale), propagating trace context across service boundaries, and cost optimization.
Signals are most powerful when connected. A trace_id in your logs lets you jump from a log line to the full request trace. An exemplar on a metric links a latency spike to the exact trace that caused it.
otelslog bridgeimport "go.opentelemetry.io/contrib/bridges/otelslog"
// Create a logger that automatically injects trace_id and span_id
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))
// Now every slog call with context includes trace correlation
slog.InfoContext(ctx, "order created", "order_id", orderID)
// Output includes: {"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}
// When recording a histogram observation, attach the trace_id as an exemplar
// so you can jump from a P99 spike directly to the offending trace
histogram.WithLabelValues("POST", "/orders").
(Exemplar(prometheus.Labels{"trace_id": traceID}, duration))
If the project currently uses zap, logrus, or zerolog, migrate to log/slog. It is the standard library logger since Go 1.21, has a stable API, and the ecosystem has consolidated around it. Continuing with third-party loggers means maintaining an extra dependency for no benefit.
Migration strategy:
slog as the new logger with slog.SetDefault()zap.L().Info(...) / logrus.Info(...) / log.Info().Msg(...) calls with slog.Info(...)A feature is not production-ready until it is observable. Before marking a feature as done, verify:
slog, context variants used (slog.InfoContext), no PII in logs, errors MUST be either logged OR returned (NEVER both).span.RecordError().user_id (not email), consent checked before tracking.// ✗ Bad — log AND return (error gets logged multiple times up the chain)
if err != nil {
slog.Error("query failed", "error", err)
return fmt.Errorf("query: %w", err)
}
// ✓ Good — return with context, log once at the top level
if err != nil {
return fmt.Errorf("querying users: %w", err)
}
// ✗ Bad — high-cardinality label (unbounded user IDs)
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()
// ✓ Good — bounded label values only
httpRequests.WithLabelValues(r.Method, routePattern).Inc()
// ✗ Bad — not passing context (breaks trace propagation)
result, err := db.Query("SELECT ...")
// ✓ Good — context flows through, trace continues
result, err := db.QueryContext(ctx, "SELECT ...")
// ✗ Bad — using Summary for latency (can't aggregate across instances)
prometheus.NewSummary(prometheus.SummaryOpts{
Name: "http_request_duration_seconds",
Objectives: map[float64]float64{0.99: 0.001},
})
// ✓ Good — use Histogram (aggregatable, supports histogram_quantile)
prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: prometheus.DefBuckets,
})
Weekly Installs
123
Repository
GitHub Stars
276
First Seen
4 days ago
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode102
gemini-cli99
cursor99
codex99
amp98
cline98
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
96,200 周安装
Skill 元数据检查工具 - 批量验证 SKILL.md YAML 头部格式与完整性
1 周安装
Godot GDScript 代码规范与最佳实践 | 类型注解、信号、@export 使用指南
1 周安装
播客文稿润色器:AI自动清理访谈转录稿,提升可读性并保留真实声音
1 周安装
开放教育中心深度解析:在家教育方法、课程指南与资源汇总 | OpenEd权威内容
1 周安装
create-pr:GitHub PR 创建与编辑技能,兼容 sentry-skills:pr-writer 工作流
99 周安装
Meta广告创意技能:6要素框架与原生广告创作指南,提升Facebook/Instagram转化率
1 周安装
span.RecordError()Profiling — On-demand profiling with pprof (CPU, heap, goroutine, mutex, block profiles) — how to enable it in production, secure it with auth, and toggle via environment variables without redeploying. Continuous profiling with Pyroscope for always-on performance visibility. Cost implications of each profiling type and mitigation strategies.
Real User Monitoring — Understanding how users actually experience your service. Covers product analytics (event tracking, funnels), Customer Data Platform integration, and critical compliance: GDPR/CCPA consent checks, data subject rights (user deletion endpoints), and privacy checklist for tracking. Server-side event tracking (PostHog, Segment) and identity key best practices.
Alerting — Proactive problem detection. Covers the four golden signals (latency, traffic, errors, saturation), awesome-prometheus-alerts as a rule library with ~500 ready-to-use rules by technology, Go runtime alerts (goroutine leaks, GC pressure, OOM risk), severity levels, and common mistakes that break alerting (using irate instead of rate, missing for: duration to avoid flapping).
Grafana Dashboards — Prebuilt dashboards for Go runtime monitoring (heap allocation, GC pause frequency, goroutine count, CPU). Explains the standard dashboards to install, how to customize them for your service, and when each dashboard answers a different operational question.