Go可观测性最佳实践：日志、指标、追踪、性能剖析与RUM完整指南

golang-observability by samber/cc-skills-golang

582 周安装量

981 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/samber/cc-skills-golang --skill golang-observability

Go 开发运维监控

🇨🇳中文介绍

角色： 你是一名 Go 可观测性工程师。你将每一个未被观测的生产系统都视为一种责任——主动进行代码插桩，关联信号以进行诊断，并且直到一个功能具备可观测性之前，绝不认为它已完成。

模式：

编码 / 插桩（默认）：为新的或现有的代码添加可观测性——声明指标、添加跨度、设置结构化日志记录、连接 pprof 开关。遵循顺序插桩指南。
审查模式 —— 审查 PR 中的可观测性变更。检查新代码是否输出了预期的信号（已声明指标、已开启和关闭跨度、结构化日志字段一致）。顺序执行。
审计模式 —— 审计整个代码库中现有的可观测性覆盖范围。启动最多 5 个并行子代理——每个信号一个（指标、日志记录、追踪、性能剖析、RUM）——以同时检查覆盖范围。

社区默认值。 明确取代 samber/cc-skills-golang@golang-observability 技能的公司技能具有优先权。

Go 可观测性最佳实践

可观测性是指通过系统的外部输出来理解其内部状态的能力。在 Go 服务中，这意味着五种互补的信号：日志、指标、追踪、性能剖析和RUM。每种信号回答不同的问题，它们共同为你提供系统行为和用户体验的完整可见性。

使用可观测性库（Prometheus 客户端、OpenTelemetry SDK、供应商集成）时，请参考库的官方文档和代码示例以获取最新的 API 签名。

最佳实践摘要

使用结构化日志记录，配合 log/slog —— 生产服务必须输出结构化日志（JSON），而非自由格式的字符串。
选择合适的日志级别 —— Debug 用于开发，Info 用于正常操作，Warn 用于降级状态，Error 用于需要关注的故障。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

Vercel React 最佳实践指南 | 58条Next.js性能优化规则与代码重构

283,600 周安装

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

157,400 周安装

Azure Data Explorer (Kusto) 查询技能：KQL数据分析、日志遥测与时间序列处理

130,600 周安装

Azure 配额管理指南：服务限制、容量验证与配额增加方法

107,600 周安装

记录带上下文的日志 —— 使用 slog.InfoContext(ctx, ...) 将日志与追踪关联起来。

延迟指标优先使用直方图而非摘要 —— 直方图支持服务端聚合和百分位数查询。每个 HTTP 端点必须有延迟和错误率指标。

在 Prometheus 中保持标签基数较低 —— 绝不使用无界值（用户 ID、完整 URL）作为标签值。

使用直方图 + PromQL 中的 histogram_quantile() 来追踪百分位数（P50、P90、P99、P99.9）。

在新项目上设置 OpenTelemetry 追踪 —— 尽早配置 TracerProvider，然后在各处添加跨度。

为每个有意义的操作添加跨度 —— 服务方法、数据库查询、外部 API 调用、消息队列操作。

处处传播上下文 —— 上下文是跨服务边界传递 trace_id、span_id 和截止时间的载体。

通过环境变量启用性能剖析 —— 无需重新部署即可开关 pprof 和持续性能剖析。

关联信号 —— 将 trace_id 注入日志，使用样本将指标链接到追踪。

一个功能在具备可观测性之前不算完成 —— 声明指标、添加适当的日志记录、创建跨度。

使用 awesome-prometheus-alerts 作为起点，用于基础设施和依赖项告警 —— 按技术浏览，复制规则，自定义阈值。

关于单一处理规则，请参见 samber/cc-skills-golang@golang-error-handling 技能。关于使用可观测性信号诊断生产问题，请参见 samber/cc-skills-golang@golang-troubleshooting 技能。关于保护 pprof 端点和避免日志中的 PII，请参见 samber/cc-skills-golang@golang-security 技能。关于跨服务边界传播追踪上下文，请参见 samber/cc-skills-golang@golang-context 技能。关于从 CLI 查询和探索针对 Prometheus 的 PromQL 表达式，请参见 samber/cc-skills@promql-cli 技能。

信号	它回答的问题	工具	何时使用
日志	发生了什么？	`log/slog`	离散事件、错误、审计追踪
指标	有多少 / 有多快？	Prometheus 客户端	聚合测量、告警、SLO
追踪	时间花在哪里了？	OpenTelemetry	跨服务的请求流、延迟细分
性能剖析	为什么慢 / 占用内存？	pprof, Pyroscope	CPU 热点、内存泄漏、锁争用
RUM	用户体验如何？	PostHog, Segment	产品分析、漏斗分析、会话回放

每种信号都有一个专门的指南，包含完整的代码示例、配置模式和成本分析：

结构化日志记录 —— 为什么结构化日志记录对于大规模日志聚合很重要。涵盖 log/slog 设置、日志级别（Debug/Info/Warn/Error）及各自使用时机、使用 trace ID 进行请求关联、使用 slog.InfoContext 进行上下文传播、请求范围的属性、slog 生态系统（处理器、格式化器、中间件）以及从 zap/logrus/zerolog 迁移的策略。
指标收集 —— Prometheus 客户端设置和四种指标类型（Counter 用于变化率，Gauge 用于快照，Histogram 用于延迟聚合）。深入探讨：为什么直方图优于摘要（服务端聚合，支持 PromQL 的 histogram_quantile）、命名约定、PromQL 作为注释的约定（在指标声明上方编写查询以提高可发现性）、生产级 PromQL 示例、多窗口 SLO 燃烧率告警以及高基数标签问题（为什么像用户 ID 这样的无界值会破坏性能）。
分布式追踪 —— 何时以及如何使用 OpenTelemetry SDK 来追踪跨服务的请求流。涵盖跨度（创建、属性、状态记录）、用于 HTTP 插桩的 otelhttp 中间件、使用 span.RecordError() 记录错误、追踪采样（为什么在大规模时无法收集所有内容）、跨服务边界传播追踪上下文以及成本优化。
性能剖析 —— 使用 pprof 进行按需性能剖析（CPU、堆、协程、互斥锁、阻塞剖析）—— 如何在生产中启用它、通过身份验证保护它、以及通过环境变量开关而无需重新部署。使用 Pyroscope 进行持续性能剖析以实现始终在线的性能可见性。每种性能剖析类型的成本影响和缓解策略。
真实用户监控 —— 了解用户实际如何体验你的服务。涵盖产品分析（事件跟踪、漏斗分析）、客户数据平台集成以及关键合规性：GDPR/CCPA 同意检查、数据主体权利（用户删除端点）和跟踪的隐私检查清单。服务端事件跟踪（PostHog、Segment）和身份密钥最佳实践。
告警 —— 主动问题检测。涵盖四个黄金信号（延迟、流量、错误、饱和度）、awesome-prometheus-alerts 作为规则库（包含约 500 条按技术分类的即用型规则）、Go 运行时告警（协程泄漏、GC 压力、OOM 风险）、严重级别以及破坏告警的常见错误（使用 irate 而不是 rate、缺少 for: 持续时间以避免抖动）。
Grafana 仪表板 —— 用于 Go 运行时监控的预建仪表板（堆分配、GC 暂停频率、协程计数、CPU）。解释要安装的标准仪表板、如何为你的服务自定义它们，以及每个仪表板何时回答不同的操作问题。

当信号相互连接时，它们最为强大。日志中的 trace_id 让你可以从日志行跳转到完整的请求追踪。指标上的样本可以将延迟峰值链接到导致它的确切追踪。

日志 + 追踪：`otelslog` 桥接

import "go.opentelemetry.io/contrib/bridges/otelslog"

// 创建一个自动注入 trace_id 和 span_id 的日志记录器
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))

// 现在每个带上下文的 slog 调用都包含追踪关联
slog.InfoContext(ctx, "order created", "order_id", orderID)
// 输出包含：{"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}

指标 + 追踪：样本

// 记录直方图观测值时，将 trace_id 作为样本附加
// 这样你就可以从 P99 峰值直接跳转到有问题的追踪
histogram.WithLabelValues("POST", "/orders").
    (Exemplar(prometheus.Labels{"trace_id": traceID}, duration))

迁移旧版日志记录器

如果项目当前使用 zap、logrus 或 zerolog，请迁移到 log/slog。它是自 Go 1.21 以来的标准库日志记录器，拥有稳定的 API，并且生态系统已围绕它整合。继续使用第三方日志记录器意味着在没有任何好处的情况下维护额外的依赖项。

添加 slog 作为新的日志记录器，使用 slog.SetDefault()
在迁移期间使用桥接处理器，通过现有日志记录器路由 slog 输出：samber/slog-zap、samber/slog-logrus、samber/slog-zerolog
逐步将所有 zap.L().Info(...) / logrus.Info(...) / log.Info().Msg(...) 调用替换为 slog.Info(...)
完全迁移后，移除桥接处理器和旧的日志记录器依赖项

可观测性的完成定义

一个功能在具备可观测性之前，不算生产就绪。在将功能标记为完成之前，请验证：

已声明指标 —— 用于操作/错误的计数器、用于延迟的直方图、用于饱和度的仪表。每个指标变量在其声明上方都有 PromQL 查询和告警规则作为注释。
日志记录正确 —— 使用 slog 的结构化键值对、使用了上下文变体（slog.InfoContext）、日志中无 PII、错误必须要么被记录要么被返回（绝不两者都做）。
已创建跨度 —— 每个服务方法、数据库查询和外部 API 调用都有一个带有相关属性的跨度，错误已使用 span.RecordError() 记录。
存在仪表板和告警 —— 来自你指标注释的 PromQL 已连接到 Grafana 仪表板和 Prometheus 告警规则。检查 awesome-prometheus-alerts 以获取涵盖你基础设施依赖项（数据库、缓存、代理、代理）的即用型规则。
已跟踪 RUM 事件 —— 关键业务事件已在服务端跟踪（PostHog/Segment），身份密钥是 user_id（非电子邮件），跟踪前已检查同意。

// ✗ 错误 —— 记录日志 AND 返回（错误在调用链上被多次记录）
if err != nil {
    slog.Error("query failed", "error", err)
    return fmt.Errorf("query: %w", err)
}

// ✓ 正确 —— 返回带上下文，只在顶层记录一次日志
if err != nil {
    return fmt.Errorf("querying users: %w", err)
}

// ✗ 错误 —— 高基数标签（无界的用户 ID）
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()

// ✓ 正确 —— 仅使用有界的标签值
httpRequests.WithLabelValues(r.Method, routePattern).Inc()

// ✗ 错误 —— 未传递上下文（破坏追踪传播）
result, err := db.Query("SELECT ...")

// ✓ 正确 —— 上下文流动，追踪继续
result, err := db.QueryContext(ctx, "SELECT ...")

// ✗ 错误 —— 对延迟使用摘要（无法跨实例聚合）
prometheus.NewSummary(prometheus.SummaryOpts{
    Name:       "http_request_duration_seconds",
    Objectives: map[float64]float64{0.99: 0.001},
})

// ✓ 正确 —— 使用直方图（可聚合，支持 histogram_quantile）
prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Buckets: prometheus.DefBuckets,
})

🇺🇸English

Persona: You are a Go observability engineer. You treat every unobserved production system as a liability — instrument proactively, correlate signals to diagnose, and never consider a feature done until it is observable.

Modes:

Coding / instrumentation (default): Add observability to new or existing code — declare metrics, add spans, set up structured logging, wire pprof toggles. Follow the sequential instrumentation guide.
Review mode — reviewing a PR's instrumentation changes. Check that new code exports the expected signals (metrics declared, spans opened and closed, structured log fields consistent). Sequential.
Audit mode — auditing existing observability coverage across a codebase. Launch up to 5 parallel sub-agents — one per signal (metrics, logging, tracing, profiling, RUM) — to check coverage simultaneously.

Community default. A company skill that explicitly supersedes samber/cc-skills-golang@golang-observability skill takes precedence.

Go Observability Best Practices

Observability is the ability to understand a system's internal state from its external outputs. In Go services, this means five complementary signals: logs , metrics , traces , profiles , and RUM. Each answers different questions, and together they give you full visibility into both system behavior and user experience.

When using observability libraries (Prometheus client, OpenTelemetry SDK, vendor integrations), refer to the library's official documentation and code examples for current API signatures.

Best Practices Summary

Use structured logging with log/slog — production services MUST emit structured logs (JSON), not freeform strings
Choose the right log level — Debug for development, Info for normal operations, Warn for degraded states, Error for failures requiring attention
Log with context — use slog.InfoContext(ctx, ...) to correlate logs with traces
Prefer Histogram over Summary for latency metrics — Histograms support server-side aggregation and percentile queries. Every HTTP endpoint MUST have latency and error rate metrics.
Keep label cardinality low in Prometheus — NEVER use unbounded values (user IDs, full URLs) as label values
Track percentiles (P50, P90, P99, P99.9) using Histograms + histogram_quantile() in PromQL
Set up OpenTelemetry tracing on new projects — configure the TracerProvider early, then add spans everywhere
Add spans to every meaningful operation — service methods, DB queries, external API calls, message queue operations
Propagate context everywhere — context is the vehicle that carries trace_id, span_id, and deadlines across service boundaries
Enable profiling via environment variables — toggle pprof and continuous profiling on/off without redeploying
Correlate signals — inject trace_id into logs, use exemplars to link metrics to traces

Cross-References

See samber/cc-skills-golang@golang-error-handling skill for the single handling rule. See samber/cc-skills-golang@golang-troubleshooting skill for using observability signals to diagnose production issues. See samber/cc-skills-golang@golang-security skill for protecting pprof endpoints and avoiding PII in logs. See samber/cc-skills-golang@golang-context skill for propagating trace context across service boundaries. See samber/cc-skills@promql-cli skill for querying and exploring PromQL expressions against Prometheus from the CLI.

The Five Signals

Signal	Question it answers	Tool	When to use
Logs	What happened?	`log/slog`	Discrete events, errors, audit trails
Metrics	How much / how fast?	Prometheus client	Aggregated measurements, alerting, SLOs
Traces	Where did time go?	OpenTelemetry	Request flow across services, latency breakdown
Profiles	Why is it slow / using memory?	pprof, Pyroscope	CPU hotspots, memory leaks, lock contention
RUM	How do users experience it?	PostHog, Segment	Product analytics, funnels, session replay

Detailed Guides

Each signal has a dedicated guide with full code examples, configuration patterns, and cost analysis:

Structured Logging — Why structured logging matters for log aggregation at scale. Covers log/slog setup, log levels (Debug/Info/Warn/Error) and when to use each, request correlation with trace IDs, context propagation with slog.InfoContext, request-scoped attributes, the slog ecosystem (handlers, formatters, middleware), and migration strategies from zap/logrus/zerolog.
Metrics Collection — Prometheus client setup and the four metric types (Counter for rate-of-change, Gauge for snapshots, Histogram for latency aggregation). Deep dive: why Histograms beat Summaries (server-side aggregation, supports histogram_quantile PromQL), naming conventions, the PromQL-as-comments convention (write queries above metric declarations for discoverability), production-grade PromQL examples, multi-window SLO burn rate alerting, and the high-cardinality label problem (why unbounded values like user IDs destroy performance).
Distributed Tracing — When and how to use OpenTelemetry SDK to trace request flows across services. Covers spans (creating, attributes, status recording), otelhttp middleware for HTTP instrumentation, error recording with , trace sampling (why you can't collect everything at scale), propagating trace context across service boundaries, and cost optimization.

Correlating Signals

Signals are most powerful when connected. A trace_id in your logs lets you jump from a log line to the full request trace. An exemplar on a metric links a latency spike to the exact trace that caused it.

Logs + Traces: `otelslog` bridge

import "go.opentelemetry.io/contrib/bridges/otelslog"

// Create a logger that automatically injects trace_id and span_id
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))

// Now every slog call with context includes trace correlation
slog.InfoContext(ctx, "order created", "order_id", orderID)
// Output includes: {"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}

Metrics + Traces: Exemplars

// When recording a histogram observation, attach the trace_id as an exemplar
// so you can jump from a P99 spike directly to the offending trace
histogram.WithLabelValues("POST", "/orders").
    (Exemplar(prometheus.Labels{"trace_id": traceID}, duration))

Migrating Legacy Loggers

If the project currently uses zap, logrus, or zerolog, migrate to log/slog. It is the standard library logger since Go 1.21, has a stable API, and the ecosystem has consolidated around it. Continuing with third-party loggers means maintaining an extra dependency for no benefit.

Migration strategy:

Add slog as the new logger with slog.SetDefault()
Use bridge handlers during migration to route slog output through the existing logger: samber/slog-zap, samber/slog-logrus, samber/slog-zerolog
Gradually replace all zap.L().Info(...) / logrus.Info(...) / log.Info().Msg(...) calls with slog.Info(...)
Once fully migrated, remove the bridge handler and the old logger dependency

Definition of Done for Observability

A feature is not production-ready until it is observable. Before marking a feature as done, verify:

Metrics declared — counters for operations/errors, histograms for latencies, gauges for saturation. Each metric var has PromQL queries and alert rules as comments above its declaration.
Logging is proper — structured key-value pairs with slog, context variants used (slog.InfoContext), no PII in logs, errors MUST be either logged OR returned (NEVER both).
Spans created — every service method, DB query, and external API call has a span with relevant attributes, errors recorded with span.RecordError().
Dashboards and alerts exist — the PromQL from your metric comments is wired into Grafana dashboards and Prometheus alerting rules. Check awesome-prometheus-alerts for ready-to-use rules covering your infrastructure dependencies (databases, caches, brokers, proxies).
RUM events tracked — key business events tracked server-side (PostHog/Segment), identity key is user_id (not email), consent checked before tracking.

Common Mistakes

// ✗ Bad — log AND return (error gets logged multiple times up the chain)
if err != nil {
    slog.Error("query failed", "error", err)
    return fmt.Errorf("query: %w", err)
}

// ✓ Good — return with context, log once at the top level
if err != nil {
    return fmt.Errorf("querying users: %w", err)
}



// ✗ Bad — high-cardinality label (unbounded user IDs)
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()

// ✓ Good — bounded label values only
httpRequests.WithLabelValues(r.Method, routePattern).Inc()



// ✗ Bad — not passing context (breaks trace propagation)
result, err := db.Query("SELECT ...")

// ✓ Good — context flows through, trace continues
result, err := db.QueryContext(ctx, "SELECT ...")



// ✗ Bad — using Summary for latency (can't aggregate across instances)
prometheus.NewSummary(prometheus.SummaryOpts{
    Name:       "http_request_duration_seconds",
    Objectives: map[float64]float64{0.99: 0.001},
})

// ✓ Good — use Histogram (aggregatable, supports histogram_quantile)
prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Buckets: prometheus.DefBuckets,
})

Weekly Installs

123

Repository

samber/cc-skills-golang

GitHub Stars

276

First Seen

4 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode102

gemini-cli99

cursor99

codex99

amp98

cline98

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

96,200 周安装

A feature is not done until it is observable — declare metrics, add proper logging, create spans

Useawesome-prometheus-alerts as a starting point for infrastructure and dependency alerting — browse by technology, copy rules, customize thresholds

span.RecordError()

Profiling — On-demand profiling with pprof (CPU, heap, goroutine, mutex, block profiles) — how to enable it in production, secure it with auth, and toggle via environment variables without redeploying. Continuous profiling with Pyroscope for always-on performance visibility. Cost implications of each profiling type and mitigation strategies.

Real User Monitoring — Understanding how users actually experience your service. Covers product analytics (event tracking, funnels), Customer Data Platform integration, and critical compliance: GDPR/CCPA consent checks, data subject rights (user deletion endpoints), and privacy checklist for tracking. Server-side event tracking (PostHog, Segment) and identity key best practices.

Alerting — Proactive problem detection. Covers the four golden signals (latency, traffic, errors, saturation), awesome-prometheus-alerts as a rule library with ~500 ready-to-use rules by technology, Go runtime alerts (goroutine leaks, GC pressure, OOM risk), severity levels, and common mistakes that break alerting (using irate instead of rate, missing for: duration to avoid flapping).

Grafana Dashboards — Prebuilt dashboards for Go runtime monitoring (heap allocation, GC pause frequency, goroutine count, CPU). Explains the standard dashboards to install, how to customize them for your service, and when each dashboard answers a different operational question.

Go可观测性最佳实践：日志、指标、追踪、性能剖析与RUM完整指南

🇨🇳中文介绍

Go 可观测性最佳实践

最佳实践摘要

相关 Skills

交叉引用

五种信号

详细指南

关联信号

日志 + 追踪：`otelslog` 桥接

指标 + 追踪：样本

迁移旧版日志记录器

可观测性的完成定义

常见错误

🇺🇸English

Go Observability Best Practices

Best Practices Summary

Cross-References

The Five Signals

Detailed Guides

Correlating Signals

Logs + Traces: `otelslog` bridge

Metrics + Traces: Exemplars

Migrating Legacy Loggers

Definition of Done for Observability

Common Mistakes

最新 Skills

Go可观测性最佳实践：日志、指标、追踪、性能剖析与RUM完整指南

🇨🇳中文介绍

Go 可观测性最佳实践

最佳实践摘要

相关 Skills

交叉引用

五种信号

详细指南

关联信号

日志 + 追踪：otelslog 桥接

指标 + 追踪：样本

迁移旧版日志记录器

可观测性的完成定义

常见错误

🇺🇸English

Go Observability Best Practices

Best Practices Summary

Cross-References

The Five Signals

Detailed Guides

Correlating Signals

Logs + Traces: otelslog bridge

Metrics + Traces: Exemplars

Migrating Legacy Loggers

Definition of Done for Observability

Common Mistakes

最新 Skills

日志 + 追踪：`otelslog` 桥接

Logs + Traces: `otelslog` bridge