Elastic APM 服务健康评估指南：使用 Observability APIs 与 ES|QL 监控 SLO、延迟、错误率

observability-service-health by elastic/agent-skills

122 周安装量

89 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/elastic/agent-skills --skill observability-service-health

数据分析开发运维监控

🇨🇳中文介绍

APM 服务健康评估

使用 Observability APIs、针对 APM 索引的 ES|QL、Elasticsearch API 以及（用于关联和 APM 特定逻辑）Kibana 仓库来评估 APM 服务健康状况。利用 SLO、触发的警报、ML 异常、吞吐量、延迟（平均/p95/p99）、错误率和依赖项健康状况。

查看位置

Observability APIs (Observability APIs): 使用 SLOs API (Stack | Serverless) 获取 SLO 定义、状态、消耗率和错误预算。使用 Alerting API (Stack | Serverless) 列出和管理服务的警报规则及其警报。需要时使用 APM annotations API 创建或搜索注解。
ES|QL 和 Elasticsearch: 使用 ES|QL 查询 traces*apm*,traces*otel* 和 metrics*apm*,metrics*otel*（参见使用 ES|QL 查询 APM 指标），以获取吞吐量、延迟、错误率和依赖项风格的聚合。按照 Elasticsearch 仓库中的文档使用 Elasticsearch API（例如，用于 ES|QL 的 POST _query 或 Query DSL）进行索引和搜索。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

信号	检查内容
SLOs	消耗率、状态（健康/降级/违规）、错误预算。
触发的警报	服务或其依赖项的未解决或最近触发的警报。
ML 异常	异常检测任务；针对延迟、吞吐量或错误率的得分和严重性。
吞吐量	请求速率；与基线或先前时段比较。
延迟	平均、p95、p99；与 SLO 目标或历史数据比较。
错误率	失败请求数/总请求数；峰值或持续升高。
依赖项健康	下游延迟、错误率、可用性（ES
基础设施	CPU 使用率、内存；Pod/容器/主机上的 OOM 和 CPU 限制。
日志	按服务或追踪 ID 过滤的应用日志，用于获取上下文和根本原因。

使用 ES|QL 查询 APM 指标

从 Elasticsearch 查询 APM 数据（traces*apm*,traces*otel*、metrics*apm*,metrics*otel*）时，默认使用 ES|QL（如果可用）。

可用性: ES|QL 在 Elasticsearch 8.11+ 中可用（技术预览版；8.14 版正式发布）。它在 Elastic Observability Serverless Complete 层级中 始终可用。
限定到特定服务: 始终按 service.name（以及相关的 service.environment）进行过滤。结合 @timestamp 上的时间范围：

WHERE service.name == "my-service-name" AND service.environment == "production" AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
示例模式: 随时间变化的吞吐量、延迟和错误率：参见 Kibana 的 trace_charts_definition.ts（getThroughputChart、getLatencyChart、getErrorRateChart）。使用 from(index) → where(...) → stats(...) / evaluate(...) 结合 BUCKET(@timestamp, ...) 和 WHERE service.name == "<service_name>"。
性能: 添加 LIMIT n 以限制行数和令牌使用量。当只需要趋势时，优先使用更粗粒度的 BUCKET(@timestamp, ...)（例如 1 小时）；更细粒度的桶会增加工作量和结果大小。

当只有部分事务出现高延迟或失败时，运行 apm-correlations 脚本以列出与这些事务相关联的属性（例如主机、服务版本、Pod、区域）。该脚本首先尝试 Kibana 内部的 APM 关联 API；如果不可用（例如 404），则回退到对 traces*apm*,traces*otel* 使用 Elasticsearch 的 significant_terms。

# 延迟关联（在慢速事务中过度呈现的属性）
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]

# 失败事务关联
node skills/observability/service-health/scripts/apm-correlations.js failed-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]

# 测试 Kibana 连接
node skills/observability/service-health/scripts/apm-correlations.js test [--space <id>]

环境变量: 用于 Kibana 的 KIBANA_URL 和 KIBANA_API_KEY（或 KIBANA_USERNAME/KIBANA_PASSWORD）；用于回退的 ELASTICSEARCH_URL 和 ELASTICSEARCH_API_KEY。使用与调查相同的时间范围。

Service health progress:
- [ ] Step 1: Identify the service (and time range)
- [ ] Step 2: Check SLOs and firing alerts
- [ ] Step 3: Check ML anomalies (if configured)
- [ ] Step 4: Review throughput, latency (avg/p95/p99), error rate
- [ ] Step 5: Assess dependency health (ES|QL/APIs / Kibana repo)
- [ ] Step 6: Correlate with infrastructure and logs
- [ ] Step 7: Summarize health and recommend actions

步骤 1：识别服务

确认服务名称和时间范围。从请求中解析服务；如果涉及多个服务，则定位最相关的那个。在 traces*apm*,traces*otel* 或 metrics*apm*,metrics*otel* 上使用 ES|QL（例如 WHERE service.name == "<name>"）或 Kibana 仓库中的 APM 路由来获取服务级别的数据。如果用户未提供时间范围，则假定为最近一小时。

步骤 2：检查 SLO 和触发的警报

SLOs: 调用 SLOs API 获取服务的 SLO 定义和状态（延迟、可用性）、健康/降级/违规状态、消耗率、错误预算。警报: 对于活动的 APM 警报，调用 /api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active。检查单个服务时，包括 params.serviceName 匹配该服务的规则以及 params.serviceName 缺失的规则（全服务规则）。不要查询 .alerts* 索引来检查活动状态。与 SLO 违规或指标变化进行关联。

步骤 3：检查 ML 异常

如果使用了 ML 异常检测，则查询该服务和时间范围的 ML 任务结果或异常记录（通过 Elasticsearch ML API 或索引）。注意高严重性异常（延迟、吞吐量、错误率）；使用异常时间窗口来缩小步骤 4-5 的范围。

步骤 4：审查吞吐量、延迟和错误率

针对该服务和时间范围，在 traces*apm*,traces*otel* 或 metrics*apm*,metrics*otel* 上使用 ES|QL 获取 吞吐量（例如请求/分钟）、延迟（平均、p95、p99）、错误率（失败数/总数或 5xx/总数）。示例：FROM traces*apm*,traces*otel* | WHERE service.name == "<service_name>" AND @timestamp >= ... AND @timestamp <= ... | STATS ...。与先前时段或 SLO 目标进行比较。参见使用 ES|QL 查询 APM 指标。

步骤 5：评估依赖项健康

通过 ES|QL 在 traces*apm*,traces*otel*/metrics*apm*,metrics*otel* 上（例如下游服务/跨度聚合）或通过 Kibana 仓库 中公开依赖项/服务映射数据的 APM 路由处理程序来获取依赖项和服务映射数据。对于该服务和时间范围，注意下游延迟和错误率；将缓慢或失败的依赖项标记为可能的原因。

步骤 6：与基础设施和日志关联

APM 关联（当只有部分子集受影响时）: 运行 node skills/observability/service-health/scripts/apm-correlations.js latency-correlations|failed-correlations --service-name <name> [--start ...] [--end ...] 以获取关联属性。按这些属性进行过滤并获取追踪样本或错误以确认根本原因。参见 APM 关联脚本。
基础设施: 使用追踪中的 资源属性（例如 k8s.pod.name、container.id、host.name），并使用 ES|QL 或 Elasticsearch 查询基础设施/指标索引以获取 CPU 和内存信息。OOM 和 CPU 限制 直接影响 APM 健康；将其时间窗口与 APM 降级进行关联。
日志: 在日志索引上使用 ES|QL 或 Elasticsearch，配合 service.name == "<service_name>" 或 trace.id == "<trace_id>" 来解释行为和根本原因（异常、超时、重启）。

步骤 7：总结并建议

陈述健康状况（健康 / 降级 / 不健康）及原因；列出具体的后续步骤。

示例：针对特定服务的 ES|QL

使用 WHERE service.name == "<service_name>" 和时间范围进行限定。吞吐量和错误率（1 小时桶；LIMIT 限制行数和令牌数）：

FROM traces*apm*,traces*otel*
| WHERE service.name == "api-gateway"
  AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
| STATS request_count = COUNT(*), failures = COUNT(*) WHERE event.outcome == "failure" BY BUCKET(@timestamp, 1 hour)
| EVAL error_rate = failures / request_count
| SORT @timestamp
| LIMIT 500

延迟百分位数和确切字段名：参见 Kibana 的 trace_charts_definition.ts。

示例："服务 X 健康吗？"

解析服务 X 和时间范围。调用 SLOs API 和 Alerting API；在 traces*apm*,traces*otel*/metrics*apm*,metrics*otel* 上运行 ES|QL 以获取吞吐量、延迟、错误率；查询依赖项/服务映射数据（ES|QL 或 Kibana 仓库）。
评估 SLO 状态（违规/降级？）、触发的规则、ML 异常和依赖项健康状况。
回答：健康 / 降级 / 不健康，并附上原因和后续步骤（例如 Observability Labs）。

示例："为什么服务 Y 慢？"

服务 Y 和缓慢发生的时间范围。调用 SLOs API 和 Alerting API；为 Y 及其依赖项运行 ES|QL；查询 ML 异常结果。
通过 ES|QL 将延迟（平均/p95/p99）与先前时段进行比较；从依赖项数据中识别出高延迟或失败的依赖项。
总结（例如 p99 上升；依赖项 Z 升高）并建议（调查 Z；针对延迟的 Observability Labs）。

示例：将服务与基础设施关联（OpenTelemetry）

使用跨度/追踪上的 资源属性 来获取服务的运行时（Pod、容器、主机）。然后在与 APM 问题相同的时间窗口内检查这些资源的 CPU 和内存：

从服务的追踪或指标中，读取资源属性，例如 k8s.pod.name、k8s.namespace.name、container.id 或 host.name。
在基础设施/指标索引上运行 ES|QL 或 Elasticsearch 搜索，按这些资源值和事件时间范围进行过滤。检查 CPU 使用率 和 内存消耗（例如 system.cpu.total.norm.pct）；查找与 APM 延迟或错误峰值一致的 OOMKilled 事件、CPU 限制 或持续的高 CPU/内存。

示例：按服务或追踪 ID 过滤日志

要了解特定服务或单个追踪的行为，请相应地过滤日志：

按服务: 在日志索引上运行 ES|QL 或 Elasticsearch 搜索，使用 service.name == "<service_name>" 和时间范围，以获取服务上下文中的应用日志（错误、警告、重启）。
按追踪 ID: 当调查特定请求时，从 APM 追踪中获取 trace.id，并按 trace.id == "<trace_id>"（或日志模式中的等效字段）过滤日志。具有该追踪 ID 的日志显示了完整的请求路径，有助于解释失败或延迟。

使用 Observability APIs (SLOs API, Alerting API) 和在 traces*apm*,traces*otel*/metrics*apm*,metrics*otel*（8.11+ 或 Serverless）上的 ES|QL，按 service.name（以及相关的 service.environment）进行过滤。对于活动的 APM 警报，调用 /api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active。检查单个服务时，评估两种规则类型：params.serviceName 匹配目标服务的规则，以及 params.serviceName 缺失的规则（全服务规则）。在声明健康状况之前，将两者都视为适用于该服务。确定当前活动警报时，不要查询 .alerts* 索引；使用上述 Alerting API 响应作为真实来源。对于 APM 关联，运行 apm-correlations 脚本（参见 APM 关联脚本）；对于依赖项/服务映射数据，使用 ES|QL 或 Kibana 仓库的路由处理程序。关于 Elasticsearch 索引和搜索行为，请参见 Elasticsearch 仓库中的 Elasticsearch API。
始终使用 用户的时间范围；如果问题是历史性的，避免假设"最近 1 小时"。
当存在 SLO 时，将健康总结锚定在 SLO 状态和消耗率上；当不存在时，则依赖警报、异常、吞吐量、延迟、错误率和依赖项。
当分析 仅通过 OpenTelemetry 摄取的应用指标 时，使用 ES|QL 的 TS（时间序列）命令进行高效的指标查询。TS 命令在 Elasticsearch 9.3+ 中可用，并且在 Elastic Observability Serverless 中 始终可用。
总结：一个简短的健康结论，加上证据和后续步骤的要点。

🇺🇸English

APM Service Health

Assess APM service health using Observability APIs, ES|QL against APM indices, Elasticsearch APIs, and (for correlation and APM-specific logic) the Kibana repo. Use SLOs, firing alerts, ML anomalies, throughput, latency (avg/p95/p99), error rate, and dependency health.

Where to look

Observability APIs (Observability APIs): Use the SLOs API (Stack | Serverless) to get SLO definitions, status, burn rate, and error budget. Use the Alerting API (Stack | Serverless) to list and manage alerting rules and their alerts for the service. Use APM annotations API to create or search annotations when needed.
ES|QL and Elasticsearch: Query traces*apm*,traces*otel* and metrics*apm*,metrics*otel* with ES|QL (see Using ES|QL for APM metrics) for throughput, latency, error rate, and dependency-style aggregations. Use Elasticsearch APIs (e.g. POST _query for ES|QL, or Query DSL) as documented in the Elasticsearch repo for indices and search.
APM Correlations: Run the apm-correlations script to get attributes that correlate with high-latency or failed transactions for a given service. It tries the Kibana internal APM correlations API first, then falls back to Elasticsearch significant_terms on traces*apm*,traces*otel*. See APM Correlations script.
Infrastructure: Correlate via resource attributes (e.g. k8s.pod.name, container.id, host.name) in traces; query infrastructure or metrics indices with ES|QL/Elasticsearch for CPU and memory. OOM and CPU throttling directly impact APM health.
Logs: Use ES|QL or Elasticsearch search on log indices filtered by service.name or trace.id to explain behavior and root cause.
Observability Labs: Observability Labs and APM tag for patterns and troubleshooting.

Health criteria

Synthesize health from all of the following when available:

Signal	What to check
SLOs	Burn rate, status (healthy/degrading/violated), error budget.
Firing alerts	Open or recently fired alerts for the service or dependencies.
ML anomalies	Anomaly jobs; score and severity for latency, throughput, or error rate.
Throughput	Request rate; compare to baseline or previous period.
Latency	Avg, p95, p99; compare to SLO targets or history.
Error rate	Failed/total requests; spikes or sustained elevation.
Dependency health	Downstream latency, error rate, availability (ES
Infrastructure	CPU usage, memory; OOM and CPU throttling on pods/containers/hosts.
Logs	App logs filtered by service or trace ID for context and root cause.

Treat a service as unhealthy if SLOs are violated, critical alerts are firing, or ML anomalies indicate severe degradation. Correlate with infrastructure (OOM, CPU throttling), dependencies, and logs (service/trace context) to explain why and suggest next steps.

Using ES|QL for APM metrics

When querying APM data from Elasticsearch (traces*apm*,traces*otel*, metrics*apm*,metrics*otel*), use ES|QL by default where available.

Availability: ES|QL is available in Elasticsearch 8.11+ (technical preview; GA in 8.14). It is always available in Elastic Observability Serverless Complete tier.
Scoping to a service: Always filter by service.name (and service.environment when relevant). Combine with a time range on @timestamp:

WHERE service.name == "my-service-name" AND service.environment == "production" AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
Example patterns: Throughput, latency, and error rate over time: see Kibana trace_charts_definition.ts (getThroughputChart, getLatencyChart, getErrorRateChart). Use → → / with and .

APM Correlations script

When only a subpopulation of transactions has high latency or failures, run the apm-correlations script to list attributes that correlate with those transactions (e.g. host, service version, pod, region). The script tries the Kibana internal APM correlations API first; if unavailable (e.g. 404), it falls back to Elasticsearch significant_terms on traces*apm*,traces*otel*.

# Latency correlations (attributes over-represented in slow transactions)
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]

# Failed transaction correlations
node skills/observability/service-health/scripts/apm-correlations.js failed-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]

# Test Kibana connection
node skills/observability/service-health/scripts/apm-correlations.js test [--space <id>]

Environment: KIBANA_URL and KIBANA_API_KEY (or KIBANA_USERNAME/KIBANA_PASSWORD) for Kibana; for fallback, ELASTICSEARCH_URL and ELASTICSEARCH_API_KEY. Use the same time range as the investigation.

Workflow

Service health progress:
- [ ] Step 1: Identify the service (and time range)
- [ ] Step 2: Check SLOs and firing alerts
- [ ] Step 3: Check ML anomalies (if configured)
- [ ] Step 4: Review throughput, latency (avg/p95/p99), error rate
- [ ] Step 5: Assess dependency health (ES|QL/APIs / Kibana repo)
- [ ] Step 6: Correlate with infrastructure and logs
- [ ] Step 7: Summarize health and recommend actions

Step 1: Identify the service

Confirm service name and time range. Resolve the service from the request; if multiple are in scope, target the most relevant. Use ES|QL on traces*apm*,traces*otel* or metrics*apm*,metrics*otel* (e.g. WHERE service.name == "<name>") or Kibana repo APM routes to obtain service-level data. If the user has not provided the time range, assume last hour.

Step 2: Check SLOs and firing alerts

SLOs: Call the SLOs API to get SLO definitions and status for the service (latency, availability), healthy/degrading/violated, burn rate, error budget. Alerts: For active APM alerts, call /api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active. When checking one service, include both rules where params.serviceName matches the service and rules where params.serviceName is absent (all-services rules). Do not query .alerts* indices for active-state checks. Correlate with SLO violations or metric changes.

Step 3: Check ML anomalies

If ML anomaly detection is used, query ML job results or anomaly records (via Elasticsearch ML APIs or indices) for the service and time range. Note high-severity anomalies (latency, throughput, error rate); use anomaly time windows to narrow Steps 4–5.

Step 4: Review throughput, latency, and error rate

Use ES|QL against traces*apm*,traces*otel* or metrics*apm*,metrics*otel* for the service and time range to get throughput (e.g. req/min), latency (avg, p95, p99), error rate (failed/total or 5xx/total). Example: FROM traces*apm*,traces*otel* | WHERE service.name == "<service_name>" AND @timestamp >= ... AND @timestamp <= ... | STATS .... Compare to prior period or SLO targets. See Using ES|QL for APM metrics.

Step 5: Assess dependency health

Obtain dependency and service-map data via ES|QL on traces*apm*,traces*otel*/metrics*apm*,metrics*otel* (e.g. downstream service/span aggregations) or via APM route handlers in the Kibana repo that expose dependency/service-map data. For the service and time range, note downstream latency and error rate; flag slow or failing dependencies as likely causes.

Step 6: Correlate with infrastructure and logs

APM Correlations (when only a subpopulation is affected): Run node skills/observability/service-health/scripts/apm-correlations.js latency-correlations|failed-correlations --service-name <name> [--start ...] [--end ...] to get correlated attributes. Filter by those attributes and fetch trace samples or errors to confirm root cause. See APM Correlations script.
Infrastructure: Use resource attributes from traces (e.g. k8s.pod.name, container.id, host.name) and query infrastructure/metrics indices with ES|QL or Elasticsearch for CPU and memory. OOM and CPU throttling directly impact APM health; correlate their time windows with APM degradation.
Logs: Use ES|QL or Elasticsearch on log indices with service.name == "<service_name>" or trace.id == "<trace_id>" to explain behavior and root cause (exceptions, timeouts, restarts).

Step 7: Summarize and recommend

State health (healthy / degraded / unhealthy) with reasons; list concrete next steps.

Examples

Example: ES|QL for a specific service

Scope with WHERE service.name == "<service_name>" and time range. Throughput and error rate (1-hour buckets; LIMIT caps rows and tokens):

FROM traces*apm*,traces*otel*
| WHERE service.name == "api-gateway"
  AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
| STATS request_count = COUNT(*), failures = COUNT(*) WHERE event.outcome == "failure" BY BUCKET(@timestamp, 1 hour)
| EVAL error_rate = failures / request_count
| SORT @timestamp
| LIMIT 500

Latency percentiles and exact field names: see Kibana trace_charts_definition.ts.

Example: "Is service X healthy?"

Resolve service X and time range. Call SLOs API and Alerting API ; run ES|QL on traces*apm*,traces*otel*/metrics*apm*,metrics*otel* for throughput, latency, error rate; query dependency/service-map data (ES|QL or Kibana repo).
Evaluate SLO status (violated/degrading?), firing rules, ML anomalies, and dependency health.
Answer: Healthy / Degraded / Unhealthy with reasons and next steps (e.g. Observability Labs).

Example: "Why is service Y slow?"

Service Y and slowness time range. Call SLOs API and Alerting API ; run ES|QL for Y and dependencies; query ML anomaly results.
Compare latency (avg/p95/p99) to prior period via ES|QL; from dependency data identify high-latency or failing deps.
Summarize (e.g. p99 up; dependency Z elevated) and recommend (investigate Z; Observability Labs for latency).

Example: Correlate service to infrastructure (OpenTelemetry)

Use resource attributes on spans/traces to get the runtimes (pods, containers, hosts) for the service. Then check CPU and memory for those resources in the same time window as the APM issue:

From the service’s traces or metrics, read resource attributes such as k8s.pod.name, k8s.namespace.name, container.id, or host.name.
Run ES|QL or Elasticsearch search on infrastructure/metrics indices filtered by those resource values and the incident time range. Check CPU usage and memory consumption (e.g. system.cpu.total.norm.pct); look for OOMKilled events, CPU throttling , or sustained high CPU/memory that align with APM latency or error spikes.

Example: Filter logs by service or trace ID

To understand behavior for a specific service or a single trace, filter logs accordingly:

By service: Run ES|QL or Elasticsearch search on log indices with service.name == "<service_name>" and time range to get application logs (errors, warnings, restarts) in the service context.
By trace ID: When investigating a specific request, take the trace.id from the APM trace and filter logs by trace.id == "<trace_id>" (or equivalent field in your log schema). Logs with that trace ID show the full request path and help explain failures or latency.

Guidelines

Use Observability APIs (SLOs API, Alerting API) and ES|QL on traces*apm*,traces*otel*/metrics*apm*,metrics*otel* (8.11+ or Serverless), filtering by service.name (and service.environment when relevant). For active APM alerts, call /api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active. When checking one service, evaluate both rule types: rules where params.serviceName matches the target service, and rules where params.serviceName is absent (all-services rules). Treat either as applicable to the service before declaring health. Do not query .alerts* indices when determining currently active alerts; use the Alerting API response above as the source of truth. For APM correlations, run the script (see APM Correlations script); for dependency/service-map data, use ES|QL or Kibana repo route handlers. For Elasticsearch index and search behavior, see the APIs in the Elasticsearch repo.

Weekly Installs

122

Repository

elastic/agent-skills

GitHub Stars

First Seen

11 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

cursor106

opencode99

gemini-cli99

github-copilot99

codex99

amp98

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

94,100 周安装

BUCKET(@timestamp, ...)

WHERE service.name == "<service_name>"

Performance: Add LIMIT n to cap rows and token usage. Prefer coarser BUCKET(@timestamp, ...) (e.g. 1 hour) when only trends are needed; finer buckets increase work and result size.

apm-correlations

Always use the user's time range ; avoid assuming "last 1 hour" if the issue is historical.

When SLOs exist, anchor the health summary to SLO status and burn rate; when they do not, rely on alerts, anomalies, throughput, latency, error rate, and dependencies.

When analyzing only application metrics ingested via OpenTelemetry , use the ES|QL TS (time series) command for efficient metrics queries. The TS command is available in Elasticsearch 9.3+ and is always available in Elastic Observability Serverless.

Summary: one short health verdict plus bullet points for evidence and next steps.