observability-service-health by elastic/agent-skills
npx skills add https://github.com/elastic/agent-skills --skill observability-service-health使用 Observability APIs、针对 APM 索引的 ES|QL、Elasticsearch API 以及(用于关联和 APM 特定逻辑)Kibana 仓库来评估 APM 服务健康状况。利用 SLO、触发的警报、ML 异常、吞吐量、延迟(平均/p95/p99)、错误率和依赖项健康状况。
traces*apm*,traces*otel* 和 metrics*apm*,metrics*otel*(参见 使用 ES|QL 查询 APM 指标),以获取吞吐量、延迟、错误率和依赖项风格的聚合。按照 Elasticsearch 仓库中的文档使用 Elasticsearch API(例如,用于 ES|QL 的 POST _query 或 Query DSL)进行索引和搜索。广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
traces*apm*,traces*otel* 使用 Elasticsearch 的 significant_terms。参见 APM 关联脚本。k8s.pod.name、container.id、host.name)进行关联;使用 ES|QL/Elasticsearch 查询基础设施或指标索引以获取 CPU 和内存信息。OOM 和 CPU 限制 直接影响 APM 健康。service.name 或 trace.id 过滤的日志索引,以解释行为和根本原因。综合以下所有可用信息来评估健康状况:
| 信号 | 检查内容 |
|---|---|
| SLOs | 消耗率、状态(健康/降级/违规)、错误预算。 |
| 触发的警报 | 服务或其依赖项的未解决或最近触发的警报。 |
| ML 异常 | 异常检测任务;针对延迟、吞吐量或错误率的得分和严重性。 |
| 吞吐量 | 请求速率;与基线或先前时段比较。 |
| 延迟 | 平均、p95、p99;与 SLO 目标或历史数据比较。 |
| 错误率 | 失败请求数/总请求数;峰值或持续升高。 |
| 依赖项健康 | 下游延迟、错误率、可用性(ES |
| 基础设施 | CPU 使用率、内存;Pod/容器/主机上的 OOM 和 CPU 限制。 |
| 日志 | 按服务或追踪 ID 过滤的应用日志,用于获取上下文和根本原因。 |
如果 SLO 违规、关键警报触发或 ML 异常指示严重降级,则将服务视为 不健康。与基础设施(OOM、CPU 限制)、依赖项和日志(服务/追踪上下文)进行关联,以解释 原因 并建议后续步骤。
从 Elasticsearch 查询 APM 数据(traces*apm*,traces*otel*、metrics*apm*,metrics*otel*)时,默认使用 ES|QL(如果可用)。
可用性: ES|QL 在 Elasticsearch 8.11+ 中可用(技术预览版;8.14 版正式发布)。它在 Elastic Observability Serverless Complete 层级 中 始终可用。
限定到特定服务: 始终按 service.name(以及相关的 service.environment)进行过滤。结合 @timestamp 上的时间范围:
WHERE service.name == "my-service-name" AND service.environment == "production" AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
示例模式: 随时间变化的吞吐量、延迟和错误率:参见 Kibana 的 trace_charts_definition.ts(getThroughputChart、getLatencyChart、getErrorRateChart)。使用 from(index) → where(...) → stats(...) / evaluate(...) 结合 BUCKET(@timestamp, ...) 和 WHERE service.name == "<service_name>"。
性能: 添加 LIMIT n 以限制行数和令牌使用量。当只需要趋势时,优先使用更粗粒度的 BUCKET(@timestamp, ...)(例如 1 小时);更细粒度的桶会增加工作量和结果大小。
当只有 部分 事务出现高延迟或失败时,运行 apm-correlations 脚本以列出与这些事务相关联的属性(例如主机、服务版本、Pod、区域)。该脚本首先尝试 Kibana 内部的 APM 关联 API;如果不可用(例如 404),则回退到对 traces*apm*,traces*otel* 使用 Elasticsearch 的 significant_terms。
# 延迟关联(在慢速事务中过度呈现的属性)
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
# 失败事务关联
node skills/observability/service-health/scripts/apm-correlations.js failed-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
# 测试 Kibana 连接
node skills/observability/service-health/scripts/apm-correlations.js test [--space <id>]
环境变量: 用于 Kibana 的 KIBANA_URL 和 KIBANA_API_KEY(或 KIBANA_USERNAME/KIBANA_PASSWORD);用于回退的 ELASTICSEARCH_URL 和 ELASTICSEARCH_API_KEY。使用与调查相同的时间范围。
Service health progress:
- [ ] Step 1: Identify the service (and time range)
- [ ] Step 2: Check SLOs and firing alerts
- [ ] Step 3: Check ML anomalies (if configured)
- [ ] Step 4: Review throughput, latency (avg/p95/p99), error rate
- [ ] Step 5: Assess dependency health (ES|QL/APIs / Kibana repo)
- [ ] Step 6: Correlate with infrastructure and logs
- [ ] Step 7: Summarize health and recommend actions
确认服务名称和时间范围。从请求中解析服务;如果涉及多个服务,则定位最相关的那个。在 traces*apm*,traces*otel* 或 metrics*apm*,metrics*otel* 上使用 ES|QL(例如 WHERE service.name == "<name>")或 Kibana 仓库中的 APM 路由来获取服务级别的数据。如果用户未提供时间范围,则假定为最近一小时。
SLOs: 调用 SLOs API 获取服务的 SLO 定义和状态(延迟、可用性)、健康/降级/违规状态、消耗率、错误预算。警报: 对于活动的 APM 警报,调用 /api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active。检查单个服务时,包括 params.serviceName 匹配该服务的规则以及 params.serviceName 缺失的规则(全服务规则)。不要查询 .alerts* 索引来检查活动状态。与 SLO 违规或指标变化进行关联。
如果使用了 ML 异常检测,则查询该服务和时间范围的 ML 任务结果或异常记录(通过 Elasticsearch ML API 或索引)。注意高严重性异常(延迟、吞吐量、错误率);使用异常时间窗口来缩小步骤 4-5 的范围。
针对该服务和时间范围,在 traces*apm*,traces*otel* 或 metrics*apm*,metrics*otel* 上使用 ES|QL 获取 吞吐量(例如 请求/分钟)、延迟(平均、p95、p99)、错误率(失败数/总数 或 5xx/总数)。示例:FROM traces*apm*,traces*otel* | WHERE service.name == "<service_name>" AND @timestamp >= ... AND @timestamp <= ... | STATS ...。与先前时段或 SLO 目标进行比较。参见 使用 ES|QL 查询 APM 指标。
通过 ES|QL 在 traces*apm*,traces*otel*/metrics*apm*,metrics*otel* 上(例如下游服务/跨度聚合)或通过 Kibana 仓库 中公开依赖项/服务映射数据的 APM 路由处理程序来获取依赖项和服务映射数据。对于该服务和时间范围,注意下游延迟和错误率;将缓慢或失败的依赖项标记为可能的原因。
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations|failed-correlations --service-name <name> [--start ...] [--end ...] 以获取关联属性。按这些属性进行过滤并获取追踪样本或错误以确认根本原因。参见 APM 关联脚本。k8s.pod.name、container.id、host.name),并使用 ES|QL 或 Elasticsearch 查询基础设施/指标索引以获取 CPU 和 内存 信息。OOM 和 CPU 限制 直接影响 APM 健康;将其时间窗口与 APM 降级进行关联。service.name == "<service_name>" 或 trace.id == "<trace_id>" 来解释行为和根本原因(异常、超时、重启)。陈述健康状况(健康 / 降级 / 不健康)及原因;列出具体的后续步骤。
使用 WHERE service.name == "<service_name>" 和时间范围进行限定。吞吐量和错误率(1 小时桶;LIMIT 限制行数和令牌数):
FROM traces*apm*,traces*otel*
| WHERE service.name == "api-gateway"
AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
| STATS request_count = COUNT(*), failures = COUNT(*) WHERE event.outcome == "failure" BY BUCKET(@timestamp, 1 hour)
| EVAL error_rate = failures / request_count
| SORT @timestamp
| LIMIT 500
延迟百分位数和确切字段名:参见 Kibana 的 trace_charts_definition.ts。
traces*apm*,traces*otel*/metrics*apm*,metrics*otel* 上运行 ES|QL 以获取吞吐量、延迟、错误率;查询依赖项/服务映射数据(ES|QL 或 Kibana 仓库)。使用跨度/追踪上的 资源属性 来获取服务的运行时(Pod、容器、主机)。然后在与 APM 问题相同的时间窗口内检查这些资源的 CPU 和内存:
k8s.pod.name、k8s.namespace.name、container.id 或 host.name。system.cpu.total.norm.pct);查找与 APM 延迟或错误峰值一致的 OOMKilled 事件、CPU 限制 或持续的高 CPU/内存。要了解特定服务或单个追踪的行为,请相应地过滤日志:
service.name == "<service_name>" 和时间范围,以获取服务上下文中的应用日志(错误、警告、重启)。trace.id,并按 trace.id == "<trace_id>"(或日志模式中的等效字段)过滤日志。具有该追踪 ID 的日志显示了完整的请求路径,有助于解释失败或延迟。traces*apm*,traces*otel*/metrics*apm*,metrics*otel*(8.11+ 或 Serverless)上的 ES|QL,按 service.name(以及相关的 service.environment)进行过滤。对于活动的 APM 警报,调用 /api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active。检查单个服务时,评估两种规则类型:params.serviceName 匹配目标服务的规则,以及 params.serviceName 缺失的规则(全服务规则)。在声明健康状况之前,将两者都视为适用于该服务。确定当前活动警报时,不要查询 .alerts* 索引;使用上述 Alerting API 响应作为真实来源。对于 APM 关联,运行 apm-correlations 脚本(参见 APM 关联脚本);对于依赖项/服务映射数据,使用 ES|QL 或 Kibana 仓库的路由处理程序。关于 Elasticsearch 索引和搜索行为,请参见 Elasticsearch 仓库中的 Elasticsearch API。每周安装数
122
仓库
GitHub 星标数
89
首次出现
11 天前
安全审计
安装于
cursor106
opencode99
gemini-cli99
github-copilot99
codex99
amp98
Assess APM service health using Observability APIs, ES|QL against APM indices, Elasticsearch APIs, and (for correlation and APM-specific logic) the Kibana repo. Use SLOs, firing alerts, ML anomalies, throughput, latency (avg/p95/p99), error rate, and dependency health.
traces*apm*,traces*otel* and metrics*apm*,metrics*otel* with ES|QL (see Using ES|QL for APM metrics) for throughput, latency, error rate, and dependency-style aggregations. Use Elasticsearch APIs (e.g. POST _query for ES|QL, or Query DSL) as documented in the Elasticsearch repo for indices and search.traces*apm*,traces*otel*. See APM Correlations script.k8s.pod.name, container.id, host.name) in traces; query infrastructure or metrics indices with ES|QL/Elasticsearch for CPU and memory. OOM and CPU throttling directly impact APM health.service.name or trace.id to explain behavior and root cause.Synthesize health from all of the following when available:
| Signal | What to check |
|---|---|
| SLOs | Burn rate, status (healthy/degrading/violated), error budget. |
| Firing alerts | Open or recently fired alerts for the service or dependencies. |
| ML anomalies | Anomaly jobs; score and severity for latency, throughput, or error rate. |
| Throughput | Request rate; compare to baseline or previous period. |
| Latency | Avg, p95, p99; compare to SLO targets or history. |
| Error rate | Failed/total requests; spikes or sustained elevation. |
| Dependency health | Downstream latency, error rate, availability (ES |
| Infrastructure | CPU usage, memory; OOM and CPU throttling on pods/containers/hosts. |
| Logs | App logs filtered by service or trace ID for context and root cause. |
Treat a service as unhealthy if SLOs are violated, critical alerts are firing, or ML anomalies indicate severe degradation. Correlate with infrastructure (OOM, CPU throttling), dependencies, and logs (service/trace context) to explain why and suggest next steps.
When querying APM data from Elasticsearch (traces*apm*,traces*otel*, metrics*apm*,metrics*otel*), use ES|QL by default where available.
Availability: ES|QL is available in Elasticsearch 8.11+ (technical preview; GA in 8.14). It is always available in Elastic Observability Serverless Complete tier.
Scoping to a service: Always filter by service.name (and service.environment when relevant). Combine with a time range on @timestamp:
WHERE service.name == "my-service-name" AND service.environment == "production" AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
Example patterns: Throughput, latency, and error rate over time: see Kibana trace_charts_definition.ts (getThroughputChart, getLatencyChart, getErrorRateChart). Use → → / with and .
When only a subpopulation of transactions has high latency or failures, run the apm-correlations script to list attributes that correlate with those transactions (e.g. host, service version, pod, region). The script tries the Kibana internal APM correlations API first; if unavailable (e.g. 404), it falls back to Elasticsearch significant_terms on traces*apm*,traces*otel*.
# Latency correlations (attributes over-represented in slow transactions)
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
# Failed transaction correlations
node skills/observability/service-health/scripts/apm-correlations.js failed-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
# Test Kibana connection
node skills/observability/service-health/scripts/apm-correlations.js test [--space <id>]
Environment: KIBANA_URL and KIBANA_API_KEY (or KIBANA_USERNAME/KIBANA_PASSWORD) for Kibana; for fallback, ELASTICSEARCH_URL and ELASTICSEARCH_API_KEY. Use the same time range as the investigation.
Service health progress:
- [ ] Step 1: Identify the service (and time range)
- [ ] Step 2: Check SLOs and firing alerts
- [ ] Step 3: Check ML anomalies (if configured)
- [ ] Step 4: Review throughput, latency (avg/p95/p99), error rate
- [ ] Step 5: Assess dependency health (ES|QL/APIs / Kibana repo)
- [ ] Step 6: Correlate with infrastructure and logs
- [ ] Step 7: Summarize health and recommend actions
Confirm service name and time range. Resolve the service from the request; if multiple are in scope, target the most relevant. Use ES|QL on traces*apm*,traces*otel* or metrics*apm*,metrics*otel* (e.g. WHERE service.name == "<name>") or Kibana repo APM routes to obtain service-level data. If the user has not provided the time range, assume last hour.
SLOs: Call the SLOs API to get SLO definitions and status for the service (latency, availability), healthy/degrading/violated, burn rate, error budget. Alerts: For active APM alerts, call /api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active. When checking one service, include both rules where params.serviceName matches the service and rules where params.serviceName is absent (all-services rules). Do not query .alerts* indices for active-state checks. Correlate with SLO violations or metric changes.
If ML anomaly detection is used, query ML job results or anomaly records (via Elasticsearch ML APIs or indices) for the service and time range. Note high-severity anomalies (latency, throughput, error rate); use anomaly time windows to narrow Steps 4–5.
Use ES|QL against traces*apm*,traces*otel* or metrics*apm*,metrics*otel* for the service and time range to get throughput (e.g. req/min), latency (avg, p95, p99), error rate (failed/total or 5xx/total). Example: FROM traces*apm*,traces*otel* | WHERE service.name == "<service_name>" AND @timestamp >= ... AND @timestamp <= ... | STATS .... Compare to prior period or SLO targets. See Using ES|QL for APM metrics.
Obtain dependency and service-map data via ES|QL on traces*apm*,traces*otel*/metrics*apm*,metrics*otel* (e.g. downstream service/span aggregations) or via APM route handlers in the Kibana repo that expose dependency/service-map data. For the service and time range, note downstream latency and error rate; flag slow or failing dependencies as likely causes.
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations|failed-correlations --service-name <name> [--start ...] [--end ...] to get correlated attributes. Filter by those attributes and fetch trace samples or errors to confirm root cause. See APM Correlations script.k8s.pod.name, container.id, host.name) and query infrastructure/metrics indices with ES|QL or Elasticsearch for CPU and memory. OOM and CPU throttling directly impact APM health; correlate their time windows with APM degradation.service.name == "<service_name>" or trace.id == "<trace_id>" to explain behavior and root cause (exceptions, timeouts, restarts).State health (healthy / degraded / unhealthy) with reasons; list concrete next steps.
Scope with WHERE service.name == "<service_name>" and time range. Throughput and error rate (1-hour buckets; LIMIT caps rows and tokens):
FROM traces*apm*,traces*otel*
| WHERE service.name == "api-gateway"
AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
| STATS request_count = COUNT(*), failures = COUNT(*) WHERE event.outcome == "failure" BY BUCKET(@timestamp, 1 hour)
| EVAL error_rate = failures / request_count
| SORT @timestamp
| LIMIT 500
Latency percentiles and exact field names: see Kibana trace_charts_definition.ts.
traces*apm*,traces*otel*/metrics*apm*,metrics*otel* for throughput, latency, error rate; query dependency/service-map data (ES|QL or Kibana repo).Use resource attributes on spans/traces to get the runtimes (pods, containers, hosts) for the service. Then check CPU and memory for those resources in the same time window as the APM issue:
k8s.pod.name, k8s.namespace.name, container.id, or host.name.system.cpu.total.norm.pct); look for OOMKilled events, CPU throttling , or sustained high CPU/memory that align with APM latency or error spikes.To understand behavior for a specific service or a single trace, filter logs accordingly:
service.name == "<service_name>" and time range to get application logs (errors, warnings, restarts) in the service context.trace.id from the APM trace and filter logs by trace.id == "<trace_id>" (or equivalent field in your log schema). Logs with that trace ID show the full request path and help explain failures or latency.traces*apm*,traces*otel*/metrics*apm*,metrics*otel* (8.11+ or Serverless), filtering by service.name (and service.environment when relevant). For active APM alerts, call /api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active. When checking one service, evaluate both rule types: rules where params.serviceName matches the target service, and rules where params.serviceName is absent (all-services rules). Treat either as applicable to the service before declaring health. Do not query .alerts* indices when determining currently active alerts; use the Alerting API response above as the source of truth. For APM correlations, run the script (see APM Correlations script); for dependency/service-map data, use ES|QL or Kibana repo route handlers. For Elasticsearch index and search behavior, see the APIs in the Elasticsearch repo.Weekly Installs
122
Repository
GitHub Stars
89
First Seen
11 days ago
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
cursor106
opencode99
gemini-cli99
github-copilot99
codex99
amp98
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
94,100 周安装
HeyGen AI视频生成API:文本生成视频,支持Sora/VEO/Kling等多模型
682 周安装
MarsWaveAI TTS:文本转语音API,支持多说话人脚本与快速语音合成
677 周安装
钉钉消息发送技能指南:Webhook机器人、企业应用、工作通知、sessionWebhook全解析
690 周安装
stop-slop:AI写作优化工具 - 消除陈词滥调,提升文本真实性与可读性
700 周安装
Dependabot 配置与管理指南:GitHub 依赖管理工具详解
741 周安装
智能外联草拟工具:基于调研的个性化邮件与LinkedIn消息生成器 | 销售与营销自动化
724 周安装
from(index)where(...)stats(...)evaluate(...)BUCKET(@timestamp, ...)WHERE service.name == "<service_name>"Performance: Add LIMIT n to cap rows and token usage. Prefer coarser BUCKET(@timestamp, ...) (e.g. 1 hour) when only trends are needed; finer buckets increase work and result size.