LLM与智能体可观测性技能：基于Elastic数据监控AI性能、成本与调用链

observability-llm-obs by elastic/agent-skills

158 周安装量

206 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/elastic/agent-skills --skill observability-llm-obs

AI/机器学习开发运维监控

🇨🇳中文介绍

LLM 与智能体可观测性

仅使用已摄入 Elastic 的数据回答用户关于监控 LLM 和智能体组件的疑问。重点关注 LLM 性能、成本和令牌使用情况、响应质量以及调用链或智能体工作流编排。使用 ES|QL、Elasticsearch API 以及（必要时）Kibana API。不要依赖 Kibana UI；此技能无需 UI 即可工作。一个给定的部署通常使用一个或多个数据摄入路径（APM/OTLP 追踪和/或集成指标/日志）——在查询前先发现可用的数据。

数据查找位置

追踪和指标数据 (APM / OTel): 由 Elastic APM Agent 收集的追踪数据存储在 traces* 中，而由 OpenTelemetry 收集的数据则存储在 traces-generic.otel-default（及类似索引）中。使用通用模式 traces* 来查找所有来源的追踪数据。当应用程序使用 OpenTelemetry（例如 Elastic OpenTelemetry 发行版 (EDOT)、OpenLLMetry、OpenLIT、Langtrace 导出到 OTLP）进行插桩时，LLM 和智能体跨度会进入这些追踪数据流；指标可能进入 metrics-apm* 或 metrics-generic。查询 traces* 和数据流以获取每个请求和聚合的 LLM 信号。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

Elastic 中可用的数据

来自追踪和指标 (traces, metrics-apm / metrics-generic)

来自 OTel/EDOT（及兼容 SDK）的跨度携带跨度属性，这些属性可能遵循 OpenTelemetry GenAI 语义约定或特定于提供商的名称。在 Elasticsearch 中，属性通常出现在 span.attributes 下（确切的键名取决于数据摄入）。常见属性：

用途	示例属性名称 (OTel GenAI)
操作 / 提供商	`gen_ai.operation.name`, `gen_ai.provider.name`
模型	`gen_ai.request.model`, `gen_ai.response.model`
令牌使用量	`gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`
请求配置	`gen_ai.request.temperature`, `gen_ai.request.max_tokens`
错误	`error.type`
对话 / 智能体	`gen_ai.conversation.id`；工具/智能体跨度作为子跨度

成本不在 OTel 规范中；一些插桩会添加自定义属性（例如 llm.response.cost.usd_estimate）。从索引映射或示例文档（例如 span.attributes.* 或扁平化的键）中发现实际的字段名称。

使用跨度上的持续时间和event.outcome来分析延迟和成功/失败。使用trace.id、span.id以及父子跨度关系来分析调用链和智能体工作流（例如，一个根跨度，多个 LLM 或工具调用的子跨度）。

集成（OpenAI、Azure OpenAI、Azure AI Foundry、Bedrock、Bedrock AgentCore、Vertex AI 等）将指标（以及支持情况下的日志）发送到 Elastic。指标通常包括令牌使用量、请求计数、延迟，以及（在集成支持的情况下）与成本相关的字段。日志可能包括提示/响应或护栏事件。确切的字段名称和数据流由每个集成包定义；从集成文档或目标数据流的映射中发现它们。

确定可用的数据

列出数据流: GET _data_stream 并过滤出 traces*、metrics-apm*（或 metrics*）以及匹配已知 LLM 集成数据集（例如来自 Elastic LLM 可观测性）的 metrics-* / logs-*。
检查追踪索引: 对于 traces*，运行一个小型搜索或使用映射来查看跨度是否包含 gen_ai.* 或 llm.*（或类似）属性。确认存在令牌、模型和持续时间字段。
检查集成索引: 对于指标/日志数据流，检查映射或一个文档以查看令牌、成本、延迟和模型维度。
每个用例使用一个数据源: 如果 APM 和集成数据同时存在，对于一个给定的问题，优先使用一个一致的数据源（例如，使用追踪数据进行每个请求的链分析，使用集成指标进行聚合的令牌/成本分析）。
检查告警和 SLO: 使用 SLOs API 和 Alerting API 来列出针对 LLM 相关服务或集成指标的 SLO 和告警规则，并获取已触发或最近触发的告警。触发的告警或处于降级/违反状态的 SLO 指向潜在的性能下降。

用例和查询模式

LLM 性能（延迟、吞吐量、错误）

追踪: 在 traces* 上使用 ES|QL，按跨度属性（例如 gen_ai.operation.name 或 gen_ai.provider.name，如果存在）进行过滤。按模型、服务或时间计算吞吐量（每个时间桶的计数）、延迟（例如 duration.us 或跨度持续时间）和错误率（event.outcome == "failure"）。
集成: 查询集成指标，按集成暴露的模型/维度获取请求率、延迟和错误指标。

成本和令牌使用情况

追踪: 从 traces* 中的跨度进行聚合：按时间、模型或服务对 gen_ai.usage.input_tokens 和 gen_ai.usage.output_tokens（或等效的属性名称）求和。如果存在成本属性（例如自定义的 llm.response.cost.*），则对其求和以进行成本视图。
集成: 使用暴露令牌计数和/或成本的集成指标；按时间和模型进行聚合。

响应质量和安全性

追踪: 使用 traces* 中的 event.outcome、error.type 和跨度属性（例如 gen_ai.response.finish_reasons）来识别失败、超时或内容过滤器。如果提示/响应被捕获在属性中（例如 gen_ai.input.messages、gen_ai.output.messages）且未被编辑，则与之关联。
集成: 使用该集成定义的字段查询集成日志中的护栏阻止、内容过滤器事件或策略违规（例如 Bedrock Guardrails）。

调用链和智能体工作流编排

仅限追踪: 使用 traces* 中的追踪层次结构。按根服务或追踪属性进行过滤；按 trace.id 分组，并使用父子跨度关系（例如 parent.id、span.id）来重建链（例如编排跨度 → 多个 LLM 或工具调用跨度）。按跨度名称或 gen_ai.operation.name 进行聚合，以查看步骤（例如检索、LLM、工具使用）的分布。每个跨度和每个追踪的持续时间提供了瓶颈和端到端延迟。

使用 ES|QL 处理 LLM 数据

可用性: ES|QL 在 Elasticsearch 8.11+（8.14 正式发布）和 Elastic Observability Serverless 中可用。
范围限定: 始终按时间范围（@timestamp）进行限制。如果存在，添加 service.name 和可选的 service.environment。对于 LLM 特定的跨度，一旦知道字段名称，就按跨度属性进行过滤（例如 gen_ai.provider.name 或 gen_ai.operation.name 的关键字字段）。
性能: 使用 LIMIT，当只需要趋势时使用粗粒度的时间桶，并避免对大时间窗口进行全扫描。

LLM 可观测性进度：
- [ ] 步骤 1：确定可用数据（traces*、metrics-apm* 或 metrics*，或集成数据流）
- [ ] 步骤 2：发现与 LLM 相关的字段名称（映射或示例文档）
- [ ] 步骤 3：针对用户的问题运行 ES|QL 或 Elasticsearch 查询（性能、成本、质量、编排）
- [ ] 步骤 4：检查针对 LLM 相关数据定义的活跃告警或 SLO（Alerting API、SLOs API）；步骤 2 中的字段名称有助于识别相关规则；触发的告警或违反/降级的 SLO 表明潜在的性能下降
- [ ] 步骤 5：仅根据摄入的数据总结发现；在相关时包含告警/SLO 状态

示例：追踪数据中随时间变化的令牌使用量

假设跨度属性以 span.attributes.gen_ai.usage.input_tokens 和 span.attributes.gen_ai.usage.output_tokens 的形式可用（根据映射中的实际字段名称进行调整）：

FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
  AND span.attributes.gen_ai.provider.name IS NOT NULL
| STATS
    input_tokens = SUM(span.attributes.gen_ai.usage.input_tokens),
    output_tokens = SUM(span.attributes.gen_ai.usage.output_tokens)
  BY BUCKET(@timestamp, 1 hour), span.attributes.gen_ai.request.model
| SORT @timestamp
| LIMIT 500

示例：按模型划分的延迟和错误率

FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
  AND span.attributes.gen_ai.request.model IS NOT NULL
| STATS
    request_count = COUNT(*),
    failures = COUNT(*) WHERE event.outcome == "failure",
    avg_duration_us = AVG(span.duration.us)
  BY span.attributes.gen_ai.request.model
| EVAL error_rate = failures / request_count
| LIMIT 100

示例：智能体工作流（追踪级别视图）

获取包含至少一个 LLM 跨度的追踪 ID，并按追踪统计跨度数以查看链长度：

FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
  AND span.attributes.gen_ai.operation.name IS NOT NULL
| STATS span_count = COUNT(*), total_duration_us = SUM(span.duration.us) BY trace.id
| WHERE span_count > 1
| SORT total_duration_us DESC
| LIMIT 50

示例：集成指标（Amazon Bedrock AgentCore）

Amazon Bedrock AgentCore 集成将指标发送到 metrics-aws_bedrock_agentcore.metrics-* 数据流（时间序列索引）。对时间序列数据流使用 TS 进行聚合（Elasticsearch 9.2+）；使用带有 TRANGE 的时间范围（9.3+）。该集成的仪表板和告警规则模板示例：按小时和智能体划分的令牌使用量（计数器）、调用次数（计数器）和平均延迟（计量器）：

TS metrics-aws_bedrock_agentcore.metrics-*
| WHERE TRANGE(7 days)
  AND aws.dimensions.Operation == "InvokeAgentRuntime"
| STATS
    total_tokens = SUM(RATE(aws.bedrock_agentcore.metrics.TokenCount.sum)),
    total_invocations = SUM(RATE(aws.bedrock_agentcore.metrics.Invocations.sum)),
    avg_latency_ms = AVG(AVG_OVER_TIME(aws.bedrock_agentcore.metrics.Latency.avg))
  BY TBUCKET(1 hour), aws.bedrock_agentcore.agent_name
| SORT TBUCKET(1 hour) DESC

对于 Elasticsearch 8.x 或 TS 不可用的情况，使用 FROM 配合 BUCKET(@timestamp, 1 hour) 和对指标字段的 SUM/AVG（如集成的告警规则模板中所示）。对于其他 LLM 集成（OpenAI、Azure OpenAI、Vertex AI 等），使用该集成的数据流索引模式及其包中的字段名称（参见 Elastic LLM 可观测性）。

仅限 Elastic 中的数据: 仅使用在 Elastic 中收集和存储的数据（traces* 中的追踪、指标或集成指标/日志）。不要描述或依赖其他供应商的 UI 或产品。
每个客户一种技术: 在回答时假设每个部署只有一个数据摄入路径；发现哪个（追踪 vs 集成）存在，并在回答问题时一致地使用它。
发现字段名称: 在编写 ES|QL 或 Query DSL 之前，从 _mapping 或示例文档中确认与 LLM 相关的属性或指标名称；命名可能不同（例如 gen_ai.* vs llm.* 或特定于集成的字段）。
不依赖 Kibana UI: 优先使用 ES|QL 和 Elasticsearch API；仅在需要时使用 Kibana API（例如 SLO、告警）。不要指示用户打开 Kibana UI。
参考资料: LLM 和智能体 AI 可观测性、可观测性实验室 – LLM 可观测性、OpenTelemetry GenAI 跨度。对于 ES|QL 语法和查询模式，使用 elasticsearch-esql 技能，或查阅 ES|QL TS 命令参考（适用于 Elastic v9.3 或更高版本以及 Serverless），并查阅 ES|QL FROM 命令参考（适用于其他 Elastic 版本）。

🇺🇸English

LLM and Agentic Observability

Answer user questions about monitoring LLMs and agentic components using data ingested into Elastic only. Focus on LLM performance, cost and token utilization, response quality, and call chaining or agentic workflow orchestration. Use ES|QL , Elasticsearch APIs, and (where needed) Kibana APIs. Do not rely on Kibana UI; the skill works without it. A given deployment typically uses one or more ingestion paths (APM/OTLP traces and/or integration metrics/logs)— discover what is available before querying.

Where to look

Trace and metrics data (APM / OTel): Trace data in Elastic is stored in traces* when collected by the Elastic APM Agent, and in traces-generic.otel-default (and similar) when collected by OpenTelemetry. Use the generic pattern traces* to find all trace data regardless of source. When the application is instrumented with OpenTelemetry (e.g. Elastic Distributions of OpenTelemetry (EDOT), OpenLLMetry, OpenLIT, Langtrace exporting to OTLP), LLM and agent spans land in these trace data streams; metrics may land in metrics-apm* or metrics-generic. Query traces* and metrics* data streams for per-request and aggregated LLM signals.
Integration metrics and logs: When the user collects data via Elastic LLM integrations (OpenAI, Azure OpenAI, Azure AI Foundry, Amazon Bedrock, Bedrock AgentCore, GCP Vertex AI, etc.), metrics and logs go to integration data streams (e.g. metrics*, logs* with dataset/namespace per integration). Check which data streams exist.
Discover first: Use Elasticsearch to list data streams or indices (e.g. GET _data_stream, or GET traces*/_mapping, GET metrics*/_mapping) and optionally sample a document to see which LLM-related fields are present. Do not assume both APM and integration data exist.
ES|QL: Use the elasticsearch-esql skill for ES|QL syntax, commands, and query patterns when building queries against traces* or metrics data streams.
Alerts and SLOs: Use the Observability APIs SLOs API (Stack | Serverless) and Alerting API (Stack | Serverless) to find SLOs and alerting rules that target LLM-related data (e.g. services backed by traces*, or integration metrics). Firing alerts or violated/degrading SLOs point to potential degraded performance.

Data available in Elastic

From traces and metrics (traces, metrics-apm / metrics-generic)

Spans from OTel/EDOT (and compatible SDKs) carry span attributes that may follow OpenTelemetry GenAI semantic conventions or provider-specific names. In Elasticsearch, attributes typically appear under span.attributes (exact key names depend on ingestion). Common attributes:

Purpose	Example attribute names (OTel GenAI)
Operation / provider	`gen_ai.operation.name`, `gen_ai.provider.name`
Model	`gen_ai.request.model`, `gen_ai.response.model`
Token usage	`gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`
Request config	`gen_ai.request.temperature`,

Cost is not in the OTel spec; some instrumentations add custom attributes (e.g. llm.response.cost.usd_estimate). Discover actual field names from the index mapping or a sample document (e.g. span.attributes.* or flattened keys).

Use duration and event.outcome on spans for latency and success/failure. Use trace.id , span.id , and parent/child span relationships to analyze call chaining and agentic workflows (e.g. one root span, multiple LLM or tool-call child spans).

From LLM integrations

Integrations (OpenAI, Azure OpenAI, Azure AI Foundry, Bedrock, Bedrock AgentCore, Vertex AI, etc.) ship metrics (and where supported logs) to Elastic. Metrics typically include token usage, request counts, latency, and—where the integration supports it—cost-related fields. Logs may include prompt/response or guardrail events. Exact field names and data streams are defined by each integration package; discover them from the integration docs or from the target data stream mapping.

Determine what data is available

List data streams: GET _data_stream and filter for traces*, metrics-apm* (or metrics*), and metrics-* / logs-* that match known LLM integration datasets (e.g. from Elastic LLM observability).
Inspect trace indices: For traces*, run a small search or use mapping to see if spans contain gen_ai.* or llm.* (or similar) attributes. Confirm presence of token, model, and duration fields.
For metrics/logs data streams, check mapping or one document to see token, cost, latency, and model dimensions.

Use cases and query patterns

LLM performance (latency, throughput, errors)

Traces: ES|QL on traces* filtered by span attributes (e.g. gen_ai.operation.name or gen_ai.provider.name when present). Compute throughput (count per time bucket), latency (e.g. duration.us or span duration), and error rate (event.outcome == "failure") by model, service, or time.
Integrations: Query integration metrics for request rate, latency, and error metrics by model/dimension as exposed by the integration.

Cost and token utilization

Traces: Aggregate from spans in traces*: sum gen_ai.usage.input_tokens and gen_ai.usage.output_tokens (or equivalent attribute names) by time, model, or service. If a cost attribute exists (e.g. custom llm.response.cost.*), sum it for cost views.
Integrations: Use integration metrics that expose token counts and/or cost; aggregate by time and model.

Response quality and safety

Traces: Use event.outcome, error.type, and span attributes (e.g. gen_ai.response.finish_reasons) in traces* to identify failures, timeouts, or content filters. Correlate with prompts/responses if captured in attributes (e.g. gen_ai.input.messages, gen_ai.output.messages) and not redacted.
Integrations: Query integration logs for guardrail blocks, content filter events, or policy violations (e.g. Bedrock Guardrails) using the fields defined by that integration.

Call chaining and agentic workflow orchestration

Traces only: Use trace hierarchy in traces*. Filter by root service or trace attributes; group by trace.id and use parent/child span relationships (e.g. parent.id, span.id) to reconstruct chains (e.g. orchestration span → multiple LLM or tool-call spans). Aggregate by span name or gen_ai.operation.name to see distribution of steps (e.g. retrieval, LLM, tool use). Duration per span and per trace gives bottleneck and end-to-end latency.

Using ES|QL for LLM data

Availability: ES|QL is available in Elasticsearch 8.11+ (GA in 8.14) and in Elastic Observability Serverless.
Scoping: Always restrict by time range (@timestamp). When present, add service.name and optionally service.environment. For LLM-specific spans, filter by span attributes once you know the field names (e.g. a keyword field for gen_ai.provider.name or gen_ai.operation.name).
Performance: Use LIMIT, coarse time buckets when only trends are needed, and avoid full scans over large windows.

Workflow

LLM observability progress:
- [ ] Step 1: Determine available data (traces*, metrics-apm* or metrics*, or integration data streams)
- [ ] Step 2: Discover LLM-related field names (mapping or sample doc)
- [ ] Step 3: Run ES|QL or Elasticsearch queries for the user's question (performance, cost, quality, orchestration)
- [ ] Step 4: Check for active alerts or SLOs defined on LLM-related data (Alerting API, SLOs API); field names from
        Step 2 help identify related rules; firing alerts or violated/degrading SLOs indicate potential degraded performance
- [ ] Step 5: Summarize findings from ingested data only; include alert/SLO status when relevant

Examples

Example: Token usage over time from traces

Assume span attributes are available as span.attributes.gen_ai.usage.input_tokens and span.attributes.gen_ai.usage.output_tokens (adjust to actual field names from mapping):

FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
  AND span.attributes.gen_ai.provider.name IS NOT NULL
| STATS
    input_tokens = SUM(span.attributes.gen_ai.usage.input_tokens),
    output_tokens = SUM(span.attributes.gen_ai.usage.output_tokens)
  BY BUCKET(@timestamp, 1 hour), span.attributes.gen_ai.request.model
| SORT @timestamp
| LIMIT 500

Example: Latency and error rate by model

FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
  AND span.attributes.gen_ai.request.model IS NOT NULL
| STATS
    request_count = COUNT(*),
    failures = COUNT(*) WHERE event.outcome == "failure",
    avg_duration_us = AVG(span.duration.us)
  BY span.attributes.gen_ai.request.model
| EVAL error_rate = failures / request_count
| LIMIT 100

Example: Agentic workflow (trace-level view)

Get trace IDs that contain at least one LLM span and count spans per trace to see chain length:

FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
  AND span.attributes.gen_ai.operation.name IS NOT NULL
| STATS span_count = COUNT(*), total_duration_us = SUM(span.duration.us) BY trace.id
| WHERE span_count > 1
| SORT total_duration_us DESC
| LIMIT 50

Example: Integration metrics (Amazon Bedrock AgentCore)

The Amazon Bedrock AgentCore integration ships metrics to the metrics-aws_bedrock_agentcore.metrics-* data stream (time series index). Use TS for aggregations on time series data streams (Elasticsearch 9.2+); use a time range with TRANGE (9.3+). The integration’s dashboards and alerting rule templates Example: token usage (counter), invocations (counter), and average latency (gauge) by hour and agent:

TS metrics-aws_bedrock_agentcore.metrics-*
| WHERE TRANGE(7 days)
  AND aws.dimensions.Operation == "InvokeAgentRuntime"
| STATS
    total_tokens = SUM(RATE(aws.bedrock_agentcore.metrics.TokenCount.sum)),
    total_invocations = SUM(RATE(aws.bedrock_agentcore.metrics.Invocations.sum)),
    avg_latency_ms = AVG(AVG_OVER_TIME(aws.bedrock_agentcore.metrics.Latency.avg))
  BY TBUCKET(1 hour), aws.bedrock_agentcore.agent_name
| SORT TBUCKET(1 hour) DESC

For Elasticsearch 8.x or when TS is not available, use FROM with BUCKET(@timestamp, 1 hour) and SUM/AVG over the metric fields (as in the integration's alert rule templates). For other LLM integrations (OpenAI, Azure OpenAI, Vertex AI, etc.), use that integration’s data stream index pattern and field names from its package (see Elastic LLM observability).

Guidelines

Data only in Elastic: Use only data collected and stored in Elastic (traces in traces*, metrics, or integration metrics/logs). Do not describe or rely on other vendors’ UIs or products.
One technology per customer: Assume a single ingestion path per deployment when answering; discover which (traces vs integration) exists and use it consistently for the question.
Discover field names: Before writing ES|QL or Query DSL, confirm LLM-related attribute or metric names from _mapping or a sample document; naming may differ (e.g. gen_ai.* vs llm.* or integration-specific fields).
No Kibana UI dependency: Prefer ES|QL and Elasticsearch APIs; use Kibana APIs only when needed (e.g. SLO, alerting). Do not instruct the user to open Kibana UI.
References: LLM and agentic AI observability, Observability Labs – LLM Observability, OpenTelemetry GenAI spans. For ES|QL syntax and query patterns, use the elasticsearch-esql skill, or look through ES|QL TS command reference for Elastic v9.3 or higher and for Serverless, and look through for other Elastic versions.

Weekly Installs

158

Repository

elastic/agent-skills

GitHub Stars

206

First Seen

13 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

cursor145

gemini-cli138

github-copilot138

codex138

opencode138

kimi-cli137

gen_ai.request.max_tokens

Inspect integration indices:

Use one source per use case: If both APM and integration data exist, prefer one consistent source for a given question (e.g. use traces for per-request chain analysis, integration metrics for aggregate token/cost).

Check alerts and SLOs: Use the SLOs API and Alerting API to list SLOs and alerting rules that target LLM-related services or integration metrics, and to get open or recently fired alerts. Firing alerts or SLOs in degrading/violated status point to potential degraded performance.