observability-llm-obs by elastic/agent-skills
npx skills add https://github.com/elastic/agent-skills --skill observability-llm-obs仅使用已摄入 Elastic 的数据回答用户关于监控 LLM 和智能体组件的疑问。重点关注 LLM 性能、成本和令牌使用情况、响应质量以及调用链或智能体工作流编排。使用 ES|QL、Elasticsearch API 以及(必要时)Kibana API。不要依赖 Kibana UI;此技能无需 UI 即可工作。一个给定的部署通常使用一个或多个数据摄入路径(APM/OTLP 追踪和/或集成指标/日志)——在查询前先发现可用的数据。
traces* 中,而由 OpenTelemetry 收集的数据则存储在 traces-generic.otel-default(及类似索引)中。使用通用模式 traces* 来查找所有来源的追踪数据。当应用程序使用 OpenTelemetry(例如 Elastic OpenTelemetry 发行版 (EDOT)、OpenLLMetry、OpenLIT、Langtrace 导出到 OTLP)进行插桩时,LLM 和智能体跨度会进入这些追踪数据流;指标可能进入 metrics-apm* 或 metrics-generic。查询 traces* 和 数据流以获取每个请求和聚合的 LLM 信号。广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
metrics*metrics*、logs*,每个集成有对应的 dataset/namespace)。检查存在哪些数据流。GET _data_stream,或 GET traces*/_mapping、GET metrics*/_mapping),并可选择采样一个文档以查看存在哪些与 LLM 相关的字段。不要假设 APM 和集成数据同时存在。traces* 或指标数据流构建查询时,使用 elasticsearch-esql 技能来获取 ES|QL 语法、命令和查询模式。traces* 支持的服务,或集成指标)的 SLO 和告警规则。触发的告警或违反/降级的 SLO 指向潜在的性能下降。来自 OTel/EDOT(及兼容 SDK)的跨度携带跨度属性,这些属性可能遵循 OpenTelemetry GenAI 语义约定或特定于提供商的名称。在 Elasticsearch 中,属性通常出现在 span.attributes 下(确切的键名取决于数据摄入)。常见属性:
| 用途 | 示例属性名称 (OTel GenAI) |
|---|---|
| 操作 / 提供商 | gen_ai.operation.name, gen_ai.provider.name |
| 模型 | gen_ai.request.model, gen_ai.response.model |
| 令牌使用量 | gen_ai.usage.input_tokens, gen_ai.usage.output_tokens |
| 请求配置 | gen_ai.request.temperature, gen_ai.request.max_tokens |
| 错误 | error.type |
| 对话 / 智能体 | gen_ai.conversation.id;工具/智能体跨度作为子跨度 |
成本不在 OTel 规范中;一些插桩会添加自定义属性(例如 llm.response.cost.usd_estimate)。从索引映射或示例文档(例如 span.attributes.* 或扁平化的键)中发现实际的字段名称。
使用跨度上的持续时间和event.outcome来分析延迟和成功/失败。使用trace.id、span.id以及父子跨度关系来分析调用链和智能体工作流(例如,一个根跨度,多个 LLM 或工具调用的子跨度)。
集成(OpenAI、Azure OpenAI、Azure AI Foundry、Bedrock、Bedrock AgentCore、Vertex AI 等)将指标(以及支持情况下的日志)发送到 Elastic。指标通常包括令牌使用量、请求计数、延迟,以及(在集成支持的情况下)与成本相关的字段。日志可能包括提示/响应或护栏事件。确切的字段名称和数据流由每个集成包定义;从集成文档或目标数据流的映射中发现它们。
GET _data_stream 并过滤出 traces*、metrics-apm*(或 metrics*)以及匹配已知 LLM 集成数据集(例如来自 Elastic LLM 可观测性)的 metrics-* / logs-*。traces*,运行一个小型搜索或使用映射来查看跨度是否包含 gen_ai.* 或 llm.*(或类似)属性。确认存在令牌、模型和持续时间字段。traces* 上使用 ES|QL,按跨度属性(例如 gen_ai.operation.name 或 gen_ai.provider.name,如果存在)进行过滤。按模型、服务或时间计算吞吐量(每个时间桶的计数)、延迟(例如 duration.us 或跨度持续时间)和错误率(event.outcome == "failure")。traces* 中的跨度进行聚合:按时间、模型或服务对 gen_ai.usage.input_tokens 和 gen_ai.usage.output_tokens(或等效的属性名称)求和。如果存在成本属性(例如自定义的 llm.response.cost.*),则对其求和以进行成本视图。traces* 中的 event.outcome、error.type 和跨度属性(例如 gen_ai.response.finish_reasons)来识别失败、超时或内容过滤器。如果提示/响应被捕获在属性中(例如 gen_ai.input.messages、gen_ai.output.messages)且未被编辑,则与之关联。traces* 中的追踪层次结构。按根服务或追踪属性进行过滤;按 trace.id 分组,并使用父子跨度关系(例如 parent.id、span.id)来重建链(例如编排跨度 → 多个 LLM 或工具调用跨度)。按跨度名称或 gen_ai.operation.name 进行聚合,以查看步骤(例如检索、LLM、工具使用)的分布。每个跨度和每个追踪的持续时间提供了瓶颈和端到端延迟。@timestamp)进行限制。如果存在,添加 service.name 和可选的 service.environment。对于 LLM 特定的跨度,一旦知道字段名称,就按跨度属性进行过滤(例如 gen_ai.provider.name 或 gen_ai.operation.name 的关键字字段)。LIMIT,当只需要趋势时使用粗粒度的时间桶,并避免对大时间窗口进行全扫描。LLM 可观测性进度:
- [ ] 步骤 1:确定可用数据(traces*、metrics-apm* 或 metrics*,或集成数据流)
- [ ] 步骤 2:发现与 LLM 相关的字段名称(映射或示例文档)
- [ ] 步骤 3:针对用户的问题运行 ES|QL 或 Elasticsearch 查询(性能、成本、质量、编排)
- [ ] 步骤 4:检查针对 LLM 相关数据定义的活跃告警或 SLO(Alerting API、SLOs API);步骤 2 中的字段名称有助于识别相关规则;触发的告警或违反/降级的 SLO 表明潜在的性能下降
- [ ] 步骤 5:仅根据摄入的数据总结发现;在相关时包含告警/SLO 状态
假设跨度属性以 span.attributes.gen_ai.usage.input_tokens 和 span.attributes.gen_ai.usage.output_tokens 的形式可用(根据映射中的实际字段名称进行调整):
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.provider.name IS NOT NULL
| STATS
input_tokens = SUM(span.attributes.gen_ai.usage.input_tokens),
output_tokens = SUM(span.attributes.gen_ai.usage.output_tokens)
BY BUCKET(@timestamp, 1 hour), span.attributes.gen_ai.request.model
| SORT @timestamp
| LIMIT 500
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.request.model IS NOT NULL
| STATS
request_count = COUNT(*),
failures = COUNT(*) WHERE event.outcome == "failure",
avg_duration_us = AVG(span.duration.us)
BY span.attributes.gen_ai.request.model
| EVAL error_rate = failures / request_count
| LIMIT 100
获取包含至少一个 LLM 跨度的追踪 ID,并按追踪统计跨度数以查看链长度:
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.operation.name IS NOT NULL
| STATS span_count = COUNT(*), total_duration_us = SUM(span.duration.us) BY trace.id
| WHERE span_count > 1
| SORT total_duration_us DESC
| LIMIT 50
Amazon Bedrock AgentCore 集成将指标发送到 metrics-aws_bedrock_agentcore.metrics-* 数据流(时间序列索引)。对时间序列数据流使用 TS 进行聚合(Elasticsearch 9.2+);使用带有 TRANGE 的时间范围(9.3+)。该集成的仪表板和告警规则模板 示例:按小时和智能体划分的令牌使用量(计数器)、调用次数(计数器)和平均延迟(计量器):
TS metrics-aws_bedrock_agentcore.metrics-*
| WHERE TRANGE(7 days)
AND aws.dimensions.Operation == "InvokeAgentRuntime"
| STATS
total_tokens = SUM(RATE(aws.bedrock_agentcore.metrics.TokenCount.sum)),
total_invocations = SUM(RATE(aws.bedrock_agentcore.metrics.Invocations.sum)),
avg_latency_ms = AVG(AVG_OVER_TIME(aws.bedrock_agentcore.metrics.Latency.avg))
BY TBUCKET(1 hour), aws.bedrock_agentcore.agent_name
| SORT TBUCKET(1 hour) DESC
对于 Elasticsearch 8.x 或 TS 不可用的情况,使用 FROM 配合 BUCKET(@timestamp, 1 hour) 和对指标字段的 SUM/AVG(如集成的告警规则模板中所示)。对于其他 LLM 集成(OpenAI、Azure OpenAI、Vertex AI 等),使用该集成的数据流索引模式及其包中的字段名称(参见 Elastic LLM 可观测性)。
traces* 中的追踪、指标或集成指标/日志)。不要描述或依赖其他供应商的 UI 或产品。_mapping 或示例文档中确认与 LLM 相关的属性或指标名称;命名可能不同(例如 gen_ai.* vs llm.* 或特定于集成的字段)。每周安装数
158
代码仓库
GitHub 星标数
206
首次出现
13 天前
安全审计
安装于
cursor145
gemini-cli138
github-copilot138
codex138
opencode138
kimi-cli137
Answer user questions about monitoring LLMs and agentic components using data ingested into Elastic only. Focus on LLM performance, cost and token utilization, response quality, and call chaining or agentic workflow orchestration. Use ES|QL , Elasticsearch APIs, and (where needed) Kibana APIs. Do not rely on Kibana UI; the skill works without it. A given deployment typically uses one or more ingestion paths (APM/OTLP traces and/or integration metrics/logs)— discover what is available before querying.
traces* when collected by the Elastic APM Agent, and in traces-generic.otel-default (and similar) when collected by OpenTelemetry. Use the generic pattern traces* to find all trace data regardless of source. When the application is instrumented with OpenTelemetry (e.g. Elastic Distributions of OpenTelemetry (EDOT), OpenLLMetry, OpenLIT, Langtrace exporting to OTLP), LLM and agent spans land in these trace data streams; metrics may land in metrics-apm* or metrics-generic. Query traces* and metrics* data streams for per-request and aggregated LLM signals.metrics*, logs* with dataset/namespace per integration). Check which data streams exist.GET _data_stream, or GET traces*/_mapping, GET metrics*/_mapping) and optionally sample a document to see which LLM-related fields are present. Do not assume both APM and integration data exist.traces* or metrics data streams.traces*, or integration metrics). Firing alerts or violated/degrading SLOs point to potential degraded performance.Spans from OTel/EDOT (and compatible SDKs) carry span attributes that may follow OpenTelemetry GenAI semantic conventions or provider-specific names. In Elasticsearch, attributes typically appear under span.attributes (exact key names depend on ingestion). Common attributes:
| Purpose | Example attribute names (OTel GenAI) |
|---|---|
| Operation / provider | gen_ai.operation.name, gen_ai.provider.name |
| Model | gen_ai.request.model, gen_ai.response.model |
| Token usage | gen_ai.usage.input_tokens, gen_ai.usage.output_tokens |
| Request config | gen_ai.request.temperature, |
Cost is not in the OTel spec; some instrumentations add custom attributes (e.g. llm.response.cost.usd_estimate). Discover actual field names from the index mapping or a sample document (e.g. span.attributes.* or flattened keys).
Use duration and event.outcome on spans for latency and success/failure. Use trace.id , span.id , and parent/child span relationships to analyze call chaining and agentic workflows (e.g. one root span, multiple LLM or tool-call child spans).
Integrations (OpenAI, Azure OpenAI, Azure AI Foundry, Bedrock, Bedrock AgentCore, Vertex AI, etc.) ship metrics (and where supported logs) to Elastic. Metrics typically include token usage, request counts, latency, and—where the integration supports it—cost-related fields. Logs may include prompt/response or guardrail events. Exact field names and data streams are defined by each integration package; discover them from the integration docs or from the target data stream mapping.
GET _data_stream and filter for traces*, metrics-apm* (or metrics*), and metrics-* / logs-* that match known LLM integration datasets (e.g. from Elastic LLM observability).traces*, run a small search or use mapping to see if spans contain gen_ai.* or llm.* (or similar) attributes. Confirm presence of token, model, and duration fields.traces* filtered by span attributes (e.g. gen_ai.operation.name or gen_ai.provider.name when present). Compute throughput (count per time bucket), latency (e.g. duration.us or span duration), and error rate (event.outcome == "failure") by model, service, or time.traces*: sum gen_ai.usage.input_tokens and gen_ai.usage.output_tokens (or equivalent attribute names) by time, model, or service. If a cost attribute exists (e.g. custom llm.response.cost.*), sum it for cost views.event.outcome, error.type, and span attributes (e.g. gen_ai.response.finish_reasons) in traces* to identify failures, timeouts, or content filters. Correlate with prompts/responses if captured in attributes (e.g. gen_ai.input.messages, gen_ai.output.messages) and not redacted.traces*. Filter by root service or trace attributes; group by trace.id and use parent/child span relationships (e.g. parent.id, span.id) to reconstruct chains (e.g. orchestration span → multiple LLM or tool-call spans). Aggregate by span name or gen_ai.operation.name to see distribution of steps (e.g. retrieval, LLM, tool use). Duration per span and per trace gives bottleneck and end-to-end latency.@timestamp). When present, add service.name and optionally service.environment. For LLM-specific spans, filter by span attributes once you know the field names (e.g. a keyword field for gen_ai.provider.name or gen_ai.operation.name).LIMIT, coarse time buckets when only trends are needed, and avoid full scans over large windows.LLM observability progress:
- [ ] Step 1: Determine available data (traces*, metrics-apm* or metrics*, or integration data streams)
- [ ] Step 2: Discover LLM-related field names (mapping or sample doc)
- [ ] Step 3: Run ES|QL or Elasticsearch queries for the user's question (performance, cost, quality, orchestration)
- [ ] Step 4: Check for active alerts or SLOs defined on LLM-related data (Alerting API, SLOs API); field names from
Step 2 help identify related rules; firing alerts or violated/degrading SLOs indicate potential degraded performance
- [ ] Step 5: Summarize findings from ingested data only; include alert/SLO status when relevant
Assume span attributes are available as span.attributes.gen_ai.usage.input_tokens and span.attributes.gen_ai.usage.output_tokens (adjust to actual field names from mapping):
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.provider.name IS NOT NULL
| STATS
input_tokens = SUM(span.attributes.gen_ai.usage.input_tokens),
output_tokens = SUM(span.attributes.gen_ai.usage.output_tokens)
BY BUCKET(@timestamp, 1 hour), span.attributes.gen_ai.request.model
| SORT @timestamp
| LIMIT 500
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.request.model IS NOT NULL
| STATS
request_count = COUNT(*),
failures = COUNT(*) WHERE event.outcome == "failure",
avg_duration_us = AVG(span.duration.us)
BY span.attributes.gen_ai.request.model
| EVAL error_rate = failures / request_count
| LIMIT 100
Get trace IDs that contain at least one LLM span and count spans per trace to see chain length:
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.operation.name IS NOT NULL
| STATS span_count = COUNT(*), total_duration_us = SUM(span.duration.us) BY trace.id
| WHERE span_count > 1
| SORT total_duration_us DESC
| LIMIT 50
The Amazon Bedrock AgentCore integration ships metrics to the metrics-aws_bedrock_agentcore.metrics-* data stream (time series index). Use TS for aggregations on time series data streams (Elasticsearch 9.2+); use a time range with TRANGE (9.3+). The integration’s dashboards and alerting rule templates Example: token usage (counter), invocations (counter), and average latency (gauge) by hour and agent:
TS metrics-aws_bedrock_agentcore.metrics-*
| WHERE TRANGE(7 days)
AND aws.dimensions.Operation == "InvokeAgentRuntime"
| STATS
total_tokens = SUM(RATE(aws.bedrock_agentcore.metrics.TokenCount.sum)),
total_invocations = SUM(RATE(aws.bedrock_agentcore.metrics.Invocations.sum)),
avg_latency_ms = AVG(AVG_OVER_TIME(aws.bedrock_agentcore.metrics.Latency.avg))
BY TBUCKET(1 hour), aws.bedrock_agentcore.agent_name
| SORT TBUCKET(1 hour) DESC
For Elasticsearch 8.x or when TS is not available, use FROM with BUCKET(@timestamp, 1 hour) and SUM/AVG over the metric fields (as in the integration's alert rule templates). For other LLM integrations (OpenAI, Azure OpenAI, Vertex AI, etc.), use that integration’s data stream index pattern and field names from its package (see Elastic LLM observability).
traces*, metrics, or integration metrics/logs). Do not describe or rely on other vendors’ UIs or products._mapping or a sample document; naming may differ (e.g. gen_ai.* vs llm.* or integration-specific fields).Weekly Installs
158
Repository
GitHub Stars
206
First Seen
13 days ago
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
cursor145
gemini-cli138
github-copilot138
codex138
opencode138
kimi-cli137
gen_ai.request.max_tokens| Errors | error.type |
| Conversation / agent | gen_ai.conversation.id; tool/agent spans as child spans |