observability-logs-search by elastic/agent-skills
npx skills add https://github.com/elastic/agent-skills --skill observability-logs-search搜索和过滤日志以支持事件调查。该工作流程模仿 Kibana Discover:应用时间范围和范围过滤器,然后迭代添加排除过滤器(NOT),直到只剩下一个小的、有趣的日志子集——要么是根本原因,要么是关键文档。可选地查看上下文中的日志(该文档之前和之后的日志),或者切换到另一个实体并开始新的搜索。仅使用 ES|QL(POST /_query);不要使用 Query DSL。
为可观测性日志搜索使用一致的参数名称:
| 参数 | 类型 | 描述 |
|---|---|---|
start | string | 时间范围的开始(Elasticsearch 日期运算,例如 now-1h) |
end | string | 时间范围的结束(例如 now) |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
kqlFilter | string | 用于缩小结果的 KQL 查询字符串。不是 query、filter 或 kql。 |
limit | number | 要返回的最大日志样本数(例如 10–100) |
groupBy | string | 用于按字段分组直方图的可选字段(例如 log.level、service.name) |
对于实体过滤器,使用 ECS 字段名称:service.name、host.name、service.environment、kubernetes.pod.name、kubernetes.namespace。仅查询 ECS 名称;OpenTelemetry 别名在可观测性索引中会自动映射。
保持上下文窗口较小。在查询的样本分支中,仅保留字段的一个子集;默认情况下不要返回完整文档。一个小的摘要(例如,使用 KEEP 的 10 个文档)保持在约 1000 个令牌以下;一个完整的 JSON 文档可能超过 4000 个令牌。
推荐的日志样本 KEEP 列表:
message、error.message、service.name、container.name、host.name、container.id、agent.name、kubernetes.container.name、kubernetes.node.name、kubernetes.namespace、kubernetes.pod.name
消息回退: 如果存在,按顺序使用第一个非空的字段:body.text (OTel)、message、error.message、event.original、exception.message、error.exception.message、attributes.exception.message (OTel)。可观测性索引模板通常会对这些字段设置别名;在构建用于显示的单个“消息”时,优先使用该顺序。
限制样本数: 默认情况下,每次查询只获取少量样本(10–20 条日志)。上限为 500;不要在一次调用中获取数千条。每个漏斗步骤仅用于决定下一次调用——只有最终缩小的结果才需要保留在上下文中并进行总结。
你必须进行迭代。 不要在一次查询后就停止。继续使用 NOT 排除噪音,直到剩下少于 20 个日志模式(不同的消息类别)。迭代时始终保留完整的过滤器: 将新的 NOT 条件连接到之前的 KQL 上;不要“缩小范围”或丢弃先前的排除条件。
service.name: advertService)和时间范围运行查询。获取总计数、直方图、日志样本和消息分类(常见 + 罕见模式)。NOT 子句。使用完整过滤器(所有先前的 NOT 加上新的 NOT)再次运行查询。NOT 子句并使用完整过滤器重新运行。不要在一两轮后就停止。继续直到剩下少于 20 个日志模式(使用分类结果来计算不同的消息类别)。然后剩下的集合就足够小,可以解释为有趣的部分(错误、异常、根本原因)。container.id、host.name),再运行一次针对该实体的查询,以查看其“临终遗言”或周围上下文。如果在达到少于 20 个日志模式之前停止,你将报告噪音而不是实际的故障。每个中间结果仅用于决定下一次调用;只有最终缩小的结果才应保留在上下文中并进行总结。
仅使用 ES|QL(POST /_query);不要使用 Query DSL。始终在一个请求中返回:时间序列直方图、总计数、少量日志样本和消息分类(常见和罕见模式)。直方图是主要信号——它显示峰值或下降何时发生,并指导下一个过滤器。使用 FORK 在单个查询中计算趋势、总数、样本和分类。
FORK 输出解释: 响应包含多个由 _fork 列(或等效列)标识的结果集。将它们映射为:fork1 = 趋势(每个时间桶的计数),fork2 = 总计数(单行),fork3 = 日志样本,fork4 = 常见消息模式(按计数排名的前 20 个,来自最多 1 万条日志),fork5 = 罕见消息模式(按计数排名的后 20 个,来自最多 1 万条日志)。使用 fork1 来发现何时缩小时间范围;使用 fork2 查看还剩下多少噪音;使用 fork3 决定接下来添加哪些 NOT;使用 fork4 和 fork5 查看还剩下多少不同的日志模式,并选择下一个排除项——继续迭代直到剩下少于 20 个日志模式。
message: "GET /health"、service.name: "advertService")。message: *Returning*、message: *WARNING*)。不要在引用的短语内放置通配符字符。service.name: "payment-api"、message: "GET /health"、NOT kubernetes.namespace: "kube-system"、error.message: * AND NOT message: "Known benign warning"。log.level 上过滤(例如 log.level: error)可能有用,但通常有缺陷:许多日志缺少或不正确的级别元数据(例如,所有内容都是 "info",或者级别仅出现在消息文本中)。在查找故障时,优先通过消息内容或 error.message 进行漏斗过滤;将 log.level 视为提示,而不是可靠的过滤器。包含消息分类,以便可以计算不同的日志模式并迭代直到剩下少于 20 个。使用五路 FORK:趋势、总数、样本、常见模式、罕见模式。
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= TO_DATETIME(\"2025-03-06T10:00:00.000Z\") AND @timestamp <= TO_DATETIME(\"2025-03-06T11:00:00.000Z\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 10 | KEEP _id, _index, message, error.message, service.name, container.name, host.name, kubernetes.container.name, kubernetes.node.name, kubernetes.namespace, kubernetes.pod.name) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` DESC | LIMIT 20) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` ASC | LIMIT 20)"
}
调整索引模式(例如 logs-*、logs-*-*)、时间范围和桶大小(例如 30s、5m、1h)。保持样本 LIMIT 较小(默认 10–20;上限 500)。使用 KEEP 以便样本分支仅返回摘要字段,而不是完整文档。
使用 KQL("...") 缩小结果范围。KQL 表达式是 ES|QL 中的一个双引号字符串。
请求体中的转义: 查询是在 JSON 内部发送的,因此作为 ES|QL 字符串一部分的每个双引号都必须转义。对于包装 KQL 表达式的引号,使用 \"。如果 KQL 表达式本身包含双引号(例如像 message: "GET /health" 这样的短语),则在 JSON 中将其转义为 \\\",以便 KQL 解析器接收字面引号字符。
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= TO_DATETIME(\"2025-03-06T10:00:00.000Z\") AND @timestamp <= TO_DATETIME(\"2025-03-06T11:00:00.000Z\") | WHERE KQL(\"service.name: checkout AND log.level: error\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 10 | KEEP _id, _index, message, error.message, service.name, host.name, kubernetes.pod.name) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` DESC | LIMIT 20) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` ASC | LIMIT 20)"
}
通过排除已知噪音来构建漏斗。在请求体中,将 KQL 字符串包装在 \"...\" 中,并将 KQL 表达式内的任何引号转义为 \\\":
"query": "... | WHERE KQL(\"NOT message: \\\"GET /health\\\" AND NOT kubernetes.namespace: \\\"kube-system\\\"\") | ..."
"query": "... | WHERE KQL(\"error.message: * AND NOT message: \\\"Known benign warning\\\"\") | ..."
按第二个维度(例如 log.level、service.name)细分趋势,以查看哪个级别或实体驱动了峰值:
STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m), log.level
在响应中使用有限的组值集合以避免爆炸(例如,按计数排名的前 N 个,其余为 _other)。
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= NOW() - 1 hour AND @timestamp <= NOW() | WHERE KQL(\"service.name: api-gateway\") | SORT @timestamp DESC | LIMIT 20"
}
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= NOW() - 2 hours AND @timestamp <= NOW() | WHERE KQL(\"log.level: error\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 5m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 15)"
}
不要在一次排除后就停止。每一轮,为当前的主要噪音添加更多 NOT,然后再次运行。
第 1 轮: KQL("service.name: advertService") → 例如 55k 条日志;样本显示 "Returning N ads"、"WARNING: request..."、"received ad request"。
第 2 轮: 排除最大的噪音:
KQL("service.name: advertService AND NOT message: *Returning* AND NOT message: *WARNING*") → 重新运行,检查新的总数和样本。
第 3 轮: 排除下一个噪音(例如请求/缓存闲聊):
KQL("service.name: advertService AND NOT message: *Returning* AND NOT message: *WARNING* AND NOT message: *received ad request* AND NOT message: *Adding* AND NOT message: *Cache miss*") → 重新运行。
第 4 轮+: 继续为样本中仍然占主导地位的任何内容添加 NOT(使用 fork4/fork5 分类查看模式)。继续直到剩下少于 20 个日志模式;然后剩下的就是要报告的信号(例如 "error fetching ads"、编码问题)。
转义:在 JSON 中将 KQL 字符串包装在 \"...\" 中;对于 KQL 内部的引号短语,使用 \\\"。
query 值是 JSON。转义 ES|QL 字符串中的双引号:\" 用于 KQL 包装器,\\\" 用于 KQL 表达式内的引号(例如短语值)。start 和 end 使用 Elasticsearch 日期运算(例如 now-1h、now-15m)。1m 或 2m)。log.level: 按它过滤或分组可能没问题,但当级别缺失或设置错误时通常不可靠;在查找故障时,优先使用消息内容或 error.message。每周安装次数
145
仓库
GitHub 星标数
89
首次出现
11 天前
安全审计
安装于
cursor124
codex116
gemini-cli115
github-copilot115
opencode115
amp114
Search and filter logs to support incident investigation. The workflow mirrors Kibana Discover: apply a time range and scope filter, then iteratively add exclusion filters (NOT) until a small, interesting subset of logs remains—either the root cause or the key document. Optionally view logs in context (preceding and following that document) or pivot to another entity and start a fresh search. Use ES|QL only (POST /_query); do not use Query DSL.
Use consistent names for Observability log search:
| Parameter | Type | Description |
|---|---|---|
start | string | Start of time range (Elasticsearch date math, e.g. now-1h) |
end | string | End of time range (e.g. now) |
kqlFilter | string | KQL query string to narrow results. Not query, filter, or kql. |
limit | number | Maximum log samples to return (e.g. 10–100) |
groupBy | string | Optional field to group the histogram by (e.g. log.level, service.name) |
For entity filters, use ECS field names: service.name, host.name, service.environment, kubernetes.pod.name, kubernetes.namespace. Query ECS names only; OpenTelemetry aliases map automatically in Observability indices.
Keep the context window small. In the sample branch of the query, KEEP only a subset of fields ; do not return full documents by default. A small summary (e.g. 10 docs with KEEP) stays under ~1000 tokens; a single full JSON doc can exceed 4000 tokens.
Recommended KEEP list for sample logs:
message, error.message, service.name, container.name, host.name, container.id, agent.name, kubernetes.container.name, kubernetes.node.name, kubernetes.namespace, kubernetes.pod.name
Message fallback: If present, use the first non-empty of: body.text (OTel), message, error.message, event.original, exception.message, error.exception.message, attributes.exception.message (OTel). Observability index templates often alias these; when building a single “message” for display, prefer that order.
Limit samples: Default to a small sample (10–20 logs) per query. Cap at 500; do not fetch thousands in one call. Each funnel step is only to decide the next call—only the final narrowed result is the one to keep in context and summarize.
You must iterate. Do not stop after one query. Keep excluding noise with NOT until fewer than 20 log patterns (distinct message categories) remain. Always keep the full filter when iterating: concatenate new NOTs to the previous KQL; do not “zoom out” or drop earlier exclusions.
service.name: advertService) and time range. Get total count, histogram, sample logs, and message categorization (common + rare patterns).NOT clauses to the KQL filter for the dominant noise patterns. Run the query again with the full filter (all previous NOTs plus new ones).NOT clauses and re-running with the full filter. Do not stop after one or two rounds. Continue until fewer than 20 log patterns remain (use the categorization result to count distinct message categories). Then the remaining set is small enough to interpret as the interesting bits (errors, anomalies, root cause).container.id, ), run one more query focused on that entity to see its “dying words” or surrounding context.If you stop before reaching fewer than 20 log patterns, you will report noise instead of the actual failures. Each intermediate result is only for deciding the next call; only the final narrowed result should be kept in context and summarized.
Use ES|QL (POST /_query) only; do not use Query DSL. Always return in one request: a time-series histogram, total count, a small sample of logs, and message categorization (common and rare patterns). The histogram is the primary signal—it shows when spikes or drops occur and guides the next filter. Use FORK to compute trend, total, samples, and categorization in a single query.
FORK output interpretation: The response contains multiple result sets identified by a _fork column (or equivalent). Map them as: fork1 = trend (count per time bucket), fork2 = total count (single row), fork3 = sample logs, fork4 = common message patterns (top 20 by count, from up to 10k logs), fork5 = rare message patterns (bottom 20 by count, from up to 10k logs). Use fork1 to spot when to narrow the time range; use fork2 to see how much noise remains; use fork3 to decide which NOTs to add next; use fork4 and fork5 to see how many distinct log patterns remain and to choose the next exclusions—continue iterating until fewer than 20 log patterns remain.
message: "GET /health", service.name: "advertService").message: *Returning*, message: *WARNING*). Do not put wildcard characters inside quoted phrases.service.name: "payment-api", message: "GET /health", NOT kubernetes.namespace: "kube-system", error.message: * AND NOT message: "Known benign warning".Include message categorization so you can count distinct log patterns and iterate until fewer than 20 remain. Use a five-way FORK: trend, total, samples, common patterns, rare patterns.
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= TO_DATETIME(\"2025-03-06T10:00:00.000Z\") AND @timestamp <= TO_DATETIME(\"2025-03-06T11:00:00.000Z\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 10 | KEEP _id, _index, message, error.message, service.name, container.name, host.name, kubernetes.container.name, kubernetes.node.name, kubernetes.namespace, kubernetes.pod.name) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` DESC | LIMIT 20) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` ASC | LIMIT 20)"
}
Adjust the index pattern (e.g. logs-*, logs-*-*), time range, and bucket size (e.g. 30s, 5m, 1h). Keep sample LIMIT small (10–20 by default; cap at 500). Use KEEP so the sample branch returns only summary fields, not full documents.
Narrow results with KQL("..."). The KQL expression is a single double-quoted string in ES|QL.
Escaping in the request body: The query is sent inside JSON, so every double quote that is part of the ES|QL string must be escaped. Use \" for the quotes that wrap the KQL expression. If the KQL expression itself contains double quotes (e.g. a phrase like message: "GET /health"), escape those in the JSON as \\\" so the KQL parser receives literal quote characters.
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= TO_DATETIME(\"2025-03-06T10:00:00.000Z\") AND @timestamp <= TO_DATETIME(\"2025-03-06T11:00:00.000Z\") | WHERE KQL(\"service.name: checkout AND log.level: error\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 10 | KEEP _id, _index, message, error.message, service.name, host.name, kubernetes.pod.name) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` DESC | LIMIT 20) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` ASC | LIMIT 20)"
}
Build the funnel by excluding known noise. In the request body, wrap the KQL string in \"...\" and escape any quotes inside the KQL expression as \\\":
"query": "... | WHERE KQL(\"NOT message: \\\"GET /health\\\" AND NOT kubernetes.namespace: \\\"kube-system\\\"\") | ..."
"query": "... | WHERE KQL(\"error.message: * AND NOT message: \\\"Known benign warning\\\"\") | ..."
Break down the trend by a second dimension (e.g. log.level, service.name) to see which level or entity drives the spike:
STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m), log.level
Use a limited set of group values in the response to avoid explosion (e.g. top N by count, rest as _other).
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= NOW() - 1 hour AND @timestamp <= NOW() | WHERE KQL(\"service.name: api-gateway\") | SORT @timestamp DESC | LIMIT 20"
}
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= NOW() - 2 hours AND @timestamp <= NOW() | WHERE KQL(\"log.level: error\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 5m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 15)"
}
Do not stop after one exclusion. Each round, add more NOTs for the current top noise, then run again.
Round 1: KQL("service.name: advertService") → e.g. 55k logs; samples show "Returning N ads", "WARNING: request...", "received ad request".
Round 2: Exclude the biggest noise:
KQL("service.name: advertService AND NOT message: *Returning* AND NOT message: *WARNING*") → re-run, check new total and samples.
Round 3: Exclude next noise (e.g. request/cache chatter):
KQL("service.name: advertService AND NOT message: *Returning* AND NOT message: *WARNING* AND NOT message: *received ad request* AND NOT message: *Adding* AND NOT message: *Cache miss*") → re-run.
Round 4+: Keep adding NOTs for whatever still dominates the samples (use fork4/fork5 categorization to see patterns). Continue until fewer than 20 log patterns remain ; then what remains is the signal to report (e.g. "error fetching ads", encoding issues).
Escaping: wrap the KQL string in \"...\" in the JSON; for quoted phrases inside KQL use \\\".
query value is JSON. Escape double quotes in the ES|QL string: \" for the KQL wrapper, \\\" for quotes inside the KQL expression (e.g. phrase values).start and end (e.g. now-1h, now-15m) when building queries programmatically.Weekly Installs
145
Repository
GitHub Stars
89
First Seen
11 days ago
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
cursor124
codex116
gemini-cli115
github-copilot115
opencode115
amp114
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
96,200 周安装
host.namelog.levellog.level: errorerror.messagelog.level1m or 2m).log.level: Filtering or grouping by it can be OK but is often unreliable when levels are missing or mis-set; prefer message content or error.message for finding failures.