Elasticsearch 日志搜索与故障排查指南：使用 ES|QL 和漏斗工作流程

observability-logs-search by elastic/agent-skills

236 周安装量

293 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/elastic/agent-skills --skill observability-logs-search

数据分析开发运维监控

🇨🇳中文介绍

日志搜索

搜索和过滤日志以支持事件调查。该工作流程模仿 Kibana Discover：应用时间范围和范围过滤器，然后迭代添加排除过滤器（NOT），直到只剩下一个小的、有趣的日志子集——要么是根本原因，要么是关键文档。可选地查看上下文中的日志（该文档之前和之后的日志），或者切换到另一个实体并开始新的搜索。仅使用 ES|QL（POST /_query）；不要使用 Query DSL。

何时不使用

指标或追踪 — 请使用专用的指标或追踪工具。

参数约定

为可观测性日志搜索使用一致的参数名称：

参数	类型	描述
`start`	string	时间范围的开始（Elasticsearch 日期运算，例如 `now-1h`）
`end`	string	时间范围的结束（例如 `now`）

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

用于日志搜索的 ES|QL 模式

仅使用 ES|QL（POST /_query）；不要使用 Query DSL。始终在一个请求中返回：时间序列直方图、总计数、少量日志样本和消息分类（常见和罕见模式）。直方图是主要信号——它显示峰值或下降何时发生，并指导下一个过滤器。使用 FORK 在单个查询中计算趋势、总数、样本和分类。

FORK 输出解释： 响应包含多个由 _fork 列（或等效列）标识的结果集。将它们映射为：fork1 = 趋势（每个时间桶的计数），fork2 = 总计数（单行），fork3 = 日志样本，fork4 = 常见消息模式（按计数排名的前 20 个，来自最多 1 万条日志），fork5 = 罕见消息模式（按计数排名的后 20 个，来自最多 1 万条日志）。使用 fork1 来发现何时缩小时间范围；使用 fork2 查看还剩下多少噪音；使用 fork3 决定接下来添加哪些 NOT；使用 fork4 和 fork5 查看还剩下多少不同的日志模式，并选择下一个排除项——继续迭代直到剩下少于 20 个日志模式。

当目标文本按预期进行分词时，优先使用短语查询以获得特异性（例如 message: "GET /health"、service.name: "advertService"）。
如果目标不会作为单个词条进行分词，请使用通配符（例如 message: *Returning*、message: *WARNING*）。不要在引用的短语内放置通配符字符。
使用显式字段化的 KQL：service.name: "payment-api"、message: "GET /health"、NOT kubernetes.namespace: "kube-system"、error.message: * AND NOT message: "Known benign warning"。
在 log.level 上过滤（例如 log.level: error）可能有用，但通常有缺陷：许多日志缺少或不正确的级别元数据（例如，所有内容都是 "info"，或者级别仅出现在消息文本中）。在查找故障时，优先通过消息内容或 error.message 进行漏斗过滤；将 log.level 视为提示，而不是可靠的过滤器。
对诸如 "error" 之类的词进行随机全文搜索也通常有缺陷：它们会匹配无害的提及（例如 "no error"、"error code 0"、引用该词的堆栈跟踪）。优先通过服务/实体限定范围，并对实际消息模式使用 NOT 排除进行迭代，而不是依赖单个关键词。

包含直方图、样本和分类的基本日志搜索

包含消息分类，以便可以计算不同的日志模式并迭代直到剩下少于 20 个。使用五路 FORK：趋势、总数、样本、常见模式、罕见模式。

POST /_query
{
  "query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= TO_DATETIME(\"2025-03-06T10:00:00.000Z\") AND @timestamp <= TO_DATETIME(\"2025-03-06T11:00:00.000Z\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 10 | KEEP _id, _index, message, error.message, service.name, container.name, host.name, kubernetes.container.name, kubernetes.node.name, kubernetes.namespace, kubernetes.pod.name) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` DESC | LIMIT 20) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` ASC | LIMIT 20)"
}

fork4（常见）：按计数排名的前 20 个消息模式，来自最多 10,000 条日志——用于为主要噪音添加 NOT。
fork5（罕见）：按计数排名的后 20 个消息模式——有助于发现大海捞针。
计算 fork4/fork5（以及整体分类）中的不同模式，并继续迭代直到剩下少于 20 个日志模式。

调整索引模式（例如 logs-*、logs-*-*）、时间范围和桶大小（例如 30s、5m、1h）。保持样本 LIMIT 较小（默认 10–20；上限 500）。使用 KEEP 以便样本分支仅返回摘要字段，而不是完整文档。

添加 KQL 过滤器

使用 KQL("...") 缩小结果范围。KQL 表达式是 ES|QL 中的一个双引号字符串。

请求体中的转义： 查询是在 JSON 内部发送的，因此作为 ES|QL 字符串一部分的每个双引号都必须转义。对于包装 KQL 表达式的引号，使用 \"。如果 KQL 表达式本身包含双引号（例如像 message: "GET /health" 这样的短语），则在 JSON 中将其转义为 \\\"，以便 KQL 解析器接收字面引号字符。

POST /_query
{
  "query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= TO_DATETIME(\"2025-03-06T10:00:00.000Z\") AND @timestamp <= TO_DATETIME(\"2025-03-06T11:00:00.000Z\") | WHERE KQL(\"service.name: checkout AND log.level: error\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 10 | KEEP _id, _index, message, error.message, service.name, host.name, kubernetes.pod.name) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` DESC | LIMIT 20) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` ASC | LIMIT 20)"
}

使用 NOT 排除噪音

通过排除已知噪音来构建漏斗。在请求体中，将 KQL 字符串包装在 \"...\" 中，并将 KQL 表达式内的任何引号转义为 \\\"：

"query": "... | WHERE KQL(\"NOT message: \\\"GET /health\\\" AND NOT kubernetes.namespace: \\\"kube-system\\\"\") | ..."



"query": "... | WHERE KQL(\"error.message: * AND NOT message: \\\"Known benign warning\\\"\") | ..."

按维度分组的直方图

按第二个维度（例如 log.level、service.name）细分趋势，以查看哪个级别或实体驱动了峰值：

STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m), log.level

在响应中使用有限的组值集合以避免爆炸（例如，按计数排名的前 N 个，其余为 _other）。

服务的最近一小时日志

POST /_query
{
  "query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= NOW() - 1 hour AND @timestamp <= NOW() | WHERE KQL(\"service.name: api-gateway\") | SORT @timestamp DESC | LIMIT 20"
}

包含趋势和样本的错误日志

POST /_query
{
  "query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= NOW() - 2 hours AND @timestamp <= NOW() | WHERE KQL(\"log.level: error\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 5m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 15)"
}

迭代漏斗：NOT 再 NOT 直到找到有趣的部分

不要在一次排除后就停止。每一轮，为当前的主要噪音添加更多 NOT，然后再次运行。

第 1 轮： KQL("service.name: advertService") → 例如 55k 条日志；样本显示 "Returning N ads"、"WARNING: request..."、"received ad request"。

第 2 轮： 排除最大的噪音：
KQL("service.name: advertService AND NOT message: *Returning* AND NOT message: *WARNING*") → 重新运行，检查新的总数和样本。

第 3 轮： 排除下一个噪音（例如请求/缓存闲聊）：
KQL("service.name: advertService AND NOT message: *Returning* AND NOT message: *WARNING* AND NOT message: *received ad request* AND NOT message: *Adding* AND NOT message: *Cache miss*") → 重新运行。

第 4 轮+： 继续为样本中仍然占主导地位的任何内容添加 NOT（使用 fork4/fork5 分类查看模式）。继续直到剩下少于 20 个日志模式；然后剩下的就是要报告的信号（例如 "error fetching ads"、编码问题）。

转义：在 JSON 中将 KQL 字符串包装在 \"...\" 中；对于 KQL 内部的引号短语，使用 \\\"。

漏斗：使用 NOT 迭代。 不要在单个宽泛查询后就报告发现。为主要噪音添加 NOT 子句，使用完整过滤器重新运行（保留所有先前的 NOT），并重复直到剩下少于 20 个日志模式（使用分类 fork4/fork5 进行计数）。过早停止会产生噪音，而不是信号。
直方图优先： 使用趋势（fork1）查看峰值或下降何时发生；如果需要，在添加更多 NOT 之前围绕峰值缩小时间范围。
上下文最小化： 在样本分支中仅 KEEP 摘要字段；默认 LIMIT 10–20，上限 500。每个漏斗步骤用于决定下一次调用；只有最终缩小的结果用于上下文和总结。
请求体转义： query 值是 JSON。转义 ES|QL 字符串中的双引号：\" 用于 KQL 包装器，\\\" 用于 KQL 表达式内的引号（例如短语值）。
以编程方式构建查询时，对 start 和 end 使用 Elasticsearch 日期运算（例如 now-1h、now-15m）。
根据时间范围选择桶大小：目标是大约 20–50 个桶（例如 1 小时窗口 → 1m 或 2m）。
优先使用 ECS 字段名称。在可观测性索引模板中，OTel 字段被别名为 ECS；有关资源元数据字段回退（容器、主机、集群、命名空间、Pod、工作负载），请参阅 references/log-search-reference.md。
log.level： 按它过滤或分组可能没问题，但当级别缺失或设置错误时通常不可靠；在查找故障时，优先使用消息内容或 error.message。
关键词搜索： 仅搜索诸如 "error" 或 "fail" 之类的词通常有缺陷（例如 "no error"、"error code 0"）；优先通过实体限定范围，并对真实消息模式使用 NOT 进行漏斗过滤。

references/log-search-reference.md — ECS/OTel 字段映射和索引模式

🇺🇸English

Logs Search

Search and filter logs to support incident investigation. The workflow mirrors Kibana Discover: apply a time range and scope filter, then iteratively add exclusion filters (NOT) until a small, interesting subset of logs remains—either the root cause or the key document. Optionally view logs in context (preceding and following that document) or pivot to another entity and start a fresh search. Use ES|QL only (POST /_query); do not use Query DSL.

When NOT to use

Metrics or traces — use the dedicated metric or trace tools.

Parameter conventions

Use consistent names for Observability log search:

Parameter	Type	Description
`start`	string	Start of time range (Elasticsearch date math, e.g. `now-1h`)
`end`	string	End of time range (e.g. `now`)
`kqlFilter`	string	KQL query string to narrow results. Not `query`, `filter`, or `kql`.
`limit`	number	Maximum log samples to return (e.g. 10–100)
`groupBy`	string	Optional field to group the histogram by (e.g. `log.level`, `service.name`)

For entity filters, use ECS field names: service.name, host.name, service.environment, kubernetes.pod.name, kubernetes.namespace. Query ECS names only; OpenTelemetry aliases map automatically in Observability indices.

Context minimization

Keep the context window small. In the sample branch of the query, KEEP only a subset of fields ; do not return full documents by default. A small summary (e.g. 10 docs with KEEP) stays under ~1000 tokens; a single full JSON doc can exceed 4000 tokens.

Recommended KEEP list for sample logs:
message, error.message, service.name, container.name, host.name, container.id, agent.name, kubernetes.container.name, kubernetes.node.name, kubernetes.namespace, kubernetes.pod.name

Message fallback: If present, use the first non-empty of: body.text (OTel), message, error.message, event.original, exception.message, error.exception.message, attributes.exception.message (OTel). Observability index templates often alias these; when building a single “message” for display, prefer that order.

Limit samples: Default to a small sample (10–20 logs) per query. Cap at 500; do not fetch thousands in one call. Each funnel step is only to decide the next call—only the final narrowed result is the one to keep in context and summarize.

The funnel workflow

You must iterate. Do not stop after one query. Keep excluding noise with NOT until fewer than 20 log patterns (distinct message categories) remain. Always keep the full filter when iterating: concatenate new NOTs to the previous KQL; do not “zoom out” or drop earlier exclusions.

Round 1 — broad: Run a query with only the scope filter (e.g. service.name: advertService) and time range. Get total count, histogram, sample logs, and message categorization (common + rare patterns).
Inspect: Look at the histogram (when spikes or drops occur), the sample messages , and the categorized patterns (fork4 = top patterns by count, fork5 = rare patterns). If the histogram shows a sharp spike at a specific time, narrow the time range (t_start, t_end) around that spike for the next round. Count how many distinct log patterns remain (from the categorization); identify high-volume noise to exclude.
Round 2 — exclude noise: Add NOT clauses to the KQL filter for the dominant noise patterns. Run the query again with the full filter (all previous NOTs plus new ones).
Repeat: Keep adding NOT clauses and re-running with the full filter. Do not stop after one or two rounds. Continue until fewer than 20 log patterns remain (use the categorization result to count distinct message categories). Then the remaining set is small enough to interpret as the interesting bits (errors, anomalies, root cause).
Pivot (optional): Once the funnel isolates a specific entity (e.g. container.id, ), run one more query focused on that entity to see its “dying words” or surrounding context.

If you stop before reaching fewer than 20 log patterns, you will report noise instead of the actual failures. Each intermediate result is only for deciding the next call; only the final narrowed result should be kept in context and summarized.

ES|QL patterns for log search

Use ES|QL (POST /_query) only; do not use Query DSL. Always return in one request: a time-series histogram, total count, a small sample of logs, and message categorization (common and rare patterns). The histogram is the primary signal—it shows when spikes or drops occur and guides the next filter. Use FORK to compute trend, total, samples, and categorization in a single query.

FORK output interpretation: The response contains multiple result sets identified by a _fork column (or equivalent). Map them as: fork1 = trend (count per time bucket), fork2 = total count (single row), fork3 = sample logs, fork4 = common message patterns (top 20 by count, from up to 10k logs), fork5 = rare message patterns (bottom 20 by count, from up to 10k logs). Use fork1 to spot when to narrow the time range; use fork2 to see how much noise remains; use fork3 to decide which NOTs to add next; use fork4 and fork5 to see how many distinct log patterns remain and to choose the next exclusions—continue iterating until fewer than 20 log patterns remain.

KQL guidance

Prefer phrase queries for specificity when the target text is tokenized as you expect (e.g. message: "GET /health", service.name: "advertService").
If the target would not be tokenized as a single term, use a wildcard (e.g. message: *Returning*, message: *WARNING*). Do not put wildcard characters inside quoted phrases.
Use explicit fielded KQL : service.name: "payment-api", message: "GET /health", NOT kubernetes.namespace: "kube-system", error.message: * AND NOT message: "Known benign warning".
(e.g. ) can be useful, but it is : many logs have missing or incorrect level metadata (e.g. everything as "info", or level only in the message text). Prefer funneling by message content or when hunting failures; treat as a hint, not a reliable filter.

Basic log search with histogram, samples, and categorization

Include message categorization so you can count distinct log patterns and iterate until fewer than 20 remain. Use a five-way FORK: trend, total, samples, common patterns, rare patterns.

POST /_query
{
  "query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= TO_DATETIME(\"2025-03-06T10:00:00.000Z\") AND @timestamp <= TO_DATETIME(\"2025-03-06T11:00:00.000Z\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 10 | KEEP _id, _index, message, error.message, service.name, container.name, host.name, kubernetes.container.name, kubernetes.node.name, kubernetes.namespace, kubernetes.pod.name) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` DESC | LIMIT 20) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` ASC | LIMIT 20)"
}

fork4 (common): top 20 message patterns by count, from up to 10,000 logs—use to add NOTs for dominant noise.
fork5 (rare): bottom 20 message patterns by count—helps spot needles in the haystack.
Count distinct patterns across fork4/fork5 (and the overall categorization) and continue iterating until fewer than 20 log patterns remain.

Adjust the index pattern (e.g. logs-*, logs-*-*), time range, and bucket size (e.g. 30s, 5m, 1h). Keep sample LIMIT small (10–20 by default; cap at 500). Use KEEP so the sample branch returns only summary fields, not full documents.

Adding a KQL filter

Narrow results with KQL("..."). The KQL expression is a single double-quoted string in ES|QL.

Escaping in the request body: The query is sent inside JSON, so every double quote that is part of the ES|QL string must be escaped. Use \" for the quotes that wrap the KQL expression. If the KQL expression itself contains double quotes (e.g. a phrase like message: "GET /health"), escape those in the JSON as \\\" so the KQL parser receives literal quote characters.

POST /_query
{
  "query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= TO_DATETIME(\"2025-03-06T10:00:00.000Z\") AND @timestamp <= TO_DATETIME(\"2025-03-06T11:00:00.000Z\") | WHERE KQL(\"service.name: checkout AND log.level: error\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 10 | KEEP _id, _index, message, error.message, service.name, host.name, kubernetes.pod.name) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` DESC | LIMIT 20) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` ASC | LIMIT 20)"
}

Excluding noise with NOT

Build the funnel by excluding known noise. In the request body, wrap the KQL string in \"...\" and escape any quotes inside the KQL expression as \\\":

"query": "... | WHERE KQL(\"NOT message: \\\"GET /health\\\" AND NOT kubernetes.namespace: \\\"kube-system\\\"\") | ..."



"query": "... | WHERE KQL(\"error.message: * AND NOT message: \\\"Known benign warning\\\"\") | ..."

Histogram grouped by a dimension

Break down the trend by a second dimension (e.g. log.level, service.name) to see which level or entity drives the spike:

STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m), log.level

Use a limited set of group values in the response to avoid explosion (e.g. top N by count, rest as _other).

Examples

Last hour of logs for a service

POST /_query
{
  "query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= NOW() - 1 hour AND @timestamp <= NOW() | WHERE KQL(\"service.name: api-gateway\") | SORT @timestamp DESC | LIMIT 20"
}

Error logs with trend and samples

POST /_query
{
  "query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= NOW() - 2 hours AND @timestamp <= NOW() | WHERE KQL(\"log.level: error\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 5m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 15)"
}

Iterative funnel: NOT and NOT and NOT until the interesting bits

Do not stop after one exclusion. Each round, add more NOTs for the current top noise, then run again.

Round 1: KQL("service.name: advertService") → e.g. 55k logs; samples show "Returning N ads", "WARNING: request...", "received ad request".

Round 2: Exclude the biggest noise:
KQL("service.name: advertService AND NOT message: *Returning* AND NOT message: *WARNING*") → re-run, check new total and samples.

Round 3: Exclude next noise (e.g. request/cache chatter):
KQL("service.name: advertService AND NOT message: *Returning* AND NOT message: *WARNING* AND NOT message: *received ad request* AND NOT message: *Adding* AND NOT message: *Cache miss*") → re-run.

Round 4+: Keep adding NOTs for whatever still dominates the samples (use fork4/fork5 categorization to see patterns). Continue until fewer than 20 log patterns remain ; then what remains is the signal to report (e.g. "error fetching ads", encoding issues).

Escaping: wrap the KQL string in \"...\" in the JSON; for quoted phrases inside KQL use \\\".

Guidelines

Funnel: iterate with NOT. Do not report findings after a single broad query. Add NOT clauses for dominant noise, re-run with the full filter (keep all previous NOTs), and repeat until fewer than 20 log patterns remain (use categorization fork4/fork5 to count). Stopping early yields noise, not signal.
Histogram first: Use the trend (fork1) to see when spikes or drops occur; narrow the time range around the spike if needed before adding more NOTs.
Context minimization: KEEP only summary fields in the sample branch; default LIMIT 10–20, cap at 500. Each funnel step is for deciding the next call; only the final narrowed result is for context and summary.
Request body escaping: The query value is JSON. Escape double quotes in the ES|QL string: \" for the KQL wrapper, \\\" for quotes inside the KQL expression (e.g. phrase values).
Use Elasticsearch date math for start and end (e.g. now-1h, now-15m) when building queries programmatically.

References

references/log-search-reference.md — ECS/OTel field mapping and index patterns

Weekly Installs

145

Repository

elastic/agent-skills

GitHub Stars

First Seen

11 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

cursor124

codex116

gemini-cli115

github-copilot115

opencode115

amp114

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

96,200 周安装

Step back (if needed): If the funnel does not reveal the root cause, consider viewing logs in context (preceding and following the key document) or a different entity and start a fresh search.

Filtering onlog.level

Random full-text searches for words like "error" are also often flawed : they match harmless mentions (e.g. "no error", "error code 0", stack traces that reference the word). Prefer scoping by service/entity and iterating with NOT exclusions on actual message patterns rather than relying on a single keyword.

Choose bucket size from the time range: aim for roughly 20–50 buckets (e.g. 1h window → 1m or 2m).

Prefer ECS field names. In Observability index templates, OTel fields are aliased to ECS; see references/log-search-reference.md for resource metadata field fallbacks (container, host, cluster, namespace, pod, workload).

log.level: Filtering or grouping by it can be OK but is often unreliable when levels are missing or mis-set; prefer message content or error.message for finding failures.

Keyword searches: Searching only for words like "error" or "fail" is often flawed (e.g. "no error", "error code 0"); prefer scoping by entity and funneling with NOT on real message patterns.