MLflow追踪分析指南：如何解析AI/ML应用执行结构与评估数据

analyzing-mlflow-trace by mlflow/skills

115 周安装量

24 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/mlflow/skills --skill analyzing-mlflow-trace

AI/机器学习可观测性数据分析

🇨🇳中文介绍

分析单个 MLflow 追踪

追踪结构

一次追踪将 AI/ML 应用程序的完整执行过程捕获为一棵跨度树。每个跨度代表一个操作（LLM 调用、工具调用、检索步骤等），并记录其输入、输出、时间和状态。追踪还包含评估——来自人类或 LLM 评判者关于质量的反馈。

建议在分析追踪前阅读 references/trace-structure.md——它涵盖了完整的数据模型、所有字段和类型、分析指南以及 OpenTelemetry 兼容性说明。

处理 CLI 输出

对于复杂的智能体执行，追踪可能超过 100KB。请始终将输出重定向到文件——不要直接将 mlflow traces get 的输出通过管道传递给 jq、head 或其他命令，因为管道可能会静默地不产生任何输出。

# 将完整追踪获取到文件（traces get 始终输出 JSON，不需要 --output 标志）
mlflow traces get --trace-id <ID> > /tmp/trace.json

# 然后处理文件
jq '.info.state' /tmp/trace.json
jq '.data.spans | length' /tmp/trace.json

建议获取完整追踪并直接解析 JSON，而不是使用 --extract-fields。--extract-fields 标志对嵌套跨度数据（例如，跨度输入/输出可能返回空对象）的支持有限。获取一次完整追踪，然后根据需要解析它。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

896,800 周安装

Azure Data Explorer (Kusto) 查询技能：KQL数据分析、日志遥测与时间序列处理

138,800 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

120,000 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

69,600 周安装

追踪 JSON 有两个顶级键：info（元数据、评估）和 data（跨度）。

{
  "info": { "trace_id", "state", "request_time", "assessments", ... },
  "data": { "spans": [ { "span_id", "name", "status", "attributes", ... } ] }
}

关键路径（已针对实际 CLI 输出验证）：

内容	jq 路径
追踪状态	`.info.state`
所有跨度	`.data.spans`
根跨度	`.data.spans[]
跨度状态码	`.data.spans[].status.code`（值：`STATUS_CODE_OK`、`STATUS_CODE_ERROR`、`STATUS_CODE_UNSET`）
跨度状态消息	`.data.spans[].status.message`
跨度输入	`.data.spans[].attributes["mlflow.spanInputs"]`
跨度输出	`.data.spans[].attributes["mlflow.spanOutputs"]`
评估	`.info.assessments`
评估名称	`.info.assessments[].assessment_name`
反馈值	`.info.assessments[].feedback.value`
反馈错误	`.info.assessments[].feedback.error`
评估理由	`.info.assessments[].rationale`

重要提示：跨度输入和输出作为序列化的 JSON 字符串存储在 attributes 内部，而不是作为顶级的跨度字段。来自第三方 OpenTelemetry 客户端的追踪可能使用不同的属性名称（例如，GenAI 语义约定、OpenInference 或自定义键）——请检查原始的 attributes 字典以找到等效字段。

如果路径不匹配（结构可能因 MLflow 版本而异），请发现它们：

# 顶级键
jq 'keys' /tmp/trace.json

# 跨度键
jq '.data.spans[0] | keys' /tmp/trace.json

# 状态结构
jq '.data.spans[0].status' /tmp/trace.json

将追踪获取到文件后，运行以下命令获取摘要：

jq '{
  state: .info.state,
  span_count: (.data.spans | length),
  error_spans: [.data.spans[] | select(.status.code == "STATUS_CODE_ERROR") | .name],
  assessment_errors: [.info.assessments[] | select(.feedback.error) | .assessment_name]
}' /tmp/trace.json

state: OK 并不意味着输出正确。 它仅表示没有未处理的异常。请检查评估以获取质量信号，如果不存在评估，则直接分析追踪的输入、输出和中间跨度数据以发现问题。
在解释评估值时，务必查阅 rationale。 仅凭 value 可能会产生误导——例如，一个 value: "no" 的 user_frustration 评估可能意味着“未检测到沮丧”或“沮丧检查未通过”（即，沮丧确实存在），具体取决于评分器的配置方式。.rationale 字段（一个顶级评估字段，不嵌套在 .feedback 下）解释了该值在上下文中的含义，并且通常在你需要检查任何跨度之前，就用通俗的语言描述了问题。
评估告诉你_哪里_出了问题；跨度告诉你_哪里_出了问题。 如果存在评估，请使用反馈/期望来形成假设，然后在跨度树中确认它。如果不存在评估，请检查跨度输入/输出以确定执行在何处偏离了预期行为。
评估错误不是追踪错误。 如果评估有一个 error 字段，这意味着评估追踪的评分器或评判者失败了——而不是追踪本身有问题。追踪可能完全正常；只是评估的 value 不可靠。当评分器崩溃（例如，超时、返回无法解析的输出）或者当评分器应用于其未设计处理的追踪类型时（例如，将检索相关性评分器应用于没有检索步骤的追踪），可能会发生这种情况。后者是评分器配置问题，而不是追踪问题。
跨度时间揭示了性能问题。 父跨度和子跨度之间的间隙表示开销；重复的跨度名称表明重试；比较各个跨度的持续时间以找到瓶颈。
令牌使用量解释了延迟和成本。 在追踪元数据（例如 mlflow.trace.tokenUsage）或跨度属性（例如 mlflow.chat.tokenUsage）中查找令牌使用量。并非所有客户端都设置这些——请检查原始的 attributes 字典以找到等效字段。输入令牌的激增可能提示提示注入或上下文过大。

MLflow 追踪捕获应用程序调用堆栈不同部分的输入、输出和元数据。通过将追踪内容与源代码关联起来，可以比仅从追踪中更精确地定位问题的根本原因。

跨度名称映射到函数。 跨度名称通常与使用 @mlflow.trace 装饰或用 mlflow.start_span() 包装的函数匹配。对于自动记录的跨度（LangChain、OpenAI 等），名称遵循框架约定（例如 ChatOpenAI、RetrievalQA）。
跨度树反映了调用堆栈。 如果跨度 A 是跨度 B 的父级，那么函数 A 调用了函数 B。
跨度输入/输出对应于函数参数/返回值。 将它们与代码逻辑进行比较，可以揭示函数是否按设计运行或产生了意外结果。
追踪显示了_发生了什么_；代码显示了_为什么_。 检索器返回不相关的结果可能追溯到有问题的相似性阈值。不正确的跨度输入可能揭示了错误的模型参数或代码中缺少的环境变量。

示例：调查错误答案

用户报告其客户支持智能体对于查询“我们的退款政策是什么？”给出了错误答案。追踪上没有评估。

1. 获取追踪并检查高级信号。

追踪显示 state: OK——没有发生崩溃。没有评估存在，因此直接检查追踪的输入和输出。response_preview 显示 “我们的运输政策规定订单在 3-5 个工作日内送达...”——这回答了一个与所问问题不同的问题。

2. 检查跨度以定位问题。

customer_support_agent (AGENT) — OK
├── plan_action (LLM) — OK
│   outputs: {"tool_call": "search_knowledge_base", "args": {"query": "refund policy"}}
├── search_knowledge_base (TOOL) — OK
│   inputs: {"query": "refund policy"}
│   outputs: [{"doc": "Shipping takes 3-5 business days...", "score": 0.82}]
├── generate_response (LLM) — OK
│   inputs: {"messages": [..., {"role": "user", "content": "Context: Shipping takes 3-5 business days..."}]}
│   outputs: {"content": "Our shipping policy states..."}

智能体正确地决定搜索“refund policy”，但 search_knowledge_base 工具返回了一个运输文档。然后 LLM 忠实地使用了错误的上下文进行回答。问题出在工具的检索上，而不是智能体的推理或 LLM 的生成。

3. 与代码库关联。

跨度 search_knowledge_base 映射到应用程序代码中的一个函数。调查发现向量索引仅基于运输常见问题解答构建——退款政策文档从未被索引。

重新索引知识库以包含退款政策文档。
添加检索相关性评分器，以检测检索到的上下文是否与查询主题不匹配。
考虑为常见查询添加带有正确答案的期望评估，以支持回归测试。

🇺🇸English

Analyzing a Single MLflow Trace

Trace Structure

A trace captures the full execution of an AI/ML application as a tree of spans. Each span represents one operation (LLM call, tool invocation, retrieval step, etc.) and records its inputs, outputs, timing, and status. Traces also carry assessments — feedback from humans or LLM judges about quality.

It is recommended to read references/trace-structure.md before analyzing a trace — it covers the complete data model, all fields and types, analysis guidance, and OpenTelemetry compatibility notes.

Handling CLI Output

Traces can be 100KB+ for complex agent executions. Always redirect output to a file — do not pipe mlflow traces get directly to jq, head, or other commands, as piping can silently produce no output.

# Fetch full trace to a file (traces get always outputs JSON, no --output flag needed)
mlflow traces get --trace-id <ID> > /tmp/trace.json

# Then process the file
jq '.info.state' /tmp/trace.json
jq '.data.spans | length' /tmp/trace.json

Prefer fetching the full trace and parsing the JSON directly rather than using --extract-fields. The --extract-fields flag has limited support for nested span data (e.g., span inputs/outputs may return empty objects). Fetch the complete trace once and parse it as needed.

JSON Structure

The trace JSON has two top-level keys: info (metadata, assessments) and data (spans).

{
  "info": { "trace_id", "state", "request_time", "assessments", ... },
  "data": { "spans": [ { "span_id", "name", "status", "attributes", ... } ] }
}

Key paths (verified against actual CLI output):

What	jq path
Trace state	`.info.state`
All spans	`.data.spans`
Root span	`.data.spans[]
Span status code	`.data.spans[].status.code` (values: `STATUS_CODE_OK`, `STATUS_CODE_ERROR`, `STATUS_CODE_UNSET`)
Span status message	`.data.spans[].status.message`

Important : Span inputs and outputs are stored as serialized JSON strings inside attributes, not as top-level span fields. Traces from third-party OpenTelemetry clients may use different attribute names (e.g., GenAI Semantic Conventions, OpenInference, or custom keys) — check the raw attributes dict to find the equivalent fields.

If paths don't match (structure may vary by MLflow version), discover them:

# Top-level keys
jq 'keys' /tmp/trace.json

# Span keys
jq '.data.spans[0] | keys' /tmp/trace.json

# Status structure
jq '.data.spans[0].status' /tmp/trace.json

Quick Health Check

After fetching a trace to a file, run this to get a summary:

jq '{
  state: .info.state,
  span_count: (.data.spans | length),
  error_spans: [.data.spans[] | select(.status.code == "STATUS_CODE_ERROR") | .name],
  assessment_errors: [.info.assessments[] | select(.feedback.error) | .assessment_name]
}' /tmp/trace.json

Analysis Insights

state: OK does not mean correct output. It only means no unhandled exception. Check assessments for quality signals, and if none exist, analyze the trace's inputs, outputs, and intermediate span data directly for issues.
Always consult therationale when interpreting assessment values. The value alone can be misleading — for example, a user_frustration assessment with value: "no" could mean "no frustration detected" or "the frustration check did not pass" (i.e., frustration is present), depending on how the scorer was configured. The .rationale field (a top-level assessment field, not nested under .feedback) explains what the value means in context and often describes the issue in plain language before you need to examine any spans.
Assessments tell you what went wrong; spans tell you where. If assessments exist, use feedback/expectations to form a hypothesis, then confirm it in the span tree. If no assessments exist, examine span inputs/outputs to identify where the execution diverged from expected behavior.

Codebase Correlation

MLflow Tracing captures inputs, outputs, and metadata from different parts of an application's call stack. By correlating trace contents with the source code, issues can be root-caused more precisely than from the trace alone.

Span names map to functions. Span names typically match the function decorated with @mlflow.trace or wrapped in mlflow.start_span(). For autologged spans (LangChain, OpenAI, etc.), names follow framework conventions instead (e.g., ChatOpenAI, RetrievalQA).
The span tree mirrors the call stack. If span A is the parent of span B, then function A called function B.
Span inputs/outputs correspond to function parameters/return values. Comparing them against the code logic reveals whether the function behaved as designed or produced an unexpected result.
The trace shows what happened ; the code shows why. A retriever returning irrelevant results might trace back to a faulty similarity threshold. Incorrect span inputs might reveal wrong model parameters or missing environment variables set in code.

Example: Investigating a Wrong Answer

A user reports that their customer support agent gave an incorrect answer for the query "What is our refund policy?" There are no assessments on the trace.

1. Fetch the trace and check high-level signals.

The trace has state: OK — no crash occurred. No assessments are present, so examine the trace's inputs and outputs directly. The response_preview says "Our shipping policy states that orders are delivered within 3-5 business days..." — this answers a different question than what was asked.

2. Examine spans to locate the problem.

The span tree shows:

customer_support_agent (AGENT) — OK
├── plan_action (LLM) — OK
│   outputs: {"tool_call": "search_knowledge_base", "args": {"query": "refund policy"}}
├── search_knowledge_base (TOOL) — OK
│   inputs: {"query": "refund policy"}
│   outputs: [{"doc": "Shipping takes 3-5 business days...", "score": 0.82}]
├── generate_response (LLM) — OK
│   inputs: {"messages": [..., {"role": "user", "content": "Context: Shipping takes 3-5 business days..."}]}
│   outputs: {"content": "Our shipping policy states..."}

The agent correctly decided to search for "refund policy," but the search_knowledge_base tool returned a shipping document. The LLM then faithfully answered using the wrong context. The problem is in the tool's retrieval, not the agent's reasoning or the LLM's generation.

3. Correlate with the codebase.

The span search_knowledge_base maps to a function in the application code. Investigating reveals the vector index was built from only the shipping FAQ — the refund policy documents were never indexed.

4. Recommendations.

Re-index the knowledge base to include refund policy documents.
Add a retrieval relevance scorer to detect when retrieved context doesn't match the query topic.
Consider adding expectation assessments with correct answers for common queries to enable regression testing.

Weekly Installs

Repository

mlflow/skills

GitHub Stars

First Seen

Feb 5, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

github-copilot87

gemini-cli87

codex86

opencode85

amp84

kimi-cli84

专业SEO审计工具：全面网站诊断、技术SEO优化与页面分析指南

68,800 周安装

Assessment errors are not trace errors. If an assessment has an error field, it means the scorer or judge that evaluated the trace failed — not that the trace itself has a problem. The trace may be perfectly fine; the assessment's value is just unreliable. This can happen when a scorer crashes (e.g., timed out, returned unparseable output) or when a scorer was applied to a trace type it wasn't designed for (e.g., a retrieval relevance scorer applied to a trace with no retrieval steps). The latter is a scorer configuration issue, not a trace issue.

Span timing reveals performance issues. Gaps between parent and child spans indicate overhead; repeated span names suggest retries; compare individual span durations to find bottlenecks.

Token usage explains latency and cost. Look for token usage in trace metadata (e.g., mlflow.trace.tokenUsage) or span attributes (e.g., mlflow.chat.tokenUsage). Not all clients set these — check the raw attributes dict for equivalent fields. Spikes in input tokens may indicate prompt injection or overly large context.