analyzing-mlflow-trace by mlflow/skills
npx skills add https://github.com/mlflow/skills --skill analyzing-mlflow-trace一次追踪将 AI/ML 应用程序的完整执行过程捕获为一棵跨度树。每个跨度代表一个操作(LLM 调用、工具调用、检索步骤等),并记录其输入、输出、时间和状态。追踪还包含评估——来自人类或 LLM 评判者关于质量的反馈。
建议在分析追踪前阅读 references/trace-structure.md——它涵盖了完整的数据模型、所有字段和类型、分析指南以及 OpenTelemetry 兼容性说明。
对于复杂的智能体执行,追踪可能超过 100KB。请始终将输出重定向到文件——不要直接将 mlflow traces get 的输出通过管道传递给 jq、head 或其他命令,因为管道可能会静默地不产生任何输出。
# 将完整追踪获取到文件(traces get 始终输出 JSON,不需要 --output 标志)
mlflow traces get --trace-id <ID> > /tmp/trace.json
# 然后处理文件
jq '.info.state' /tmp/trace.json
jq '.data.spans | length' /tmp/trace.json
建议获取完整追踪并直接解析 JSON,而不是使用 --extract-fields。--extract-fields 标志对嵌套跨度数据(例如,跨度输入/输出可能返回空对象)的支持有限。获取一次完整追踪,然后根据需要解析它。
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
追踪 JSON 有两个顶级键:info(元数据、评估)和 data(跨度)。
{
"info": { "trace_id", "state", "request_time", "assessments", ... },
"data": { "spans": [ { "span_id", "name", "status", "attributes", ... } ] }
}
关键路径(已针对实际 CLI 输出验证):
| 内容 | jq 路径 |
|---|---|
| 追踪状态 | .info.state |
| 所有跨度 | .data.spans |
| 根跨度 | `.data.spans[] |
| 跨度状态码 | .data.spans[].status.code(值:STATUS_CODE_OK、STATUS_CODE_ERROR、STATUS_CODE_UNSET) |
| 跨度状态消息 | .data.spans[].status.message |
| 跨度输入 | .data.spans[].attributes["mlflow.spanInputs"] |
| 跨度输出 | .data.spans[].attributes["mlflow.spanOutputs"] |
| 评估 | .info.assessments |
| 评估名称 | .info.assessments[].assessment_name |
| 反馈值 | .info.assessments[].feedback.value |
| 反馈错误 | .info.assessments[].feedback.error |
| 评估理由 | .info.assessments[].rationale |
重要提示:跨度输入和输出作为序列化的 JSON 字符串存储在 attributes 内部,而不是作为顶级的跨度字段。来自第三方 OpenTelemetry 客户端的追踪可能使用不同的属性名称(例如,GenAI 语义约定、OpenInference 或自定义键)——请检查原始的 attributes 字典以找到等效字段。
如果路径不匹配(结构可能因 MLflow 版本而异),请发现它们:
# 顶级键
jq 'keys' /tmp/trace.json
# 跨度键
jq '.data.spans[0] | keys' /tmp/trace.json
# 状态结构
jq '.data.spans[0].status' /tmp/trace.json
将追踪获取到文件后,运行以下命令获取摘要:
jq '{
state: .info.state,
span_count: (.data.spans | length),
error_spans: [.data.spans[] | select(.status.code == "STATUS_CODE_ERROR") | .name],
assessment_errors: [.info.assessments[] | select(.feedback.error) | .assessment_name]
}' /tmp/trace.json
state: OK 并不意味着输出正确。 它仅表示没有未处理的异常。请检查评估以获取质量信号,如果不存在评估,则直接分析追踪的输入、输出和中间跨度数据以发现问题。rationale。 仅凭 value 可能会产生误导——例如,一个 value: "no" 的 user_frustration 评估可能意味着“未检测到沮丧”或“沮丧检查未通过”(即,沮丧确实存在),具体取决于评分器的配置方式。.rationale 字段(一个顶级评估字段,不嵌套在 .feedback 下)解释了该值在上下文中的含义,并且通常在你需要检查任何跨度之前,就用通俗的语言描述了问题。error 字段,这意味着评估追踪的评分器或评判者失败了——而不是追踪本身有问题。追踪可能完全正常;只是评估的 value 不可靠。当评分器崩溃(例如,超时、返回无法解析的输出)或者当评分器应用于其未设计处理的追踪类型时(例如,将检索相关性评分器应用于没有检索步骤的追踪),可能会发生这种情况。后者是评分器配置问题,而不是追踪问题。mlflow.trace.tokenUsage)或跨度属性(例如 mlflow.chat.tokenUsage)中查找令牌使用量。并非所有客户端都设置这些——请检查原始的 attributes 字典以找到等效字段。输入令牌的激增可能提示提示注入或上下文过大。MLflow 追踪捕获应用程序调用堆栈不同部分的输入、输出和元数据。通过将追踪内容与源代码关联起来,可以比仅从追踪中更精确地定位问题的根本原因。
@mlflow.trace 装饰或用 mlflow.start_span() 包装的函数匹配。对于自动记录的跨度(LangChain、OpenAI 等),名称遵循框架约定(例如 ChatOpenAI、RetrievalQA)。用户报告其客户支持智能体对于查询“我们的退款政策是什么?”给出了错误答案。追踪上没有评估。
1. 获取追踪并检查高级信号。
追踪显示 state: OK——没有发生崩溃。没有评估存在,因此直接检查追踪的输入和输出。response_preview 显示 “我们的运输政策规定订单在 3-5 个工作日内送达...”——这回答了一个与所问问题不同的问题。
2. 检查跨度以定位问题。
跨度树显示:
customer_support_agent (AGENT) — OK
├── plan_action (LLM) — OK
│ outputs: {"tool_call": "search_knowledge_base", "args": {"query": "refund policy"}}
├── search_knowledge_base (TOOL) — OK
│ inputs: {"query": "refund policy"}
│ outputs: [{"doc": "Shipping takes 3-5 business days...", "score": 0.82}]
├── generate_response (LLM) — OK
│ inputs: {"messages": [..., {"role": "user", "content": "Context: Shipping takes 3-5 business days..."}]}
│ outputs: {"content": "Our shipping policy states..."}
智能体正确地决定搜索“refund policy”,但 search_knowledge_base 工具返回了一个运输文档。然后 LLM 忠实地使用了错误的上下文进行回答。问题出在工具的检索上,而不是智能体的推理或 LLM 的生成。
3. 与代码库关联。
跨度 search_knowledge_base 映射到应用程序代码中的一个函数。调查发现向量索引仅基于运输常见问题解答构建——退款政策文档从未被索引。
4. 建议。
每周安装量
88
代码仓库
GitHub 星标数
18
首次出现
2026年2月5日
安全审计
安装于
github-copilot87
gemini-cli87
codex86
opencode85
amp84
kimi-cli84
A trace captures the full execution of an AI/ML application as a tree of spans. Each span represents one operation (LLM call, tool invocation, retrieval step, etc.) and records its inputs, outputs, timing, and status. Traces also carry assessments — feedback from humans or LLM judges about quality.
It is recommended to read references/trace-structure.md before analyzing a trace — it covers the complete data model, all fields and types, analysis guidance, and OpenTelemetry compatibility notes.
Traces can be 100KB+ for complex agent executions. Always redirect output to a file — do not pipe mlflow traces get directly to jq, head, or other commands, as piping can silently produce no output.
# Fetch full trace to a file (traces get always outputs JSON, no --output flag needed)
mlflow traces get --trace-id <ID> > /tmp/trace.json
# Then process the file
jq '.info.state' /tmp/trace.json
jq '.data.spans | length' /tmp/trace.json
Prefer fetching the full trace and parsing the JSON directly rather than using --extract-fields. The --extract-fields flag has limited support for nested span data (e.g., span inputs/outputs may return empty objects). Fetch the complete trace once and parse it as needed.
The trace JSON has two top-level keys: info (metadata, assessments) and data (spans).
{
"info": { "trace_id", "state", "request_time", "assessments", ... },
"data": { "spans": [ { "span_id", "name", "status", "attributes", ... } ] }
}
Key paths (verified against actual CLI output):
| What | jq path |
|---|---|
| Trace state | .info.state |
| All spans | .data.spans |
| Root span | `.data.spans[] |
| Span status code | .data.spans[].status.code (values: STATUS_CODE_OK, STATUS_CODE_ERROR, STATUS_CODE_UNSET) |
| Span status message | .data.spans[].status.message |
Important : Span inputs and outputs are stored as serialized JSON strings inside attributes, not as top-level span fields. Traces from third-party OpenTelemetry clients may use different attribute names (e.g., GenAI Semantic Conventions, OpenInference, or custom keys) — check the raw attributes dict to find the equivalent fields.
If paths don't match (structure may vary by MLflow version), discover them:
# Top-level keys
jq 'keys' /tmp/trace.json
# Span keys
jq '.data.spans[0] | keys' /tmp/trace.json
# Status structure
jq '.data.spans[0].status' /tmp/trace.json
After fetching a trace to a file, run this to get a summary:
jq '{
state: .info.state,
span_count: (.data.spans | length),
error_spans: [.data.spans[] | select(.status.code == "STATUS_CODE_ERROR") | .name],
assessment_errors: [.info.assessments[] | select(.feedback.error) | .assessment_name]
}' /tmp/trace.json
state: OK does not mean correct output. It only means no unhandled exception. Check assessments for quality signals, and if none exist, analyze the trace's inputs, outputs, and intermediate span data directly for issues.rationale when interpreting assessment values. The value alone can be misleading — for example, a user_frustration assessment with value: "no" could mean "no frustration detected" or "the frustration check did not pass" (i.e., frustration is present), depending on how the scorer was configured. The .rationale field (a top-level assessment field, not nested under .feedback) explains what the value means in context and often describes the issue in plain language before you need to examine any spans.MLflow Tracing captures inputs, outputs, and metadata from different parts of an application's call stack. By correlating trace contents with the source code, issues can be root-caused more precisely than from the trace alone.
@mlflow.trace or wrapped in mlflow.start_span(). For autologged spans (LangChain, OpenAI, etc.), names follow framework conventions instead (e.g., ChatOpenAI, RetrievalQA).A user reports that their customer support agent gave an incorrect answer for the query "What is our refund policy?" There are no assessments on the trace.
1. Fetch the trace and check high-level signals.
The trace has state: OK — no crash occurred. No assessments are present, so examine the trace's inputs and outputs directly. The response_preview says "Our shipping policy states that orders are delivered within 3-5 business days..." — this answers a different question than what was asked.
2. Examine spans to locate the problem.
The span tree shows:
customer_support_agent (AGENT) — OK
├── plan_action (LLM) — OK
│ outputs: {"tool_call": "search_knowledge_base", "args": {"query": "refund policy"}}
├── search_knowledge_base (TOOL) — OK
│ inputs: {"query": "refund policy"}
│ outputs: [{"doc": "Shipping takes 3-5 business days...", "score": 0.82}]
├── generate_response (LLM) — OK
│ inputs: {"messages": [..., {"role": "user", "content": "Context: Shipping takes 3-5 business days..."}]}
│ outputs: {"content": "Our shipping policy states..."}
The agent correctly decided to search for "refund policy," but the search_knowledge_base tool returned a shipping document. The LLM then faithfully answered using the wrong context. The problem is in the tool's retrieval, not the agent's reasoning or the LLM's generation.
3. Correlate with the codebase.
The span search_knowledge_base maps to a function in the application code. Investigating reveals the vector index was built from only the shipping FAQ — the refund policy documents were never indexed.
4. Recommendations.
Weekly Installs
88
Repository
GitHub Stars
18
First Seen
Feb 5, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
github-copilot87
gemini-cli87
codex86
opencode85
amp84
kimi-cli84
专业SEO审计工具:全面网站诊断、技术SEO优化与页面分析指南
68,800 周安装
App Store Connect CLI 工作流自动化技能:asc-workflow 命令详解与CI/CD集成指南
1,200 周安装
App Store Connect 自动化创建应用指南:asc-app-create-ui 技能详解
1,200 周安装
内容营销实战指南:23位产品领导者框架,打造高效SEO与品牌内容策略
1,200 周安装
品牌叙事指南:30位专家教你打造难忘品牌故事 | 品牌营销与内容创作
1,200 周安装
Zod 最佳实践指南:TypeScript 模式验证 43 条规则与性能优化
1,200 周安装
Vitest 测试框架:Vite 原生单元测试工具,兼容 Jest API,支持 Vue/React/Svelte
1,200 周安装
| Span inputs | .data.spans[].attributes["mlflow.spanInputs"] |
| Span outputs | .data.spans[].attributes["mlflow.spanOutputs"] |
| Assessments | .info.assessments |
| Assessment name | .info.assessments[].assessment_name |
| Feedback value | .info.assessments[].feedback.value |
| Feedback error | .info.assessments[].feedback.error |
| Assessment rationale | .info.assessments[].rationale |
error field, it means the scorer or judge that evaluated the trace failed — not that the trace itself has a problem. The trace may be perfectly fine; the assessment's value is just unreliable. This can happen when a scorer crashes (e.g., timed out, returned unparseable output) or when a scorer was applied to a trace type it wasn't designed for (e.g., a retrieval relevance scorer applied to a trace with no retrieval steps). The latter is a scorer configuration issue, not a trace issue.mlflow.trace.tokenUsage) or span attributes (e.g., mlflow.chat.tokenUsage). Not all clients set these — check the raw attributes dict for equivalent fields. Spikes in input tokens may indicate prompt injection or overly large context.