MLflow 会话分析指南：高效追踪与评估聊天对话

analyzing-mlflow-session by mlflow/skills

127 周安装量

25 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/mlflow/skills --skill analyzing-mlflow-session

AI/机器学习数据分析监控

🇨🇳中文介绍

分析 MLflow 聊天会话

什么是会话？

会话将属于同一聊天对话或用户交互的多个追踪记录分组。会话中的每条追踪记录代表一轮交互：用户的输入和系统的响应。会话内的追踪记录通过存储在追踪元数据中的共享会话 ID 进行关联。

会话 ID 存储在追踪元数据中，键为 mlflow.trace.session。此键包含点号，这会影响过滤语法（见下文）。共享此键相同值的所有追踪记录都属于同一会话。

重建对话

重建会话对话是一个多步骤过程：从第一条追踪记录发现输入/输出模式，高效地提取所有会话追踪记录中的这些字段，然后根据需要检查特定轮次。切勿为每一轮都获取完整的追踪记录——请改用搜索命令的 --extract-fields 选项。

步骤 1：发现模式。 首先，从会话中找到一个追踪 ID，然后获取其完整的 JSON 以检查模式：

# 获取会话中的第一条追踪记录
mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  --filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
  --order-by "timestamp_ms ASC" \
  --extract-fields 'info.trace_id' \
  --output json \
  --max-results 1 > /tmp/first_trace.json

# 获取完整的追踪记录（始终输出 JSON，无需 --output 标志）
mlflow traces get \
  --trace-id <TRACE_ID_FROM_ABOVE> > /tmp/trace_detail.json

找到根跨度——即 parent_span_id 等于的跨度（即没有父跨度）。这是追踪记录中的顶级操作：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

896,800 周安装

Azure Data Explorer (Kusto) 查询技能：KQL数据分析、日志遥测与时间序列处理

138,800 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

120,000 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

69,600 周安装

# 查找根跨度
jq '.data.spans[] | select(.parent_span_id == null)' /tmp/trace_detail.json

检查其 attributes 字典，以确定哪些键保存了用户输入和系统输出。这些可能是：

MLflow 标准属性：mlflow.spanInputs 和 mlflow.spanOutputs（由 MLflow Python 客户端设置）
自定义属性：通过 @mlflow.trace 或 mlflow.start_span() 设置的应用特定键，用于自定义属性日志记录
第三方 OTel 属性：遵循 GenAI 语义约定、OpenInference 或其他插装约定的键

这些值的结构也因应用程序而异（例如，query 字符串、messages 数组、包含多个字段的字典）。检查实际的属性值以了解格式。

如果根跨度的输入/输出为空或缺失，它可能是一个包装器跨度（例如，编排器或中间件），不直接承载聊天轮次数据。在这种情况下，查看其直接子跨度——找到层次结构顶部最接近的、具有与聊天轮次相对应的有意义输入和输出的跨度：

以下示例假设追踪记录来自 MLflow Python 客户端（它将输入/输出存储在 mlflow.spanInputs/mlflow.spanOutputs 中），并且相关跨度是根跨度的直接子跨度。实际上，相关跨度可能位于层次结构更深处，并且来自其他客户端的追踪记录可能使用不同的属性键——请根据需要探索跨度树：

# 获取根跨度的 ID
ROOT_ID=$(jq -r '.data.spans[] | select(.parent_span_id == null) | .span_id' /tmp/trace_detail.json)

# 列出根跨度的直接子跨度及其输入/输出
jq --arg root "$ROOT_ID" '.data.spans[] | select(.parent_span_id == $root) | {name: .name, inputs: .attributes["mlflow.spanInputs"], outputs: .attributes["mlflow.spanOutputs"]}' /tmp/trace_detail.json

同时检查第一条追踪记录的评估。会话级评估附加在会话的第一条追踪记录上——这些评估针对整个会话（例如，整体对话质量、多轮连贯性），可以指示整个会话中某处存在问题，而不仅仅是第一轮。第一条追踪记录也可能包含针对该特定轮次的逐轮评估。

两种类型都出现在 .info.assessments 中。会话级评估通过其 metadata 字段中存在 mlflow.trace.session 来标识：

# 显示会话级评估（排除评分器错误）
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"]) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json

# 显示逐轮评估（排除评分器错误）
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"] == null) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json

评估错误不是追踪错误。 如果评估具有 feedback.error 字段，意味着评分器或评判器失败——而不是追踪记录本身有问题。在使用评估来识别追踪问题时，请排除这些。

在解释评估值时，务必参考理由。 仅凭 value 可能会产生误导——例如，一个 user_frustration 评估，其 value: "no" 可能意味着“未检测到沮丧”或“沮丧检查未通过”（即沮丧确实存在），具体取决于评分器的配置方式。.rationale 字段（一个顶级的评估字段，不嵌套在 .feedback 下）解释了该值在上下文中的含义。提取评估时请包含理由：

jq '[.info.assessments[] | select(.feedback.error == null) | {name: .assessment_name, value: .feedback.value, rationale: .rationale}]' /tmp/trace_detail.json

步骤 2：跨所有会话追踪记录提取。 一旦知道哪些属性键保存输入和输出，就使用 --extract-fields 搜索会话中的所有追踪记录，以提取这些字段以及评估（有关为何将输出写入文件，请参阅处理 CLI 输出）：

mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  --filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
  --order-by "timestamp_ms ASC" \
  --extract-fields 'info.trace_id,info.state,info.request_time,info.assessments,info.trace_metadata.`mlflow.traceInputs`,info.trace_metadata.`mlflow.traceOutputs`' \
  --output json \
  --max-results 100 > /tmp/session_traces.json

然后使用 bash 命令（例如 jq、wc、head）对文件进行分析。

上面的 --extract-fields 示例使用了追踪元数据中的 mlflow.traceInputs/mlflow.traceOutputs——请根据步骤 1 中的发现调整字段路径。

评估包含质量判断（例如，正确性、相关性），可以在无需详细阅读每条追踪记录的情况下，精确定位哪些轮次存在问题。要识别哪些轮次具有评估信号（排除评分器错误）：

# 列出具有有效评估的轮次（过滤掉评分器错误）
jq '.traces[] | {
  trace_id: .info.trace_id,
  time: .info.request_time,
  state: .info.state,
  assessments: [.info.assessments[]? | select(.feedback.error == null) | {
    name: .assessment_name,
    value: .feedback.value
  }]
}' /tmp/session_traces.json

CLI 语法说明：

--experiment-id 是所有 mlflow traces search 命令的必填项。没有它，命令将失败。
包含点号的元数据键在过滤字符串和 extract-fields 中必须用反引号转义：metadata.\mlflow.trace.session``
Shell 引用：双引号内的反引号会被 bash 解释为命令替换（例如，bash 会尝试将 mlflow.trace.session 作为命令运行）。当值包含反引号时，始终对最外层字符串使用单引号。例如：--filter-string 'metadata.\mlflow.trace.session = "value"'
--max-results 默认为 100，对于大多数会话来说足够了。对于更长的对话，最多可增加到 500（最大值）。如果返回了 500 个结果，请使用分页来检索其余部分。

MLflow 追踪输出可能很大，而 Claude Code 的 Bash 工具对管道命令有约 30KB 的输出限制。当输出超过此阈值时，它会被保存到文件中而不是通过管道传输，从而导致静默失败。

安全方法（始终有效）：

# 步骤 1：保存到文件
mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  [...] \
  --output json > /tmp/output.json

# 步骤 2：处理文件
cat /tmp/output.json | jq '.traces[0].info.trace_id'
head -50 /tmp/output.json
wc -l /tmp/output.json

切勿直接管道传输 MLflow CLI 输出（例如，mlflow traces search ... | jq '.'）。这可能会静默地不产生任何输出。始终先重定向到文件，然后在文件上运行命令。

要详细检查特定轮次（例如，在识别出有问题的轮次后），获取其完整的追踪记录：

mlflow traces get --trace-id <TRACE_ID> > /tmp/turn_detail.json

会话 ID 分配：在代码库中搜索设置 mlflow.trace.session 的位置，以了解会话是如何创建的——是按用户登录、按浏览器标签页、按显式的“新对话”操作等。
上下文窗口管理：查找应用程序如何在每一轮构造传递给 LLM 的消息历史记录。常见模式包括滑动窗口（最后 N 条消息）、对较早轮次的总结或完整历史记录。此实现决定了模型看到的上下文，并且是多轮失败的常见来源。
内存和状态：一些应用程序在轮次之间维护超出消息历史记录的状态（例如，提取的实体、用户偏好、累积的工具结果）。搜索此状态是如何存储并在轮次之间传递的。

scripts/ 子目录包含用于每个分析步骤的现成可运行的 bash 脚本。所有脚本都遵循上述输出处理规则（重定向到文件，然后处理）。

scripts/discover_schema.sh<EXPERIMENT_ID> <SESSION_ID> —— 查找会话中的第一条追踪记录，获取其完整详细信息，并打印根跨度的属性键和输入/输出值。
scripts/inspect_turn.sh<TRACE_ID> —— 获取特定追踪记录，列出所有跨度，突出显示错误跨度，并显示评估。

示例：聊天第 5 轮的错误答案

用户报告他们的聊天机器人在聊天对话的第 5 条消息中给出了错误的答案。

1. 发现模式并重建对话。

获取会话中的第一条追踪记录，并检查根跨度的属性，以找到哪些键保存输入和输出。在本例中，mlflow.spanInputs 包含用户查询，mlflow.spanOutputs 包含助手响应。然后搜索所有会话追踪记录，按时间顺序提取这些字段。扫描提取的输入和输出，确认第 5 轮的响应是错误的，并揭示之前的轮次看起来是否正确。

2. 检查错误是否源于更早的轮次。

第 3 轮的响应包含一个事实性错误，用户没有质疑。第 4 轮基于该错误信息构建，第 5 轮加剧了它。根本原因在第 3 轮，而不是第 5 轮。

3. 将根本原因轮次作为单条追踪记录进行分析。

获取第 3 轮的完整追踪记录并进行分析——检查评估（如果有的话），遍历跨度树，检查检索器结果，并与代码关联。检索器返回了一份过时的文档，导致了错误的答案。

修复检索器的数据源，以排除或更新过时的文档。
添加逐轮评估，以便在错误在对话中传播之前检测到它们。
考虑实施对话级错误检测（例如，检查跨轮次答案的一致性）。

🇺🇸English

Analyzing an MLflow Chat Session

What is a Session?

A session groups multiple traces that belong to the same chat conversation or user interaction. Each trace in the session represents one turn: the user's input and the system's response. Traces within a session are linked by a shared session ID stored in trace metadata.

The session ID is stored in trace metadata under the key mlflow.trace.session. This key contains dots, which affects filter syntax (see below). All traces sharing the same value for this key belong to the same session.

Reconstructing the Conversation

Reconstructing a session's conversation is a multi-step process: discover the input/output schema from the first trace, extract those fields efficiently across all session traces, then inspect specific turns as needed. Do NOT fetch full traces for every turn — use --extract-fields on the search command instead.

Step 1: Discover the schema. First, find a trace ID from the session, then fetch its full JSON to inspect the schema:

# Get the first trace in the session
mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  --filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
  --order-by "timestamp_ms ASC" \
  --extract-fields 'info.trace_id' \
  --output json \
  --max-results 1 > /tmp/first_trace.json

# Fetch the full trace (always outputs JSON, no --output flag needed)
mlflow traces get \
  --trace-id <TRACE_ID_FROM_ABOVE> > /tmp/trace_detail.json

Find the root span — the span with parent_span_id equal to null (i.e., it has no parent). This is the top-level operation in the trace:

# Find the root span
jq '.data.spans[] | select(.parent_span_id == null)' /tmp/trace_detail.json

Examine its attributes dict to identify which keys hold the user input and system output. These could be:

MLflow standard attributes : mlflow.spanInputs and mlflow.spanOutputs (set by the MLflow Python client)
Custom attributes : Application-specific keys set via @mlflow.trace or mlflow.start_span() with custom attribute logging
Third-party OTel attributes : Keys following GenAI Semantic Conventions, OpenInference, or other instrumentation conventions

The structure of these values also varies by application (e.g., a query string, a messages array, a dict with multiple fields). Inspect the actual attribute values to understand the format.

If the root span has empty or missing inputs/outputs , it may be a wrapper span (e.g., an orchestrator or middleware) that doesn't directly carry the chat turn data. In that case, look at its immediate children — find the closest span to the top of the hierarchy that has meaningful inputs and outputs corresponding to a chat turn:

The following example assumes the trace comes from the MLflow Python client (which stores inputs/outputs in mlflow.spanInputs/mlflow.spanOutputs) and that the relevant span is a direct child of root. In practice, the relevant span may be deeper in the hierarchy, and traces from other clients may use different attribute keys — explore the span tree as needed:

# Get the root span's ID
ROOT_ID=$(jq -r '.data.spans[] | select(.parent_span_id == null) | .span_id' /tmp/trace_detail.json)

# List immediate children of the root span with their inputs/outputs
jq --arg root "$ROOT_ID" '.data.spans[] | select(.parent_span_id == $root) | {name: .name, inputs: .attributes["mlflow.spanInputs"], outputs: .attributes["mlflow.spanOutputs"]}' /tmp/trace_detail.json

Also check the first trace's assessments. Session-level assessments are attached to the first trace in the session — these evaluate the session as a whole (e.g., overall conversation quality, multi-turn coherence) and can indicate the presence of issues somewhere across the entire session, not just the first turn. The first trace may also have per-turn assessments for that specific turn.

Both types appear in .info.assessments. Session-level assessments are identified by the presence of mlflow.trace.session in their metadata field:

# Show session-level assessments (exclude scorer errors)
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"]) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json

# Show per-turn assessments (exclude scorer errors)
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"] == null) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json

Assessment errors are not trace errors. If an assessment has a feedback.error field, it means the scorer or judge failed — not that the trace itself has a problem. Exclude these when using assessments to identify trace issues.

Always consult the rationale when interpreting assessment values. The value alone can be misleading — for example, a user_frustration assessment with value: "no" could mean "no frustration detected" or "the frustration check did not pass" (i.e., frustration is present), depending on how the scorer was configured. The .rationale field (a top-level assessment field, not nested under .feedback) explains what the value means in context. Include rationale when extracting assessments:

jq '[.info.assessments[] | select(.feedback.error == null) | {name: .assessment_name, value: .feedback.value, rationale: .rationale}]' /tmp/trace_detail.json

Step 2: Extract across all session traces. Once you know which attribute keys hold inputs and outputs, search for all traces in the session using --extract-fields to pull those fields along with assessments (see Handling CLI Output for why output is written to a file):

mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  --filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
  --order-by "timestamp_ms ASC" \
  --extract-fields 'info.trace_id,info.state,info.request_time,info.assessments,info.trace_metadata.`mlflow.traceInputs`,info.trace_metadata.`mlflow.traceOutputs`' \
  --output json \
  --max-results 100 > /tmp/session_traces.json

Then use bash commands (e.g., jq, wc, head) on the file to analyze it.

The --extract-fields example above uses mlflow.traceInputs/mlflow.traceOutputs from trace metadata — adjust the field paths based on what you discovered in step 1.

Assessments contain quality judgments (e.g., correctness, relevance) that can pinpoint which turns had issues without needing to read every trace in detail. To identify which turns have assessment signals (excluding scorer errors):

# List turns with their valid assessments (scorer errors filtered out)
jq '.traces[] | {
  trace_id: .info.trace_id,
  time: .info.request_time,
  state: .info.state,
  assessments: [.info.assessments[]? | select(.feedback.error == null) | {
    name: .assessment_name,
    value: .feedback.value
  }]
}' /tmp/session_traces.json

CLI syntax notes:

--experiment-id is required for all mlflow traces search commands. The command will fail without it.
Metadata keys containing dots must be escaped with backticks in filter strings and extract-fields: metadata.mlflow.trace.session``
Shell quoting : Backticks inside double quotes are interpreted by bash as command substitution (e.g., bash will try to run mlflow.trace.session as a command). Always use single quotes for the outer string when the value contains backticks. For example: --filter-string 'metadata.\mlflow.trace.session = "value"'
--max-results defaults to 100, which is sufficient for most sessions. Increase up to 500 (the maximum) for longer conversations. If 500 results are returned, use pagination to retrieve the rest.

Handling CLI Output

MLflow trace output can be large, and Claude Code's Bash tool has a ~30KB output limit for piped commands. When output exceeds this threshold, it gets saved to a file instead of being piped, causing silent failures.

Safe approach (always works):

# Step 1: Save to file
mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  [...] \
  --output json > /tmp/output.json

# Step 2: Process the file
cat /tmp/output.json | jq '.traces[0].info.trace_id'
head -50 /tmp/output.json
wc -l /tmp/output.json

Never pipe MLflow CLI output directly (e.g., mlflow traces search ... | jq '.'). This can silently produce no output. Always redirect to a file first, then run commands on the file.

To inspect a specific turn in detail (e.g., after identifying a problematic turn), fetch its full trace:

mlflow traces get --trace-id <TRACE_ID> > /tmp/turn_detail.json

Codebase Correlation

Session ID assignment : Search the codebase for where mlflow.trace.session is set to understand how sessions are created — per user login, per browser tab, per explicit "new conversation" action, etc.
Context window management : Look for how the application constructs the message history passed to the LLM at each turn. Common patterns include sliding window (last N messages), summarization of older turns, or full history. This implementation determines what context the model sees and is a frequent source of multi-turn failures.
Memory and state : Some applications maintain state across turns beyond message history (e.g., extracted entities, user preferences, accumulated tool results). Search for how this state is stored and passed between turns.

Reference Scripts

The scripts/ subdirectory contains ready-to-run bash scripts for each analysis step. All scripts follow the output handling rules above (redirect to file, then process).

scripts/discover_schema.sh<EXPERIMENT_ID> <SESSION_ID> — Finds the first trace in the session, fetches its full detail, and prints the root span's attribute keys and input/output values.
scripts/inspect_turn.sh<TRACE_ID> — Fetches a specific trace, lists all spans, highlights error spans, and shows assessments.

Example: Wrong Answer on Chat Turn 5

A user reports that their chatbot gave an incorrect answer on the 5th message of a chat conversation.

1. Discover the schema and reconstruct the conversation.

Fetch the first trace in the session and inspect the root span's attributes to find which keys hold inputs and outputs. In this case, mlflow.spanInputs contains the user query and mlflow.spanOutputs contains the assistant response. Then search all session traces, extracting those fields in chronological order. Scanning the extracted inputs and outputs confirms that turn 5's response is wrong, and reveals whether earlier turns look correct.

2. Check if the error originated in an earlier turn.

Turn 3's response contains a factual error that the user didn't challenge. Turn 4 builds on that incorrect information, and turn 5 compounds it. The root cause is in turn 3, not turn 5.

3. Analyze the root-cause turn as a single trace.

Fetch the full trace for turn 3 and analyze it — examine assessments (if any), walk the span tree, check retriever results, and correlate with code. The retriever returned an outdated document, causing the wrong answer.

4. Recommendations.

Fix the retriever's data source to exclude or update outdated documents.
Add per-turn assessments to detect errors before they propagate across the conversation.
Consider implementing conversation-level error detection (e.g., checking consistency of answers across turns).

Weekly Installs

Repository

mlflow/skills

GitHub Stars

First Seen

Feb 5, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

gemini-cli85

github-copilot85

codex84

opencode83

kimi-cli82

amp82

专业SEO审计工具：全面网站诊断、技术SEO优化与页面分析指南

68,800 周安装