npx skills add https://github.com/mlflow/skills --skill analyzing-mlflow-session会话将属于同一聊天对话或用户交互的多个追踪记录分组。会话中的每条追踪记录代表一轮交互:用户的输入和系统的响应。会话内的追踪记录通过存储在追踪元数据中的共享会话 ID 进行关联。
会话 ID 存储在追踪元数据中,键为 mlflow.trace.session。此键包含点号,这会影响过滤语法(见下文)。共享此键相同值的所有追踪记录都属于同一会话。
重建会话对话是一个多步骤过程:从第一条追踪记录发现输入/输出模式,高效地提取所有会话追踪记录中的这些字段,然后根据需要检查特定轮次。切勿为每一轮都获取完整的追踪记录——请改用搜索命令的 --extract-fields 选项。
步骤 1:发现模式。 首先,从会话中找到一个追踪 ID,然后获取其完整的 JSON 以检查模式:
# 获取会话中的第一条追踪记录
mlflow traces search \
--experiment-id <EXPERIMENT_ID> \
--filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
--order-by "timestamp_ms ASC" \
--extract-fields 'info.trace_id' \
--output json \
--max-results 1 > /tmp/first_trace.json
# 获取完整的追踪记录(始终输出 JSON,无需 --output 标志)
mlflow traces get \
--trace-id <TRACE_ID_FROM_ABOVE> > /tmp/trace_detail.json
找到根跨度——即 parent_span_id 等于 的跨度(即没有父跨度)。这是追踪记录中的顶级操作:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
null# 查找根跨度
jq '.data.spans[] | select(.parent_span_id == null)' /tmp/trace_detail.json
检查其 attributes 字典,以确定哪些键保存了用户输入和系统输出。这些可能是:
mlflow.spanInputs 和 mlflow.spanOutputs(由 MLflow Python 客户端设置)@mlflow.trace 或 mlflow.start_span() 设置的应用特定键,用于自定义属性日志记录这些值的结构也因应用程序而异(例如,query 字符串、messages 数组、包含多个字段的字典)。检查实际的属性值以了解格式。
如果根跨度的输入/输出为空或缺失,它可能是一个包装器跨度(例如,编排器或中间件),不直接承载聊天轮次数据。在这种情况下,查看其直接子跨度——找到层次结构顶部最接近的、具有与聊天轮次相对应的有意义输入和输出的跨度:
以下示例假设追踪记录来自 MLflow Python 客户端(它将输入/输出存储在 mlflow.spanInputs/mlflow.spanOutputs 中),并且相关跨度是根跨度的直接子跨度。实际上,相关跨度可能位于层次结构更深处,并且来自其他客户端的追踪记录可能使用不同的属性键——请根据需要探索跨度树:
# 获取根跨度的 ID
ROOT_ID=$(jq -r '.data.spans[] | select(.parent_span_id == null) | .span_id' /tmp/trace_detail.json)
# 列出根跨度的直接子跨度及其输入/输出
jq --arg root "$ROOT_ID" '.data.spans[] | select(.parent_span_id == $root) | {name: .name, inputs: .attributes["mlflow.spanInputs"], outputs: .attributes["mlflow.spanOutputs"]}' /tmp/trace_detail.json
同时检查第一条追踪记录的评估。会话级评估附加在会话的第一条追踪记录上——这些评估针对整个会话(例如,整体对话质量、多轮连贯性),可以指示整个会话中某处存在问题,而不仅仅是第一轮。第一条追踪记录也可能包含针对该特定轮次的逐轮评估。
两种类型都出现在 .info.assessments 中。会话级评估通过其 metadata 字段中存在 mlflow.trace.session 来标识:
# 显示会话级评估(排除评分器错误)
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"]) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json
# 显示逐轮评估(排除评分器错误)
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"] == null) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json
评估错误不是追踪错误。 如果评估具有 feedback.error 字段,意味着评分器或评判器失败——而不是追踪记录本身有问题。在使用评估来识别追踪问题时,请排除这些。
在解释评估值时,务必参考理由。 仅凭 value 可能会产生误导——例如,一个 user_frustration 评估,其 value: "no" 可能意味着“未检测到沮丧”或“沮丧检查未通过”(即沮丧确实存在),具体取决于评分器的配置方式。.rationale 字段(一个顶级的评估字段,不嵌套在 .feedback 下)解释了该值在上下文中的含义。提取评估时请包含理由:
jq '[.info.assessments[] | select(.feedback.error == null) | {name: .assessment_name, value: .feedback.value, rationale: .rationale}]' /tmp/trace_detail.json
步骤 2:跨所有会话追踪记录提取。 一旦知道哪些属性键保存输入和输出,就使用 --extract-fields 搜索会话中的所有追踪记录,以提取这些字段以及评估(有关为何将输出写入文件,请参阅处理 CLI 输出):
mlflow traces search \
--experiment-id <EXPERIMENT_ID> \
--filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
--order-by "timestamp_ms ASC" \
--extract-fields 'info.trace_id,info.state,info.request_time,info.assessments,info.trace_metadata.`mlflow.traceInputs`,info.trace_metadata.`mlflow.traceOutputs`' \
--output json \
--max-results 100 > /tmp/session_traces.json
然后使用 bash 命令(例如 jq、wc、head)对文件进行分析。
上面的 --extract-fields 示例使用了追踪元数据中的 mlflow.traceInputs/mlflow.traceOutputs——请根据步骤 1 中的发现调整字段路径。
评估包含质量判断(例如,正确性、相关性),可以在无需详细阅读每条追踪记录的情况下,精确定位哪些轮次存在问题。要识别哪些轮次具有评估信号(排除评分器错误):
# 列出具有有效评估的轮次(过滤掉评分器错误)
jq '.traces[] | {
trace_id: .info.trace_id,
time: .info.request_time,
state: .info.state,
assessments: [.info.assessments[]? | select(.feedback.error == null) | {
name: .assessment_name,
value: .feedback.value
}]
}' /tmp/session_traces.json
CLI 语法说明:
--experiment-id 是所有 mlflow traces search 命令的必填项。没有它,命令将失败。metadata.\mlflow.trace.session``mlflow.trace.session 作为命令运行)。当值包含反引号时,始终对最外层字符串使用单引号。例如:--filter-string 'metadata.\mlflow.trace.session = "value"'--max-results 默认为 100,对于大多数会话来说足够了。对于更长的对话,最多可增加到 500(最大值)。如果返回了 500 个结果,请使用分页来检索其余部分。MLflow 追踪输出可能很大,而 Claude Code 的 Bash 工具对管道命令有约 30KB 的输出限制。当输出超过此阈值时,它会被保存到文件中而不是通过管道传输,从而导致静默失败。
安全方法(始终有效):
# 步骤 1:保存到文件
mlflow traces search \
--experiment-id <EXPERIMENT_ID> \
[...] \
--output json > /tmp/output.json
# 步骤 2:处理文件
cat /tmp/output.json | jq '.traces[0].info.trace_id'
head -50 /tmp/output.json
wc -l /tmp/output.json
切勿直接管道传输 MLflow CLI 输出(例如,mlflow traces search ... | jq '.')。这可能会静默地不产生任何输出。始终先重定向到文件,然后在文件上运行命令。
要详细检查特定轮次(例如,在识别出有问题的轮次后),获取其完整的追踪记录:
mlflow traces get --trace-id <TRACE_ID> > /tmp/turn_detail.json
mlflow.trace.session 的位置,以了解会话是如何创建的——是按用户登录、按浏览器标签页、按显式的“新对话”操作等。scripts/ 子目录包含用于每个分析步骤的现成可运行的 bash 脚本。所有脚本都遵循上述输出处理规则(重定向到文件,然后处理)。
scripts/discover_schema.sh<EXPERIMENT_ID> <SESSION_ID> —— 查找会话中的第一条追踪记录,获取其完整详细信息,并打印根跨度的属性键和输入/输出值。scripts/inspect_turn.sh<TRACE_ID> —— 获取特定追踪记录,列出所有跨度,突出显示错误跨度,并显示评估。用户报告他们的聊天机器人在聊天对话的第 5 条消息中给出了错误的答案。
1. 发现模式并重建对话。
获取会话中的第一条追踪记录,并检查根跨度的属性,以找到哪些键保存输入和输出。在本例中,mlflow.spanInputs 包含用户查询,mlflow.spanOutputs 包含助手响应。然后搜索所有会话追踪记录,按时间顺序提取这些字段。扫描提取的输入和输出,确认第 5 轮的响应是错误的,并揭示之前的轮次看起来是否正确。
2. 检查错误是否源于更早的轮次。
第 3 轮的响应包含一个事实性错误,用户没有质疑。第 4 轮基于该错误信息构建,第 5 轮加剧了它。根本原因在第 3 轮,而不是第 5 轮。
3. 将根本原因轮次作为单条追踪记录进行分析。
获取第 3 轮的完整追踪记录并进行分析——检查评估(如果有的话),遍历跨度树,检查检索器结果,并与代码关联。检索器返回了一份过时的文档,导致了错误的答案。
4. 建议。
每周安装次数
86
仓库
GitHub 星标数
18
首次出现
2026年2月5日
安全审计
安装于
gemini-cli85
github-copilot85
codex84
opencode83
kimi-cli82
amp82
A session groups multiple traces that belong to the same chat conversation or user interaction. Each trace in the session represents one turn: the user's input and the system's response. Traces within a session are linked by a shared session ID stored in trace metadata.
The session ID is stored in trace metadata under the key mlflow.trace.session. This key contains dots, which affects filter syntax (see below). All traces sharing the same value for this key belong to the same session.
Reconstructing a session's conversation is a multi-step process: discover the input/output schema from the first trace, extract those fields efficiently across all session traces, then inspect specific turns as needed. Do NOT fetch full traces for every turn — use --extract-fields on the search command instead.
Step 1: Discover the schema. First, find a trace ID from the session, then fetch its full JSON to inspect the schema:
# Get the first trace in the session
mlflow traces search \
--experiment-id <EXPERIMENT_ID> \
--filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
--order-by "timestamp_ms ASC" \
--extract-fields 'info.trace_id' \
--output json \
--max-results 1 > /tmp/first_trace.json
# Fetch the full trace (always outputs JSON, no --output flag needed)
mlflow traces get \
--trace-id <TRACE_ID_FROM_ABOVE> > /tmp/trace_detail.json
Find the root span — the span with parent_span_id equal to null (i.e., it has no parent). This is the top-level operation in the trace:
# Find the root span
jq '.data.spans[] | select(.parent_span_id == null)' /tmp/trace_detail.json
Examine its attributes dict to identify which keys hold the user input and system output. These could be:
mlflow.spanInputs and mlflow.spanOutputs (set by the MLflow Python client)@mlflow.trace or mlflow.start_span() with custom attribute loggingThe structure of these values also varies by application (e.g., a query string, a messages array, a dict with multiple fields). Inspect the actual attribute values to understand the format.
If the root span has empty or missing inputs/outputs , it may be a wrapper span (e.g., an orchestrator or middleware) that doesn't directly carry the chat turn data. In that case, look at its immediate children — find the closest span to the top of the hierarchy that has meaningful inputs and outputs corresponding to a chat turn:
The following example assumes the trace comes from the MLflow Python client (which stores inputs/outputs in mlflow.spanInputs/mlflow.spanOutputs) and that the relevant span is a direct child of root. In practice, the relevant span may be deeper in the hierarchy, and traces from other clients may use different attribute keys — explore the span tree as needed:
# Get the root span's ID
ROOT_ID=$(jq -r '.data.spans[] | select(.parent_span_id == null) | .span_id' /tmp/trace_detail.json)
# List immediate children of the root span with their inputs/outputs
jq --arg root "$ROOT_ID" '.data.spans[] | select(.parent_span_id == $root) | {name: .name, inputs: .attributes["mlflow.spanInputs"], outputs: .attributes["mlflow.spanOutputs"]}' /tmp/trace_detail.json
Also check the first trace's assessments. Session-level assessments are attached to the first trace in the session — these evaluate the session as a whole (e.g., overall conversation quality, multi-turn coherence) and can indicate the presence of issues somewhere across the entire session, not just the first turn. The first trace may also have per-turn assessments for that specific turn.
Both types appear in .info.assessments. Session-level assessments are identified by the presence of mlflow.trace.session in their metadata field:
# Show session-level assessments (exclude scorer errors)
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"]) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json
# Show per-turn assessments (exclude scorer errors)
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"] == null) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json
Assessment errors are not trace errors. If an assessment has a feedback.error field, it means the scorer or judge failed — not that the trace itself has a problem. Exclude these when using assessments to identify trace issues.
Always consult the rationale when interpreting assessment values. The value alone can be misleading — for example, a user_frustration assessment with value: "no" could mean "no frustration detected" or "the frustration check did not pass" (i.e., frustration is present), depending on how the scorer was configured. The .rationale field (a top-level assessment field, not nested under .feedback) explains what the value means in context. Include rationale when extracting assessments:
jq '[.info.assessments[] | select(.feedback.error == null) | {name: .assessment_name, value: .feedback.value, rationale: .rationale}]' /tmp/trace_detail.json
Step 2: Extract across all session traces. Once you know which attribute keys hold inputs and outputs, search for all traces in the session using --extract-fields to pull those fields along with assessments (see Handling CLI Output for why output is written to a file):
mlflow traces search \
--experiment-id <EXPERIMENT_ID> \
--filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
--order-by "timestamp_ms ASC" \
--extract-fields 'info.trace_id,info.state,info.request_time,info.assessments,info.trace_metadata.`mlflow.traceInputs`,info.trace_metadata.`mlflow.traceOutputs`' \
--output json \
--max-results 100 > /tmp/session_traces.json
Then use bash commands (e.g., jq, wc, head) on the file to analyze it.
The --extract-fields example above uses mlflow.traceInputs/mlflow.traceOutputs from trace metadata — adjust the field paths based on what you discovered in step 1.
Assessments contain quality judgments (e.g., correctness, relevance) that can pinpoint which turns had issues without needing to read every trace in detail. To identify which turns have assessment signals (excluding scorer errors):
# List turns with their valid assessments (scorer errors filtered out)
jq '.traces[] | {
trace_id: .info.trace_id,
time: .info.request_time,
state: .info.state,
assessments: [.info.assessments[]? | select(.feedback.error == null) | {
name: .assessment_name,
value: .feedback.value
}]
}' /tmp/session_traces.json
CLI syntax notes:
--experiment-id is required for all mlflow traces search commands. The command will fail without it.metadata.mlflow.trace.session``mlflow.trace.session as a command). Always use single quotes for the outer string when the value contains backticks. For example: --filter-string 'metadata.\mlflow.trace.session = "value"'--max-results defaults to 100, which is sufficient for most sessions. Increase up to 500 (the maximum) for longer conversations. If 500 results are returned, use pagination to retrieve the rest.MLflow trace output can be large, and Claude Code's Bash tool has a ~30KB output limit for piped commands. When output exceeds this threshold, it gets saved to a file instead of being piped, causing silent failures.
Safe approach (always works):
# Step 1: Save to file
mlflow traces search \
--experiment-id <EXPERIMENT_ID> \
[...] \
--output json > /tmp/output.json
# Step 2: Process the file
cat /tmp/output.json | jq '.traces[0].info.trace_id'
head -50 /tmp/output.json
wc -l /tmp/output.json
Never pipe MLflow CLI output directly (e.g., mlflow traces search ... | jq '.'). This can silently produce no output. Always redirect to a file first, then run commands on the file.
To inspect a specific turn in detail (e.g., after identifying a problematic turn), fetch its full trace:
mlflow traces get --trace-id <TRACE_ID> > /tmp/turn_detail.json
mlflow.trace.session is set to understand how sessions are created — per user login, per browser tab, per explicit "new conversation" action, etc.The scripts/ subdirectory contains ready-to-run bash scripts for each analysis step. All scripts follow the output handling rules above (redirect to file, then process).
scripts/discover_schema.sh<EXPERIMENT_ID> <SESSION_ID> — Finds the first trace in the session, fetches its full detail, and prints the root span's attribute keys and input/output values.scripts/inspect_turn.sh<TRACE_ID> — Fetches a specific trace, lists all spans, highlights error spans, and shows assessments.A user reports that their chatbot gave an incorrect answer on the 5th message of a chat conversation.
1. Discover the schema and reconstruct the conversation.
Fetch the first trace in the session and inspect the root span's attributes to find which keys hold inputs and outputs. In this case, mlflow.spanInputs contains the user query and mlflow.spanOutputs contains the assistant response. Then search all session traces, extracting those fields in chronological order. Scanning the extracted inputs and outputs confirms that turn 5's response is wrong, and reveals whether earlier turns look correct.
2. Check if the error originated in an earlier turn.
Turn 3's response contains a factual error that the user didn't challenge. Turn 4 builds on that incorrect information, and turn 5 compounds it. The root cause is in turn 3, not turn 5.
3. Analyze the root-cause turn as a single trace.
Fetch the full trace for turn 3 and analyze it — examine assessments (if any), walk the span tree, check retriever results, and correlate with code. The retriever returned an outdated document, causing the wrong answer.
4. Recommendations.
Weekly Installs
86
Repository
GitHub Stars
18
First Seen
Feb 5, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
gemini-cli85
github-copilot85
codex84
opencode83
kimi-cli82
amp82
专业SEO审计工具:全面网站诊断、技术SEO优化与页面分析指南
68,800 周安装
AWS 成本优化与运维监控技能 - 集成 MCP 服务器实现成本估算、分析、监控与安全审计
213 周安装
Umbraco Playwright 测试助手 - 官方 E2E 测试工具包 | 自动化测试
72 周安装
Reddit信息获取工具:使用Gemini CLI和Reddit JSON API抓取数据教程
215 周安装
TypeScript最佳实践指南:AI代理代码规范、类型安全与架构模式
217 周安装
Linux系统管理命令大全:系统信息、资源监控、服务管理与故障排除实战指南
213 周安装
Imagen AI图像生成技能 - 使用Google Gemini模型一键生成图片,支持前端开发、文档插图
216 周安装