MLflow智能体评估指南：使用原生API进行GenAI智能体测试与追踪

agent-evaluation by mlflow/skills

133 周安装量

25 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/mlflow/skills --skill agent-evaluation

AI/机器学习自动化测试

🇨🇳中文介绍

使用 MLflow 进行智能体评估

使用 MLflow 评估 GenAI 智能体的综合指南。您可以使用此技能完成整个评估工作流或其中的独立组件——追踪设置、环境配置、数据集创建、评分器定义或评估执行。每个部分都可以根据您的需求独立使用。

⛔ 关键：必须使用 MLflow API

请勿创建自定义评估框架。 您必须使用 MLflow 的原生 API：

数据集：使用 mlflow.genai.datasets.create_dataset() - 而非自定义测试用例文件
评分器：使用 mlflow.genai.scorers 和 mlflow.genai.judges.make_judge() - 而非自定义评分器函数
评估：使用 mlflow.genai.evaluate() - 而非自定义评估循环
脚本：使用提供的 scripts/ 目录模板 - 而非自定义的 evaluation/ 目录

为什么？ MLflow 在实验中追踪所有内容（数据集、评分器、追踪记录、结果）。自定义框架会绕过此机制，导致所有可观测性丢失。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

发现智能体结构

每个项目都有独特的结构。 使用动态探索而非假设：

查找智能体入口点

# 搜索主要的智能体函数
grep -r "def.*agent" . --include="*.py"
grep -r "def (run|stream|handle|process)" . --include="*.py"

# 检查常见位置
ls main.py app.py src/*/agent.py 2>/dev/null

# 查找 API 路由
grep -r "@app\.(get|post)" . --include="*.py"  # FastAPI/Flask
grep -r "def.*route" . --include="*.py"

# 检查包配置中的入口点
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points"

# 阅读项目文档
cat README.md docs/*.md 2>/dev/null | head -100

# 探索主目录
ls -la src/ app/ agent/ 2>/dev/null

预检查：使用现有环境

在进行任何设置之前，检查 MLFLOW_TRACKING_URI 和 MLFLOW_EXPERIMENT_ID 是否已设置：

echo "MLFLOW_TRACKING_URI=$MLFLOW_TRACKING_URI"
echo "MLFLOW_EXPERIMENT_ID=$MLFLOW_EXPERIMENT_ID"

如果两者都已设置，则完全跳过步骤 1-2。 环境已预配置。请不要运行 setup_mlflow.py，不要创建 .env 文件，不要覆盖这些值。直接进入步骤 3（追踪集成）和评估工作流。

设置步骤（仅在环境未预配置时执行）

安装 MLflow（版本 >=3.8.0）
配置环境（追踪 URI 和实验）
- 指南：遵循 references/setup-guide.md 中的步骤 1-2
集成追踪（自动日志记录和 @mlflow.trace 装饰器）
- ⚠️ 强制要求：使用 instrumenting-with-mlflow-tracing 技能进行追踪设置
- ✓ 验证：实现后运行 scripts/validate_tracing_runtime.py

⚠️ 评估前必须确保追踪正常工作。 如果追踪失败，请停止并排查问题。

检查点 - 在继续之前验证：

已安装 MLflow >=3.8.0
已设置 MLFLOW_TRACKING_URI 和 MLFLOW_EXPERIMENT_ID
已启用自动日志记录并添加了 @mlflow.trace 装饰器
测试运行创建了追踪记录（验证追踪 ID 不为 None）

uv run python scripts/validate_environment.py  # 检查 MLflow 安装、环境变量、连接性
uv run python scripts/validate_auth.py         # 在昂贵的操作之前测试身份验证

步骤 1：理解智能体用途

使用示例输入调用智能体
检查 MLflow 追踪记录（特别是描述智能体用途的 LLM 提示词）
打印您的理解并向用户请求验证
在继续之前等待确认

步骤 2：定义质量评分器

检查您的实验中已注册的评分器：

uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID

重要：如果实验中已有注册的评分器，则评估时必须使用它们。

选择适用于该智能体的其他内置评分器

请参阅 references/scorers.md 了解内置评分器。选择任何对评估智能体质量有用且尚未注册的评分器。

根据需要创建额外的自定义评分器

如果需要，使用 make_judge() API 创建额外的评分器。请参阅 references/scorers.md 了解如何创建自定义评分器，并参阅 references/scorers-constraints.md 了解最佳实践。

必需：在评估前注册新的评分器，使用 Python API：

from mlflow.genai.judges import make_judge from mlflow.genai.scorers import BuiltinScorerName import os

scorer = make_judge(...)  # 或者，scorer = BuiltinScorerName()

** 重要：在注册前配置评分器的 model 参数，请参阅 references/scorers.md → "Model Selection for Scorers"。

⚠️ 评分器必须在评估前注册。 未注册的内联评分器不会出现在 mlflow scorers list 中，并且不可重用。

验证注册：

uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID # 应显示您的评分器

步骤 3：准备评估数据集

始终先发现现有数据集，以防止重复工作：

运行数据集发现（强制）：

uv run python scripts/list_datasets.py # 列出、比较、推荐数据集 uv run python scripts/list_datasets.py --format json # 机器可读输出 uv run python scripts/list_datasets.py --help # 所有选项
向用户展示发现结果：
- 显示所有发现的数据集及其特征（大小、涵盖的主题）
- 如果找到数据集，根据智能体类型突出显示最相关的选项
询问用户关于现有数据集：
- "我发现了 [N] 个现有的评估数据集。您想使用其中一个吗？(y/n)"
- 如果是：询问使用哪个数据集并记录数据集名称
- 如果否：继续步骤 4
仅在用户拒绝使用现有数据集时创建新数据集：

从测试用例文件生成数据集创建脚本

uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt

uv run python scripts/create_dataset_template.py --help  # 查看所有选项

生成的代码使用 mlflow.genai.datasets API - 请审阅并执行该脚本。

重要：不要跳过数据集发现。即使您计划创建新数据集，也始终先运行 list_datasets.py。这可以防止重复工作，并确保用户了解现有的评估数据集。

完整的数据集指南： 请参阅 references/dataset-preparation.md

检查点 - 在继续之前验证：

评分器已注册
数据集已创建

步骤 4：运行评估

生成并运行评估脚本：

生成评估脚本（指定模块和入口点）

uv run python scripts/run_evaluation_template.py \

  --module mlflow_agent.agent \
  --entry-point run_agent

# 审阅生成的脚本，然后执行它
uv run python run_agent_evaluation.py

生成的脚本创建一个包装函数，该函数：

 * 接受与数据集输入键匹配的关键字参数
 * 提供智能体所需的任何额外参数（如 `llm_provider`）
 * 运行 `mlflow.genai.evaluate(data=df, predict_fn=wrapper, scorers=registered_scorers)`
 * 将结果保存到 `evaluation_results.csv`

⚠️ 关键：包装函数签名必须与数据集输入键匹配

MLflow 调用 predict_fn(**inputs) - 它将输入字典解包为关键字参数。

数据集记录	MLflow 调用	predict_fn 必须是
`{"inputs": {"query": "..."}}`	`predict_fn(query="...")`	`def wrapper(query):`
`{"inputs": {"question": "...", "context": "..."}}`	`predict_fn(question="...", context="...")`	`def wrapper(question, context):`

常见错误（错误）：

def wrapper(inputs):  # ❌ 错误 - inputs 不是字典
    return agent(inputs["query"])

    # 模式检测、失败分析、建议
uv run python scripts/analyze_results.py evaluation_results.csv

生成 evaluation_report.md，包含通过率和改进建议。

references/ 中的详细指南（根据需要加载）：

setup-guide.md - 环境设置（MLflow 安装、追踪 URI 配置）
追踪：使用 instrumenting-with-mlflow-tracing 技能（关于自动日志记录、装饰器、会话追踪、验证的权威指南）
dataset-preparation.md - 数据集模式、API、创建、Unity Catalog
scorers.md - 内置与自定义评分器、注册、测试
scorers-constraints.md - 自定义评分器的 CLI 要求（是/否格式、模板）
troubleshooting.md - 各阶段的常见错误及解决方案

脚本是自文档化的 - 使用 --help 运行以获取使用详情。

🇺🇸English

Agent Evaluation with MLflow

Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.

⛔ CRITICAL: Must Use MLflow APIs

DO NOT create custom evaluation frameworks. You MUST use MLflow's native APIs:

Datasets : Use mlflow.genai.datasets.create_dataset() - NOT custom test case files
Scorers : Use mlflow.genai.scorers and mlflow.genai.judges.make_judge() - NOT custom scorer functions
Evaluation : Use mlflow.genai.evaluate() - NOT custom evaluation loops
Scripts : Use the provided scripts/ directory templates - NOT custom evaluation/ directories

Why? MLflow tracks everything (datasets, scorers, traces, results) in the experiment. Custom frameworks bypass this and lose all observability.

If you're tempted to create evaluation/eval_dataset.py or similar custom files, STOP. Use scripts/create_dataset_template.py instead.

Quick Start
Documentation Access Protocol
Setup Overview
Evaluation Workflow
References

Quick Start

⚠️ REMINDER: Use MLflow APIs from this skill. Do not create custom evaluation frameworks.

Setup (prerequisite) : Install MLflow 3.8+, configure environment, integrate tracing

Evaluation workflow in 4 steps (each uses MLflow APIs):

Understand : Run agent, inspect traces, understand purpose
Scorers : Select and register scorers for quality criteria
Dataset : ALWAYS discover existing datasets first, only create new if needed
Evaluate : Run agent on dataset, apply scorers, analyze results

Command Conventions

Always useuv run for MLflow and Python commands:

uv run mlflow --version          # MLflow CLI commands
uv run python scripts/xxx.py     # Python script execution
uv run python -c "..."           # Python one-liners

This ensures commands run in the correct environment with proper dependencies.

CRITICAL: Separate stderr from stdout when capturing CLI output:

When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data:

# Save both separately for debugging
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log

Documentation Access Protocol

All MLflow documentation must be accessed through llms.txt:

Start at: https://mlflow.org/docs/latest/llms.txt
Query llms.txt for your topic with specific prompt
If llms.txt references another doc, use WebFetch with that URL
Do not use WebSearch - use WebFetch with llms.txt first

This applies to all steps , especially:

Dataset creation (read GenAI dataset docs from llms.txt)
Scorer registration (check MLflow docs for scorer APIs)
Evaluation execution (understand mlflow.genai.evaluate API)

Discovering Agent Structure

Each project has unique structure. Use dynamic exploration instead of assumptions:

Find Agent Entry Points

# Search for main agent functions
grep -r "def.*agent" . --include="*.py"
grep -r "def (run|stream|handle|process)" . --include="*.py"

# Check common locations
ls main.py app.py src/*/agent.py 2>/dev/null

# Look for API routes
grep -r "@app\.(get|post)" . --include="*.py"  # FastAPI/Flask
grep -r "def.*route" . --include="*.py"

Understand Project Structure

# Check entry points in package config
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points"

# Read project documentation
cat README.md docs/*.md 2>/dev/null | head -100

# Explore main directories
ls -la src/ app/ agent/ 2>/dev/null

Setup Overview

Pre-check: Use Existing Environment

Before doing ANY setup, check ifMLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID are already set:

echo "MLFLOW_TRACKING_URI=$MLFLOW_TRACKING_URI"
echo "MLFLOW_EXPERIMENT_ID=$MLFLOW_EXPERIMENT_ID"

If BOTH are already set, skip Steps 1-2 entirely. The environment is pre-configured. Do NOT run setup_mlflow.py, do NOT create a .env file, do NOT override these values. Go directly to Step 3 (tracing integration) and the evaluation workflow.

Setup Steps (only if environment is NOT pre-configured)

Install MLflow (version >=3.8.0)
Configure environment (tracking URI and experiment)
- Guide : Follow references/setup-guide.md Steps 1-2
Integrate tracing (autolog and @mlflow.trace decorators)
- ⚠️ MANDATORY : Use the instrumenting-with-mlflow-tracing skill for tracing setup
- ✓ VERIFY : Run scripts/validate_tracing_runtime.py after implementing

⚠️ Tracing must work before evaluation. If tracing fails, stop and troubleshoot.

Checkpoint - verify before proceeding:

MLflow >=3.8.0 installed
MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID set
Autolog enabled and @mlflow.trace decorators added
Test run creates a trace (verify trace ID is not None)

Validation scripts:

uv run python scripts/validate_environment.py  # Check MLflow install, env vars, connectivity
uv run python scripts/validate_auth.py         # Test authentication before expensive operations

Evaluation Workflow

Step 1: Understand Agent Purpose

Invoke agent with sample input
Inspect MLflow trace (especially LLM prompts describing agent purpose)
Print your understanding and ask user for verification
Wait for confirmation before proceeding

Step 2: Define Quality Scorers

Check registered scorers in your experiment:

uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID

IMPORTANT: if there are registered scorers in the experiment then they must be used for evaluation.

Select additional built-in scorers that apply to the agent

See references/scorers.md for the built-in scorers. Select any that are useful for assessing the agent's quality and that are not already registered.

Create additional custom scorers as needed

If needed, create additional scorers using the make_judge() API. See references/scorers.md on how to create custom scorers and references/scorers-constraints.md for best practices.

REQUIRED: Register new scorers before evaluation using Python API:

from mlflow.genai.judges import make_judge from mlflow.genai.scorers import BuiltinScorerName import os

scorer = make_judge(...)  # Or, scorer = BuiltinScorerName()

scorer.register()

** IMPORTANT: See references/scorers.md → "Model Selection for Scorers" to configure the model parameter of scorers before registration.

⚠️ Scorers MUST be registered before evaluation. Inline scorers that aren't registered won't appear in mlflow scorers list and won't be reusable.

Verify registration:

uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID # Should show your scorers

Step 3: Prepare Evaluation Dataset

ALWAYS discover existing datasets first to prevent duplicate work:

Run dataset discovery (mandatory):

uv run python scripts/list_datasets.py # Lists, compares, recommends datasets uv run python scripts/list_datasets.py --format json # Machine-readable output uv run python scripts/list_datasets.py --help # All options
Present findings to user :
- Show all discovered datasets with their characteristics (size, topics covered)
- If datasets found, highlight most relevant options based on agent type
Ask user about existing datasets :
- "I found [N] existing evaluation dataset(s). Do you want to use one of these? (y/n)"
- If yes: Ask which dataset to use and record the dataset name
- If no: Proceed to step 4
Create new dataset only if user declined existing ones :

Generates dataset creation script from test cases file

uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt

uv run python scripts/create_dataset_template.py --help  # See all options

Generated code uses mlflow.genai.datasets APIs - review and execute the script.

IMPORTANT : Do not skip dataset discovery. Always run list_datasets.py first, even if you plan to create a new dataset. This prevents duplicate work and ensures users are aware of existing evaluation datasets.

For complete dataset guide: See references/dataset-preparation.md

Checkpoint - verify before proceeding:

Scorers have been registered
Dataset has been created

Step 4: Run Evaluation

Generate and run evaluation script:

Generate evaluation script (specify module and entry point)

uv run python scripts/run_evaluation_template.py \

  --module mlflow_agent.agent \
  --entry-point run_agent

# Review the generated script, then execute it
uv run python run_agent_evaluation.py

The generated script creates a wrapper function that:

 * Accepts keyword arguments matching the dataset's input keys
 * Provides any additional arguments the agent needs (like `llm_provider`)
 * Runs `mlflow.genai.evaluate(data=df, predict_fn=wrapper, scorers=registered_scorers)`
 * Saves results to `evaluation_results.csv`

⚠️ CRITICAL: wrapper Signature Must Match Dataset Input Keys

MLflow calls predict_fn(**inputs) - it unpacks the inputs dict as keyword arguments.

Dataset Record	MLflow Calls	predict_fn Must Be
`{"inputs": {"query": "..."}}`	`predict_fn(query="...")`	`def wrapper(query):`
`{"inputs": {"question": "...", "context": "..."}}`	`predict_fn(question="...", context="...")`	`def wrapper(question, context):`

Common Mistake (WRONG):

def wrapper(inputs):  # ❌ WRONG - inputs is NOT a dict
    return agent(inputs["query"])

2. Analyze results:

    # Pattern detection, failure analysis, recommendations
uv run python scripts/analyze_results.py evaluation_results.csv

Generates evaluation_report.md with pass rates and improvement suggestions.

References

Detailed guides in references/ (load as needed):

setup-guide.md - Environment setup (MLflow install, tracking URI configuration)
Tracing : Use the instrumenting-with-mlflow-tracing skill (authoritative guide for autolog, decorators, session tracking, verification)
dataset-preparation.md - Dataset schema, APIs, creation, Unity Catalog
scorers.md - Built-in vs custom scorers, registration, testing
scorers-constraints.md - CLI requirements for custom scorers (yes/no format, templates)
troubleshooting.md - Common errors by phase with solutions

Scripts are self-documenting - run with --help for usage details.

Weekly Installs

Repository

mlflow/skills

GitHub Stars

First Seen

Feb 4, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykFail

Installed on

gemini-cli94

github-copilot93

codex92

opencode91

amp90

kimi-cli90

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

69,600 周安装

MLflow智能体评估指南：使用原生API进行GenAI智能体测试与追踪

🇨🇳中文介绍

使用 MLflow 进行智能体评估

⛔ 关键：必须使用 MLflow API

相关 Skills

目录

快速开始

命令约定

文档访问协议