npx skills add https://github.com/mlflow/skills --skill agent-evaluation使用 MLflow 评估 GenAI 智能体的综合指南。您可以使用此技能完成整个评估工作流或其中的独立组件——追踪设置、环境配置、数据集创建、评分器定义或评估执行。每个部分都可以根据您的需求独立使用。
请勿创建自定义评估框架。 您必须使用 MLflow 的原生 API:
mlflow.genai.datasets.create_dataset() - 而非自定义测试用例文件mlflow.genai.scorers 和 mlflow.genai.judges.make_judge() - 而非自定义评分器函数mlflow.genai.evaluate() - 而非自定义评估循环scripts/ 目录模板 - 而非自定义的 evaluation/ 目录为什么? MLflow 在实验中追踪所有内容(数据集、评分器、追踪记录、结果)。自定义框架会绕过此机制,导致所有可观测性丢失。
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
如果您想创建 evaluation/eval_dataset.py 或类似的定制文件,请停止。请改用 scripts/create_dataset_template.py。
⚠️ 提醒:使用此技能中的 MLflow API。请勿创建自定义评估框架。
设置(先决条件):安装 MLflow 3.8+,配置环境,集成追踪
评估工作流分为 4 步(每一步都使用 MLflow API):
对于 MLflow 和 Python 命令,始终使用 uv run:
uv run mlflow --version # MLflow CLI 命令
uv run python scripts/xxx.py # Python 脚本执行
uv run python -c "..." # Python 单行命令
这确保命令在正确的环境中运行,并具有适当的依赖项。
关键:捕获 CLI 输出时,将 stderr 与 stdout 分开:
当将 CLI 命令输出保存到文件以供解析(JSON、CSV 等)时,始终单独重定向 stderr,以避免日志与结构化数据混合:
# 分别保存以便调试
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
所有 MLflow 文档必须通过 llms.txt 访问:
https://mlflow.org/docs/latest/llms.txt这适用于所有步骤,尤其是:
每个项目都有独特的结构。 使用动态探索而非假设:
# 搜索主要的智能体函数
grep -r "def.*agent" . --include="*.py"
grep -r "def (run|stream|handle|process)" . --include="*.py"
# 检查常见位置
ls main.py app.py src/*/agent.py 2>/dev/null
# 查找 API 路由
grep -r "@app\.(get|post)" . --include="*.py" # FastAPI/Flask
grep -r "def.*route" . --include="*.py"
# 检查包配置中的入口点
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points"
# 阅读项目文档
cat README.md docs/*.md 2>/dev/null | head -100
# 探索主目录
ls -la src/ app/ agent/ 2>/dev/null
在进行任何设置之前,检查 MLFLOW_TRACKING_URI 和 MLFLOW_EXPERIMENT_ID 是否已设置:
echo "MLFLOW_TRACKING_URI=$MLFLOW_TRACKING_URI"
echo "MLFLOW_EXPERIMENT_ID=$MLFLOW_EXPERIMENT_ID"
如果两者都已设置,则完全跳过步骤 1-2。 环境已预配置。请不要运行 setup_mlflow.py,不要创建 .env 文件,不要覆盖这些值。直接进入步骤 3(追踪集成)和评估工作流。
references/setup-guide.md 中的步骤 1-2instrumenting-with-mlflow-tracing 技能进行追踪设置scripts/validate_tracing_runtime.py⚠️ 评估前必须确保追踪正常工作。 如果追踪失败,请停止并排查问题。
检查点 - 在继续之前验证:
验证脚本:
uv run python scripts/validate_environment.py # 检查 MLflow 安装、环境变量、连接性
uv run python scripts/validate_auth.py # 在昂贵的操作之前测试身份验证
检查您的实验中已注册的评分器:
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID
重要:如果实验中已有注册的评分器,则评估时必须使用它们。
请参阅 references/scorers.md 了解内置评分器。选择任何对评估智能体质量有用且尚未注册的评分器。
如果需要,使用 make_judge() API 创建额外的评分器。请参阅 references/scorers.md 了解如何创建自定义评分器,并参阅 references/scorers-constraints.md 了解最佳实践。
必需:在评估前注册新的评分器,使用 Python API:
from mlflow.genai.judges import make_judge from mlflow.genai.scorers import BuiltinScorerName import os
scorer = make_judge(...) # 或者,scorer = BuiltinScorerName()
scorer.register()
** 重要:在注册前配置评分器的 model 参数,请参阅 references/scorers.md → "Model Selection for Scorers"。
⚠️ 评分器必须在评估前注册。 未注册的内联评分器不会出现在 mlflow scorers list 中,并且不可重用。
验证注册:
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID # 应显示您的评分器
始终先发现现有数据集,以防止重复工作:
运行数据集发现(强制):
uv run python scripts/list_datasets.py # 列出、比较、推荐数据集 uv run python scripts/list_datasets.py --format json # 机器可读输出 uv run python scripts/list_datasets.py --help # 所有选项
向用户展示发现结果:
询问用户关于现有数据集:
仅在用户拒绝使用现有数据集时创建新数据集:
uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt
uv run python scripts/create_dataset_template.py --help # 查看所有选项
生成的代码使用 mlflow.genai.datasets API - 请审阅并执行该脚本。
重要:不要跳过数据集发现。即使您计划创建新数据集,也始终先运行 list_datasets.py。这可以防止重复工作,并确保用户了解现有的评估数据集。
完整的数据集指南: 请参阅 references/dataset-preparation.md
检查点 - 在继续之前验证:
生成并运行评估脚本:
uv run python scripts/run_evaluation_template.py \
--module mlflow_agent.agent \
--entry-point run_agent
# 审阅生成的脚本,然后执行它
uv run python run_agent_evaluation.py
生成的脚本创建一个包装函数,该函数:
* 接受与数据集输入键匹配的关键字参数
* 提供智能体所需的任何额外参数(如 `llm_provider`)
* 运行 `mlflow.genai.evaluate(data=df, predict_fn=wrapper, scorers=registered_scorers)`
* 将结果保存到 `evaluation_results.csv`
⚠️ 关键:包装函数签名必须与数据集输入键匹配
MLflow 调用 predict_fn(**inputs) - 它将输入字典解包为关键字参数。
| 数据集记录 | MLflow 调用 | predict_fn 必须是 |
|---|---|---|
{"inputs": {"query": "..."}} | predict_fn(query="...") | def wrapper(query): |
{"inputs": {"question": "...", "context": "..."}} | predict_fn(question="...", context="...") | def wrapper(question, context): |
常见错误(错误):
def wrapper(inputs): # ❌ 错误 - inputs 不是字典
return agent(inputs["query"])
2. 分析结果:
# 模式检测、失败分析、建议
uv run python scripts/analyze_results.py evaluation_results.csv
生成 evaluation_report.md,包含通过率和改进建议。
references/ 中的详细指南(根据需要加载):
instrumenting-with-mlflow-tracing 技能(关于自动日志记录、装饰器、会话追踪、验证的权威指南)脚本是自文档化的 - 使用 --help 运行以获取使用详情。
每周安装次数
96
代码仓库
GitHub 星标数
19
首次出现
2026年2月4日
安全审计
安装于
gemini-cli94
github-copilot93
codex92
opencode91
amp90
kimi-cli90
Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.
DO NOT create custom evaluation frameworks. You MUST use MLflow's native APIs:
mlflow.genai.datasets.create_dataset() - NOT custom test case filesmlflow.genai.scorers and mlflow.genai.judges.make_judge() - NOT custom scorer functionsmlflow.genai.evaluate() - NOT custom evaluation loopsscripts/ directory templates - NOT custom evaluation/ directoriesWhy? MLflow tracks everything (datasets, scorers, traces, results) in the experiment. Custom frameworks bypass this and lose all observability.
If you're tempted to create evaluation/eval_dataset.py or similar custom files, STOP. Use scripts/create_dataset_template.py instead.
⚠️ REMINDER: Use MLflow APIs from this skill. Do not create custom evaluation frameworks.
Setup (prerequisite) : Install MLflow 3.8+, configure environment, integrate tracing
Evaluation workflow in 4 steps (each uses MLflow APIs):
Always useuv run for MLflow and Python commands:
uv run mlflow --version # MLflow CLI commands
uv run python scripts/xxx.py # Python script execution
uv run python -c "..." # Python one-liners
This ensures commands run in the correct environment with proper dependencies.
CRITICAL: Separate stderr from stdout when capturing CLI output:
When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data:
# Save both separately for debugging
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
All MLflow documentation must be accessed through llms.txt:
https://mlflow.org/docs/latest/llms.txtThis applies to all steps , especially:
Each project has unique structure. Use dynamic exploration instead of assumptions:
# Search for main agent functions
grep -r "def.*agent" . --include="*.py"
grep -r "def (run|stream|handle|process)" . --include="*.py"
# Check common locations
ls main.py app.py src/*/agent.py 2>/dev/null
# Look for API routes
grep -r "@app\.(get|post)" . --include="*.py" # FastAPI/Flask
grep -r "def.*route" . --include="*.py"
# Check entry points in package config
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points"
# Read project documentation
cat README.md docs/*.md 2>/dev/null | head -100
# Explore main directories
ls -la src/ app/ agent/ 2>/dev/null
Before doing ANY setup, check ifMLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID are already set:
echo "MLFLOW_TRACKING_URI=$MLFLOW_TRACKING_URI"
echo "MLFLOW_EXPERIMENT_ID=$MLFLOW_EXPERIMENT_ID"
If BOTH are already set, skip Steps 1-2 entirely. The environment is pre-configured. Do NOT run setup_mlflow.py, do NOT create a .env file, do NOT override these values. Go directly to Step 3 (tracing integration) and the evaluation workflow.
references/setup-guide.md Steps 1-2instrumenting-with-mlflow-tracing skill for tracing setupscripts/validate_tracing_runtime.py after implementing⚠️ Tracing must work before evaluation. If tracing fails, stop and troubleshoot.
Checkpoint - verify before proceeding:
Validation scripts:
uv run python scripts/validate_environment.py # Check MLflow install, env vars, connectivity
uv run python scripts/validate_auth.py # Test authentication before expensive operations
Check registered scorers in your experiment:
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID
IMPORTANT: if there are registered scorers in the experiment then they must be used for evaluation.
See references/scorers.md for the built-in scorers. Select any that are useful for assessing the agent's quality and that are not already registered.
If needed, create additional scorers using the make_judge() API. See references/scorers.md on how to create custom scorers and references/scorers-constraints.md for best practices.
REQUIRED: Register new scorers before evaluation using Python API:
from mlflow.genai.judges import make_judge from mlflow.genai.scorers import BuiltinScorerName import os
scorer = make_judge(...) # Or, scorer = BuiltinScorerName()
scorer.register()
** IMPORTANT: See references/scorers.md → "Model Selection for Scorers" to configure the model parameter of scorers before registration.
⚠️ Scorers MUST be registered before evaluation. Inline scorers that aren't registered won't appear in mlflow scorers list and won't be reusable.
Verify registration:
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID # Should show your scorers
ALWAYS discover existing datasets first to prevent duplicate work:
Run dataset discovery (mandatory):
uv run python scripts/list_datasets.py # Lists, compares, recommends datasets uv run python scripts/list_datasets.py --format json # Machine-readable output uv run python scripts/list_datasets.py --help # All options
Present findings to user :
Ask user about existing datasets :
Create new dataset only if user declined existing ones :
uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt
uv run python scripts/create_dataset_template.py --help # See all options
Generated code uses mlflow.genai.datasets APIs - review and execute the script.
IMPORTANT : Do not skip dataset discovery. Always run list_datasets.py first, even if you plan to create a new dataset. This prevents duplicate work and ensures users are aware of existing evaluation datasets.
For complete dataset guide: See references/dataset-preparation.md
Checkpoint - verify before proceeding:
Generate and run evaluation script:
uv run python scripts/run_evaluation_template.py \
--module mlflow_agent.agent \
--entry-point run_agent
# Review the generated script, then execute it
uv run python run_agent_evaluation.py
The generated script creates a wrapper function that:
* Accepts keyword arguments matching the dataset's input keys
* Provides any additional arguments the agent needs (like `llm_provider`)
* Runs `mlflow.genai.evaluate(data=df, predict_fn=wrapper, scorers=registered_scorers)`
* Saves results to `evaluation_results.csv`
⚠️ CRITICAL: wrapper Signature Must Match Dataset Input Keys
MLflow calls predict_fn(**inputs) - it unpacks the inputs dict as keyword arguments.
| Dataset Record | MLflow Calls | predict_fn Must Be |
|---|---|---|
{"inputs": {"query": "..."}} | predict_fn(query="...") | def wrapper(query): |
{"inputs": {"question": "...", "context": "..."}} | predict_fn(question="...", context="...") | def wrapper(question, context): |
Common Mistake (WRONG):
def wrapper(inputs): # ❌ WRONG - inputs is NOT a dict
return agent(inputs["query"])
2. Analyze results:
# Pattern detection, failure analysis, recommendations
uv run python scripts/analyze_results.py evaluation_results.csv
Generates evaluation_report.md with pass rates and improvement suggestions.
Detailed guides in references/ (load as needed):
instrumenting-with-mlflow-tracing skill (authoritative guide for autolog, decorators, session tracking, verification)Scripts are self-documenting - run with --help for usage details.
Weekly Installs
96
Repository
GitHub Stars
19
First Seen
Feb 4, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykFail
Installed on
gemini-cli94
github-copilot93
codex92
opencode91
amp90
kimi-cli90
AI Elements:基于shadcn/ui的AI原生应用组件库,快速构建对话界面
69,600 周安装