adk-eval-guide by google/adk-docs
npx skills add https://github.com/google/adk-docs --skill adk-eval-guide脚手架项目? 如果你使用了
/adk-scaffold,那么你已经拥有make eval、tests/eval/evalsets/和tests/eval/eval_config.json。从make eval开始,并在此基础上迭代。非脚手架项目? 直接使用
adk eval— 请参阅下面的“运行评估”部分。
| 文件 | 内容 |
|---|---|
references/criteria-guide.md |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 完整的指标参考 — 所有 8 项标准、匹配类型、自定义指标、评判模型配置 |
references/user-simulation.md | 动态对话测试 — ConversationScenario、用户模拟器配置、兼容的指标 |
references/builtin-tools-eval.md | google_search 和模型内部工具 — 轨迹行为、指标兼容性 |
references/multimodal-eval.md | 多模态输入 — 评估集模式、内置指标限制、自定义评估器模式 |
评估是迭代进行的。当分数低于阈值时,诊断原因,修复问题,重新运行 — 不要仅仅报告失败。
make eval(如果没有 Makefile,则使用 adk eval)预计需要 5-10+ 次迭代。 这是正常的 — 每次迭代都会使智能体变得更好。
| 失败项 | 需要修改的内容 |
|---|---|
tool_trajectory_avg_score 低 | 修复智能体指令(工具调用顺序),更新评估集的 tool_uses,或切换到 IN_ORDER/ANY_ORDER 匹配类型 |
response_match_score 低 | 调整智能体指令的措辞,或放宽预期的响应 |
final_response_match_v2 低 | 优化智能体指令,或调整预期响应 — 这是语义匹配,而非词汇匹配 |
rubric_based 分数低 | 优化智能体指令以解决失败的具体评分标准 |
hallucinations_v1 低 | 收紧智能体指令,使其基于工具输出 |
| 智能体调用了错误的工具 | 修复工具描述、智能体指令或 tool_config |
| 智能体调用了额外的工具 | 使用 IN_ORDER/ANY_ORDER 匹配类型,添加严格的停止指令,或切换到 rubric_based_tool_use_quality_v1 |
| 目标 | 推荐指标 |
|---|---|
| 回归测试 / CI/CD(快速、确定性) | tool_trajectory_avg_score + response_match_score |
| 语义响应正确性(允许灵活的措辞) | final_response_match_v2 |
| 无参考答案的响应质量 | rubric_based_final_response_quality_v1 |
| 验证工具使用推理 | rubric_based_tool_use_quality_v1 |
| 检测幻觉声明 | hallucinations_v1 |
| 安全合规性 | safety_v1 |
| 动态多轮对话 | 用户模拟 + hallucinations_v1 / safety_v1(参见 references/user-simulation.md) |
| 多模态输入(图像、音频、文件) | tool_trajectory_avg_score + 用于响应质量的自定义指标(参见 references/multimodal-eval.md) |
有关包含配置示例、匹配类型和自定义指标的完整指标参考,请参阅 references/criteria-guide.md。
# 脚手架项目:
make eval EVALSET=tests/eval/evalsets/my_evalset.json
# 或直接通过 ADK CLI:
adk eval ./app <path_to_evalset.json> --config_file_path=<path_to_config.json> --print_detailed_results
# 从集合中运行特定的评估用例:
adk eval ./app my_evalset.json:eval_1,eval_2
# 使用 GCS 存储:
adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals
CLI 选项: --config_file_path, --print_detailed_results, --eval_storage_uri, --log_level
评估集管理:
adk eval_set create <agent_path> <eval_set_id>
adk eval_set add_eval_case <agent_path> <eval_set_id> --scenarios_file <path> --session_input_file <path>
eval_config.json)camelCase 和 snake_case 字段名均可接受(Pydantic 别名)。下面的示例使用 snake_case,与官方 ADK 文档保持一致。
{
"criteria": {
"tool_trajectory_avg_score": {
"threshold": 1.0,
"match_type": "IN_ORDER"
},
"final_response_match_v2": {
"threshold": 0.8,
"judge_model_options": {
"judge_model": "gemini-2.5-flash",
"num_samples": 5
}
},
"rubric_based_final_response_quality_v1": {
"threshold": 0.8,
"rubrics": [
{
"rubric_id": "professionalism",
"rubric_content": { "text_property": "The response must be professional and helpful." }
},
{
"rubric_id": "safety",
"rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
}
]
}
}
}
简单的阈值简写也是有效的:"response_match_score": 0.8
有关自定义指标、judge_model_options 详情和 user_simulator_config,请参阅 references/criteria-guide.md。
evalset.json){
"eval_set_id": "my_eval_set",
"name": "My Eval Set",
"description": "Tests core capabilities",
"eval_cases": [
{
"eval_id": "search_test",
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
"final_response": {
"role": "model",
"parts": [{ "text": "I found a flight for $500. Want to book?" }]
},
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": { "destination": "NYC" } }
],
"intermediate_responses": [
["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
]
}
}
],
"session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
}
]
}
关键字段:
intermediate_data.tool_uses — 预期的工具调用轨迹(按时间顺序)intermediate_data.intermediate_responses — 预期的子智能体响应(用于多智能体系统)session_input.state — 初始会话状态(覆盖 Python 级别的初始化)conversation_scenario — conversation 的替代方案,用于用户模拟(参见 references/user-simulation.md)LLM 经常执行未要求的额外操作(例如,在 save_preferences 之后执行 google_search)。这会导致使用 EXACT 匹配时 tool_trajectory_avg_score 失败。解决方案:
IN_ORDER 或 ANY_ORDER 匹配类型 — 容忍在预期工具调用之间出现额外的工具调用rubric_based_tool_use_quality_v1 代替轨迹匹配tool_trajectory_avg_score 评估每次调用。如果你没有为中间轮次指定预期的工具调用,即使智能体调用了正确的工具,评估也会失败。
{
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
]
}
},
{
"invocation_id": "inv_2",
"user_content": { "parts": [{"text": "Book the first option"}] },
"final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "book_flight", "args": {"flight_id": "1"} }
]
}
}
]
}
App 对象的 name 参数必须与包含你智能体的目录名匹配:
# 正确 - 与 "app" 目录匹配
app = App(root_agent=root_agent, name="app")
# 错误 - 会导致 "Session not found" 错误
app = App(root_agent=root_agent, name="flight_booking_assistant")
before_agent_callback 模式(状态初始化)始终使用回调来初始化指令模板中使用的会话状态变量。这可以防止在第一轮出现 KeyError 崩溃:
async def initialize_state(callback_context: CallbackContext) -> None:
state = callback_context.state
if "user_preferences" not in state:
state["user_preferences"] = {}
root_agent = Agent(
name="my_agent",
before_agent_callback=initialize_state,
instruction="Based on preferences: {user_preferences}...",
)
在评估集中使用 session_input.state 时要小心。它会覆盖 Python 级别的初始化:
// 错误 — 将 feedback_history 初始化为字符串,破坏 .append() 操作
"state": { "feedback_history": "" }
// 正确 — 与 Python 类型(列表)匹配
"state": { "feedback_history": [] }
// 注意:使用前删除这些 // 注释 — JSON 不支持注释。
启用“思考”功能的模型可能会跳过工具调用。使用 tool_config 并设置 mode="ANY" 来强制使用工具,或者切换到非思考模型以获得可预测的工具调用行为。
| 症状 | 原因 | 修复方法 |
|---|---|---|
中间轮次缺少 tool_uses | 轨迹期望每次调用都匹配 | 为所有轮次添加预期的工具调用 |
| 智能体提及工具输出中不存在的数据 | 幻觉 | 收紧智能体指令;添加 hallucinations_v1 指标 |
| “Session not found” 错误 | 应用名称不匹配 | 确保 App 的 name 与目录名匹配 |
| 不同运行间分数波动 | 非确定性模型 | 设置 temperature=0 或使用基于评分标准的评估 |
tool_trajectory_avg_score 始终为 0 | 智能体使用 google_search(模型内部工具) | 移除轨迹指标;参见 references/builtin-tools-eval.md |
| 轨迹失败但工具调用正确 | 调用了额外的工具 | 切换到 IN_ORDER/ANY_ORDER 匹配类型 |
| LLM 评判器忽略评估中的图像/音频 | get_text_from_content() 跳过了非文本部分 | 使用具有视觉能力的评判器的自定义指标(参见 references/multimodal-eval.md) |
要获取官方评估文档,请查阅以下页面:
https://google.github.io/adk-docs/evaluate/index.mdhttps://google.github.io/adk-docs/evaluate/criteria/index.mdhttps://google.github.io/adk-docs/evaluate/user-sim/index.md用户提问:“tool_trajectory_avg_score 是 0,哪里出错了?”
google_search — 如果是,请参阅 references/builtin-tools-eval.mdEXACT 匹配且智能体调用了额外的工具 — 尝试 IN_ORDERtool_uses 与实际智能体行为每周安装量
1.0K
代码库
GitHub Stars
1.2K
首次出现
Mar 9, 2026
安全审计
安装于
codex1.0K
gemini-cli1.0K
cursor1.0K
opencode1.0K
github-copilot1.0K
amp1.0K
Scaffolded project? If you used
/adk-scaffold, you already havemake eval,tests/eval/evalsets/, andtests/eval/eval_config.json. Start withmake evaland iterate from there.Non-scaffolded? Use
adk evaldirectly — see Running Evaluations below.
| File | Contents |
|---|---|
references/criteria-guide.md | Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config |
references/user-simulation.md | Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics |
references/builtin-tools-eval.md | google_search and model-internal tools — trajectory behavior, metric compatibility |
references/multimodal-eval.md | Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern |
Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.
make eval (or adk eval if no Makefile)Expect 5-10+ iterations. This is normal — each iteration makes the agent better.
| Failure | What to change |
|---|---|
tool_trajectory_avg_score low | Fix agent instructions (tool ordering), update evalset tool_uses, or switch to IN_ORDER/ANY_ORDER match type |
response_match_score low | Adjust agent instruction wording, or relax the expected response |
final_response_match_v2 low | Refine agent instructions, or adjust expected response — this is semantic, not lexical |
rubric_based score low |
| Goal | Recommended Metric |
|---|---|
| Regression testing / CI/CD (fast, deterministic) | tool_trajectory_avg_score + response_match_score |
| Semantic response correctness (flexible phrasing OK) | final_response_match_v2 |
| Response quality without reference answer | rubric_based_final_response_quality_v1 |
| Validate tool usage reasoning | rubric_based_tool_use_quality_v1 |
| Detect hallucinated claims | hallucinations_v1 |
For the complete metrics reference with config examples, match types, and custom metrics, see references/criteria-guide.md.
# Scaffolded projects:
make eval EVALSET=tests/eval/evalsets/my_evalset.json
# Or directly via ADK CLI:
adk eval ./app <path_to_evalset.json> --config_file_path=<path_to_config.json> --print_detailed_results
# Run specific eval cases from a set:
adk eval ./app my_evalset.json:eval_1,eval_2
# With GCS storage:
adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals
CLI options: --config_file_path, --print_detailed_results, --eval_storage_uri, --log_level
Eval set management:
adk eval_set create <agent_path> <eval_set_id>
adk eval_set add_eval_case <agent_path> <eval_set_id> --scenarios_file <path> --session_input_file <path>
eval_config.json)Both camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs.
{
"criteria": {
"tool_trajectory_avg_score": {
"threshold": 1.0,
"match_type": "IN_ORDER"
},
"final_response_match_v2": {
"threshold": 0.8,
"judge_model_options": {
"judge_model": "gemini-2.5-flash",
"num_samples": 5
}
},
"rubric_based_final_response_quality_v1": {
"threshold": 0.8,
"rubrics": [
{
"rubric_id": "professionalism",
"rubric_content": { "text_property": "The response must be professional and helpful." }
},
{
"rubric_id": "safety",
"rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
}
]
}
}
}
Simple threshold shorthand is also valid: "response_match_score": 0.8
For custom metrics, judge_model_options details, and user_simulator_config, see references/criteria-guide.md.
evalset.json){
"eval_set_id": "my_eval_set",
"name": "My Eval Set",
"description": "Tests core capabilities",
"eval_cases": [
{
"eval_id": "search_test",
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
"final_response": {
"role": "model",
"parts": [{ "text": "I found a flight for $500. Want to book?" }]
},
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": { "destination": "NYC" } }
],
"intermediate_responses": [
["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
]
}
}
],
"session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
}
]
}
Key fields:
intermediate_data.tool_uses — expected tool call trajectory (chronological order)intermediate_data.intermediate_responses — expected sub-agent responses (for multi-agent systems)session_input.state — initial session state (overrides Python-level initialization)conversation_scenario — alternative to conversation for user simulation (see references/user-simulation.md)LLMs often perform extra actions not asked for (e.g., google_search after save_preferences). This causes tool_trajectory_avg_score failures with EXACT match. Solutions:
IN_ORDER or ANY_ORDER match type — tolerates extra tool calls between expected onesrubric_based_tool_use_quality_v1 instead of trajectory matchingThe tool_trajectory_avg_score evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools.
{
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
]
}
},
{
"invocation_id": "inv_2",
"user_content": { "parts": [{"text": "Book the first option"}] },
"final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "book_flight", "args": {"flight_id": "1"} }
]
}
}
]
}
The App object's name parameter MUST match the directory containing your agent:
# CORRECT - matches the "app" directory
app = App(root_agent=root_agent, name="app")
# WRONG - causes "Session not found" errors
app = App(root_agent=root_agent, name="flight_booking_assistant")
before_agent_callback Pattern (State Initialization)Always use a callback to initialize session state variables used in your instruction template. This prevents KeyError crashes on the first turn:
async def initialize_state(callback_context: CallbackContext) -> None:
state = callback_context.state
if "user_preferences" not in state:
state["user_preferences"] = {}
root_agent = Agent(
name="my_agent",
before_agent_callback=initialize_state,
instruction="Based on preferences: {user_preferences}...",
)
Be careful with session_input.state in your evalset. It overrides Python-level initialization:
// WRONG — initializes feedback_history as a string, breaks .append()
"state": { "feedback_history": "" }
// CORRECT — matches the Python type (list)
"state": { "feedback_history": [] }
// NOTE: Remove these // comments before using — JSON does not support comments.
Models with "thinking" enabled may skip tool calls. Use tool_config with mode="ANY" to force tool usage, or switch to a non-thinking model for predictable tool calling.
| Symptom | Cause | Fix |
|---|---|---|
Missing tool_uses in intermediate turns | Trajectory expects match per invocation | Add expected tool calls to all turns |
| Agent mentions data not in tool output | Hallucination | Tighten agent instructions; add hallucinations_v1 metric |
| "Session not found" error | App name mismatch | Ensure App name matches directory name |
| Score fluctuates between runs | Non-deterministic model | Set temperature=0 or use rubric-based eval |
tool_trajectory_avg_score always 0 |
For the official evaluation documentation, fetch these pages:
https://google.github.io/adk-docs/evaluate/index.mdhttps://google.github.io/adk-docs/evaluate/criteria/index.mdhttps://google.github.io/adk-docs/evaluate/user-sim/index.mdUser says: "tool_trajectory_avg_score is 0, what's wrong?"
google_search — if so, see references/builtin-tools-eval.mdEXACT match and agent calls extra tools — try IN_ORDERtool_uses in evalset with actual agent behaviorWeekly Installs
1.0K
Repository
GitHub Stars
1.2K
First Seen
Mar 9, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex1.0K
gemini-cli1.0K
cursor1.0K
opencode1.0K
github-copilot1.0K
amp1.0K
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
102,200 周安装
Grimoire CLI 使用指南:区块链法术编写、验证与执行全流程
940 周安装
Grimoire Uniswap 技能:查询 Uniswap 元数据与生成代币/资金池快照的 CLI 工具
940 周安装
Grimoire Aave 技能:查询 Aave V3 元数据和储备快照的 CLI 工具
941 周安装
Railway CLI 部署指南:使用 railway up 命令快速部署代码到 Railway 平台
942 周安装
n8n Python 代码节点使用指南:在自动化工作流中编写 Python 脚本
943 周安装
Flutter Platform Views 实现指南:Android/iOS/macOS原生视图与Web嵌入教程
943 周安装
| Refine agent instructions to address the specific rubric that failed |
hallucinations_v1 low | Tighten agent instructions to stay grounded in tool output |
| Agent calls wrong tools | Fix tool descriptions, agent instructions, or tool_config |
| Agent calls extra tools | Use IN_ORDER/ANY_ORDER match type, add strict stop instructions, or switch to rubric_based_tool_use_quality_v1 |
| Safety compliance | safety_v1 |
| Dynamic multi-turn conversations | User simulation + hallucinations_v1 / safety_v1 (see references/user-simulation.md) |
| Multimodal input (image, audio, file) | tool_trajectory_avg_score + custom metric for response quality (see references/multimodal-eval.md) |
Agent uses google_search (model-internal) |
Remove trajectory metric; see references/builtin-tools-eval.md |
| Trajectory fails but tools are correct | Extra tools called | Switch to IN_ORDER/ANY_ORDER match type |
| LLM judge ignores image/audio in eval | get_text_from_content() skips non-text parts | Use custom metric with vision-capable judge (see references/multimodal-eval.md) |