ADK 评估指南：AI 智能体评估、测试与优化全流程详解 | Google ADK 开发

adk-eval-guide by google/adk-docs

1,200 周安装量

1,300 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/google/adk-docs --skill adk-eval-guide

AI/机器学习开发测试

🇨🇳中文介绍

ADK 评估指南

脚手架项目？ 如果你使用了 /adk-scaffold，那么你已经拥有 make eval、tests/eval/evalsets/ 和 tests/eval/eval_config.json。从 make eval 开始，并在此基础上迭代。

非脚手架项目？ 直接使用 adk eval — 请参阅下面的“运行评估”部分。

参考文件

文件	内容
`references/criteria-guide.md`

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

分数失败时修复什么

失败项	需要修改的内容
`tool_trajectory_avg_score` 低	修复智能体指令（工具调用顺序），更新评估集的 `tool_uses`，或切换到 `IN_ORDER`/`ANY_ORDER` 匹配类型
`response_match_score` 低	调整智能体指令的措辞，或放宽预期的响应
`final_response_match_v2` 低	优化智能体指令，或调整预期响应 — 这是语义匹配，而非词汇匹配
`rubric_based` 分数低	优化智能体指令以解决失败的具体评分标准
`hallucinations_v1` 低	收紧智能体指令，使其基于工具输出
智能体调用了错误的工具	修复工具描述、智能体指令或 tool_config
智能体调用了额外的工具	使用 `IN_ORDER`/`ANY_ORDER` 匹配类型，添加严格的停止指令，或切换到 `rubric_based_tool_use_quality_v1`

选择正确的标准

目标	推荐指标
回归测试 / CI/CD（快速、确定性）	`tool_trajectory_avg_score` + `response_match_score`
语义响应正确性（允许灵活的措辞）	`final_response_match_v2`
无参考答案的响应质量	`rubric_based_final_response_quality_v1`
验证工具使用推理	`rubric_based_tool_use_quality_v1`
检测幻觉声明	`hallucinations_v1`
安全合规性	`safety_v1`
动态多轮对话	用户模拟 + `hallucinations_v1` / `safety_v1`（参见 `references/user-simulation.md`）
多模态输入（图像、音频、文件）	`tool_trajectory_avg_score` + 用于响应质量的自定义指标（参见 `references/multimodal-eval.md`）

有关包含配置示例、匹配类型和自定义指标的完整指标参考，请参阅 references/criteria-guide.md。

# 脚手架项目：
make eval EVALSET=tests/eval/evalsets/my_evalset.json

# 或直接通过 ADK CLI：
adk eval ./app <path_to_evalset.json> --config_file_path=<path_to_config.json> --print_detailed_results

# 从集合中运行特定的评估用例：
adk eval ./app my_evalset.json:eval_1,eval_2

# 使用 GCS 存储：
adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals

CLI 选项： --config_file_path, --print_detailed_results, --eval_storage_uri, --log_level

评估集管理：

adk eval_set create <agent_path> <eval_set_id>
adk eval_set add_eval_case <agent_path> <eval_set_id> --scenarios_file <path> --session_input_file <path>

配置模式 (`eval_config.json`)

camelCase 和 snake_case 字段名均可接受（Pydantic 别名）。下面的示例使用 snake_case，与官方 ADK 文档保持一致。

{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 1.0,
      "match_type": "IN_ORDER"
    },
    "final_response_match_v2": {
      "threshold": 0.8,
      "judge_model_options": {
        "judge_model": "gemini-2.5-flash",
        "num_samples": 5
      }
    },
    "rubric_based_final_response_quality_v1": {
      "threshold": 0.8,
      "rubrics": [
        {
          "rubric_id": "professionalism",
          "rubric_content": { "text_property": "The response must be professional and helpful." }
        },
        {
          "rubric_id": "safety",
          "rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
        }
      ]
    }
  }
}

简单的阈值简写也是有效的："response_match_score": 0.8

有关自定义指标、judge_model_options 详情和 user_simulator_config，请参阅 references/criteria-guide.md。

评估集模式 (`evalset.json`)

{
  "eval_set_id": "my_eval_set",
  "name": "My Eval Set",
  "description": "Tests core capabilities",
  "eval_cases": [
    {
      "eval_id": "search_test",
      "conversation": [
        {
          "invocation_id": "inv_1",
          "user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
          "final_response": {
            "role": "model",
            "parts": [{ "text": "I found a flight for $500. Want to book?" }]
          },
          "intermediate_data": {
            "tool_uses": [
              { "name": "search_flights", "args": { "destination": "NYC" } }
            ],
            "intermediate_responses": [
              ["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
            ]
          }
        }
      ],
      "session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
    }
  ]
}

intermediate_data.tool_uses — 预期的工具调用轨迹（按时间顺序）
intermediate_data.intermediate_responses — 预期的子智能体响应（用于多智能体系统）
session_input.state — 初始会话状态（覆盖 Python 级别的初始化）
conversation_scenario — conversation 的替代方案，用于用户模拟（参见 references/user-simulation.md）

主动性轨迹差距

LLM 经常执行未要求的额外操作（例如，在 save_preferences 之后执行 google_search）。这会导致使用 EXACT 匹配时 tool_trajectory_avg_score 失败。解决方案：

使用 IN_ORDER 或 ANY_ORDER 匹配类型 — 容忍在预期工具调用之间出现额外的工具调用
在预期轨迹中包含智能体可能调用的所有工具
使用 rubric_based_tool_use_quality_v1 代替轨迹匹配
添加严格的停止指令："调用 save_preferences 后停止。请勿搜索。"

多轮对话需要为所有轮次指定 tool_uses

tool_trajectory_avg_score 评估每次调用。如果你没有为中间轮次指定预期的工具调用，即使智能体调用了正确的工具，评估也会失败。

{
  "conversation": [
    {
      "invocation_id": "inv_1",
      "user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
        ]
      }
    },
    {
      "invocation_id": "inv_2",
      "user_content": { "parts": [{"text": "Book the first option"}] },
      "final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "book_flight", "args": {"flight_id": "1"} }
        ]
      }
    }
  ]
}

应用名称必须与目录名匹配

App 对象的 name 参数必须与包含你智能体的目录名匹配：

# 正确 - 与 "app" 目录匹配
app = App(root_agent=root_agent, name="app")

# 错误 - 会导致 "Session not found" 错误
app = App(root_agent=root_agent, name="flight_booking_assistant")

`before_agent_callback` 模式（状态初始化）

始终使用回调来初始化指令模板中使用的会话状态变量。这可以防止在第一轮出现 KeyError 崩溃：

async def initialize_state(callback_context: CallbackContext) -> None:
    state = callback_context.state
    if "user_preferences" not in state:
        state["user_preferences"] = {}

root_agent = Agent(
    name="my_agent",
    before_agent_callback=initialize_state,
    instruction="Based on preferences: {user_preferences}...",
)

评估状态覆盖（类型不匹配危险）

在评估集中使用 session_input.state 时要小心。它会覆盖 Python 级别的初始化：

// 错误 — 将 feedback_history 初始化为字符串，破坏 .append() 操作
"state": { "feedback_history": "" }

// 正确 — 与 Python 类型（列表）匹配
"state": { "feedback_history": [] }

// 注意：使用前删除这些 // 注释 — JSON 不支持注释。

模型思考模式可能绕过工具调用

启用“思考”功能的模型可能会跳过工具调用。使用 tool_config 并设置 mode="ANY" 来强制使用工具，或者切换到非思考模型以获得可预测的工具调用行为。

常见评估失败原因

症状	原因	修复方法
中间轮次缺少 `tool_uses`	轨迹期望每次调用都匹配	为所有轮次添加预期的工具调用
智能体提及工具输出中不存在的数据	幻觉	收紧智能体指令；添加 `hallucinations_v1` 指标
“Session not found” 错误	应用名称不匹配	确保 App 的 `name` 与目录名匹配
不同运行间分数波动	非确定性模型	设置 `temperature=0` 或使用基于评分标准的评估
`tool_trajectory_avg_score` 始终为 0	智能体使用 `google_search`（模型内部工具）	移除轨迹指标；参见 `references/builtin-tools-eval.md`
轨迹失败但工具调用正确	调用了额外的工具	切换到 `IN_ORDER`/`ANY_ORDER` 匹配类型
LLM 评判器忽略评估中的图像/音频	`get_text_from_content()` 跳过了非文本部分	使用具有视觉能力的评判器的自定义指标（参见 `references/multimodal-eval.md`）

深入探索：ADK 文档

要获取官方评估文档，请查阅以下页面：

评估概述：https://google.github.io/adk-docs/evaluate/index.md
标准参考：https://google.github.io/adk-docs/evaluate/criteria/index.md
用户模拟：https://google.github.io/adk-docs/evaluate/user-sim/index.md

用户提问：“tool_trajectory_avg_score 是 0，哪里出错了？”

检查智能体是否使用 google_search — 如果是，请参阅 references/builtin-tools-eval.md
检查是否使用 EXACT 匹配且智能体调用了额外的工具 — 尝试 IN_ORDER
比较评估集中的预期 tool_uses 与实际智能体行为
修复不匹配之处（更新评估集或智能体指令）

🇺🇸English

ADK Evaluation Guide

Scaffolded project? If you used /adk-scaffold, you already have make eval, tests/eval/evalsets/, and tests/eval/eval_config.json. Start with make eval and iterate from there.

Non-scaffolded? Use adk eval directly — see Running Evaluations below.

Reference Files

File	Contents
`references/criteria-guide.md`	Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config
`references/user-simulation.md`	Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics
`references/builtin-tools-eval.md`	google_search and model-internal tools — trajectory behavior, metric compatibility
`references/multimodal-eval.md`	Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern

The Eval-Fix Loop

Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.

How to iterate

Start small : Begin with 1-2 eval cases, not the full suite
Run eval : make eval (or adk eval if no Makefile)
Read the scores — identify what failed and why
Fix the code — adjust prompts, tool logic, instructions, or the evalset
Rerun eval — verify the fix worked
Repeat steps 3-5 until the case passes
Only then add more eval cases and expand coverage

Expect 5-10+ iterations. This is normal — each iteration makes the agent better.

What to fix when scores fail

Failure	What to change
`tool_trajectory_avg_score` low	Fix agent instructions (tool ordering), update evalset `tool_uses`, or switch to `IN_ORDER`/`ANY_ORDER` match type
`response_match_score` low	Adjust agent instruction wording, or relax the expected response
`final_response_match_v2` low	Refine agent instructions, or adjust expected response — this is semantic, not lexical
`rubric_based` score low

Choosing the Right Criteria

Goal	Recommended Metric
Regression testing / CI/CD (fast, deterministic)	`tool_trajectory_avg_score` + `response_match_score`
Semantic response correctness (flexible phrasing OK)	`final_response_match_v2`
Response quality without reference answer	`rubric_based_final_response_quality_v1`
Validate tool usage reasoning	`rubric_based_tool_use_quality_v1`
Detect hallucinated claims	`hallucinations_v1`

For the complete metrics reference with config examples, match types, and custom metrics, see references/criteria-guide.md.

Running Evaluations

# Scaffolded projects:
make eval EVALSET=tests/eval/evalsets/my_evalset.json

# Or directly via ADK CLI:
adk eval ./app <path_to_evalset.json> --config_file_path=<path_to_config.json> --print_detailed_results

# Run specific eval cases from a set:
adk eval ./app my_evalset.json:eval_1,eval_2

# With GCS storage:
adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals

CLI options: --config_file_path, --print_detailed_results, --eval_storage_uri, --log_level

Eval set management:

adk eval_set create <agent_path> <eval_set_id>
adk eval_set add_eval_case <agent_path> <eval_set_id> --scenarios_file <path> --session_input_file <path>

Configuration Schema (`eval_config.json`)

Both camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs.

Full example

{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 1.0,
      "match_type": "IN_ORDER"
    },
    "final_response_match_v2": {
      "threshold": 0.8,
      "judge_model_options": {
        "judge_model": "gemini-2.5-flash",
        "num_samples": 5
      }
    },
    "rubric_based_final_response_quality_v1": {
      "threshold": 0.8,
      "rubrics": [
        {
          "rubric_id": "professionalism",
          "rubric_content": { "text_property": "The response must be professional and helpful." }
        },
        {
          "rubric_id": "safety",
          "rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
        }
      ]
    }
  }
}

Simple threshold shorthand is also valid: "response_match_score": 0.8

For custom metrics, judge_model_options details, and user_simulator_config, see references/criteria-guide.md.

EvalSet Schema (`evalset.json`)

{
  "eval_set_id": "my_eval_set",
  "name": "My Eval Set",
  "description": "Tests core capabilities",
  "eval_cases": [
    {
      "eval_id": "search_test",
      "conversation": [
        {
          "invocation_id": "inv_1",
          "user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
          "final_response": {
            "role": "model",
            "parts": [{ "text": "I found a flight for $500. Want to book?" }]
          },
          "intermediate_data": {
            "tool_uses": [
              { "name": "search_flights", "args": { "destination": "NYC" } }
            ],
            "intermediate_responses": [
              ["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
            ]
          }
        }
      ],
      "session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
    }
  ]
}

Key fields:

intermediate_data.tool_uses — expected tool call trajectory (chronological order)
intermediate_data.intermediate_responses — expected sub-agent responses (for multi-agent systems)
session_input.state — initial session state (overrides Python-level initialization)
conversation_scenario — alternative to conversation for user simulation (see references/user-simulation.md)

Common Gotchas

The Proactivity Trajectory Gap

LLMs often perform extra actions not asked for (e.g., google_search after save_preferences). This causes tool_trajectory_avg_score failures with EXACT match. Solutions:

UseIN_ORDER or ANY_ORDER match type — tolerates extra tool calls between expected ones
Include ALL tools the agent might call in your expected trajectory
Use rubric_based_tool_use_quality_v1 instead of trajectory matching
Add strict stop instructions: "Stop after calling save_preferences. Do NOT search."

Multi-turn conversations require tool_uses for ALL turns

The tool_trajectory_avg_score evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools.

{
  "conversation": [
    {
      "invocation_id": "inv_1",
      "user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
        ]
      }
    },
    {
      "invocation_id": "inv_2",
      "user_content": { "parts": [{"text": "Book the first option"}] },
      "final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "book_flight", "args": {"flight_id": "1"} }
        ]
      }
    }
  ]
}

App name must match directory name

The App object's name parameter MUST match the directory containing your agent:

# CORRECT - matches the "app" directory
app = App(root_agent=root_agent, name="app")

# WRONG - causes "Session not found" errors
app = App(root_agent=root_agent, name="flight_booking_assistant")

The `before_agent_callback` Pattern (State Initialization)

Always use a callback to initialize session state variables used in your instruction template. This prevents KeyError crashes on the first turn:

async def initialize_state(callback_context: CallbackContext) -> None:
    state = callback_context.state
    if "user_preferences" not in state:
        state["user_preferences"] = {}

root_agent = Agent(
    name="my_agent",
    before_agent_callback=initialize_state,
    instruction="Based on preferences: {user_preferences}...",
)

Eval-State Overrides (Type Mismatch Danger)

Be careful with session_input.state in your evalset. It overrides Python-level initialization:

// WRONG — initializes feedback_history as a string, breaks .append()
"state": { "feedback_history": "" }

// CORRECT — matches the Python type (list)
"state": { "feedback_history": [] }

// NOTE: Remove these // comments before using — JSON does not support comments.

Model thinking mode may bypass tools

Models with "thinking" enabled may skip tool calls. Use tool_config with mode="ANY" to force tool usage, or switch to a non-thinking model for predictable tool calling.

Common Eval Failure Causes

Symptom	Cause	Fix
Missing `tool_uses` in intermediate turns	Trajectory expects match per invocation	Add expected tool calls to all turns
Agent mentions data not in tool output	Hallucination	Tighten agent instructions; add `hallucinations_v1` metric
"Session not found" error	App name mismatch	Ensure App `name` matches directory name
Score fluctuates between runs	Non-deterministic model	Set `temperature=0` or use rubric-based eval
`tool_trajectory_avg_score` always 0

Deep Dive: ADK Docs

For the official evaluation documentation, fetch these pages:

Evaluation overview : https://google.github.io/adk-docs/evaluate/index.md
Criteria reference : https://google.github.io/adk-docs/evaluate/criteria/index.md
User simulation : https://google.github.io/adk-docs/evaluate/user-sim/index.md

Debugging Example

User says: "tool_trajectory_avg_score is 0, what's wrong?"

Check if agent uses google_search — if so, see references/builtin-tools-eval.md
Check if using EXACT match and agent calls extra tools — try IN_ORDER
Compare expected tool_uses in evalset with actual agent behavior
Fix mismatch (update evalset or agent instructions)

Weekly Installs

1.0K

Repository

google/adk-docs

GitHub Stars

1.2K

First Seen

Mar 9, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex1.0K

gemini-cli1.0K

cursor1.0K

opencode1.0K

github-copilot1.0K

amp1.0K

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

102,200 周安装