智能体系统评估方法指南：构建AI智能体评估框架与持续测试策略

evaluation by crinkj/common-claude-setting

1 周安装量

安装命令

npx skills add https://github.com/crinkj/common-claude-setting --skill evaluation

🇨🇳中文介绍

智能体系统评估方法

智能体系统的评估需要不同于传统软件甚至标准语言模型应用的方法。智能体进行动态决策，在多次运行之间具有非确定性，并且通常没有单一正确答案。有效的评估必须考虑这些特性，同时提供可操作的反馈。一个稳健的评估框架能够支持持续改进、发现性能回退，并验证上下文工程选择是否达到预期效果。

何时启用

在以下情况下启用此技能：

系统性地测试智能体性能
验证上下文工程选择
衡量随时间推移的改进
在部署前发现性能回退
为智能体流水线构建质量门
比较不同的智能体配置
持续评估生产系统

核心概念

智能体评估需要以结果为中心的方法，并考虑非确定性和多种有效路径。多维度的评分标准能捕捉各个质量方面：事实准确性、完整性、引用准确性、来源质量和工具效率。使用 LLM 作为评判者（LLM-as-judge）提供了可扩展的评估，而人工评估则能发现边缘情况。

关键见解是：智能体可能找到实现目标的替代路径——评估应判断它们是否在遵循合理流程的同时实现了正确的结果。

性能驱动因素：95% 发现 对 BrowseComp 评估（测试浏览智能体查找难以发现信息的能力）的研究发现，三个因素解释了 95% 的性能差异：

因素	解释的差异	含义
Token 使用量	80%	更多 token = 更好性能
工具调用次数	~10%	更多探索有助于提升
模型选择	~5%	更好的模型能成倍提高效率

这一发现对评估设计有重要影响：

Token 预算很重要：使用现实的 token 预算评估智能体，而非无限资源

🇺🇸English

Evaluation Methods for Agent Systems

Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.

When to Activate

Activate this skill when:

Testing agent performance systematically
Validating context engineering choices
Measuring improvements over time
Catching regressions before deployment
Building quality gates for agent pipelines
Comparing different agent configurations
Evaluating production systems continuously

Core Concepts

Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. Multi-dimensional rubrics capture various quality aspects: factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. LLM-as-judge provides scalable evaluation while human evaluation catches edge cases.

The key insight is that agents may find alternative paths to goals—the evaluation should judge whether they achieve right outcomes while following reasonable processes.

Performance Drivers: The 95% Finding Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance:

Factor	Variance Explained	Implication

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

上下文工程评估

测试上下文策略 上下文工程选择应通过系统性评估进行验证。在同一测试集上运行具有不同上下文策略的智能体。比较质量分数、token 使用量和效率指标。

退化测试 通过在不同上下文大小下运行智能体，测试上下文退化如何影响性能。识别上下文变得有问题的性能悬崖。建立安全操作限制。

评估流水线 构建在智能体变更时自动运行的评估流水线。随时间跟踪结果。比较版本以识别改进或回退。

监控生产环境 通过抽样交互和随机评估，在生产环境中跟踪评估指标。为质量下降设置警报。维护用于趋势分析的仪表板。

定义与您的用例相关的质量维度
创建具有清晰、可操作级别描述的评分标准
从实际使用模式和边缘情况构建测试集
实现自动化评估流水线
在进行更改前建立基线指标
对所有重大变更运行评估
随时间跟踪指标以进行趋势分析
用人工审查补充自动化评估

过度拟合特定路径：评估结果，而非特定步骤。忽略边缘情况：包含多样化的测试场景。单一指标迷恋：使用多维度评分标准。忽视上下文影响：使用现实的上下文大小进行测试。跳过人工评估：自动化评估会遗漏微妙问题。

示例 1：简单评估

def evaluate_agent_response(response, expected):
    rubric = load_rubric()
    scores = {}
    for dimension, config in rubric.items():
        scores[dimension] = assess_dimension(response, expected, dimension)
    overall = weighted_average(scores, config["weights"])
    return {"passed": overall >= 0.7, "scores": scores}

示例 2：测试集结构

测试集应跨越多个复杂度级别，以确保全面评估：

test_set = [
    {
        "name": "simple_lookup",
        "input": "What is the capital of France?",
        "expected": {"type": "fact", "answer": "Paris"},
        "complexity": "simple",
        "description": "Single tool call, factual lookup"
    },
    {
        "name": "medium_query",
        "input": "Compare the revenue of Apple and Microsoft last quarter",
        "complexity": "medium",
        "description": "Multiple tool calls, comparison logic"
    },
    {
        "name": "multi_step_reasoning",
        "input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
        "complexity": "complex",
        "description": "Many tool calls, aggregation, analysis"
    },
    {
        "name": "research_synthesis",
        "input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
        "complexity": "very_complex",
        "description": "Extended interaction, deep reasoning, synthesis"
    }
]

使用多维度评分标准，而非单一指标
评估结果，而非特定的执行路径
覆盖从简单到复杂的复杂度级别
使用现实的上下文大小和历史进行测试
持续运行评估，而不仅仅在发布前
用人工审查补充 LLM 评估
随时间跟踪指标以进行趋势检测
根据用例设定清晰的通过/失败阈值

此技能作为横切关注点连接到所有其他技能：

context-fundamentals - 评估上下文使用
context-degradation - 检测退化
context-optimization - 衡量优化效果
multi-agent-patterns - 评估协调性
tool-design - 评估工具有效性
memory-systems - 评估内存质量

指标参考 - 详细的评估指标和实现

所有其他技能都连接到评估以进行质量测量

LLM 评估基准
智能体评估研究论文
生产环境监控实践

创建日期 : 2025-12-20 最后更新 : 2025-12-20 作者 : Agent Skills for Context Engineering Contributors 版本 : 1.0.0

This finding has significant implications for evaluation design:

Token budgets matter : Evaluate agents with realistic token budgets, not unlimited resources
Model upgrades beat token increases : Upgrading to Claude Sonnet 4.5 or GPT-5.2 provides larger gains than doubling token budgets on previous versions
Multi-agent validation : The finding validates architectures that distribute work across agents with separate context windows

Evaluation Challenges

Non-Determinism and Multiple Valid Paths Agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten. They might use different tools to find the same answer. Traditional evaluations that check for specific steps fail in this context.

The solution is outcome-focused evaluation that judges whether agents achieve right outcomes while following reasonable processes.

Context-Dependent Failures Agent failures often depend on context in subtle ways. An agent might succeed on simple queries but fail on complex ones. It might work well with one tool set but fail with another. Failures may emerge only after extended interaction when context accumulates.

Evaluation must cover a range of complexity levels and test extended interactions, not just isolated queries.

Composite Quality Dimensions Agent quality is not a single dimension. It includes factual accuracy, completeness, coherence, tool efficiency, and process quality. An agent might score high on accuracy but low in efficiency, or vice versa.

Evaluation rubrics must capture multiple dimensions with appropriate weighting for the use case.

Evaluation Rubric Design

Multi-Dimensional Rubric Effective rubrics cover key dimensions with descriptive levels:

Factual accuracy: Claims match ground truth (excellent to failed)

Completeness: Output covers requested aspects (excellent to failed)

Citation accuracy: Citations match claimed sources (excellent to failed)

Source quality: Uses appropriate primary sources (excellent to failed)

Tool efficiency: Uses right tools reasonable number of times (excellent to failed)

Rubric Scoring Convert dimension assessments to numeric scores (0.0 to 1.0) with appropriate weighting. Calculate weighted overall scores. Determine passing threshold based on use case requirements.

Evaluation Methodologies

LLM-as-Judge LLM-based evaluation scales to large test sets and provides consistent judgments. The key is designing effective evaluation prompts that capture the dimensions of interest.

Provide clear task description, agent output, ground truth (if available), evaluation scale with level descriptions, and request structured judgment.

Human Evaluation Human evaluation catches what automation misses. Humans notice hallucinated answers on unusual queries, system failures, and subtle biases that automated evaluation misses.

Effective human evaluation covers edge cases, samples systematically, tracks patterns, and provides contextual understanding.

End-State Evaluation For agents that mutate persistent state, end-state evaluation focuses on whether the final state matches expectations rather than how the agent got there.

Sample Selection Start with small samples during development. Early in agent development, changes have dramatic impacts because there is abundant low-hanging fruit. Small test sets reveal large effects.

Sample from real usage patterns. Add known edge cases. Ensure coverage across complexity levels.

Complexity Stratification Test sets should span complexity levels: simple (single tool call), medium (multiple tool calls), complex (many tool calls, significant ambiguity), and very complex (extended interaction, deep reasoning).

Context Engineering Evaluation

Testing Context Strategies Context engineering choices should be validated through systematic evaluation. Run agents with different context strategies on the same test set. Compare quality scores, token usage, and efficiency metrics.

Degradation Testing Test how context degradation affects performance by running agents at different context sizes. Identify performance cliffs where context becomes problematic. Establish safe operating limits.

Continuous Evaluation

Evaluation Pipeline Build evaluation pipelines that run automatically on agent changes. Track results over time. Compare versions to identify improvements or regressions.

Monitoring Production Track evaluation metrics in production by sampling interactions and evaluating randomly. Set alerts for quality drops. Maintain dashboards for trend analysis.

Building Evaluation Frameworks

Define quality dimensions relevant to your use case
Create rubrics with clear, actionable level descriptions
Build test sets from real usage patterns and edge cases
Implement automated evaluation pipelines
Establish baseline metrics before making changes
Run evaluations on all significant changes
Track metrics over time for trend analysis
Supplement automated evaluation with human review

Avoiding Evaluation Pitfalls

Overfitting to specific paths: Evaluate outcomes, not specific steps. Ignoring edge cases: Include diverse test scenarios. Single-metric obsession: Use multi-dimensional rubrics. Neglecting context effects: Test with realistic context sizes. Skipping human evaluation: Automated evaluation misses subtle issues.

Example 1: Simple Evaluation

def evaluate_agent_response(response, expected):
    rubric = load_rubric()
    scores = {}
    for dimension, config in rubric.items():
        scores[dimension] = assess_dimension(response, expected, dimension)
    overall = weighted_average(scores, config["weights"])
    return {"passed": overall >= 0.7, "scores": scores}

Example 2: Test Set Structure

Test sets should span multiple complexity levels to ensure comprehensive evaluation:

test_set = [
    {
        "name": "simple_lookup",
        "input": "What is the capital of France?",
        "expected": {"type": "fact", "answer": "Paris"},
        "complexity": "simple",
        "description": "Single tool call, factual lookup"
    },
    {
        "name": "medium_query",
        "input": "Compare the revenue of Apple and Microsoft last quarter",
        "complexity": "medium",
        "description": "Multiple tool calls, comparison logic"
    },
    {
        "name": "multi_step_reasoning",
        "input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
        "complexity": "complex",
        "description": "Many tool calls, aggregation, analysis"
    },
    {
        "name": "research_synthesis",
        "input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
        "complexity": "very_complex",
        "description": "Extended interaction, deep reasoning, synthesis"
    }
]

Use multi-dimensional rubrics, not single metrics
Evaluate outcomes, not specific execution paths
Cover complexity levels from simple to complex
Test with realistic context sizes and histories
Run evaluations continuously, not just before release
Supplement LLM evaluation with human review
Track metrics over time for trend detection
Set clear pass/fail thresholds based on use case

This skill connects to all other skills as a cross-cutting concern:

context-fundamentals - Evaluating context usage
context-degradation - Detecting degradation
context-optimization - Measuring optimization effectiveness
multi-agent-patterns - Evaluating coordination
tool-design - Evaluating tool effectiveness
memory-systems - Evaluating memory quality

Metrics Reference - Detailed evaluation metrics and implementation

All other skills connect to evaluation for quality measurement

LLM evaluation benchmarks
Agent evaluation research papers
Production monitoring practices

Created : 2025-12-20 Last Updated : 2025-12-20 Author : Agent Skills for Context Engineering Contributors Version : 1.0.0

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

60,400 周安装

智能体系统评估方法指南：构建AI智能体评估框架与持续测试策略

🇨🇳中文介绍

智能体系统评估方法

何时启用

核心概念

🇺🇸English

Evaluation Methods for Agent Systems

When to Activate

Core Concepts

相关 Skills

详细主题

评估挑战

评估标准设计

评估方法

测试集设计

上下文工程评估

持续评估

实践指导

构建评估框架

避免评估陷阱

示例

指导原则

集成

参考文献

参考文献

技能元数据

Detailed Topics

Evaluation Challenges

Evaluation Rubric Design

Evaluation Methodologies

Test Set Design

Context Engineering Evaluation

Continuous Evaluation

Practical Guidance

Building Evaluation Frameworks

Avoiding Evaluation Pitfalls

Examples

Guidelines

Integration

References

References

Skill Metadata

最新 Skills