npx skills add https://github.com/crinkj/common-claude-setting --skill evaluation智能体系统的评估需要不同于传统软件甚至标准语言模型应用的方法。智能体进行动态决策,在多次运行之间具有非确定性,并且通常没有单一正确答案。有效的评估必须考虑这些特性,同时提供可操作的反馈。一个稳健的评估框架能够支持持续改进、发现性能回退,并验证上下文工程选择是否达到预期效果。
在以下情况下启用此技能:
智能体评估需要以结果为中心的方法,并考虑非确定性和多种有效路径。多维度的评分标准能捕捉各个质量方面:事实准确性、完整性、引用准确性、来源质量和工具效率。使用 LLM 作为评判者(LLM-as-judge)提供了可扩展的评估,而人工评估则能发现边缘情况。
关键见解是:智能体可能找到实现目标的替代路径——评估应判断它们是否在遵循合理流程的同时实现了正确的结果。
性能驱动因素:95% 发现 对 BrowseComp 评估(测试浏览智能体查找难以发现信息的能力)的研究发现,三个因素解释了 95% 的性能差异:
| 因素 | 解释的差异 | 含义 |
|---|---|---|
| Token 使用量 | 80% | 更多 token = 更好性能 |
| 工具调用次数 | ~10% | 更多探索有助于提升 |
| 模型选择 | ~5% | 更好的模型能成倍提高效率 |
这一发现对评估设计有重要影响:
Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.
Activate this skill when:
Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. Multi-dimensional rubrics capture various quality aspects: factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. LLM-as-judge provides scalable evaluation while human evaluation catches edge cases.
The key insight is that agents may find alternative paths to goals—the evaluation should judge whether they achieve right outcomes while following reasonable processes.
Performance Drivers: The 95% Finding Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance:
| Factor | Variance Explained | Implication |
|---|
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
非确定性与多种有效路径 智能体可能采取完全不同的有效路径来实现目标。一个智能体可能搜索三个来源,而另一个可能搜索十个。它们可能使用不同的工具来找到相同的答案。在这种情境下,检查特定步骤的传统评估方法会失效。
解决方案是以结果为中心的评估,判断智能体是否在遵循合理流程的同时实现了正确的结果。
上下文相关的失败 智能体的失败通常以微妙的方式依赖于上下文。一个智能体可能在简单查询上成功,但在复杂查询上失败。它可能在一组工具上运行良好,但在另一组上失败。失败可能只在上下文累积后的长时间交互中显现。
评估必须覆盖一系列复杂度级别,并测试扩展的交互,而不仅仅是孤立的查询。
复合质量维度 智能体质量不是单一维度。它包括事实准确性、完整性、连贯性、工具效率和流程质量。一个智能体可能在准确性上得分高,但在效率上得分低,反之亦然。
评估标准必须捕捉多个维度,并根据用例进行适当的加权。
多维度评分标准 有效的评分标准涵盖关键维度,并包含描述性级别:
事实准确性:声明与事实相符(优秀到失败)
完整性:输出覆盖请求的各个方面(优秀到失败)
引用准确性:引用与声称的来源匹配(优秀到失败)
来源质量:使用适当的主要来源(优秀到失败)
工具效率:使用正确的工具,调用次数合理(优秀到失败)
评分标准计分 将维度评估转换为数值分数(0.0 到 1.0),并进行适当的加权。计算加权后的总体分数。根据用例需求确定通过阈值。
LLM 作为评判者 基于 LLM 的评估可扩展到大型测试集,并提供一致的判断。关键是设计有效的评估提示,捕捉感兴趣的维度。
提供清晰的任务描述、智能体输出、事实依据(如果可用)、带有级别描述的评估量表,并请求结构化判断。
人工评估 人工评估能发现自动化评估遗漏的问题。人类能注意到不寻常查询上的幻觉答案、系统故障以及自动化评估遗漏的微妙偏见。
有效的人工评估覆盖边缘情况、系统性地抽样、跟踪模式并提供上下文理解。
最终状态评估 对于改变持久状态的智能体,最终状态评估关注最终状态是否符合预期,而非智能体如何到达该状态。
样本选择 在开发期间从小样本开始。在智能体开发的早期,变化会产生巨大影响,因为存在大量容易实现的改进。小测试集能揭示大的效果。
从实际使用模式中抽样。添加已知的边缘情况。确保覆盖不同的复杂度级别。
复杂度分层 测试集应跨越复杂度级别:简单(单次工具调用)、中等(多次工具调用)、复杂(多次工具调用,存在显著歧义)和非常复杂(扩展交互,深度推理)。
测试上下文策略 上下文工程选择应通过系统性评估进行验证。在同一测试集上运行具有不同上下文策略的智能体。比较质量分数、token 使用量和效率指标。
退化测试 通过在不同上下文大小下运行智能体,测试上下文退化如何影响性能。识别上下文变得有问题的性能悬崖。建立安全操作限制。
评估流水线 构建在智能体变更时自动运行的评估流水线。随时间跟踪结果。比较版本以识别改进或回退。
监控生产环境 通过抽样交互和随机评估,在生产环境中跟踪评估指标。为质量下降设置警报。维护用于趋势分析的仪表板。
过度拟合特定路径:评估结果,而非特定步骤。忽略边缘情况:包含多样化的测试场景。单一指标迷恋:使用多维度评分标准。忽视上下文影响:使用现实的上下文大小进行测试。跳过人工评估:自动化评估会遗漏微妙问题。
示例 1:简单评估
def evaluate_agent_response(response, expected):
rubric = load_rubric()
scores = {}
for dimension, config in rubric.items():
scores[dimension] = assess_dimension(response, expected, dimension)
overall = weighted_average(scores, config["weights"])
return {"passed": overall >= 0.7, "scores": scores}
示例 2:测试集结构
测试集应跨越多个复杂度级别,以确保全面评估:
test_set = [
{
"name": "simple_lookup",
"input": "What is the capital of France?",
"expected": {"type": "fact", "answer": "Paris"},
"complexity": "simple",
"description": "Single tool call, factual lookup"
},
{
"name": "medium_query",
"input": "Compare the revenue of Apple and Microsoft last quarter",
"complexity": "medium",
"description": "Multiple tool calls, comparison logic"
},
{
"name": "multi_step_reasoning",
"input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
"complexity": "complex",
"description": "Many tool calls, aggregation, analysis"
},
{
"name": "research_synthesis",
"input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
"complexity": "very_complex",
"description": "Extended interaction, deep reasoning, synthesis"
}
]
此技能作为横切关注点连接到所有其他技能:
内部参考:
内部技能:
外部资源:
创建日期 : 2025-12-20 最后更新 : 2025-12-20 作者 : Agent Skills for Context Engineering Contributors 版本 : 1.0.0
每周安装次数
1
仓库
首次出现
今天
安全审计
安装于
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1
| Token usage | 80% | More tokens = better performance |
| Number of tool calls | ~10% | More exploration helps |
| Model choice | ~5% | Better models multiply efficiency |
This finding has significant implications for evaluation design:
Non-Determinism and Multiple Valid Paths Agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten. They might use different tools to find the same answer. Traditional evaluations that check for specific steps fail in this context.
The solution is outcome-focused evaluation that judges whether agents achieve right outcomes while following reasonable processes.
Context-Dependent Failures Agent failures often depend on context in subtle ways. An agent might succeed on simple queries but fail on complex ones. It might work well with one tool set but fail with another. Failures may emerge only after extended interaction when context accumulates.
Evaluation must cover a range of complexity levels and test extended interactions, not just isolated queries.
Composite Quality Dimensions Agent quality is not a single dimension. It includes factual accuracy, completeness, coherence, tool efficiency, and process quality. An agent might score high on accuracy but low in efficiency, or vice versa.
Evaluation rubrics must capture multiple dimensions with appropriate weighting for the use case.
Multi-Dimensional Rubric Effective rubrics cover key dimensions with descriptive levels:
Factual accuracy: Claims match ground truth (excellent to failed)
Completeness: Output covers requested aspects (excellent to failed)
Citation accuracy: Citations match claimed sources (excellent to failed)
Source quality: Uses appropriate primary sources (excellent to failed)
Tool efficiency: Uses right tools reasonable number of times (excellent to failed)
Rubric Scoring Convert dimension assessments to numeric scores (0.0 to 1.0) with appropriate weighting. Calculate weighted overall scores. Determine passing threshold based on use case requirements.
LLM-as-Judge LLM-based evaluation scales to large test sets and provides consistent judgments. The key is designing effective evaluation prompts that capture the dimensions of interest.
Provide clear task description, agent output, ground truth (if available), evaluation scale with level descriptions, and request structured judgment.
Human Evaluation Human evaluation catches what automation misses. Humans notice hallucinated answers on unusual queries, system failures, and subtle biases that automated evaluation misses.
Effective human evaluation covers edge cases, samples systematically, tracks patterns, and provides contextual understanding.
End-State Evaluation For agents that mutate persistent state, end-state evaluation focuses on whether the final state matches expectations rather than how the agent got there.
Sample Selection Start with small samples during development. Early in agent development, changes have dramatic impacts because there is abundant low-hanging fruit. Small test sets reveal large effects.
Sample from real usage patterns. Add known edge cases. Ensure coverage across complexity levels.
Complexity Stratification Test sets should span complexity levels: simple (single tool call), medium (multiple tool calls), complex (many tool calls, significant ambiguity), and very complex (extended interaction, deep reasoning).
Testing Context Strategies Context engineering choices should be validated through systematic evaluation. Run agents with different context strategies on the same test set. Compare quality scores, token usage, and efficiency metrics.
Degradation Testing Test how context degradation affects performance by running agents at different context sizes. Identify performance cliffs where context becomes problematic. Establish safe operating limits.
Evaluation Pipeline Build evaluation pipelines that run automatically on agent changes. Track results over time. Compare versions to identify improvements or regressions.
Monitoring Production Track evaluation metrics in production by sampling interactions and evaluating randomly. Set alerts for quality drops. Maintain dashboards for trend analysis.
Overfitting to specific paths: Evaluate outcomes, not specific steps. Ignoring edge cases: Include diverse test scenarios. Single-metric obsession: Use multi-dimensional rubrics. Neglecting context effects: Test with realistic context sizes. Skipping human evaluation: Automated evaluation misses subtle issues.
Example 1: Simple Evaluation
def evaluate_agent_response(response, expected):
rubric = load_rubric()
scores = {}
for dimension, config in rubric.items():
scores[dimension] = assess_dimension(response, expected, dimension)
overall = weighted_average(scores, config["weights"])
return {"passed": overall >= 0.7, "scores": scores}
Example 2: Test Set Structure
Test sets should span multiple complexity levels to ensure comprehensive evaluation:
test_set = [
{
"name": "simple_lookup",
"input": "What is the capital of France?",
"expected": {"type": "fact", "answer": "Paris"},
"complexity": "simple",
"description": "Single tool call, factual lookup"
},
{
"name": "medium_query",
"input": "Compare the revenue of Apple and Microsoft last quarter",
"complexity": "medium",
"description": "Multiple tool calls, comparison logic"
},
{
"name": "multi_step_reasoning",
"input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
"complexity": "complex",
"description": "Many tool calls, aggregation, analysis"
},
{
"name": "research_synthesis",
"input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
"complexity": "very_complex",
"description": "Extended interaction, deep reasoning, synthesis"
}
]
This skill connects to all other skills as a cross-cutting concern:
Internal reference:
Internal skills:
External resources:
Created : 2025-12-20 Last Updated : 2025-12-20 Author : Agent Skills for Context Engineering Contributors Version : 1.0.0
Weekly Installs
1
Repository
First Seen
Today
Security Audits
Installed on
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1
AI Elements:基于shadcn/ui的AI原生应用组件库,快速构建对话界面
60,400 周安装
Vercel Blob 文件存储教程 - 安全上传、管理和优化指南
333 周安装
高级提示工程指南:LLM优化、少样本学习、思维链提示与模板系统
339 周安装
iOS/macOS 应用签名设置指南:使用 asc-signing-setup 自动化证书和配置文件管理
347 周安装
TypeScript MCP服务器开发指南:基于Cloudflare Workers构建AI工具集成
335 周安装
OpenPencil CLI 工具:.fig 设计文件命令行操作与 MCP 服务器 | 设计自动化
351 周安装
产品路线图管理指南:框架、优先级排序与沟通技巧 | 产品经理必备
336 周安装