重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
evaluation by sickn33/antigravity-awesome-skills
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill evaluation为智能体系统构建评估框架
当您需要为智能体系统构建评估框架时,请使用此技能。
智能体系统的评估需要不同于传统软件甚至标准语言模型应用的方法。智能体做出动态决策,在多次运行之间具有非确定性,并且通常没有单一正确答案。有效的评估必须考虑这些特性,同时提供可操作的反馈。一个稳健的评估框架能够支持持续改进、捕捉性能回归,并验证上下文工程选择是否达到预期效果。
在以下情况下激活此技能:
智能体评估需要以结果为中心的方法,考虑非确定性和多种有效路径。多维度的评分标准能捕捉各个质量方面:事实准确性、完整性、引用准确性、来源质量和工具使用效率。LLM 作为评判者提供了可扩展的评估,而人工评估则能捕捉边缘情况。
关键见解是:智能体可能找到实现目标的替代路径——评估应判断它们是否在遵循合理流程的同时实现了正确的结果。
性能驱动因素:95% 的发现 对 BrowseComp 评估(测试浏览智能体查找难以发现信息的能力)的研究发现,三个因素解释了 95% 的性能差异:
| 因素 | 解释的差异 | 启示 |
|---|---|---|
| 令牌使用量 | 80% | 更多令牌 = 更好性能 |
| 工具调用次数 | ~10% | 更多探索有帮助 |
| 模型选择 | ~5% | 更好的模型能成倍提高效率 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
这一发现对评估设计具有重要意义:
非确定性与多种有效路径 智能体可能采取完全不同的有效路径来达成目标。一个智能体可能搜索三个来源,而另一个可能搜索十个。它们可能使用不同的工具来找到相同的答案。在此背景下,检查特定步骤的传统评估方法会失效。
解决方案是以结果为中心的评估,判断智能体是否在遵循合理流程的同时实现了正确的结果。
上下文相关的失败 智能体的失败通常以微妙的方式依赖于上下文。一个智能体可能在简单查询上成功,但在复杂查询上失败。它可能在使用一组工具时表现良好,但在使用另一组时失败。失败可能只有在上下文累积的长时间交互后才会出现。
评估必须覆盖一系列复杂度级别,并测试扩展的交互,而不仅仅是孤立的查询。
复合质量维度 智能体质量不是单一维度。它包括事实准确性、完整性、连贯性、工具使用效率和流程质量。一个智能体可能在准确性上得分高,但在效率上得分低,反之亦然。
评估标准必须捕捉多个维度,并根据用例进行适当的加权。
多维度评分标准 有效的评分标准涵盖关键维度,并带有描述性级别:
评分标准打分 将维度评估转换为数值分数(0.0 到 1.0),并进行适当的加权。计算加权总分。根据用例要求确定通过阈值。
LLM 作为评判者 基于 LLM 的评估可扩展到大型测试集,并提供一致的判断。关键是设计有效的评估提示,以捕捉感兴趣的维度。
提供清晰的任务描述、智能体输出、事实基准(如果可用)、带级别描述的评估量表,并要求结构化判断。
人工评估 人工评估能捕捉自动化评估遗漏的内容。人类能注意到异常查询上的幻觉答案、系统故障以及自动化评估遗漏的细微偏见。
有效的人工评估应覆盖边缘情况、系统性地抽样、跟踪模式并提供上下文理解。
终态评估 对于会改变持久状态的智能体,终态评估关注最终状态是否符合预期,而非智能体如何达到该状态。
样本选择 在开发阶段从小样本开始。在智能体开发的早期,更改会产生巨大影响,因为有大量低垂的果实。小测试集就能揭示大的效果。
从真实使用模式中抽样。添加已知的边缘情况。确保覆盖不同的复杂度级别。
复杂度分层 测试集应跨越复杂度级别:简单(单次工具调用)、中等(多次工具调用)、复杂(多次工具调用,显著模糊性)和非常复杂(扩展交互,深度推理)。
测试上下文策略 上下文工程的选择应通过系统评估进行验证。在同一测试集上使用不同的上下文策略运行智能体。比较质量分数、令牌使用量和效率指标。
性能退化测试 通过在不同上下文大小下运行智能体,测试上下文退化如何影响性能。识别上下文变得有问题的性能悬崖。建立安全操作限制。
评估流水线 构建在智能体更改时自动运行的评估流水线。随时间跟踪结果。比较版本以识别改进或回归。
监控生产环境 通过抽样交互和随机评估,在生产环境中跟踪评估指标。为质量下降设置警报。维护仪表板以进行趋势分析。
示例 1:简单评估
def evaluate_agent_response(response, expected):
rubric = load_rubric()
scores = {}
for dimension, config in rubric.items():
scores[dimension] = assess_dimension(response, expected, dimension)
overall = weighted_average(scores, config["weights"])
return {"passed": overall >= 0.7, "scores": scores}
示例 2:测试集结构
测试集应跨越多个复杂度级别,以确保全面评估:
test_set = [
{
"name": "simple_lookup",
"input": "What is the capital of France?",
"expected": {"type": "fact", "answer": "Paris"},
"complexity": "simple",
"description": "Single tool call, factual lookup"
},
{
"name": "medium_query",
"input": "Compare the revenue of Apple and Microsoft last quarter",
"complexity": "medium",
"description": "Multiple tool calls, comparison logic"
},
{
"name": "multi_step_reasoning",
"input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
"complexity": "complex",
"description": "Many tool calls, aggregation, analysis"
},
{
"name": "research_synthesis",
"input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
"complexity": "very_complex",
"description": "Extended interaction, deep reasoning, synthesis"
}
]
此技能作为跨领域关注点连接到所有其他技能:
内部参考:
内部技能:
外部资源:
创建日期 : 2025-12-20 最后更新 : 2025-12-20 作者 : Agent Skills for Context Engineering Contributors 版本 : 1.0.0
每周安装量
76
仓库
GitHub 星标数
27.1K
首次出现
2026年2月3日
安全审计
安装于
gemini-cli73
codex73
opencode73
cursor71
github-copilot71
kimi-cli71
Build evaluation frameworks for agent systems
Use this skill when working with build evaluation frameworks for agent systems.
Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.
Activate this skill when:
Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. Multi-dimensional rubrics capture various quality aspects: factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. LLM-as-judge provides scalable evaluation while human evaluation catches edge cases.
The key insight is that agents may find alternative paths to goals—the evaluation should judge whether they achieve right outcomes while following reasonable processes.
Performance Drivers: The 95% Finding Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance:
| Factor | Variance Explained | Implication |
|---|---|---|
| Token usage | 80% | More tokens = better performance |
| Number of tool calls | ~10% | More exploration helps |
| Model choice | ~5% | Better models multiply efficiency |
This finding has significant implications for evaluation design:
Non-Determinism and Multiple Valid Paths Agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten. They might use different tools to find the same answer. Traditional evaluations that check for specific steps fail in this context.
The solution is outcome-focused evaluation that judges whether agents achieve right outcomes while following reasonable processes.
Context-Dependent Failures Agent failures often depend on context in subtle ways. An agent might succeed on simple queries but fail on complex ones. It might work well with one tool set but fail with another. Failures may emerge only after extended interaction when context accumulates.
Evaluation must cover a range of complexity levels and test extended interactions, not just isolated queries.
Composite Quality Dimensions Agent quality is not a single dimension. It includes factual accuracy, completeness, coherence, tool efficiency, and process quality. An agent might score high on accuracy but low in efficiency, or vice versa.
Evaluation rubrics must capture multiple dimensions with appropriate weighting for the use case.
Multi-Dimensional Rubric Effective rubrics cover key dimensions with descriptive levels:
Factual accuracy: Claims match ground truth (excellent to failed)
Completeness: Output covers requested aspects (excellent to failed)
Citation accuracy: Citations match claimed sources (excellent to failed)
Source quality: Uses appropriate primary sources (excellent to failed)
Tool efficiency: Uses right tools reasonable number of times (excellent to failed)
Rubric Scoring Convert dimension assessments to numeric scores (0.0 to 1.0) with appropriate weighting. Calculate weighted overall scores. Determine passing threshold based on use case requirements.
LLM-as-Judge LLM-based evaluation scales to large test sets and provides consistent judgments. The key is designing effective evaluation prompts that capture the dimensions of interest.
Provide clear task description, agent output, ground truth (if available), evaluation scale with level descriptions, and request structured judgment.
Human Evaluation Human evaluation catches what automation misses. Humans notice hallucinated answers on unusual queries, system failures, and subtle biases that automated evaluation misses.
Effective human evaluation covers edge cases, samples systematically, tracks patterns, and provides contextual understanding.
End-State Evaluation For agents that mutate persistent state, end-state evaluation focuses on whether the final state matches expectations rather than how the agent got there.
Sample Selection Start with small samples during development. Early in agent development, changes have dramatic impacts because there is abundant low-hanging fruit. Small test sets reveal large effects.
Sample from real usage patterns. Add known edge cases. Ensure coverage across complexity levels.
Complexity Stratification Test sets should span complexity levels: simple (single tool call), medium (multiple tool calls), complex (many tool calls, significant ambiguity), and very complex (extended interaction, deep reasoning).
Testing Context Strategies Context engineering choices should be validated through systematic evaluation. Run agents with different context strategies on the same test set. Compare quality scores, token usage, and efficiency metrics.
Degradation Testing Test how context degradation affects performance by running agents at different context sizes. Identify performance cliffs where context becomes problematic. Establish safe operating limits.
Evaluation Pipeline Build evaluation pipelines that run automatically on agent changes. Track results over time. Compare versions to identify improvements or regressions.
Monitoring Production Track evaluation metrics in production by sampling interactions and evaluating randomly. Set alerts for quality drops. Maintain dashboards for trend analysis.
Overfitting to specific paths: Evaluate outcomes, not specific steps. Ignoring edge cases: Include diverse test scenarios. Single-metric obsession: Use multi-dimensional rubrics. Neglecting context effects: Test with realistic context sizes. Skipping human evaluation: Automated evaluation misses subtle issues.
Example 1: Simple Evaluation
def evaluate_agent_response(response, expected):
rubric = load_rubric()
scores = {}
for dimension, config in rubric.items():
scores[dimension] = assess_dimension(response, expected, dimension)
overall = weighted_average(scores, config["weights"])
return {"passed": overall >= 0.7, "scores": scores}
Example 2: Test Set Structure
Test sets should span multiple complexity levels to ensure comprehensive evaluation:
test_set = [
{
"name": "simple_lookup",
"input": "What is the capital of France?",
"expected": {"type": "fact", "answer": "Paris"},
"complexity": "simple",
"description": "Single tool call, factual lookup"
},
{
"name": "medium_query",
"input": "Compare the revenue of Apple and Microsoft last quarter",
"complexity": "medium",
"description": "Multiple tool calls, comparison logic"
},
{
"name": "multi_step_reasoning",
"input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
"complexity": "complex",
"description": "Many tool calls, aggregation, analysis"
},
{
"name": "research_synthesis",
"input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
"complexity": "very_complex",
"description": "Extended interaction, deep reasoning, synthesis"
}
]
This skill connects to all other skills as a cross-cutting concern:
Internal reference:
Internal skills:
External resources:
Created : 2025-12-20 Last Updated : 2025-12-20 Author : Agent Skills for Context Engineering Contributors Version : 1.0.0
Weekly Installs
76
Repository
GitHub Stars
27.1K
First Seen
Feb 3, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
gemini-cli73
codex73
opencode73
cursor71
github-copilot71
kimi-cli71
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
55,300 周安装