重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
evaluation by guanyang/antigravity-skills
npx skills add https://github.com/guanyang/antigravity-skills --skill evaluation评估智能体系统与传统软件不同,因为智能体做出动态决策,运行之间具有非确定性,并且通常缺乏单一正确答案。构建评估框架时需考虑这些特性,提供可操作的反馈,捕捉性能回归,并验证上下文工程选择是否达到预期效果。
在以下情况下启用此技能:
将评估重点放在结果而非执行路径上,因为智能体可能找到实现目标的替代有效路径。判断智能体是否通过合理过程实现了正确结果,而不是判断其是否遵循了特定的步骤序列。
使用多维评分标准而非单一分数,因为一个数字会掩盖特定维度的关键失败。将事实准确性、完整性、引用准确性、来源质量和工具效率作为独立维度进行捕获,然后根据用例对它们进行加权。
部署 LLM 作为评判者 以在大规模测试集上进行可扩展的评估,同时辅以人工审查,以捕捉自动化评估遗漏的边缘情况、幻觉和细微偏见。
性能驱动因素:95% 发现
在设计评估预算时应用 BrowseComp 研究发现:三个因素解释了 95% 的浏览智能体性能差异。
| 因素 | 解释的方差 | 含义 |
|---|---|---|
| 令牌使用量 | 80% | 更多令牌 = 更好性能 |
| 工具调用次数 | ~10% | 更多探索有助于提升 |
| 模型选择 | ~5% | 更好的模型能成倍提升效率 |
在设计评估时,根据这些含义采取行动:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
处理非确定性和多有效路径
设计能够容忍路径变化的评估,因为智能体可能采取完全不同的有效路径来达成目标。一个智能体可能搜索三个来源,而另一个搜索十个;两者都可能产生正确答案。避免检查特定步骤。相反,定义结果标准(正确性、完整性、质量)并据此评分,将执行路径视为信息性的而非评估性的。
测试上下文相关的失败
在多个复杂度级别和交互长度上进行评估,因为智能体失败通常以微妙的方式依赖于上下文。智能体可能在简单查询上成功但在复杂查询上失败,使用一组工具时表现良好但使用另一组时失败,或者在长时间交互后随着上下文累积而性能下降。包含简单、中等、复杂和非常复杂的测试用例以揭示这些模式。
分别评分复合质量维度
将智能体质量分解为独立的维度(事实准确性、完整性、连贯性、工具效率、过程质量)并分别评分,因为智能体可能在准确性上得分高但在效率上得分低,反之亦然。然后根据用例优先级计算加权聚合分数。这种方法揭示了哪些维度需要改进,而不是平均化掩盖信号。
构建多维评分标准
定义涵盖关键维度的评分标准,并包含从优秀到失败的描述性级别。包含这些核心维度并根据用例调整权重:
将评分标准转换为数值分数
将维度评估映射到数值分数(0.0 到 1.0),应用每个维度的权重,并计算加权总分。根据用例需求设定通过阈值,通常通用用途为 0.7,高风险应用为 0.9。将各个维度分数与聚合分数一起存储,因为细分数据驱动着有针对性的改进。
使用 LLM 作为评判者以实现规模
构建基于 LLM 的评估提示,包括:清晰的任务描述、待测试的智能体输出、可用时的真实情况、带有明确级别描述的评估量表,以及要求提供带有推理的结构化判断。LLM 评判者在大规模测试集上提供一致、可扩展的评估。使用与被评估智能体不同的模型系列,以避免自我增强偏差。
辅以人工评估
将边缘情况、异常查询和生产流量的随机样本路由给人工审查员,因为人工审查员能注意到自动化评估遗漏的幻觉答案、系统故障和细微偏见。跟踪人工审查中的模式,以识别系统性问题,并将发现反馈到自动化评估标准中。
对状态化智能体应用最终状态评估
对于修改持久状态(文件、数据库、配置)的智能体,评估最终状态是否符合预期,而不是智能体如何达到该状态。定义预期的最终状态断言,并在每次测试运行后以编程方式验证它们。
选择代表性样本
在早期开发阶段,当更改具有巨大影响且容易实现的目标很多时,从小样本(20-30 个案例)开始。随着系统成熟,扩展到 50 个以上案例以获得可靠信号。从实际使用模式中抽样,添加已知的边缘情况,并确保覆盖所有复杂度级别。
按复杂度分层
构建跨复杂度级别的测试集,以防止简单示例夸大分数:
报告每个层级的分数以及总分,以揭示智能体实际在何处遇到困难。
系统性地验证上下文策略
在同一测试集上使用不同的上下文策略运行智能体,并比较质量分数、令牌使用量和效率指标。这可以将上下文工程的效果与其他变量隔离开来,并防止基于轶事做出决策。
运行性能下降测试
通过在不同上下文大小下运行智能体,测试上下文退化如何影响性能。识别上下文变得有问题的性能悬崖,并建立安全操作限制。将这些限制反馈到上下文管理策略中。
构建自动化评估流水线
将评估集成到开发工作流程中,以便在智能体更改时自动运行评估。随时间跟踪结果,比较版本,并阻止在关键指标上出现性能回归的部署。
监控生产质量
对生产交互进行抽样并持续评估。为质量下降到警告(0.85 通过率)和关键(0.70 通过率)阈值以下设置警报。维护仪表板,显示随时间窗口的趋势分析,以检测逐渐的性能下降。
遵循此序列构建评估框架,因为跳过早期步骤会导致测量结果无法反映真实质量:
防范这些破坏评估可靠性的常见失败:
示例 1:简单评估
def evaluate_agent_response(response, expected):
rubric = load_rubric()
scores = {}
for dimension, config in rubric.items():
scores[dimension] = assess_dimension(response, expected, dimension)
overall = weighted_average(scores, config["weights"])
return {"passed": overall >= 0.7, "scores": scores}
示例 2:测试集结构
测试集应跨越多个复杂度级别,以确保全面评估:
test_set = [
{
"name": "simple_lookup",
"input": "What is the capital of France?",
"expected": {"type": "fact", "answer": "Paris"},
"complexity": "simple",
"description": "Single tool call, factual lookup"
},
{
"name": "medium_query",
"input": "Compare the revenue of Apple and Microsoft last quarter",
"complexity": "medium",
"description": "Multiple tool calls, comparison logic"
},
{
"name": "multi_step_reasoning",
"input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
"complexity": "complex",
"description": "Many tool calls, aggregation, analysis"
},
{
"name": "research_synthesis",
"input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
"complexity": "very_complex",
"description": "Extended interaction, deep reasoning, synthesis"
}
]
此技能作为跨领域关注点连接到所有其他技能:
内部参考:
内部技能:
外部资源:
创建时间 : 2025-12-20 最后更新 : 2026-03-17 作者 : Agent Skills for Context Engineering Contributors 版本 : 1.1.0
每周安装次数
51
代码仓库
GitHub 星标数
520
首次出现
Jan 26, 2026
安全审计
安装于
opencode47
codex46
github-copilot45
cursor44
gemini-cli44
amp43
Evaluate agent systems differently from traditional software because agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Build evaluation frameworks that account for these characteristics, provide actionable feedback, catch regressions, and validate that context engineering choices achieve intended effects.
Activate this skill when:
Focus evaluation on outcomes rather than execution paths, because agents may find alternative valid routes to goals. Judge whether the agent achieves the right outcome via a reasonable process, not whether it followed a specific sequence of steps.
Use multi-dimensional rubrics instead of single scores because one number hides critical failures in specific dimensions. Capture factual accuracy, completeness, citation accuracy, source quality, and tool efficiency as separate dimensions, then weight them for the use case.
Deploy LLM-as-judge for scalable evaluation across large test sets while supplementing with human review to catch edge cases, hallucinations, and subtle biases that automated evaluation misses.
Performance Drivers: The 95% Finding
Apply the BrowseComp research finding when designing evaluation budgets: three factors explain 95% of browsing agent performance variance.
| Factor | Variance Explained | Implication |
|---|---|---|
| Token usage | 80% | More tokens = better performance |
| Number of tool calls | ~10% | More exploration helps |
| Model choice | ~5% | Better models multiply efficiency |
Act on these implications when designing evaluations:
Handle Non-Determinism and Multiple Valid Paths
Design evaluations that tolerate path variation because agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten; both may produce correct answers. Avoid checking for specific steps. Instead, define outcome criteria (correctness, completeness, quality) and score against those, treating the execution path as informational rather than evaluative.
Test Context-Dependent Failures
Evaluate across a range of complexity levels and interaction lengths because agent failures often depend on context in subtle ways. An agent might succeed on simple queries but fail on complex ones, work well with one tool set but fail with another, or degrade after extended interaction as context accumulates. Include simple, medium, complex, and very complex test cases to surface these patterns.
Score Composite Quality Dimensions Separately
Break agent quality into separate dimensions (factual accuracy, completeness, coherence, tool efficiency, process quality) and score each independently because an agent might score high on accuracy but low on efficiency, or vice versa. Then compute weighted aggregates tuned to use-case priorities. This approach reveals which dimensions need improvement rather than averaging away the signal.
Build Multi-Dimensional Rubrics
Define rubrics covering key dimensions with descriptive levels from excellent to failed. Include these core dimensions and adapt weights per use case:
Convert Rubrics to Numeric Scores
Map dimension assessments to numeric scores (0.0 to 1.0), apply per-dimension weights, and calculate weighted overall scores. Set passing thresholds based on use-case requirements, typically 0.7 for general use and 0.9 for high-stakes applications. Store individual dimension scores alongside the aggregate because the breakdown drives targeted improvement.
Use LLM-as-Judge for Scale
Build LLM-based evaluation prompts that include: clear task description, the agent output under test, ground truth when available, an evaluation scale with explicit level descriptions, and a request for structured judgment with reasoning. LLM judges provide consistent, scalable evaluation across large test sets. Use a different model family than the agent being evaluated to avoid self-enhancement bias.
Supplement with Human Evaluation
Route edge cases, unusual queries, and a random sample of production traffic to human reviewers because humans notice hallucinated answers, system failures, and subtle biases that automated evaluation misses. Track patterns across human reviews to identify systematic issues and feed findings back into automated evaluation criteria.
Apply End-State Evaluation for Stateful Agents
For agents that mutate persistent state (files, databases, configurations), evaluate whether the final state matches expectations rather than how the agent got there. Define expected end-state assertions and verify them programmatically after each test run.
Select Representative Samples
Start with small samples (20-30 cases) during early development when changes have dramatic impacts and low-hanging fruit is abundant. Scale to 50+ cases for reliable signal as the system matures. Sample from real usage patterns, add known edge cases, and ensure coverage across complexity levels.
Stratify by Complexity
Structure test sets across complexity levels to prevent easy examples from inflating scores:
Report scores per stratum alongside overall scores to reveal where the agent actually struggles.
Validate Context Strategies Systematically
Run agents with different context strategies on the same test set and compare quality scores, token usage, and efficiency metrics. This isolates the effect of context engineering from other variables and prevents anecdote-driven decisions.
Run Degradation Tests
Test how context degradation affects performance by running agents at different context sizes. Identify performance cliffs where context becomes problematic and establish safe operating limits. Feed these limits back into context management strategies.
Build Automated Evaluation Pipelines
Integrate evaluation into the development workflow so evaluations run automatically on agent changes. Track results over time, compare versions, and block deployments that regress on key metrics.
Monitor Production Quality
Sample production interactions and evaluate them continuously. Set alerts for quality drops below warning (0.85 pass rate) and critical (0.70 pass rate) thresholds. Maintain dashboards showing trend analysis over time windows to detect gradual degradation.
Follow this sequence to build an evaluation framework, because skipping early steps leads to measurements that do not reflect real quality:
Guard against these common failures that undermine evaluation reliability:
Example 1: Simple Evaluation
def evaluate_agent_response(response, expected):
rubric = load_rubric()
scores = {}
for dimension, config in rubric.items():
scores[dimension] = assess_dimension(response, expected, dimension)
overall = weighted_average(scores, config["weights"])
return {"passed": overall >= 0.7, "scores": scores}
Example 2: Test Set Structure
Test sets should span multiple complexity levels to ensure comprehensive evaluation:
test_set = [
{
"name": "simple_lookup",
"input": "What is the capital of France?",
"expected": {"type": "fact", "answer": "Paris"},
"complexity": "simple",
"description": "Single tool call, factual lookup"
},
{
"name": "medium_query",
"input": "Compare the revenue of Apple and Microsoft last quarter",
"complexity": "medium",
"description": "Multiple tool calls, comparison logic"
},
{
"name": "multi_step_reasoning",
"input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
"complexity": "complex",
"description": "Many tool calls, aggregation, analysis"
},
{
"name": "research_synthesis",
"input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
"complexity": "very_complex",
"description": "Extended interaction, deep reasoning, synthesis"
}
]
This skill connects to all other skills as a cross-cutting concern:
Internal reference:
Internal skills:
External resources:
Created : 2025-12-20 Last Updated : 2026-03-17 Author : Agent Skills for Context Engineering Contributors Version : 1.1.0
Weekly Installs
51
Repository
GitHub Stars
520
First Seen
Jan 26, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode47
codex46
github-copilot45
cursor44
gemini-cli44
amp43
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
56,600 周安装