⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

智能体系统评估方法指南：构建LLM评估框架、多维评分与持续测试

evaluation by guanyang/antigravity-skills

51 周安装量

520 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/guanyang/antigravity-skills --skill evaluation

AI/机器学习质量管理测试

🇨🇳中文介绍

智能体系统评估方法

评估智能体系统与传统软件不同，因为智能体做出动态决策，运行之间具有非确定性，并且通常缺乏单一正确答案。构建评估框架时需考虑这些特性，提供可操作的反馈，捕捉性能回归，并验证上下文工程选择是否达到预期效果。

何时启用

在以下情况下启用此技能：

系统性地测试智能体性能
验证上下文工程选择
衡量随时间推移的改进
在部署前捕捉性能回归
为智能体流水线构建质量门
比较不同的智能体配置
持续评估生产系统

核心概念

将评估重点放在结果而非执行路径上，因为智能体可能找到实现目标的替代有效路径。判断智能体是否通过合理过程实现了正确结果，而不是判断其是否遵循了特定的步骤序列。

使用多维评分标准而非单一分数，因为一个数字会掩盖特定维度的关键失败。将事实准确性、完整性、引用准确性、来源质量和工具效率作为独立维度进行捕获，然后根据用例对它们进行加权。

部署 LLM 作为评判者 以在大规模测试集上进行可扩展的评估，同时辅以人工审查，以捕捉自动化评估遗漏的边缘情况、幻觉和细微偏见。

性能驱动因素：95% 发现

在设计评估预算时应用 BrowseComp 研究发现：三个因素解释了 95% 的浏览智能体性能差异。

因素	解释的方差	含义
令牌使用量	80%	更多令牌 = 更好性能
工具调用次数	~10%	更多探索有助于提升
模型选择	~5%	更好的模型能成倍提升效率

在设计评估时，根据这些含义采取行动：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

评估评分标准设计

构建多维评分标准

定义涵盖关键维度的评分标准，并包含从优秀到失败的描述性级别。包含这些核心维度并根据用例调整权重：

事实准确性：声明与事实相符（对知识任务权重较大）
完整性：输出覆盖请求的各个方面（对研究任务权重较大）
引用准确性：引用与声称的来源匹配（对信任敏感上下文权重较大）
来源质量：使用适当的主要来源（对权威输出权重较大）
工具效率：以合理的次数使用正确的工具（对成本敏感系统权重较大）

将评分标准转换为数值分数

将维度评估映射到数值分数（0.0 到 1.0），应用每个维度的权重，并计算加权总分。根据用例需求设定通过阈值，通常通用用途为 0.7，高风险应用为 0.9。将各个维度分数与聚合分数一起存储，因为细分数据驱动着有针对性的改进。

使用 LLM 作为评判者以实现规模

构建基于 LLM 的评估提示，包括：清晰的任务描述、待测试的智能体输出、可用时的真实情况、带有明确级别描述的评估量表，以及要求提供带有推理的结构化判断。LLM 评判者在大规模测试集上提供一致、可扩展的评估。使用与被评估智能体不同的模型系列，以避免自我增强偏差。

辅以人工评估

将边缘情况、异常查询和生产流量的随机样本路由给人工审查员，因为人工审查员能注意到自动化评估遗漏的幻觉答案、系统故障和细微偏见。跟踪人工审查中的模式，以识别系统性问题，并将发现反馈到自动化评估标准中。

对状态化智能体应用最终状态评估

对于修改持久状态（文件、数据库、配置）的智能体，评估最终状态是否符合预期，而不是智能体如何达到该状态。定义预期的最终状态断言，并在每次测试运行后以编程方式验证它们。

选择代表性样本

在早期开发阶段，当更改具有巨大影响且容易实现的目标很多时，从小样本（20-30 个案例）开始。随着系统成熟，扩展到 50 个以上案例以获得可靠信号。从实际使用模式中抽样，添加已知的边缘情况，并确保覆盖所有复杂度级别。

按复杂度分层

构建跨复杂度级别的测试集，以防止简单示例夸大分数：

简单：单次工具调用，事实查找
中等：多次工具调用，比较逻辑
复杂：多次工具调用，显著模糊性
非常复杂：长时间交互，深度推理，综合

报告每个层级的分数以及总分，以揭示智能体实际在何处遇到困难。

上下文工程评估

系统性地验证上下文策略

在同一测试集上使用不同的上下文策略运行智能体，并比较质量分数、令牌使用量和效率指标。这可以将上下文工程的效果与其他变量隔离开来，并防止基于轶事做出决策。

运行性能下降测试

通过在不同上下文大小下运行智能体，测试上下文退化如何影响性能。识别上下文变得有问题的性能悬崖，并建立安全操作限制。将这些限制反馈到上下文管理策略中。

构建自动化评估流水线

将评估集成到开发工作流程中，以便在智能体更改时自动运行评估。随时间跟踪结果，比较版本，并阻止在关键指标上出现性能回归的部署。

监控生产质量

对生产交互进行抽样并持续评估。为质量下降到警告（0.85 通过率）和关键（0.70 通过率）阈值以下设置警报。维护仪表板，显示随时间窗口的趋势分析，以检测逐渐的性能下降。

遵循此序列构建评估框架，因为跳过早期步骤会导致测量结果无法反映真实质量：

在编写任何评估代码之前，定义与用例相关的质量维度，因为后来选择的维度往往反映易于测量的内容，而非重要的内容。
创建具有清晰、描述性级别定义的评分标准，以便评估者（人工或 LLM）产生一致的分数。
从实际使用模式和边缘情况构建测试集，按复杂度分层，至少包含 50 个案例以获得可靠信号。
实现自动化评估流水线，在每次重大更改时运行。
在进行更改之前建立基线指标，以便可以对照已知参考点衡量改进。
对所有重大更改运行评估，并与基线进行比较。
随时间跟踪指标以进行趋势分析，因为逐渐的性能下降比突然下降更难察觉。
定期辅以人工审查来补充自动化评估。

防范这些破坏评估可靠性的常见失败：

过度拟合特定路径：评估结果，而非特定步骤，因为智能体会找到新颖的有效路径。
忽略边缘情况：包含涵盖全复杂度范围的多样化测试场景。
单一指标迷恋：使用多维评分标准，因为单一分数会掩盖特定维度的失败。
忽视上下文效应：使用现实的上下文大小和历史进行测试，而非理想化条件。
跳过人工评估：自动化评估会遗漏人工能可靠捕捉的细微问题。

示例 1：简单评估

def evaluate_agent_response(response, expected):
    rubric = load_rubric()
    scores = {}
    for dimension, config in rubric.items():
        scores[dimension] = assess_dimension(response, expected, dimension)
    overall = weighted_average(scores, config["weights"])
    return {"passed": overall >= 0.7, "scores": scores}

示例 2：测试集结构

测试集应跨越多个复杂度级别，以确保全面评估：

test_set = [
    {
        "name": "simple_lookup",
        "input": "What is the capital of France?",
        "expected": {"type": "fact", "answer": "Paris"},
        "complexity": "simple",
        "description": "Single tool call, factual lookup"
    },
    {
        "name": "medium_query",
        "input": "Compare the revenue of Apple and Microsoft last quarter",
        "complexity": "medium",
        "description": "Multiple tool calls, comparison logic"
    },
    {
        "name": "multi_step_reasoning",
        "input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
        "complexity": "complex",
        "description": "Many tool calls, aggregation, analysis"
    },
    {
        "name": "research_synthesis",
        "input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
        "complexity": "very_complex",
        "description": "Extended interaction, deep reasoning, synthesis"
    }
]

使用多维评分标准，而非单一指标
评估结果，而非特定的执行路径
覆盖从简单到复杂的各个复杂度级别
使用现实的上下文大小和历史进行测试
持续运行评估，而不仅仅在发布前
用人工审查补充 LLM 评估
随时间跟踪指标以进行趋势检测
根据用例设定清晰的通过/失败阈值

评估过度拟合特定代码路径：测试通过，但智能体在轻微输入变化上失败。根据结果和语义而非表面模式编写评估标准，并定期轮换测试输入。
LLM 评判者自我增强偏差：模型给自己的输出评分高于独立评判者。使用与被评估模型不同的模型系列作为评估评判者。
测试集污染：评估示例泄露到训练数据或提示模板中，夸大了分数。保持评估集版本化，并与任何用于提示或微调的数据分开。
指标博弈：优化指标而非实际质量会产生得分高但令用户失望的智能体。定期用人工判断交叉验证自动化指标。
单维度评分：一个聚合数字掩盖了特定维度的关键失败。始终报告每个维度的分数以及总分，如果任何单个维度低于其最低阈值，则评估失败。
评估集太小：少于 50 个示例会产生不可靠的信号，运行间差异很大。将评估集扩展到至少 50 个案例，并报告置信区间。
未按难度分层：简单示例夸大了总分，掩盖了困难案例的失败。报告每个复杂度层级的分数，并加权总分以防止简单案例主导。
将评估视为一次性：评估必须是持续的，而非发布门。随着模型更新、工具更改和使用模式演变，智能体质量会漂移。在每次更改时以及按常规生产节奏运行评估。

此技能作为跨领域关注点连接到所有其他技能：

context-fundamentals - 评估上下文使用
context-degradation - 检测性能下降
context-optimization - 衡量优化效果
multi-agent-patterns - 评估协调性
tool-design - 评估工具有效性
memory-systems - 评估内存质量

指标参考 - 阅读时机：设计特定评估指标、选择评分量表或实现加权评分标准计算时

所有其他技能都连接到评估以进行质量测量

LLM 评估基准 - 阅读时机：为智能体比较选择或构建基准套件时
智能体评估研究论文 - 阅读时机：采用新的评估方法或验证当前方法时
生产监控实践 - 阅读时机：为实时系统设置警报、仪表板或抽样策略时

创建时间 : 2025-12-20 最后更新 : 2026-03-17 作者 : Agent Skills for Context Engineering Contributors 版本 : 1.1.0

🇺🇸English

Evaluation Methods for Agent Systems

Evaluate agent systems differently from traditional software because agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Build evaluation frameworks that account for these characteristics, provide actionable feedback, catch regressions, and validate that context engineering choices achieve intended effects.

When to Activate

Activate this skill when:

Testing agent performance systematically
Validating context engineering choices
Measuring improvements over time
Catching regressions before deployment
Building quality gates for agent pipelines
Comparing different agent configurations
Evaluating production systems continuously

Core Concepts

Focus evaluation on outcomes rather than execution paths, because agents may find alternative valid routes to goals. Judge whether the agent achieves the right outcome via a reasonable process, not whether it followed a specific sequence of steps.

Use multi-dimensional rubrics instead of single scores because one number hides critical failures in specific dimensions. Capture factual accuracy, completeness, citation accuracy, source quality, and tool efficiency as separate dimensions, then weight them for the use case.

Deploy LLM-as-judge for scalable evaluation across large test sets while supplementing with human review to catch edge cases, hallucinations, and subtle biases that automated evaluation misses.

Performance Drivers: The 95% Finding

Apply the BrowseComp research finding when designing evaluation budgets: three factors explain 95% of browsing agent performance variance.

Factor	Variance Explained	Implication
Token usage	80%	More tokens = better performance
Number of tool calls	~10%	More exploration helps
Model choice	~5%	Better models multiply efficiency

Act on these implications when designing evaluations:

Set realistic token budgets : Evaluate agents with production-realistic token limits, not unlimited resources, because token usage drives 80% of variance.
Prioritize model upgrades over token increases : Upgrading model versions provides larger gains than doubling token budgets on previous versions because better models use tokens more efficiently.
Validate multi-agent architectures : The finding supports distributing work across agents with separate context windows, so evaluate multi-agent setups against single-agent baselines.

Detailed Topics

Evaluation Challenges

Handle Non-Determinism and Multiple Valid Paths

Design evaluations that tolerate path variation because agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten; both may produce correct answers. Avoid checking for specific steps. Instead, define outcome criteria (correctness, completeness, quality) and score against those, treating the execution path as informational rather than evaluative.

Test Context-Dependent Failures

Evaluate across a range of complexity levels and interaction lengths because agent failures often depend on context in subtle ways. An agent might succeed on simple queries but fail on complex ones, work well with one tool set but fail with another, or degrade after extended interaction as context accumulates. Include simple, medium, complex, and very complex test cases to surface these patterns.

Score Composite Quality Dimensions Separately

Break agent quality into separate dimensions (factual accuracy, completeness, coherence, tool efficiency, process quality) and score each independently because an agent might score high on accuracy but low on efficiency, or vice versa. Then compute weighted aggregates tuned to use-case priorities. This approach reveals which dimensions need improvement rather than averaging away the signal.

Evaluation Rubric Design

Build Multi-Dimensional Rubrics

Define rubrics covering key dimensions with descriptive levels from excellent to failed. Include these core dimensions and adapt weights per use case:

Factual accuracy: Claims match ground truth (weight heavily for knowledge tasks)
Completeness: Output covers requested aspects (weight heavily for research tasks)
Citation accuracy: Citations match claimed sources (weight for trust-sensitive contexts)
Source quality: Uses appropriate primary sources (weight for authoritative outputs)
Tool efficiency: Uses right tools a reasonable number of times (weight for cost-sensitive systems)

Convert Rubrics to Numeric Scores

Map dimension assessments to numeric scores (0.0 to 1.0), apply per-dimension weights, and calculate weighted overall scores. Set passing thresholds based on use-case requirements, typically 0.7 for general use and 0.9 for high-stakes applications. Store individual dimension scores alongside the aggregate because the breakdown drives targeted improvement.

Evaluation Methodologies

Use LLM-as-Judge for Scale

Build LLM-based evaluation prompts that include: clear task description, the agent output under test, ground truth when available, an evaluation scale with explicit level descriptions, and a request for structured judgment with reasoning. LLM judges provide consistent, scalable evaluation across large test sets. Use a different model family than the agent being evaluated to avoid self-enhancement bias.

Supplement with Human Evaluation

Route edge cases, unusual queries, and a random sample of production traffic to human reviewers because humans notice hallucinated answers, system failures, and subtle biases that automated evaluation misses. Track patterns across human reviews to identify systematic issues and feed findings back into automated evaluation criteria.

Apply End-State Evaluation for Stateful Agents

For agents that mutate persistent state (files, databases, configurations), evaluate whether the final state matches expectations rather than how the agent got there. Define expected end-state assertions and verify them programmatically after each test run.

Test Set Design

Select Representative Samples

Start with small samples (20-30 cases) during early development when changes have dramatic impacts and low-hanging fruit is abundant. Scale to 50+ cases for reliable signal as the system matures. Sample from real usage patterns, add known edge cases, and ensure coverage across complexity levels.

Stratify by Complexity

Structure test sets across complexity levels to prevent easy examples from inflating scores:

Simple: single tool call, factual lookup
Medium: multiple tool calls, comparison logic
Complex: many tool calls, significant ambiguity
Very complex: extended interaction, deep reasoning, synthesis

Report scores per stratum alongside overall scores to reveal where the agent actually struggles.

Context Engineering Evaluation

Validate Context Strategies Systematically

Run agents with different context strategies on the same test set and compare quality scores, token usage, and efficiency metrics. This isolates the effect of context engineering from other variables and prevents anecdote-driven decisions.

Run Degradation Tests

Test how context degradation affects performance by running agents at different context sizes. Identify performance cliffs where context becomes problematic and establish safe operating limits. Feed these limits back into context management strategies.

Continuous Evaluation

Build Automated Evaluation Pipelines

Integrate evaluation into the development workflow so evaluations run automatically on agent changes. Track results over time, compare versions, and block deployments that regress on key metrics.

Monitor Production Quality

Sample production interactions and evaluate them continuously. Set alerts for quality drops below warning (0.85 pass rate) and critical (0.70 pass rate) thresholds. Maintain dashboards showing trend analysis over time windows to detect gradual degradation.

Practical Guidance

Building Evaluation Frameworks

Follow this sequence to build an evaluation framework, because skipping early steps leads to measurements that do not reflect real quality:

Define quality dimensions relevant to the use case before writing any evaluation code, because dimensions chosen later tend to reflect what is easy to measure rather than what matters.
Create rubrics with clear, descriptive level definitions so evaluators (human or LLM) produce consistent scores.
Build test sets from real usage patterns and edge cases, stratified by complexity, with at least 50 cases for reliable signal.
Implement automated evaluation pipelines that run on every significant change.
Establish baseline metrics before making changes so improvements can be measured against a known reference.
Run evaluations on all significant changes and compare against the baseline.
Track metrics over time for trend analysis because gradual degradation is harder to notice than sudden drops.
Supplement automated evaluation with human review on a regular cadence.

Avoiding Evaluation Pitfalls

Guard against these common failures that undermine evaluation reliability:

Overfitting to specific paths : Evaluate outcomes, not specific steps, because agents find novel valid paths.
Ignoring edge cases : Include diverse test scenarios covering the full complexity spectrum.
Single-metric obsession : Use multi-dimensional rubrics because a single score hides dimension-specific failures.
Neglecting context effects : Test with realistic context sizes and histories rather than clean-room conditions.
Skipping human evaluation : Automated evaluation misses subtle issues that humans catch reliably.

Examples

Example 1: Simple Evaluation

def evaluate_agent_response(response, expected):
    rubric = load_rubric()
    scores = {}
    for dimension, config in rubric.items():
        scores[dimension] = assess_dimension(response, expected, dimension)
    overall = weighted_average(scores, config["weights"])
    return {"passed": overall >= 0.7, "scores": scores}

Example 2: Test Set Structure

Test sets should span multiple complexity levels to ensure comprehensive evaluation:

test_set = [
    {
        "name": "simple_lookup",
        "input": "What is the capital of France?",
        "expected": {"type": "fact", "answer": "Paris"},
        "complexity": "simple",
        "description": "Single tool call, factual lookup"
    },
    {
        "name": "medium_query",
        "input": "Compare the revenue of Apple and Microsoft last quarter",
        "complexity": "medium",
        "description": "Multiple tool calls, comparison logic"
    },
    {
        "name": "multi_step_reasoning",
        "input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
        "complexity": "complex",
        "description": "Many tool calls, aggregation, analysis"
    },
    {
        "name": "research_synthesis",
        "input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
        "complexity": "very_complex",
        "description": "Extended interaction, deep reasoning, synthesis"
    }
]

Guidelines

Use multi-dimensional rubrics, not single metrics
Evaluate outcomes, not specific execution paths
Cover complexity levels from simple to complex
Test with realistic context sizes and histories
Run evaluations continuously, not just before release
Supplement LLM evaluation with human review
Track metrics over time for trend detection
Set clear pass/fail thresholds based on use case

Gotchas

Overfitting evals to specific code paths : Tests pass but the agent fails on slight input variations. Write eval criteria against outcomes and semantics, not surface patterns, and rotate test inputs periodically.
LLM-judge self-enhancement bias : Models rate their own outputs higher than independent judges do. Use a different model family as the evaluation judge than the model being evaluated.
Test set contamination : Eval examples leak into training data or prompt templates, inflating scores. Keep eval sets versioned and separate from any data used in prompts or fine-tuning.
Metric gaming : Optimizing for the metric rather than actual quality produces agents that score well but disappoint users. Cross-validate automated metrics against human judgments regularly.
Single-dimension scoring : One aggregate number hides critical failures in specific dimensions. Always report per-dimension scores alongside the overall score, and fail the eval if any single dimension falls below its minimum threshold.
Eval set too small : Fewer than 50 examples produces unreliable signal with high variance between runs. Scale the eval set to at least 50 cases and report confidence intervals.
Not stratifying by difficulty : Easy examples inflate overall scores, masking failures on hard cases. Report scores per complexity stratum and weight the overall score to prevent easy-case dominance.
Treating eval as one-time : Evaluation must be continuous, not a launch gate. Agent quality drifts as models update, tools change, and usage patterns evolve. Run evals on every change and on a regular production cadence.

Integration

This skill connects to all other skills as a cross-cutting concern:

context-fundamentals - Evaluating context usage
context-degradation - Detecting degradation
context-optimization - Measuring optimization effectiveness
multi-agent-patterns - Evaluating coordination
tool-design - Evaluating tool effectiveness
memory-systems - Evaluating memory quality

References

Internal reference:

Metrics Reference - Read when: designing specific evaluation metrics, choosing scoring scales, or implementing weighted rubric calculations

Internal skills:

All other skills connect to evaluation for quality measurement

External resources:

LLM evaluation benchmarks - Read when: selecting or building benchmark suites for agent comparison
Agent evaluation research papers - Read when: adopting new evaluation methodologies or validating current approach
Production monitoring practices - Read when: setting up alerting, dashboards, or sampling strategies for live systems

Skill Metadata

Created : 2025-12-20 Last Updated : 2026-03-17 Author : Agent Skills for Context Engineering Contributors Version : 1.1.0

Weekly Installs

Repository

guanyang/antigr…y-skills

GitHub Stars

520

First Seen

Jan 26, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode47

codex46

github-copilot45

cursor44

gemini-cli44

amp43

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

56,600 周安装

智能体系统评估方法指南：构建LLM评估框架、多维评分与持续测试

🇨🇳中文介绍

智能体系统评估方法

何时启用

核心概念

相关 Skills

详细主题

评估挑战

评估评分标准设计

评估方法

测试集设计