⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

智能体系统评估框架构建指南：方法、标准与最佳实践

evaluation by sickn33/antigravity-awesome-skills

124 周安装量

31,900 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill evaluation

AI/机器学习质量管理测试

🇨🇳中文介绍

何时使用此技能

为智能体系统构建评估框架

当您需要为智能体系统构建评估框架时，请使用此技能。

智能体系统的评估方法

智能体系统的评估需要不同于传统软件甚至标准语言模型应用的方法。智能体做出动态决策，在多次运行之间具有非确定性，并且通常没有单一正确答案。有效的评估必须考虑这些特性，同时提供可操作的反馈。一个稳健的评估框架能够支持持续改进、捕捉性能回归，并验证上下文工程选择是否达到预期效果。

何时使用

在以下情况下激活此技能：

系统性地测试智能体性能
验证上下文工程选择
随时间推移衡量改进情况
在部署前捕捉性能回归
为智能体流水线构建质量门禁
比较不同的智能体配置
持续评估生产系统

核心概念

智能体评估需要以结果为中心的方法，考虑非确定性和多种有效路径。多维度的评分标准能捕捉各个质量方面：事实准确性、完整性、引用准确性、来源质量和工具使用效率。LLM 作为评判者提供了可扩展的评估，而人工评估则能捕捉边缘情况。

关键见解是：智能体可能找到实现目标的替代路径——评估应判断它们是否在遵循合理流程的同时实现了正确的结果。

性能驱动因素：95% 的发现 对 BrowseComp 评估（测试浏览智能体查找难以发现信息的能力）的研究发现，三个因素解释了 95% 的性能差异：

因素	解释的差异	启示
令牌使用量	80%	更多令牌 = 更好性能
工具调用次数	~10%	更多探索有帮助
模型选择	~5%	更好的模型能成倍提高效率

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

上下文工程评估

测试上下文策略 上下文工程的选择应通过系统评估进行验证。在同一测试集上使用不同的上下文策略运行智能体。比较质量分数、令牌使用量和效率指标。

性能退化测试 通过在不同上下文大小下运行智能体，测试上下文退化如何影响性能。识别上下文变得有问题的性能悬崖。建立安全操作限制。

评估流水线 构建在智能体更改时自动运行的评估流水线。随时间跟踪结果。比较版本以识别改进或回归。

监控生产环境 通过抽样交互和随机评估，在生产环境中跟踪评估指标。为质量下降设置警报。维护仪表板以进行趋势分析。

定义与您的用例相关的质量维度
创建具有清晰、可操作级别描述的评分标准
从真实使用模式和边缘情况构建测试集
实现自动化评估流水线
在进行更改前建立基准指标
对所有重大更改运行评估
随时间跟踪指标以进行趋势分析
用人工审查补充自动化评估

过度拟合特定路径：评估结果，而非特定步骤。
忽略边缘情况：包含多样化的测试场景。
单一指标迷恋：使用多维度评分标准。
忽视上下文影响：使用现实的上下文大小进行测试。
跳过人工评估：自动化评估会遗漏细微问题。

示例 1：简单评估

def evaluate_agent_response(response, expected):
    rubric = load_rubric()
    scores = {}
    for dimension, config in rubric.items():
        scores[dimension] = assess_dimension(response, expected, dimension)
    overall = weighted_average(scores, config["weights"])
    return {"passed": overall >= 0.7, "scores": scores}

示例 2：测试集结构

测试集应跨越多个复杂度级别，以确保全面评估：

test_set = [
    {
        "name": "simple_lookup",
        "input": "What is the capital of France?",
        "expected": {"type": "fact", "answer": "Paris"},
        "complexity": "simple",
        "description": "Single tool call, factual lookup"
    },
    {
        "name": "medium_query",
        "input": "Compare the revenue of Apple and Microsoft last quarter",
        "complexity": "medium",
        "description": "Multiple tool calls, comparison logic"
    },
    {
        "name": "multi_step_reasoning",
        "input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
        "complexity": "complex",
        "description": "Many tool calls, aggregation, analysis"
    },
    {
        "name": "research_synthesis",
        "input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
        "complexity": "very_complex",
        "description": "Extended interaction, deep reasoning, synthesis"
    }
]

使用多维度评分标准，而非单一指标
评估结果，而非特定的执行路径
覆盖从简单到复杂的复杂度级别
使用现实的上下文大小和历史记录进行测试
持续运行评估，而不仅仅在发布前
用人工审查补充 LLM 评估
随时间跟踪指标以进行趋势检测
根据用例设定清晰的通过/失败阈值

此技能作为跨领域关注点连接到所有其他技能：

context-fundamentals - 评估上下文使用情况
context-degradation - 检测性能退化
context-optimization - 衡量优化效果
multi-agent-patterns - 评估协调性
tool-design - 评估工具有效性
memory-systems - 评估记忆质量

Metrics Reference - 详细的评估指标和实现

所有其他技能都连接到评估以进行质量测量

LLM 评估基准
智能体评估研究论文
生产环境监控实践

创建日期 : 2025-12-20 最后更新 : 2025-12-20 作者 : Agent Skills for Context Engineering Contributors 版本 : 1.0.0

🇺🇸English

When to Use This Skill

Build evaluation frameworks for agent systems

Use this skill when working with build evaluation frameworks for agent systems.

Evaluation Methods for Agent Systems

Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.

When to Use

Activate this skill when:

Testing agent performance systematically
Validating context engineering choices
Measuring improvements over time
Catching regressions before deployment
Building quality gates for agent pipelines
Comparing different agent configurations
Evaluating production systems continuously

Core Concepts

Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. Multi-dimensional rubrics capture various quality aspects: factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. LLM-as-judge provides scalable evaluation while human evaluation catches edge cases.

The key insight is that agents may find alternative paths to goals—the evaluation should judge whether they achieve right outcomes while following reasonable processes.

Performance Drivers: The 95% Finding Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance:

Factor	Variance Explained	Implication
Token usage	80%	More tokens = better performance
Number of tool calls	~10%	More exploration helps
Model choice	~5%	Better models multiply efficiency

This finding has significant implications for evaluation design:

Token budgets matter : Evaluate agents with realistic token budgets, not unlimited resources
Model upgrades beat token increases : Upgrading to Claude Sonnet 4.5 or GPT-5.2 provides larger gains than doubling token budgets on previous versions
Multi-agent validation : The finding validates architectures that distribute work across agents with separate context windows

Detailed Topics

Evaluation Challenges

Non-Determinism and Multiple Valid Paths Agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten. They might use different tools to find the same answer. Traditional evaluations that check for specific steps fail in this context.

The solution is outcome-focused evaluation that judges whether agents achieve right outcomes while following reasonable processes.

Context-Dependent Failures Agent failures often depend on context in subtle ways. An agent might succeed on simple queries but fail on complex ones. It might work well with one tool set but fail with another. Failures may emerge only after extended interaction when context accumulates.

Evaluation must cover a range of complexity levels and test extended interactions, not just isolated queries.

Composite Quality Dimensions Agent quality is not a single dimension. It includes factual accuracy, completeness, coherence, tool efficiency, and process quality. An agent might score high on accuracy but low in efficiency, or vice versa.

Evaluation rubrics must capture multiple dimensions with appropriate weighting for the use case.

Evaluation Rubric Design

Multi-Dimensional Rubric Effective rubrics cover key dimensions with descriptive levels:

Factual accuracy: Claims match ground truth (excellent to failed)

Completeness: Output covers requested aspects (excellent to failed)

Citation accuracy: Citations match claimed sources (excellent to failed)

Source quality: Uses appropriate primary sources (excellent to failed)

Tool efficiency: Uses right tools reasonable number of times (excellent to failed)

Rubric Scoring Convert dimension assessments to numeric scores (0.0 to 1.0) with appropriate weighting. Calculate weighted overall scores. Determine passing threshold based on use case requirements.

Evaluation Methodologies

LLM-as-Judge LLM-based evaluation scales to large test sets and provides consistent judgments. The key is designing effective evaluation prompts that capture the dimensions of interest.

Provide clear task description, agent output, ground truth (if available), evaluation scale with level descriptions, and request structured judgment.

Human Evaluation Human evaluation catches what automation misses. Humans notice hallucinated answers on unusual queries, system failures, and subtle biases that automated evaluation misses.

Effective human evaluation covers edge cases, samples systematically, tracks patterns, and provides contextual understanding.

End-State Evaluation For agents that mutate persistent state, end-state evaluation focuses on whether the final state matches expectations rather than how the agent got there.

Test Set Design

Sample Selection Start with small samples during development. Early in agent development, changes have dramatic impacts because there is abundant low-hanging fruit. Small test sets reveal large effects.

Sample from real usage patterns. Add known edge cases. Ensure coverage across complexity levels.

Complexity Stratification Test sets should span complexity levels: simple (single tool call), medium (multiple tool calls), complex (many tool calls, significant ambiguity), and very complex (extended interaction, deep reasoning).

Context Engineering Evaluation

Testing Context Strategies Context engineering choices should be validated through systematic evaluation. Run agents with different context strategies on the same test set. Compare quality scores, token usage, and efficiency metrics.

Degradation Testing Test how context degradation affects performance by running agents at different context sizes. Identify performance cliffs where context becomes problematic. Establish safe operating limits.

Continuous Evaluation

Evaluation Pipeline Build evaluation pipelines that run automatically on agent changes. Track results over time. Compare versions to identify improvements or regressions.

Monitoring Production Track evaluation metrics in production by sampling interactions and evaluating randomly. Set alerts for quality drops. Maintain dashboards for trend analysis.

Practical Guidance

Building Evaluation Frameworks

Define quality dimensions relevant to your use case
Create rubrics with clear, actionable level descriptions
Build test sets from real usage patterns and edge cases
Implement automated evaluation pipelines
Establish baseline metrics before making changes
Run evaluations on all significant changes
Track metrics over time for trend analysis
Supplement automated evaluation with human review

Avoiding Evaluation Pitfalls

Overfitting to specific paths: Evaluate outcomes, not specific steps. Ignoring edge cases: Include diverse test scenarios. Single-metric obsession: Use multi-dimensional rubrics. Neglecting context effects: Test with realistic context sizes. Skipping human evaluation: Automated evaluation misses subtle issues.

Examples

Example 1: Simple Evaluation

def evaluate_agent_response(response, expected):
    rubric = load_rubric()
    scores = {}
    for dimension, config in rubric.items():
        scores[dimension] = assess_dimension(response, expected, dimension)
    overall = weighted_average(scores, config["weights"])
    return {"passed": overall >= 0.7, "scores": scores}

Example 2: Test Set Structure

Test sets should span multiple complexity levels to ensure comprehensive evaluation:

test_set = [
    {
        "name": "simple_lookup",
        "input": "What is the capital of France?",
        "expected": {"type": "fact", "answer": "Paris"},
        "complexity": "simple",
        "description": "Single tool call, factual lookup"
    },
    {
        "name": "medium_query",
        "input": "Compare the revenue of Apple and Microsoft last quarter",
        "complexity": "medium",
        "description": "Multiple tool calls, comparison logic"
    },
    {
        "name": "multi_step_reasoning",
        "input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
        "complexity": "complex",
        "description": "Many tool calls, aggregation, analysis"
    },
    {
        "name": "research_synthesis",
        "input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
        "complexity": "very_complex",
        "description": "Extended interaction, deep reasoning, synthesis"
    }
]

Guidelines

Use multi-dimensional rubrics, not single metrics
Evaluate outcomes, not specific execution paths
Cover complexity levels from simple to complex
Test with realistic context sizes and histories
Run evaluations continuously, not just before release
Supplement LLM evaluation with human review
Track metrics over time for trend detection
Set clear pass/fail thresholds based on use case

Integration

This skill connects to all other skills as a cross-cutting concern:

context-fundamentals - Evaluating context usage
context-degradation - Detecting degradation
context-optimization - Measuring optimization effectiveness
multi-agent-patterns - Evaluating coordination
tool-design - Evaluating tool effectiveness
memory-systems - Evaluating memory quality

References

Internal reference:

Metrics Reference - Detailed evaluation metrics and implementation

References

Internal skills:

All other skills connect to evaluation for quality measurement

External resources:

LLM evaluation benchmarks
Agent evaluation research papers
Production monitoring practices

Skill Metadata

Created : 2025-12-20 Last Updated : 2025-12-20 Author : Agent Skills for Context Engineering Contributors Version : 1.0.0

Weekly Installs

Repository

sickn33/antigra…e-skills

GitHub Stars

27.1K

First Seen

Feb 3, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

gemini-cli73

codex73

opencode73

cursor71

github-copilot71

kimi-cli71

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

55,300 周安装

智能体系统评估框架构建指南：方法、标准与最佳实践

🇨🇳中文介绍

何时使用此技能

智能体系统的评估方法

何时使用

核心概念

相关 Skills

详细主题

评估挑战

评估标准设计

评估方法

测试集设计

上下文工程评估

持续评估

实践指导

构建评估框架

避免评估陷阱

示例

指导原则

集成

参考资料

参考资料

技能元数据

🇺🇸English

When to Use This Skill

Evaluation Methods for Agent Systems

When to Use

Core Concepts

Detailed Topics

Evaluation Challenges

Evaluation Rubric Design

Evaluation Methodologies

Test Set Design

Context Engineering Evaluation

Continuous Evaluation

Practical Guidance

Building Evaluation Frameworks

Avoiding Evaluation Pitfalls

Examples

Guidelines

Integration

References

References

Skill Metadata

最新 Skills