Claude代码智能体评估方法：多维标准与LLM评判者框架，提升AI代理性能

customaize-agent%3Aagent-evaluation by neolabhq/context-engineering-kit

244 周安装量

708 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/neolabhq/context-engineering-kit --skill customaize-agent:agent-evaluation

AI/机器学习自动化测试

🇨🇳中文介绍

Claude 代码智能体的评估方法

智能体系统的评估需要不同于传统软件甚至标准语言模型应用的方法。智能体做出动态决策，在不同运行之间存在非确定性，并且通常缺乏单一正确答案。有效的评估必须考虑这些特性，同时提供可操作的反馈。一个稳健的评估框架能够实现持续改进、捕捉回归问题，并验证上下文工程选择是否达到预期效果。

核心概念

智能体评估需要以结果为中心的方法，考虑非确定性和多种有效路径。多维评估标准能捕捉质量的各个方面：事实准确性、完整性、引用准确性、来源质量和工具效率。LLM 作为评判者提供了可扩展的评估，而人工评估则能捕捉边缘情况。

关键见解是，智能体可能找到实现目标的替代路径——评估应判断它们是否在遵循合理流程的同时实现了正确结果。

性能驱动因素：95% 发现 对 BrowseComp 评估（测试浏览智能体查找难以找到信息的能力）的研究发现，三个因素解释了 95% 的性能差异：

因素	解释的差异	含义
令牌使用量	80%	更多令牌 = 更好性能
工具调用次数	~10%	更多探索有帮助
模型选择	~5%	更好的模型能成倍提高效率

对 Claude Code 开发的启示：

令牌预算很重要：在现实的令牌约束下进行评估
模型升级优于增加令牌：升级模型比增加令牌预算能带来更大的收益
多智能体验证：验证跨子智能体分配工作并使用独立上下文窗口的架构

评估挑战

非确定性和多种有效路径

智能体可能采取完全不同的有效路径来实现目标。一个智能体可能搜索三个来源，而另一个可能搜索十个。它们可能使用不同的工具来找到相同的答案。在这种情况下，检查特定步骤的传统评估会失效。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

上下文相关的失败

智能体失败通常以微妙的方式依赖于上下文。一个智能体可能在复杂查询上成功，但在简单查询上失败。它可能在一组工具上工作良好，但在另一组上失败。失败可能只在上下文累积后的长时间交互中才会出现。

解决方案：评估必须覆盖一系列复杂度级别，并测试扩展交互，而不仅仅是孤立的查询。

智能体质量不是单一维度。它包括事实准确性、完整性、连贯性、工具效率和流程质量。一个智能体可能在准确性上得分高，但在效率上得分低，反之亦然。

一个智能体可能在准确性上得分高，但在效率上得分低。

解决方案：评估标准必须捕捉多个维度，并根据用例进行适当加权。

有效的评估标准涵盖关键维度并包含描述性级别：

指令遵循（权重：0.30）

优秀 (1.0)：所有指令都精确遵循
良好 (0.8)：不影响结果的微小偏差
可接受 (0.6)：主要指令已遵循，次要指令被遗漏
差 (0.3)：重要指令被忽略
失败 (0.0)：从根本上误解了任务

输出完整性（权重：0.25）

优秀：所有请求的方面都得到全面覆盖
良好：大多数方面已覆盖，存在微小差距
可接受：关键方面已覆盖，存在一些差距
差：主要方面缺失
失败：基本方面未涉及

工具效率（权重：0.20）

优秀：工具选择最优且调用次数最少
良好：工具选择良好，存在微小低效
可接受：使用适当工具，存在一些冗余
差：使用了错误的工具或调用次数过多
失败：严重误用工具或调用次数极度过量

推理质量（权重：0.15）

优秀：整个过程中推理清晰、逻辑性强
良好：推理基本合理，存在微小差距
可接受：存在基本推理
差：推理不清晰或有缺陷
失败：无明显推理

响应连贯性（权重：0.10）

优秀：结构良好，易于理解
良好：基本连贯，存在微小问题
可接受：可以理解，但可以更清晰
差：难以理解
失败：不连贯

将维度评估转换为数值分数（0.0 到 1.0）并进行适当加权。计算加权总分。根据用例要求设置通过阈值（通常通用用途为 0.7，关键操作为 0.85）。

使用 LLM 评估智能体输出具有良好的扩展性，并能提供一致的判断。设计评估提示以捕捉感兴趣的维度。基于 LLM 的评估可以扩展到大型测试集，并提供一致的判断。关键在于设计有效的评估提示以捕捉感兴趣的维度。

提供清晰的任务描述、智能体输出、真实情况（如果可用）、包含级别描述的评估量表，并请求结构化判断。

评估提示模板：

You are evaluating the output of a Claude Code agent.

## Original Task
{task_description}

## Agent Output
{agent_output}

## Ground Truth (if available)
{expected_output}

## Evaluation Criteria
For each criterion, assess the output and provide:
1. Score (1-5)
2. Specific evidence supporting your score
3. One improvement suggestion

### Criteria
1. Instruction Following: Did the agent follow all instructions?
2. Completeness: Are all requested aspects covered?
3. Tool Efficiency: Were appropriate tools used efficiently?
4. Reasoning Quality: Is the reasoning clear and sound?
5. Response Coherence: Is the output well-structured?

Provide your evaluation as a structured assessment with scores and justifications.

思维链要求：在给出分数之前始终要求提供理由。研究表明，与先给分的方法相比，这能提高 15-25% 的可靠性。

人工评估能捕捉自动化评估遗漏的内容：

不寻常查询上的幻觉答案
微妙的上下文误解
自动化评估忽略的边缘情况
语气或方法上的定性问题

对于 Claude Code 开发，请用户：

手动审查智能体输出的边缘情况
在不同复杂度级别上进行系统抽样
跟踪失败模式以指导提示改进

对于产生工件（文件、配置、代码）的命令，评估最终输出而非过程：

生成的代码能工作吗？
配置有效吗？
输出满足要求吗？

样本选择 在开发过程中从少量样本开始。在智能体开发早期，更改会产生巨大影响，因为存在大量低垂的果实。小测试集能揭示大的效果。

从实际使用模式中抽样。添加已知的边缘情况。确保覆盖不同复杂度级别。

复杂度分层 测试集应跨越复杂度级别：简单（单次工具调用）、中等（多次工具调用）、复杂（多次工具调用、显著模糊性）和非常复杂（扩展交互、深度推理）。

上下文工程评估

在迭代 Claude Code 提示时，进行系统评估：

基线：在测试案例上运行当前提示
变体：在相同案例上运行修改后的提示
比较：衡量质量分数、令牌使用量、效率
分析：确定哪些更改改进了哪些维度

测试上下文策略

应通过系统评估验证上下文工程选择。在同一测试集上使用不同上下文策略运行智能体。比较质量分数、令牌使用量和效率指标。

通过在不同上下文大小下运行智能体来测试上下文退化如何影响性能。识别上下文变得有问题的性能悬崖。建立安全操作限制。

高级评估：LLM 作为评判者

关键见解：LLM 作为评判者不是单一技术，而是一系列方法，每种方法适用于不同的评估上下文。选择正确的方法并减轻已知偏见是此技能发展的核心能力。

评估方法主要分为两类，具有不同的可靠性特征：

直接评分：单个 LLM 根据定义的标准对一个响应进行评分。

最适合：客观标准（事实准确性、指令遵循、毒性）
可靠性：对于明确定义的标准，可靠性中等至高
失败模式：分数校准漂移、量表解释不一致

成对比较：LLM 比较两个响应并选择更好的一个。

最适合：主观偏好（语气、风格、说服力）
可靠性：对于偏好，比直接评分更高
失败模式：位置偏见、长度偏见

来自 MT-Bench 论文（Zheng 等人，2023）的研究表明，对于基于偏好的评估，成对比较比直接评分与人类评判者达成更高的一致性，而直接评分仍然适用于具有明确真实情况的客观标准。

LLM 评判者表现出必须积极缓解的系统性偏见：

位置偏见：在成对比较中，第一位置的响应受到优待。缓解方法：交换位置评估两次，使用多数投票或一致性检查。

长度偏见：无论质量如何，更长的响应评分更高。缓解方法：明确提示忽略长度，使用长度归一化评分。

自我增强偏见：模型对自己的输出评分更高。缓解方法：使用不同的模型进行生成和评估，或承认局限性。

冗长偏见：详细的解释即使不必要也会获得更高的分数。缓解方法：使用特定标准来惩罚不相关的细节。

权威偏见：自信、权威的语气无论准确性如何都会获得更高的评分。缓解方法：要求提供证据引用、事实检查层。

根据评估任务结构选择指标：

任务类型	主要指标	次要指标
二元分类（通过/失败）	召回率、精确率、F1	Cohen's κ
序数量表（1-5 评分）	Spearman's ρ、Kendall's τ	Cohen's κ（加权）
成对偏好	一致率、位置一致性	置信度校准
多标签	宏平均 F1、微平均 F1	每标签精确率/召回率

关键见解：高绝对一致性不如系统性不一致模式重要。一个在特定标准上持续与人类不一致的评判者比具有随机噪声的评判者问题更大。

分类指标（通过/失败任务）

精确率：在所有标记为通过的响应中，真正通过的比例是多少？

当误报成本高时使用

召回率：在所有实际通过的响应中，我们识别出了多少比例？

当漏报成本高时使用

F1 分数：精确率和召回率的调和平均数

用于平衡的单数字摘要

一致性指标（与人类判断比较）

Cohen's Kappa：调整了偶然性的一致程度

0.8：几乎完美一致

0.6-0.8：实质性一致
0.4-0.6：中等一致
< 0.4：一般到差的一致

良好评估系统指标

指标	良好	可接受	令人担忧
Spearman's rho	> 0.8	0.6-0.8	< 0.6
Cohen's Kappa	> 0.7	0.5-0.7	< 0.5
位置一致性	> 0.9	0.8-0.9	< 0.8
长度-分数相关性	< 0.2	0.2-0.4	> 0.4

直接评分需要三个组成部分：明确的标准、校准的量表和结构化的输出格式。

标准定义模式：

Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]

1-3 量表：带有中性选项的二元选择，认知负荷最低
1-5 量表：标准李克特量表，在粒度和可靠性之间取得良好平衡
1-10 量表：高粒度但更难校准，仅在具有详细评估标准时使用

直接评分提示结构：

You are an expert evaluator assessing response quality.

## Task
Evaluate the following response against each criterion.

## Original Prompt
{prompt}

## Response to Evaluate
{response}

## Criteria
{for each criterion: name, description, weight}

## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-{max} scale)
3. Justify your score with evidence
4. Suggest one specific improvement

## Output Format
Respond with structured JSON containing scores, justifications, and summary.

思维链要求：所有评分提示必须在给出分数之前要求提供理由。研究表明，与先给分的方法相比，这能提高 15-25% 的可靠性。

成对比较对于基于偏好的评估本质上更可靠，但需要缓解偏见。

位置偏见缓解协议：

第一轮：响应 A 在第一位置，响应 B 在第二位置
第二轮：响应 B 在第一位置，响应 A 在第二位置
一致性检查：如果两轮结果不一致，则返回 TIE 并降低置信度
最终裁决：一致胜出者，取平均置信度

成对比较提示结构：

You are an expert evaluator comparing two AI responses.

## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to the specified criteria
- Ties are acceptable when responses are genuinely equivalent

## Original Prompt
{prompt}

## Response A
{response_a}

## Response B
{response_b}

## Comparison Criteria
{criteria list}

## Instructions
1. Analyze each response independently first
2. Compare them on each criterion
3. Determine overall winner with confidence level

## Output Format
JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.

置信度校准：置信度分数应反映位置一致性：

两轮一致：置信度 = 各轮置信度的平均值
两轮不一致：置信度 = 0.5，裁决 = TIE

明确定义的评估标准与开放式评分相比，能将评估方差降低 40-60%。

评估标准组成部分

级别描述：每个分数级别的明确边界
特征：定义每个级别的可观察特征
示例：每个级别的代表性输出（如果可能）
边缘情况：针对模糊情况的指导
评分指南：一致应用的一般原则

宽松：通过分数门槛较低，适用于鼓励迭代
平衡：公平，适用于生产使用的典型期望
严格：高标准，适用于安全关键或高风险评估

评估标准应使用领域特定术语：

“代码可读性”评估标准提到变量、函数和注释。
文档评估标准参考清晰度、准确性、完整性
分析评估标准关注深度、准确性、可操作性

评估流水线设计

生产评估系统需要多个层次：

┌─────────────────────────────────────────────────┐
│                 Evaluation Pipeline              │
├─────────────────────────────────────────────────┤
│                                                   │
│  Input: Response + Prompt + Context               │
│           │                                       │
│           ▼                                       │
│  ┌─────────────────────┐                         │
│  │   Criteria Loader   │ ◄── Rubrics, weights    │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Primary Scorer    │ ◄── Direct or Pairwise  │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Bias Mitigation   │ ◄── Position swap, etc. │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │ Confidence Scoring  │ ◄── Calibration         │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  Output: Scores + Justifications + Confidence     │
│                                                   │
└─────────────────────────────────────────────────┘

反模式：评分而不提供理由

问题：分数缺乏依据，难以调试或改进
解决方案：始终要求在给出分数之前提供基于证据的理由

反模式：单轮成对比较

问题：位置偏见破坏结果
解决方案：始终交换位置并检查一致性

反模式：标准过载

问题：衡量多个方面的标准不可靠
解决方案：一个标准 = 一个可衡量的方面

反模式：缺少边缘情况指导

问题：评估者处理模糊情况时不一致
解决方案：在评估标准中包含边缘情况并提供明确指导

反模式：忽略置信度校准

问题：高置信度的错误判断比低置信度的更糟糕
解决方案：根据位置一致性和证据强度校准置信度

决策框架：直接评分 vs 成对比较

使用此决策树：

Is there an objective ground truth?
├── Yes → Direct Scoring
│   └── Examples: factual accuracy, instruction following, format compliance
│
└── No → Is it a preference or quality judgment?
    ├── Yes → Pairwise Comparison
    │   └── Examples: tone, style, persuasiveness, creativity
    │
    └── No → Consider reference-based evaluation
        └── Examples: summarization (compare to source), translation (compare to reference)

对于高容量评估：

LLM 评审团：使用多个模型作为评判者，汇总投票
- 减少单个模型偏见
- 成本更高，但对于高风险决策更可靠
分层评估：使用快速廉价模型进行筛选，昂贵模型处理边缘情况
- 对于大容量具有成本效益
- 需要校准筛选阈值
人在回路：清晰案例使用自动化评估，低置信度案例进行人工审查
- 对于关键应用具有最佳可靠性
- 设计反馈循环以改进自动化评估

示例 1：准确性的直接评分

Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun, 
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5

{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct 
scientific reasoning. Both the axial tilt and its effect on sunlight distribution 
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5°) for completeness."
}

示例 2：带位置交换的成对比较

Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]

第一轮（A 在前）：

{ "winner": "B", "confidence": 0.8 }

第二轮（B 在前）：

{ "winner": "A", "confidence": 0.6 }

（注意：胜出者是 A，因为 B 在第一位置）

映射后的第二轮：

{ "winner": "B", "confidence": 0.6 }

{
  "winner": "B",
  "confidence": 0.7,
  "positionConsistency": {
    "consistent": true,
    "firstPassWinner": "B",
    "secondPassWinner": "B"
  }
}

示例 3：评估标准生成

criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"
domain: "software engineering"
scale: "1-5"
strictness: "balanced"

输出（缩写）：

{
  "levels": [
    {
      "score": 1,
      "label": "Poor",
      "description": "Code is difficult to understand without significant effort",
      "characteristics": [
        "No meaningful variable or function names",
        "No comments or documentation",
        "Deeply nested or convoluted logic"
      ]
    },
    {
      "score": 3,
      "label": "Adequate",
      "description": "Code is understandable with some effort",
      "characteristics": [
        "Most variables have meaningful names",
        "Basic comments present for complex sections",
        "Logic is followable but could be cleaner"
      ]
    },
    {
      "score": 5,
      "label": "Excellent",
      "description": "Code is immediately clear and maintainable",
      "characteristics": [
        "All names are descriptive and consistent",
        "Comprehensive documentation",
        "Clean, modular structure"
      ]
    }
  ],
  "edgeCases": [
    {
      "situation": "Code is well-structured but uses domain-specific abbreviations",
      "guidance": "Score based on readability for domain experts, not general audience"
    }
  ]
}

迭代改进工作流程

识别弱点：使用评估找出智能体在何处遇到困难
假设原因：是提示问题？上下文问题？示例问题？
修改提示：基于假设进行有针对性的更改
重新评估：使用修改后的提示运行相同的测试案例
比较：更改是否改进了目标维度？
检查回归：其他维度是否受到影响？
迭代：重复直到质量达到阈值

始终要求在给出分数之前提供理由 - 思维链提示将可靠性提高 15-25%
在成对比较中始终交换位置 - 单轮比较会受到位置偏见的影响
使量表粒度与评估标准特异性相匹配 - 没有详细级别描述时不要使用 1-10 量表
区分客观和主观标准 - 客观标准使用直接评分，主观标准使用成对比较
包含置信度分数 - 根据位置一致性和证据强度进行校准
明确界定边缘情况 - 模糊情况导致最大的评估方差
使用领域特定的评估标准 - 通用评估标准产生通用（不太有用）的评估
根据人类判断进行验证 - 自动化评估只有在与人类评估相关时才有价值
监控系统性偏见 - 按标准和响应类型跟踪不一致模式
为迭代而设计 - 评估系统通过反馈循环改进

示例：评估 Claude Code 命令

假设您创建了一个 /refactor 命令并希望评估其质量：

简单：在单个文件中重命名变量
中等：从现有代码中提取函数
复杂：重构类以使用新的设计模式
非常复杂：重构模块依赖关系

正确性：重构后的代码能工作吗？
完整性：是否更新了所有实例？
风格：是否遵循项目约定？
效率：是否避免了不必要的更改？

Evaluate this refactoring output:

Original Code:
{original}

Refactored Code:
{refactored}

Request:
{user_request}

Score 1-5 on each dimension with evidence:
1. Correctness: Does the code still work correctly?
2. Completeness: Were all relevant instances updated?
3. Style: Does it follow the project's coding patterns?
4. Efficiency: Were only necessary changes made?

Provide scores with specific evidence from the code.

迭代：如果评估显示命令经常遗漏实例：

添加明确指令：“在整个代码库中搜索所有出现位置”
使用相同的测试案例重新评估
比较完整性分数
检查正确性是否没有回归

LLM 评估的偏见缓解技术

此参考详细介绍了缓解 LLM 作为评判者系统中已知偏见的具体技术。

在成对比较中，LLM 系统性地偏好某些位置的响应。研究表明：

GPT 有轻微的第一位置偏见（在平局情况下约 55% 偏好第一位置）
Claude 表现出类似的模式
较小的模型通常表现出更强的偏见

缓解：位置交换协议

async def position_swap_comparison(response_a, response_b, prompt, criteria):
    # Pass 1: Original order
    result_ab = await compare(response_a, response_b, prompt, criteria)
    
    # Pass 2: Swapped order
    result_ba = await compare(response_b, response_a, prompt, criteria)
    
    # Map second result (A in second position → B in first)
    result_ba_mapped = {
        'winner': {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[result_ba['winner']],
        'confidence': result_ba['confidence']
    }
    
    # Consistency check
    if result_ab['winner'] == result_ba_mapped['winner']:
        return {
            'winner': result_ab['winner'],
            'confidence': (result_ab['confidence'] + result_ba_mapped['confidence']) / 2,
            'position_consistent': True
        }
    else:
        # Disagreement indicates position bias was a factor
        return {
            'winner': 'TIE',
            'confidence': 0.5,
            'position_consistent': False,
            'bias_detected': True
        }

替代方案：多次洗牌

为了更高的可靠性，使用多个位置排序：

async def multi_shuffle_comparison(response_a, response_b, prompt, criteria, n_shuffles=3):
    results = []
    for i in range(n_shuffles):
        if i % 2 == 0:
            r = await compare(response_a, response_b, prompt, criteria)
        else:
            r = await compare(response_b, response_a, prompt, criteria)
            r['winner'] = {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[r['winner']]
        results.append(r)
    
    # Majority vote
    winners = [r['winner'] for r in results]
    final_winner = max(set(winners), key=winners.count)
    agreement = winners.count(final_winner) / len(winners)
    
    return {
        'winner': final_winner,
        'confidence': agreement,
        'n_shuffles': n_shuffles
    }

LLM 倾向于给更长的响应更高的评分，无论质量如何。这表现为：

冗长的响应获得虚高的分数
简洁但完整的响应受到惩罚
填充和重复得到奖励

缓解：明确提示

在提示中包含反长度偏见的指令：

CRITICAL EVALUATION GUIDELINES:
- Do NOT prefer responses because they are longer
- Concise, complete answers are as valuable as detailed ones
- Penalize unnecessary verbosity or repetition
- Focus on information density, not word count

缓解：长度归一化评分

def length_normalized_score(score, response_length, target_length=500):
    """Adjust score based on response length."""
    length_ratio = response_length / target_length
    
    if length_ratio > 2.0:
        # Penalize excessively long responses
        penalty = (length_ratio - 2.0) * 0.1
        return max(score - penalty, 1)
    elif length_ratio < 0.3:
        # Penalize excessively short responses
        penalty = (0.3 - length_ratio) * 0.5
        return max(score - penalty, 1)
    else:
        return score

缓解：单独的长度标准

将长度作为一个单独的、明确的标准，使其不会隐含地得到奖励：

criteria = [
    {"name": "Accuracy", "description": "Factual correctness", "weight": 0.4},
    {"name": "Completeness", "description": "Covers key points", "weight": 0.3},
    {"name": "Conciseness", "description": "No unnecessary content", "weight": 0.3}  # Explicit
]

模型对自己（或类似模型）生成的输出评分高于来自不同模型的输出。

缓解：跨模型评估

使用与生成模型不同的模型系列进行评估：

def get_evaluator_model(generator_model):
    """Select evaluator to avoid self-enhancement bias."""
    if 'gpt' in generator_model.lower():
        return 'claude-4-5-sonnet'
    elif 'claude' in generator_model.lower():
        return 'gpt-5.2'
    else:
        return 'gpt-5.2'  # Default

在评估前从响应中移除模型归属：

def anonymize_response(response, model_name):
    """Remove model-identifying patterns."""
    patterns = [
        f"As {model_name}",
        "I am an AI",
        "I don't have personal opinions",
        # Model-specific patterns
    ]
    anonymized = response
    for pattern in patterns:
        anonymized = anonymized.replace(pattern, "[REDACTED]")
    return anonymized

详细的解释即使额外细节不相关或不正确，也会获得更高的分数。

缓解：相关性加权评分

async def relevance_weighted_evaluation(response, prompt, criteria):
    # First, assess relevance of each segment
    relevance_scores = await assess_relevance(response, prompt)
    
    # Weight evaluation by relevance
    segments = split_into_segments(response)
    weighted_scores = []
    for segment, relevance in zip(segments, relevance_scores):
        if relevance > 0.5:  # Only count relevant segments
            score = await evaluate_segment(segment, prompt, criteria)
            weighted_scores.append(score * relevance)
    
    return sum(weighted_scores) / len(weighted_scores)

缓解：带有冗长惩罚的评估标准

在评估标准中包含明确的冗长惩罚：

rubric_levels = [
    {
        "score": 5,
        "description": "Complete and concise. All necessary information, nothing extraneous.",
        "characteristics": ["Every sentence adds value", "No repetition", "Appropriately scoped"]
    },
    {
        "score": 3,
        "description": "Complete but verbose. Contains unnecessary detail or repetition.",
        "characteristics": ["Main points covered", "Some tangents", "Could be more concise"]
    },
    # ... etc
]

自信、权威的语气无论准确性如何都会获得更高的评分。

缓解：证据要求

要求为声明提供明确证据：

For each claim in the response:
1. Identify whether it's a factual claim
2. Note if evidence or sources are provided
3. Score based on verifiability, not confidence

IMPORTANT: Confident claims without evidence should NOT receive higher scores than 
hedged claims with evidence.

缓解：事实检查层

在评分前添加事实检查步骤：

async def fact_checked_evaluation(response, prompt, criteria):
    # Extract claims
    claims = await extract_claims(response)
    
    # Fact-check each claim
    fact_check_results = await asyncio.gather(*[
        verify_claim(claim) for claim in claims
    ])
    
    # Adjust score based on fact-check results
    accuracy_factor = sum(r['verified'] for r in fact_check_results) / len(fact_check_results)
    
    base_score = await evaluate(response, prompt, criteria)
    return base_score * (0.7 + 0.3 * accuracy_factor)  # At least 70% of score

监控生产中的系统性偏见：

class BiasMonitor:
    def __init__(self):
        self.evaluations = []
    
    def record(self, evaluation):
        self.evaluations.append(evaluation)
    
    def detect_position_bias(self):
        """Detect if first position wins more often than expected."""
        first_wins = sum(1 for e in self.evaluations if e['first_position_winner'])
        expected = len(self.evaluations) * 0.5
        z_score = (first_wins - expected) / (expected * 0.5) ** 0.5
        return {'bias_detected': abs(z_score) > 2, 'z_score': z_score}
    
    def detect_length_bias(self):
        """Detect if longer responses score higher."""
        from scipy.stats import spearmanr
        lengths = [e['response_length'] for e in self.evaluations]
        scores = [e['score'] for e in self.evaluations]
        corr, p_value = spearmanr(lengths, scores)
        return {'bias_detected': corr > 0.3 and p_value < 0.05, 'correlation': corr}

偏见	主要缓解方法	次要缓解方法	检测方法
位置	位置交换	多次洗牌	一致性检查
长度	明确提示	长度归一化	长度-分数相关性
自我增强	跨模型评估	匿名化	模型比较研究
冗长	相关性加权	评估标准惩罚	相关性评分
权威	证据要求	事实检查层	置信度-准确性相关性

Claude Code 的 LLM 作为评判者实现模式

此参考为评估 Claude Code 命令、技能和智能体在开发过程中提供了实用的提示模式和工作流程。

模式 1：结构化评估工作流程

最可靠的评估遵循结构化工作流程，将关注点分离：

Define Criteria → Gather Test Cases → Run Evaluation → Mitigate Bias → Interpret Results

步骤 1：定义评估标准

在评估之前，建立明确的标准。以可重用格式记录它们：

## Evaluation Criteria for [Command/Skill Name]

### Criterion 1: Instruction Following (weight: 0.30)
- **Description**: Does the output follow all explicit instructions?
- **1 (Poor)**: Ignores or misunderstands core instructions
- **3 (Adequate)**: Follows main instructions, misses some details
- **5 (Excellent)**: Follows all instructions precisely

### Criterion 2: Output Completeness (weight: 0.25)
- **Description**: Are all requested aspects covered?
- **1 (Poor)**: Major aspects missing
- **3 (Adequate)**: Core aspects covered with gaps
- **5 (Excellent)**: All aspects thoroughly addressed

### Criterion 3: Tool Efficiency (weight: 0.20)
- **Description**: Were appropriate

🇺🇸English

Evaluation Methods for Claude Code Agents

Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.

Core Concepts

Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. Multi-dimensional rubrics capture various quality aspects: factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. LLM-as-judge provides scalable evaluation while human evaluation catches edge cases.

The key insight is that agents may find alternative paths to goals—the evaluation should judge whether they achieve right outcomes while following reasonable processes.

Performance Drivers: The 95% Finding Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance:

Factor	Variance Explained	Implication
Token usage	80%	More tokens = better performance
Number of tool calls	~10%	More exploration helps
Model choice	~5%	Better models multiply efficiency

Implications for Claude Code development:

Token budgets matter : Evaluate with realistic token constraints
Model upgrades beat token increases : Upgrading models provides larger gains than increasing token budgets
Multi-agent validation : Validates architectures that distribute work across subagents with separate context windows

Evaluation Challenges

Non-Determinism and Multiple Valid Paths

Agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten. They might use different tools to find the same answer. Traditional evaluations that check for specific steps fail in this context.

Solution : The solution is outcomes, not exact execution paths. Judge whether the agent achieves the right result through a reasonable process.

Context-Dependent Failures

Agent failures often depend on context in subtle ways. An agent might succeed on complex queries but fail on simple ones. It might work well with one tool set but fail with another. Failures may emerge only after extended interaction when context accumulates.

Solution : Evaluation must cover a range of complexity levels and test extended interactions, not just isolated queries.

Composite Quality Dimensions

Agent quality is not a single dimension. It includes factual accuracy, completeness, coherence, tool efficiency, and process quality. An agent might score high on accuracy but low in efficiency, or vice versa.

An agent might score high on accuracy but low in efficiency.

Solution : Evaluation rubrics must capture multiple dimensions with appropriate weighting for the use case.

Evaluation Rubric Design

Multi-Dimensional Rubric

Effective rubrics cover key dimensions with descriptive levels:

Instruction Following (weight: 0.30)

Excellent (1.0): All instructions followed precisely
Good (0.8): Minor deviations that don't affect outcome
Acceptable (0.6): Major instructions followed, minor ones missed
Poor (0.3): Significant instructions ignored
Failed (0.0): Fundamentally misunderstood the task

Output Completeness (weight: 0.25)

Excellent: All requested aspects thoroughly covered
Good: Most aspects covered with minor gaps
Acceptable: Key aspects covered, some gaps
Poor: Major aspects missing
Failed: Fundamental aspects not addressed

Tool Efficiency (weight: 0.20)

Excellent: Optimal tool selection and minimal calls
Good: Good tool selection with minor inefficiencies
Acceptable: Appropriate tools with some redundancy
Poor: Wrong tools or excessive calls
Failed: Severe tool misuse or extremely excessive calls

Reasoning Quality (weight: 0.15)

Excellent: Clear, logical reasoning throughout
Good: Generally sound reasoning with minor gaps
Acceptable: Basic reasoning present
Poor: Reasoning unclear or flawed
Failed: No apparent reasoning

Response Coherence (weight: 0.10)

Excellent: Well-structured, easy to follow
Good: Generally coherent with minor issues
Acceptable: Understandable but could be clearer
Poor: Difficult to follow
Failed: Incoherent

Scoring Approach

Convert dimension assessments to numeric scores (0.0 to 1.0) with appropriate weighting. Calculate weighted overall scores. Set passing thresholds based on use case requirements (typically 0.7 for general use, 0.85 for critical operations).

Evaluation Methodologies

LLM-as-Judge

Using an LLM to evaluate agent outputs scales well and provides consistent judgments. Design evaluation prompts that capture the dimensions of interest. LLM-based evaluation scales to large test sets and provides consistent judgments. The key is designing effective evaluation prompts that capture the dimensions of interest.

Provide clear task description, agent output, ground truth (if available), evaluation scale with level descriptions, and request structured judgment.

Evaluation Prompt Template :

You are evaluating the output of a Claude Code agent.

## Original Task
{task_description}

## Agent Output
{agent_output}

## Ground Truth (if available)
{expected_output}

## Evaluation Criteria
For each criterion, assess the output and provide:
1. Score (1-5)
2. Specific evidence supporting your score
3. One improvement suggestion

### Criteria
1. Instruction Following: Did the agent follow all instructions?
2. Completeness: Are all requested aspects covered?
3. Tool Efficiency: Were appropriate tools used efficiently?
4. Reasoning Quality: Is the reasoning clear and sound?
5. Response Coherence: Is the output well-structured?

Provide your evaluation as a structured assessment with scores and justifications.

Chain-of-Thought Requirement : Always require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.

Human Evaluation

Human evaluation catches what automation misses:

Hallucinated answers on unusual queries
Subtle context misunderstandings
Edge cases that automated evaluation overlooks
Qualitative issues with tone or approach

For Claude Code development, ask users this:

Review agent outputs manually for edge cases
Sample systematically across complexity levels
Track patterns in failures to inform prompt improvements

End-State Evaluation

For commands that produce artifacts (files, configurations, code), evaluate the final output rather than the process:

Does the generated code work?
Is the configuration valid?
Does the output meet requirements?

Test Set Design

Sample Selection Start with small samples during development. Early in agent development, changes have dramatic impacts because there is abundant low-hanging fruit. Small test sets reveal large effects.

Sample from real usage patterns. Add known edge cases. Ensure coverage across complexity levels.

Complexity Stratification Test sets should span complexity levels: simple (single tool call), medium (multiple tool calls), complex (many tool calls, significant ambiguity), and very complex (extended interaction, deep reasoning).

Context Engineering Evaluation

Testing Prompt Variations

When iterating on Claude Code prompts, evaluate systematically:

Baseline : Run current prompt on test cases
Variation : Run modified prompt on same cases
Compare : Measure quality scores, token usage, efficiency
Analyze : Identify which changes improved which dimensions

Testing Context Strategies

Context engineering choices should be validated through systematic evaluation. Run agents with different context strategies on the same test set. Compare quality scores, token usage, and efficiency metrics.

Degradation Testing

Test how context degradation affects performance by running agents at different context sizes. Identify performance cliffs where context becomes problematic. Establish safe operating limits.

Advanced Evaluation: LLM-as-Judge

Key insight : LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.

The Evaluation Taxonomy

Evaluation approaches fall into two primary categories with distinct reliability profiles:

Direct Scoring : A single LLM rates one response on a defined scale.

Best for: Objective criteria (factual accuracy, instruction following, toxicity)
Reliability: Moderate to high for well-defined criteria
Failure mode: Score calibration drift, inconsistent scale interpretation

Pairwise Comparison : An LLM compares two responses and selects the better one.

Best for: Subjective preferences (tone, style, persuasiveness)
Reliability: Higher than direct scoring for preferences
Failure mode: Position bias, length bias

Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth.

The Bias Landscape

LLM judges exhibit systematic biases that must be actively mitigated:

Position Bias : First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check.

Length Bias : Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring.

Self-Enhancement Bias : Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation.

Verbosity Bias : Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail.

Authority Bias : Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer.

Metric Selection Framework

Choose metrics based on the evaluation task structure:

Task Type	Primary Metrics	Secondary Metrics
Binary classification (pass/fail)	Recall, Precision, F1	Cohen's κ
Ordinal scale (1-5 rating)	Spearman's ρ, Kendall's τ	Cohen's κ (weighted)
Pairwise preference	Agreement rate, Position consistency	Confidence calibration
Multi-label	Macro-F1, Micro-F1	Per-label precision/recall

The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.

Evaluation Metrics Reference

Classification Metrics (Pass/Fail Tasks)

Precision : Of all responses marked as passing, what fraction truly passed?

Use when false positives are costly

Recall : Of all actually passing responses, what fraction did we identify?

Use when false negatives are costly

F1 Score : Harmonic mean of precision and recall

Use for balanced single-number summary

Agreement Metrics (Comparing to Human Judgment)

Cohen's Kappa : Agreement adjusted for chance

0.8: Almost perfect agreement

0.6-0.8: Substantial agreement
0.4-0.6: Moderate agreement
< 0.4: Fair to poor agreement

Correlation Metrics (Ordinal Scores)

Spearman's Rank Correlation : Correlation between rankings

0.9: Very strong correlation

0.7-0.9: Strong correlation
0.5-0.7: Moderate correlation
< 0.5: Weak correlation

Good Evaluation System Indicators

Metric	Good	Acceptable	Concerning
Spearman's rho	> 0.8	0.6-0.8	< 0.6
Cohen's Kappa	> 0.7	0.5-0.7	< 0.5
Position consistency	> 0.9	0.8-0.9	< 0.8
Length-score correlation	< 0.2	0.2-0.4	> 0.4

Evaluation Approaches

Direct Scoring Implementation

Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format.

Criteria Definition Pattern :

Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]

Scale Calibration :

1-3 scales: Binary with neutral option, lowest cognitive load
1-5 scales: Standard Likert, good balance of granularity and reliability
1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics

Prompt Structure for Direct Scoring :

You are an expert evaluator assessing response quality.

## Task
Evaluate the following response against each criterion.

## Original Prompt
{prompt}

## Response to Evaluate
{response}

## Criteria
{for each criterion: name, description, weight}

## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-{max} scale)
3. Justify your score with evidence
4. Suggest one specific improvement

## Output Format
Respond with structured JSON containing scores, justifications, and summary.

Chain-of-Thought Requirement : All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.

Pairwise Comparison Implementation

Pairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation.

Position Bias Mitigation Protocol :

First pass: Response A in first position, Response B in second
Second pass: Response B in first position, Response A in second
Consistency check: If passes disagree, return TIE with reduced confidence
Final verdict: Consistent winner with averaged confidence

Prompt Structure for Pairwise Comparison :

You are an expert evaluator comparing two AI responses.

## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to the specified criteria
- Ties are acceptable when responses are genuinely equivalent

## Original Prompt
{prompt}

## Response A
{response_a}

## Response B
{response_b}

## Comparison Criteria
{criteria list}

## Instructions
1. Analyze each response independently first
2. Compare them on each criterion
3. Determine overall winner with confidence level

## Output Format
JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.

Confidence Calibration : Confidence scores should reflect position consistency:

Both passes agree: confidence = average of individual confidences
Passes disagree: confidence = 0.5, verdict = TIE

Rubric Generation

Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring.

Rubric Components

Level descriptions : Clear boundaries for each score level
Characteristics : Observable features that define each level
Examples : Representative outputs for each level (when possible)
Edge cases : Guidance for ambiguous situations
Scoring guidelines : General principles for consistent application

Strictness Calibration

Lenient : Lower bar for passing scores, appropriate for encouraging iteration
Balanced : Fair, typical expectations for production use
Strict : High standards, appropriate for safety-critical or high-stakes evaluation

Domain Adaptation

Rubrics should use domain-specific terminology:

A "code readability" rubric mentions variables, functions, and comments.
Documentation rubrics reference clarity, accuracy, completeness
Analysis rubrics focus on depth, accuracy, actionability

Practical Guidance

Evaluation Pipeline Design

Production evaluation systems require multiple layers:

┌─────────────────────────────────────────────────┐
│                 Evaluation Pipeline              │
├─────────────────────────────────────────────────┤
│                                                   │
│  Input: Response + Prompt + Context               │
│           │                                       │
│           ▼                                       │
│  ┌─────────────────────┐                         │
│  │   Criteria Loader   │ ◄── Rubrics, weights    │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Primary Scorer    │ ◄── Direct or Pairwise  │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Bias Mitigation   │ ◄── Position swap, etc. │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │ Confidence Scoring  │ ◄── Calibration         │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  Output: Scores + Justifications + Confidence     │
│                                                   │
└─────────────────────────────────────────────────┘

Avoiding Evaluation Pitfalls

Anti-pattern: Scoring without justification

Problem: Scores lack grounding, difficult to debug or improve
Solution: Always require evidence-based justification before score

Anti-pattern: Single-pass pairwise comparison

Problem: Position bias corrupts results
Solution: Always swap positions and check consistency

Anti-pattern: Overloaded criteria

Problem: Criteria measuring multiple things are unreliable
Solution: One criterion = one measurable aspect

Anti-pattern: Missing edge case guidance

Problem: Evaluators handle ambiguous cases inconsistently
Solution: Include edge cases in rubrics with explicit guidance

Anti-pattern: Ignoring confidence calibration

Problem: High-confidence wrong judgments are worse than low-confidence
Solution: Calibrate confidence to position consistency and evidence strength

Decision Framework: Direct vs. Pairwise

Use this decision tree:

Is there an objective ground truth?
├── Yes → Direct Scoring
│   └── Examples: factual accuracy, instruction following, format compliance
│
└── No → Is it a preference or quality judgment?
    ├── Yes → Pairwise Comparison
    │   └── Examples: tone, style, persuasiveness, creativity
    │
    └── No → Consider reference-based evaluation
        └── Examples: summarization (compare to source), translation (compare to reference)

Scaling Evaluation

For high-volume evaluation:

Panel of LLMs (PoLL) : Use multiple models as judges, aggregate votes
- Reduces individual model bias
- More expensive but more reliable for high-stakes decisions
Hierarchical evaluation : Fast cheap model for screening, expensive model for edge cases
- Cost-effective for large volumes
- Requires calibration of screening threshold
Human-in-the-loop : Automated evaluation for clear cases, human review for low-confidence
- Best reliability for critical applications
- Design feedback loop to improve automated evaluation

Examples

Example 1: Direct Scoring for Accuracy

Input :

Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun, 
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5

Output :

{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct 
scientific reasoning. Both the axial tilt and its effect on sunlight distribution 
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5°) for completeness."
}

Example 2: Pairwise Comparison with Position Swap

Input :

Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]

First Pass (A first) :

{ "winner": "B", "confidence": 0.8 }

Second Pass (B first) :

{ "winner": "A", "confidence": 0.6 }

(Note: Winner is A because B was in first position)

Mapped Second Pass :

{ "winner": "B", "confidence": 0.6 }

Final Result :

{
  "winner": "B",
  "confidence": 0.7,
  "positionConsistency": {
    "consistent": true,
    "firstPassWinner": "B",
    "secondPassWinner": "B"
  }
}

Example 3: Rubric Generation

Input :

criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"
domain: "software engineering"
scale: "1-5"
strictness: "balanced"

Output (abbreviated):

{
  "levels": [
    {
      "score": 1,
      "label": "Poor",
      "description": "Code is difficult to understand without significant effort",
      "characteristics": [
        "No meaningful variable or function names",
        "No comments or documentation",
        "Deeply nested or convoluted logic"
      ]
    },
    {
      "score": 3,
      "label": "Adequate",
      "description": "Code is understandable with some effort",
      "characteristics": [
        "Most variables have meaningful names",
        "Basic comments present for complex sections",
        "Logic is followable but could be cleaner"
      ]
    },
    {
      "score": 5,
      "label": "Excellent",
      "description": "Code is immediately clear and maintainable",
      "characteristics": [
        "All names are descriptive and consistent",
        "Comprehensive documentation",
        "Clean, modular structure"
      ]
    }
  ],
  "edgeCases": [
    {
      "situation": "Code is well-structured but uses domain-specific abbreviations",
      "guidance": "Score based on readability for domain experts, not general audience"
    }
  ]
}

Iterative Improvement Workflow

Identify weakness : Use evaluation to find where agent struggles
Hypothesize cause : Is it the prompt? The context? The examples?
Modify prompt : Make targeted changes based on hypothesis
Re-evaluate : Run same test cases with modified prompt
Compare : Did the change improve the target dimension?
Check regression : Did other dimensions suffer?
Iterate : Repeat until quality meets threshold

Guidelines

Always require justification before scores - Chain-of-thought prompting improves reliability by 15-25%
Always swap positions in pairwise comparison - Single-pass comparison is corrupted by position bias
Match scale granularity to rubric specificity - Don't use 1-10 without detailed level descriptions
Separate objective and subjective criteria - Use direct scoring for objective, pairwise for subjective
Include confidence scores - Calibrate to position consistency and evidence strength
Define edge cases explicitly - Ambiguous situations cause the most evaluation variance
Use domain-specific rubrics - Generic rubrics produce generic (less useful) evaluations
Validate against human judgments - Automated evaluation is only valuable if it correlates with human assessment
Monitor for systematic bias - Track disagreement patterns by criterion and response type
Design for iteration - Evaluation systems improve with feedback loops

Example: Evaluating a Claude Code Command

Suppose you've created a /refactor command and want to evaluate its quality:

Test Cases :

Simple: Rename a variable across a single file
Medium: Extract a function from existing code
Complex: Refactor a class to use a new design pattern
Very Complex: Restructure module dependencies

Evaluation Rubric :

Correctness: Does the refactored code work?
Completeness: Were all instances updated?
Style: Does it follow project conventions?
Efficiency: Were unnecessary changes avoided?

Evaluation Prompt :

Evaluate this refactoring output:

Original Code:
{original}

Refactored Code:
{refactored}

Request:
{user_request}

Score 1-5 on each dimension with evidence:
1. Correctness: Does the code still work correctly?
2. Completeness: Were all relevant instances updated?
3. Style: Does it follow the project's coding patterns?
4. Efficiency: Were only necessary changes made?

Provide scores with specific evidence from the code.

Iteration : If evaluation reveals the command often misses instances:

Add explicit instruction: "Search the entire codebase for all occurrences"
Re-evaluate with same test cases
Compare completeness scores
Check that correctness didn't regress

Bias Mitigation Techniques for LLM Evaluation

This reference details specific techniques for mitigating known biases in LLM-as-a-Judge systems.

Position Bias

The Problem

In pairwise comparison, LLMs systematically prefer responses in certain positions. Research shows:

GPT has mild first-position bias (~55% preference for first position in ties)
Claude shows similar patterns
Smaller models often show stronger bias

Mitigation: Position Swapping Protocol

async def position_swap_comparison(response_a, response_b, prompt, criteria):
    # Pass 1: Original order
    result_ab = await compare(response_a, response_b, prompt, criteria)
    
    # Pass 2: Swapped order
    result_ba = await compare(response_b, response_a, prompt, criteria)
    
    # Map second result (A in second position → B in first)
    result_ba_mapped = {
        'winner': {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[result_ba['winner']],
        'confidence': result_ba['confidence']
    }
    
    # Consistency check
    if result_ab['winner'] == result_ba_mapped['winner']:
        return {
            'winner': result_ab['winner'],
            'confidence': (result_ab['confidence'] + result_ba_mapped['confidence']) / 2,
            'position_consistent': True
        }
    else:
        # Disagreement indicates position bias was a factor
        return {
            'winner': 'TIE',
            'confidence': 0.5,
            'position_consistent': False,
            'bias_detected': True
        }

Alternative: Multiple Shuffles

For higher reliability, use multiple position orderings:

async def multi_shuffle_comparison(response_a, response_b, prompt, criteria, n_shuffles=3):
    results = []
    for i in range(n_shuffles):
        if i % 2 == 0:
            r = await compare(response_a, response_b, prompt, criteria)
        else:
            r = await compare(response_b, response_a, prompt, criteria)
            r['winner'] = {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[r['winner']]
        results.append(r)
    
    # Majority vote
    winners = [r['winner'] for r in results]
    final_winner = max(set(winners), key=winners.count)
    agreement = winners.count(final_winner) / len(winners)
    
    return {
        'winner': final_winner,
        'confidence': agreement,
        'n_shuffles': n_shuffles
    }

Length Bias

The Problem

LLMs tend to rate longer responses higher, regardless of quality. This manifests as:

Verbose responses receiving inflated scores
Concise but complete responses penalized
Padding and repetition being rewarded

Mitigation: Explicit Prompting

Include anti-length-bias instructions in the prompt:

CRITICAL EVALUATION GUIDELINES:
- Do NOT prefer responses because they are longer
- Concise, complete answers are as valuable as detailed ones
- Penalize unnecessary verbosity or repetition
- Focus on information density, not word count

Mitigation: Length-Normalized Scoring

def length_normalized_score(score, response_length, target_length=500):
    """Adjust score based on response length."""
    length_ratio = response_length / target_length
    
    if length_ratio > 2.0:
        # Penalize excessively long responses
        penalty = (length_ratio - 2.0) * 0.1
        return max(score - penalty, 1)
    elif length_ratio < 0.3:
        # Penalize excessively short responses
        penalty = (0.3 - length_ratio) * 0.5
        return max(score - penalty, 1)
    else:
        return score

Mitigation: Separate Length Criterion

Make length a separate, explicit criterion so it's not implicitly rewarded:

criteria = [
    {"name": "Accuracy", "description": "Factual correctness", "weight": 0.4},
    {"name": "Completeness", "description": "Covers key points", "weight": 0.3},
    {"name": "Conciseness", "description": "No unnecessary content", "weight": 0.3}  # Explicit
]

Self-Enhancement Bias

The Problem

Models rate outputs generated by themselves (or similar models) higher than outputs from different models.

Mitigation: Cross-Model Evaluation

Use a different model family for evaluation than generation:

def get_evaluator_model(generator_model):
    """Select evaluator to avoid self-enhancement bias."""
    if 'gpt' in generator_model.lower():
        return 'claude-4-5-sonnet'
    elif 'claude' in generator_model.lower():
        return 'gpt-5.2'
    else:
        return 'gpt-5.2'  # Default

Mitigation: Blind Evaluation

Remove model attribution from responses before evaluation:

def anonymize_response(response, model_name):
    """Remove model-identifying patterns."""
    patterns = [
        f"As {model_name}",
        "I am an AI",
        "I don't have personal opinions",
        # Model-specific patterns
    ]
    anonymized = response
    for pattern in patterns:
        anonymized = anonymized.replace(pattern, "[REDACTED]")
    return anonymized

Verbosity Bias

The Problem

Detailed explanations receive higher scores even when the extra detail is irrelevant or incorrect.

Mitigation: Relevance-Weighted Scoring

async def relevance_weighted_evaluation(response, prompt, criteria):
    # First, assess relevance of each segment
    relevance_scores = await assess_relevance(response, prompt)
    
    # Weight evaluation by relevance
    segments = split_into_segments(response)
    weighted_scores = []
    for segment, relevance in zip(segments, relevance_scores):
        if relevance > 0.5:  # Only count relevant segments
            score = await evaluate_segment(segment, prompt, criteria)
            weighted_scores.append(score * relevance)
    
    return sum(weighted_scores) / len(weighted_scores)

Mitigation: Rubric with Verbosity Penalty

Include explicit verbosity penalties in rubrics:

rubric_levels = [
    {
        "score": 5,
        "description": "Complete and concise. All necessary information, nothing extraneous.",
        "characteristics": ["Every sentence adds value", "No repetition", "Appropriately scoped"]
    },
    {
        "score": 3,
        "description": "Complete but verbose. Contains unnecessary detail or repetition.",
        "characteristics": ["Main points covered", "Some tangents", "Could be more concise"]
    },
    # ... etc
]

Authority Bias

The Problem

Confident, authoritative tone is rated higher regardless of accuracy.

Mitigation: Evidence Requirement

Require explicit evidence for claims:

For each claim in the response:
1. Identify whether it's a factual claim
2. Note if evidence or sources are provided
3. Score based on verifiability, not confidence

IMPORTANT: Confident claims without evidence should NOT receive higher scores than 
hedged claims with evidence.

Mitigation: Fact-Checking Layer

Add a fact-checking step before scoring:

async def fact_checked_evaluation(response, prompt, criteria):
    # Extract claims
    claims = await extract_claims(response)
    
    # Fact-check each claim
    fact_check_results = await asyncio.gather(*[
        verify_claim(claim) for claim in claims
    ])
    
    # Adjust score based on fact-check results
    accuracy_factor = sum(r['verified'] for r in fact_check_results) / len(fact_check_results)
    
    base_score = await evaluate(response, prompt, criteria)
    return base_score * (0.7 + 0.3 * accuracy_factor)  # At least 70% of score

Aggregate Bias Detection

Monitor for systematic biases in production:

class BiasMonitor:
    def __init__(self):
        self.evaluations = []
    
    def record(self, evaluation):
        self.evaluations.append(evaluation)
    
    def detect_position_bias(self):
        """Detect if first position wins more often than expected."""
        first_wins = sum(1 for e in self.evaluations if e['first_position_winner'])
        expected = len(self.evaluations) * 0.5
        z_score = (first_wins - expected) / (expected * 0.5) ** 0.5
        return {'bias_detected': abs(z_score) > 2, 'z_score': z_score}
    
    def detect_length_bias(self):
        """Detect if longer responses score higher."""
        from scipy.stats import spearmanr
        lengths = [e['response_length'] for e in self.evaluations]
        scores = [e['score'] for e in self.evaluations]
        corr, p_value = spearmanr(lengths, scores)
        return {'bias_detected': corr > 0.3 and p_value < 0.05, 'correlation': corr}

Summary Table

Bias	Primary Mitigation	Secondary Mitigation	Detection Method
Position	Position swapping	Multiple shuffles	Consistency check
Length	Explicit prompting	Length normalization	Length-score correlation
Self-enhancement	Cross-model evaluation	Anonymization	Model comparison study
Verbosity	Relevance weighting	Rubric penalties	Relevance scoring
Authority	Evidence requirement	Fact-checking layer	Confidence-accuracy correlation

LLM-as-Judge Implementation Patterns for Claude Code

This reference provides practical prompt patterns and workflows for evaluating Claude Code commands, skills, and agents during development.

Pattern 1: Structured Evaluation Workflow

The most reliable evaluation follows a structured workflow that separates concerns:

Define Criteria → Gather Test Cases → Run Evaluation → Mitigate Bias → Interpret Results

Step 1: Define Evaluation Criteria

Before evaluating, establish clear criteria. Document them in a reusable format:

## Evaluation Criteria for [Command/Skill Name]

### Criterion 1: Instruction Following (weight: 0.30)
- **Description**: Does the output follow all explicit instructions?
- **1 (Poor)**: Ignores or misunderstands core instructions
- **3 (Adequate)**: Follows main instructions, misses some details
- **5 (Excellent)**: Follows all instructions precisely

### Criterion 2: Output Completeness (weight: 0.25)
- **Description**: Are all requested aspects covered?
- **1 (Poor)**: Major aspects missing
- **3 (Adequate)**: Core aspects covered with gaps
- **5 (Excellent)**: All aspects thoroughly addressed

### Criterion 3: Tool Efficiency (weight: 0.20)
- **Description**: Were appropriate tools used efficiently?
- **1 (Poor)**: Wrong tools or excessive redundant calls
- **3 (Adequate)**: Appropriate tools with some redundancy
- **5 (Excellent)**: Optimal tool selection, minimal calls

### Criterion 4: Reasoning Quality (weight: 0.15)
- **Description**: Is the reasoning clear and sound?
- **1 (Poor)**: No apparent reasoning or flawed logic
- **3 (Adequate)**: Basic reasoning present
- **5 (Excellent)**: Clear, logical reasoning throughout

### Criterion 5: Response Coherence (weight: 0.10)
- **Description**: Is the output well-structured and clear?
- **1 (Poor)**: Difficult to follow or incoherent
- **3 (Adequate)**: Understandable but could be clearer
- **5 (Excellent)**: Well-structured, easy to follow

Step 2: Create Test Cases

Structure test cases by complexity level:

## Test Cases for /refactor Command

### Simple (Single Operation)
- **Input**: Rename variable `x` to `count` in a single file
- **Expected**: All instances renamed, code still runs
- **Complexity**: Low

### Medium (Multiple Operations)
- **Input**: Extract function from 20-line code block
- **Expected**: New function created, original call site updated, behavior preserved
- **Complexity**: Medium

### Complex (Cross-File Changes)
- **Input**: Refactor class to use Strategy pattern
- **Expected**: Interface created, implementations separated, all usages updated
- **Complexity**: High

### Edge Case
- **Input**: Refactor code with conflicting variable names in nested scopes
- **Expected**: Correct scoping preserved, no accidental shadowing
- **Complexity**: Edge case

Step 3: Run Direct Scoring Evaluation

Use this prompt template to evaluate a single output:

You are evaluating the output of a Claude Code command.

## Original Task
{paste the user's original request}

## Command Output
{paste the full command output including tool calls}

## Evaluation Criteria
{paste your criteria definitions from Step 1}

## Instructions
For each criterion:
1. Find specific evidence in the output that supports your assessment
2. Assign a score (1-5) based on the rubric levels
3. Write a 1-2 sentence justification citing the evidence
4. Suggest one specific improvement

IMPORTANT: Provide your justification BEFORE stating the score. This improves evaluation reliability.

## Output Format
For each criterion, respond with:

### [Criterion Name]
**Evidence**: [Quote or describe specific parts of the output]
**Justification**: [Explain how the evidence maps to the rubric level]
**Score**: [1-5]
**Improvement**: [One actionable suggestion]

### Overall Assessment
**Weighted Score**: [Calculate: sum of (score × weight)]
**Pass/Fail**: [Pass if weighted score ≥ 3.5]
**Summary**: [2-3 sentences summarizing strengths and weaknesses]

Step 4: Mitigate Position Bias in Comparisons

When comparing two prompt variants (A vs B), use this two-pass workflow:

Pass 1 (A First):

You are comparing two outputs from different prompt variants.

## Original Task
{task description}

## Output A (First Variant)
{output from prompt variant A}

## Output B (Second Variant)
{output from prompt variant B}

## Comparison Criteria
- Instruction Following
- Output Completeness
- Reasoning Quality

## Critical Instructions
- Do NOT prefer outputs because they are longer
- Do NOT prefer outputs based on their position (first vs second)
- Focus ONLY on quality differences
- TIE is acceptable when outputs are equivalent

## Analysis Process
1. Analyze Output A independently: [strengths, weaknesses]
2. Analyze Output B independently: [strengths, weaknesses]
3. Compare on each criterion
4. Determine winner with confidence (0-1)

## Output
Reasoning: [Explain why]
Winner: [A/B/TIE]
Confidence: [0.0-1.0]

Pass 2 (B First): Repeat the same prompt but swap the order—put Output B first and Output A second.

Interpret Results:

If both passes agree → Winner confirmed, average the confidences
If passes disagree → Result is TIE with confidence 0.5 (position bias detected)

Pattern 2: Hierarchical Evaluation Workflow

For complex evaluations, use a hierarchical approach:

Quick Screen (cheap model) → Detailed Evaluation (expensive model) → Human Review (edge cases)

Tier 1: Quick Screen (Use Haiku)

Rate this command output 0-10 for basic adequacy.

Task: {brief task description}
Output: {command output}

Quick assessment: Does this output reasonably address the task?
Score (0-10):
One-line reasoning:

Decision rule : Score < 5 → Fail, Score ≥ 7 → Pass, Score 5-7 → Escalate to detailed evaluation

Tier 2: Detailed Evaluation (Use Opus)

Use the full direct scoring prompt from Pattern 1 for borderline cases.

Tier 3: Human Review

For low-confidence automated evaluations (confidence < 0.6), queue for manual review:

## Human Review Request

**Automated Score**: 3.2/5 (Confidence: 0.45)
**Reason for Escalation**: Low confidence, evaluator disagreed across passes

### What to Review
1. Does the output actually complete the task?
2. Are the automated criterion scores reasonable?
3. What did the automation miss?

### Original Task
{task}

### Output
{output}

### Automated Assessment
{paste automated evaluation}

### Human Override
[ ] Agree with automation
[ ] Override to PASS - Reason: ___
[ ] Override to FAIL - Reason: ___

Pattern 3: Panel of LLM Judges (PoLL)

For high-stakes evaluation, use multiple models::

Workflow

Run 3 independent evaluations with different prompt framings:
- Evaluation 1: Standard criteria prompt
- Evaluation 2: Adversarial framing ("Find problems with this output")
- Evaluation 3: User perspective ("Would a developer be satisfied?")
Aggregate results :
- Take median score per criterion (robust to outliers)
- Flag criteria with high variance (std > 1.0) for review
- Overall pass requires majority agreement

Multi-Judge Prompt Variants

Standard Framing:

Evaluate this output against the specified criteria. Be fair and balanced.

Adversarial Framing:

Your role is to find problems with this output. Be critical and thorough.
Look for: factual errors, missing requirements, inefficiencies, unclear explanations.

User Perspective:

Imagine you're a developer who requested this task.
Would you be satisfied with this result? Would you need to redo any work?

Agreement Analysis

After running all judges, check consistency:

Criterion	Judge 1	Judge 2	Judge 3	Median	Std Dev
Instruction Following	4	4	5	4	0.58
Completeness	3	4	3	3	0.58
Tool Efficiency	2	3	4	3	1.00 ⚠️

⚠️ High variance on Tool Efficiency suggests the criterion needs clearer definition or the output has ambiguous efficiency characteristics.

Pattern 4: Confidence Calibration

Confidence scores should be calibrated to actual reliability:

Confidence Factors

Factor	High Confidence	Low Confidence
Position consistency	Both passes agree	Passes disagree
Evidence count	3+ specific citations	Vague or no citations
Criterion agreement	All criteria align	Criteria scores vary widely
Edge case match	Similar to known cases	Novel situation

Calibration Prompt Addition

Add this to evaluation prompts:

## Confidence Assessment

After scoring, assess your confidence:

1. **Evidence Strength**: How specific was the evidence you cited?
   - Strong: Quoted exact passages, precise observations
   - Moderate: General observations, reasonable inferences
   - Weak: Vague impressions, assumptions

2. **Criterion Clarity**: How clear were the criterion boundaries?
   - Clear: Easy to map output to rubric levels
   - Ambiguous: Output fell between levels
   - Unclear: Rubric didn't fit this case

3. **Overall Confidence**: [0.0-1.0]
   - 0.9+: Very confident, clear evidence, obvious rubric fit
   - 0.7-0.9: Confident, good evidence, minor ambiguity
   - 0.5-0.7: Moderate confidence, some ambiguity
   - <0.5: Low confidence, significant uncertainty

Confidence: [score]
Confidence Reasoning: [explain what factors affected confidence]

Pattern 5: Structured Output Format

Request consistent output structure for easier analysis:

Evaluation Output Template

## Evaluation Results

### Metadata
- **Evaluated**: [command/skill name]
- **Test Case**: [test case ID or description]
- **Evaluator**: [model used]
- **Timestamp**: [when evaluated]

### Criterion Scores

| Criterion | Score | Weight | Weighted | Confidence |
|-----------|-------|--------|----------|------------|
| Instruction Following | 4/5 | 0.30 | 1.20 | 0.85 |
| Output Completeness | 3/5 | 0.25 | 0.75 | 0.70 |
| Tool Efficiency | 5/5 | 0.20 | 1.00 | 0.90 |
| Reasoning Quality | 4/5 | 0.15 | 0.60 | 0.75 |
| Response Coherence | 4/5 | 0.10 | 0.40 | 0.80 |

### Summary
- **Overall Score**: 3.95/5.0
- **Pass Threshold**: 3.5/5.0
- **Result**: ✅ PASS

### Evidence Summary
- **Strengths**: [bullet points]
- **Weaknesses**: [bullet points]
- **Improvements**: [prioritized suggestions]

### Confidence Assessment
- **Overall Confidence**: 0.78
- **Flags**: [any concerns or caveats]

Evaluation Workflows for Claude Code Development

Workflow: Testing a New Command

Write 5-10 test cases spanning complexity levels
Run command on each test case, capture full output
Quick screen all outputs with Tier 1 evaluation
Detailed evaluate failures and borderline cases
Identify patterns in failures to guide prompt improvements
Iterate prompt based on specific weaknesses found
Re-evaluate same test cases to measure improvement

Workflow: Comparing Prompt Variants

Create variant prompts (e.g., different instruction phrasings)
Run both variants on identical test cases
Pairwise compare with position swapping
Calculate win rate for each variant
Analyze which cases each variant handles better
Decide : Pick winner or create hybrid

Workflow: Regression Testing

Maintain test suite of representative cases
Before changes : Run evaluation, record baseline scores
After changes : Re-run evaluation
Compare : Flag regressions (score drops > 0.5)
Investigate : Why did specific cases regress?
Accept or revert : Based on overall impact

Workflow: Continuous Quality Monitoring

Sample production usage (if available)
Run lightweight evaluation on samples
Track metrics over time :
- Average scores by criterion
- Failure rate
- Low-confidence rate
Alert on degradation : Score drop > 10% from baseline
Periodic deep dive : Monthly detailed evaluation on random sample

Anti-Patterns to Avoid

❌ Scoring Without Justification

Problem : Scores lack grounding, difficult to debug Solution : Always require evidence before score

❌ Single-Pass Pairwise Comparison

Problem : Position bias corrupts results Solution : Always swap positions and check consistency

❌ Overloaded Criteria

Problem : Criteria measuring multiple things are unreliable Solution : One criterion = one measurable aspect

❌ Missing Edge Case Guidance

Problem : Evaluators handle ambiguous cases inconsistently Solution : Include edge cases in rubrics with explicit guidance

❌ Ignoring Low Confidence

Problem : Acting on uncertain evaluations leads to wrong conclusions Solution : Escalate low-confidence cases for human review

❌ Generic Rubrics

Problem : Generic criteria produce vague, unhelpful evaluations Solution : Create domain-specific rubrics (code commands vs documentation commands vs analysis commands)

Handling Evaluation Failures

When evaluations fail or produce unreliable results, use these recovery strategies:

Malformed Output Disregard

When the evaluator produces unparseable or incomplete output:

Mark as invalid and ingore for analysis - incorrect output, usally means halicunations during thinking process
Retry initial prompt without chagnes - multiple retries usally more consistent rahter one shot prompt
if still produce incorrect output, flag for human review : Mark as "evaluation failed, needs manual check" and queue for later

Validation Checklist

Before trusting evaluation results, verify:

All criteria have scores in valid range (1-5)
Each score has a justification referencing specific evidence
Confidence score is provided and reasonable
No contradictions between justification and assigned score
Weighted total calculation is correct

Validating Evaluation Prompts (Meta-Evaluation)

Before using an evaluation prompt in production, test it against known cases:

Calibration Test Cases

Create a small set of outputs with known quality levels:

Test Type	Description	Expected Score
Known-good	Clearly excellent output	4.5+ / 5.0
Known-bad	Clearly poor output	< 2.5 / 5.0
Boundary	Borderline case	3.0-3.5 with nuanced explanation

Validation Workflow

Known-good test : Evaluate a clearly excellent output
- If score < 4.0 → Rubric is too strict or evidence requirements unclear
Known-bad test : Evaluate a clearly poor output
- If score > 3.0 → Rubric is too lenient or criteria not specific enough
Boundary test : Evaluate a borderline case
- Should produce moderate score (3.0-3.5) with detailed explanation
- If confident high/low score → Criteria lack nuance
Consistency test : Run same evaluation 3 times
- Score variance should be < 0.5
- If higher variance → Criteria need tighter definitions

Position Bias Validation

Test for position bias before using pairwise comparisons:

## Position Bias Test

Run this test with IDENTICAL outputs in both positions:

Test Case: [Same output text]
Position A: [Paste output]
Position B: [Paste identical output]

Expected Result: TIE with high confidence (>0.9)

If Result Shows Winner:
- Position bias detected
- Add stronger anti-bias instructions to prompt
- Re-test until TIE achieved consistently

Evaluation Prompt Iteration

When calibration tests fail:

Identify failure mode : Too strict? Too lenient? Inconsistent?
Adjust specific rubric levels : Add examples, clarify boundaries
Re-run calibration tests : All 4 tests must pass
Document changes : Track what adjustments improved reliability

Metric Selection Guide for LLM Evaluation

This reference provides guidance on selecting appropriate metrics for different evaluation scenarios.

Metric Categories

Classification Metrics

Use for binary or multi-class evaluation tasks (pass/fail, correct/incorrect).

Precision

Precision = True Positives / (True Positives + False Positives)

Interpretation : Of all responses the judge said were good, what fraction were actually good?

Use when : False positives are costly (e.g., approving unsafe content)

Recall

Recall = True Positives / (True Positives + False Negatives)

Interpretation : Of all actually good responses, what fraction did the judge identify?

Use when : False negatives are costly (e.g., missing good content in filtering)

F1 Score

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Interpretation : Harmonic mean of precision and recall

Use when : You need a single number balancing both concerns

Agreement Metrics

Use for comparing automated evaluation with human judgment.

Cohen's Kappa (κ)

κ = (Observed Agreement - Expected Agreement) / (1 - Expected Agreement)

Interpretation : Agreement adjusted for chance

κ > 0.8: Almost perfect agreement
κ 0.6-0.8: Substantial agreement
κ 0.4-0.6: Moderate agreement
κ < 0.4: Fair to poor agreement

Use for : Binary or categorical judgments

Weighted Kappa

For ordinal scales where disagreement severity matters:

Interpretation : Penalizes large disagreements more than small ones

Correlation Metrics

Use for ordinal/continuous scores.

Spearman's Rank Correlation (ρ)

Interpretation : Correlation between rankings, not absolute values

ρ > 0.9: Very strong correlation
ρ 0.7-0.9: Strong correlation
ρ 0.5-0.7: Moderate correlation
ρ < 0.5: Weak correlation

Use when : Order matters more than exact values

Kendall's Tau (τ)

Interpretation : Similar to Spearman but based on pairwise concordance

Use when : You have many tied values

Pearson Correlation (r)

Interpretation : Linear correlation between scores

Use when : Exact score values matter, not just order

Pairwise Comparison Metrics

Agreement Rate

Agreement = (Matching Decisions) / (Total Comparisons)

Interpretation : Simple percentage of agreement

Position Consistency

Consistency = (Consistent across position swaps) / (Total comparisons)

Interpretation : How often does swapping position change the decision?

Selection Decision Tree

What type of evaluation task?
│
├── Binary classification (pass/fail)
│   └── Use: Precision, Recall, F1, Cohen's κ
│
├── Ordinal scale (1-5 rating)
│   ├── Comparing to human judgments?
│   │   └── Use: Spearman's ρ, Weighted κ
│   └── Comparing two automated judges?
│       └── Use: Kendall's τ, Spearman's ρ
│
├── Pairwise preference
│   └── Use: Agreement rate, Position consistency
│
└── Multi-label classification
    └── Use: Macro-F1, Micro-F1, Per-label metrics

Metric Selection by Use Case

Use Case 1: Validating Automated Evaluation

Goal : Ensure automated evaluation correlates with human judgment

Recommended Metrics :

Primary: Spearman's ρ (for ordinal scales) or Cohen's κ (for categorical)
Secondary: Per-criterion agreement
Diagnostic: Confusion matrix for systematic errors

Use Case 2: Comparing Two Models

Goal : Determine which model produces better outputs

Recommended Metrics :

Primary: Win rate (from pairwise comparison)
Secondary: Position consistency (bias check)
Diagnostic: Per-criterion breakdown

Use Case 3: Quality Monitoring

Goal : Track evaluation quality over time

Recommended Metrics :

Primary: Rolling agreement with human spot-checks
Secondary: Score distribution stability
Diagnostic: Bias indicators (position, length)

Interpreting Metric Results

Good Evaluation System Indicators

Metric	Good	Acceptable	Concerning
Spearman's ρ	> 0.8	0.6-0.8	< 0.6
Cohen's κ	> 0.7	0.5-0.7	< 0.5
Position consistency	> 0.9	0.8-0.9	< 0.8
Length correlation	< 0.2	0.2-0.4	> 0.4

Warning Signs

High agreement but low correlation : May indicate calibration issues
Low position consistency : Position bias affecting results
High length correlation : Length bias inflating scores
Per-criterion variance : Some criteria may be poorly defined

Reporting Template

## Evaluation System Metrics Report

### Human Agreement
- Spearman's ρ: 0.82 (p < 0.001)
- Cohen's κ: 0.74
- Sample size: 500 evaluations

### Bias Indicators
- Position consistency: 91%
- Length-score correlation: 0.12

### Per-Criterion Performance
| Criterion | Spearman's ρ | κ |
|-----------|--------------|---|
| Accuracy | 0.88 | 0.79 |
| Clarity | 0.76 | 0.68 |
| Completeness | 0.81 | 0.72 |

### Recommendations
- All metrics within acceptable ranges
- Monitor "Clarity" criterion - lower agreement may indicate need for rubric refinement

Weekly Installs

244

Repository

neolabhq/contex…ring-kit

GitHub Stars

708

First Seen

Feb 19, 2026

Installed on

opencode239

gemini-cli237

codex237

github-copilot237

cursor235

amp234

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

56,200 周安装

Claude代码智能体评估方法：多维标准与LLM评判者框架，提升AI代理性能

🇨🇳中文介绍

Claude 代码智能体的评估方法

核心概念

评估挑战

非确定性和多种有效路径

相关 Skills

上下文相关的失败

复合质量维度

评估标准设计

多维评估标准

评分方法

评估方法

LLM 作为评判者

人工评估

最终状态评估

测试集设计

上下文工程评估

测试提示变体

测试上下文策略

退化测试

高级评估：LLM 作为评判者

评估分类法

偏见格局

指标选择框架

评估指标参考

分类指标（通过/失败任务）

一致性指标（与人类判断比较）

相关性指标（序数分数）

良好评估系统指标

评估方法

直接评分实现

成对比较实现

评估标准生成

评估标准组成部分

严格度校准

领域适应

实践指导

评估流水线设计

避免评估陷阱

决策框架：直接评分 vs 成对比较

扩展评估

示例

示例 1：准确性的直接评分

示例 2：带位置交换的成对比较

示例 3：评估标准生成

迭代改进工作流程

指南

示例：评估 Claude Code 命令

LLM 评估的偏见缓解技术

位置偏见

问题

缓解：位置交换协议

替代方案：多次洗牌

长度偏见

问题

缓解：明确提示

缓解：长度归一化评分

缓解：单独的长度标准

自我增强偏见

问题

缓解：跨模型评估

缓解：盲评估

冗长偏见

问题

缓解：相关性加权评分

缓解：带有冗长惩罚的评估标准

权威偏见

问题

缓解：证据要求

缓解：事实检查层

聚合偏见检测

汇总表

Claude Code 的 LLM 作为评判者实现模式

模式 1：结构化评估工作流程

步骤 1：定义评估标准

🇺🇸English

Evaluation Methods for Claude Code Agents

Core Concepts

Evaluation Challenges