LLM评判器校准指南：使用validate-evaluator优化AI评估准确率

validate-evaluator by hamelsmu/evals-skills

180 周安装量

1,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/hamelsmu/evals-skills --skill validate-evaluator

AI/机器学习测试提示工程

🇨🇳中文介绍

验证评估器

根据人类判断校准 LLM 评判器。

概述

将人工标注的数据划分为训练集（10-20%）、开发集（40-45%）、测试集（40-45%）
在开发集上运行评判器并测量 TPR/TNR
迭代优化评判器，直到在开发集上的 TPR 和 TNR 均 > 90%
在预留的测试集上运行一次，获得最终的 TPR/TNR
将偏差校正公式应用于生产数据

先决条件

一个已构建的 LLM 评判器提示（来自 write-judge-prompt）
人工标注的数据：每个故障模式约 100 条轨迹，带有二元通过/失败标签
- 目标：约 50 条通过和 50 条失败（平衡，即使真实分布是倾斜的）
- 标签必须来自领域专家，而非外包标注员
来自您标注数据的候选少样本示例

核心步骤

步骤 1：创建数据划分

将人工标注的数据划分为三个互斥的集合：

划分	大小	用途	规则
训练集	10-20%（约 10-20 个示例）	作为评判器提示的少样本示例来源	仅包含明确的通过和失败案例。直接在提示中使用。
开发集	40-45%（约 40-45 个示例）	迭代优化评估器	切勿包含在提示中。反复评估。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

步骤 2：在开发集上运行评估器

在开发集中的每个示例上运行评判器。将预测结果与人工标签进行比较。

步骤 3：测量 TPR 和 TNR

TPR（真阳性率）： 当人工标注为通过时，评判器也标注为通过的频率是多少？

TPR = (评判器说通过 AND 人工说通过) / (人工说通过)

TNR（真阴性率）： 当人工标注为失败时，评判器也标注为失败的频率是多少？

TNR = (评判器说失败 AND 人工说失败) / (人工说失败)



from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
                                   labels=['Fail', 'Pass']).ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)

使用 TPR/TNR，而不是精确率/召回率或原始准确率。这两个指标直接映射到偏差校正公式。Cohen's Kappa 仅用于测量两个人工标注者之间的一致性，不用于评判器与真实标签的比较。

步骤 4：检查分歧案例

检查评判器与人工标签不一致的每一个案例：

分歧类型	评判器	人工	修复方法
假通过	通过	失败	评判器过于宽松。加强失败定义或添加边缘案例示例。
假失败	失败	通过	评判器过于严格。澄清通过定义或调整示例。

对于每个分歧，决定是否：

澄清评判器提示中的措辞
从训练集中交换或添加少样本示例
为边缘案例添加明确规则
将标准分解为更具体的子检查项

步骤 5：迭代优化

优化评判器提示并在开发集上重新运行。重复此过程，直到 TPR 和 TNR 稳定。

目标： TPR > 90% 且 TNR > 90%
最低可接受标准： TPR > 80% 且 TNR > 80%

如果校准停滞：

问题	解决方案
TPR 和 TNR 都低	为评判器使用能力更强的 LLM
一个指标低，一个指标可接受	专门检查低指标对应的分歧案例
两者都低于目标值并趋于稳定	将标准分解为更小、更原子化的检查项
在某些输入类型上持续出错	从训练集中添加有针对性的少样本示例
标签本身似乎不一致	重新检查人工标签；评分标准可能需要细化

步骤 6：在测试集上进行最终测量

在预留的测试集上仅运行一次评判器。记录最终的 TPR 和 TNR。

在看到测试集结果后不要进行迭代。如果需要，请使用新的开发数据返回步骤 4。

步骤 7（可选）：估计真实成功率（Rogan-Gladen 校正）

评判器对未标注的生产数据的原始评分是有偏差的。如果您需要准确的总体通过率，请根据已知的评判器误差进行校正：

theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)

p_obs = 评判器评分为通过的未标注轨迹的比例
TPR, TNR = 来自测试集的测量值
theta_hat = 校正后的真实成功率估计值

裁剪到 [0, 1]。当 TPR + TNR - 1 接近 0 时（评判器不比随机猜测好）无效。

评判器 TPR = 0.92, TNR = 0.88
500 条生产轨迹：400 条评分为通过 -> p_obs = 0.80
theta_hat = (0.80 + 0.88 - 1) / (0.92 + 0.88 - 1) = 0.68 / 0.80 = 0.85
真实成功率约为 85%，而非原始的 80%

步骤 8：置信区间

计算一个自助法置信区间。仅有点估计是不够的。

import numpy as np

def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
    """为校正后的成功率计算自助法 95% 置信区间。"""
    n = len(human_labels)
    estimates = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, size=n, replace=True)
        h = np.array(human_labels)[idx]
        e = np.array(eval_labels)[idx]

        tp = ((h == 'Pass') & (e == 'Pass')).sum()
        fn = ((h == 'Pass') & (e == 'Fail')).sum()
        tn = ((h == 'Fail') & (e == 'Fail')).sum()
        fp = ((h == 'Fail') & (e == 'Pass')).sum()

        tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
        tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
        denom = tpr_b + tnr_b - 1

        if abs(denom) < 1e-6:
            continue
        theta = (p_obs + tnr_b - 1) / denom
        estimates.append(np.clip(theta, 0, 1))

    return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)

lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")

或者使用 judgy (pip install judgy)：

from judgy import estimate_success_rate

result = estimate_success_rate(
    human_labels=test_human_labels,
    evaluator_labels=test_eval_labels,
    unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")

固定 LLM 评判器的确切模型版本（例如，gpt-4o-2024-05-13，而不是 gpt-4o）。提供商会在不通知的情况下更新模型，导致无声的漂移。
在更改评判器提示、切换模型或生产环境置信区间意外扩大后，重新进行验证。
使用约 100 个标注示例（50 个通过，50 个失败）。低于 60 个，置信区间会变得很宽。
一位可信的领域专家是最有效的标注途径。如果不可行，让两位标注员独立标注 20-50 条轨迹，并在继续之前解决分歧。
提高 TPR 比提高 TNR 更能缩小置信区间。 校正公式除以 TPR，因此低 TPR 会将估计误差放大为宽置信区间。

假设评判器“直接可用”而无需验证。 评判器可能会持续遗漏失败或标记通过的轨迹。
使用原始准确率或百分比一致性。 使用 TPR 和 TNR。在类别不平衡的情况下，原始准确率具有误导性。
将开发集/测试集示例用作少样本示例。 这是数据泄露。
将开发集性能报告为最终准确率。 开发集的数字是乐观的。测试集给出的是无偏估计。
未经偏差校正的原始评判器分数。 如果您报告总体通过率，请应用 Rogan-Gladen 公式（步骤 7）。
只有点估计而没有置信区间。 在小测试集下，85% 的校正率可能很容易是 78-92%。报告范围，以便利益相关者知道该数字的可信度。

🇺🇸English

Validate Evaluator

Calibrate an LLM judge against human judgment.

Overview

Split human-labeled data into train (10-20%), dev (40-45%), test (40-45%)
Run judge on dev set and measure TPR/TNR
Iterate on the judge until TPR and TNR > 90% on dev set
Run once on held-out test set for final TPR/TNR
Apply bias correction formula to production data

Prerequisites

A built LLM judge prompt (from write-judge-prompt)
Human-labeled data: ~100 traces with binary Pass/Fail labels per failure mode
- Aim for ~50 Pass and ~50 Fail (balanced, even if real distribution is skewed)
- Labels must come from a domain expert, not outsourced annotators
Candidate few-shot examples from your labeled data

Core Instructions

Step 1: Create Data Splits

Split human-labeled data into three disjoint sets:

Split	Size	Purpose	Rules
Training	10-20% (~10-20 examples)	Source of few-shot examples for the judge prompt	Only clear-cut Pass and Fail cases. Used directly in the prompt.
Dev	40-45% (~40-45 examples)	Iterative evaluator refinement	Never include in the prompt. Evaluate against repeatedly.
Test	40-45% (~40-45 examples)	Final unbiased accuracy measurement	Do NOT look at during development. Used once at the end.

Target: 30-50 examples of each class (Pass and Fail) across dev and test combined. Use balanced splits even if real-world prevalence is skewed — you need enough Fail examples to measure TNR reliably.

from sklearn.model_selection import train_test_split

# First split: separate test set
train_dev, test = train_test_split(
    labeled_data, test_size=0.4, stratify=labeled_data['label'], random_state=42
)
# Second split: separate training examples from dev set
train, dev = train_test_split(
    train_dev, test_size=0.75, stratify=train_dev['label'], random_state=42
)
# Result: ~15% train, ~45% dev, ~40% test

Step 2: Run Evaluator on Dev Set

Run the judge on every example in the dev set. Compare predictions to human labels.

Step 3: Measure TPR and TNR

TPR (True Positive Rate): When a human says Pass, how often does the judge also say Pass?

TPR = (judge says Pass AND human says Pass) / (human says Pass)

TNR (True Negative Rate): When a human says Fail, how often does the judge also say Fail?

TNR = (judge says Fail AND human says Fail) / (human says Fail)



from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
                                   labels=['Fail', 'Pass']).ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)

Use TPR/TNR, not Precision/Recall or raw accuracy. These two metrics directly map to the bias correction formula. Use Cohen's Kappa only for measuring agreement between two human annotators, not for judge-vs-ground-truth.

Step 4: Inspect Disagreements

Examine every case where the judge disagrees with human labels:

Disagreement Type	Judge	Human	Fix
False Pass	Pass	Fail	Judge is too lenient. Strengthen Fail definitions or add edge-case examples.
False Fail	Fail	Pass	Judge is too strict. Clarify Pass definitions or adjust examples.

For each disagreement, determine whether to:

Clarify wording in the judge prompt
Swap or add few-shot examples from the training set
Add explicit rules for the edge case
Split the criterion into more specific sub-checks

Step 5: Iterate

Refine the judge prompt and re-run on the dev set. Repeat until TPR and TNR stabilize.

Stopping criteria:

Target: TPR > 90% AND TNR > 90%
Minimum acceptable: TPR > 80% AND TNR > 80%

If alignment stalls:

Problem	Solution
TPR and TNR both low	Use a more capable LLM for the judge
One metric low, one acceptable	Inspect disagreements for the low metric specifically
Both plateau below target	Decompose the criterion into smaller, more atomic checks
Consistently wrong on certain input types	Add targeted few-shot examples from training set
Labels themselves seem inconsistent	Re-examine human labels; the rubric may need refinement

Step 6: Final Measurement on Test Set

Run the judge exactly once on the held-out test set. Record final TPR and TNR.

Do not iterate after seeing test set results. Go back to step 4 with new dev data if needed.

Step 7 (Optional): Estimate True Success Rate (Rogan-Gladen Correction)

Raw judge scores on unlabeled production data are biased. If you need an accurate aggregate pass rate, correct for known judge errors:

theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)

Where:

p_obs = fraction of unlabeled traces the judge scored as Pass
TPR, TNR = from test set measurement
theta_hat = corrected estimate of true success rate

Clip to [0, 1]. Invalid when TPR + TNR - 1 is near 0 (judge is no better than random).

Example:

Judge TPR = 0.92, TNR = 0.88
500 production traces: 400 scored Pass -> p_obs = 0.80
theta_hat = (0.80 + 0.88 - 1) / (0.92 + 0.88 - 1) = 0.68 / 0.80 = 0.85
True success rate is ~85%, not the raw 80%

Step 8: Confidence Interval

Compute a bootstrap confidence interval. A point estimate alone is not enough.

import numpy as np

def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
    """Bootstrap 95% CI for corrected success rate."""
    n = len(human_labels)
    estimates = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, size=n, replace=True)
        h = np.array(human_labels)[idx]
        e = np.array(eval_labels)[idx]

        tp = ((h == 'Pass') & (e == 'Pass')).sum()
        fn = ((h == 'Pass') & (e == 'Fail')).sum()
        tn = ((h == 'Fail') & (e == 'Fail')).sum()
        fp = ((h == 'Fail') & (e == 'Pass')).sum()

        tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
        tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
        denom = tpr_b + tnr_b - 1

        if abs(denom) < 1e-6:
            continue
        theta = (p_obs + tnr_b - 1) / denom
        estimates.append(np.clip(theta, 0, 1))

    return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)

lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")

Or use judgy (pip install judgy):

from judgy import estimate_success_rate

result = estimate_success_rate(
    human_labels=test_human_labels,
    evaluator_labels=test_eval_labels,
    unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")

Practical Guidance

Pin exact model versions for LLM judges (e.g., gpt-4o-2024-05-13, not gpt-4o). Providers update models without notice, causing silent drift.
Re-validate after changing the judge prompt, switching models, or when production confidence intervals widen unexpectedly.
Use ~100 labeled examples (50 Pass, 50 Fail). Below 60, confidence intervals become wide.
One trusted domain expert is the most efficient labeling path. If not feasible, have two annotators label 20-50 traces independently and resolve disagreements before proceeding.
Improving TPR narrows the confidence interval more than improving TNR. The correction formula divides by TPR, so low TPR amplifies estimation errors into wide CIs.

Anti-Patterns

Assuming judges "just work" without validation. A judge may consistently miss failures or flag passing traces.
Using raw accuracy or percent agreement. Use TPR and TNR. With class imbalance, raw accuracy is misleading.
Dev/test examples as few-shot examples. This is data leakage.
Reporting dev set performance as final accuracy. Dev numbers are optimistic. The test set gives the unbiased estimate.
Raw judge scores without bias correction. If you report an aggregate pass rate, apply the Rogan-Gladen formula (Step 7).
Point estimates without confidence intervals. A corrected rate of 85% could easily be 78-92% with small test sets. Report the range so stakeholders know how much to trust the number.

Weekly Installs

134

Repository

hamelsmu/evals-skills

GitHub Stars

955

First Seen

Mar 3, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex131

kimi-cli130

gemini-cli130

cursor130

opencode130

github-copilot130

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

48,300 周安装