validate-evaluator by hamelsmu/evals-skills
npx skills add https://github.com/hamelsmu/evals-skills --skill validate-evaluator根据人类判断校准 LLM 评判器。
将人工标注的数据划分为三个互斥的集合:
| 划分 | 大小 | 用途 | 规则 |
|---|---|---|---|
| 训练集 | 10-20%(约 10-20 个示例) | 作为评判器提示的少样本示例来源 | 仅包含明确的通过和失败案例。直接在提示中使用。 |
| 开发集 | 40-45%(约 40-45 个示例) | 迭代优化评估器 | 切勿包含在提示中。反复评估。 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 测试集 | 40-45%(约 40-45 个示例) | 最终无偏的准确率测量 | 在开发期间不要查看。仅在最后使用一次。 |
目标:在开发集和测试集合计中,每个类别(通过和失败)有 30-50 个示例。即使真实世界中的分布是倾斜的,也使用平衡划分——您需要足够的失败示例来可靠地测量 TNR。
from sklearn.model_selection import train_test_split
# 第一次划分:分离测试集
train_dev, test = train_test_split(
labeled_data, test_size=0.4, stratify=labeled_data['label'], random_state=42
)
# 第二次划分:从开发集中分离训练示例
train, dev = train_test_split(
train_dev, test_size=0.75, stratify=train_dev['label'], random_state=42
)
# 结果:约 15% 训练集,约 45% 开发集,约 40% 测试集
在开发集中的每个示例上运行评判器。将预测结果与人工标签进行比较。
TPR(真阳性率): 当人工标注为通过时,评判器也标注为通过的频率是多少?
TPR = (评判器说通过 AND 人工说通过) / (人工说通过)
TNR(真阴性率): 当人工标注为失败时,评判器也标注为失败的频率是多少?
TNR = (评判器说失败 AND 人工说失败) / (人工说失败)
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
labels=['Fail', 'Pass']).ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)
使用 TPR/TNR,而不是精确率/召回率或原始准确率。这两个指标直接映射到偏差校正公式。Cohen's Kappa 仅用于测量两个人工标注者之间的一致性,不用于评判器与真实标签的比较。
检查评判器与人工标签不一致的每一个案例:
| 分歧类型 | 评判器 | 人工 | 修复方法 |
|---|---|---|---|
| 假通过 | 通过 | 失败 | 评判器过于宽松。加强失败定义或添加边缘案例示例。 |
| 假失败 | 失败 | 通过 | 评判器过于严格。澄清通过定义或调整示例。 |
对于每个分歧,决定是否:
优化评判器提示并在开发集上重新运行。重复此过程,直到 TPR 和 TNR 稳定。
停止标准:
如果校准停滞:
| 问题 | 解决方案 |
|---|---|
| TPR 和 TNR 都低 | 为评判器使用能力更强的 LLM |
| 一个指标低,一个指标可接受 | 专门检查低指标对应的分歧案例 |
| 两者都低于目标值并趋于稳定 | 将标准分解为更小、更原子化的检查项 |
| 在某些输入类型上持续出错 | 从训练集中添加有针对性的少样本示例 |
| 标签本身似乎不一致 | 重新检查人工标签;评分标准可能需要细化 |
在预留的测试集上仅运行一次评判器。记录最终的 TPR 和 TNR。
在看到测试集结果后不要进行迭代。如果需要,请使用新的开发数据返回步骤 4。
评判器对未标注的生产数据的原始评分是有偏差的。如果您需要准确的总体通过率,请根据已知的评判器误差进行校正:
theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)
其中:
p_obs = 评判器评分为通过的未标注轨迹的比例TPR, TNR = 来自测试集的测量值theta_hat = 校正后的真实成功率估计值裁剪到 [0, 1]。当 TPR + TNR - 1 接近 0 时(评判器不比随机猜测好)无效。
示例:
计算一个自助法置信区间。仅有点估计是不够的。
import numpy as np
def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
"""为校正后的成功率计算自助法 95% 置信区间。"""
n = len(human_labels)
estimates = []
for _ in range(n_bootstrap):
idx = np.random.choice(n, size=n, replace=True)
h = np.array(human_labels)[idx]
e = np.array(eval_labels)[idx]
tp = ((h == 'Pass') & (e == 'Pass')).sum()
fn = ((h == 'Pass') & (e == 'Fail')).sum()
tn = ((h == 'Fail') & (e == 'Fail')).sum()
fp = ((h == 'Fail') & (e == 'Pass')).sum()
tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
denom = tpr_b + tnr_b - 1
if abs(denom) < 1e-6:
continue
theta = (p_obs + tnr_b - 1) / denom
estimates.append(np.clip(theta, 0, 1))
return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)
lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")
或者使用 judgy (pip install judgy):
from judgy import estimate_success_rate
result = estimate_success_rate(
human_labels=test_human_labels,
evaluator_labels=test_eval_labels,
unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")
gpt-4o-2024-05-13,而不是 gpt-4o)。提供商会在不通知的情况下更新模型,导致无声的漂移。每周安装次数
134
代码仓库
GitHub 星标数
955
首次出现
2026年3月3日
安全审计
安装于
codex131
kimi-cli130
gemini-cli130
cursor130
opencode130
github-copilot130
Calibrate an LLM judge against human judgment.
Split human-labeled data into three disjoint sets:
| Split | Size | Purpose | Rules |
|---|---|---|---|
| Training | 10-20% (~10-20 examples) | Source of few-shot examples for the judge prompt | Only clear-cut Pass and Fail cases. Used directly in the prompt. |
| Dev | 40-45% (~40-45 examples) | Iterative evaluator refinement | Never include in the prompt. Evaluate against repeatedly. |
| Test | 40-45% (~40-45 examples) | Final unbiased accuracy measurement | Do NOT look at during development. Used once at the end. |
Target: 30-50 examples of each class (Pass and Fail) across dev and test combined. Use balanced splits even if real-world prevalence is skewed — you need enough Fail examples to measure TNR reliably.
from sklearn.model_selection import train_test_split
# First split: separate test set
train_dev, test = train_test_split(
labeled_data, test_size=0.4, stratify=labeled_data['label'], random_state=42
)
# Second split: separate training examples from dev set
train, dev = train_test_split(
train_dev, test_size=0.75, stratify=train_dev['label'], random_state=42
)
# Result: ~15% train, ~45% dev, ~40% test
Run the judge on every example in the dev set. Compare predictions to human labels.
TPR (True Positive Rate): When a human says Pass, how often does the judge also say Pass?
TPR = (judge says Pass AND human says Pass) / (human says Pass)
TNR (True Negative Rate): When a human says Fail, how often does the judge also say Fail?
TNR = (judge says Fail AND human says Fail) / (human says Fail)
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
labels=['Fail', 'Pass']).ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)
Use TPR/TNR, not Precision/Recall or raw accuracy. These two metrics directly map to the bias correction formula. Use Cohen's Kappa only for measuring agreement between two human annotators, not for judge-vs-ground-truth.
Examine every case where the judge disagrees with human labels:
| Disagreement Type | Judge | Human | Fix |
|---|---|---|---|
| False Pass | Pass | Fail | Judge is too lenient. Strengthen Fail definitions or add edge-case examples. |
| False Fail | Fail | Pass | Judge is too strict. Clarify Pass definitions or adjust examples. |
For each disagreement, determine whether to:
Refine the judge prompt and re-run on the dev set. Repeat until TPR and TNR stabilize.
Stopping criteria:
If alignment stalls:
| Problem | Solution |
|---|---|
| TPR and TNR both low | Use a more capable LLM for the judge |
| One metric low, one acceptable | Inspect disagreements for the low metric specifically |
| Both plateau below target | Decompose the criterion into smaller, more atomic checks |
| Consistently wrong on certain input types | Add targeted few-shot examples from training set |
| Labels themselves seem inconsistent | Re-examine human labels; the rubric may need refinement |
Run the judge exactly once on the held-out test set. Record final TPR and TNR.
Do not iterate after seeing test set results. Go back to step 4 with new dev data if needed.
Raw judge scores on unlabeled production data are biased. If you need an accurate aggregate pass rate, correct for known judge errors:
theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)
Where:
p_obs = fraction of unlabeled traces the judge scored as PassTPR, TNR = from test set measurementtheta_hat = corrected estimate of true success rateClip to [0, 1]. Invalid when TPR + TNR - 1 is near 0 (judge is no better than random).
Example:
Compute a bootstrap confidence interval. A point estimate alone is not enough.
import numpy as np
def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
"""Bootstrap 95% CI for corrected success rate."""
n = len(human_labels)
estimates = []
for _ in range(n_bootstrap):
idx = np.random.choice(n, size=n, replace=True)
h = np.array(human_labels)[idx]
e = np.array(eval_labels)[idx]
tp = ((h == 'Pass') & (e == 'Pass')).sum()
fn = ((h == 'Pass') & (e == 'Fail')).sum()
tn = ((h == 'Fail') & (e == 'Fail')).sum()
fp = ((h == 'Fail') & (e == 'Pass')).sum()
tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
denom = tpr_b + tnr_b - 1
if abs(denom) < 1e-6:
continue
theta = (p_obs + tnr_b - 1) / denom
estimates.append(np.clip(theta, 0, 1))
return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)
lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")
Or use judgy (pip install judgy):
from judgy import estimate_success_rate
result = estimate_success_rate(
human_labels=test_human_labels,
evaluator_labels=test_eval_labels,
unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")
gpt-4o-2024-05-13, not gpt-4o). Providers update models without notice, causing silent drift.Weekly Installs
134
Repository
GitHub Stars
955
First Seen
Mar 3, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex131
kimi-cli130
gemini-cli130
cursor130
opencode130
github-copilot130
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
48,300 周安装