LLM评估审计工具：诊断AI评估流程问题，生成优先级修复清单

eval-audit by hamelsmu/evals-skills

197 周安装量

1,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/hamelsmu/evals-skills --skill eval-audit

AI/机器学习质量管理测试

🇨🇳中文介绍

评估审计

检查 LLM 评估流程，并生成一份按优先级排序的问题清单，附带具体的后续步骤。

概述

收集评估工件：追踪记录、评估器配置、评判提示词、标注数据、指标仪表板
在六个方面运行诊断检查
生成一份按影响程度排序的发现报告，每个发现都链接到一个修复方案

先决条件

通过可观测性 MCP 服务器或本地文件访问评估工件（追踪记录、评估器配置、评判提示词、标注数据）。如果不存在任何工件，请跳转到“无评估基础设施”部分。

连接到评估基础设施

检查用户是否已连接可观测性 MCP 服务器（Phoenix、Braintrust、LangSmith、Truesight 或类似工具）。如果可用，则使用它来拉取追踪记录、评估器定义和实验结果。如果不可用，则请求本地文件：CSV、JSON 追踪记录导出文件、笔记本或评估脚本。

诊断检查

逐一检查以下每个方面。检查可用的工件，确定问题是否存在，如果存在则记录发现。

根据对用户产品的影响程度对发现进行优先级排序。首先呈现影响最大的发现。

1. 错误分析

检查： 用户是否对真实或合成的追踪记录进行了系统性的错误分析？

寻找：标注的追踪记录数据集、失败类别定义、追踪记录审查笔记。如果存在评估器但没有记录失败类别，则可能跳过了错误分析。

如果缺失则记录发现： 未经错误分析构建的评估器衡量的是通用质量（“有帮助性”、“连贯性”），而不是实际的失败模式。如果不存在追踪记录，首先使用 generate-synthetic-data，然后使用 error-analysis。

参见：你的 AI 产品需要评估、LLM 评估常见问题解答

检查： 失败类别是头脑风暴得出的还是观察得出的？

从研究中借用的通用标签（“幻觉分数”、“毒性”、“连贯性”）表明是头脑风暴。基于应用场景的类别（“缺少查询约束”、“错误的客户语气”、“虚构的属性特征”）表明是观察得出的。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

4. 人工审查流程

检查： 谁在审查追踪记录？

确定是领域专家还是外包的标注员在标注数据。

如果是外包且缺乏领域专业知识则记录发现： 通用标注员能发现格式错误，但会遗漏特定领域的失败（错误的医疗剂量、不正确的法律引用、不匹配的属性特征）。请让领域专家参与。

检查： 审查者看到的是完整的追踪记录还是仅最终输出？

如果仅查看输出则记录发现： 仅审查最终输出会隐藏流程中断的位置。展示完整的追踪记录：输入、中间步骤、工具调用、检索到的上下文和最终输出。

检查： 数据如何呈现给审查者？

标记原始 JSON、未格式化的文本或在单元格中包含追踪记录数据的电子表格。

如果是原始格式则记录发现： 审查者花费精力解析数据而不是判断质量。以自然表示形式格式化：渲染 Markdown、语法高亮代码、将表格显示为表格。使用 build-review-interface。

检查： 是否有足够的标注数据？

对于错误分析，约 100 条追踪记录是达到饱和的粗略目标。对于评判器验证，需要约 50 个通过和约 50 个失败的示例才能获得可靠的 TPR/TNR。如果标注数据稀少，通过更有效地采样追踪记录来收集更多数据：

随机： 始终在其他策略旁边包含随机样本，以发现未知问题。
聚类： 根据语义相似性对追踪记录进行分组，并审查每个聚类的代表。
数据分析： 分析延迟、轮次、工具调用和令牌的统计数据以找出异常值。
分类： 使用现有的评估、预测模型或 LLM 来呈现有问题的追踪记录。谨慎使用。
反馈： 使用明确的客户反馈（投诉、差评信号）来筛选追踪记录。

如果不足则记录发现： 小型数据集会产生不可靠的失败率和较宽的置信区间。使用上述采样策略收集更多标注数据，或使用 generate-synthetic-data 进行补充。

检查： 在重大变更后是否重新运行了错误分析？

检查上次执行错误分析的时间点，相对于模型切换、提示词重写、新功能或生产事件。

如果过时则记录发现： 失败模式在流程变更后会发生转变，为旧流程构建的评估器会遗漏新的失败类型。每次重大变更后都应重新运行错误分析。

检查： 评估器是否得到维护？

寻找评判器的定期重新验证或更新的评估数据集。

如果设置后就不管则记录发现： 评估器会随着流程的演变而退化。根据新的人工标注重新验证评判器，并更新评估数据集以反映当前使用情况。

无评估基础设施

如果用户没有评估工件（没有追踪记录、没有评估器、没有标注数据）：

首先对真实追踪记录样本进行 error-analysis。
如果不存在生产数据，使用 generate-synthetic-data 创建测试输入，通过流程运行它们，然后对生成的追踪记录应用 error-analysis。
在完成错误分析之前，不要建议构建评估器、评判器或仪表板。

按影响程度排序呈现发现。对于每个发现：

### [问题标题]
**状态：** [问题存在 / 正常 / 无法确定]
[1-2 句话解释发现的具体问题]
**修复：** [具体行动，引用技能或文章]

按六个诊断方面进行分组。省略未发现问题的方面。

将审计当作检查清单运行而不检查实际工件。
报告与用户流程中发现的问题脱节的通用建议。
在完成错误分析之前就推荐构建评估器。
建议对代码检查可以处理的失败使用 LLM 评判器。
将此审计视为一次性事件。在重大流程变更后应重新审计。

🇺🇸English

Eval Audit

Inspect an LLM eval pipeline and produce a prioritized list of problems with concrete next steps.

Overview

Gather eval artifacts: traces, evaluator configs, judge prompts, labeled data, metrics dashboards
Run diagnostic checks across six areas
Produce a findings report ordered by impact, with each finding linking to a fix

Prerequisites

Access to eval artifacts (traces, evaluator configs, judge prompts, labeled data) via an observability MCP server or local files. If none exist, skip to "No Eval Infrastructure."

Connecting to Eval Infrastructure

Check whether the user has an observability MCP server connected (Phoenix, Braintrust, LangSmith, Truesight or similar). If available, use it to pull traces, evaluator definitions, and experiment results. If not, ask for local files: CSVs, JSON trace exports, notebooks, or evaluation scripts.

Diagnostic Checks

Work through each area below. Inspect available artifacts, determine whether the problem exists, and record a finding if it does.

Prioritize findings by impact on the user's product. Present the most impactful findings first.

1. Error Analysis

Check: Has the user done systematic error analysis on real or synthetic traces?

Look for: labeled trace datasets, failure category definitions, notes from trace review. If evaluators exist but no documented failure categories, error analysis was likely skipped.

Finding if missing: Evaluators built without error analysis measure generic qualities ("helpfulness", "coherence") instead of actual failure modes. Start with error-analysis, or generate-synthetic-data first if no traces exist.

See: Your AI Product Needs Evals, LLM Evals FAQ

Check: Were failure categories brainstormed or observed?

Generic labels borrowed from research ("hallucination score", "toxicity", "coherence") suggest brainstorming. Application-grounded categories ("missing query constraints", "wrong client tone", "fabricated property features") suggest observation.

Finding if brainstormed: Generic categories miss application-specific failures and produce evaluators that score well on paper but miss real problems. Re-do with error-analysis, starting from traces.

See: Who Validates the Validators?

2. Evaluator Design

Check: Are evaluators binary pass/fail?

Flag any that use Likert scales (1-5), letter grades (A-F), or numeric scores without a clear pass/fail threshold.

Finding if not binary: Likert scales are difficult to calibrate. Annotators disagree on the difference between a 3 and a 4, and judges inherit that noise. Consider converting to binary pass/fail with explicit definitions using write-judge-prompt.

See: Creating an LLM Judge That Drives Business Results

Check: Do LLM judge prompts target specific failure modes?

Flag any that evaluate holistically ("Is this response helpful?", "Rate the quality of this output").

Finding if vague: Holistic judges produce unactionable verdicts. Each judge should check exactly one failure mode with explicit pass/fail definitions and few-shot examples. Use write-judge-prompt.

Check: Are code-based checks used where possible?

Flag LLM judges used for objectively checkable criteria: format validation, constraint satisfaction, keyword presence, schema conformance.

Finding if over-relying on judges: Replace objective checks with code (regex, parsing, schema validation, execution tests). Reserve LLM judges for criteria requiring interpretation.

Check: Are similarity metrics used as primary evaluation?

Flag ROUGE, BERTScore, cosine similarity, or embedding distance used as the main evaluator for generation quality.

Finding if present: These metrics measure surface-level overlap, not correctness. They suit retrieval ranking but not generation evaluation. Replace with binary evaluators grounded in specific failure modes.

See: LLM Evals FAQ

3. Judge Validation

Check: Are LLM judges validated against human labels?

Look for: confusion matrices, TPR/TNR measurements, alignment scores. Judges in production with no validation data is a critical finding.

Finding if unvalidated: An unvalidated judge may consistently miss failures or flag passing traces. Measure alignment using TPR and TNR on a held-out test set. Use validate-evaluator.

See: Creating an LLM Judge That Drives Business Results

Check: Is alignment measured with TPR/TNR or with raw accuracy?

Flag "accuracy", "percent agreement", or Cohen's Kappa as the primary alignment metric.

Finding if using accuracy: With class imbalance, raw accuracy is misleading: a judge that always says "Pass" gets 90% accuracy when 90% of traces pass but catches zero failures. Use TPR and TNR, which map directly to bias correction. Use validate-evaluator.

Check: Is there a proper train/dev/test split?

Check whether few-shot examples in judge prompts come from the same data used to measure judge performance.

Finding if leaking: Using evaluation data as few-shot examples inflates alignment scores and hides real judge failures. Split into train (few-shot source), dev (iteration), and test (final measurement). Use validate-evaluator.

4. Human Review Process

Check: Who is reviewing traces?

Determine whether domain experts or outsourced annotators are labeling data.

Finding if outsourced without domain expertise: General annotators catch formatting errors but miss domain-specific failures (wrong medical dosage, incorrect legal citation, mismatched property features). Involve a domain expert.

See: A Field Guide to Improving AI Products

Check: Are reviewers seeing full traces or just final outputs?

Finding if output-only: Reviewing only the final output hides where the pipeline broke. Show the full trace: input, intermediate steps, tool calls, retrieved context, and final output.

Check: How is data displayed to reviewers?

Flag raw JSON, unformatted text, or spreadsheets with trace data in cells.

Finding if raw format: Reviewers spend effort parsing data instead of judging quality. Format in natural representation: render markdown, syntax-highlight code, display tables as tables. Use build-review-interface.

See: LLM Evals FAQ

5. Labeled Data

Check: Is there enough labeled data?

For error analysis, ~100 traces is the rough target for saturation. For judge validation, ~50 Pass and ~50 Fail examples are needed for reliable TPR/TNR. If labeled data is sparse, collect more by sampling traces more effectively:

Random: Always include a random sample alongside other strategies to discover unknown issues.
Clustering: Group traces by semantic similarity and review representatives from each cluster.
Data analysis: Analyze statistics on latency, turns, tool calls, and tokens for outliers.
Classification: Use existing evals, a predictive model, or an LLM to surface problematic traces. Use with caution.
Feedback: Use explicit customer feedback (complaints, thumbs-down signals) to filter traces.

Finding if insufficient: Small datasets produce unreliable failure rates and wide confidence intervals. Use the sampling strategies above to collect more labeled data, or supplement with generate-synthetic-data.

6. Pipeline Hygiene

Check: Is error analysis re-run after significant changes?

Check when error analysis was last performed relative to model switches, prompt rewrites, new features, or production incidents.

Finding if stale: Failure modes shift after pipeline changes, and evaluators built for the old pipeline miss new failure types. Re-run error analysis after every significant change.

Check: Are evaluators maintained?

Look for periodic re-validation of judges or refreshed evaluation datasets.

Finding if set-and-forget: Evaluators degrade as the pipeline evolves. Re-validate judges against fresh human labels and update eval datasets to reflect current usage.

No Eval Infrastructure

If the user has no eval artifacts (no traces, no evaluators, no labeled data):

Start with error-analysis on a sample of real traces.
If no production data exists, use generate-synthetic-data to create test inputs, run them through the pipeline, then apply error-analysis to the resulting traces.
Do not recommend building evaluators, judges, or dashboards before completing error analysis.

Report Format

Present findings ordered by impact. For each:

### [Problem Title]
**Status:** [Problem exists / OK / Cannot determine]
[1-2 sentence explanation of the specific problem found]
**Fix:** [Concrete action, referencing a skill or article]

Group under the six diagnostic areas. Omit areas where no problems were found.

Anti-Patterns

Running the audit as a checklist without inspecting actual artifacts.
Reporting generic advice disconnected from what was found in the user's pipeline.
Recommending evaluators before error analysis is complete.
Suggesting LLM judges for failures that code-based checks can handle.
Treating this audit as a one-time event. Re-audit after significant pipeline changes.

Weekly Installs

144

Repository

hamelsmu/evals-skills

GitHub Stars

955

First Seen

Mar 3, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex141

gemini-cli140

kimi-cli140

github-copilot140

cursor140

amp140

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

50,900 周安装