npx skills add https://github.com/hamelsmu/evals-skills --skill eval-audit检查 LLM 评估流程,并生成一份按优先级排序的问题清单,附带具体的后续步骤。
通过可观测性 MCP 服务器或本地文件访问评估工件(追踪记录、评估器配置、评判提示词、标注数据)。如果不存在任何工件,请跳转到“无评估基础设施”部分。
检查用户是否已连接可观测性 MCP 服务器(Phoenix、Braintrust、LangSmith、Truesight 或类似工具)。如果可用,则使用它来拉取追踪记录、评估器定义和实验结果。如果不可用,则请求本地文件:CSV、JSON 追踪记录导出文件、笔记本或评估脚本。
逐一检查以下每个方面。检查可用的工件,确定问题是否存在,如果存在则记录发现。
根据对用户产品的影响程度对发现进行优先级排序。首先呈现影响最大的发现。
检查: 用户是否对真实或合成的追踪记录进行了系统性的错误分析?
寻找:标注的追踪记录数据集、失败类别定义、追踪记录审查笔记。如果存在评估器但没有记录失败类别,则可能跳过了错误分析。
如果缺失则记录发现: 未经错误分析构建的评估器衡量的是通用质量(“有帮助性”、“连贯性”),而不是实际的失败模式。如果不存在追踪记录,首先使用 generate-synthetic-data,然后使用 error-analysis。
检查: 失败类别是头脑风暴得出的还是观察得出的?
从研究中借用的通用标签(“幻觉分数”、“毒性”、“连贯性”)表明是头脑风暴。基于应用场景的类别(“缺少查询约束”、“错误的客户语气”、“虚构的属性特征”)表明是观察得出的。
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
如果是头脑风暴得出的则记录发现: 通用类别会遗漏特定于应用的失败,并产生在纸面上得分良好但遗漏实际问题的评估器。使用 error-analysis 重新进行,从追踪记录开始。
参见:谁来验证验证器?
检查: 评估器是二进制通过/失败吗?
标记任何使用李克特量表(1-5)、字母等级(A-F)或没有明确通过/失败阈值的数字分数的评估器。
如果不是二进制则记录发现: 李克特量表难以校准。标注者对于 3 和 4 之间的区别存在分歧,而评判者会继承这种噪音。考虑使用 write-judge-prompt 转换为具有明确定义的二进制通过/失败。
检查: LLM 评判提示词是否针对特定的失败模式?
标记任何进行整体评估的提示词(“这个回答有帮助吗?”、“评价此输出的质量”)。
如果模糊则记录发现: 整体评判会产生无法操作的裁决。每个评判者应仅检查一个失败模式,并具有明确的通过/失败定义和少量示例。使用 write-judge-prompt。
检查: 是否尽可能使用基于代码的检查?
标记用于可客观检查标准的 LLM 评判者:格式验证、约束满足、关键词存在、模式一致性。
如果过度依赖评判者则记录发现: 用代码(正则表达式、解析、模式验证、执行测试)替换客观检查。保留 LLM 评判者用于需要解释的标准。
检查: 是否使用相似性指标作为主要评估方法?
标记将 ROUGE、BERTScore、余弦相似度或嵌入距离用作生成质量的主要评估器。
如果存在则记录发现: 这些指标衡量的是表面重叠度,而不是正确性。它们适用于检索排序,但不适用于生成评估。用基于特定失败模式的二进制评估器替换。
参见:LLM 评估常见问题解答
检查: LLM 评判器是否根据人工标注进行了验证?
寻找:混淆矩阵、TPR/TNR 测量值、对齐分数。生产环境中没有验证数据的评判器是一个关键发现。
如果未经验证则记录发现: 未经验证的评判器可能持续遗漏失败或标记通过的追踪记录。在保留的测试集上使用 TPR 和 TNR 测量对齐度。使用 validate-evaluator。
检查: 对齐度是用 TPR/TNR 还是原始准确率衡量的?
标记将“准确率”、“一致百分比”或 Cohen's Kappa 作为主要对齐度指标的情况。
如果使用准确率则记录发现: 在类别不平衡的情况下,原始准确率具有误导性:当 90% 的追踪记录通过时,一个总是说“通过”的评判器可以获得 90% 的准确率,但捕获的失败数为零。使用 TPR 和 TNR,它们直接映射到偏差校正。使用 validate-evaluator。
检查: 是否有适当的训练集/开发集/测试集划分?
检查评判提示词中的少量示例是否来自用于衡量评判器性能的同一数据。
如果存在数据泄露则记录发现: 使用评估数据作为少量示例会夸大对齐分数并掩盖真实的评判器失败。划分为训练集(少量示例来源)、开发集(迭代)和测试集(最终测量)。使用 validate-evaluator。
检查: 谁在审查追踪记录?
确定是领域专家还是外包的标注员在标注数据。
如果是外包且缺乏领域专业知识则记录发现: 通用标注员能发现格式错误,但会遗漏特定领域的失败(错误的医疗剂量、不正确的法律引用、不匹配的属性特征)。请让领域专家参与。
检查: 审查者看到的是完整的追踪记录还是仅最终输出?
如果仅查看输出则记录发现: 仅审查最终输出会隐藏流程中断的位置。展示完整的追踪记录:输入、中间步骤、工具调用、检索到的上下文和最终输出。
检查: 数据如何呈现给审查者?
标记原始 JSON、未格式化的文本或在单元格中包含追踪记录数据的电子表格。
如果是原始格式则记录发现: 审查者花费精力解析数据而不是判断质量。以自然表示形式格式化:渲染 Markdown、语法高亮代码、将表格显示为表格。使用 build-review-interface。
参见:LLM 评估常见问题解答
检查: 是否有足够的标注数据?
对于错误分析,约 100 条追踪记录是达到饱和的粗略目标。对于评判器验证,需要约 50 个通过和约 50 个失败的示例才能获得可靠的 TPR/TNR。如果标注数据稀少,通过更有效地采样追踪记录来收集更多数据:
如果不足则记录发现: 小型数据集会产生不可靠的失败率和较宽的置信区间。使用上述采样策略收集更多标注数据,或使用 generate-synthetic-data 进行补充。
检查: 在重大变更后是否重新运行了错误分析?
检查上次执行错误分析的时间点,相对于模型切换、提示词重写、新功能或生产事件。
如果过时则记录发现: 失败模式在流程变更后会发生转变,为旧流程构建的评估器会遗漏新的失败类型。每次重大变更后都应重新运行错误分析。
检查: 评估器是否得到维护?
寻找评判器的定期重新验证或更新的评估数据集。
如果设置后就不管则记录发现: 评估器会随着流程的演变而退化。根据新的人工标注重新验证评判器,并更新评估数据集以反映当前使用情况。
如果用户没有评估工件(没有追踪记录、没有评估器、没有标注数据):
error-analysis。generate-synthetic-data 创建测试输入,通过流程运行它们,然后对生成的追踪记录应用 error-analysis。按影响程度排序呈现发现。对于每个发现:
### [问题标题]
**状态:** [问题存在 / 正常 / 无法确定]
[1-2 句话解释发现的具体问题]
**修复:** [具体行动,引用技能或文章]
按六个诊断方面进行分组。省略未发现问题的方面。
每周安装量
144
仓库
GitHub 星标数
955
首次出现
2026年3月3日
安全审计
安装于
codex141
gemini-cli140
kimi-cli140
github-copilot140
cursor140
amp140
Inspect an LLM eval pipeline and produce a prioritized list of problems with concrete next steps.
Access to eval artifacts (traces, evaluator configs, judge prompts, labeled data) via an observability MCP server or local files. If none exist, skip to "No Eval Infrastructure."
Check whether the user has an observability MCP server connected (Phoenix, Braintrust, LangSmith, Truesight or similar). If available, use it to pull traces, evaluator definitions, and experiment results. If not, ask for local files: CSVs, JSON trace exports, notebooks, or evaluation scripts.
Work through each area below. Inspect available artifacts, determine whether the problem exists, and record a finding if it does.
Prioritize findings by impact on the user's product. Present the most impactful findings first.
Check: Has the user done systematic error analysis on real or synthetic traces?
Look for: labeled trace datasets, failure category definitions, notes from trace review. If evaluators exist but no documented failure categories, error analysis was likely skipped.
Finding if missing: Evaluators built without error analysis measure generic qualities ("helpfulness", "coherence") instead of actual failure modes. Start with error-analysis, or generate-synthetic-data first if no traces exist.
See: Your AI Product Needs Evals, LLM Evals FAQ
Check: Were failure categories brainstormed or observed?
Generic labels borrowed from research ("hallucination score", "toxicity", "coherence") suggest brainstorming. Application-grounded categories ("missing query constraints", "wrong client tone", "fabricated property features") suggest observation.
Finding if brainstormed: Generic categories miss application-specific failures and produce evaluators that score well on paper but miss real problems. Re-do with error-analysis, starting from traces.
See: Who Validates the Validators?
Check: Are evaluators binary pass/fail?
Flag any that use Likert scales (1-5), letter grades (A-F), or numeric scores without a clear pass/fail threshold.
Finding if not binary: Likert scales are difficult to calibrate. Annotators disagree on the difference between a 3 and a 4, and judges inherit that noise. Consider converting to binary pass/fail with explicit definitions using write-judge-prompt.
See: Creating an LLM Judge That Drives Business Results
Check: Do LLM judge prompts target specific failure modes?
Flag any that evaluate holistically ("Is this response helpful?", "Rate the quality of this output").
Finding if vague: Holistic judges produce unactionable verdicts. Each judge should check exactly one failure mode with explicit pass/fail definitions and few-shot examples. Use write-judge-prompt.
Check: Are code-based checks used where possible?
Flag LLM judges used for objectively checkable criteria: format validation, constraint satisfaction, keyword presence, schema conformance.
Finding if over-relying on judges: Replace objective checks with code (regex, parsing, schema validation, execution tests). Reserve LLM judges for criteria requiring interpretation.
Check: Are similarity metrics used as primary evaluation?
Flag ROUGE, BERTScore, cosine similarity, or embedding distance used as the main evaluator for generation quality.
Finding if present: These metrics measure surface-level overlap, not correctness. They suit retrieval ranking but not generation evaluation. Replace with binary evaluators grounded in specific failure modes.
See: LLM Evals FAQ
Check: Are LLM judges validated against human labels?
Look for: confusion matrices, TPR/TNR measurements, alignment scores. Judges in production with no validation data is a critical finding.
Finding if unvalidated: An unvalidated judge may consistently miss failures or flag passing traces. Measure alignment using TPR and TNR on a held-out test set. Use validate-evaluator.
See: Creating an LLM Judge That Drives Business Results
Check: Is alignment measured with TPR/TNR or with raw accuracy?
Flag "accuracy", "percent agreement", or Cohen's Kappa as the primary alignment metric.
Finding if using accuracy: With class imbalance, raw accuracy is misleading: a judge that always says "Pass" gets 90% accuracy when 90% of traces pass but catches zero failures. Use TPR and TNR, which map directly to bias correction. Use validate-evaluator.
Check: Is there a proper train/dev/test split?
Check whether few-shot examples in judge prompts come from the same data used to measure judge performance.
Finding if leaking: Using evaluation data as few-shot examples inflates alignment scores and hides real judge failures. Split into train (few-shot source), dev (iteration), and test (final measurement). Use validate-evaluator.
Check: Who is reviewing traces?
Determine whether domain experts or outsourced annotators are labeling data.
Finding if outsourced without domain expertise: General annotators catch formatting errors but miss domain-specific failures (wrong medical dosage, incorrect legal citation, mismatched property features). Involve a domain expert.
See: A Field Guide to Improving AI Products
Check: Are reviewers seeing full traces or just final outputs?
Finding if output-only: Reviewing only the final output hides where the pipeline broke. Show the full trace: input, intermediate steps, tool calls, retrieved context, and final output.
Check: How is data displayed to reviewers?
Flag raw JSON, unformatted text, or spreadsheets with trace data in cells.
Finding if raw format: Reviewers spend effort parsing data instead of judging quality. Format in natural representation: render markdown, syntax-highlight code, display tables as tables. Use build-review-interface.
See: LLM Evals FAQ
Check: Is there enough labeled data?
For error analysis, ~100 traces is the rough target for saturation. For judge validation, ~50 Pass and ~50 Fail examples are needed for reliable TPR/TNR. If labeled data is sparse, collect more by sampling traces more effectively:
Finding if insufficient: Small datasets produce unreliable failure rates and wide confidence intervals. Use the sampling strategies above to collect more labeled data, or supplement with generate-synthetic-data.
Check: Is error analysis re-run after significant changes?
Check when error analysis was last performed relative to model switches, prompt rewrites, new features, or production incidents.
Finding if stale: Failure modes shift after pipeline changes, and evaluators built for the old pipeline miss new failure types. Re-run error analysis after every significant change.
Check: Are evaluators maintained?
Look for periodic re-validation of judges or refreshed evaluation datasets.
Finding if set-and-forget: Evaluators degrade as the pipeline evolves. Re-validate judges against fresh human labels and update eval datasets to reflect current usage.
If the user has no eval artifacts (no traces, no evaluators, no labeled data):
error-analysis on a sample of real traces.generate-synthetic-data to create test inputs, run them through the pipeline, then apply error-analysis to the resulting traces.Present findings ordered by impact. For each:
### [Problem Title]
**Status:** [Problem exists / OK / Cannot determine]
[1-2 sentence explanation of the specific problem found]
**Fix:** [Concrete action, referencing a skill or article]
Group under the six diagnostic areas. Omit areas where no problems were found.
Weekly Installs
144
Repository
GitHub Stars
955
First Seen
Mar 3, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex141
gemini-cli140
kimi-cli140
github-copilot140
cursor140
amp140
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装
Chrome DevTools 自动化脚本:Puppeteer 浏览器自动化与性能监控工具
523 周安装
Overlastic:支持React、Vue、Svelte的Promise模态框库,实现弹窗管理
524 周安装
Obsidian Agent Skill - 知识管理与笔记工具集成,提升AI助手工作效率
72 周安装
btca-cli:源码优先的AI研究工具,将Git、本地和npm资源转化为可搜索上下文
525 周安装
Trump Code 市场信号分析系统:特朗普发帖与标普500指数关联预测工具
588 周安装
AI每日摘要:从90个热门技术博客中智能筛选最新文章,生成每日AI精选摘要
530 周安装