npx skills add https://github.com/hamelsmu/evals-skills --skill error-analysis指导用户如何阅读 LLM 流水线追踪记录,并构建系统故障的分类目录。
捕获完整的追踪记录:输入、所有中间 LLM 调用、工具使用、检索到的文档、推理步骤以及最终输出。
目标:约 100 条追踪记录。 大约在这个数量时,新的追踪记录将不再揭示新类型的故障。具体数量取决于系统的复杂性。
来自真实用户数据(首选):
来自合成数据(当真实数据稀少时):
generate-synthetic-data 技能向用户展示每条追踪记录。针对每条记录,询问:系统是否产生了良好的结果? 通过或失败。
对于失败的情况,记录出错的地方。关注追踪记录中第一个出错的地方——错误会级联,因此当根本原因被修复后,下游症状就会消失。不要在一次追踪记录中追踪所有问题。
写下观察结果,而非解释。例如,"SQL 遗漏了预算约束" 而不是 "模型可能没有理解预算。"
模板:
| Trace ID | Trace | What went wrong | Pass/Fail |
|----------|-------|-----------------|-----------|
| 001 | [full trace] | Missing filter: pet-friendly requirement ignored in SQL | Fail |
| 002 | [full trace] | Proposed unavailable times despite calendar conflicts | Fail |
| 003 | [full trace] | Used casual tone for luxury client; wrong property type | Fail |
| 004 | [full trace] | - | Pass |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
启发式方法:
在审查了 30-50 条追踪记录后,开始将相似的笔记归类。不要等到所有 100 条都完成——早期归类有助于明确在剩余追踪记录中需要寻找什么。类别会不断演变。目标是命名具体且可操作,而非完美。
何时拆分 vs. 何时合并:
拆分这些(不同根本原因):
合并这些(相同根本原因):
LLM 辅助聚类(仅在用户审查了 30-50 条追踪记录后使用):
Here are failure annotations from reviewing LLM pipeline traces.
Group similar failures into 5-10 distinct categories.
For each category, provide:
- A clear name
- A one-sentence definition
- Which annotations belong to it
Annotations:
[paste annotations]
务必与用户一起审查 LLM 建议的分组。LLM 倾向于根据表面相似性聚类(例如,将"应用崩溃"和"登录缓慢"归为一组,因为两者都提到了登录)。
目标是 5-10 个类别,这些类别应:
返回查看所有追踪记录,并为每个故障类别应用二元标签(通过/失败)。每条追踪记录针对每个类别都有一列。使用用户偏好的任何工具——电子表格、标注应用(参见 build-review-interface)或简单的脚本。
failure_rates = labeled_df[failure_columns].sum() / len(labeled_df)
failure_rates.sort_values(ascending=False)
故障率最高的类别是首要关注点。
按以下顺序与用户一起处理每个类别:
我们能直接修复它吗? 许多故障都有明显的修复方法,根本不需要评估器:
如果明确的修复方法可以解决故障,请先进行修复。只有在修复后故障仍然存在时,才考虑使用评估器。
评估器值得投入精力吗? 并非每个剩余的故障都需要评估器。构建和维护评估器有实际成本。询问用户:
将评估器保留给用户将反复迭代的故障。从最高频率、最高影响的类别开始。
对于需要评估器的故障: 对于任何客观的故障,优先使用基于代码的检查(正则表达式、解析、模式验证)。仅对需要判断的故障使用 write-judge-prompt。关键要求(安全性、合规性)即使在修复提示词后,也可能需要评估器作为保障措施。
预计需要进行 2-3 轮的审查和类别细化。每轮之后:
当新的追踪记录不再揭示新类型的故障时,停止审查。大致标准是:审查了约 100 条追踪记录,且最近 20 条中没有出现新的故障类型。具体数量取决于系统的复杂性。
当生产数据量很大时,混合使用以下策略:
| 策略 | 使用时机 | 方法 |
|---|---|---|
| 随机 | 默认起点 | 从最近的追踪记录中均匀抽样 |
| 异常值 | 发现异常行为 | 按响应长度、延迟、工具调用次数排序;审查极端值 |
| 故障驱动 | 在发生防护栏违规或用户投诉后 | 优先处理标记的追踪记录 |
| 不确定性 | 当存在自动化评估器时 | 关注评估器意见不一致或置信度低的追踪记录 |
| 分层 | 确保覆盖所有用户群体 | 在每个维度内抽样 |
每周安装次数
143
仓库
GitHub 星标数
955
首次出现
2026年3月3日
安全审计
安装于
codex140
gemini-cli138
kimi-cli138
github-copilot138
cursor138
amp138
Guide the user through reading LLM pipeline traces and building a catalog of how the system fails.
Capture the full trace: input, all intermediate LLM calls, tool uses, retrieved documents, reasoning steps, and final output.
Target: ~100 traces. This is roughly where new traces stop revealing new kinds of failures. The number depends on system complexity.
From real user data (preferred):
From synthetic data (when real data is sparse):
Present each trace to the user. For each one, ask: did the system produce a good result? Pass or Fail.
For failures, note what went wrong. Focus on the first thing that went wrong in the trace — errors cascade, so downstream symptoms disappear when the root cause is fixed. Don't chase every issue in a single trace.
Write observations, not explanations. "SQL missed the budget constraint" not "The model probably didn't understand the budget."
Template:
| Trace ID | Trace | What went wrong | Pass/Fail |
|----------|-------|-----------------|-----------|
| 001 | [full trace] | Missing filter: pet-friendly requirement ignored in SQL | Fail |
| 002 | [full trace] | Proposed unavailable times despite calendar conflicts | Fail |
| 003 | [full trace] | Used casual tone for luxury client; wrong property type | Fail |
| 004 | [full trace] | - | Pass |
Heuristics:
After reviewing 30-50 traces, start grouping similar notes into categories. Don't wait until all 100 are done — grouping early helps sharpen what to look for in the remaining traces. The categories will evolve. The goal is names that are specific and actionable, not perfect.
When to split vs. group:
Split these (different root causes):
Group these (same root cause):
LLM-assisted clustering (use only after the user has reviewed 30-50 traces):
Here are failure annotations from reviewing LLM pipeline traces.
Group similar failures into 5-10 distinct categories.
For each category, provide:
- A clear name
- A one-sentence definition
- Which annotations belong to it
Annotations:
[paste annotations]
Always review LLM-suggested groupings with the user. LLMs cluster by surface similarity (e.g., grouping "app crashes" and "login is slow" because both mention login).
Aim for 5-10 categories that are:
Go back through all traces and apply binary labels (pass/fail) for each failure category. Each trace gets a column per category. Use whatever tool the user prefers — spreadsheet, annotation app (see build-review-interface), or a simple script.
failure_rates = labeled_df[failure_columns].sum() / len(labeled_df)
failure_rates.sort_values(ascending=False)
The most frequent failure category is where to focus first.
Work through each category with the user in this order:
Can we just fix it? Many failures have obvious fixes that don't need an evaluator at all:
If a clear fix resolves the failure, do that first. Only consider an evaluator for failures that persist after fixing.
Is an evaluator worth the effort? Not every remaining failure needs one. Building and maintaining evaluators has real cost. Ask the user:
Reserve evaluators for failures the user will iterate on repeatedly. Start with the highest-frequency, highest-impact category.
For failures that warrant an evaluator: prefer code-based checks (regex, parsing, schema validation) for anything objective. Use write-judge-prompt only for failures that require judgment. Critical requirements (safety, compliance) may warrant an evaluator even after fixing the prompt, as a guardrail.
Expect 2-3 rounds of reviewing and refining categories. After each round:
Stop reviewing when new traces aren't revealing new kinds of failures. Roughly: ~100 traces reviewed with no new failure types appearing in the last 20. The exact number depends on system complexity.
When production volume is high, use a mix:
| Strategy | When to Use | Method |
|---|---|---|
| Random | Default starting point | Sample uniformly from recent traces |
| Outlier | Surface unusual behavior | Sort by response length, latency, tool call count; review extremes |
| Failure-driven | After guardrail violations or user complaints | Prioritize flagged traces |
| Uncertainty | When automated judges exist | Focus on traces where judges disagree or have low confidence |
| Stratified | Ensure coverage across user segments | Sample within each dimension |
Weekly Installs
143
Repository
GitHub Stars
955
First Seen
Mar 3, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex140
gemini-cli138
kimi-cli138
github-copilot138
cursor138
amp138
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装