npx skills add https://github.com/corygabrielsen/skills --skill postmortem事件发生了。理解原因,妥善记录,并使其更难重复发生。不敷衍了事,不轻描淡写。
糟糕的事后分析会轻描淡写("没有实际影响")、推卸责任("边缘情况")或匆忙了事("命令错了,已修复,继续")。这些毫无教益。
一份好的事后分析能让读者感受到可能出错的严重性,追溯失败的根本原因,并产生具体的改进措施。这是一份你可以毫无愧色地发送给团队领导的文档。
撰写事后分析。除非事件确实模糊不清,否则不要询问澄清性问题——你通常能从对话中获得完整上下文。一次性输出完整的文档。
使用每一个部分。没有可选的部分。如果某个部分似乎不适用,说明你思考得还不够深入。
## 事后分析:<标题>
**日期**:YYYY-MM-DD
**严重性**:严重 | 高 | 中 | 低
**持续时间**:从事件发生到解决的时间
**检测方式**:谁/什么发现了它(用户、CI、自动化检查、自行发现)
---
### 摘要
两到三句话。发生了什么,产生了什么影响,如何解决的。一个陌生人仅通过此段落就应该能理解事件。
### 影响
**实际影响**:发生了何种损害。诚实说明——如果没有,就说没有并解释原因(例如,"用户在执行前发现了它")。
**潜在影响**:如果未被发现会发生什么。这是重要的部分。推演反事实——损害会传播多远?会破坏哪些下游决策?需要多久才会有人注意到?
### 时间线
事件的时间顺序。包含时间戳或相对顺序。从触发指令开始,到解决结束。
| 时间/顺序 | 事件 |
| ---------- | -------------------- |
| T+0 | 用户指示 X |
| T+1 | 代理却执行了 Y |
| T+2 | 用户发现错误 |
| T+3 | 纠正为 Z |
### 根本原因
你能追溯到的最深层次的"为什么"。不是"我运行了错误的命令"——为什么你运行了错误的命令?模式匹配?假设?疲劳?熟悉性偏见?急于从之前的错误中恢复的压力?
如有帮助,使用"5个为什么"方法:
1. 为什么测试针对了错误的版本运行?
2. 因为代码更改后环境没有重新构建。
3. 为什么没有重新构建?
4. 因为我选择了快速重启命令,而不是完全重建命令。
5. 为什么我没有检查是否需要重建?
6. 因为我没有针对构建命令的预检清单。
### 促成因素
使故障更可能发生或更危险的其他条件。这些不是根本原因,但它们塑造了事件。例如:
- 先前错误导致的会话疲劳
- 没有针对陈旧二进制文件的自动化防护
- 时间压力(真实的或感知的)
- 指令中的模糊性(仅在确实存在时提及)
### 经验教训
这次事件教会了我们什么。不是陈词滥调("要更仔细")——而是具体的、可证伪的见解。
差:"我应该更仔细地阅读指令。"
好:"提交代码更改后,必须在测试前重新构建环境。当源代码已更改时,快速重启命令永远是不正确的——它测试的是旧的构建。"
### 行动项
具体的改进措施。每个项目都应足够具体,以便你可以验证是否已完成。
- [ ] 在任何重启/重建命令之前,检查:自上次构建以来源代码是否已更改?如果是,使用完全重建。
- [ ] 学习项目的构建命令及其各自的适用场景。
An incident happened. Understand why, document it properly, and make it harder to repeat. No hand-waving, no minimizing.
Bad postmortems dismiss ("no actual impact"), deflect ("edge case"), or rush ("wrong command, fixed it, moving on"). These teach nothing.
A good postmortem makes the reader feel the weight of what could have gone wrong, traces the failure to its root, and produces concrete changes. It's a document you'd send to your team lead without embarrassment.
Write the postmortem. Don't ask clarifying questions unless the incident is genuinely ambiguous — you usually have full context from the conversation. Output the complete document in one pass.
Use every section. No section is optional. If a section seems inapplicable, you haven't thought hard enough.
## Postmortem: <Title>
**Date**: YYYY-MM-DD
**Severity**: Critical | High | Medium | Low
**Duration**: Time from incident to resolution
**Detection**: Who/what caught it (user, CI, automated check, self)
---
### Summary
Two to three sentences. What happened, what was the impact, how
was it resolved. A stranger should understand the incident from
this paragraph alone.
### Impact
**Actual**: What damage occurred. Be honest — if none, say none
and explain why (e.g., "user caught it before execution").
**Potential**: What would have happened if undetected. This is the
important part. Trace the counterfactual — how far would the
damage have propagated? What downstream decisions would have been
corrupted? How long until someone noticed?
### Timeline
Chronological sequence of events. Include timestamps or relative
ordering. Start from the triggering instruction, end at resolution.
| Time/Order | Event |
| ---------- | -------------------- |
| T+0 | User instructs X |
| T+1 | Agent does Y instead |
| T+2 | User catches error |
| T+3 | Corrected to Z |
### Root Cause
The deepest "why" you can reach. Not "I ran the wrong command" —
why did you run the wrong command? Pattern matching? Assumption?
Fatigue? Familiarity bias? Pressure to recover from a prior
mistake?
Use the 5 Whys if helpful:
1. Why did tests run against the wrong version?
2. Because the environment wasn't rebuilt after code changes.
3. Why wasn't it rebuilt?
4. Because I chose the fast-restart command over the full-rebuild command.
5. Why didn't I check whether a rebuild was needed?
6. Because I had no pre-flight checklist for build commands.
### Contributing Factors
Other conditions that made the failure more likely or more
dangerous. These aren't the root cause but they shaped the
incident. Examples:
- Session fatigue from prior errors
- No automated guard against stale binaries
- Time pressure (real or perceived)
- Ambiguity in the instruction (only if genuine)
### Lessons
What this incident teaches. Not platitudes ("be more careful")
— specific, falsifiable insights.
Bad: "I should read instructions more carefully."
Good: "After committing code changes, the environment must be
rebuilt before testing. A fast-restart command is never correct
when source has changed — it tests against the old build."
### Action Items
Concrete changes. Each item should be specific enough that you
could verify whether it was done.
- [ ] Before any restart/rebuild command, check: has source
changed since the last build? If yes, use the full rebuild.
- [ ] Learn the project's build commands and when each applies.
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 严重性 | 标准 |
|---|---|
| 严重 | 数据丢失、发布了错误内容、采取了不可逆的操作 |
| 高 | 静默的正确性风险、陈旧数据、在错误范围内执行操作 |
| 中 | 在使用前发现了错误输出、浪费了大量时间 |
| 低 | 错误命令自行纠正、外观错误 |
严重性基于潜在损害,而非实际损害。针对陈旧代码通过的测试仍然是"高"——故障模式是静默的,并且结果本应被信任。
谁发现了它很重要。如果是用户发现的,如实说明——这是你自身验证的失败。如果是 CI 发现的,说明系统起了作用。如果你自己在产生任何影响前发现的,也要注明。目标是诚实地记录安全网在哪里。
以下短语在事后分析中禁止使用:
以上每一条都表明,撰写事后分析是为了关闭工单,而不是从失败中学习。
当一次会话中发生多个故障时,除非它们共享一个根本原因,否则每个故障都应单独撰写事后分析。如果它们共享一个根本原因,则撰写一份事后分析,涵盖所有事件,并明确指出共享的根本原因。
寻找升级模式:第一个故障是否造成了导致第二个故障的压力?如果是,升级过程本身就是一个值得记录的发现。
事后分析的长度应与事件的严重性和指导价值相匹配。一个具有新颖故障模式的严重事件值得完整的记录。一个自行纠正的低严重性拼写错误只需要几句话,而不是一整页。
但当有疑问时,宁可详尽。过于详细的事后分析能教人一些东西。过于简短的事后分析则毫无教益。
每周安装数
1
仓库
首次发现
今天
安全审计
安装于
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1
| Severity | Criteria |
|---|---|
| Critical | Data loss, published incorrect content, irreversible action taken |
| High | Silent correctness risk, stale data, action in wrong scope |
| Medium | Wrong output caught before use, wasted significant time |
| Low | Wrong command self-corrected, cosmetic error |
Severity is based on potential damage, not actual. A test that passed against stale code is still High — the failure mode is silent and the results would have been trusted.
Who caught it matters. If the user caught it, say so — that's a failure of your own validation. If CI caught it, the system worked. If you caught it yourself before any effect, note that too. The goal is honest accounting of where the safety net was.
These phrases are banned in postmortems:
Every one of these is a signal that the postmortem is being written to close a ticket, not to learn from a failure.
When multiple failures occur in one session, each gets its own postmortem unless they share a root cause. If they share a root cause, write one postmortem that covers all incidents and explicitly names the shared root cause.
Look for escalation patterns: did the first failure create pressure that caused the second? If so, the escalation itself is a finding worth documenting.
The postmortem's length should match the incident's severity and instructional value. A Critical incident with a novel failure mode deserves a full writeup. A Low severity typo that was self-corrected needs a few sentences, not a page.
But when in doubt, err on the side of thoroughness. A postmortem that's too detailed teaches something. A postmortem that's too brief teaches nothing.
Weekly Installs
1
Repository
First Seen
Today
Security Audits
Installed on
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
79,900 周安装