事后分析（Postmortem）指南：事件复盘、根本原因分析与改进措施

postmortem by corygabrielsen/skills

1 周安装量

GitHub

安装命令

npx skills add https://github.com/corygabrielsen/skills --skill postmortem

质量管理开发运维技术文档

🇨🇳中文介绍

事后分析

事件发生了。理解原因，妥善记录，并使其更难重复发生。不敷衍了事，不轻描淡写。

存在意义

糟糕的事后分析会轻描淡写（"没有实际影响"）、推卸责任（"边缘情况"）或匆忙了事（"命令错了，已修复，继续"）。这些毫无教益。

一份好的事后分析能让读者感受到可能出错的严重性，追溯失败的根本原因，并产生具体的改进措施。这是一份你可以毫无愧色地发送给团队领导的文档。

关于激活

撰写事后分析。除非事件确实模糊不清，否则不要询问澄清性问题——你通常能从对话中获得完整上下文。一次性输出完整的文档。

结构

使用每一个部分。没有可选的部分。如果某个部分似乎不适用，说明你思考得还不够深入。

## 事后分析：<标题>

**日期**：YYYY-MM-DD
**严重性**：严重 | 高 | 中 | 低
**持续时间**：从事件发生到解决的时间
**检测方式**：谁/什么发现了它（用户、CI、自动化检查、自行发现）

---

### 摘要

两到三句话。发生了什么，产生了什么影响，如何解决的。一个陌生人仅通过此段落就应该能理解事件。

### 影响

**实际影响**：发生了何种损害。诚实说明——如果没有，就说没有并解释原因（例如，"用户在执行前发现了它"）。

**潜在影响**：如果未被发现会发生什么。这是重要的部分。推演反事实——损害会传播多远？会破坏哪些下游决策？需要多久才会有人注意到？

### 时间线

事件的时间顺序。包含时间戳或相对顺序。从触发指令开始，到解决结束。

| 时间/顺序 | 事件                |
| ---------- | -------------------- |
| T+0        | 用户指示 X           |
| T+1        | 代理却执行了 Y       |
| T+2        | 用户发现错误         |
| T+3        | 纠正为 Z             |

### 根本原因

你能追溯到的最深层次的"为什么"。不是"我运行了错误的命令"——为什么你运行了错误的命令？模式匹配？假设？疲劳？熟悉性偏见？急于从之前的错误中恢复的压力？

如有帮助，使用"5个为什么"方法：

1. 为什么测试针对了错误的版本运行？
2. 因为代码更改后环境没有重新构建。
3. 为什么没有重新构建？
4. 因为我选择了快速重启命令，而不是完全重建命令。
5. 为什么我没有检查是否需要重建？
6. 因为我没有针对构建命令的预检清单。

### 促成因素

使故障更可能发生或更危险的其他条件。这些不是根本原因，但它们塑造了事件。例如：

- 先前错误导致的会话疲劳
- 没有针对陈旧二进制文件的自动化防护
- 时间压力（真实的或感知的）
- 指令中的模糊性（仅在确实存在时提及）

### 经验教训

这次事件教会了我们什么。不是陈词滥调（"要更仔细"）——而是具体的、可证伪的见解。

差："我应该更仔细地阅读指令。"
好："提交代码更改后，必须在测试前重新构建环境。当源代码已更改时，快速重启命令永远是不正确的——它测试的是旧的构建。"

### 行动项

具体的改进措施。每个项目都应足够具体，以便你可以验证是否已完成。

- [ ] 在任何重启/重建命令之前，检查：自上次构建以来源代码是否已更改？如果是，使用完全重建。
- [ ] 学习项目的构建命令及其各自的适用场景。

🇺🇸English

Postmortem

An incident happened. Understand why, document it properly, and make it harder to repeat. No hand-waving, no minimizing.

Why This Exists

Bad postmortems dismiss ("no actual impact"), deflect ("edge case"), or rush ("wrong command, fixed it, moving on"). These teach nothing.

A good postmortem makes the reader feel the weight of what could have gone wrong, traces the failure to its root, and produces concrete changes. It's a document you'd send to your team lead without embarrassment.

On Activation

Write the postmortem. Don't ask clarifying questions unless the incident is genuinely ambiguous — you usually have full context from the conversation. Output the complete document in one pass.

Structure

Use every section. No section is optional. If a section seems inapplicable, you haven't thought hard enough.

## Postmortem: <Title>

**Date**: YYYY-MM-DD
**Severity**: Critical | High | Medium | Low
**Duration**: Time from incident to resolution
**Detection**: Who/what caught it (user, CI, automated check, self)

---

### Summary

Two to three sentences. What happened, what was the impact, how
was it resolved. A stranger should understand the incident from
this paragraph alone.

### Impact

**Actual**: What damage occurred. Be honest — if none, say none
and explain why (e.g., "user caught it before execution").

**Potential**: What would have happened if undetected. This is the
important part. Trace the counterfactual — how far would the
damage have propagated? What downstream decisions would have been
corrupted? How long until someone noticed?

### Timeline

Chronological sequence of events. Include timestamps or relative
ordering. Start from the triggering instruction, end at resolution.

| Time/Order | Event                |
| ---------- | -------------------- |
| T+0        | User instructs X     |
| T+1        | Agent does Y instead |
| T+2        | User catches error   |
| T+3        | Corrected to Z       |

### Root Cause

The deepest "why" you can reach. Not "I ran the wrong command" —
why did you run the wrong command? Pattern matching? Assumption?
Fatigue? Familiarity bias? Pressure to recover from a prior
mistake?

Use the 5 Whys if helpful:

1. Why did tests run against the wrong version?
2. Because the environment wasn't rebuilt after code changes.
3. Why wasn't it rebuilt?
4. Because I chose the fast-restart command over the full-rebuild command.
5. Why didn't I check whether a rebuild was needed?
6. Because I had no pre-flight checklist for build commands.

### Contributing Factors

Other conditions that made the failure more likely or more
dangerous. These aren't the root cause but they shaped the
incident. Examples:

- Session fatigue from prior errors
- No automated guard against stale binaries
- Time pressure (real or perceived)
- Ambiguity in the instruction (only if genuine)

### Lessons

What this incident teaches. Not platitudes ("be more careful")
— specific, falsifiable insights.

Bad: "I should read instructions more carefully."
Good: "After committing code changes, the environment must be
rebuilt before testing. A fast-restart command is never correct
when source has changed — it tests against the old build."

### Action Items

Concrete changes. Each item should be specific enough that you
could verify whether it was done.

- [ ] Before any restart/rebuild command, check: has source
      changed since the last build? If yes, use the full rebuild.
- [ ] Learn the project's build commands and when each applies.

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

事后分析（Postmortem）指南：事件复盘、根本原因分析与改进措施

🇨🇳中文介绍

事后分析

存在意义

关于激活

结构

🇺🇸English

Postmortem

Why This Exists

On Activation

Structure

相关 Skills

原则

严重性校准

检测归功

不轻描淡写

复合事件

比例原则

Principles

Severity Calibration

Detection Credit

No Minimizing

Compound Incidents

Proportionality

最新 Skills

严重性	标准
严重	数据丢失、发布了错误内容、采取了不可逆的操作
高	静默的正确性风险、陈旧数据、在错误范围内执行操作
中	在使用前发现了错误输出、浪费了大量时间
低	错误命令自行纠正、外观错误

Severity	Criteria
Critical	Data loss, published incorrect content, irreversible action taken
High	Silent correctness risk, stale data, action in wrong scope
Medium	Wrong output caught before use, wasted significant time
Low	Wrong command self-corrected, cosmetic error