事后分析撰写指南：无责难文化、模板与最佳实践 | 事件复盘与组织学习

postmortem-writing by wshobson/agents

3,300 周安装量

32,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/wshobson/agents --skill postmortem-writing

开发运维监控生产力

🇨🇳中文介绍

事后分析撰写

撰写有效、无责难的事后分析综合指南，以推动组织学习并防止事件再次发生。

何时使用此技能

进行事件后审查
撰写事后分析文档
主持无责难的事后分析会议
识别根本原因和促成因素
创建可执行的后续事项
建立组织学习文化

核心概念

1. 无责难文化

归咎导向	无责难
"谁导致了这个问题？"	"什么条件允许了这个问题发生？"
"有人犯了错误"	"系统允许了这个错误"
惩罚个人	改进系统
隐藏信息	分享学习成果
害怕发言	心理安全

2. 事后分析触发条件

SEV1 或 SEV2 事件
面向客户的停机时间 > 15 分钟
数据丢失或安全事件
本可能很严重的未遂事件
新的故障模式
需要特殊干预的事件

快速开始

事后分析时间线

Day 0: 事件发生
Day 1-2: 起草事后分析文档
Day 3-5: 事后分析会议
Day 5-7: 定稿文档，创建工单
Week 2+: 完成行动项
Quarterly: 审查跨事件模式

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

733,500 周安装

Vercel React 最佳实践指南 | 58条Next.js性能优化规则与代码重构

252,100 周安装

Vercel Web界面规范检查工具 - 自动检测代码是否符合Web设计指南

202,600 周安装

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

133,200 周安装

# 事后分析：[事件标题]

**日期**: 2024-01-15
**作者**: @alice, @bob
**状态**: 草案 | 审核中 | 终稿
**事件严重性**: SEV2
**事件持续时间**: 47 分钟

## 执行摘要

2024年1月15日，支付处理服务经历了47分钟的停机，影响了大约12,000名客户。根本原因是部署版本 v2.3.4 中的配置更改触发了数据库连接池耗尽。通过回滚到 v2.3.3 版本并增加连接池限制解决了该事件。

**影响**:

- 12,000 名客户无法完成购买
- 预计收入损失：$45,000
- 创建了 847 个支持工单
- 无数据丢失或安全隐患

## 时间线（所有时间均为 UTC）

| 时间  | 事件                                           |
| ----- | ----------------------------------------------- |
| 14:23 | 部署 v2.3.4 到生产环境完成       |
| 14:31 | 首次告警：`payment_error_rate > 5%`          |
| 14:33 | 值班工程师 @alice 确认告警      |
| 14:35 | 开始初步调查，错误率 23% |
| 14:41 | 事件被宣布为 SEV2，@bob 加入              |
| 14:45 | 识别出数据库连接耗尽       |
| 14:52 | 决定回滚部署                 |
| 14:58 | 开始回滚到 v2.3.3                    |
| 15:10 | 回滚完成，错误率下降          |
| 15:18 | 服务完全恢复，事件解决      |

## 根本原因分析

### 发生了什么

v2.3.4 部署包含了对数据库查询模式的更改，无意中移除了一个频繁调用端点的连接池。每个请求都打开了一个新的数据库连接，而不是重用池中的连接。

### 为什么会发生

1.  **直接原因**：`PaymentRepository.java` 中的代码更改将池化的 `DataSource` 替换为直接的 `DriverManager.getConnection()` 调用。

2.  **促成因素**：
    - 代码审查未发现连接处理方式的更改
    - 没有专门针对连接池行为的集成测试
    - 预发布环境流量较低，掩盖了问题
    - 数据库连接指标告警阈值过高（90%）

3.  **5个为什么分析**：
    - 为什么服务失败？ → 数据库连接耗尽
    - 为什么连接耗尽？ → 每个请求都打开新连接
    - 为什么每个请求都打开新连接？ → 代码绕过了连接池
    - 为什么代码绕过连接池？ → 开发人员不熟悉代码库模式
    - 为什么开发人员不熟悉？ → 没有关于连接管理模式的文档

### 系统架构图

[Client] → [Load Balancer] → [Payment Service] → [Database] ↓ Connection Pool (broken) ↓ Direct connections (cause)

## 检测

### 哪些有效
- 部署后 8 分钟内触发错误率告警
- Grafana 仪表板清晰地显示了连接激增
- 值班响应迅速（2 分钟确认）

### 哪些无效
- 数据库连接指标告警阈值过高
- 没有部署关联告警
- 金丝雀部署本可以更早发现此问题

### 检测差距
部署在 14:23 完成，但第一次告警直到 14:31 才触发（8 分钟后）。部署感知告警本可以更快地检测到问题。

## 响应

### 哪些有效
- 值班工程师迅速确定数据库是问题所在
- 果断做出回滚决定
- 事件频道中的沟通清晰

### 可以改进的地方
- 花了 10 分钟将问题与最近的部署关联起来
- 必须手动检查部署历史记录
- 回滚花了 12 分钟（可以更快）

## 影响

### 客户影响
- 12,000 名独立客户受影响
- 平均影响时长：35 分钟
- 847 个支持工单（受影响用户的 23%）
- 客户满意度得分下降了 12 分

### 业务影响
- 预计收入损失：$45,000
- 支持成本：约 $2,500（客服时间）
- 工程时间：约 8 人时

### 技术影响
- 数据库主节点负载升高
- 事件期间部分副本存在延迟
- 系统无永久性损坏

## 经验教训

### 做得好的地方
1.  告警在客户报告之前检测到问题
2.  团队在压力下有效协作
3.  回滚流程运行顺利
4.  沟通清晰及时

### 做得不好的地方
1.  代码审查遗漏了关键更改
2.  连接池的测试覆盖存在缺口
3.  预发布环境未能反映生产流量
4.  告警阈值设置不当

### 我们幸运的地方
1.  事件发生在工作时间，全员可用
2.  数据库承受住了负载，没有完全崩溃
3.  没有同时发生其他事件

## 行动项

| 优先级 | 行动项 | 负责人 | 截止日期 | 工单 |
|----------|--------|-------|----------|--------|
| P0 | 为连接池行为添加集成测试 | @alice | 2024-01-22 | ENG-1234 |
| P0 | 将数据库连接告警阈值降低到 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | 记录连接管理模式 | @alice | 2024-01-29 | DOC-89 |
| P1 | 实施部署关联告警 | @bob | 2024-02-05 | OPS-568 |
| P2 | 评估金丝雀部署策略 | @charlie | 2024-02-15 | ENG-1235 |
| P2 | 使用类生产流量对预发布环境进行负载测试 | @dave | 2024-02-28 | QA-123 |

## 附录

### 支持数据

#### 错误率图表
[链接到 Grafana 仪表板快照]

#### 数据库连接图表
[链接到指标]

### 相关事件
- 2023-11-02: 用户服务中的类似连接问题 (POSTMORTEM-42)

### 参考资料
- [连接池最佳实践](internal-wiki/connection-pools)
- [部署运行手册](internal-wiki/deployment-runbook)

反模式	问题	更好的方法
指责游戏	扼杀学习	聚焦于系统
浅层分析	无法防止复发	问 5 次"为什么"
没有行动项	浪费时间	始终有具体的后续步骤
不切实际的行动	永远无法完成	限定在可完成的任务范围内
没有跟进	行动被遗忘	在工单系统中跟踪

🇺🇸English

Postmortem Writing

Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.

When to Use This Skill

Conducting post-incident reviews
Writing postmortem documents
Facilitating blameless postmortem meetings
Identifying root causes and contributing factors
Creating actionable follow-up items
Building organizational learning culture

Core Concepts

1. Blameless Culture

Blame-Focused	Blameless
"Who caused this?"	"What conditions allowed this?"
"Someone made a mistake"	"The system allowed this mistake"
Punish individuals	Improve systems
Hide information	Share learnings
Fear of speaking up	Psychological safety

2. Postmortem Triggers

SEV1 or SEV2 incidents
Customer-facing outages > 15 minutes
Data loss or security incidents
Near-misses that could have been severe
Novel failure modes
Incidents requiring unusual intervention

Quick Start

Postmortem Timeline

Day 0: Incident occurs
Day 1-2: Draft postmortem document
Day 3-5: Postmortem meeting
Day 5-7: Finalize document, create tickets
Week 2+: Action item completion
Quarterly: Review patterns across incidents

Templates

Template 1: Standard Postmortem

# Postmortem: [Incident Title]

**Date**: 2024-01-15
**Authors**: @alice, @bob
**Status**: Draft | In Review | Final
**Incident Severity**: SEV2
**Incident Duration**: 47 minutes

## Executive Summary

On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.

**Impact**:

- 12,000 customers unable to complete purchases
- Estimated revenue loss: $45,000
- 847 support tickets created
- No data loss or security implications

## Timeline (All times UTC)

| Time  | Event                                           |
| ----- | ----------------------------------------------- |
| 14:23 | Deployment v2.3.4 completed to production       |
| 14:31 | First alert: `payment_error_rate > 5%`          |
| 14:33 | On-call engineer @alice acknowledges alert      |
| 14:35 | Initial investigation begins, error rate at 23% |
| 14:41 | Incident declared SEV2, @bob joins              |
| 14:45 | Database connection exhaustion identified       |
| 14:52 | Decision to rollback deployment                 |
| 14:58 | Rollback to v2.3.3 initiated                    |
| 15:10 | Rollback complete, error rate dropping          |
| 15:18 | Service fully recovered, incident resolved      |

## Root Cause Analysis

### What Happened

The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.

### Why It Happened

1. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls.

2. **Contributing Factors**:
   - Code review did not catch the connection handling change
   - No integration tests specifically for connection pool behavior
   - Staging environment has lower traffic, masking the issue
   - Database connection metrics alert threshold was too high (90%)

3. **5 Whys Analysis**:
   - Why did the service fail? → Database connections exhausted
   - Why were connections exhausted? → Each request opened new connection
   - Why did each request open new connection? → Code bypassed connection pool
   - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
   - Why was developer unfamiliar? → No documentation on connection management patterns

### System Diagram

[Client] → [Load Balancer] → [Payment Service] → [Database] ↓ Connection Pool (broken) ↓ Direct connections (cause)

## Detection

### What Worked
- Error rate alert fired within 8 minutes of deployment
- Grafana dashboard clearly showed connection spike
- On-call response was swift (2 minute acknowledgment)

### What Didn't Work
- Database connection metric alert threshold too high
- No deployment-correlated alerting
- Canary deployment would have caught this earlier

### Detection Gap
The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.

## Response

### What Worked
- On-call engineer quickly identified database as the issue
- Rollback decision was made decisively
- Clear communication in incident channel

### What Could Be Improved
- Took 10 minutes to correlate issue with recent deployment
- Had to manually check deployment history
- Rollback took 12 minutes (could be faster)

## Impact

### Customer Impact
- 12,000 unique customers affected
- Average impact duration: 35 minutes
- 847 support tickets (23% of affected users)
- Customer satisfaction score dropped 12 points

### Business Impact
- Estimated revenue loss: $45,000
- Support cost: ~$2,500 (agent time)
- Engineering time: ~8 person-hours

### Technical Impact
- Database primary experienced elevated load
- Some replica lag during incident
- No permanent damage to systems

## Lessons Learned

### What Went Well
1. Alerting detected the issue before customer reports
2. Team collaborated effectively under pressure
3. Rollback procedure worked smoothly
4. Communication was clear and timely

### What Went Wrong
1. Code review missed critical change
2. Test coverage gap for connection pooling
3. Staging environment doesn't reflect production traffic
4. Alert thresholds were not tuned properly

### Where We Got Lucky
1. Incident occurred during business hours with full team available
2. Database handled the load without failing completely
3. No other incidents occurred simultaneously

## Action Items

| Priority | Action | Owner | Due Date | Ticket |
|----------|--------|-------|----------|--------|
| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |
| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |
| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |
| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |

## Appendix

### Supporting Data

#### Error Rate Graph
[Link to Grafana dashboard snapshot]

#### Database Connection Graph
[Link to metrics]

### Related Incidents
- 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)

### References
- [Connection Pool Best Practices](internal-wiki/connection-pools)
- [Deployment Runbook](internal-wiki/deployment-runbook)

Template 2: 5 Whys Analysis

# 5 Whys Analysis: [Incident]

## Problem Statement

Payment service experienced 47-minute outage due to database connection exhaustion.

## Analysis

### Why #1: Why did the service fail?

**Answer**: Database connections were exhausted, causing all new requests to fail.

**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.

---

### Why #2: Why were database connections exhausted?

**Answer**: Each incoming request opened a new database connection instead of using the connection pool.

**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.

---

### Why #3: Why did the code bypass the connection pool?

**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.

**Evidence**: PR #1234 shows the change, made while fixing a different bug.

---

### Why #4: Why wasn't this caught in code review?

**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.

**Evidence**: Review comments only discuss business logic.

---

### Why #5: Why isn't there a safety net for this type of change?

**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.

**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.

## Root Causes Identified

1. **Primary**: Missing automated tests for infrastructure behavior
2. **Secondary**: Insufficient documentation of architectural patterns
3. **Tertiary**: Code review checklist doesn't include infrastructure considerations

## Systemic Improvements

| Root Cause    | Improvement                       | Type       |
| ------------- | --------------------------------- | ---------- |
| Missing tests | Add infrastructure behavior tests | Prevention |
| Missing docs  | Document connection patterns      | Prevention |
| Review gaps   | Update review checklist           | Detection  |
| No canary     | Implement canary deployments      | Mitigation |

Template 3: Quick Postmortem (Minor Incidents)

# Quick Postmortem: [Brief Title]

**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3

## What Happened

API latency spiked to 5s due to cache miss storm after cache flush.

## Timeline

- 10:00 - Cache flush initiated for config update
- 10:02 - Latency alerts fire
- 10:05 - Identified as cache miss storm
- 10:08 - Enabled cache warming
- 10:12 - Latency normalized

## Root Cause

Full cache flush for minor config update caused thundering herd.

## Fix

- Immediate: Enabled cache warming
- Long-term: Implement partial cache invalidation (ENG-999)

## Lessons

Don't full-flush cache in production; use targeted invalidation.

Facilitation Guide

Running a Postmortem Meeting

## Meeting Structure (60 minutes)

### 1. Opening (5 min)

- Remind everyone of blameless culture
- "We're here to learn, not to blame"
- Review meeting norms

### 2. Timeline Review (15 min)

- Walk through events chronologically
- Ask clarifying questions
- Identify gaps in timeline

### 3. Analysis Discussion (20 min)

- What failed?
- Why did it fail?
- What conditions allowed this?
- What would have prevented it?

### 4. Action Items (15 min)

- Brainstorm improvements
- Prioritize by impact and effort
- Assign owners and due dates

### 5. Closing (5 min)

- Summarize key learnings
- Confirm action item owners
- Schedule follow-up if needed

## Facilitation Tips

- Keep discussion on track
- Redirect blame to systems
- Encourage quiet participants
- Document dissenting views
- Time-box tangents

Anti-Patterns to Avoid

Anti-Pattern	Problem	Better Approach
Blame game	Shuts down learning	Focus on systems
Shallow analysis	Doesn't prevent recurrence	Ask "why" 5 times
No action items	Waste of time	Always have concrete next steps
Unrealistic actions	Never completed	Scope to achievable tasks
No follow-up	Actions forgotten	Track in ticketing system

Best Practices

Do's

Start immediately - Memory fades fast
Be specific - Exact times, exact errors
Include graphs - Visual evidence
Assign owners - No orphan action items
Share widely - Organizational learning

Don'ts

Don't name and shame - Ever
Don't skip small incidents - They reveal patterns
Don't make it a blame doc - That kills learning
Don't create busywork - Actions should be meaningful
Don't skip follow-up - Verify actions completed

Weekly Installs

3.3K

Repository

wshobson/agents

GitHub Stars

32.2K

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code2.6K

opencode2.4K

gemini-cli2.4K

codex2.3K

cursor2.3K

github-copilot2.0K