postmortem-writing by wshobson/agents
npx skills add https://github.com/wshobson/agents --skill postmortem-writing撰写有效、无责难的事后分析综合指南,以推动组织学习并防止事件再次发生。
| 归咎导向 | 无责难 |
|---|---|
| "谁导致了这个问题?" | "什么条件允许了这个问题发生?" |
| "有人犯了错误" | "系统允许了这个错误" |
| 惩罚个人 | 改进系统 |
| 隐藏信息 | 分享学习成果 |
| 害怕发言 | 心理安全 |
Day 0: 事件发生
Day 1-2: 起草事后分析文档
Day 3-5: 事后分析会议
Day 5-7: 定稿文档,创建工单
Week 2+: 完成行动项
Quarterly: 审查跨事件模式
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 事后分析:[事件标题]
**日期**: 2024-01-15
**作者**: @alice, @bob
**状态**: 草案 | 审核中 | 终稿
**事件严重性**: SEV2
**事件持续时间**: 47 分钟
## 执行摘要
2024年1月15日,支付处理服务经历了47分钟的停机,影响了大约12,000名客户。根本原因是部署版本 v2.3.4 中的配置更改触发了数据库连接池耗尽。通过回滚到 v2.3.3 版本并增加连接池限制解决了该事件。
**影响**:
- 12,000 名客户无法完成购买
- 预计收入损失:$45,000
- 创建了 847 个支持工单
- 无数据丢失或安全隐患
## 时间线(所有时间均为 UTC)
| 时间 | 事件 |
| ----- | ----------------------------------------------- |
| 14:23 | 部署 v2.3.4 到生产环境完成 |
| 14:31 | 首次告警:`payment_error_rate > 5%` |
| 14:33 | 值班工程师 @alice 确认告警 |
| 14:35 | 开始初步调查,错误率 23% |
| 14:41 | 事件被宣布为 SEV2,@bob 加入 |
| 14:45 | 识别出数据库连接耗尽 |
| 14:52 | 决定回滚部署 |
| 14:58 | 开始回滚到 v2.3.3 |
| 15:10 | 回滚完成,错误率下降 |
| 15:18 | 服务完全恢复,事件解决 |
## 根本原因分析
### 发生了什么
v2.3.4 部署包含了对数据库查询模式的更改,无意中移除了一个频繁调用端点的连接池。每个请求都打开了一个新的数据库连接,而不是重用池中的连接。
### 为什么会发生
1. **直接原因**:`PaymentRepository.java` 中的代码更改将池化的 `DataSource` 替换为直接的 `DriverManager.getConnection()` 调用。
2. **促成因素**:
- 代码审查未发现连接处理方式的更改
- 没有专门针对连接池行为的集成测试
- 预发布环境流量较低,掩盖了问题
- 数据库连接指标告警阈值过高(90%)
3. **5个为什么分析**:
- 为什么服务失败? → 数据库连接耗尽
- 为什么连接耗尽? → 每个请求都打开新连接
- 为什么每个请求都打开新连接? → 代码绕过了连接池
- 为什么代码绕过连接池? → 开发人员不熟悉代码库模式
- 为什么开发人员不熟悉? → 没有关于连接管理模式的文档
### 系统架构图
[Client] → [Load Balancer] → [Payment Service] → [Database] ↓ Connection Pool (broken) ↓ Direct connections (cause)
## 检测
### 哪些有效
- 部署后 8 分钟内触发错误率告警
- Grafana 仪表板清晰地显示了连接激增
- 值班响应迅速(2 分钟确认)
### 哪些无效
- 数据库连接指标告警阈值过高
- 没有部署关联告警
- 金丝雀部署本可以更早发现此问题
### 检测差距
部署在 14:23 完成,但第一次告警直到 14:31 才触发(8 分钟后)。部署感知告警本可以更快地检测到问题。
## 响应
### 哪些有效
- 值班工程师迅速确定数据库是问题所在
- 果断做出回滚决定
- 事件频道中的沟通清晰
### 可以改进的地方
- 花了 10 分钟将问题与最近的部署关联起来
- 必须手动检查部署历史记录
- 回滚花了 12 分钟(可以更快)
## 影响
### 客户影响
- 12,000 名独立客户受影响
- 平均影响时长:35 分钟
- 847 个支持工单(受影响用户的 23%)
- 客户满意度得分下降了 12 分
### 业务影响
- 预计收入损失:$45,000
- 支持成本:约 $2,500(客服时间)
- 工程时间:约 8 人时
### 技术影响
- 数据库主节点负载升高
- 事件期间部分副本存在延迟
- 系统无永久性损坏
## 经验教训
### 做得好的地方
1. 告警在客户报告之前检测到问题
2. 团队在压力下有效协作
3. 回滚流程运行顺利
4. 沟通清晰及时
### 做得不好的地方
1. 代码审查遗漏了关键更改
2. 连接池的测试覆盖存在缺口
3. 预发布环境未能反映生产流量
4. 告警阈值设置不当
### 我们幸运的地方
1. 事件发生在工作时间,全员可用
2. 数据库承受住了负载,没有完全崩溃
3. 没有同时发生其他事件
## 行动项
| 优先级 | 行动项 | 负责人 | 截止日期 | 工单 |
|----------|--------|-------|----------|--------|
| P0 | 为连接池行为添加集成测试 | @alice | 2024-01-22 | ENG-1234 |
| P0 | 将数据库连接告警阈值降低到 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | 记录连接管理模式 | @alice | 2024-01-29 | DOC-89 |
| P1 | 实施部署关联告警 | @bob | 2024-02-05 | OPS-568 |
| P2 | 评估金丝雀部署策略 | @charlie | 2024-02-15 | ENG-1235 |
| P2 | 使用类生产流量对预发布环境进行负载测试 | @dave | 2024-02-28 | QA-123 |
## 附录
### 支持数据
#### 错误率图表
[链接到 Grafana 仪表板快照]
#### 数据库连接图表
[链接到指标]
### 相关事件
- 2023-11-02: 用户服务中的类似连接问题 (POSTMORTEM-42)
### 参考资料
- [连接池最佳实践](internal-wiki/connection-pools)
- [部署运行手册](internal-wiki/deployment-runbook)
# 5个为什么分析:[事件]
## 问题陈述
支付服务因数据库连接耗尽而经历了 47 分钟的停机。
## 分析
### 为什么 #1:为什么服务失败?
**答案**:数据库连接耗尽,导致所有新请求失败。
**证据**:指标显示连接数为 100/100(最大),有 500 多个待处理请求。
---
### 为什么 #2:为什么数据库连接耗尽?
**答案**:每个传入请求都打开了一个新的数据库连接,而不是使用连接池。
**证据**:代码差异显示使用了直接的 `DriverManager.getConnection()` 而不是池化的 `DataSource`。
---
### 为什么 #3:为什么代码绕过了连接池?
**答案**:一位开发人员重构了仓库类,无意中更改了连接获取方法。
**证据**:PR #1234 显示了该更改,是在修复另一个错误时进行的。
---
### 为什么 #4:为什么在代码审查中没有发现这一点?
**答案**:审查者专注于功能更改(错误修复),没有注意到基础设施的变更。
**证据**:审查评论只讨论了业务逻辑。
---
### 为什么 #5:为什么没有针对此类变更的安全网?
**答案**:我们缺乏验证连接池行为的自动化测试,也缺乏关于我们连接模式的文档。
**证据**:测试套件没有连接处理的测试;wiki 上没有关于数据库连接的文章。
## 识别的根本原因
1. **主要**:缺少基础设施行为的自动化测试
2. **次要**:架构模式文档不足
3. **第三**:代码审查清单未包含基础设施考虑因素
## 系统性改进
| 根本原因 | 改进措施 | 类型 |
| ------------- | --------------------------------- | ---------- |
| 缺少测试 | 添加基础设施行为测试 | 预防 |
| 缺少文档 | 记录连接模式 | 预防 |
| 审查缺口 | 更新审查清单 | 检测 |
| 无金丝雀 | 实施金丝雀部署 | 缓解 |
# 快速事后分析:[简短标题]
**日期**: 2024-01-15 | **持续时间**: 12 分钟 | **严重性**: SEV3
## 发生了什么
缓存刷新后,缓存击穿风暴导致 API 延迟飙升至 5 秒。
## 时间线
- 10:00 - 为配置更新启动缓存刷新
- 10:02 - 延迟告警触发
- 10:05 - 识别为缓存击穿风暴
- 10:08 - 启用缓存预热
- 10:12 - 延迟恢复正常
## 根本原因
为次要配置更新而进行的完整缓存刷新导致了惊群效应。
## 修复
- 立即:启用缓存预热
- 长期:实施部分缓存失效 (ENG-999)
## 经验教训
不要在生产环境中完全刷新缓存;使用针对性失效。
## 会议结构(60 分钟)
### 1. 开场(5 分钟)
- 提醒大家无责难文化
- "我们在这里是为了学习,而不是为了指责"
- 回顾会议规范
### 2. 时间线回顾(15 分钟)
- 按时间顺序梳理事件
- 提出澄清性问题
- 识别时间线中的空白
### 3. 分析讨论(20 分钟)
- 什么失败了?
- 为什么会失败?
- 什么条件允许了这种情况?
- 什么本可以阻止它?
### 4. 行动项(15 分钟)
- 头脑风暴改进措施
- 按影响力和工作量确定优先级
- 分配负责人和截止日期
### 5. 结束(5 分钟)
- 总结关键经验教训
- 确认行动项负责人
- 如有需要,安排后续跟进
## 引导技巧
- 保持讨论不偏离主题
- 将指责引导到系统上
- 鼓励安静的参与者发言
- 记录不同意见
- 限制跑题的时间
| 反模式 | 问题 | 更好的方法 |
|---|---|---|
| 指责游戏 | 扼杀学习 | 聚焦于系统 |
| 浅层分析 | 无法防止复发 | 问 5 次"为什么" |
| 没有行动项 | 浪费时间 | 始终有具体的后续步骤 |
| 不切实际的行动 | 永远无法完成 | 限定在可完成的任务范围内 |
| 没有跟进 | 行动被遗忘 | 在工单系统中跟踪 |
每周安装量
3.3K
代码仓库
GitHub 星标
32.2K
首次出现
Jan 20, 2026
安全审计
安装于
claude-code2.6K
opencode2.4K
gemini-cli2.4K
codex2.3K
cursor2.3K
github-copilot2.0K
Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
| Blame-Focused | Blameless |
|---|---|
| "Who caused this?" | "What conditions allowed this?" |
| "Someone made a mistake" | "The system allowed this mistake" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
| Fear of speaking up | Psychological safety |
Day 0: Incident occurs
Day 1-2: Draft postmortem document
Day 3-5: Postmortem meeting
Day 5-7: Finalize document, create tickets
Week 2+: Action item completion
Quarterly: Review patterns across incidents
# Postmortem: [Incident Title]
**Date**: 2024-01-15
**Authors**: @alice, @bob
**Status**: Draft | In Review | Final
**Incident Severity**: SEV2
**Incident Duration**: 47 minutes
## Executive Summary
On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
**Impact**:
- 12,000 customers unable to complete purchases
- Estimated revenue loss: $45,000
- 847 support tickets created
- No data loss or security implications
## Timeline (All times UTC)
| Time | Event |
| ----- | ----------------------------------------------- |
| 14:23 | Deployment v2.3.4 completed to production |
| 14:31 | First alert: `payment_error_rate > 5%` |
| 14:33 | On-call engineer @alice acknowledges alert |
| 14:35 | Initial investigation begins, error rate at 23% |
| 14:41 | Incident declared SEV2, @bob joins |
| 14:45 | Database connection exhaustion identified |
| 14:52 | Decision to rollback deployment |
| 14:58 | Rollback to v2.3.3 initiated |
| 15:10 | Rollback complete, error rate dropping |
| 15:18 | Service fully recovered, incident resolved |
## Root Cause Analysis
### What Happened
The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.
### Why It Happened
1. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls.
2. **Contributing Factors**:
- Code review did not catch the connection handling change
- No integration tests specifically for connection pool behavior
- Staging environment has lower traffic, masking the issue
- Database connection metrics alert threshold was too high (90%)
3. **5 Whys Analysis**:
- Why did the service fail? → Database connections exhausted
- Why were connections exhausted? → Each request opened new connection
- Why did each request open new connection? → Code bypassed connection pool
- Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
- Why was developer unfamiliar? → No documentation on connection management patterns
### System Diagram
[Client] → [Load Balancer] → [Payment Service] → [Database] ↓ Connection Pool (broken) ↓ Direct connections (cause)
## Detection
### What Worked
- Error rate alert fired within 8 minutes of deployment
- Grafana dashboard clearly showed connection spike
- On-call response was swift (2 minute acknowledgment)
### What Didn't Work
- Database connection metric alert threshold too high
- No deployment-correlated alerting
- Canary deployment would have caught this earlier
### Detection Gap
The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.
## Response
### What Worked
- On-call engineer quickly identified database as the issue
- Rollback decision was made decisively
- Clear communication in incident channel
### What Could Be Improved
- Took 10 minutes to correlate issue with recent deployment
- Had to manually check deployment history
- Rollback took 12 minutes (could be faster)
## Impact
### Customer Impact
- 12,000 unique customers affected
- Average impact duration: 35 minutes
- 847 support tickets (23% of affected users)
- Customer satisfaction score dropped 12 points
### Business Impact
- Estimated revenue loss: $45,000
- Support cost: ~$2,500 (agent time)
- Engineering time: ~8 person-hours
### Technical Impact
- Database primary experienced elevated load
- Some replica lag during incident
- No permanent damage to systems
## Lessons Learned
### What Went Well
1. Alerting detected the issue before customer reports
2. Team collaborated effectively under pressure
3. Rollback procedure worked smoothly
4. Communication was clear and timely
### What Went Wrong
1. Code review missed critical change
2. Test coverage gap for connection pooling
3. Staging environment doesn't reflect production traffic
4. Alert thresholds were not tuned properly
### Where We Got Lucky
1. Incident occurred during business hours with full team available
2. Database handled the load without failing completely
3. No other incidents occurred simultaneously
## Action Items
| Priority | Action | Owner | Due Date | Ticket |
|----------|--------|-------|----------|--------|
| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |
| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |
| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |
| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |
## Appendix
### Supporting Data
#### Error Rate Graph
[Link to Grafana dashboard snapshot]
#### Database Connection Graph
[Link to metrics]
### Related Incidents
- 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)
### References
- [Connection Pool Best Practices](internal-wiki/connection-pools)
- [Deployment Runbook](internal-wiki/deployment-runbook)
# 5 Whys Analysis: [Incident]
## Problem Statement
Payment service experienced 47-minute outage due to database connection exhaustion.
## Analysis
### Why #1: Why did the service fail?
**Answer**: Database connections were exhausted, causing all new requests to fail.
**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
---
### Why #2: Why were database connections exhausted?
**Answer**: Each incoming request opened a new database connection instead of using the connection pool.
**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.
---
### Why #3: Why did the code bypass the connection pool?
**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.
**Evidence**: PR #1234 shows the change, made while fixing a different bug.
---
### Why #4: Why wasn't this caught in code review?
**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
**Evidence**: Review comments only discuss business logic.
---
### Why #5: Why isn't there a safety net for this type of change?
**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.
## Root Causes Identified
1. **Primary**: Missing automated tests for infrastructure behavior
2. **Secondary**: Insufficient documentation of architectural patterns
3. **Tertiary**: Code review checklist doesn't include infrastructure considerations
## Systemic Improvements
| Root Cause | Improvement | Type |
| ------------- | --------------------------------- | ---------- |
| Missing tests | Add infrastructure behavior tests | Prevention |
| Missing docs | Document connection patterns | Prevention |
| Review gaps | Update review checklist | Detection |
| No canary | Implement canary deployments | Mitigation |
# Quick Postmortem: [Brief Title]
**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3
## What Happened
API latency spiked to 5s due to cache miss storm after cache flush.
## Timeline
- 10:00 - Cache flush initiated for config update
- 10:02 - Latency alerts fire
- 10:05 - Identified as cache miss storm
- 10:08 - Enabled cache warming
- 10:12 - Latency normalized
## Root Cause
Full cache flush for minor config update caused thundering herd.
## Fix
- Immediate: Enabled cache warming
- Long-term: Implement partial cache invalidation (ENG-999)
## Lessons
Don't full-flush cache in production; use targeted invalidation.
## Meeting Structure (60 minutes)
### 1. Opening (5 min)
- Remind everyone of blameless culture
- "We're here to learn, not to blame"
- Review meeting norms
### 2. Timeline Review (15 min)
- Walk through events chronologically
- Ask clarifying questions
- Identify gaps in timeline
### 3. Analysis Discussion (20 min)
- What failed?
- Why did it fail?
- What conditions allowed this?
- What would have prevented it?
### 4. Action Items (15 min)
- Brainstorm improvements
- Prioritize by impact and effort
- Assign owners and due dates
### 5. Closing (5 min)
- Summarize key learnings
- Confirm action item owners
- Schedule follow-up if needed
## Facilitation Tips
- Keep discussion on track
- Redirect blame to systems
- Encourage quiet participants
- Document dissenting views
- Time-box tangents
| Anti-Pattern | Problem | Better Approach |
|---|---|---|
| Blame game | Shuts down learning | Focus on systems |
| Shallow analysis | Doesn't prevent recurrence | Ask "why" 5 times |
| No action items | Waste of time | Always have concrete next steps |
| Unrealistic actions | Never completed | Scope to achievable tasks |
| No follow-up | Actions forgotten | Track in ticketing system |
Weekly Installs
3.3K
Repository
GitHub Stars
32.2K
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
claude-code2.6K
opencode2.4K
gemini-cli2.4K
codex2.3K
cursor2.3K
github-copilot2.0K
Azure Data Explorer (Kusto) 查询技能:KQL数据分析、日志遥测与时间序列处理
93,700 周安装