重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
npx skills add https://github.com/patricio0312rev/skills --skill postmortem-writer记录事件以供学习和改进。
# 事后分析:API 中断 - 数据库连接池耗尽
**日期:** 2024-01-15
**作者:** Jane Doe(值班人员),John Smith(数据库管理员)
**状态:** 已完成
**严重性:** P1(严重)
## 摘要
2024年1月15日,我们的 API 经历了持续25分钟的完全中断(UTC时间14:32 - 14:57),影响了100%的用户。根本原因是部署版本 v2.3.4 中引入的连接泄漏触发了数据库连接池耗尽。
**影响:**
- 持续时间:25分钟
- 受影响用户:约 50,000
- 失败请求:约 125,000
- 收入影响:约 15,000 美元
## 时间线(所有时间为 UTC)
| 时间 | 事件 |
| ----- | ------------------------------------------------ |
| 14:15 | v2.3.4 部署到生产环境 |
| 14:32 | 第一个 CloudWatch 警报:高错误率 |
| 14:33 | PagerDuty 向值班人员(Jane)发送警报 |
| 14:35 | Jane 确认警报,开始调查 |
| 14:38 | 确认问题:数据库连接池使用率达到 100% |
| 14:40 | 尝试:终止长时间运行的查询(无效) |
| 14:43 | 决策:回滚到 v2.3.3 |
| 14:45 | 开始回滚 |
| 14:47 | 回滚完成,连接开始下降 |
| 14:50 | 错误率恢复正常 |
| 14:57 | 所有系统恢复,事件关闭 |
| 15:30 | 安排事后分析会议 |
## 根本原因
v2.3.4 版本中的代码变更在用户个人资料端点引入了连接泄漏。新的缓存层在查询完成后没有正确释放数据库连接。
**代码差异:**
```diff
- await prisma.user.findUnique({ where: { id } });
* const client = await pool.connect();
* const user = await client.query('SELECT * FROM users WHERE id = $1', [id]);
* // 缺失:client.release() ❌
测试不充分: 负载测试未发现泄漏
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
监控缺失: 没有针对连接池指标的警报
代码审查不足: 审查者未发现缺失的 release()
部署流程: 没有渐进式发布
| 待办事项 | 负责人 | 截止日期 | 优先级 | 状态 |
|---|---|---|---|---|
| 将连接池指标添加到仪表板 | Jane | 2024-01-20 | P0 | ✅ 已完成 |
| 创建连接管理的 PR 检查清单 | John | 2024-01-22 | P0 | ✅ 已完成 |
| 将负载测试最低时长延长至 30 分钟 | QA 团队 | 2024-01-25 | P1 | 🔄 进行中 |
| 实施金丝雀部署(10% → 100%) | DevOps | 2024-02-01 | P1 | 📋 已计划 |
| 在测试中添加连接泄漏检测 | Jane | 2024-01-27 | P1 | 🔄 进行中 |
| 审查所有数据库连接使用模式 | John | 2024-02-05 | P2 | 📋 已计划 |
| 改进警报路由(更快升级) | DevOps | 2024-02-10 | P2 | 📋 已计划 |
为防止类似事件发生:
[插入事件期间的连接池、错误率、延迟图表]
[插入状态页面更新和客户沟通记录]
PR #1235:修复用户个人资料端点的连接泄漏
const client = await pool.connect();
try {
const user = await client.query('SELECT * FROM users WHERE id = $1', [id]);
return user;
} finally {
client.release(); // ✅ 始终释放
}
## 事后分析最佳实践
## 输出检查清单
* 时间线已创建
* 根本原因已确定
* 促成因素已记录
* 待办事项已有负责人
* 经验教训已捕获
* 事后分析会议已召开
* 文档已广泛分享
* 已安排后续跟进 ENDFILE
每周安装次数
51
代码仓库
[patricio0312rev/skills](https://github.com/patricio0312rev/skills "patricio0312rev/skills")
GitHub 星标数
27
首次出现
2026年1月24日
安全审计
[Gen Agent Trust HubPass](/patricio0312rev/skills/postmortem-writer/security/agent-trust-hub)[SocketPass](/patricio0312rev/skills/postmortem-writer/security/socket)[SnykPass](/patricio0312rev/skills/postmortem-writer/security/snyk)
安装于
codex44
gemini-cli43
opencode43
github-copilot42
cursor41
claude-code40
Document incidents for learning and improvement.
# Postmortem: API Outage - Database Connection Pool Exhausted
**Date:** 2024-01-15
**Authors:** Jane Doe (On-call), John Smith (DBA)
**Status:** Complete
**Severity:** P1 (Critical)
## Summary
On January 15, 2024, our API experienced a complete outage for 25 minutes (14:32 - 14:57 UTC) affecting 100% of users. The root cause was database connection pool exhaustion triggered by a connection leak introduced in deployment v2.3.4.
**Impact:**
- Duration: 25 minutes
- Users affected: ~50,000
- Requests failed: ~125,000
- Revenue impact: ~$15,000
## Timeline (All times UTC)
| Time | Event |
| ----- | ------------------------------------------------ |
| 14:15 | v2.3.4 deployed to production |
| 14:32 | First CloudWatch alarm: HighErrorRate |
| 14:33 | PagerDuty alert sent to on-call (Jane) |
| 14:35 | Jane acknowledges, begins investigation |
| 14:38 | Identified: Database connection pool at 100% |
| 14:40 | Attempted: Kill long-running queries (no effect) |
| 14:43 | Decision: Rollback to v2.3.3 |
| 14:45 | Rollback initiated |
| 14:47 | Rollback complete, connections dropping |
| 14:50 | Error rate returning to normal |
| 14:57 | All systems recovered, incident closed |
| 15:30 | Postmortem meeting scheduled |
## Root Cause
A code change in v2.3.4 introduced a connection leak in the user profile endpoint. The new caching layer was not properly releasing database connections after queries completed.
**Code diff:**
\`\`\`diff
- await prisma.user.findUnique({ where: { id } });
* const client = await pool.connect();
* const user = await client.query('SELECT \* FROM users WHERE id = $1', [id]);
* // Missing: client.release() ❌
\`\`\`
## Contributing Factors
1. **Insufficient testing:** Load tests didn't catch the leak
- Tests only ran for 5 minutes
- Not enough concurrent connections to exhaust pool
2. **Missing monitoring:** No alerts on connection pool metrics
- Had alarms for query latency
- No alarms for active connections count
3. **Inadequate code review:** Reviewer didn't spot missing release()
- PR approved without running locally
- No checklist for connection management
4. **Deployment process:** No gradual rollout
- Deployed to 100% of production immediately
- No canary deployment
## What Went Well
1. ✅ **Fast detection:** Alert fired within 3 minutes
2. ✅ **Clear runbook:** DBA runbook had exact steps to follow
3. ✅ **Quick decision:** Made rollback decision in 8 minutes
4. ✅ **Communication:** Status page updated every 5 minutes
5. ✅ **Rollback capability:** Automated rollback took <2 minutes
## What Went Wrong
1. ❌ **Code review missed bug:** Connection leak not caught
2. ❌ **Testing gaps:** Load tests insufficient duration
3. ❌ **No canary:** Deployed to all instances at once
4. ❌ **Late detection:** 17 minutes between deploy and alert
## Action Items
| Action | Owner | Due Date | Priority | Status |
| --------------------------------------------- | ------- | ---------- | -------- | -------------- |
| Add connection pool metrics to dashboards | Jane | 2024-01-20 | P0 | ✅ Done |
| Create PR checklist for connection management | John | 2024-01-22 | P0 | ✅ Done |
| Extend load tests to 30 minutes minimum | QA Team | 2024-01-25 | P1 | 🔄 In Progress |
| Implement canary deployment (10% → 100%) | DevOps | 2024-02-01 | P1 | 📋 Planned |
| Add connection leak detection to tests | Jane | 2024-01-27 | P1 | 🔄 In Progress |
| Review all DB connection usage patterns | John | 2024-02-05 | P2 | 📋 Planned |
| Improve alert routing (faster escalation) | DevOps | 2024-02-10 | P2 | 📋 Planned |
## Lessons Learned
1. **Code review checklists work:** Need specific items for common issues
2. **Load tests need realistic duration:** 5min insufficient for leaks
3. **Gradual rollouts catch issues:** 10% canary would have limited impact
4. **Monitoring gaps are dangerous:** Add metrics before you need them
5. **Runbooks save time:** Clear procedures enabled fast response
## Related Incidents
- [2023-11-20] Database CPU spike (similar connection pool issue)
- [2023-08-15] Memory leak in cache layer
## Prevention
To prevent similar incidents:
1. ✅ Add connection management to code review checklist
2. ✅ Monitor connection pool utilization
3. ✅ Extend load test duration
4. ✅ Implement canary deployments
5. ✅ Add automated connection leak detection
## Appendix
### Monitoring Graphs
[Insert graphs of connection pool, error rates, latency during incident]
### Communication Log
[Insert status page updates and customer communication]
### Code Fix
PR #1235: Fix connection leak in user profile endpoint
\`\`\`typescript
const client = await pool.connect();
try {
const user = await client.query('SELECT \* FROM users WHERE id = $1', [id]);
return user;
} finally {
client.release(); // ✅ Always release
}
\`\`\`
# Blameless Postmortem Guidelines
## Do ✅
- Focus on systems and processes, not people
- Use timeline with exact timestamps
- Identify contributing factors, not just root cause
- Create actionable items with owners and dates
- Document what went well (positive reinforcement)
- Share widely for organizational learning
## Don't ❌
- Blame individuals or teams
- Hide or minimize the incident
- Skip the postmortem (even for small incidents)
- Create action items without owners
- Forget to follow up on action items
- Make it a blame session
## Template Sections
1. **Summary** (2-3 sentences)
2. **Impact** (numbers: users, revenue, duration)
3. **Timeline** (chronological events)
4. **Root Cause** (technical explanation)
5. **Contributing Factors** (broader context)
6. **What Went Well** (positive reinforcement)
7. **What Went Wrong** (improvement areas)
8. **Action Items** (concrete, owned, dated)
9. **Lessons Learned** (key takeaways)
Weekly Installs
51
Repository
GitHub Stars
27
First Seen
Jan 24, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex44
gemini-cli43
opencode43
github-copilot42
cursor41
claude-code40
论文复现上下文解析器 - 解决AI论文复现中的关键信息缺口与冲突检测
22,600 周安装