incident-runbook-templates by sickn33/antigravity-awesome-skills
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill incident-runbook-templates适用于生产环境的模板,涵盖事件检测、分类、缓解、解决和沟通。
resources/implementation-playbook.md。| 严重级别 | 影响 | 响应时间 | 示例 |
|---|---|---|---|
| SEV1 | 完全中断,数据丢失 | 15 分钟 | 生产环境宕机 |
| SEV2 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 严重性能下降 |
| 30 分钟 |
| 关键功能损坏 |
| SEV3 | 轻微影响 | 2 小时 | 非关键性错误 |
| SEV4 | 最小影响 | 下一个工作日 | 界面显示问题 |
1. 概述与影响
2. 检测与告警
3. 初步分类
4. 缓解步骤
5. 根本原因调查
6. 解决流程
7. 验证与回滚
8. 沟通模板
9. 升级矩阵
# [服务名称] 中断操作手册
## 概述
**服务**: 支付处理服务
**负责人**: 平台团队
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall
## 影响评估
- [ ] 哪些客户受到影响?
- [ ] 多大比例的流量受到影响?
- [ ] 是否存在财务影响?
- [ ] 影响范围有多大?
## 检测
### 告警
- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate < 95%` (PagerDuty)
### 仪表板
- [支付服务仪表板](https://grafana/d/payments)
- [错误追踪](https://sentry.io/payments)
- [依赖项状态](https://status.stripe.com)
## 初步分类(前 5 分钟)
### 1. 评估范围
```bash
# 检查服务健康状况
kubectl get pods -n payments -l app=payment-service
# 检查最近部署
kubectl rollout history deployment/payment-service -n payments
# 检查错误率
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
curl -I https://api.company.com/payments/health| 症状 | 可能原因 | 跳转至章节 |
|---|---|---|
| 所有请求失败 | 服务宕机 | 章节 4.1 |
| 高延迟 | 数据库/依赖项 | 章节 4.2 |
| 部分失败 | 代码错误 | 章节 4.3 |
| 错误激增 | 流量激增 | 章节 4.4 |
# 步骤 1:检查 Pod 状态
kubectl get pods -n payments
# 步骤 2:如果 Pod 处于崩溃循环状态,检查日志
kubectl logs -n payments -l app=payment-service --tail=100
# 步骤 3:检查最近部署
kubectl rollout history deployment/payment-service -n payments
# 步骤 4:如果怀疑最近部署有问题,执行回滚
kubectl rollout undo deployment/payment-service -n payments
# 步骤 5:如果资源受限,进行扩容
kubectl scale deployment/payment-service -n payments --replicas=10
# 步骤 6:验证恢复情况
kubectl rollout status deployment/payment-service -n payments
# 步骤 1:检查数据库连接
kubectl exec -n payments deploy/payment-service -- \
curl localhost:8080/metrics | grep db_pool
# 步骤 2:检查慢查询(如果是数据库问题)
psql -h $DB_HOST -U $DB_USER -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND duration > interval '5 seconds'
ORDER BY duration DESC;"
# 步骤 3:如果需要,终止长时间运行的查询
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
# 步骤 4:检查外部依赖项延迟
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
# 步骤 5:如果依赖项响应慢,启用熔断器
kubectl set env deployment/payment-service \
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
# 步骤 1:识别错误模式
kubectl logs -n payments -l app=payment-service --tail=500 | \
grep -i error | sort | uniq -c | sort -rn | head -20
# 步骤 2:检查错误追踪
# 前往 Sentry: https://sentry.io/payments
# 步骤 3:如果是特定端点,启用功能标志以禁用它
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
# 步骤 4:如果是数据问题,检查最近的数据变更
psql -h $DB_HOST -c "
SELECT * FROM audit_log
WHERE table_name = 'payment_methods'
AND created_at > now() - interval '1 hour';"
# 步骤 1:检查当前请求速率
kubectl top pods -n payments
# 步骤 2:水平扩容
kubectl scale deployment/payment-service -n payments --replicas=20
# 步骤 3:启用速率限制
kubectl set env deployment/payment-service \
RATE_LIMIT_ENABLED=true \
RATE_LIMIT_RPS=1000 -n payments
# 步骤 4:如果是攻击,阻止可疑 IP
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: block-suspicious
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 192.168.1.0/24 # 可疑范围
EOF
# 验证服务健康
curl -s https://api.company.com/payments/health | jq
# 验证错误率恢复正常
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
# 验证延迟可接受
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
# 关键流程冒烟测试
./scripts/smoke-test-payments.sh
# 回滚 Kubernetes 部署
kubectl rollout undo deployment/payment-service -n payments
# 回滚数据库迁移(如果适用)
./scripts/db-rollback.sh $MIGRATION_VERSION
# 回滚功能标志
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
| 条件 | 升级至 | 联系方式 |
|---|
15 分钟未解决的 SEV1 | 工程经理 | @manager (Slack)
怀疑数据泄露 | 安全团队 | #security-incidents
财务影响 > $10k | 财务 + 法务 | @finance-oncall
需要客户沟通 | 支持负责人 | @support-lead
🚨 事件:支付服务性能下降
严重级别:SEV2
状态:调查中
影响:约 20% 支付请求失败
开始时间:[TIME]
事件指挥官:[NAME]
当前行动:
- 调查根本原因
- 扩容服务
- 监控仪表板
更新在 #payments-incidents
📊 更新:支付服务事件
状态:缓解中
影响:失败率降至约 5%
持续时间:25 分钟
已采取行动:
- 回滚部署 v2.3.4 → v2.3.3
- 将服务副本数从 5 扩容至 10
下一步:
- 持续监控
- 根本原因分析进行中
预计解决时间:约 15 分钟
✅ 已解决:支付服务事件
持续时间:45 分钟
影响:约 5,000 笔受影响交易
根本原因:v2.3.4 版本内存泄漏
解决方案:
- 回滚至 v2.3.3
- 交易自动重试成功
后续:
- 事后分析会议定于 [DATE]
- 错误修复进行中
### 模板 2:数据库事件操作手册
```markdown
# 数据库事件操作手册
## 快速参考
| 问题 | 命令 |
|-------|---------|
| 检查连接 | `SELECT count(*) FROM pg_stat_activity;` |
| 终止查询 | `SELECT pg_terminate_backend(pid);` |
| 检查复制延迟 | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| 检查锁 | `SELECT * FROM pg_locks WHERE NOT granted;` |
## 连接池耗尽
```sql
-- 检查当前连接
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;
-- 识别长时间运行的连接
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
-- 终止空闲连接
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
-- 检查副本延迟
SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
END AS lag_seconds;
-- 如果延迟 > 60 秒,考虑:
-- 1. 检查主库/副本之间的网络
-- 2. 检查副本磁盘 I/O
-- 3. 如果无法恢复,考虑故障转移
# 检查磁盘使用情况
df -h /var/lib/postgresql/data
# 查找大表
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"
# 执行 VACUUM 以回收空间
psql -c "VACUUM FULL large_table;"
# 如果紧急,删除旧数据或扩展磁盘
## 最佳实践
### 应该做的
- **保持操作手册更新** - 每次事件后审查
- **定期测试操作手册** - 演练日、混沌工程
- **包含回滚步骤** - 始终留有退路
- **记录假设** - 步骤生效的前提条件
- **链接到仪表板** - 压力下的快速访问
### 不应该做的
- **不要假设知识** - 为凌晨 3 点的大脑写作
- **不要跳过验证** - 确认每个步骤生效
- **不要忘记沟通** - 及时通知利益相关者
- **不要单独工作** - 尽早升级
- **不要跳过事后分析** - 从每次事件中学习
## 资源
- [Google SRE 手册 - 事件管理](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty 事件响应](https://response.pagerduty.com/)
- [Atlassian 事件管理](https://www.atlassian.com/incident-management)
每周安装数
87
代码仓库
GitHub 星标数
27.4K
首次出现
Jan 28, 2026
安全审计
已安装于
gemini-cli82
opencode82
cursor81
claude-code80
github-copilot78
codex78
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
resources/implementation-playbook.md.| Severity | Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Complete outage, data loss | 15 min | Production down |
| SEV2 | Major degradation | 30 min | Critical feature broken |
| SEV3 | Minor impact | 2 hours | Non-critical bug |
| SEV4 | Minimal impact | Next business day | Cosmetic issue |
1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix
# [Service Name] Outage Runbook
## Overview
**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall
## Impact Assessment
- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?
## Detection
### Alerts
- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate < 95%` (PagerDuty)
### Dashboards
- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)
## Initial Triage (First 5 Minutes)
### 1. Assess Scope
```bash
# Check service health
kubectl get pods -n payments -l app=payment-service
# Check recent deployments
kubectl rollout history deployment/payment-service -n payments
# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
curl -I https://api.company.com/payments/health| Symptom | Likely Cause | Go To Section |
|---|---|---|
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |
# Step 1: Check pod status
kubectl get pods -n payments
# Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100
# Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments
# Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments
# Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10
# Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments
# Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
curl localhost:8080/metrics | grep db_pool
# Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND duration > interval '5 seconds'
ORDER BY duration DESC;"
# Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
# Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
# Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
# Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
grep -i error | sort | uniq -c | sort -rn | head -20
# Step 2: Check error tracking
# Go to Sentry: https://sentry.io/payments
# Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
# Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c "
SELECT * FROM audit_log
WHERE table_name = 'payment_methods'
AND created_at > now() - interval '1 hour';"
# Step 1: Check current request rate
kubectl top pods -n payments
# Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20
# Step 3: Enable rate limiting
kubectl set env deployment/payment-service \
RATE_LIMIT_ENABLED=true \
RATE_LIMIT_RPS=1000 -n payments
# Step 4: If attack, block suspicious IPs
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: block-suspicious
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 192.168.1.0/24 # Suspicious range
EOF
# Verify service is healthy
curl -s https://api.company.com/payments/health | jq
# Verify error rate is back to normal
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
# Verify latency is acceptable
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
# Smoke test critical flows
./scripts/smoke-test-payments.sh
# Rollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments
# Rollback database migration (if applicable)
./scripts/db-rollback.sh $MIGRATION_VERSION
# Rollback feature flag
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
| Condition | Escalate To | Contact |
|---|
15 min unresolved SEV1 | Engineering Manager | @manager (Slack)
Data breach suspected | Security Team | #security-incidents
Financial impact > $10k | Finance + Legal | @finance-oncall
Customer communication needed | Support Lead | @support-lead
🚨 INCIDENT: Payment Service Degradation
Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]
Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards
Updates in #payments-incidents
📊 UPDATE: Payment Service Incident
Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes
Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas
Next Steps:
- Continuing to monitor
- Root cause analysis in progress
ETA to Resolution: ~15 minutes
✅ RESOLVED: Payment Service Incident
Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4
Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully
Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress
### Template 2: Database Incident Runbook
```markdown
# Database Incident Runbook
## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
## Connection Pool Exhaustion
```sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;
-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
-- Check lag on replica
SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
END AS lag_seconds;
-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable
# Check disk usage
df -h /var/lib/postgresql/data
# Find large tables
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"
# VACUUM to reclaim space
psql -c "VACUUM FULL large_table;"
# If emergency, delete old data or expand disk
## Best Practices
### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress
### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident
## Resources
- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)
Weekly Installs
87
Repository
GitHub Stars
27.4K
First Seen
Jan 28, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
gemini-cli82
opencode82
cursor81
claude-code80
github-copilot78
codex78
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
104,900 周安装
AI音乐生成器 - 使用ElevenLabs API生成自定义音乐,支持器乐人声和详细作曲控制
108 周安装
RSS 代理发现工具 - AI 代理专用 RSS 源 JSON 输出
106 周安装
DuckDuckGo搜索技能 - 开源网络搜索工具,集成DuckDuckGo API,支持多种开发环境
106 周安装
Nuxt 4 数据管理:useFetch、useAsyncData、useState 组合式函数与 SSR 状态管理
70 周安装
Tauri v2 前端调用 Rust 后端函数教程:命令系统与参数传递详解
108 周安装
定价策略师指南:SaaS定价模型、价值指标与分层定价策略框架
106 周安装