事件应对手册模板 - 生产就绪的事件响应流程与SRE运维指南

incident-runbook-templates by wshobson/agents

3,300 周安装量

32,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/wshobson/agents --skill incident-runbook-templates

开发运维监控安全

🇨🇳中文介绍

事件应对手册模板

适用于事件响应的生产就绪模板，涵盖检测、分类、缓解、解决和沟通。

何时使用此技能

创建事件响应流程
构建特定于服务的应对手册
建立升级路径
记录恢复流程
响应活跃事件
培训待命工程师

核心概念

1. 事件严重级别

严重级别	影响	响应时间	示例
SEV1	完全中断，数据丢失	15 分钟	生产环境宕机
SEV2	严重性能下降	30 分钟	关键功能损坏
SEV3	轻微影响	2 小时	非关键性错误
SEV4	最小影响	下一个工作日

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

2. 应对手册结构

1. 概述与影响
2. 检测与告警
3. 初步分类
4. 缓解步骤
5. 根本原因调查
6. 解决流程
7. 验证与回滚
8. 沟通模板
9. 升级矩阵

模板 1：服务中断应对手册

# [服务名称] 中断应对手册

## 概述

**服务**：支付处理服务
**负责人**：平台团队
**Slack**：#payments-incidents
**PagerDuty**：payments-oncall

## 影响评估

- [ ] 哪些客户受到影响？
- [ ] 多大比例的流量受到影响？
- [ ] 是否存在财务影响？
- [ ] 影响范围有多大？

## 检测

### 告警

- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate < 95%` (PagerDuty)

### 仪表板

- [支付服务仪表板](https://grafana/d/payments)
- [错误追踪](https://sentry.io/payments)
- [依赖状态](https://status.stripe.com)

## 初步分类（前 5 分钟）

### 1. 评估范围

```bash
# 检查服务健康状态
kubectl get pods -n payments -l app=payment-service

# 检查最近部署
kubectl rollout history deployment/payment-service -n payments

# 检查错误率
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"

2. 快速健康检查

能否访问服务？curl -I https://api.company.com/payments/health
数据库连接性？检查连接池指标
外部依赖？检查 Stripe、银行 API 状态
最近的变更？检查部署历史

症状	可能原因	前往章节
所有请求失败	服务宕机	章节 4.1
高延迟	数据库/依赖	章节 4.2
部分失败	代码错误	章节 4.3
错误激增	流量激增	章节 4.4

4.1 服务完全宕机

# 步骤 1：检查 Pod 状态
kubectl get pods -n payments

# 步骤 2：如果 Pod 处于崩溃循环，检查日志
kubectl logs -n payments -l app=payment-service --tail=100

# 步骤 3：检查最近部署
kubectl rollout history deployment/payment-service -n payments

# 步骤 4：如果怀疑是最近部署导致，执行回滚
kubectl rollout undo deployment/payment-service -n payments

# 步骤 5：如果资源受限，进行扩容
kubectl scale deployment/payment-service -n payments --replicas=10

# 步骤 6：验证恢复情况
kubectl rollout status deployment/payment-service -n payments

# 步骤 1：检查数据库连接
kubectl exec -n payments deploy/payment-service -- \
  curl localhost:8080/metrics | grep db_pool

# 步骤 2：检查慢查询（如果是数据库问题）
psql -h $DB_HOST -U $DB_USER -c "
  SELECT pid, now() - query_start AS duration, query
  FROM pg_stat_activity
  WHERE state = 'active' AND duration > interval '5 seconds'
  ORDER BY duration DESC;"

# 步骤 3：如果需要，终止长时间运行的查询
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

# 步骤 4：检查外部依赖延迟
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

# 步骤 5：如果依赖响应慢，启用熔断器
kubectl set env deployment/payment-service \
  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

4.3 部分失败（特定错误）

# 步骤 1：识别错误模式
kubectl logs -n payments -l app=payment-service --tail=500 | \
  grep -i error | sort | uniq -c | sort -rn | head -20

# 步骤 2：检查错误追踪
# 前往 Sentry：https://sentry.io/payments

# 步骤 3：如果是特定端点，启用功能标志来禁用它
curl -X POST https://api.company.com/internal/feature-flags \
  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

# 步骤 4：如果是数据问题，检查最近的数据变更
psql -h $DB_HOST -c "
  SELECT * FROM audit_log
  WHERE table_name = 'payment_methods'
  AND created_at > now() - interval '1 hour';"

# 步骤 1：检查当前请求速率
kubectl top pods -n payments

# 步骤 2：水平扩容
kubectl scale deployment/payment-service -n payments --replicas=20

# 步骤 3：启用速率限制
kubectl set env deployment/payment-service \
  RATE_LIMIT_ENABLED=true \
  RATE_LIMIT_RPS=1000 -n payments

# 步骤 4：如果是攻击，阻止可疑 IP
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: block-suspicious
  namespace: payments
spec:
  podSelector:
    matchLabels:
          app: payment-service
  ingress:
  - from:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 192.168.1.0/24  # 可疑范围
EOF

# 验证服务健康
curl -s https://api.company.com/payments/health | jq

# 验证错误率恢复正常
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'

# 验证延迟在可接受范围内
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq

# 关键流程冒烟测试
./scripts/smoke-test-payments.sh

# 回滚 Kubernetes 部署
kubectl rollout undo deployment/payment-service -n payments

# 回滚数据库迁移（如果适用）
./scripts/db-rollback.sh $MIGRATION_VERSION

# 回滚功能标志
curl -X POST https://api.company.com/internal/feature-flags \
  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'

条件	升级给	联系方式

15 分钟未解决的 SEV1 | 工程经理 | @manager (Slack) 怀疑数据泄露 | 安全团队 | #security-incidents 财务影响 > $10k | 财务 + 法务 | @finance-oncall 需要客户沟通 | 支持主管 | @support-lead

初始通知（内部）

🚨 事件：支付服务性能下降

严重级别：SEV2
状态：调查中
影响：约 20% 的支付请求失败
开始时间：[TIME]
事件指挥官：[NAME]

当前行动：
- 调查根本原因
- 扩展服务容量
- 监控仪表板

更新在 #payments-incidents

📊 更新：支付服务事件

状态：缓解中
影响：失败率降至约 5%
持续时间：25 分钟

已采取的行动：
- 回滚部署 v2.3.4 → v2.3.3
- 将服务副本数从 5 扩展到 10

后续步骤：
- 继续监控
- 根本原因分析进行中

预计解决时间：约 15 分钟

✅ 已解决：支付服务事件

持续时间：45 分钟
影响：约 5,000 笔受影响交易
根本原因：v2.3.4 版本中的内存泄漏

解决方案：
- 回滚至 v2.3.3
- 交易自动重试成功

后续跟进：
- 事后分析会议定于 [DATE]
- 错误修复进行中

模板 2：数据库事件应对手册

# 数据库事件应对手册

## 快速参考
| 问题 | 命令 |
|-------|---------|
| 检查连接 | `SELECT count(*) FROM pg_stat_activity;` |
| 终止查询 | `SELECT pg_terminate_backend(pid);` |
| 检查复制延迟 | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| 检查锁 | `SELECT * FROM pg_locks WHERE NOT granted;` |

## 连接池耗尽
```sql
-- 检查当前连接
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;

-- 识别长时间运行的连接
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- 终止空闲连接
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';

-- 检查副本延迟
SELECT
  CASE
    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
  END AS lag_seconds;

-- 如果延迟 > 60s，考虑：
-- 1. 检查主/副本之间的网络
-- 2. 检查副本磁盘 I/O
-- 3. 如果无法恢复，考虑故障转移

磁盘空间严重不足

# 检查磁盘使用情况
df -h /var/lib/postgresql/data

# 查找大表
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"

# VACUUM 以回收空间
psql -c "VACUUM FULL large_table;"

# 如果紧急，删除旧数据或扩展磁盘

保持应对手册更新 - 每次事件后审查
定期测试应对手册 - 演练日，混沌工程
包含回滚步骤 - 始终留有退路
记录假设 - 步骤生效的前提条件
链接到仪表板 - 压力下快速访问

不要假设知识 - 为凌晨 3 点的大脑写作
不要跳过验证 - 确认每个步骤生效
不要忘记沟通 - 让利益相关者知情
不要独自工作 - 尽早升级
不要跳过事后分析 - 从每次事件中学习

🇺🇸English

Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

When to Use This Skill

Creating incident response procedures
Building service-specific runbooks
Establishing escalation paths
Documenting recovery procedures
Responding to active incidents
Onboarding on-call engineers

Core Concepts

1. Incident Severity Levels

Severity	Impact	Response Time	Example
SEV1	Complete outage, data loss	15 min	Production down
SEV2	Major degradation	30 min	Critical feature broken
SEV3	Minor impact	2 hours	Non-critical bug
SEV4	Minimal impact	Next business day	Cosmetic issue

2. Runbook Structure

1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix

Runbook Templates

Template 1: Service Outage Runbook

# [Service Name] Outage Runbook

## Overview

**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall

## Impact Assessment

- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?

## Detection

### Alerts

- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate < 95%` (PagerDuty)

### Dashboards

- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)

## Initial Triage (First 5 Minutes)

### 1. Assess Scope

```bash
# Check service health
kubectl get pods -n payments -l app=payment-service

# Check recent deployments
kubectl rollout history deployment/payment-service -n payments

# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
```

2. Quick Health Checks

Can you reach the service? curl -I https://api.company.com/payments/health
Database connectivity? Check connection pool metrics
External dependencies? Check Stripe, bank API status
Recent changes? Check deploy history

3. Initial Classification

Symptom	Likely Cause	Go To Section
All requests failing	Service down	Section 4.1
High latency	Database/dependency	Section 4.2
Partial failures	Code bug	Section 4.3
Spike in errors	Traffic surge	Section 4.4

Mitigation Procedures

4.1 Service Completely Down

# Step 1: Check pod status
kubectl get pods -n payments

# Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100

# Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments

# Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments

# Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10

# Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments

4.2 High Latency

# Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
  curl localhost:8080/metrics | grep db_pool

# Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c "
  SELECT pid, now() - query_start AS duration, query
  FROM pg_stat_activity
  WHERE state = 'active' AND duration > interval '5 seconds'
  ORDER BY duration DESC;"

# Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

# Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

# Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \
  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

4.3 Partial Failures (Specific Errors)

# Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
  grep -i error | sort | uniq -c | sort -rn | head -20

# Step 2: Check error tracking
# Go to Sentry: https://sentry.io/payments

# Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \
  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

# Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c "
  SELECT * FROM audit_log
  WHERE table_name = 'payment_methods'
  AND created_at > now() - interval '1 hour';"

4.4 Traffic Surge

# Step 1: Check current request rate
kubectl top pods -n payments

# Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20

# Step 3: Enable rate limiting
kubectl set env deployment/payment-service \
  RATE_LIMIT_ENABLED=true \
  RATE_LIMIT_RPS=1000 -n payments

# Step 4: If attack, block suspicious IPs
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: block-suspicious
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: payment-service
  ingress:
  - from:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 192.168.1.0/24  # Suspicious range
EOF

Verification Steps

# Verify service is healthy
curl -s https://api.company.com/payments/health | jq

# Verify error rate is back to normal
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'

# Verify latency is acceptable
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq

# Smoke test critical flows
./scripts/smoke-test-payments.sh

Rollback Procedures

# Rollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments

# Rollback database migration (if applicable)
./scripts/db-rollback.sh $MIGRATION_VERSION

# Rollback feature flag
curl -X POST https://api.company.com/internal/feature-flags \
  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'

Escalation Matrix

Condition	Escalate To	Contact

15 min unresolved SEV1 | Engineering Manager | @manager (Slack)
Data breach suspected | Security Team | #security-incidents
Financial impact > $10k | Finance + Legal | @finance-oncall
Customer communication needed | Support Lead | @support-lead

Communication Templates

Initial Notification (Internal)

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]

Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards

Updates in #payments-incidents

Status Update

📊 UPDATE: Payment Service Incident

Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes

Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas

Next Steps:
- Continuing to monitor
- Root cause analysis in progress

ETA to Resolution: ~15 minutes

Resolution Notification

✅ RESOLVED: Payment Service Incident

Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4

Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully

Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress



### Template 2: Database Incident Runbook

```markdown
# Database Incident Runbook

## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |

## Connection Pool Exhaustion
```sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;

-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';

Replication Lag

-- Check lag on replica
SELECT
  CASE
    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
  END AS lag_seconds;

-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable

Disk Space Critical

# Check disk usage
df -h /var/lib/postgresql/data

# Find large tables
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"

# VACUUM to reclaim space
psql -c "VACUUM FULL large_table;"

# If emergency, delete old data or expand disk



## Best Practices

### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress

### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident

Weekly Installs

3.3K

Repository

wshobson/agents

GitHub Stars

32.2K

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykPass

Installed on

claude-code2.6K

gemini-cli2.5K

opencode2.5K

codex2.4K

cursor2.4K

github-copilot2.1K

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

59,200 周安装

事件应对手册模板 - 生产就绪的事件响应流程与SRE运维指南

🇨🇳中文介绍

事件应对手册模板

何时使用此技能

核心概念

1. 事件严重级别

相关 Skills

2. 应对手册结构

应对手册模板

模板 1：服务中断应对手册

2. 快速健康检查

3. 初步分类

缓解流程

4.1 服务完全宕机

4.2 高延迟