on-call-handoff-patterns by sickn33/antigravity-awesome-skills
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill on-call-handoff-patterns确保值班轮换期间连续性、上下文传递和可靠事件响应的有效模式。
resources/implementation-playbook.md。| 组成部分 | 目的 |
|---|---|
| 进行中的事件 | 当前出现的问题 |
| 正在进行的调查 | 正在调试的问题 |
| 近期变更 | 部署、配置 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 已实施的变通方案 |
| 即将发生的事件 | 维护、发布 |
建议:班次之间有 30 分钟重叠时间
交班人员:
├── 15 分钟:编写交接文档
└── 15 分钟:与接班人员同步通话
接班人员:
├── 15 分钟:审阅交接文档
├── 15 分钟:与交班人员同步通话
└── 5 分钟:验证告警设置
# 值班交接:平台团队
**交班人员**:@alice (2024-01-15 至 2024-01-22)
**接班人员**:@bob (2024-01-22 至 2024-01-29)
**交接时间**:2024-01-22 09:00 UTC
---
## 🔴 进行中的事件
### 当前无进行中事件
交接时无进行中事件。
---
## 🟡 正在进行的调查
### 1. 间歇性 API 超时 (ENG-1234)
**状态**:调查中
**开始时间**:2024-01-20
**影响**:约 0.1% 的请求超时
**背景**:
- 超时与数据库备份窗口 (02:00-03:00 UTC) 相关
- 怀疑备份进程导致锁争用
- 已在 PR #567 中添加额外日志记录(已于 01/21 部署)
**后续步骤**:
- [ ] 今晚备份后查看新日志
- [ ] 如果确认,考虑移动备份窗口
**资源**:
- 仪表板:[API 延迟](https://grafana/d/api-latency)
- 讨论串:#platform-eng (01/20, 14:32)
---
### 2. 认证服务内存增长 (ENG-1235)
**状态**:监控中
**开始时间**:2024-01-18
**影响**:暂无(主动发现)
**背景**:
- 内存使用量每天增长约 5%
- 性能分析未发现内存泄漏
- 怀疑连接池未正确释放
**后续步骤**:
- [ ] 审阅 01/21 的堆转储
- [ ] 如果使用率 > 80%,考虑重启
**资源**:
- 仪表板:[认证服务内存](https://grafana/d/auth-memory)
- 分析文档:[内存调查](https://docs/eng-1235)
---
## 🟢 本班次已解决的问题
### 支付服务中断 (2024-01-19)
- **持续时间**:23 分钟
- **根本原因**:数据库连接耗尽
- **解决方案**:回退到 v2.3.4,增加连接池大小
- **事后分析**:[POSTMORTEM-89](https://docs/postmortem-89)
- **跟进工单**:ENG-1230, ENG-1231
---
## 📋 近期变更
### 部署
| 服务 | 版本 | 时间 | 备注 |
|---------|---------|------|-------|
| api-gateway | v3.2.1 | 01/21 14:00 | 修复请求头解析的 bug |
| user-service | v2.8.0 | 01/20 10:00 | 新增个人资料功能 |
| auth-service | v4.1.2 | 01/19 16:00 | 安全补丁 |
### 配置变更
- 01/21:API 速率限制从 1000 RPS 提高到 1500 RPS
- 01/20:数据库连接池最大值从 50 更新为 75
### 基础设施
- 01/20:向 Kubernetes 集群添加了 2 个节点
- 01/19:Redis 从 6.2 升级到 7.0
---
## ⚠️ 已知问题及变通方案
### 1. 仪表板加载缓慢
**问题**:周一上午 Grafana 仪表板加载缓慢
**变通方案**:在 08:00 UTC 后等待 5 分钟以预热缓存
**工单**:OPS-456 (P3)
### 2. 不稳定的集成测试
**问题**:CI 中 `test_payment_flow` 间歇性失败
**变通方案**:重新运行失败的作业(通常重试会通过)
**工单**:ENG-1200 (P2)
---
## 📅 即将发生的事件
| 日期 | 事件 | 影响 | 联系人 |
|------|-------|--------|---------|
| 01/23 02:00 | 数据库维护 | 5 分钟只读 | @dba-team |
| 01/24 14:00 | 主要版本 v5.0 发布 | 密切监控 | @release-team |
| 01/25 | 营销活动 | 预计流量翻倍 | @platform |
---
## 📞 升级提醒
| 问题类型 | 第一级升级 | 第二级升级 |
|------------|------------------|-------------------|
| 支付问题 | @payments-oncall | @payments-manager |
| 认证问题 | @auth-oncall | @security-team |
| 数据库问题 | @dba-team | @infra-manager |
| 未知/严重问题 | @engineering-manager | @vp-engineering |
---
## 🔧 快速参考
### 常用命令
```bash
# 检查服务健康状态
kubectl get pods -A | grep -v Running
# 近期部署
kubectl get events --sort-by='.lastTimestamp' | tail -20
# 数据库连接
psql -c "SELECT count(*) FROM pg_stat_activity;"
# 清除缓存(仅限紧急情况)
redis-cli FLUSHDB
阅读本文档
参加同步通话
验证 PagerDuty 是否将告警路由给你
验证 Slack 通知是否正常工作
检查 VPN/访问权限是否正常
审阅关键仪表板
# 快速交接:@alice → @bob
## 摘要
- 无进行中事件
- 1 项调查正在进行中(API 超时,参见 ENG-1234)
- 明天 (01/24) 有主要版本发布 - 准备好应对问题
## 关注列表
1. 02:00-03:00 UTC 左右的 API 延迟(备份窗口)
2. 认证服务内存(如果 > 80% 则重启)
## 近期情况
- 昨天部署了 api-gateway v3.2.1(稳定)
- 已将速率限制提高到 1500 RPS
## 即将发生
- 01/23 02:00 - 数据库维护(5 分钟只读)
- 01/24 14:00 - v5.0 版本发布
## 有问题吗?
我今天在 Slack 上会一直待到 17:00。
# 事件交接:支付服务性能下降
**事件开始时间**:2024-01-22 08:15 UTC
**当前状态**:缓解中
**严重性**:SEV2
---
## 当前状态
- 错误率:15%(从 40% 下降)
- 缓解措施进行中:正在扩容 Pod
- 预计解决时间:约 30 分钟
## 已知信息
1. 根本原因:payment-service Pod 内存压力
2. 触发因素:异常流量激增(正常值的 3 倍)
3. 促成因素:结账流程中的低效查询
## 已采取的措施
- 将 payment-service 从 5 个 Pod 扩容到 15 个 Pod
- 在结账端点启用了速率限制
- 禁用了非关键功能
## 需要进行的操作
1. 监控错误率 - 应在约 15 分钟内降至 <1%
2. 如果没有改善,升级给 @payments-manager
3. 一旦稳定,开始根本原因调查
## 关键人员
- 事件指挥官:@alice(正在交接)
- 通讯负责人:@charlie
- 技术负责人:@bob(接班)
## 沟通情况
- 状态页面:已于 08:45 更新
- 客户支持:已通知
- 执行团队:已知晓
## 资源
- 事件频道:#inc-20240122-payment
- 仪表板:[支付服务](https://grafana/d/payments)
- 操作手册:[支付性能下降](https://wiki/runbooks/payments)
---
**接班值班人员 (@bob) - 请确认你已:**
- [ ] 加入 #inc-20240122-payment
- [ ] 可以访问仪表板
- [ ] 了解当前状态
- [ ] 知晓升级路径
## 交接同步:@alice → @bob
1. **进行中的问题** (5 分钟)
- 梳理任何进行中的事件
- 讨论调查状态
- 传递背景信息和推测
2. **近期变更** (3 分钟)
- 需要关注的部署
- 配置变更
- 已知的回归问题
3. **即将发生的事件** (3 分钟)
- 维护窗口
- 预期的流量变化
- 计划中的发布
4. **问题** (4 分钟)
- 澄清任何不清楚的地方
- 确认访问权限和告警设置
- 交换联系信息
## 值班前清单
### 访问权限验证
- [ ] VPN 正常工作
- [ ] 对所有集群的 kubectl 访问权限
- [ ] 数据库读取权限
- [ ] 日志聚合器访问权限(Splunk/Datadog)
- [ ] PagerDuty 应用已安装并登录
### 告警设置
- [ ] PagerDuty 排班表显示你为主要联系人
- [ ] 手机通知已启用
- [ ] 事件频道的 Slack 通知
- [ ] 已收到并确认测试告警
### 知识更新
- [ ] 审阅近期事件(过去 2 周)
- [ ] 检查服务变更日志
- [ ] 浏览关键操作手册
- [ ] 知晓升级联系人
### 环境准备就绪
- [ ] 笔记本电脑已充电并可访问
- [ ] 手机已充电
- [ ] 通话的安静空间可用
- [ ] 已确定备用联系人(如果出差)
## 每日值班例行工作
### 早晨(开始值班时)
- [ ] 检查夜间告警
- [ ] 审阅仪表板查找异常
- [ ] 检查是否有创建的 P0/P1 工单
- [ ] 浏览事件频道了解背景
### 全天
- [ ] 在 SLA 内响应告警
- [ ] 记录调查进展
- [ ] 向团队更新重大问题
- [ ] 分类传入的告警页面
### 下班前
- [ ] 交接任何进行中的问题
- [ ] 更新调查文档
- [ ] 为下一班次记录注意事项
## 值班后清单
- [ ] 完成交接文档
- [ ] 与接班值班人员同步
- [ ] 验证 PagerDuty 路由已更改
- [ ] 关闭/更新调查工单
- [ ] 为任何事件提交事后分析
- [ ] 如果值班压力大,请休息
## 升级触发条件
### 立即升级
- 已宣布 SEV1 事件
- 怀疑数据泄露
- 30 分钟内无法诊断
- 收到客户或法务升级
### 考虑升级
- 问题涉及多个团队
- 需要你不具备的专业知识
- 业务影响超过阈值
- 你对后续步骤不确定
### 如何升级
1. 通知适当的升级路径
2. 在 Slack 中提供简要背景
3. 保持参与直到升级方确认
4. 清晰交接,不要直接消失
每周安装量
78
代码仓库
GitHub 星标数
27.1K
首次出现时间
2026年1月28日
安全审计
安装于
gemini-cli76
opencode75
cursor74
antigravity72
claude-code72
github-copilot71
Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.
resources/implementation-playbook.md.| Component | Purpose |
|---|---|
| Active Incidents | What's currently broken |
| Ongoing Investigations | Issues being debugged |
| Recent Changes | Deployments, configs |
| Known Issues | Workarounds in place |
| Upcoming Events | Maintenance, releases |
Recommended: 30 min overlap between shifts
Outgoing:
├── 15 min: Write handoff document
└── 15 min: Sync call with incoming
Incoming:
├── 15 min: Review handoff document
├── 15 min: Sync call with outgoing
└── 5 min: Verify alerting setup
# On-Call Handoff: Platform Team
**Outgoing**: @alice (2024-01-15 to 2024-01-22)
**Incoming**: @bob (2024-01-22 to 2024-01-29)
**Handoff Time**: 2024-01-22 09:00 UTC
---
## 🔴 Active Incidents
### None currently active
No active incidents at handoff time.
---
## 🟡 Ongoing Investigations
### 1. Intermittent API Timeouts (ENG-1234)
**Status**: Investigating
**Started**: 2024-01-20
**Impact**: ~0.1% of requests timing out
**Context**:
- Timeouts correlate with database backup window (02:00-03:00 UTC)
- Suspect backup process causing lock contention
- Added extra logging in PR #567 (deployed 01/21)
**Next Steps**:
- [ ] Review new logs after tonight's backup
- [ ] Consider moving backup window if confirmed
**Resources**:
- Dashboard: [API Latency](https://grafana/d/api-latency)
- Thread: #platform-eng (01/20, 14:32)
---
### 2. Memory Growth in Auth Service (ENG-1235)
**Status**: Monitoring
**Started**: 2024-01-18
**Impact**: None yet (proactive)
**Context**:
- Memory usage growing ~5% per day
- No memory leak found in profiling
- Suspect connection pool not releasing properly
**Next Steps**:
- [ ] Review heap dump from 01/21
- [ ] Consider restart if usage > 80%
**Resources**:
- Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)
- Analysis doc: [Memory Investigation](https://docs/eng-1235)
---
## 🟢 Resolved This Shift
### Payment Service Outage (2024-01-19)
- **Duration**: 23 minutes
- **Root Cause**: Database connection exhaustion
- **Resolution**: Rolled back v2.3.4, increased pool size
- **Postmortem**: [POSTMORTEM-89](https://docs/postmortem-89)
- **Follow-up tickets**: ENG-1230, ENG-1231
---
## 📋 Recent Changes
### Deployments
| Service | Version | Time | Notes |
|---------|---------|------|-------|
| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |
| user-service | v2.8.0 | 01/20 10:00 | New profile features |
| auth-service | v4.1.2 | 01/19 16:00 | Security patch |
### Configuration Changes
- 01/21: Increased API rate limit from 1000 to 1500 RPS
- 01/20: Updated database connection pool max from 50 to 75
### Infrastructure
- 01/20: Added 2 nodes to Kubernetes cluster
- 01/19: Upgraded Redis from 6.2 to 7.0
---
## ⚠️ Known Issues & Workarounds
### 1. Slow Dashboard Loading
**Issue**: Grafana dashboards slow on Monday mornings
**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up
**Ticket**: OPS-456 (P3)
### 2. Flaky Integration Test
**Issue**: `test_payment_flow` fails intermittently in CI
**Workaround**: Re-run failed job (usually passes on retry)
**Ticket**: ENG-1200 (P2)
---
## 📅 Upcoming Events
| Date | Event | Impact | Contact |
|------|-------|--------|---------|
| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |
| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |
| 01/25 | Marketing campaign | 2x traffic expected | @platform |
---
## 📞 Escalation Reminders
| Issue Type | First Escalation | Second Escalation |
|------------|------------------|-------------------|
| Payment issues | @payments-oncall | @payments-manager |
| Auth issues | @auth-oncall | @security-team |
| Database issues | @dba-team | @infra-manager |
| Unknown/severe | @engineering-manager | @vp-engineering |
---
## 🔧 Quick Reference
### Common Commands
```bash
# Check service health
kubectl get pods -A | grep -v Running
# Recent deployments
kubectl get events --sort-by='.lastTimestamp' | tail -20
# Database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"
# Clear cache (emergency only)
redis-cli FLUSHDB
Read this document
Join sync call
Verify PagerDuty is routing to you
Verify Slack notifications working
Check VPN/access working
Review critical dashboards
# Quick Handoff: @alice → @bob
## TL;DR
- No active incidents
- 1 investigation ongoing (API timeouts, see ENG-1234)
- Major release tomorrow (01/24) - be ready for issues
## Watch List
1. API latency around 02:00-03:00 UTC (backup window)
2. Auth service memory (restart if > 80%)
## Recent
- Deployed api-gateway v3.2.1 yesterday (stable)
- Increased rate limits to 1500 RPS
## Coming Up
- 01/23 02:00 - DB maintenance (5 min read-only)
- 01/24 14:00 - v5.0 release
## Questions?
I'll be available on Slack until 17:00 today.
# INCIDENT HANDOFF: Payment Service Degradation
**Incident Start**: 2024-01-22 08:15 UTC
**Current Status**: Mitigating
**Severity**: SEV2
---
## Current State
- Error rate: 15% (down from 40%)
- Mitigation in progress: scaling up pods
- ETA to resolution: ~30 min
## What We Know
1. Root cause: Memory pressure on payment-service pods
2. Triggered by: Unusual traffic spike (3x normal)
3. Contributing: Inefficient query in checkout flow
## What We've Done
- Scaled payment-service from 5 → 15 pods
- Enabled rate limiting on checkout endpoint
- Disabled non-critical features
## What Needs to Happen
1. Monitor error rate - should reach <1% in ~15 min
2. If not improving, escalate to @payments-manager
3. Once stable, begin root cause investigation
## Key People
- Incident Commander: @alice (handing off)
- Comms Lead: @charlie
- Technical Lead: @bob (incoming)
## Communication
- Status page: Updated at 08:45
- Customer support: Notified
- Exec team: Aware
## Resources
- Incident channel: #inc-20240122-payment
- Dashboard: [Payment Service](https://grafana/d/payments)
- Runbook: [Payment Degradation](https://wiki/runbooks/payments)
---
**Incoming on-call (@bob) - Please confirm you have:**
- [ ] Joined #inc-20240122-payment
- [ ] Access to dashboards
- [ ] Understand current state
- [ ] Know escalation path
## Handoff Sync: @alice → @bob
1. **Active Issues** (5 min)
- Walk through any ongoing incidents
- Discuss investigation status
- Transfer context and theories
2. **Recent Changes** (3 min)
- Deployments to watch
- Config changes
- Known regressions
3. **Upcoming Events** (3 min)
- Maintenance windows
- Expected traffic changes
- Releases planned
4. **Questions** (4 min)
- Clarify anything unclear
- Confirm access and alerting
- Exchange contact info
## Pre-Shift Checklist
### Access Verification
- [ ] VPN working
- [ ] kubectl access to all clusters
- [ ] Database read access
- [ ] Log aggregator access (Splunk/Datadog)
- [ ] PagerDuty app installed and logged in
### Alerting Setup
- [ ] PagerDuty schedule shows you as primary
- [ ] Phone notifications enabled
- [ ] Slack notifications for incident channels
- [ ] Test alert received and acknowledged
### Knowledge Refresh
- [ ] Review recent incidents (past 2 weeks)
- [ ] Check service changelog
- [ ] Skim critical runbooks
- [ ] Know escalation contacts
### Environment Ready
- [ ] Laptop charged and accessible
- [ ] Phone charged
- [ ] Quiet space available for calls
- [ ] Secondary contact identified (if traveling)
## Daily On-Call Routine
### Morning (start of day)
- [ ] Check overnight alerts
- [ ] Review dashboards for anomalies
- [ ] Check for any P0/P1 tickets created
- [ ] Skim incident channels for context
### Throughout Day
- [ ] Respond to alerts within SLA
- [ ] Document investigation progress
- [ ] Update team on significant issues
- [ ] Triage incoming pages
### End of Day
- [ ] Hand off any active issues
- [ ] Update investigation docs
- [ ] Note anything for next shift
## Post-Shift Checklist
- [ ] Complete handoff document
- [ ] Sync with incoming on-call
- [ ] Verify PagerDuty routing changed
- [ ] Close/update investigation tickets
- [ ] File postmortems for any incidents
- [ ] Take time off if shift was stressful
## Escalation Triggers
### Immediate Escalation
- SEV1 incident declared
- Data breach suspected
- Unable to diagnose within 30 min
- Customer or legal escalation received
### Consider Escalation
- Issue spans multiple teams
- Requires expertise you don't have
- Business impact exceeds threshold
- You're uncertain about next steps
### How to Escalate
1. Page the appropriate escalation path
2. Provide brief context in Slack
3. Stay engaged until escalation acknowledges
4. Hand off cleanly, don't just disappear
Weekly Installs
78
Repository
GitHub Stars
27.1K
First Seen
Jan 28, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
gemini-cli76
opencode75
cursor74
antigravity72
claude-code72
github-copilot71
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
104,900 周安装