on-call-handoff-patterns by wshobson/agents
npx skills add https://github.com/wshobson/agents --skill on-call-handoff-patterns确保值班轮换期间连续性、上下文传递和可靠事件响应的有效模式。
| 组成部分 | 目的 |
|---|---|
| 进行中的事件 | 当前出现的问题 |
| 正在进行的调查 | 正在调试的问题 |
| 近期变更 | 部署、配置变更 |
| 已知问题 | 已实施的临时解决方案 |
| 即将发生的事件 | 维护、发布 |
推荐:班次间有 30 分钟重叠时间
交班者:
├── 15 分钟:编写交接文档
└── 15 分钟:与接班者同步通话
接班者:
├── 15 分钟:审阅交接文档
├── 15 分钟:与交班者同步通话
└── 5 分钟:验证告警设置
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 值班交接:平台团队
**交班者**:@alice (2024-01-15 至 2024-01-22)
**接班者**:@bob (2024-01-22 至 2024-01-29)
**交接时间**:2024-01-22 09:00 UTC
---
## 🔴 进行中的事件
### 当前无进行中事件
交接时无进行中的事件。
---
## 🟡 正在进行的调查
### 1. 间歇性 API 超时 (ENG-1234)
**状态**:调查中
**开始时间**:2024-01-20
**影响**:约 0.1% 的请求超时
**上下文**:
- 超时与数据库备份窗口 (02:00-03:00 UTC) 相关
- 怀疑备份进程导致锁争用
- 已在 PR #567 中添加额外日志记录(01/21 部署)
**后续步骤**:
- [ ] 今晚备份后查看新日志
- [ ] 如确认问题,考虑调整备份窗口
**资源**:
- 仪表板:[API 延迟](https://grafana/d/api-latency)
- 讨论线程:#platform-eng (01/20, 14:32)
---
### 2. 认证服务内存增长 (ENG-1235)
**状态**:监控中
**开始时间**:2024-01-18
**影响**:暂无(主动发现)
**上下文**:
- 内存使用量每天增长约 5%
- 性能分析未发现内存泄漏
- 怀疑连接池未正确释放
**后续步骤**:
- [ ] 审阅 01/21 的堆转储
- [ ] 如使用率 > 80%,考虑重启
**资源**:
- 仪表板:[认证服务内存](https://grafana/d/auth-memory)
- 分析文档:[内存调查](https://docs/eng-1235)
---
## 🟢 本班次已解决的问题
### 支付服务中断 (2024-01-19)
- **持续时间**:23 分钟
- **根本原因**:数据库连接耗尽
- **解决方案**:回退 v2.3.4,增加连接池大小
- **事后分析**:[POSTMORTEM-89](https://docs/postmortem-89)
- **跟进工单**:ENG-1230, ENG-1231
---
## 📋 近期变更
### 部署
| 服务 | 版本 | 时间 | 备注 |
| ------------ | ------- | ----------- | -------------------------- |
| api-gateway | v3.2.1 | 01/21 14:00 | 修复请求头解析的 bug |
| user-service | v2.8.0 | 01/20 10:00 | 新增个人资料功能 |
| auth-service | v4.1.2 | 01/19 16:00 | 安全补丁 |
### 配置变更
- 01/21:API 速率限制从 1000 提升至 1500 RPS
- 01/20:数据库连接池最大值从 50 更新至 75
### 基础设施
- 01/20:向 Kubernetes 集群添加 2 个节点
- 01/19:Redis 从 6.2 升级至 7.0
---
## ⚠️ 已知问题与临时解决方案
### 1. 仪表板加载缓慢
**问题**:周一上午 Grafana 仪表板加载缓慢
**临时解决方案**:UTC 时间 08:00 后等待 5 分钟让缓存预热
**工单**:OPS-456 (P3)
### 2. 不稳定的集成测试
**问题**:CI 中 `test_payment_flow` 间歇性失败
**临时解决方案**:重新运行失败的任务(通常重试后通过)
**工单**:ENG-1200 (P2)
---
## 📅 即将发生的事件
| 日期 | 事件 | 影响 | 联系人 |
| ----------- | -------------------- | ------------------- | ------------- |
| 01/23 02:00 | 数据库维护 | 5 分钟只读模式 | @dba-team |
| 01/24 14:00 | 主要版本 v5.0 发布 | 需密切监控 | @release-team |
| 01/25 | 营销活动 | 预计流量翻倍 | @platform |
---
## 📞 升级提醒
| 问题类型 | 首次升级联系人 | 二次升级联系人 |
| --------------- | -------------------- | ----------------- |
| 支付问题 | @payments-oncall | @payments-manager |
| 认证问题 | @auth-oncall | @security-team |
| 数据库问题 | @dba-team | @infra-manager |
| 未知/严重问题 | @engineering-manager | @vp-engineering |
---
## 🔧 快速参考
### 常用命令
```bash
# 检查服务健康状态
kubectl get pods -A | grep -v Running
# 近期部署
kubectl get events --sort-by='.lastTimestamp' | tail -20
# 数据库连接
psql -c "SELECT count(*) FROM pg_stat_activity;"
# 清除缓存(仅限紧急情况)
redis-cli FLUSHDB
# 快速交接:@alice → @bob
## 摘要
- 无进行中的事件
- 1 项调查正在进行中(API 超时,见 ENG-1234)
- 明天 (01/24) 有主要版本发布 - 请准备好处理问题
## 监控清单
1. UTC 时间 02:00-03:00 左右的 API 延迟(备份窗口)
2. 认证服务内存(如 > 80% 则重启)
## 近期情况
- 昨天部署了 api-gateway v3.2.1(稳定)
- 将速率限制提升至 1500 RPS
## 即将发生
- 01/23 02:00 - 数据库维护(5 分钟只读模式)
- 01/24 14:00 - v5.0 版本发布
## 有问题吗?
我今天在 Slack 上会待到 17:00。
# 事件交接:支付服务性能下降
**事件开始时间**:2024-01-22 08:15 UTC
**当前状态**:缓解中
**严重程度**:SEV2
---
## 当前状态
- 错误率:15%(从 40% 下降)
- 缓解措施进行中:正在扩展 pod
- 预计解决时间:约 30 分钟
## 已知情况
1. 根本原因:payment-service pod 内存压力
2. 触发因素:异常流量激增(正常流量的 3 倍)
3. 促成因素:结账流程中的低效查询
## 已采取的措施
- 将 payment-service 从 5 个 pod 扩展至 15 个 pod
- 在结账端点启用速率限制
- 禁用了非关键功能
## 后续需要进行的操作
1. 监控错误率 - 应在约 15 分钟内降至 <1%
2. 如无改善,升级至 @payments-manager
3. 稳定后,开始根本原因调查
## 关键人员
- 事件指挥官:@alice(正在交班)
- 通讯负责人:@charlie
- 技术负责人:@bob(接班者)
## 沟通情况
- 状态页面:08:45 已更新
- 客户支持:已通知
- 执行团队:已知晓
每周安装量
3.2K
代码仓库
GitHub 星标数
32.2K
首次出现时间
Jan 20, 2026
安全审计
安装于
claude-code2.5K
gemini-cli2.4K
opencode2.4K
cursor2.3K
codex2.3K
github-copilot2.0K
Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.
| Component | Purpose |
|---|---|
| Active Incidents | What's currently broken |
| Ongoing Investigations | Issues being debugged |
| Recent Changes | Deployments, configs |
| Known Issues | Workarounds in place |
| Upcoming Events | Maintenance, releases |
Recommended: 30 min overlap between shifts
Outgoing:
├── 15 min: Write handoff document
└── 15 min: Sync call with incoming
Incoming:
├── 15 min: Review handoff document
├── 15 min: Sync call with outgoing
└── 5 min: Verify alerting setup
# On-Call Handoff: Platform Team
**Outgoing**: @alice (2024-01-15 to 2024-01-22)
**Incoming**: @bob (2024-01-22 to 2024-01-29)
**Handoff Time**: 2024-01-22 09:00 UTC
---
## 🔴 Active Incidents
### None currently active
No active incidents at handoff time.
---
## 🟡 Ongoing Investigations
### 1. Intermittent API Timeouts (ENG-1234)
**Status**: Investigating
**Started**: 2024-01-20
**Impact**: ~0.1% of requests timing out
**Context**:
- Timeouts correlate with database backup window (02:00-03:00 UTC)
- Suspect backup process causing lock contention
- Added extra logging in PR #567 (deployed 01/21)
**Next Steps**:
- [ ] Review new logs after tonight's backup
- [ ] Consider moving backup window if confirmed
**Resources**:
- Dashboard: [API Latency](https://grafana/d/api-latency)
- Thread: #platform-eng (01/20, 14:32)
---
### 2. Memory Growth in Auth Service (ENG-1235)
**Status**: Monitoring
**Started**: 2024-01-18
**Impact**: None yet (proactive)
**Context**:
- Memory usage growing ~5% per day
- No memory leak found in profiling
- Suspect connection pool not releasing properly
**Next Steps**:
- [ ] Review heap dump from 01/21
- [ ] Consider restart if usage > 80%
**Resources**:
- Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)
- Analysis doc: [Memory Investigation](https://docs/eng-1235)
---
## 🟢 Resolved This Shift
### Payment Service Outage (2024-01-19)
- **Duration**: 23 minutes
- **Root Cause**: Database connection exhaustion
- **Resolution**: Rolled back v2.3.4, increased pool size
- **Postmortem**: [POSTMORTEM-89](https://docs/postmortem-89)
- **Follow-up tickets**: ENG-1230, ENG-1231
---
## 📋 Recent Changes
### Deployments
| Service | Version | Time | Notes |
| ------------ | ------- | ----------- | -------------------------- |
| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |
| user-service | v2.8.0 | 01/20 10:00 | New profile features |
| auth-service | v4.1.2 | 01/19 16:00 | Security patch |
### Configuration Changes
- 01/21: Increased API rate limit from 1000 to 1500 RPS
- 01/20: Updated database connection pool max from 50 to 75
### Infrastructure
- 01/20: Added 2 nodes to Kubernetes cluster
- 01/19: Upgraded Redis from 6.2 to 7.0
---
## ⚠️ Known Issues & Workarounds
### 1. Slow Dashboard Loading
**Issue**: Grafana dashboards slow on Monday mornings
**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up
**Ticket**: OPS-456 (P3)
### 2. Flaky Integration Test
**Issue**: `test_payment_flow` fails intermittently in CI
**Workaround**: Re-run failed job (usually passes on retry)
**Ticket**: ENG-1200 (P2)
---
## 📅 Upcoming Events
| Date | Event | Impact | Contact |
| ----------- | -------------------- | ------------------- | ------------- |
| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |
| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |
| 01/25 | Marketing campaign | 2x traffic expected | @platform |
---
## 📞 Escalation Reminders
| Issue Type | First Escalation | Second Escalation |
| --------------- | -------------------- | ----------------- |
| Payment issues | @payments-oncall | @payments-manager |
| Auth issues | @auth-oncall | @security-team |
| Database issues | @dba-team | @infra-manager |
| Unknown/severe | @engineering-manager | @vp-engineering |
---
## 🔧 Quick Reference
### Common Commands
```bash
# Check service health
kubectl get pods -A | grep -v Running
# Recent deployments
kubectl get events --sort-by='.lastTimestamp' | tail -20
# Database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"
# Clear cache (emergency only)
redis-cli FLUSHDB
```
Read this document
Join sync call
Verify PagerDuty is routing to you
Verify Slack notifications working
Check VPN/access working
Review critical dashboards
# Quick Handoff: @alice → @bob
## TL;DR
- No active incidents
- 1 investigation ongoing (API timeouts, see ENG-1234)
- Major release tomorrow (01/24) - be ready for issues
## Watch List
1. API latency around 02:00-03:00 UTC (backup window)
2. Auth service memory (restart if > 80%)
## Recent
- Deployed api-gateway v3.2.1 yesterday (stable)
- Increased rate limits to 1500 RPS
## Coming Up
- 01/23 02:00 - DB maintenance (5 min read-only)
- 01/24 14:00 - v5.0 release
## Questions?
I'll be available on Slack until 17:00 today.
# INCIDENT HANDOFF: Payment Service Degradation
**Incident Start**: 2024-01-22 08:15 UTC
**Current Status**: Mitigating
**Severity**: SEV2
---
## Current State
- Error rate: 15% (down from 40%)
- Mitigation in progress: scaling up pods
- ETA to resolution: ~30 min
## What We Know
1. Root cause: Memory pressure on payment-service pods
2. Triggered by: Unusual traffic spike (3x normal)
3. Contributing: Inefficient query in checkout flow
## What We've Done
- Scaled payment-service from 5 → 15 pods
- Enabled rate limiting on checkout endpoint
- Disabled non-critical features
## What Needs to Happen
1. Monitor error rate - should reach <1% in ~15 min
2. If not improving, escalate to @payments-manager
3. Once stable, begin root cause investigation
## Key People
- Incident Commander: @alice (handing off)
- Comms Lead: @charlie
- Technical Lead: @bob (incoming)
## Communication
- Status page: Updated at 08:45
- Customer support: Notified
- Exec team: Aware
Weekly Installs
3.2K
Repository
GitHub Stars
32.2K
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
claude-code2.5K
gemini-cli2.4K
opencode2.4K
cursor2.3K
codex2.3K
github-copilot2.0K
Azure Data Explorer (Kusto) 查询技能:KQL数据分析、日志遥测与时间序列处理
93,700 周安装