值班交接模式指南：高效轮换、事件响应与团队协作最佳实践

on-call-handoff-patterns by wshobson/agents

3,200 周安装量

32,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/wshobson/agents --skill on-call-handoff-patterns

开发运维监控生产力

🇨🇳中文介绍

值班交接模式

确保值班轮换期间连续性、上下文传递和可靠事件响应的有效模式。

何时使用此技能

交接值班职责时
编写值班交接摘要时
记录正在进行的调查时
建立值班轮换流程时
提升交接质量时
新值班工程师入职时

核心概念

1. 交接组成部分

组成部分	目的
进行中的事件	当前出现的问题
正在进行的调查	正在调试的问题
近期变更	部署、配置变更
已知问题	已实施的临时解决方案
即将发生的事件	维护、发布

2. 交接时机

推荐：班次间有 30 分钟重叠时间

交班者：
├── 15 分钟：编写交接文档
└── 15 分钟：与接班者同步通话

接班者：
├── 15 分钟：审阅交接文档
├── 15 分钟：与交班者同步通话
└── 5 分钟：验证告警设置

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

733,500 周安装

Vercel React 最佳实践指南 | 58条Next.js性能优化规则与代码重构

252,100 周安装

Vercel Web界面规范检查工具 - 自动检测代码是否符合Web设计指南

202,600 周安装

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

133,200 周安装

# 值班交接：平台团队

**交班者**：@alice (2024-01-15 至 2024-01-22)
**接班者**：@bob (2024-01-22 至 2024-01-29)
**交接时间**：2024-01-22 09:00 UTC

---

## 🔴 进行中的事件

### 当前无进行中事件

交接时无进行中的事件。

---

## 🟡 正在进行的调查

### 1. 间歇性 API 超时 (ENG-1234)

**状态**：调查中
**开始时间**：2024-01-20
**影响**：约 0.1% 的请求超时

**上下文**：

- 超时与数据库备份窗口 (02:00-03:00 UTC) 相关
- 怀疑备份进程导致锁争用
- 已在 PR #567 中添加额外日志记录（01/21 部署）

**后续步骤**：

- [ ] 今晚备份后查看新日志
- [ ] 如确认问题，考虑调整备份窗口

**资源**：

- 仪表板：[API 延迟](https://grafana/d/api-latency)
- 讨论线程：#platform-eng (01/20, 14:32)

---

### 2. 认证服务内存增长 (ENG-1235)

**状态**：监控中
**开始时间**：2024-01-18
**影响**：暂无（主动发现）

**上下文**：

- 内存使用量每天增长约 5%
- 性能分析未发现内存泄漏
- 怀疑连接池未正确释放

**后续步骤**：

- [ ] 审阅 01/21 的堆转储
- [ ] 如使用率 > 80%，考虑重启

**资源**：

- 仪表板：[认证服务内存](https://grafana/d/auth-memory)
- 分析文档：[内存调查](https://docs/eng-1235)

---

## 🟢 本班次已解决的问题

### 支付服务中断 (2024-01-19)

- **持续时间**：23 分钟
- **根本原因**：数据库连接耗尽
- **解决方案**：回退 v2.3.4，增加连接池大小
- **事后分析**：[POSTMORTEM-89](https://docs/postmortem-89)
- **跟进工单**：ENG-1230, ENG-1231

---

## 📋 近期变更

### 部署

| 服务          | 版本    | 时间        | 备注                         |
| ------------ | ------- | ----------- | -------------------------- |
| api-gateway  | v3.2.1  | 01/21 14:00 | 修复请求头解析的 bug        |
| user-service | v2.8.0  | 01/20 10:00 | 新增个人资料功能           |
| auth-service | v4.1.2  | 01/19 16:00 | 安全补丁                   |

### 配置变更

- 01/21：API 速率限制从 1000 提升至 1500 RPS
- 01/20：数据库连接池最大值从 50 更新至 75

### 基础设施

- 01/20：向 Kubernetes 集群添加 2 个节点
- 01/19：Redis 从 6.2 升级至 7.0

---

## ⚠️ 已知问题与临时解决方案

### 1. 仪表板加载缓慢

**问题**：周一上午 Grafana 仪表板加载缓慢
**临时解决方案**：UTC 时间 08:00 后等待 5 分钟让缓存预热
**工单**：OPS-456 (P3)

### 2. 不稳定的集成测试

**问题**：CI 中 `test_payment_flow` 间歇性失败
**临时解决方案**：重新运行失败的任务（通常重试后通过）
**工单**：ENG-1200 (P2)

---

## 📅 即将发生的事件

| 日期        | 事件                 | 影响              | 联系人       |
| ----------- | -------------------- | ------------------- | ------------- |
| 01/23 02:00 | 数据库维护           | 5 分钟只读模式     | @dba-team     |
| 01/24 14:00 | 主要版本 v5.0 发布   | 需密切监控         | @release-team |
| 01/25       | 营销活动             | 预计流量翻倍       | @platform     |

---

## 📞 升级提醒

| 问题类型      | 首次升级联系人       | 二次升级联系人     |
| --------------- | -------------------- | ----------------- |
| 支付问题       | @payments-oncall     | @payments-manager |
| 认证问题       | @auth-oncall         | @security-team    |
| 数据库问题     | @dba-team            | @infra-manager    |
| 未知/严重问题  | @engineering-manager | @vp-engineering   |

---

## 🔧 快速参考

### 常用命令

```bash
# 检查服务健康状态
kubectl get pods -A | grep -v Running

# 近期部署
kubectl get events --sort-by='.lastTimestamp' | tail -20

# 数据库连接
psql -c "SELECT count(*) FROM pg_stat_activity;"

# 清除缓存（仅限紧急情况）
redis-cli FLUSHDB

🇺🇸English

On-Call Handoff Patterns

Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.

When to Use This Skill

Transitioning on-call responsibilities
Writing shift handoff summaries
Documenting ongoing investigations
Establishing on-call rotation procedures
Improving handoff quality
Onboarding new on-call engineers

Core Concepts

1. Handoff Components

Component	Purpose
Active Incidents	What's currently broken
Ongoing Investigations	Issues being debugged
Recent Changes	Deployments, configs
Known Issues	Workarounds in place
Upcoming Events	Maintenance, releases

2. Handoff Timing

Recommended: 30 min overlap between shifts

Outgoing:
├── 15 min: Write handoff document
└── 15 min: Sync call with incoming

Incoming:
├── 15 min: Review handoff document
├── 15 min: Sync call with outgoing
└── 5 min: Verify alerting setup

Templates

Template 1: Shift Handoff Document

# On-Call Handoff: Platform Team

**Outgoing**: @alice (2024-01-15 to 2024-01-22)
**Incoming**: @bob (2024-01-22 to 2024-01-29)
**Handoff Time**: 2024-01-22 09:00 UTC

---

## 🔴 Active Incidents

### None currently active

No active incidents at handoff time.

---

## 🟡 Ongoing Investigations

### 1. Intermittent API Timeouts (ENG-1234)

**Status**: Investigating
**Started**: 2024-01-20
**Impact**: ~0.1% of requests timing out

**Context**:

- Timeouts correlate with database backup window (02:00-03:00 UTC)
- Suspect backup process causing lock contention
- Added extra logging in PR #567 (deployed 01/21)

**Next Steps**:

- [ ] Review new logs after tonight's backup
- [ ] Consider moving backup window if confirmed

**Resources**:

- Dashboard: [API Latency](https://grafana/d/api-latency)
- Thread: #platform-eng (01/20, 14:32)

---

### 2. Memory Growth in Auth Service (ENG-1235)

**Status**: Monitoring
**Started**: 2024-01-18
**Impact**: None yet (proactive)

**Context**:

- Memory usage growing ~5% per day
- No memory leak found in profiling
- Suspect connection pool not releasing properly

**Next Steps**:

- [ ] Review heap dump from 01/21
- [ ] Consider restart if usage > 80%

**Resources**:

- Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)
- Analysis doc: [Memory Investigation](https://docs/eng-1235)

---

## 🟢 Resolved This Shift

### Payment Service Outage (2024-01-19)

- **Duration**: 23 minutes
- **Root Cause**: Database connection exhaustion
- **Resolution**: Rolled back v2.3.4, increased pool size
- **Postmortem**: [POSTMORTEM-89](https://docs/postmortem-89)
- **Follow-up tickets**: ENG-1230, ENG-1231

---

## 📋 Recent Changes

### Deployments

| Service      | Version | Time        | Notes                      |
| ------------ | ------- | ----------- | -------------------------- |
| api-gateway  | v3.2.1  | 01/21 14:00 | Bug fix for header parsing |
| user-service | v2.8.0  | 01/20 10:00 | New profile features       |
| auth-service | v4.1.2  | 01/19 16:00 | Security patch             |

### Configuration Changes

- 01/21: Increased API rate limit from 1000 to 1500 RPS
- 01/20: Updated database connection pool max from 50 to 75

### Infrastructure

- 01/20: Added 2 nodes to Kubernetes cluster
- 01/19: Upgraded Redis from 6.2 to 7.0

---

## ⚠️ Known Issues & Workarounds

### 1. Slow Dashboard Loading

**Issue**: Grafana dashboards slow on Monday mornings
**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up
**Ticket**: OPS-456 (P3)

### 2. Flaky Integration Test

**Issue**: `test_payment_flow` fails intermittently in CI
**Workaround**: Re-run failed job (usually passes on retry)
**Ticket**: ENG-1200 (P2)

---

## 📅 Upcoming Events

| Date        | Event                | Impact              | Contact       |
| ----------- | -------------------- | ------------------- | ------------- |
| 01/23 02:00 | Database maintenance | 5 min read-only     | @dba-team     |
| 01/24 14:00 | Major release v5.0   | Monitor closely     | @release-team |
| 01/25       | Marketing campaign   | 2x traffic expected | @platform     |

---

## 📞 Escalation Reminders

| Issue Type      | First Escalation     | Second Escalation |
| --------------- | -------------------- | ----------------- |
| Payment issues  | @payments-oncall     | @payments-manager |
| Auth issues     | @auth-oncall         | @security-team    |
| Database issues | @dba-team            | @infra-manager    |
| Unknown/severe  | @engineering-manager | @vp-engineering   |

---

## 🔧 Quick Reference

### Common Commands

```bash
# Check service health
kubectl get pods -A | grep -v Running

# Recent deployments
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"

# Clear cache (emergency only)
redis-cli FLUSHDB
```

Important Links

Handoff Checklist

Outgoing Engineer

Document active incidents
Document ongoing investigations
List recent changes
Note known issues
Add upcoming events
Sync with incoming engineer

Incoming Engineer

Read this document
Join sync call
Verify PagerDuty is routing to you
Verify Slack notifications working
Check VPN/access working

Review critical dashboards

Template 2: Quick Handoff (Async)

# Quick Handoff: @alice → @bob

## TL;DR
- No active incidents
- 1 investigation ongoing (API timeouts, see ENG-1234)
- Major release tomorrow (01/24) - be ready for issues

## Watch List
1. API latency around 02:00-03:00 UTC (backup window)
2. Auth service memory (restart if > 80%)

## Recent
- Deployed api-gateway v3.2.1 yesterday (stable)
- Increased rate limits to 1500 RPS

## Coming Up
- 01/23 02:00 - DB maintenance (5 min read-only)
- 01/24 14:00 - v5.0 release

## Questions?
I'll be available on Slack until 17:00 today.

Template 3: Incident Handoff (Mid-Incident)

# INCIDENT HANDOFF: Payment Service Degradation

**Incident Start**: 2024-01-22 08:15 UTC
**Current Status**: Mitigating
**Severity**: SEV2

---

## Current State

- Error rate: 15% (down from 40%)
- Mitigation in progress: scaling up pods
- ETA to resolution: ~30 min

## What We Know

1. Root cause: Memory pressure on payment-service pods
2. Triggered by: Unusual traffic spike (3x normal)
3. Contributing: Inefficient query in checkout flow

## What We've Done

- Scaled payment-service from 5 → 15 pods
- Enabled rate limiting on checkout endpoint
- Disabled non-critical features

## What Needs to Happen

1. Monitor error rate - should reach <1% in ~15 min
2. If not improving, escalate to @payments-manager
3. Once stable, begin root cause investigation

## Key People

- Incident Commander: @alice (handing off)
- Comms Lead: @charlie
- Technical Lead: @bob (incoming)

## Communication

- Status page: Updated at 08:45
- Customer support: Notified
- Exec team: Aware

Weekly Installs

3.2K

Repository

wshobson/agents

GitHub Stars

32.2K

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code2.5K

gemini-cli2.4K

opencode2.4K

cursor2.3K

codex2.3K

github-copilot2.0K