值班交接模式指南：确保运维连续性的事件响应与交接最佳实践

on-call-handoff-patterns by sickn33/antigravity-awesome-skills

78 周安装量

27,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill on-call-handoff-patterns

软件工程团队协作开发运维

🇨🇳中文介绍

值班交接模式

确保值班轮换期间连续性、上下文传递和可靠事件响应的有效模式。

不应使用此技能的情况

任务与值班交接模式无关
你需要此范围之外的不同领域或工具

使用说明

明确目标、约束条件和所需输入。
应用相关最佳实践并验证结果。
提供可操作的步骤和验证方法。
如果需要详细示例，请打开 resources/implementation-playbook.md。

应使用此技能的情况

交接值班职责时
编写值班交接摘要时
记录正在进行的调查时
建立值班轮换流程时
改进交接质量时
新值班工程师入职时

核心概念

1. 交接组成部分

组成部分	目的
进行中的事件	当前出现的问题
正在进行的调查	正在调试的问题
近期变更	部署、配置

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

Vercel React 最佳实践指南 | 58条Next.js性能优化规则与代码重构

294,600 周安装

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

166,500 周安装

Azure Data Explorer (Kusto) 查询技能：KQL数据分析、日志遥测与时间序列处理

145,500 周安装

Azure 配额管理指南：服务限制、容量验证与配额增加方法

116,900 周安装

# 值班交接：平台团队

**交班人员**：@alice (2024-01-15 至 2024-01-22)
**接班人员**：@bob (2024-01-22 至 2024-01-29)
**交接时间**：2024-01-22 09:00 UTC

---

## 🔴 进行中的事件

### 当前无进行中事件
交接时无进行中事件。

---

## 🟡 正在进行的调查

### 1. 间歇性 API 超时 (ENG-1234)
**状态**：调查中
**开始时间**：2024-01-20
**影响**：约 0.1% 的请求超时

**背景**：
- 超时与数据库备份窗口 (02:00-03:00 UTC) 相关
- 怀疑备份进程导致锁争用
- 已在 PR #567 中添加额外日志记录（已于 01/21 部署）

**后续步骤**：
- [ ] 今晚备份后查看新日志
- [ ] 如果确认，考虑移动备份窗口

**资源**：
- 仪表板：[API 延迟](https://grafana/d/api-latency)
- 讨论串：#platform-eng (01/20, 14:32)

---

### 2. 认证服务内存增长 (ENG-1235)
**状态**：监控中
**开始时间**：2024-01-18
**影响**：暂无（主动发现）

**背景**：
- 内存使用量每天增长约 5%
- 性能分析未发现内存泄漏
- 怀疑连接池未正确释放

**后续步骤**：
- [ ] 审阅 01/21 的堆转储
- [ ] 如果使用率 > 80%，考虑重启

**资源**：
- 仪表板：[认证服务内存](https://grafana/d/auth-memory)
- 分析文档：[内存调查](https://docs/eng-1235)

---

## 🟢 本班次已解决的问题

### 支付服务中断 (2024-01-19)
- **持续时间**：23 分钟
- **根本原因**：数据库连接耗尽
- **解决方案**：回退到 v2.3.4，增加连接池大小
- **事后分析**：[POSTMORTEM-89](https://docs/postmortem-89)
- **跟进工单**：ENG-1230, ENG-1231

---

## 📋 近期变更

### 部署
| 服务 | 版本 | 时间 | 备注 |
|---------|---------|------|-------|
| api-gateway | v3.2.1 | 01/21 14:00 | 修复请求头解析的 bug |
| user-service | v2.8.0 | 01/20 10:00 | 新增个人资料功能 |
| auth-service | v4.1.2 | 01/19 16:00 | 安全补丁 |

### 配置变更
- 01/21：API 速率限制从 1000 RPS 提高到 1500 RPS
- 01/20：数据库连接池最大值从 50 更新为 75

### 基础设施
- 01/20：向 Kubernetes 集群添加了 2 个节点
- 01/19：Redis 从 6.2 升级到 7.0

---

## ⚠️ 已知问题及变通方案

### 1. 仪表板加载缓慢
**问题**：周一上午 Grafana 仪表板加载缓慢
**变通方案**：在 08:00 UTC 后等待 5 分钟以预热缓存
**工单**：OPS-456 (P3)

### 2. 不稳定的集成测试
**问题**：CI 中 `test_payment_flow` 间歇性失败
**变通方案**：重新运行失败的作业（通常重试会通过）
**工单**：ENG-1200 (P2)

---

## 📅 即将发生的事件

| 日期 | 事件 | 影响 | 联系人 |
|------|-------|--------|---------|
| 01/23 02:00 | 数据库维护 | 5 分钟只读 | @dba-team |
| 01/24 14:00 | 主要版本 v5.0 发布 | 密切监控 | @release-team |
| 01/25 | 营销活动 | 预计流量翻倍 | @platform |

---

## 📞 升级提醒

| 问题类型 | 第一级升级 | 第二级升级 |
|------------|------------------|-------------------|
| 支付问题 | @payments-oncall | @payments-manager |
| 认证问题 | @auth-oncall | @security-team |
| 数据库问题 | @dba-team | @infra-manager |
| 未知/严重问题 | @engineering-manager | @vp-engineering |

---

## 🔧 快速参考

### 常用命令
```bash
# 检查服务健康状态
kubectl get pods -A | grep -v Running

# 近期部署
kubectl get events --sort-by='.lastTimestamp' | tail -20

# 数据库连接
psql -c "SELECT count(*) FROM pg_stat_activity;"

# 清除缓存（仅限紧急情况）
redis-cli FLUSHDB

阅读本文档
参加同步通话
验证 PagerDuty 是否将告警路由给你
验证 Slack 通知是否正常工作
检查 VPN/访问权限是否正常

审阅关键仪表板

模板 2：快速交接（异步）

# 快速交接：@alice → @bob

## 摘要
- 无进行中事件
- 1 项调查正在进行中（API 超时，参见 ENG-1234）
- 明天 (01/24) 有主要版本发布 - 准备好应对问题

## 关注列表
1. 02:00-03:00 UTC 左右的 API 延迟（备份窗口）
2. 认证服务内存（如果 > 80% 则重启）

## 近期情况
- 昨天部署了 api-gateway v3.2.1（稳定）
- 已将速率限制提高到 1500 RPS

## 即将发生
- 01/23 02:00 - 数据库维护（5 分钟只读）
- 01/24 14:00 - v5.0 版本发布

## 有问题吗？
我今天在 Slack 上会一直待到 17:00。

# 事件交接：支付服务性能下降

**事件开始时间**：2024-01-22 08:15 UTC
**当前状态**：缓解中
**严重性**：SEV2

---

## 当前状态
- 错误率：15%（从 40% 下降）
- 缓解措施进行中：正在扩容 Pod
- 预计解决时间：约 30 分钟

## 已知信息
1. 根本原因：payment-service Pod 内存压力
2. 触发因素：异常流量激增（正常值的 3 倍）
3. 促成因素：结账流程中的低效查询

## 已采取的措施
- 将 payment-service 从 5 个 Pod 扩容到 15 个 Pod
- 在结账端点启用了速率限制
- 禁用了非关键功能

## 需要进行的操作
1. 监控错误率 - 应在约 15 分钟内降至 <1%
2. 如果没有改善，升级给 @payments-manager
3. 一旦稳定，开始根本原因调查

## 关键人员
- 事件指挥官：@alice（正在交接）
- 通讯负责人：@charlie
- 技术负责人：@bob（接班）

## 沟通情况
- 状态页面：已于 08:45 更新
- 客户支持：已通知
- 执行团队：已知晓

## 资源
- 事件频道：#inc-20240122-payment
- 仪表板：[支付服务](https://grafana/d/payments)
- 操作手册：[支付性能下降](https://wiki/runbooks/payments)

---

**接班值班人员 (@bob) - 请确认你已：**
- [ ] 加入 #inc-20240122-payment
- [ ] 可以访问仪表板
- [ ] 了解当前状态
- [ ] 知晓升级路径

## 交接同步：@alice → @bob

1. **进行中的问题** (5 分钟)
   - 梳理任何进行中的事件
   - 讨论调查状态
   - 传递背景信息和推测

2. **近期变更** (3 分钟)
   - 需要关注的部署
   - 配置变更
   - 已知的回归问题

3. **即将发生的事件** (3 分钟)
   - 维护窗口
   - 预期的流量变化
   - 计划中的发布

4. **问题** (4 分钟)
   - 澄清任何不清楚的地方
   - 确认访问权限和告警设置
   - 交换联系信息

## 值班前清单

### 访问权限验证
- [ ] VPN 正常工作
- [ ] 对所有集群的 kubectl 访问权限
- [ ] 数据库读取权限
- [ ] 日志聚合器访问权限（Splunk/Datadog）
- [ ] PagerDuty 应用已安装并登录

### 告警设置
- [ ] PagerDuty 排班表显示你为主要联系人
- [ ] 手机通知已启用
- [ ] 事件频道的 Slack 通知
- [ ] 已收到并确认测试告警

### 知识更新
- [ ] 审阅近期事件（过去 2 周）
- [ ] 检查服务变更日志
- [ ] 浏览关键操作手册
- [ ] 知晓升级联系人

### 环境准备就绪
- [ ] 笔记本电脑已充电并可访问
- [ ] 手机已充电
- [ ] 通话的安静空间可用
- [ ] 已确定备用联系人（如果出差）

## 每日值班例行工作

### 早晨（开始值班时）
- [ ] 检查夜间告警
- [ ] 审阅仪表板查找异常
- [ ] 检查是否有创建的 P0/P1 工单
- [ ] 浏览事件频道了解背景

### 全天
- [ ] 在 SLA 内响应告警
- [ ] 记录调查进展
- [ ] 向团队更新重大问题
- [ ] 分类传入的告警页面

### 下班前
- [ ] 交接任何进行中的问题
- [ ] 更新调查文档
- [ ] 为下一班次记录注意事项

## 升级触发条件

### 立即升级
- 已宣布 SEV1 事件
- 怀疑数据泄露
- 30 分钟内无法诊断
- 收到客户或法务升级

### 考虑升级
- 问题涉及多个团队
- 需要你不具备的专业知识
- 业务影响超过阈值
- 你对后续步骤不确定

### 如何升级
1. 通知适当的升级路径
2. 在 Slack 中提供简要背景
3. 保持参与直到升级方确认
4. 清晰交接，不要直接消失

🇺🇸English

On-Call Handoff Patterns

Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.

Do not use this skill when

The task is unrelated to on-call handoff patterns
You need a different domain or tool outside this scope

Instructions

Clarify goals, constraints, and required inputs.
Apply relevant best practices and validate outcomes.
Provide actionable steps and verification.
If detailed examples are required, open resources/implementation-playbook.md.

Use this skill when

Transitioning on-call responsibilities
Writing shift handoff summaries
Documenting ongoing investigations
Establishing on-call rotation procedures
Improving handoff quality
Onboarding new on-call engineers

Core Concepts

1. Handoff Components

Component	Purpose
Active Incidents	What's currently broken
Ongoing Investigations	Issues being debugged
Recent Changes	Deployments, configs
Known Issues	Workarounds in place
Upcoming Events	Maintenance, releases

2. Handoff Timing

Recommended: 30 min overlap between shifts

Outgoing:
├── 15 min: Write handoff document
└── 15 min: Sync call with incoming

Incoming:
├── 15 min: Review handoff document
├── 15 min: Sync call with outgoing
└── 5 min: Verify alerting setup

Templates

Template 1: Shift Handoff Document

# On-Call Handoff: Platform Team

**Outgoing**: @alice (2024-01-15 to 2024-01-22)
**Incoming**: @bob (2024-01-22 to 2024-01-29)
**Handoff Time**: 2024-01-22 09:00 UTC

---

## 🔴 Active Incidents

### None currently active
No active incidents at handoff time.

---

## 🟡 Ongoing Investigations

### 1. Intermittent API Timeouts (ENG-1234)
**Status**: Investigating
**Started**: 2024-01-20
**Impact**: ~0.1% of requests timing out

**Context**:
- Timeouts correlate with database backup window (02:00-03:00 UTC)
- Suspect backup process causing lock contention
- Added extra logging in PR #567 (deployed 01/21)

**Next Steps**:
- [ ] Review new logs after tonight's backup
- [ ] Consider moving backup window if confirmed

**Resources**:
- Dashboard: [API Latency](https://grafana/d/api-latency)
- Thread: #platform-eng (01/20, 14:32)

---

### 2. Memory Growth in Auth Service (ENG-1235)
**Status**: Monitoring
**Started**: 2024-01-18
**Impact**: None yet (proactive)

**Context**:
- Memory usage growing ~5% per day
- No memory leak found in profiling
- Suspect connection pool not releasing properly

**Next Steps**:
- [ ] Review heap dump from 01/21
- [ ] Consider restart if usage > 80%

**Resources**:
- Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)
- Analysis doc: [Memory Investigation](https://docs/eng-1235)

---

## 🟢 Resolved This Shift

### Payment Service Outage (2024-01-19)
- **Duration**: 23 minutes
- **Root Cause**: Database connection exhaustion
- **Resolution**: Rolled back v2.3.4, increased pool size
- **Postmortem**: [POSTMORTEM-89](https://docs/postmortem-89)
- **Follow-up tickets**: ENG-1230, ENG-1231

---

## 📋 Recent Changes

### Deployments
| Service | Version | Time | Notes |
|---------|---------|------|-------|
| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |
| user-service | v2.8.0 | 01/20 10:00 | New profile features |
| auth-service | v4.1.2 | 01/19 16:00 | Security patch |

### Configuration Changes
- 01/21: Increased API rate limit from 1000 to 1500 RPS
- 01/20: Updated database connection pool max from 50 to 75

### Infrastructure
- 01/20: Added 2 nodes to Kubernetes cluster
- 01/19: Upgraded Redis from 6.2 to 7.0

---

## ⚠️ Known Issues & Workarounds

### 1. Slow Dashboard Loading
**Issue**: Grafana dashboards slow on Monday mornings
**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up
**Ticket**: OPS-456 (P3)

### 2. Flaky Integration Test
**Issue**: `test_payment_flow` fails intermittently in CI
**Workaround**: Re-run failed job (usually passes on retry)
**Ticket**: ENG-1200 (P2)

---

## 📅 Upcoming Events

| Date | Event | Impact | Contact |
|------|-------|--------|---------|
| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |
| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |
| 01/25 | Marketing campaign | 2x traffic expected | @platform |

---

## 📞 Escalation Reminders

| Issue Type | First Escalation | Second Escalation |
|------------|------------------|-------------------|
| Payment issues | @payments-oncall | @payments-manager |
| Auth issues | @auth-oncall | @security-team |
| Database issues | @dba-team | @infra-manager |
| Unknown/severe | @engineering-manager | @vp-engineering |

---

## 🔧 Quick Reference

### Common Commands
```bash
# Check service health
kubectl get pods -A | grep -v Running

# Recent deployments
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"

# Clear cache (emergency only)
redis-cli FLUSHDB

Important Links

Handoff Checklist

Outgoing Engineer

Document active incidents
Document ongoing investigations
List recent changes
Note known issues
Add upcoming events
Sync with incoming engineer

Incoming Engineer

Read this document
Join sync call
Verify PagerDuty is routing to you
Verify Slack notifications working
Check VPN/access working

Review critical dashboards

Template 2: Quick Handoff (Async)

# Quick Handoff: @alice → @bob

## TL;DR
- No active incidents
- 1 investigation ongoing (API timeouts, see ENG-1234)
- Major release tomorrow (01/24) - be ready for issues

## Watch List
1. API latency around 02:00-03:00 UTC (backup window)
2. Auth service memory (restart if > 80%)

## Recent
- Deployed api-gateway v3.2.1 yesterday (stable)
- Increased rate limits to 1500 RPS

## Coming Up
- 01/23 02:00 - DB maintenance (5 min read-only)
- 01/24 14:00 - v5.0 release

## Questions?
I'll be available on Slack until 17:00 today.

Template 3: Incident Handoff (Mid-Incident)

# INCIDENT HANDOFF: Payment Service Degradation

**Incident Start**: 2024-01-22 08:15 UTC
**Current Status**: Mitigating
**Severity**: SEV2

---

## Current State
- Error rate: 15% (down from 40%)
- Mitigation in progress: scaling up pods
- ETA to resolution: ~30 min

## What We Know
1. Root cause: Memory pressure on payment-service pods
2. Triggered by: Unusual traffic spike (3x normal)
3. Contributing: Inefficient query in checkout flow

## What We've Done
- Scaled payment-service from 5 → 15 pods
- Enabled rate limiting on checkout endpoint
- Disabled non-critical features

## What Needs to Happen
1. Monitor error rate - should reach <1% in ~15 min
2. If not improving, escalate to @payments-manager
3. Once stable, begin root cause investigation

## Key People
- Incident Commander: @alice (handing off)
- Comms Lead: @charlie
- Technical Lead: @bob (incoming)

## Communication
- Status page: Updated at 08:45
- Customer support: Notified
- Exec team: Aware

## Resources
- Incident channel: #inc-20240122-payment
- Dashboard: [Payment Service](https://grafana/d/payments)
- Runbook: [Payment Degradation](https://wiki/runbooks/payments)

---

**Incoming on-call (@bob) - Please confirm you have:**
- [ ] Joined #inc-20240122-payment
- [ ] Access to dashboards
- [ ] Understand current state
- [ ] Know escalation path

Handoff Sync Meeting

Agenda (15 minutes)

## Handoff Sync: @alice → @bob

1. **Active Issues** (5 min)
   - Walk through any ongoing incidents
   - Discuss investigation status
   - Transfer context and theories

2. **Recent Changes** (3 min)
   - Deployments to watch
   - Config changes
   - Known regressions

3. **Upcoming Events** (3 min)
   - Maintenance windows
   - Expected traffic changes
   - Releases planned

4. **Questions** (4 min)
   - Clarify anything unclear
   - Confirm access and alerting
   - Exchange contact info

On-Call Best Practices

Before Your Shift

## Pre-Shift Checklist

### Access Verification
- [ ] VPN working
- [ ] kubectl access to all clusters
- [ ] Database read access
- [ ] Log aggregator access (Splunk/Datadog)
- [ ] PagerDuty app installed and logged in

### Alerting Setup
- [ ] PagerDuty schedule shows you as primary
- [ ] Phone notifications enabled
- [ ] Slack notifications for incident channels
- [ ] Test alert received and acknowledged

### Knowledge Refresh
- [ ] Review recent incidents (past 2 weeks)
- [ ] Check service changelog
- [ ] Skim critical runbooks
- [ ] Know escalation contacts

### Environment Ready
- [ ] Laptop charged and accessible
- [ ] Phone charged
- [ ] Quiet space available for calls
- [ ] Secondary contact identified (if traveling)

During Your Shift

## Daily On-Call Routine

### Morning (start of day)
- [ ] Check overnight alerts
- [ ] Review dashboards for anomalies
- [ ] Check for any P0/P1 tickets created
- [ ] Skim incident channels for context

### Throughout Day
- [ ] Respond to alerts within SLA
- [ ] Document investigation progress
- [ ] Update team on significant issues
- [ ] Triage incoming pages

### End of Day
- [ ] Hand off any active issues
- [ ] Update investigation docs
- [ ] Note anything for next shift

After Your Shift

## Post-Shift Checklist

- [ ] Complete handoff document
- [ ] Sync with incoming on-call
- [ ] Verify PagerDuty routing changed
- [ ] Close/update investigation tickets
- [ ] File postmortems for any incidents
- [ ] Take time off if shift was stressful

Escalation Guidelines

When to Escalate

## Escalation Triggers

### Immediate Escalation
- SEV1 incident declared
- Data breach suspected
- Unable to diagnose within 30 min
- Customer or legal escalation received

### Consider Escalation
- Issue spans multiple teams
- Requires expertise you don't have
- Business impact exceeds threshold
- You're uncertain about next steps

### How to Escalate
1. Page the appropriate escalation path
2. Provide brief context in Slack
3. Stay engaged until escalation acknowledges
4. Hand off cleanly, don't just disappear

Best Practices

Do's

Document everything - Future you will thank you
Escalate early - Better safe than sorry
Take breaks - Alert fatigue is real
Keep handoffs synchronous - Async loses context
Test your setup - Before incidents, not during

Don'ts

Don't skip handoffs - Context loss causes incidents
Don't hero - Escalate when needed
Don't ignore alerts - Even if they seem minor
Don't work sick - Swap shifts instead
Don't disappear - Stay reachable during shift

Resources

Weekly Installs

Repository

sickn33/antigra…e-skills

GitHub Stars

27.1K

First Seen

Jan 28, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

gemini-cli76

opencode75

cursor74

antigravity72

claude-code72

github-copilot71

值班交接模式指南：确保运维连续性的事件响应与交接最佳实践

🇨🇳中文介绍

值班交接模式

不应使用此技能的情况

使用说明

应使用此技能的情况

核心概念

1. 交接组成部分

相关 Skills

2. 交接时机

模板

模板 1：值班交接文档

重要链接

交接清单

交班工程师

接班工程师