emergency-release-workflow by bobmatnyc/claude-mpm-skills
npx skills add https://github.com/bobmatnyc/claude-mpm-skills --skill emergency-release-workflow针对需要立即部署的关键生产问题的快速通道工作流。涵盖紧急程度评估、加急 PR 流程、部署验证和事后分析。
| 等级 | 类型 | 响应时间 | 部署 | 示例 |
|---|---|---|---|---|
| P0 | 安全漏洞 | < 2 小时 | 立即部署到生产环境 | 认证绕过、数据泄露、活跃漏洞利用 |
| P1 | 生产环境宕机 | < 4 小时 | 当天部署 | 应用崩溃、功能完全失效、支付中断 |
| P2 | 重大缺陷 | < 24 小时 | 下一个工作日部署 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 关键功能损坏、对用户有重大影响 |
| P3 | 业务关键 | < 48 小时 | 计划内发布 | 营销活动受阻、合作伙伴截止日期 |
# 从当前生产分支(main)创建分支
git checkout main
git pull origin main
# 创建热修复分支
git checkout -b hotfix/ENG-XXX-brief-description
# 示例:
git checkout -b hotfix/ENG-1234-fix-auth-bypass
⚠️ 关键:仅做最小化更改
应该做:
✅ 修复当前问题
✅ 添加回归测试
✅ 在注释中记录根本原因
不应该做:
❌ 重构周边代码
❌ 修复无关问题
❌ 添加新功能
❌ 更新依赖(除非这是修复的一部分)
# 运行完整测试套件
pnpm test
# 类型检查
pnpm tsc --noEmit
# 构建验证
pnpm build
# 手动测试清单:
# - [ ] 复现原始问题
# - [ ] 验证修复是否解决问题
# - [ ] 测试正常路径
# - [ ] 测试边界情况
# - [ ] 验证未引入新问题
git add .
git commit -m "fix: [修复的简要描述]
修复了 [描述] 的关键问题。
根本原因:[解释]。
工单:ENG-XXX
优先级:P0"
git push origin hotfix/ENG-XXX-brief-description
在 PR 标题中使用清晰的标签:
[RELEASE] - 直接发布到生产环境[HOTFIX] - 关键修复,加急审查[P0] 或 [P1] - 优先级指示器## 🚨 [RELEASE] ENG-XXX: 修复的简要描述
### 紧急程度
- [x] P0 - 安全漏洞
- [ ] P1 - 生产环境宕机
- [ ] P2 - 重大缺陷
- [ ] P3 - 业务关键
### 影响
**受影响的用户**:[所有用户 / 高级用户 / 特定区域 / 等]
**严重性**:[选择一项]
- [ ] 服务完全不可用
- [ ] 关键功能损坏
- [ ] 安全漏洞
- [ ] 数据完整性问题
- [ ] 性能下降
**用户影响**:
描述这对最终用户的影响。
### 根本原因
[导致问题的简要解释]
**发生过程:**
1. [步骤 1]
2. [步骤 2]
3. [结果:问题显现]
**为何未被发现:**
- [ ] 缺少测试覆盖
- [ ] 生产环境中的竞态条件
- [ ] 外部服务行为变更
- [ ] 最近部署引入了回归
- [ ] 其他:[解释]
### 修复方案
[此 PR 为解决该问题所做的更改]
**所做的更改:**
- 修改了 `file.ts` 以 [具体更改]
- 为 [具体情况] 添加了验证
- 修复了 [特定函数] 中的逻辑
**为何能修复:**
[解释此更改如何解决根本原因]
### 测试
- [ ] ✅ 本地复现了问题
- [ ] ✅ 验证修复解决了问题
- [ ] ✅ 添加了回归测试
- [ ] ✅ 未影响其他功能
- [ ] ✅ 测试了边界情况
- [ ] ✅ 部署到预发布环境并验证
### 回归测试
```typescript
// 为防止再次发生而添加的测试
describe('ENG-XXX: 认证绕过修复', () => {
it('应拒绝过期的令牌', async () => {
const expiredToken = generateExpiredToken();
const response = await fetch('/api/protected', {
headers: { Authorization: `Bearer ${expiredToken}` }
});
expect(response.status).toBe(401);
});
});
如果此修复导致问题:
# 选项 1:还原提交
git revert <commit-hash>
git push origin main
# 选项 2:部署先前版本
vercel rollback # 或您平台的回滚命令
# 选项 3:功能开关
在环境中设置 FEATURE_FIX_XXX=false
监控:
部署后立即:
需要关注的指标:
在 Linear 工单中更新解决方案
安排事后审查(如果是 P0/P1)
为正式修复创建工单(如果这是临时修复)
更新运行手册/文档
# 1. 将 PR 合并到 main
# (批准后或 P0 紧急豁免后)
# 2. 拉取最新代码
git checkout main
git pull origin main
# 3. 验证提交
git log -1
# 确认这是您的热修复提交
# 4. 标记发布(如果使用语义化版本控制)
git tag -a v2.3.5 -m "热修复:修复认证绕过漏洞"
git push origin v2.3.5
# 触发生产环境部署
vercel --prod
# 或使用 Vercel 仪表板:
# 部署 → 选择提交 → 部署到生产环境
# 监控部署
vercel logs --follow
# 通过 CLI 部署
netlify deploy --prod
# 或从仪表板触发:
# 部署 → 选择提交 → 发布部署
# 推送到 main 会自动触发部署
# 在仪表板中监控:railway.app/project/logs
# 遵循平台特定的部署流程
# AWS Elastic Beanstalk 示例:
eb deploy production --staged
# 监控:
eb logs --follow
# 测试特定修复
curl -X POST https://api.production.com/auth/login \
-H "Content-Type: application/json" \
-d '{"token": "expired_token"}'
# 预期:401 未授权
✅ 检查 Sentry/Rollbar 等:
- 错误率应下降
- 未引入新错误
⏱️ 部署后监控 15-30 分钟
检查监控仪表板:
- API 响应时间(应正常)
- 错误率(应下降)
- 数据库性能(应稳定)
- 第三方服务健康状况
监控支持渠道:
- 支持工单
- 应用内聊天
- 社交媒体
- 状态页面评论
🚨 **生产环境热修复已部署**
**问题**:[简要描述]
**工单**:ENG-XXX
**优先级**:P0
**状态**:✅ 已解决
**时间线:**
- 发现问题:14:23 UTC
- 修复部署:15:47 UTC
- 持续时间:1h 24m
**影响**:
[受影响的对象及影响方式]
**根本原因**:
[简要解释]
**修复方案**:
[所做的更改]
**验证**:
✅ 错误率从 450/分钟 降至 0/分钟
✅ 所有系统运行正常
**PR**:https://github.com/org/repo/pull/XXX
**后续工作**:
- [ ] 事后审查安排于 [日期]
- [ ] 文档已更新
🟢 已解决 - [问题标题]
我们已解决影响 [功能/服务] 的问题。
**发生了什么:**
在 14:23 至 15:47 UTC 期间,用户遇到了 [具体问题]。
**当前状态:**
该问题已完全解决。所有系统运行正常。
**后续步骤:**
我们正在进行全面审查,以防止未来发生类似问题。
对由此带来的不便,我们深表歉意。
主题:[服务] 问题更新 - 已解决
尊敬的 [用户],
我们写信向您更新今天早些时候影响 [功能/服务] 的问题。
**发生了什么:**
在 [时间] 至 [时间] 期间,您可能遇到了 [具体问题]。
**解决方案:**
我们的团队迅速识别并解决了根本原因。服务现已正常运行。
**我们正在做什么:**
我们非常重视这些问题,正在:
- 对事件进行全面审查
- 实施额外的安全措施
- 改进我们的监控
对由此带来的不便,我们深表歉意。
如果您有任何疑问或顾虑,请联系 support@company.com。
感谢您的耐心。
[公司] 团队
72 小时内安排
# 事后审查:[ENG-XXX]
日期:YYYY-MM-DD
严重性:P0/P1
持续时间:Xh Xm
## 概述
事件的简要描述。
## 时间线(UTC)
- 14:23 - 首次检测到问题
- 14:25 - 通知值班工程师
- 14:30 - 识别根本原因
- 14:45 - 打开修复 PR
- 15:20 - PR 批准并合并
- 15:47 - 修复部署到生产环境
- 16:00 - 验证已解决
## 影响
- **受影响的用户**:约 5,000 名用户
- **持续时间**:1h 24m
- **用户体验**:无法登录
- **收入影响**:估计损失 X 美元的交易
- **声誉影响**:23 个支持工单,5 次社交媒体提及
## 根本原因
导致问题的详细技术解释。
[如有帮助,包含代码片段、序列图]
## 解决方案
为修复问题所做的更改。
## 做得好的方面
- ✅ 快速检测(部署后 2 分钟)
- ✅ 快速识别清晰的复现步骤
- ✅ 团队有效协作
- ✅ 90 分钟内完成修复部署
## 不足之处
- ❌ 缺少过期令牌边界情况的测试覆盖
- ❌ 预发布环境未捕获问题(令牌过期设置不同)
- ❌ 未触发自动回滚
- ❌ 监控警报阈值过高
## 行动项
- [ ] ENG-XXX-1:为过期令牌验证添加测试(@engineer,于 YYYY-MM-DD 前)
- [ ] ENG-XXX-2:将预发布环境令牌过期设置与生产环境对齐(@devops,于 YYYY-MM-DD 前)
- [ ] ENG-XXX-3:在错误激增时实施自动回滚(@platform,于 YYYY-MM-DD 前)
- [ ] ENG-XXX-4:降低监控警报阈值(@observability,于 YYYY-MM-DD 前)
- [ ] ENG-XXX-5:为类似问题添加运行手册(@oncall,于 YYYY-MM-DD 前)
## 预防措施
我们将如何防止此类问题再次发生:
1. **测试**:为边界情况添加测试覆盖
2. **监控**:改进警报阈值
3. **流程**:更新部署清单
4. **文档**:为值班人员创建运行手册
## 经验教训
团队的关键收获。
当修复需要应用到多个分支/环境时:
# 1. 将修复应用到 main(生产环境)
git checkout main
git cherry-pick <hotfix-commit-hash>
# 2. 向后移植到候选发布分支
git checkout release-candidate
git cherry-pick <hotfix-commit-hash>
git push origin release-candidate
# 3. 向后移植到开发分支
git checkout develop
git cherry-pick <hotfix-commit-hash>
git push origin develop
# 为每个向后移植创建 PR:
gh pr create --base release-candidate --head backport/rc/hotfix-ENG-XXX
gh pr create --base develop --head backport/dev/hotfix-ENG-XXX
# 如果 cherry-pick 因冲突而失败
git cherry-pick <hotfix-commit-hash>
# file.ts 中的冲突
# 手动解决冲突
# 然后:
git add file.ts
git cherry-pick --continue
# 从热修复创建补丁
git format-patch -1 <hotfix-commit-hash>
# 创建:0001-fix-auth-bypass.patch
# 应用到其他分支
git checkout release-candidate
git apply 0001-fix-auth-bypass.patch
生产环境是否损坏?
├─ 是 → 严重性等级?
│ ├─ P0(安全/宕机)→ 立即部署,事后通知
│ ├─ P1(关键缺陷)→ 快速通道 PR,当天部署
│ └─ P2(重大缺陷)→ 标准加急流程
└─ 否 → 使用正常部署流程
当生产问题需要立即关注且需要在正常发布流程之外进行快速通道部署时,请使用此技能。
每周安装次数
63
代码仓库
GitHub 星标数
18
首次出现
2026 年 1 月 23 日
安全审计
安装于
claude-code48
gemini-cli47
codex47
opencode46
cursor44
github-copilot43
Fast-track workflow for critical production issues requiring immediate deployment. Covers urgency assessment, expedited PR process, deployment verification, and post-incident analysis.
| Level | Type | Response Time | Deployment | Example |
|---|---|---|---|---|
| P0 | Security vulnerability | < 2 hours | Immediate to production | Auth bypass, data leak, active exploit |
| P1 | Production down | < 4 hours | Same day | App crash, complete feature failure, payment down |
| P2 | Major bug | < 24 hours | Next business day | Critical feature broken, significant user impact |
| P3 | Business critical | < 48 hours | Scheduled release | Marketing campaign blocker, partner deadline |
# Branch from current production (main)
git checkout main
git pull origin main
# Create hotfix branch
git checkout -b hotfix/ENG-XXX-brief-description
# Example:
git checkout -b hotfix/ENG-1234-fix-auth-bypass
⚠️ CRITICAL: Minimal change only
DO:
✅ Fix the immediate issue
✅ Add regression test
✅ Document root cause in comments
DON'T:
❌ Refactor surrounding code
❌ Fix unrelated issues
❌ Add new features
❌ Update dependencies (unless that's the fix)
# Run full test suite
pnpm test
# Type check
pnpm tsc --noEmit
# Build verification
pnpm build
# Manual testing checklist:
# - [ ] Reproduce original issue
# - [ ] Verify fix resolves issue
# - [ ] Test happy path
# - [ ] Test edge cases
# - [ ] Verify no new issues introduced
git add .
git commit -m "fix: [brief description of fix]
Fixes critical issue where [description].
Root cause: [explanation].
Ticket: ENG-XXX
Priority: P0"
git push origin hotfix/ENG-XXX-brief-description
Use clear labels in PR title:
[RELEASE] - Direct to production[HOTFIX] - Critical fix, expedited review[P0] or [P1] - Priority indicator## 🚨 [RELEASE] ENG-XXX: Brief description of fix
### Urgency
- [x] P0 - Security vulnerability
- [ ] P1 - Production down
- [ ] P2 - Major bug
- [ ] P3 - Business critical
### Impact
**Users affected**: [All users / Premium tier / Specific region / etc.]
**Severity**: [Choose one]
- [ ] Service completely unavailable
- [ ] Critical feature broken
- [ ] Security vulnerability
- [ ] Data integrity issue
- [ ] Degraded performance
**User impact**:
Describe how this affects end users.
### Root Cause
[Brief explanation of what caused the issue]
**How it happened:**
1. [Step 1]
2. [Step 2]
3. [Result: issue manifested]
**Why it wasn't caught:**
- [ ] Missing test coverage
- [ ] Race condition in production
- [ ] External service behavior changed
- [ ] Recent deployment introduced regression
- [ ] Other: [explain]
### The Fix
[What this PR changes to resolve the issue]
**Changes made:**
- Modified `file.ts` to [specific change]
- Added validation for [specific case]
- Fixed logic in [specific function]
**Why this fixes it:**
[Explanation of how the change resolves the root cause]
### Testing
- [ ] ✅ Reproduced issue locally
- [ ] ✅ Verified fix resolves issue
- [ ] ✅ Regression test added
- [ ] ✅ No other functionality affected
- [ ] ✅ Tested edge cases
- [ ] ✅ Deployed to staging and verified
### Regression Test
```typescript
// Test added to prevent recurrence
describe('ENG-XXX: Auth bypass fix', () => {
it('should reject expired tokens', async () => {
const expiredToken = generateExpiredToken();
const response = await fetch('/api/protected', {
headers: { Authorization: `Bearer ${expiredToken}` }
});
expect(response.status).toBe(401);
});
});
If this causes issues:
# Option 1: Revert commit
git revert <commit-hash>
git push origin main
# Option 2: Deploy previous version
vercel rollback # or your platform's rollback command
# Option 3: Feature flag
Set FEATURE_FIX_XXX=false in environment
Monitoring:
Immediately after deploy:
Metrics to watch:
Update Linear ticket with resolution
Schedule post-incident review (if P0/P1)
Create tickets for proper fix (if this was a band-aid)
Update runbook/documentation
# 1. Merge PR to main
# (After approval or P0 emergency waiver)
# 2. Pull latest
git checkout main
git pull origin main
# 3. Verify commit
git log -1
# Confirm this is your hotfix commit
# 4. Tag release (if using semantic versioning)
git tag -a v2.3.5 -m "Hotfix: Fix auth bypass vulnerability"
git push origin v2.3.5
# Trigger production deployment
vercel --prod
# Or use Vercel dashboard:
# Deployments → Select commit → Deploy to Production
# Monitor deployment
vercel logs --follow
# Deploy via CLI
netlify deploy --prod
# Or trigger from dashboard:
# Deploys → Select commit → Publish deploy
# Push to main triggers deployment automatically
# Monitor in dashboard: railway.app/project/logs
# Follow platform-specific deployment process
# Example for AWS Elastic Beanstalk:
eb deploy production --staged
# Monitor:
eb logs --follow
# Test the specific fix
curl -X POST https://api.production.com/auth/login \
-H "Content-Type: application/json" \
-d '{"token": "expired_token"}'
# Expected: 401 Unauthorized
✅ Check Sentry/Rollbar/etc.:
- Error rate should drop
- No new errors introduced
⏱️ Monitor for 15-30 minutes after deployment
Check monitoring dashboard:
- API response times (should be normal)
- Error rates (should drop)
- Database performance (should be stable)
- Third-party service health
Monitor support channels:
- Support tickets
- In-app chat
- Social media
- Status page comments
🚨 **Production Hotfix Deployed**
**Issue**: [Brief description]
**Ticket**: ENG-XXX
**Priority**: P0
**Status**: ✅ Resolved
**Timeline:**
- Issue discovered: 14:23 UTC
- Fix deployed: 15:47 UTC
- Duration: 1h 24m
**Impact**:
[Who was affected and how]
**Root Cause**:
[Brief explanation]
**Fix**:
[What was changed]
**Verification**:
✅ Error rate dropped from 450/min to 0/min
✅ All systems operating normally
**PR**: https://github.com/org/repo/pull/XXX
**Follow-up**:
- [ ] Post-incident review scheduled for [date]
- [ ] Documentation updated
🟢 Resolved - [Issue Title]
We've resolved an issue that was affecting [feature/service].
**What happened:**
Between 14:23 and 15:47 UTC, users experienced [specific issue].
**Current status:**
The issue has been fully resolved. All systems are operating normally.
**Next steps:**
We're conducting a thorough review to prevent similar issues in the future.
We apologize for any inconvenience.
Subject: Update on [Service] Issue - Resolved
Hi [User],
We're writing to update you on an issue that affected [feature/service] earlier today.
**What happened:**
Between [time] and [time], you may have experienced [specific issue].
**Resolution:**
Our team quickly identified and resolved the root cause. The service is now operating normally.
**What we're doing:**
We take these issues seriously and are:
- Conducting a full review of the incident
- Implementing additional safeguards
- Improving our monitoring
We apologize for any inconvenience this may have caused.
If you have any questions or concerns, please reach out to support@company.com.
Thank you for your patience.
The [Company] Team
Schedule within 72 hours
# Post-Incident Review: [ENG-XXX]
Date: YYYY-MM-DD
Severity: P0/P1
Duration: Xh Xm
## Summary
Brief description of the incident.
## Timeline (UTC)
- 14:23 - Issue first detected
- 14:25 - On-call engineer alerted
- 14:30 - Root cause identified
- 14:45 - Fix PR opened
- 15:20 - PR approved and merged
- 15:47 - Fix deployed to production
- 16:00 - Verified resolved
## Impact
- **Users affected**: ~5,000 users
- **Duration**: 1h 24m
- **User experience**: Unable to log in
- **Revenue impact**: Estimated $X in lost transactions
- **Reputation impact**: 23 support tickets, 5 social media mentions
## Root Cause
Detailed technical explanation of what caused the issue.
[Include code snippets, sequence diagrams if helpful]
## Resolution
What was changed to fix the issue.
## What Went Well
- ✅ Fast detection (2 minutes after deploy)
- ✅ Clear reproduction steps identified quickly
- ✅ Team collaborated effectively
- ✅ Fix deployed in under 90 minutes
## What Went Wrong
- ❌ Missing test coverage for expired token edge case
- ❌ Staging didn't catch the issue (different token expiry settings)
- ❌ No automatic rollback triggered
- ❌ Monitoring alert threshold too high
## Action Items
- [ ] ENG-XXX-1: Add test for expired token validation (@engineer, by YYYY-MM-DD)
- [ ] ENG-XXX-2: Align staging token expiry with production (@devops, by YYYY-MM-DD)
- [ ] ENG-XXX-3: Implement automatic rollback on error spike (@platform, by YYYY-MM-DD)
- [ ] ENG-XXX-4: Lower monitoring alert threshold (@observability, by YYYY-MM-DD)
- [ ] ENG-XXX-5: Add runbook for similar issues (@oncall, by YYYY-MM-DD)
## Prevention
How we'll prevent this from happening again:
1. **Testing**: Add test coverage for edge cases
2. **Monitoring**: Improve alerting thresholds
3. **Process**: Update deployment checklist
4. **Documentation**: Create runbook for on-call
## Lessons Learned
Key takeaways for the team.
When fix needs to go to multiple branches/environments:
# 1. Fix applied to main (production)
git checkout main
git cherry-pick <hotfix-commit-hash>
# 2. Backport to release candidate
git checkout release-candidate
git cherry-pick <hotfix-commit-hash>
git push origin release-candidate
# 3. Backport to develop
git checkout develop
git cherry-pick <hotfix-commit-hash>
git push origin develop
# Create PRs for each backport:
gh pr create --base release-candidate --head backport/rc/hotfix-ENG-XXX
gh pr create --base develop --head backport/dev/hotfix-ENG-XXX
# If cherry-pick fails due to conflicts
git cherry-pick <hotfix-commit-hash>
# CONFLICT in file.ts
# Resolve conflicts manually
# Then:
git add file.ts
git cherry-pick --continue
# Create patch from hotfix
git format-patch -1 <hotfix-commit-hash>
# Creates: 0001-fix-auth-bypass.patch
# Apply to other branch
git checkout release-candidate
git apply 0001-fix-auth-bypass.patch
Is production broken?
├─ Yes → Severity level?
│ ├─ P0 (security/down) → Deploy immediately, inform after
│ ├─ P1 (critical bug) → Fast-track PR, deploy same day
│ └─ P2 (major bug) → Standard expedited process
└─ No → Use normal deployment process
Use this skill when production issues require immediate attention and fast-track deployment outside normal release processes.
Weekly Installs
63
Repository
GitHub Stars
18
First Seen
Jan 23, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
claude-code48
gemini-cli47
codex47
opencode46
cursor44
github-copilot43
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
104,900 周安装