Release It! 生产就绪软件框架：构建弹性、可观测、高可用系统的稳定性模式与反模式

release-it by wondelai/skills

364 周安装量

384 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/wondelai/skills --skill release-it

质量管理开发运维系统架构

🇨🇳中文介绍

Release It! 框架

用于设计、部署和运维生产就绪软件系统的框架。基于一个基本事实：通过质量保证的软件并非能在生产环境中存活的软件。生产环境是充满敌意的环境——你的系统必须构建得能够预见并处理各个层面的故障。

核心原则

每个系统最终都会被推至超出其设计极限。 问题不在于故障是否会发生，而在于你的系统是优雅地降级还是灾难性地崩溃。生产就绪的软件不仅仅是正确的——它具备弹性、可观测性，并且设计为能够在无需人工干预的情况下，在部分故障中持续运行。

评分

目标：10/10。 在评审或创建生产系统时，根据其对以下原则的遵循程度，给出0-10的评分。10/10表示完全符合所有指导方针；较低的分数表示存在需要解决的差距。始终提供当前分数以及达到10/10所需的具体改进措施。

Release It! 框架

决定软件能否在生产环境中存活的六个领域：

1. 稳定性反模式

核心概念： 故障通过集成点传播，并跨越系统边界级联。最危险的模式并非代码中的错误——它们是系统在压力下交互时出现的涌现行为。

为何有效： 识别反模式让你能够在生产流量发现它们之前，识别并消除这些裂痕。每一次生产中断都可追溯到一个或多个此类模式。它们是可预测的、反复出现的，并且是可以预防的。

关键见解：

集成点是生产系统的头号杀手——每个套接字、HTTP调用或队列都是一个风险点
级联故障在一个系统的故障导致其调用者失败，进而导致调用者的调用者失败时传播开来
慢响应比无响应更糟糕——它们占用线程、耗尽资源池，并将延迟传播到整个调用链
当数据增长超出测试假设时，无限制的结果集会将无害的查询变成内存不足的崩溃
用户产生的负载模式是任何测试套件都无法预测的——机器人、重试风暴和瞬时流量高峰
自我拒绝攻击发生在你自己的营销、优惠券或病毒式传播功能压垮你的基础设施时
阻塞的线程是沉默的杀手——死锁和资源争用在一切停止之前不会显示任何错误

代码应用：

上下文	模式	示例

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

上下文	模式	示例
服务调用	带阈值和恢复超时的断路器	60秒内5次失败后断开；30秒后进入半开状态
资源隔离	为每个依赖项使用专用资源池的舱壁	为关键服务和非关键服务使用独立的连接池
网络调用	带传播的超时	连接：1秒，读取：5秒；将截止时间传播到下游调用
重试	指数退避 + 抖动 + 重试预算	基准100毫秒，最多3次重试，整个集群20%的重试预算
数据清理	带自动清理的稳态	删除超过24小时的会话；日志达到500MB时轮转

3. 容量与可用性

核心概念： 容量不是一个单一的数字——它是CPU、内存、网络、磁盘I/O、连接池和线程数的多维函数。容量规划意味着理解哪种资源首先成为瓶颈，以及在何种负载水平下成为瓶颈。

为何有效： 未经容量测试的系统会在最糟糕的时刻——峰值负载期间——在生产环境中失败。了解系统的实际限制（而非理论限制）可以让你设定现实的服务水平协议，并在用户遇到瓶颈之前规划扩展。

性能测试分类：负载测试（预期流量）、压力测试（超出限制）、浸泡测试（持续负载）、尖峰测试（突然爆发）
通用可扩展性定律：吞吐量并非线性扩展——争用和一致性成本导致收益递减
连接池是有限且宝贵的——从应用程序的角度看，连接池耗尽与数据库中断看起来完全相同
线程池的大小必须基于实测吞吐量，而非猜测——太少会使系统饥饿，太多会导致上下文切换开销
误区：“云是无限可扩展的”——自动扩展有滞后时间、冷启动成本和硬性限制
资源池需要健康检查、驱逐策略和最大生命周期限制

上下文	模式	示例
负载测试	逐步增加到预期峰值，然后2倍，观察性能下降曲线	逐渐增加每秒请求数，直到延迟超过服务水平目标
连接池	基于实测并发数确定大小，而非默认值	测量负载下的活动连接数；将连接池大小设置为P99 + 20%余量
自动扩展	定义具有适当冷却时间的扩展触发器	CPU > 70% 持续3分钟时扩展；冷却时间5分钟
浸泡测试	以80%容量运行24-72小时	捕获内存泄漏、连接泄漏、文件句柄耗尽
容量模型	记录每个服务的资源瓶颈	“服务X在2000 RPS时受内存限制；每个实例需要4GB内存”

详见：references/capacity-planning.md，其中包含测试方法、资源池管理和可扩展性建模。

核心概念： 部署（将代码放到服务器上）和发布（向用户公开代码）是两个应该解耦的独立操作。将它们分离使你能够无风险地部署并有信心地发布。

为何有效： 大多数中断是由变更引起的——部署、配置更新、数据库迁移。将部署与发布解耦意味着你可以将代码部署到生产环境，验证其工作正常，然后才将流量路由到它。如果出现问题，你可以回滚发布，而不是回滚部署。

零停机部署对于任何有用户的系统都是不容商量的——滚动部署、蓝绿部署或金丝雀部署
功能标志将部署与发布解耦——暗启动代码并独立启用它
数据库迁移必须向后兼容——旧代码和新代码在部署期间将同时运行
不可变基础设施：永远不要修补正在运行的服务器——构建新镜像，部署它，销毁旧的
金丝雀发布通过首先将一小部分流量路由到新版本，来限制爆炸半径
回滚必须比前滚更快——如果回滚需要30分钟，你会避免部署

上下文	模式	示例
部署	带健康检查门的蓝绿部署	部署到绿色环境；运行冒烟测试；切换路由器
渐进式发布	带自动回滚的金丝雀发布	将5%的流量路由到金丝雀；如果错误率 > 1%则自动回滚
功能发布	带紧急关闭开关的功能标志	在标志后发布代码；为10%的用户启用；监控；逐步增加
模式变更	扩展-收缩迁移模式	添加新列；部署同时写入两者的代码；回填数据；删除旧列
回滚	通过流量路由实现即时回滚	保持先前版本运行；回滚 = 切换负载均衡器目标

详见：references/deployment-strategies.md，其中包含部署模式、迁移策略和基础设施即代码实践。

5. 健康检查与可观测性

核心概念： 你无法操作你无法观测的东西。可观测性不是事后考虑——它是一流的设计关注点。健康检查、指标、日志和跟踪是你的系统在生产环境中的感觉器官。

为何有效： 生产系统会以没有适当检测就不可见的方式失败。一个只返回“OK”的健康检查什么也没告诉你。没有上下文的指标只是噪音。正确实施的可观测性使你能够回答你在设计时未曾预料到的关于系统的问题。

健康检查有两种类型：浅层（进程存活）和深层（依赖项可达、资源可用）
可观测性的三大支柱：结构化日志（发生了什么）、指标（有多少）、分布式跟踪（在哪里以及多长时间）
服务的RED方法：速率（每秒请求数）、错误（错误率）、持续时间（延迟分布）
资源的USE方法：利用率（%）、饱和度（队列深度）、错误（错误计数）
SLI衡量用户体验；SLO设定目标；SLA创建合同义务——按此顺序定义它们
基于症状（面向用户的错误）告警优于基于原因（CPU使用率）告警——对用户感受到的问题告警
仪表板应在查看5秒内回答“系统现在健康吗？”

上下文	模式	示例
健康端点	带依赖项状态的深层健康检查	`/health` 返回数据库、缓存、队列和磁盘空间的状态
服务指标	RED方法检测	跟踪每个端点的请求速率、错误率和p50/p95/p99延迟
资源指标	基础设施的USE方法	跟踪每个主机的CPU利用率、请求队列深度和错误计数
分布式跟踪	跨服务边界传播跟踪上下文	在头部注入跟踪ID；跨服务关联日志
告警	基于SLO消耗率告警，而非原始阈值	“错误预算消耗速度是正常速度的10倍” vs. “CPU > 80%”

详见：references/observability.md，其中包含健康检查设计、指标检测、SLO框架和告警策略。

6. 适应性与混沌工程

安全须知： 混沌工程实验是设计阶段的规划活动。以下模式描述的是 测试什么 和 验证什么，而非AI代理自主执行的操作。所有故障注入必须由授权工程师使用专用工具（例如Gremlin、Litmus、AWS FIS）执行，并具备适当的审批、回滚计划和爆炸半径控制。

核心概念： 对系统弹性的信心来自于在现实的故障条件下对其进行测试。混沌工程是一门在受控环境中对系统进行实验的学科，旨在建立对其在动荡条件下生存能力的信心。

为何有效： 在系统实际发生故障之前，你无法知道它如何处理故障。等待生产事件来发现弱点是反应式的且代价高昂。混沌工程以受控方式主动注入故障，在它们导致真实中断之前，将未知的未知转变为已知的已知。

首先定义稳态——你需要一个可测量的基线来检测行为何时偏离
在非生产环境中从小处开始：终止单个进程，为一个调用添加延迟——然后在获得批准后逐步升级
最小化爆炸半径：对实验使用金丝雀群体、功能标志和紧急停止机制
生产实验需要明确的授权、监控和即时回滚能力
自动化重复性实验，以便持续验证弹性，而非一次性事件
GameDay演练将混沌工程与事件响应实践相结合——同时测试系统和团队
每个实验都应有一个假设：“我们相信当X失败时，系统将Y”
建立一种文化，在其中发现弱点受到庆祝，而非惩罚

上下文	模式	示例
进程故障	受控实例终止（通过混沌工具）	使用Gremlin/Litmus终止一个Pod；验证服务在SLO内恢复
网络故障	在服务之间注入延迟或分区（通过混沌工具）	为数据库调用添加500毫秒延迟；验证断路器跳闸
依赖项故障	模拟下游服务中断（通过混沌工具）	从支付API返回503；验证优雅降级
资源耗尽	模拟资源压力（通过混沌工具）	压力测试内存限制；验证进程干净重启
GameDay	带真实故障场景的预定团队演练	“主数据库在下午2点变为只读”——练习响应

详见：references/chaos-engineering.md，其中包含实验设计、爆炸半径管理和建立混沌工程实践。

错误	为何失败	修复方法
出站调用没有超时	一个缓慢的依赖项会冻结整个系统	为每个外部调用设置连接和读取超时
无限制重试	重试风暴会放大故障而非从中恢复	使用指数退避、抖动和集群范围内的重试预算
共享线程/连接池	一个失败的依赖项会耗尽所有功能的资源	舱壁：为每个依赖项或功能隔离资源池
仅使用浅层健康检查	负载均衡器将流量路由到依赖项已损坏的实例	实现验证下游连通性的深层健康检查
仅测试理想路径	系统在第一次真实故障之前完美运行	每次主要发布前进行负载测试、浸泡测试和混沌测试
耦合部署与发布	每次部署都是高风险事件，采用全有或全无的发布方式	使用功能标志、金丝雀发布和蓝绿部署
基于原因而非症状告警	CPU高告警触发但用户无感；错误激增但无告警触发	基于面向用户的SLI告警：错误率、延迟、可用性
没有容量模型	系统在无人计划的事件中，在2倍负载下崩溃	建模瓶颈资源；负载测试到预期峰值的3倍

审计任何生产系统：

问题	如果答案为否	行动
每个出站调用都有超时吗？	调用可能无限期挂起，阻塞线程	为所有外部调用添加连接和读取超时
关键依赖项是否配置了断路器？	一个依赖项故障会导致整个系统宕机	添加具有适当阈值的断路器
是否为每个依赖项隔离了线程/连接池？	共享资源池允许故障交叉污染	使用专用资源池实现舱壁模式
你能无停机部署吗？	部署会导致用户可见的中断	实施滚动、蓝绿或金丝雀部署
健康检查是否验证依赖项连通性？	死实例接收流量；部分故障未被检测到	添加测试数据库、缓存、队列的深层健康检查
日志、指标和跟踪是否关联？	调试需要跨服务手动搜索日志	使用关联ID实现分布式跟踪
你是否进行了超出预期峰值的负载测试？	真实负载下的故障模式未知	负载测试到预期峰值的2-3倍；记录崩溃点
你是否进行故障注入演练？	弹性是理论上的，未经验证	从低风险实验开始混沌工程

anti-patterns.md：集成点故障、级联故障、阻塞线程、无限制结果集、自我拒绝攻击、慢响应
stability-patterns.md：断路器、舱壁、超时、重试、快速失败、稳态、任其崩溃、握手
capacity-planning.md：负载/压力/浸泡测试、连接池大小调整、线程池调优、通用可扩展性定律
deployment-strategies.md：蓝绿、金丝雀、滚动部署、功能标志、数据库迁移、不可变基础设施
observability.md：健康检查、RED/USE方法、SLI/SLO/SLA、分布式跟踪、告警策略
chaos-engineering.md：稳态假设、故障注入、GameDay演练、爆炸半径管理

此技能基于Michael Nygard关于构建生产就绪软件的重要指南。如需完整的方法论、实战故事和实现细节：

Michael T. Nygard 是一位拥有超过30年构建和运维大规模生产系统经验的软件架构师和作者。他曾跨行业工作，包括金融、零售和政府，并负责处理每日数百万交易量的系统。Nygard以弥合开发和运维之间的鸿沟而闻名，主张架构师必须对其设计的系统在代码编写后很久仍负有责任。Release It! 的第一版（2007年）成为DevOps和站点可靠性工程运动的基础性文本。第二版（2018年）扩展了覆盖范围，包括云原生架构、容器化和现代部署实践。Nygard是频繁的会议演讲者，并为关于弹性工程、社会技术系统以及影响生产稳定性的人为因素的更广泛讨论做出了贡献。

🇺🇸English

Release It! Framework

Framework for designing, deploying, and operating production-ready software systems. Based on a fundamental truth: the software that passes QA is not the software that survives production. Production is a hostile environment -- and your system must be built to expect and handle failure at every level.

Core Principle

Every system will eventually be pushed beyond its design limits. The question is not whether failures will happen, but whether your system degrades gracefully or collapses catastrophically. Production-ready software is not just correct -- it is resilient, observable, and designed to operate through partial failures without human intervention.

Scoring

Goal: 10/10. When reviewing or creating production systems, rate them 0-10 based on adherence to the principles below. A 10/10 means full alignment with all guidelines; lower scores indicate gaps to address. Always provide the current score and specific improvements needed to reach 10/10.

The Release It! Framework

Six areas that determine whether software survives contact with production:

1. Stability Anti-Patterns

Core concept: Failures propagate through integration points, cascading across system boundaries. The most dangerous patterns are not bugs in your code -- they are emergent behaviors that arise when systems interact under stress.

Why it works: Recognizing anti-patterns lets you identify and eliminate the cracks before production traffic finds them. Every production outage traces back to one or more of these patterns. They are predictable, recurring, and preventable.

Key insights:

Integration points are the number-one killer of production systems -- every socket, HTTP call, or queue is a risk
Cascading failures spread when one system's failure causes its callers to fail, which causes their callers to fail
Slow responses are worse than no response -- they tie up threads, exhaust pools, and propagate delays across the entire call chain
Unbounded result sets turn a harmless query into an out-of-memory crash when data grows beyond test assumptions
Users generate load patterns that no test suite can predict -- bots, retry storms, and flash crowds
Self-denial attacks occur when your own marketing, coupons, or viral features overwhelm your infrastructure
Blocked threads are the silent killer -- deadlocks and resource contention show no errors until everything stops

Code applications:

Context	Pattern	Example
HTTP calls	Assume every remote call can fail, hang, or return garbage	Wrap all external calls with timeout + circuit breaker
Database queries	Enforce result set limits on every query	Add `LIMIT` clause; paginate all list endpoints
Thread pools	Isolate pools per dependency to prevent cross-contamination	Separate thread pool for payment gateway vs. search
Load testing	Simulate realistic traffic including spikes and abuse patterns	Use production traffic replays, not synthetic happy-path scripts
Marketing events	Coordinate launches with capacity planning	Pre-scale before Black Friday; add queue for coupon redemption

See: references/anti-patterns.md for detailed analysis of each anti-pattern with failure scenarios and detection strategies.

2. Stability Patterns

Core concept: Counter each anti-pattern with a stability pattern. Circuit breakers stop cascading failures. Bulkheads isolate blast radius. Timeouts reclaim stuck resources. Together they create a system that bends under load but does not break.

Why it works: These patterns work because they accept failure as inevitable and design the system's response to failure, rather than trying to prevent all failures. A circuit breaker that trips is the system working correctly -- it is protecting itself from a downstream failure.

Key insights:

Circuit Breaker: three states (closed, open, half-open) -- trips after threshold failures, periodically tests recovery
Bulkheads: partition resources so one failing component cannot drain the entire system
Timeouts: every outbound call needs both a connect timeout and a read timeout -- and timeouts must propagate up the call chain
Retry with backoff: exponential backoff + jitter prevents thundering herd on recovery
Fail Fast: if you know a request will fail, reject it immediately -- do not waste resources attempting it
Steady State: systems accumulate cruft (logs, sessions, temp files) -- design for automatic cleanup
Let It Crash: sometimes the safest recovery is to restart the process cleanly rather than limping along in an unknown state
Handshaking: let the server tell the client whether it can accept work before the client sends it

Code applications:

Context	Pattern	Example
Service calls	Circuit Breaker with threshold and recovery timeout	Open after 5 failures in 60s; half-open after 30s
Resource isolation	Bulkhead with dedicated pools per dependency	Separate connection pools for critical vs. non-critical services
Network calls	Timeout with propagation	Connect: 1s, read: 5s; propagate deadline to downstream calls
Retries	Exponential backoff + jitter + retry budget	Base 100ms, max 3 retries, 20% retry budget across fleet
Data cleanup	Steady State with automated purging	Delete sessions older than 24h; rotate logs at 500MB

See: references/stability-patterns.md for implementation details, state machines, threshold tuning, and pattern combinations.

3. Capacity and Availability

Core concept: Capacity is not a single number -- it is a multi-dimensional function of CPU, memory, network, disk I/O, connection pools, and thread counts. Capacity planning means understanding which resource becomes the bottleneck first and at what load level.

Why it works: Systems that are not capacity-tested fail in production at the worst possible moment -- during peak load. Understanding your system's actual limits (not theoretical limits) lets you set realistic SLAs and plan scaling before users hit the wall.

Key insights:

Performance testing taxonomy: load test (expected traffic), stress test (beyond limits), soak test (sustained load over time), spike test (sudden bursts)
The Universal Scalability Law: throughput does not scale linearly -- contention and coherence costs cause diminishing returns
Connection pools are finite and precious -- a pool exhaustion looks identical to a database outage from the application's perspective
Thread pools must be sized based on measured throughput, not guesses -- too few starve the system, too many cause context-switching overhead
Myths: "The cloud is infinitely scalable" -- auto-scaling has lag time, cold-start costs, and hard limits
Resource pools need health checks, eviction policies, and maximum lifetime limits

Code applications:

Context	Pattern	Example
Load testing	Ramp to expected peak, then 2x, observe degradation curve	Gradually increase RPS until latency exceeds SLO
Connection pools	Size based on measured concurrency, not defaults	Measure active connections under load; set pool to P99 + 20% headroom
Auto-scaling	Define scaling triggers with appropriate cooldown	Scale on CPU > 70% sustained 3 min; cooldown 5 min
Soak testing	Run at 80% capacity for 24-72 hours	Catch memory leaks, connection leaks, file handle exhaustion
Capacity model	Document resource bottleneck per service	"Service X is memory-bound at 2000 RPS; needs 4GB per instance"

See: references/capacity-planning.md for testing methodologies, resource pool management, and scalability modeling.

4. Deployment and Release

Core concept: Deployment (putting code on servers) and release (exposing code to users) are separate operations that should be decoupled. Separating them gives you the ability to deploy without risk and release with confidence.

Why it works: Most outages are caused by changes -- deployments, configuration updates, database migrations. Decoupling deployment from release means you can deploy code to production, verify it works, and only then route traffic to it. If something goes wrong, you roll back the release, not the deployment.

Key insights:

Zero-downtime deployment is non-negotiable for any system with users -- rolling deploys, blue-green, or canary
Feature flags decouple deployment from release -- dark-launch code and enable it independently
Database migrations must be backward-compatible -- the old code and new code will run simultaneously during deployment
Immutable infrastructure: never patch a running server -- build a new image, deploy it, destroy the old one
Canary releases limit blast radius by routing a small percentage of traffic to the new version first
Rollback must be faster than roll-forward -- if rollback takes 30 minutes, you will avoid deploying

Code applications:

Context	Pattern	Example
Deploys	Blue-green with health check gate	Deploy to green; run smoke tests; swap router
Progressive rollout	Canary with automated rollback	Route 5% traffic to canary; auto-rollback if error rate > 1%
Feature launch	Feature flags with emergency off switch	Ship code behind flag; enable for 10% of users; monitor; ramp
Schema changes	Expand-contract migration pattern	Add new column; deploy code that writes both; backfill; drop old column
Rollback	Instant rollback via traffic routing	Keep previous version running; rollback = switch load balancer target

See: references/deployment-strategies.md for deployment patterns, migration strategies, and infrastructure-as-code practices.

5. Health Checks and Observability

Core concept: You cannot operate what you cannot observe. Observability is not an afterthought -- it is a first-class design concern. Health checks, metrics, logs, and traces are the sensory organs of your system in production.

Why it works: Production systems fail in ways that are invisible without proper instrumentation. A health check that only returns "OK" tells you nothing. Metrics without context are noise. Observability done right gives you the ability to answer questions about your system that you did not anticipate at design time.

Key insights:

Health checks come in two flavors: shallow (process alive) and deep (dependencies reachable, resources available)
The three pillars of observability: structured logs (what happened), metrics (how much), distributed traces (where and how long)
RED method for services: Rate (requests/sec), Errors (error rate), Duration (latency distribution)
USE method for resources: Utilization (%), Saturation (queue depth), Errors (error count)
SLIs measure user experience; SLOs set targets; SLAs create contractual obligations -- define them in that order
Alerting on symptoms (user-facing errors) beats alerting on causes (CPU usage) -- alert on what users feel
Dashboards should answer "Is the system healthy right now?" within 5 seconds of looking

Code applications:

Context	Pattern	Example
Health endpoints	Deep health check with dependency status	`/health` returns status of DB, cache, queue, and disk space
Service metrics	RED method instrumentation	Track request rate, error rate, and p50/p95/p99 latency per endpoint
Resource metrics	USE method for infrastructure	Track CPU utilization, request queue depth, and error counts per host
Distributed tracing	Propagate trace context across service boundaries	Inject trace ID in headers; correlate logs across services
Alerting	Alert on SLO burn rate, not raw thresholds	"Error budget burning 10x normal rate" vs. "CPU > 80%"

See: references/observability.md for health check design, metrics instrumentation, SLO frameworks, and alerting strategies.

6. Adaptation and Chaos Engineering

Safety note: Chaos engineering experiments are design-time planning activities. The patterns below describe what to test and what to verify , not actions for an AI agent to execute autonomously. All failure injection must be performed by authorized engineers using dedicated tooling (e.g., Gremlin, Litmus, AWS FIS) with proper approvals, rollback plans, and blast radius controls in place.

Core concept: Confidence in your system's resilience comes from testing it under realistic failure conditions. Chaos engineering is the discipline of experimenting on a system in a controlled environment to build confidence in its ability to withstand turbulent conditions.

Why it works: You cannot know how your system handles failure until it actually fails. Waiting for production incidents to discover weaknesses is reactive and expensive. Chaos engineering proactively injects failures in a controlled way, turning unknown-unknowns into known-knowns before they cause real outages.

Key insights:

Define steady state first -- you need a measurable baseline to detect when behavior deviates
Start small in non-production environments: terminate a single process, add latency to one call -- then escalate gradually with approvals
Minimize blast radius: use canary populations, feature flags, and emergency stop mechanisms for experiments
Production experiments require explicit authorization, monitoring, and immediate rollback capability
Automate recurring experiments so resilience is continuously verified, not a one-time event
GameDay exercises combine chaos engineering with incident response practice -- test both the system and the team
Every experiment should have a hypothesis: "We believe that when X fails, the system will Y"
Build a culture where finding weaknesses is celebrated, not punished

Code applications:

Context	Pattern	Example
Process failure	Controlled instance termination (via chaos tooling)	Terminate one pod using Gremlin/Litmus; verify service recovers within SLO
Network failure	Inject latency or partition between services (via chaos tooling)	Add 500ms latency to DB calls; verify circuit breaker trips
Dependency failure	Simulate downstream service outage (via chaos tooling)	Return 503 from payment API; verify graceful degradation
Resource exhaustion	Simulate resource pressure (via chaos tooling)	Stress-test memory limits; verify process restarts cleanly
GameDay	Scheduled team exercise with realistic failure scenario	"Primary database goes read-only at 2pm" -- practice response

See: references/chaos-engineering.md for experiment design, blast radius management, and building a chaos engineering practice.

Common Mistakes

Mistake	Why It Fails	Fix
No timeouts on outbound calls	One slow dependency freezes the entire system	Set connect and read timeouts on every external call
Unbounded retries	Retry storms amplify failures instead of recovering from them	Use exponential backoff, jitter, and fleet-wide retry budgets
Shared thread/connection pools	One failing dependency drains resources from all features	Bulkhead: isolate pools per dependency or feature
Shallow health checks only	Load balancer routes traffic to instances with broken dependencies	Implement deep health checks that verify downstream connectivity
Testing only the happy path	System works perfectly until the first real failure	Load test, soak test, and chaos test before every major release
Coupling deploy and release	Every deployment is a high-risk event with all-or-nothing rollout	Use feature flags, canary releases, and blue-green deployments

Quick Diagnostic

Audit any production system:

Question	If No	Action
Does every outbound call have a timeout?	Calls can hang indefinitely, blocking threads	Add connect and read timeouts to all external calls
Are circuit breakers in place for critical dependencies?	One dependency failure takes down the whole system	Add circuit breakers with appropriate thresholds
Are thread/connection pools isolated per dependency?	Shared pools allow cross-contamination of failures	Implement bulkhead pattern with dedicated pools
Can you deploy without downtime?	Deployments cause user-visible outages	Implement rolling, blue-green, or canary deployment
Do health checks verify dependency connectivity?	Dead instances receive traffic; partial failures go undetected	Add deep health checks that test DB, cache, queue
Are logs, metrics, and traces correlated?	Debugging requires manual log searching across services	Implement distributed tracing with correlated IDs
Have you load-tested beyond expected peak?	Unknown failure mode under real load	Load test to 2-3x expected peak; document breaking point

Reference Files

anti-patterns.md: Integration point failures, cascading failures, blocked threads, unbounded result sets, self-denial attacks, slow responses
stability-patterns.md: Circuit Breaker, Bulkhead, Timeout, Retry, Fail Fast, Steady State, Let It Crash, Handshaking
capacity-planning.md: Load/stress/soak testing, connection pool sizing, thread pool tuning, Universal Scalability Law
deployment-strategies.md: Blue-green, canary, rolling deploys, feature flags, database migrations, immutable infrastructure
observability.md: Health checks, RED/USE methods, SLIs/SLOs/SLAs, distributed tracing, alerting strategy
chaos-engineering.md: Steady state hypothesis, failure injection, GameDay exercises, blast radius management

About the Author

Michael T. Nygard is a software architect and author with over 30 years of experience building and operating large-scale production systems. He has worked across industries including finance, retail, and government, and has been responsible for systems handling millions of transactions per day. Nygard is known for bridging the gap between development and operations, advocating that architects must be responsible for the systems they design long after the code is written. The first edition of Release It! (2007) became a foundational text in the DevOps and site reliability engineering movements. The second edition (2018) expands coverage to cloud-native architectures, containerization, and modern deployment practices. Nygard is a frequent conference speaker and has contributed to the broader conversation about resilience engineering, sociotechnical systems, and the human factors that influence production stability.

Weekly Installs

195

Repository

wondelai/skills

GitHub Stars

255

First Seen

Feb 23, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

codex187

gemini-cli186

kimi-cli186

cursor186

opencode186

github-copilot186

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

90,800 周安装

Release It! 生产就绪软件框架：构建弹性、可观测、高可用系统的稳定性模式与反模式

🇨🇳中文介绍

Release It! 框架

核心原则

评分

Release It! 框架

1. 稳定性反模式

相关 Skills

2. 稳定性模式

3. 容量与可用性

4. 部署与发布

5. 健康检查与可观测性

6. 适应性与混沌工程

常见错误

快速诊断

参考文件

延伸阅读

关于作者

🇺🇸English

Release It! Framework

Core Principle

Scoring

The Release It! Framework

1. Stability Anti-Patterns

2. Stability Patterns

3. Capacity and Availability

4. Deployment and Release

5. Health Checks and Observability

6. Adaptation and Chaos Engineering

Common Mistakes

Quick Diagnostic

Reference Files

Further Reading

About the Author

最新 Skills