release-it by wondelai/skills
npx skills add https://github.com/wondelai/skills --skill release-it用于设计、部署和运维生产就绪软件系统的框架。基于一个基本事实:通过质量保证的软件并非能在生产环境中存活的软件。生产环境是充满敌意的环境——你的系统必须构建得能够预见并处理各个层面的故障。
每个系统最终都会被推至超出其设计极限。 问题不在于故障是否会发生,而在于你的系统是优雅地降级还是灾难性地崩溃。生产就绪的软件不仅仅是正确的——它具备弹性、可观测性,并且设计为能够在无需人工干预的情况下,在部分故障中持续运行。
目标:10/10。 在评审或创建生产系统时,根据其对以下原则的遵循程度,给出0-10的评分。10/10表示完全符合所有指导方针;较低的分数表示存在需要解决的差距。始终提供当前分数以及达到10/10所需的具体改进措施。
决定软件能否在生产环境中存活的六个领域:
核心概念: 故障通过集成点传播,并跨越系统边界级联。最危险的模式并非代码中的错误——它们是系统在压力下交互时出现的涌现行为。
为何有效: 识别反模式让你能够在生产流量发现它们之前,识别并消除这些裂痕。每一次生产中断都可追溯到一个或多个此类模式。它们是可预测的、反复出现的,并且是可以预防的。
关键见解:
代码应用:
| 上下文 | 模式 | 示例 |
|---|---|---|
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 假设每个远程调用都可能失败、挂起或返回垃圾数据 |
| 用超时 + 断路器包装所有外部调用 |
| 数据库查询 | 对每个查询强制执行结果集限制 | 添加 LIMIT 子句;对所有列表端点进行分页 |
| 线程池 | 为每个依赖项隔离资源池以防止交叉污染 | 为支付网关和搜索使用独立的线程池 |
| 负载测试 | 模拟真实流量,包括峰值和滥用模式 | 使用生产流量回放,而非合成的理想路径脚本 |
| 营销活动 | 将发布与容量规划协调 | 在黑色星期五之前预先扩容;为优惠券兑换添加队列 |
详见:references/anti-patterns.md,其中包含每个反模式的详细分析,包括故障场景和检测策略。
核心概念: 用稳定性模式来对抗每个反模式。断路器阻止级联故障。舱壁隔离爆炸半径。超时回收卡住的资源。它们共同创建一个在负载下弯曲但不会断裂的系统。
为何有效: 这些模式之所以有效,是因为它们接受故障是不可避免的,并设计系统对故障的响应,而不是试图防止所有故障。一个跳闸的断路器是系统正常工作的表现——它正在保护自己免受下游故障的影响。
关键见解:
代码应用:
| 上下文 | 模式 | 示例 |
|---|---|---|
| 服务调用 | 带阈值和恢复超时的断路器 | 60秒内5次失败后断开;30秒后进入半开状态 |
| 资源隔离 | 为每个依赖项使用专用资源池的舱壁 | 为关键服务和非关键服务使用独立的连接池 |
| 网络调用 | 带传播的超时 | 连接:1秒,读取:5秒;将截止时间传播到下游调用 |
| 重试 | 指数退避 + 抖动 + 重试预算 | 基准100毫秒,最多3次重试,整个集群20%的重试预算 |
| 数据清理 | 带自动清理的稳态 | 删除超过24小时的会话;日志达到500MB时轮转 |
详见:references/stability-patterns.md,其中包含实现细节、状态机、阈值调整和模式组合。
核心概念: 容量不是一个单一的数字——它是CPU、内存、网络、磁盘I/O、连接池和线程数的多维函数。容量规划意味着理解哪种资源首先成为瓶颈,以及在何种负载水平下成为瓶颈。
为何有效: 未经容量测试的系统会在最糟糕的时刻——峰值负载期间——在生产环境中失败。了解系统的实际限制(而非理论限制)可以让你设定现实的服务水平协议,并在用户遇到瓶颈之前规划扩展。
关键见解:
代码应用:
| 上下文 | 模式 | 示例 |
|---|---|---|
| 负载测试 | 逐步增加到预期峰值,然后2倍,观察性能下降曲线 | 逐渐增加每秒请求数,直到延迟超过服务水平目标 |
| 连接池 | 基于实测并发数确定大小,而非默认值 | 测量负载下的活动连接数;将连接池大小设置为P99 + 20%余量 |
| 自动扩展 | 定义具有适当冷却时间的扩展触发器 | CPU > 70% 持续3分钟时扩展;冷却时间5分钟 |
| 浸泡测试 | 以80%容量运行24-72小时 | 捕获内存泄漏、连接泄漏、文件句柄耗尽 |
| 容量模型 | 记录每个服务的资源瓶颈 | “服务X在2000 RPS时受内存限制;每个实例需要4GB内存” |
详见:references/capacity-planning.md,其中包含测试方法、资源池管理和可扩展性建模。
核心概念: 部署(将代码放到服务器上)和发布(向用户公开代码)是两个应该解耦的独立操作。将它们分离使你能够无风险地部署并有信心地发布。
为何有效: 大多数中断是由变更引起的——部署、配置更新、数据库迁移。将部署与发布解耦意味着你可以将代码部署到生产环境,验证其工作正常,然后才将流量路由到它。如果出现问题,你可以回滚发布,而不是回滚部署。
关键见解:
代码应用:
| 上下文 | 模式 | 示例 |
|---|---|---|
| 部署 | 带健康检查门的蓝绿部署 | 部署到绿色环境;运行冒烟测试;切换路由器 |
| 渐进式发布 | 带自动回滚的金丝雀发布 | 将5%的流量路由到金丝雀;如果错误率 > 1%则自动回滚 |
| 功能发布 | 带紧急关闭开关的功能标志 | 在标志后发布代码;为10%的用户启用;监控;逐步增加 |
| 模式变更 | 扩展-收缩迁移模式 | 添加新列;部署同时写入两者的代码;回填数据;删除旧列 |
| 回滚 | 通过流量路由实现即时回滚 | 保持先前版本运行;回滚 = 切换负载均衡器目标 |
详见:references/deployment-strategies.md,其中包含部署模式、迁移策略和基础设施即代码实践。
核心概念: 你无法操作你无法观测的东西。可观测性不是事后考虑——它是一流的设计关注点。健康检查、指标、日志和跟踪是你的系统在生产环境中的感觉器官。
为何有效: 生产系统会以没有适当检测就不可见的方式失败。一个只返回“OK”的健康检查什么也没告诉你。没有上下文的指标只是噪音。正确实施的可观测性使你能够回答你在设计时未曾预料到的关于系统的问题。
关键见解:
代码应用:
| 上下文 | 模式 | 示例 |
|---|---|---|
| 健康端点 | 带依赖项状态的深层健康检查 | /health 返回数据库、缓存、队列和磁盘空间的状态 |
| 服务指标 | RED方法检测 | 跟踪每个端点的请求速率、错误率和p50/p95/p99延迟 |
| 资源指标 | 基础设施的USE方法 | 跟踪每个主机的CPU利用率、请求队列深度和错误计数 |
| 分布式跟踪 | 跨服务边界传播跟踪上下文 | 在头部注入跟踪ID;跨服务关联日志 |
| 告警 | 基于SLO消耗率告警,而非原始阈值 | “错误预算消耗速度是正常速度的10倍” vs. “CPU > 80%” |
详见:references/observability.md,其中包含健康检查设计、指标检测、SLO框架和告警策略。
安全须知: 混沌工程实验是设计阶段的规划活动。以下模式描述的是 测试什么 和 验证什么,而非AI代理自主执行的操作。所有故障注入必须由授权工程师使用专用工具(例如Gremlin、Litmus、AWS FIS)执行,并具备适当的审批、回滚计划和爆炸半径控制。
核心概念: 对系统弹性的信心来自于在现实的故障条件下对其进行测试。混沌工程是一门在受控环境中对系统进行实验的学科,旨在建立对其在动荡条件下生存能力的信心。
为何有效: 在系统实际发生故障之前,你无法知道它如何处理故障。等待生产事件来发现弱点是反应式的且代价高昂。混沌工程以受控方式主动注入故障,在它们导致真实中断之前,将未知的未知转变为已知的已知。
关键见解:
代码应用:
| 上下文 | 模式 | 示例 |
|---|---|---|
| 进程故障 | 受控实例终止(通过混沌工具) | 使用Gremlin/Litmus终止一个Pod;验证服务在SLO内恢复 |
| 网络故障 | 在服务之间注入延迟或分区(通过混沌工具) | 为数据库调用添加500毫秒延迟;验证断路器跳闸 |
| 依赖项故障 | 模拟下游服务中断(通过混沌工具) | 从支付API返回503;验证优雅降级 |
| 资源耗尽 | 模拟资源压力(通过混沌工具) | 压力测试内存限制;验证进程干净重启 |
| GameDay | 带真实故障场景的预定团队演练 | “主数据库在下午2点变为只读”——练习响应 |
详见:references/chaos-engineering.md,其中包含实验设计、爆炸半径管理和建立混沌工程实践。
| 错误 | 为何失败 | 修复方法 |
|---|---|---|
| 出站调用没有超时 | 一个缓慢的依赖项会冻结整个系统 | 为每个外部调用设置连接和读取超时 |
| 无限制重试 | 重试风暴会放大故障而非从中恢复 | 使用指数退避、抖动和集群范围内的重试预算 |
| 共享线程/连接池 | 一个失败的依赖项会耗尽所有功能的资源 | 舱壁:为每个依赖项或功能隔离资源池 |
| 仅使用浅层健康检查 | 负载均衡器将流量路由到依赖项已损坏的实例 | 实现验证下游连通性的深层健康检查 |
| 仅测试理想路径 | 系统在第一次真实故障之前完美运行 | 每次主要发布前进行负载测试、浸泡测试和混沌测试 |
| 耦合部署与发布 | 每次部署都是高风险事件,采用全有或全无的发布方式 | 使用功能标志、金丝雀发布和蓝绿部署 |
| 基于原因而非症状告警 | CPU高告警触发但用户无感;错误激增但无告警触发 | 基于面向用户的SLI告警:错误率、延迟、可用性 |
| 没有容量模型 | 系统在无人计划的事件中,在2倍负载下崩溃 | 建模瓶颈资源;负载测试到预期峰值的3倍 |
审计任何生产系统:
| 问题 | 如果答案为否 | 行动 |
|---|---|---|
| 每个出站调用都有超时吗? | 调用可能无限期挂起,阻塞线程 | 为所有外部调用添加连接和读取超时 |
| 关键依赖项是否配置了断路器? | 一个依赖项故障会导致整个系统宕机 | 添加具有适当阈值的断路器 |
| 是否为每个依赖项隔离了线程/连接池? | 共享资源池允许故障交叉污染 | 使用专用资源池实现舱壁模式 |
| 你能无停机部署吗? | 部署会导致用户可见的中断 | 实施滚动、蓝绿或金丝雀部署 |
| 健康检查是否验证依赖项连通性? | 死实例接收流量;部分故障未被检测到 | 添加测试数据库、缓存、队列的深层健康检查 |
| 日志、指标和跟踪是否关联? | 调试需要跨服务手动搜索日志 | 使用关联ID实现分布式跟踪 |
| 你是否进行了超出预期峰值的负载测试? | 真实负载下的故障模式未知 | 负载测试到预期峰值的2-3倍;记录崩溃点 |
| 你是否进行故障注入演练? | 弹性是理论上的,未经验证 | 从低风险实验开始混沌工程 |
此技能基于Michael Nygard关于构建生产就绪软件的重要指南。如需完整的方法论、实战故事和实现细节:
Michael T. Nygard 是一位拥有超过30年构建和运维大规模生产系统经验的软件架构师和作者。他曾跨行业工作,包括金融、零售和政府,并负责处理每日数百万交易量的系统。Nygard以弥合开发和运维之间的鸿沟而闻名,主张架构师必须对其设计的系统在代码编写后很久仍负有责任。Release It! 的第一版(2007年)成为DevOps和站点可靠性工程运动的基础性文本。第二版(2018年)扩展了覆盖范围,包括云原生架构、容器化和现代部署实践。Nygard是频繁的会议演讲者,并为关于弹性工程、社会技术系统以及影响生产稳定性的人为因素的更广泛讨论做出了贡献。
每周安装数
195
代码仓库
GitHub 星标数
255
首次出现
2026年2月23日
安全审计
安装于
codex187
gemini-cli186
kimi-cli186
cursor186
opencode186
github-copilot186
Framework for designing, deploying, and operating production-ready software systems. Based on a fundamental truth: the software that passes QA is not the software that survives production. Production is a hostile environment -- and your system must be built to expect and handle failure at every level.
Every system will eventually be pushed beyond its design limits. The question is not whether failures will happen, but whether your system degrades gracefully or collapses catastrophically. Production-ready software is not just correct -- it is resilient, observable, and designed to operate through partial failures without human intervention.
Goal: 10/10. When reviewing or creating production systems, rate them 0-10 based on adherence to the principles below. A 10/10 means full alignment with all guidelines; lower scores indicate gaps to address. Always provide the current score and specific improvements needed to reach 10/10.
Six areas that determine whether software survives contact with production:
Core concept: Failures propagate through integration points, cascading across system boundaries. The most dangerous patterns are not bugs in your code -- they are emergent behaviors that arise when systems interact under stress.
Why it works: Recognizing anti-patterns lets you identify and eliminate the cracks before production traffic finds them. Every production outage traces back to one or more of these patterns. They are predictable, recurring, and preventable.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| HTTP calls | Assume every remote call can fail, hang, or return garbage | Wrap all external calls with timeout + circuit breaker |
| Database queries | Enforce result set limits on every query | Add LIMIT clause; paginate all list endpoints |
| Thread pools | Isolate pools per dependency to prevent cross-contamination | Separate thread pool for payment gateway vs. search |
| Load testing | Simulate realistic traffic including spikes and abuse patterns | Use production traffic replays, not synthetic happy-path scripts |
| Marketing events | Coordinate launches with capacity planning | Pre-scale before Black Friday; add queue for coupon redemption |
See: references/anti-patterns.md for detailed analysis of each anti-pattern with failure scenarios and detection strategies.
Core concept: Counter each anti-pattern with a stability pattern. Circuit breakers stop cascading failures. Bulkheads isolate blast radius. Timeouts reclaim stuck resources. Together they create a system that bends under load but does not break.
Why it works: These patterns work because they accept failure as inevitable and design the system's response to failure, rather than trying to prevent all failures. A circuit breaker that trips is the system working correctly -- it is protecting itself from a downstream failure.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Service calls | Circuit Breaker with threshold and recovery timeout | Open after 5 failures in 60s; half-open after 30s |
| Resource isolation | Bulkhead with dedicated pools per dependency | Separate connection pools for critical vs. non-critical services |
| Network calls | Timeout with propagation | Connect: 1s, read: 5s; propagate deadline to downstream calls |
| Retries | Exponential backoff + jitter + retry budget | Base 100ms, max 3 retries, 20% retry budget across fleet |
| Data cleanup | Steady State with automated purging | Delete sessions older than 24h; rotate logs at 500MB |
See: references/stability-patterns.md for implementation details, state machines, threshold tuning, and pattern combinations.
Core concept: Capacity is not a single number -- it is a multi-dimensional function of CPU, memory, network, disk I/O, connection pools, and thread counts. Capacity planning means understanding which resource becomes the bottleneck first and at what load level.
Why it works: Systems that are not capacity-tested fail in production at the worst possible moment -- during peak load. Understanding your system's actual limits (not theoretical limits) lets you set realistic SLAs and plan scaling before users hit the wall.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Load testing | Ramp to expected peak, then 2x, observe degradation curve | Gradually increase RPS until latency exceeds SLO |
| Connection pools | Size based on measured concurrency, not defaults | Measure active connections under load; set pool to P99 + 20% headroom |
| Auto-scaling | Define scaling triggers with appropriate cooldown | Scale on CPU > 70% sustained 3 min; cooldown 5 min |
| Soak testing | Run at 80% capacity for 24-72 hours | Catch memory leaks, connection leaks, file handle exhaustion |
| Capacity model | Document resource bottleneck per service | "Service X is memory-bound at 2000 RPS; needs 4GB per instance" |
See: references/capacity-planning.md for testing methodologies, resource pool management, and scalability modeling.
Core concept: Deployment (putting code on servers) and release (exposing code to users) are separate operations that should be decoupled. Separating them gives you the ability to deploy without risk and release with confidence.
Why it works: Most outages are caused by changes -- deployments, configuration updates, database migrations. Decoupling deployment from release means you can deploy code to production, verify it works, and only then route traffic to it. If something goes wrong, you roll back the release, not the deployment.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Deploys | Blue-green with health check gate | Deploy to green; run smoke tests; swap router |
| Progressive rollout | Canary with automated rollback | Route 5% traffic to canary; auto-rollback if error rate > 1% |
| Feature launch | Feature flags with emergency off switch | Ship code behind flag; enable for 10% of users; monitor; ramp |
| Schema changes | Expand-contract migration pattern | Add new column; deploy code that writes both; backfill; drop old column |
| Rollback | Instant rollback via traffic routing | Keep previous version running; rollback = switch load balancer target |
See: references/deployment-strategies.md for deployment patterns, migration strategies, and infrastructure-as-code practices.
Core concept: You cannot operate what you cannot observe. Observability is not an afterthought -- it is a first-class design concern. Health checks, metrics, logs, and traces are the sensory organs of your system in production.
Why it works: Production systems fail in ways that are invisible without proper instrumentation. A health check that only returns "OK" tells you nothing. Metrics without context are noise. Observability done right gives you the ability to answer questions about your system that you did not anticipate at design time.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Health endpoints | Deep health check with dependency status | /health returns status of DB, cache, queue, and disk space |
| Service metrics | RED method instrumentation | Track request rate, error rate, and p50/p95/p99 latency per endpoint |
| Resource metrics | USE method for infrastructure | Track CPU utilization, request queue depth, and error counts per host |
| Distributed tracing | Propagate trace context across service boundaries | Inject trace ID in headers; correlate logs across services |
| Alerting | Alert on SLO burn rate, not raw thresholds | "Error budget burning 10x normal rate" vs. "CPU > 80%" |
See: references/observability.md for health check design, metrics instrumentation, SLO frameworks, and alerting strategies.
Safety note: Chaos engineering experiments are design-time planning activities. The patterns below describe what to test and what to verify , not actions for an AI agent to execute autonomously. All failure injection must be performed by authorized engineers using dedicated tooling (e.g., Gremlin, Litmus, AWS FIS) with proper approvals, rollback plans, and blast radius controls in place.
Core concept: Confidence in your system's resilience comes from testing it under realistic failure conditions. Chaos engineering is the discipline of experimenting on a system in a controlled environment to build confidence in its ability to withstand turbulent conditions.
Why it works: You cannot know how your system handles failure until it actually fails. Waiting for production incidents to discover weaknesses is reactive and expensive. Chaos engineering proactively injects failures in a controlled way, turning unknown-unknowns into known-knowns before they cause real outages.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Process failure | Controlled instance termination (via chaos tooling) | Terminate one pod using Gremlin/Litmus; verify service recovers within SLO |
| Network failure | Inject latency or partition between services (via chaos tooling) | Add 500ms latency to DB calls; verify circuit breaker trips |
| Dependency failure | Simulate downstream service outage (via chaos tooling) | Return 503 from payment API; verify graceful degradation |
| Resource exhaustion | Simulate resource pressure (via chaos tooling) | Stress-test memory limits; verify process restarts cleanly |
| GameDay | Scheduled team exercise with realistic failure scenario | "Primary database goes read-only at 2pm" -- practice response |
See: references/chaos-engineering.md for experiment design, blast radius management, and building a chaos engineering practice.
| Mistake | Why It Fails | Fix |
|---|---|---|
| No timeouts on outbound calls | One slow dependency freezes the entire system | Set connect and read timeouts on every external call |
| Unbounded retries | Retry storms amplify failures instead of recovering from them | Use exponential backoff, jitter, and fleet-wide retry budgets |
| Shared thread/connection pools | One failing dependency drains resources from all features | Bulkhead: isolate pools per dependency or feature |
| Shallow health checks only | Load balancer routes traffic to instances with broken dependencies | Implement deep health checks that verify downstream connectivity |
| Testing only the happy path | System works perfectly until the first real failure | Load test, soak test, and chaos test before every major release |
| Coupling deploy and release | Every deployment is a high-risk event with all-or-nothing rollout | Use feature flags, canary releases, and blue-green deployments |
Audit any production system:
| Question | If No | Action |
|---|---|---|
| Does every outbound call have a timeout? | Calls can hang indefinitely, blocking threads | Add connect and read timeouts to all external calls |
| Are circuit breakers in place for critical dependencies? | One dependency failure takes down the whole system | Add circuit breakers with appropriate thresholds |
| Are thread/connection pools isolated per dependency? | Shared pools allow cross-contamination of failures | Implement bulkhead pattern with dedicated pools |
| Can you deploy without downtime? | Deployments cause user-visible outages | Implement rolling, blue-green, or canary deployment |
| Do health checks verify dependency connectivity? | Dead instances receive traffic; partial failures go undetected | Add deep health checks that test DB, cache, queue |
| Are logs, metrics, and traces correlated? | Debugging requires manual log searching across services | Implement distributed tracing with correlated IDs |
| Have you load-tested beyond expected peak? | Unknown failure mode under real load | Load test to 2-3x expected peak; document breaking point |
This skill is based on Michael Nygard's essential guide to building production-ready software. For the complete methodology, war stories, and implementation details:
Michael T. Nygard is a software architect and author with over 30 years of experience building and operating large-scale production systems. He has worked across industries including finance, retail, and government, and has been responsible for systems handling millions of transactions per day. Nygard is known for bridging the gap between development and operations, advocating that architects must be responsible for the systems they design long after the code is written. The first edition of Release It! (2007) became a foundational text in the DevOps and site reliability engineering movements. The second edition (2018) expands coverage to cloud-native architectures, containerization, and modern deployment practices. Nygard is a frequent conference speaker and has contributed to the broader conversation about resilience engineering, sociotechnical systems, and the human factors that influence production stability.
Weekly Installs
195
Repository
GitHub Stars
255
First Seen
Feb 23, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
codex187
gemini-cli186
kimi-cli186
cursor186
opencode186
github-copilot186
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
90,800 周安装
podcast-generation 技能:AI 播客生成工具,自动化音频内容创作
1 周安装
前端UI暗黑主题TypeScript技能 - 前端开发暗黑模式UI组件库
1 周安装
Azure Resource Manager SQL .NET SDK - 管理Azure SQL数据库的.NET开发工具包
1 周安装
Azure Resource Manager PostgreSQL .NET SDK - 管理Azure PostgreSQL数据库的.NET开发工具包
1 周安装
Azure Resource Manager DurableTask .NET - 微软Azure云服务持久任务管理开发包
1 周安装
Azure Resource Manager Cosmos DB .NET SDK - 管理Azure Cosmos DB资源的.NET开发工具包
1 周安装
| Alerting on causes, not symptoms | High CPU alerts fire but users are fine; errors spike but no alert fires | Alert on user-facing SLIs: error rate, latency, availability |
| No capacity model | System falls over at 2x load during an event nobody planned for | Model bottleneck resources; load test to 3x expected peak |
| Do you practice failure injection? | Resilience is theoretical, not verified | Start chaos engineering with low-risk experiments |