⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

系统弹性与容错设计指南：熔断器、重试策略、混沌工程与生产强化

qa-resilience by vasilyu1983/ai-agents-public

66 周安装量

49 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/vasilyu1983/ai-agents-public --skill qa-resilience

开发运维系统架构测试

🇨🇳中文介绍

QA Resilience (Jan 2026) - 故障模式测试与生产强化

此技能提供可直接执行的模式，用于构建能够优雅处理故障的弹性、容错系统，并通过测试验证这些行为。

核心资源整理在 data/sources.json 中。

常见请求

当用户请求以下内容时，请使用此技能：

熔断器实现
重试策略和指数退避
资源隔离的舱壁模式
背压、负载丢弃和过载保护
外部依赖的超时策略
优雅降级和回退机制
健康检查设计（存活性与就绪性）
错误处理最佳实践
混沌工程设置
演练日 / 灾难恢复 / 故障转移测试（带防护措施）
生产强化策略
故障注入测试

何时不应使用此技能：

没有外部依赖的简单 CRUD 应用 —— 使用基本错误处理即可
单一数据库，无网络调用 —— 标准连接池足够
纯批处理作业，手动重试 —— 计划作业框架会处理
仅前端验证 —— 请参阅 software-frontend

快速开始（默认工作流）

如果缺少关键上下文，请询问：关键用户旅程、依赖项清单（包括第三方）、SLO/SLI 目标、当前超时/重试/熔断器设置、幂等性/去重策略，以及允许进行故障注入的环境（本地/预发/生产）。

定义范围：关键用户旅程、前 N 个依赖项以及 SLOs/SLIs（延迟、错误、饱和度）。
为每个依赖项构建依赖契约：超时预算、重试策略（有界 + 抖动）、幂等性/去重预期、熔断器阈值以及回退/降级行为。
选择测试框架：首先进行确定性故障注入（模拟/伪造、故障代理、服务网格故障），然后进行分阶段的混沌实验，如果适用再进行演练日/灾难恢复演练。
定义通过/失败信号：错误预算消耗、p95/p99 预算、回退率、队列积压、熔断器状态变化以及恢复时间。
产出工件（使用模板）：弹性测试计划模板、、。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

核心质量保证（默认）

故障模式测试（验证内容）

超时：每个网络调用和数据库查询都有有界超时；验证链式调用中的超时预算以及截止时间/取消传播。
重试：有界重试，带退避 + 抖动；验证幂等性/去重以及重试风暴防护（上限、预算和每次重试超时）。
依赖项故障：部分中断、下游缓慢、速率限制、DNS 故障、身份验证失败以及损坏/无效响应。
过载/饱和：连接池耗尽、队列积压、线程池饥饿和速率限制；验证背压和负载丢弃。
降级模式用户体验：当依赖项故障时用户看到/得到什么（缓存/陈旧/部分响应）以及适用哪些一致性保证。
健康检查：验证存活/就绪/启动探针行为（Kubernetes 探针：https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/）。

规模适中的混沌工程（安全构建）

定义稳态和假设（混沌工程原则：https://principlesofchaos.org/）。
从非生产环境开始；在生产环境中，使用最小爆炸半径、限时运行和明确的中止标准。
必需：在运行实验前，准备好回滚计划、负责人和可观测性信号。
必需（生产环境）：变更窗口 + 通知待命人员、错误预算健康，以及基于客户影响信号的明确停止条件。

负载/性能 + 生产防护措施

负载测试验证容量和尾部延迟；弹性测试验证故障下的行为。
防护措施：
- 按计划（每晚）和金丝雀部署运行繁重的弹性/性能测试套件，而不是在每个 PR 上都运行。
- 根据回归预算（p99 延迟、错误率、饱和度）来卡控发布，而不是根据原始 CPU/内存。

弹性测试的稳定性控制

如果实验不是确定性的，混沌/故障注入可能看起来“不稳定”。
首先稳定实验：固定的爆炸半径、受控的故障参数、确定性的持续时间、强大的可观测性。

每次弹性测试运行都应捕获：实验参数、目标范围、时间戳以及故障的追踪/日志链接。
优先使用追踪/指标来确认故障是预期的（而非附带损害）。

明确测试降级模式；记录预期的用户体验和 API 响应。
在集成测试中通过故障注入验证重试/超时。

无界重试和缺失超时（会放大事件影响）。
仅测试“快乐路径”，忽略下游故障类别。

模式	机制 / 工具	何时使用	配置（起点）
熔断器	应用级熔断器或服务网格；发出熔断器状态变化	持续的下游故障或超时	在持续的故障/超时率时打开；使用半开探针；根据流量 + 错误预算调整窗口
带退避的重试	客户端重试库；尊重 429/503 的 Retry-After 头部	瞬态故障和速率限制	面向用户的路径最多重试 2-3 次；退避 + 抖动；每次重试超时；绝不超出剩余截止时间
超时预算	截止时间/取消 + 数据库语句超时	任何远程调用或查询	每跳预算；快速失败；传播截止时间；设置数据库查询超时和池等待超时
舱壁 + 背压	并发限制器、独立的池/队列、准入控制	过载/饱和风险	每个依赖项使用独立的池；限制队列；尽早拒绝（429/503），避免不受控的延迟增长
优雅降级	功能开关、缓存/陈旧回退、部分响应	非关键功能和部分中断	定义数据新鲜度 + 用户体验；检测回退率；避免静默降级
健康检查	K8s 存活/就绪/启动探针	编排和负载均衡	存活探针浅层检查；就绪探针检查关键依赖项（有界）；启动探针用于慢初始化；添加强大关闭
混沌 / 故障注入	故障代理、服务网格故障、托管混沌工具	验证真实故障模式下的行为	从非生产环境开始；控制爆炸半径；限时；预定义停止条件；记录实验参数

决策树：弹性模式选择

故障场景：[系统依赖类型]
    ├─ 外部 API/服务？
    │   ├─ 瞬态错误？ → 带指数退避 + 抖动的重试
    │   ├─ 级联故障？ → 熔断器 + 回退
    │   ├─ 速率限制？ → 重试并尊重 Retry-After 头部
    │   └─ 响应缓慢？ → 超时 + 熔断器
    │
    ├─ 数据库依赖？
    │   ├─ 连接池耗尽？ → 舱壁隔离 + 超时
    │   ├─ 查询超时？ → 语句超时（5-10秒）
    │   ├─ 副本延迟？ → 回退到主库读取
    │   └─ 连接失败？ → 重试 + 熔断器
    │
    ├─ 过载/饱和？
    │   ├─ 队列/池增长？ → 背压 + 限制队列 + 准入控制
    │   ├─ 惊群效应？ → 抖动 + 请求合并 + 缓存
    │   └─ 昂贵路径？ → 负载丢弃 + 功能开关降级
    │
    ├─ 非关键功能？
    │   ├─ ML 推荐？ → 功能开关 + 默认值回退
    │   ├─ 搜索服务？ → 缓存结果或基本 SQL 回退
    │   ├─ 电子邮件/通知？ → 记录错误，不阻塞主流程
    │   └─ 分析？ → 发送后不管，熔断器用于保护
    │
    ├─ Kubernetes/编排？
    │   ├─ 服务发现？ → 存活 + 就绪 + 启动探针
    │   ├─ 启动缓慢？ → 启动探针（failureThreshold: 30）
    │   ├─ 负载均衡？ → 就绪探针（检查依赖项）
    │   └─ 自动重启？ → 存活探针（简单检查）
    │
    └─ 测试弹性？
        ├─ 预生产环境？ → Chaos Toolkit 实验
        ├─ 生产环境（低风险）？ → 功能开关 + 金丝雀部署
        ├─ 计划性测试？ → 演练日（每季度）
        └─ 持续混沌？ → 低爆炸半径故障注入，带强防护措施

导航：核心弹性模式

熔断器模式 - 防止级联故障
- 经典熔断器实现（Node.js, Python）
- 调优、告警和回退策略
重试模式 - 处理瞬态故障
- 带抖动的指数退避
- 重试决策表（哪些错误应重试）
- 幂等性模式和 Retry-After 头部
舱壁隔离 - 资源隔离
- 用于线程/连接池的信号量模式
- 数据库连接池策略
- 基于队列的舱壁与负载丢弃
超时策略 - 防止资源耗尽
- 连接、请求和空闲超时
- 数据库查询超时（PostgreSQL, MySQL）
- 链式操作的嵌套超时预算
优雅降级 - 保持部分功能
- 缓存回退策略
- 默认值和功能开关
- 使用 Promise.allSettled 的部分响应
健康检查模式 - 服务可用性监控
- 存活、就绪和启动探针
- Kubernetes 探针配置
- 浅层与深层健康检查
负载丢弃与背压 - 过载保护模式
- 准入控制和基于队列的丢弃
- 跨服务的背压传播
- 基于优先级的请求处理
级联故障预防 - 多层遏制
- 故障传播分析
- 依赖项隔离策略
- 爆炸半径限制技术
灾难恢复测试 - 灾难恢复演练执行
- RTO/RPO 验证
- 故障转移和故障恢复流程
- 演练日规划和执行

导航：运维资源

弹性检查清单 - 生产强化检查清单
- 依赖项弹性
- 健康和就绪探针
- 弹性的可观测性
- 故障测试
混沌工程指南 - 安全的可靠性实验
- 规划混沌实验
- 常见故障注入场景
- 执行步骤和复盘检查清单

弹性运行手册模板 - 服务强化配置文件
- 依赖项和 SLOs
- 回退策略
- 回滚流程
故障注入演练手册 - 混沌测试脚本
- 成功信号
- 回滚标准
- 实验后复盘
弹性测试计划模板 - 故障模式测试计划（超时/重试/降级模式）
- 范围和依赖项
- 故障矩阵和预期行为
- 可观测性信号和通过/失败标准

场景	建议
外部 API 调用	熔断器 + 带指数退避的重试
数据库查询	超时 + 连接池 + 熔断器
缓慢的依赖项	舱壁隔离 + 超时
过载/饱和	舱壁 + 背压 + 负载丢弃
非关键功能	功能开关 + 优雅降级
Kubernetes 部署	存活 + 就绪 + 启动探针
测试弹性	混沌工程实验
瞬态故障	带指数退避 + 抖动的重试
级联故障	熔断器 + 舱壁

应避免的反模式

无超时 - 无限等待耗尽资源
无限重试 - 放大问题（惊群效应）
无幂等性的重试 - 重复副作用和数据损坏
无熔断器 - 级联故障
紧耦合 - 一个故障破坏一切
静默故障 - 对降级状态无可观测性
无舱壁 - 共享线程池耗尽所有资源
故障转移从未测试 - 灾难恢复计划在真实事件中失败
仅测试快乐路径 - 生产环境暴露故障

可选：人工智能 / 自动化

使用 AI 从明确的风险登记册中提出故障模式场景；仅保留映射到已知依赖项和业务旅程的场景。
使用 AI 总结实验结果（指标差异、错误集群）并起草事后时间线；通过遥测数据验证。

没有风险地图的“场景生成”（产生噪音和浪费负载）。
让 AI 放宽超时/重试或移除防护措施。

../ops-devops-platform/SKILL.md — 事件响应、SLOs 和平台运行手册
../software-backend/SKILL.md — API 错误处理、重试和数据库可靠性模式
../software-architecture-design/SKILL.md — 面向可靠性的系统分解和依赖项设计
../qa-testing-strategy/SKILL.md — 回归、负载和故障注入测试策略
../software-security-appsec/SKILL.md — 安全故障模式和防护措施
../qa-observability/SKILL.md — 指标、追踪、日志记录和性能监控
../qa-debugging/SKILL.md — 生产调试和事件调查
../data-sql-optimization/SKILL.md — 数据库弹性、连接池和查询超时
../dev-api-design/SKILL.md — API 设计模式，包括错误处理和重试语义

从外部依赖项的熔断器开始
为瞬态故障（网络、速率限制）添加重试
使用舱壁防止资源耗尽
组合模式以实现深度防御

追踪熔断器状态变化
监控重试尝试和成功率
告警降级模式持续时间
测量故障后的恢复时间

在非生产环境开始混沌实验
在故障注入前定义假设
设置爆炸半径限制和自动恢复
记录经验教训和行动项

成功标准：系统优雅地处理故障、自动恢复、在中断期间保持部分功能，并快速失败以防止级联故障。通过故障注入和演练日（带防护措施）主动测试弹性。

在最终回答前，使用网络搜索/网络获取来验证当前外部事实、版本、定价、截止日期、法规或平台行为。
优先使用一手来源；对于易变信息，报告来源链接和日期。
如果无法访问网络，请说明限制并将指导标记为未经验证。

🇺🇸English

QA Resilience (Jan 2026) - Failure Mode Testing & Production Hardening

This skill provides execution-ready patterns for building resilient, fault-tolerant systems that handle failures gracefully, and for validating those behaviors with tests.

Core sources are curated in data/sources.json.

Common Requests

Use this skill when a user requests:

Circuit breaker implementation
Retry strategies and exponential backoff
Bulkhead pattern for resource isolation
Backpressure, load shedding, and overload protection
Timeout policies for external dependencies
Graceful degradation and fallback mechanisms
Health check design (liveness vs readiness)
Error handling best practices
Chaos engineering setup
Game days / DR / failover testing (with guardrails)
Production hardening strategies
Fault injection testing

When NOT to use this skill:

Simple CRUD apps with no external dependencies — use basic error handling
Single database, no network calls — standard connection pooling sufficient
Pure batch jobs with manual retry — scheduled job frameworks handle this
Frontend-only validation — see software-frontend instead

Quick Start (Default Workflow)

If key context is missing, ask for: critical user journeys, dependency inventory (including third parties), SLO/SLI targets, current timeout/retry/circuit-breaker settings, idempotency/dedup strategy, and where fault injection is allowed (local/staging/prod).

Define scope: critical user journeys, top N dependencies, and SLOs/SLIs (latency, errors, saturation).
Build a dependency contract per dependency: timeout budget, retry policy (bounded + jitter), idempotency/dedup expectations, circuit breaker thresholds, and fallback/degraded behavior.
Choose a test harness: deterministic fault injection first (mocks/fakes, fault proxy, service mesh faults), then staged chaos experiments, then game day/DR drills if applicable.
Define pass/fail signals: error budget burn, p95/p99 budgets, fallback rates, queue backlog, circuit breaker state changes, and recovery time.
Produce artifacts (use templates): Resilience Test Plan Template, Fault Injection Playbook, Resilience Runbook Template.

Core QA (Default)

Failure Mode Testing (What to Validate)

Timeouts: every network call and DB query has a bounded timeout; validate timeout budgets across chained calls and deadline/cancellation propagation.
Retries: bounded retries with backoff + jitter; validate idempotency/dedup and retry storm safeguards (caps, budgets, and per-try timeouts).
Dependency failure: partial outage, slow downstream, rate limiting, DNS failures, auth failures, and corrupted/invalid responses.
Overload/saturation: connection pool exhaustion, queue backlog, thread pool starvation, and rate limiting; validate backpressure and load shedding.
Degraded-mode UX: what the user sees/gets when dependencies fail (cached/stale/partial responses) and what consistency guarantees apply.
Health checks: validate liveness/readiness/startup probe behavior (Kubernetes probes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/).

Right-Sized Chaos Engineering (Safe by Construction)

Define steady state and hypothesis (Principles of Chaos Engineering: https://principlesofchaos.org/).
Start in non-prod; in prod, use minimal blast radius, timeboxed runs, and explicit abort criteria.
REQUIRED: rollback plan, owners, and observability signals before running experiments.
REQUIRED (prod): change window + on-call aware, error budget healthy, and an explicit stop condition based on customer impact signals.

Load/Perf + Production Guardrails

Load tests validate capacity and tail latency; resilience tests validate behavior under failure.
Guardrails:
- Run heavy resilience/perf suites on schedule (nightly) and on canary deploys, not on every PR.
- Gate releases on regression budgets (p99 latency, error rate, saturation) rather than on raw CPU/memory.

Flake Control for Resilience Tests

Chaos/fault injection can look "flaky" if the experiment is not deterministic.
Stabilize the experiment first: fixed blast radius, controlled fault parameters, deterministic duration, strong observability.

Debugging Ergonomics

Every resilience test run should capture: experiment parameters, target scope, timestamps, and trace/log links for failures.
Prefer tracing/metrics to confirm the failure is the expected one (not collateral damage).

Do / Avoid

Do:

Test degraded mode explicitly; document expected UX and API responses.
Validate retries/timeouts in integration tests with fault injection.

Avoid:

Unbounded retries and missing timeouts (amplifies incidents).
"Happy-path only" testing that ignores downstream failure classes.

Quick Reference

Pattern	Mechanism / Tooling	When to Use	Configuration (Starting Point)
Circuit Breaker	App-level breaker or service mesh; emit breaker state changes	Sustained downstream failures or timeouts	Open on sustained error/timeout rates; use half-open probes; tune windows to traffic + error budget
Retry with Backoff	Client retry libs; respect Retry-After for 429/503	Transient failures and rate limiting	2-3 retries max for user-facing paths; backoff + jitter; per-try timeouts; never exceed remaining deadline
Timeout Budgets	Deadlines/cancellation + DB statement timeouts	Any remote call or query	Budget per hop; fail fast; propagate deadlines; set DB query timeout and pool wait timeout
Bulkheads + Backpressure	Concurrency limiters, separate pools/queues, admission control	Overload/saturation risk	Separate pools per dependency; bound queues; reject early (429/503) over uncontrolled latency growth
Graceful Degradation	Feature flags, cached/stale fallback, partial responses	Non-critical features and partial outages	Define data freshness + UX; instrument fallback rate; avoid silent degradation

Decision Tree: Resilience Pattern Selection

Failure scenario: [System Dependency Type]
    ├─ External API/Service?
    │   ├─ Transient errors? → Retry with exponential backoff + jitter
    │   ├─ Cascading failures? → Circuit breaker + fallback
    │   ├─ Rate limiting? → Retry with Retry-After header respect
    │   └─ Slow response? → Timeout + circuit breaker
    │
    ├─ Database Dependency?
    │   ├─ Connection pool exhaustion? → Bulkhead isolation + timeout
    │   ├─ Query timeout? → Statement timeout (5-10s)
    │   ├─ Replica lag? → Read from primary fallback
    │   └─ Connection failures? → Retry + circuit breaker
    │
    ├─ Overload/Saturation?
    │   ├─ Queue/pool growing? → Backpressure + bound queues + admission control
    │   ├─ Thundering herd? → Jitter + request coalescing + caching
    │   └─ Expensive paths? → Load shedding + feature flag degradation
    │
    ├─ Non-Critical Feature?
    │   ├─ ML recommendations? → Feature flag + default values fallback
    │   ├─ Search service? → Cached results or basic SQL fallback
    │   ├─ Email/notifications? → Log error, don't block main flow
    │   └─ Analytics? → Fire-and-forget, circuit breaker for protection
    │
    ├─ Kubernetes/Orchestration?
    │   ├─ Service discovery? → Liveness + readiness + startup probes
    │   ├─ Slow startup? → Startup probe (failureThreshold: 30)
    │   ├─ Load balancing? → Readiness probe (check dependencies)
    │   └─ Auto-restart? → Liveness probe (simple check)
    │
    └─ Testing Resilience?
        ├─ Pre-production? → Chaos Toolkit experiments
        ├─ Production (low risk)? → Feature flags + canary deployments
        ├─ Scheduled testing? → Game days (quarterly)
        └─ Continuous chaos? → Low-blast-radius fault injection with strong guardrails

Navigation: Core Resilience Patterns

Circuit Breaker Patterns - Prevent cascading failures
- Classic circuit breaker implementation (Node.js, Python)
- Tuning, alerting, and fallback strategies
Retry Patterns - Handle transient failures
- Exponential backoff with jitter
- Retry decision table (which errors to retry)
- Idempotency patterns and Retry-After headers
Bulkhead Isolation - Resource compartmentalization
- Semaphore pattern for thread/connection pools
- Database connection pooling strategies
- Queue-based bulkheads with load shedding
Timeout Policies - Prevent resource exhaustion
- Connection, request, and idle timeouts
- Database query timeouts (PostgreSQL, MySQL)
- Nested timeout budgets for chained operations
- Maintain partial functionality

Navigation: Operational Resources

Resilience Checklists - Production hardening checklists
- Dependency resilience
- Health and readiness probes
- Observability for resilience
- Failure testing
Chaos Engineering Guide - Safe reliability experiments
- Planning chaos experiments
- Common failure injection scenarios
- Execution steps and debrief checklist

Navigation: Templates

Resilience Runbook Template - Service hardening profile
- Dependencies and SLOs
- Fallback strategies
- Rollback procedures
Fault Injection Playbook - Chaos testing script
- Success signals
- Rollback criteria
- Post-experiment debrief
Resilience Test Plan Template - Failure mode test plan (timeouts/retries/degraded mode)
- Scope and dependencies
- Fault matrix and expected behavior
- Observability signals and pass/fail criteria

Quick Decision Matrix

Scenario	Recommendation
External API calls	Circuit breaker + retry with exponential backoff
Database queries	Timeout + connection pooling + circuit breaker
Slow dependency	Bulkhead isolation + timeout
Overload/saturation	Bulkheads + backpressure + load shedding
Non-critical feature	Feature flag + graceful degradation
Kubernetes deployment	Liveness + readiness + startup probes
Testing resilience	Chaos engineering experiments
Transient failures	Retry with exponential backoff + jitter
Cascading failures	Circuit breaker + bulkhead

Anti-Patterns to Avoid

No timeouts - Infinite waits exhaust resources
Infinite retries - Amplifies problems (thundering herd)
Retries without idempotency - Duplicate side effects and data corruption
No circuit breakers - Cascading failures
Tight coupling - One failure breaks everything
Silent failures - No observability into degraded state
No bulkheads - Shared thread pools exhaust all resources
Failover never tested - DR plan fails during a real incident
Testing only happy path - Production reveals failures

Optional: AI / Automation

Do:

Use AI to propose failure-mode scenarios from an explicit risk register; keep only scenarios that map to known dependencies and business journeys.
Use AI to summarize experiment results (metrics deltas, error clusters) and draft postmortem timelines; verify with telemetry.

Avoid:

"Scenario generation" without a risk map (creates noise and wasted load).
Letting AI relax timeouts/retries or remove guardrails.

Related Skills

../ops-devops-platform/SKILL.md — Incident response, SLOs, and platform runbooks
../software-backend/SKILL.md — API error handling, retries, and database reliability patterns
../software-architecture-design/SKILL.md — System decomposition and dependency design for reliability
../qa-testing-strategy/SKILL.md — Regression, load, and fault-injection testing strategies
../software-security-appsec/SKILL.md — Security failure modes and guardrails
../qa-observability/SKILL.md — Metrics, tracing, logging, and performance monitoring
../qa-debugging/SKILL.md — Production debugging and incident investigation
../data-sql-optimization/SKILL.md — Database resilience, connection pooling, and query timeouts
../dev-api-design/SKILL.md — API design patterns including error handling and retry semantics

Usage Notes

Pattern Selection:

Start with circuit breakers for external dependencies
Add retries for transient failures (network, rate limits)
Use bulkheads to prevent resource exhaustion
Combine patterns for defense-in-depth

Observability:

Track circuit breaker state changes
Monitor retry attempts and success rates
Alert on degraded mode duration
Measure recovery time after failures

Testing:

Start chaos experiments in non-production
Define hypothesis before failure injection
Set blast radius limits and auto-revert
Document learnings and action items

Success criteria: systems gracefully handle failures, recover automatically, maintain partial functionality during outages, and fail fast to prevent cascading failures. Resilience is tested proactively through fault injection and game days (with guardrails).

Fact-Checking

Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
Prefer primary sources; report source links and dates for volatile information.
If web access is unavailable, state the limitation and mark guidance as unverified.

Weekly Installs

Repository

vasilyu1983/ai-…s-public

GitHub Stars

First Seen

Jan 23, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

cursor50

codex50

opencode49

gemini-cli48

github-copilot46

claude-code43

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

111,700 周安装

Cached fallback strategies
Default values and feature toggles
Partial responses with Promise.allSettled

Health Check Patterns - Service availability monitoring

Liveness, readiness, and startup probes
Kubernetes probe configuration
Shallow vs deep health checks

Load Shedding& Backpressure - Overload protection patterns

Admission control and queue-based shedding
Backpressure propagation across services
Priority-based request handling

Cascading Failure Prevention - Multi-layer containment

Failure propagation analysis
Dependency isolation strategies
Blast radius limitation techniques

Disaster Recovery Testing - DR drill execution

RTO/RPO verification
Failover and failback procedures
Game day planning and execution