重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
qa-resilience by vasilyu1983/ai-agents-public
npx skills add https://github.com/vasilyu1983/ai-agents-public --skill qa-resilience此技能提供可直接执行的模式,用于构建能够优雅处理故障的弹性、容错系统,并通过测试验证这些行为。
核心资源整理在 data/sources.json 中。
当用户请求以下内容时,请使用此技能:
何时不应使用此技能:
如果缺少关键上下文,请询问:关键用户旅程、依赖项清单(包括第三方)、SLO/SLI 目标、当前超时/重试/熔断器设置、幂等性/去重策略,以及允许进行故障注入的环境(本地/预发/生产)。
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
应做:
应避免:
| 模式 | 机制 / 工具 | 何时使用 | 配置(起点) |
|---|---|---|---|
| 熔断器 | 应用级熔断器或服务网格;发出熔断器状态变化 | 持续的下游故障或超时 | 在持续的故障/超时率时打开;使用半开探针;根据流量 + 错误预算调整窗口 |
| 带退避的重试 | 客户端重试库;尊重 429/503 的 Retry-After 头部 | 瞬态故障和速率限制 | 面向用户的路径最多重试 2-3 次;退避 + 抖动;每次重试超时;绝不超出剩余截止时间 |
| 超时预算 | 截止时间/取消 + 数据库语句超时 | 任何远程调用或查询 | 每跳预算;快速失败;传播截止时间;设置数据库查询超时和池等待超时 |
| 舱壁 + 背压 | 并发限制器、独立的池/队列、准入控制 | 过载/饱和风险 | 每个依赖项使用独立的池;限制队列;尽早拒绝(429/503),避免不受控的延迟增长 |
| 优雅降级 | 功能开关、缓存/陈旧回退、部分响应 | 非关键功能和部分中断 | 定义数据新鲜度 + 用户体验;检测回退率;避免静默降级 |
| 健康检查 | K8s 存活/就绪/启动探针 | 编排和负载均衡 | 存活探针浅层检查;就绪探针检查关键依赖项(有界);启动探针用于慢初始化;添加强大关闭 |
| 混沌 / 故障注入 | 故障代理、服务网格故障、托管混沌工具 | 验证真实故障模式下的行为 | 从非生产环境开始;控制爆炸半径;限时;预定义停止条件;记录实验参数 |
故障场景:[系统依赖类型]
├─ 外部 API/服务?
│ ├─ 瞬态错误? → 带指数退避 + 抖动的重试
│ ├─ 级联故障? → 熔断器 + 回退
│ ├─ 速率限制? → 重试并尊重 Retry-After 头部
│ └─ 响应缓慢? → 超时 + 熔断器
│
├─ 数据库依赖?
│ ├─ 连接池耗尽? → 舱壁隔离 + 超时
│ ├─ 查询超时? → 语句超时(5-10秒)
│ ├─ 副本延迟? → 回退到主库读取
│ └─ 连接失败? → 重试 + 熔断器
│
├─ 过载/饱和?
│ ├─ 队列/池增长? → 背压 + 限制队列 + 准入控制
│ ├─ 惊群效应? → 抖动 + 请求合并 + 缓存
│ └─ 昂贵路径? → 负载丢弃 + 功能开关降级
│
├─ 非关键功能?
│ ├─ ML 推荐? → 功能开关 + 默认值回退
│ ├─ 搜索服务? → 缓存结果或基本 SQL 回退
│ ├─ 电子邮件/通知? → 记录错误,不阻塞主流程
│ └─ 分析? → 发送后不管,熔断器用于保护
│
├─ Kubernetes/编排?
│ ├─ 服务发现? → 存活 + 就绪 + 启动探针
│ ├─ 启动缓慢? → 启动探针(failureThreshold: 30)
│ ├─ 负载均衡? → 就绪探针(检查依赖项)
│ └─ 自动重启? → 存活探针(简单检查)
│
└─ 测试弹性?
├─ 预生产环境? → Chaos Toolkit 实验
├─ 生产环境(低风险)? → 功能开关 + 金丝雀部署
├─ 计划性测试? → 演练日(每季度)
└─ 持续混沌? → 低爆炸半径故障注入,带强防护措施
熔断器模式 - 防止级联故障
重试模式 - 处理瞬态故障
舱壁隔离 - 资源隔离
超时策略 - 防止资源耗尽
优雅降级 - 保持部分功能
健康检查模式 - 服务可用性监控
负载丢弃与背压 - 过载保护模式
级联故障预防 - 多层遏制
灾难恢复测试 - 灾难恢复演练执行
| 场景 | 建议 |
|---|---|
| 外部 API 调用 | 熔断器 + 带指数退避的重试 |
| 数据库查询 | 超时 + 连接池 + 熔断器 |
| 缓慢的依赖项 | 舱壁隔离 + 超时 |
| 过载/饱和 | 舱壁 + 背压 + 负载丢弃 |
| 非关键功能 | 功能开关 + 优雅降级 |
| Kubernetes 部署 | 存活 + 就绪 + 启动探针 |
| 测试弹性 | 混沌工程实验 |
| 瞬态故障 | 带指数退避 + 抖动的重试 |
| 级联故障 | 熔断器 + 舱壁 |
应做:
应避免:
模式选择:
可观测性:
测试:
成功标准:系统优雅地处理故障、自动恢复、在中断期间保持部分功能,并快速失败以防止级联故障。通过故障注入和演练日(带防护措施)主动测试弹性。
每周安装次数
66
代码仓库
GitHub 星标数
49
首次出现时间
Jan 23, 2026
安全审计
安装于
cursor50
codex50
opencode49
gemini-cli48
github-copilot46
claude-code43
This skill provides execution-ready patterns for building resilient, fault-tolerant systems that handle failures gracefully, and for validating those behaviors with tests.
Core sources are curated in data/sources.json.
Use this skill when a user requests:
When NOT to use this skill:
If key context is missing, ask for: critical user journeys, dependency inventory (including third parties), SLO/SLI targets, current timeout/retry/circuit-breaker settings, idempotency/dedup strategy, and where fault injection is allowed (local/staging/prod).
Do:
Avoid:
| Pattern | Mechanism / Tooling | When to Use | Configuration (Starting Point) |
|---|---|---|---|
| Circuit Breaker | App-level breaker or service mesh; emit breaker state changes | Sustained downstream failures or timeouts | Open on sustained error/timeout rates; use half-open probes; tune windows to traffic + error budget |
| Retry with Backoff | Client retry libs; respect Retry-After for 429/503 | Transient failures and rate limiting | 2-3 retries max for user-facing paths; backoff + jitter; per-try timeouts; never exceed remaining deadline |
| Timeout Budgets | Deadlines/cancellation + DB statement timeouts | Any remote call or query | Budget per hop; fail fast; propagate deadlines; set DB query timeout and pool wait timeout |
| Bulkheads + Backpressure | Concurrency limiters, separate pools/queues, admission control | Overload/saturation risk | Separate pools per dependency; bound queues; reject early (429/503) over uncontrolled latency growth |
| Graceful Degradation | Feature flags, cached/stale fallback, partial responses | Non-critical features and partial outages | Define data freshness + UX; instrument fallback rate; avoid silent degradation |
Failure scenario: [System Dependency Type]
├─ External API/Service?
│ ├─ Transient errors? → Retry with exponential backoff + jitter
│ ├─ Cascading failures? → Circuit breaker + fallback
│ ├─ Rate limiting? → Retry with Retry-After header respect
│ └─ Slow response? → Timeout + circuit breaker
│
├─ Database Dependency?
│ ├─ Connection pool exhaustion? → Bulkhead isolation + timeout
│ ├─ Query timeout? → Statement timeout (5-10s)
│ ├─ Replica lag? → Read from primary fallback
│ └─ Connection failures? → Retry + circuit breaker
│
├─ Overload/Saturation?
│ ├─ Queue/pool growing? → Backpressure + bound queues + admission control
│ ├─ Thundering herd? → Jitter + request coalescing + caching
│ └─ Expensive paths? → Load shedding + feature flag degradation
│
├─ Non-Critical Feature?
│ ├─ ML recommendations? → Feature flag + default values fallback
│ ├─ Search service? → Cached results or basic SQL fallback
│ ├─ Email/notifications? → Log error, don't block main flow
│ └─ Analytics? → Fire-and-forget, circuit breaker for protection
│
├─ Kubernetes/Orchestration?
│ ├─ Service discovery? → Liveness + readiness + startup probes
│ ├─ Slow startup? → Startup probe (failureThreshold: 30)
│ ├─ Load balancing? → Readiness probe (check dependencies)
│ └─ Auto-restart? → Liveness probe (simple check)
│
└─ Testing Resilience?
├─ Pre-production? → Chaos Toolkit experiments
├─ Production (low risk)? → Feature flags + canary deployments
├─ Scheduled testing? → Game days (quarterly)
└─ Continuous chaos? → Low-blast-radius fault injection with strong guardrails
Circuit Breaker Patterns - Prevent cascading failures
Retry Patterns - Handle transient failures
Bulkhead Isolation - Resource compartmentalization
Timeout Policies - Prevent resource exhaustion
- Maintain partial functionality
Resilience Checklists - Production hardening checklists
Chaos Engineering Guide - Safe reliability experiments
Resilience Runbook Template - Service hardening profile
Fault Injection Playbook - Chaos testing script
Resilience Test Plan Template - Failure mode test plan (timeouts/retries/degraded mode)
| Scenario | Recommendation |
|---|---|
| External API calls | Circuit breaker + retry with exponential backoff |
| Database queries | Timeout + connection pooling + circuit breaker |
| Slow dependency | Bulkhead isolation + timeout |
| Overload/saturation | Bulkheads + backpressure + load shedding |
| Non-critical feature | Feature flag + graceful degradation |
| Kubernetes deployment | Liveness + readiness + startup probes |
| Testing resilience | Chaos engineering experiments |
| Transient failures | Retry with exponential backoff + jitter |
| Cascading failures | Circuit breaker + bulkhead |
Do:
Avoid:
Pattern Selection:
Observability:
Testing:
Success criteria: systems gracefully handle failures, recover automatically, maintain partial functionality during outages, and fail fast to prevent cascading failures. Resilience is tested proactively through fault injection and game days (with guardrails).
Weekly Installs
66
Repository
GitHub Stars
49
First Seen
Jan 23, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
cursor50
codex50
opencode49
gemini-cli48
github-copilot46
claude-code43
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
111,700 周安装
| Health Checks | K8s liveness/readiness/startup probes | Orchestration and load balancing | Liveness shallow; readiness checks critical deps (bounded); startup for slow init; add graceful shutdown |
| Chaos / Fault Injection | Fault proxies, service-mesh faults, managed chaos tools | Validate behavior under real failure modes | Start in non-prod; control blast radius; timebox; predefine stop conditions; record experiment parameters |
Health Check Patterns - Service availability monitoring
Load Shedding& Backpressure - Overload protection patterns
Cascading Failure Prevention - Multi-layer containment
Disaster Recovery Testing - DR drill execution