chaos-engineer by jeffallan/claude-skills
npx skills add https://github.com/jeffallan/claude-skills --skill chaos-engineer根据上下文加载详细指导:
| 主题 | 参考 | 加载时机 |
|---|---|---|
| 实验 | references/experiment-design.md | 设计假设、爆炸半径、回滚方案时 |
| 基础设施 | references/infrastructure-chaos.md |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 服务器、网络、区域、地域故障时 |
| Kubernetes | references/kubernetes-chaos.md | Pod、节点、Litmus、chaos mesh 实验时 |
| 工具与自动化 | references/chaos-tools.md | 集成 Chaos Monkey、Gremlin、Pumba、CI/CD 时 |
| 演练日 | references/game-days.md | 规划、执行、总结演练日活动时 |
必须在每次实验中强制执行的非明显约束:
实施混沌工程时,请提供:
以下展示了一个完整的实验过程——从假设到回滚——使用 Kubernetes 上的 Litmus Chaos。
# 验证基线:p99 延迟 < 200ms,错误率 < 0.1%
kubectl get deploy my-service -n production
kubectl top pods -n production -l app=my-service
# chaos-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: my-service-pod-delete
namespace: production
spec:
appinfo:
appns: production
applabel: "app=my-service"
appkind: deployment
# 限制爆炸半径:一次仅影响 1 个副本
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60" # 秒
- name: CHAOS_INTERVAL
value: "20" # 每 20 秒删除一个 Pod
- name: FORCE
value: "false"
- name: PODS_AFFECTED_PERC
value: "33" # 最多影响 33% 的副本
# 应用实验
kubectl apply -f chaos-pod-delete.yaml
# 观察实验状态
kubectl describe chaosengine my-service-pod-delete -n production
kubectl get chaosresult my-service-pod-delete-pod-delete -n production -w
# 跟踪应用程序日志以查找错误
kubectl logs -l app=my-service -n production --since=2m -f
# 实验完成后检查 ChaosResult 判定结果
kubectl get chaosresult my-service-pod-delete-pod-delete \
-n production -o jsonpath='{.status.experimentStatus.verdict}'
# 立即停止实验
kubectl patch chaosengine my-service-pod-delete \
-n production --type merge -p '{"spec":{"engineState":"stop"}}'
# 确认所有 Pod 健康
kubectl rollout status deployment/my-service -n production
# 安装 toxiproxy CLI
brew install toxiproxy # macOS;在 Linux 上使用二进制版本
# 启动 toxiproxy 服务器(与您的服务一起运行)
toxiproxy-server &
# 为您的下游依赖项创建代理
toxiproxy-cli create -l 0.0.0.0:22222 -u downstream-db:5432 db-proxy
# 注入 300ms 延迟,10% 抖动 — 爆炸半径:仅此代理
toxiproxy-cli toxic add db-proxy -t latency -a latency=300 -a jitter=30
# 在此处运行您的负载测试 / 观察指标 ...
# 移除毒性以恢复正常行为
toxiproxy-cli toxic remove db-proxy -n latency_downstream
# chaos-monkey-config.yml — 限制到单个 ASG
deployment:
enabled: true
regionIndependence: false
chaos:
enabled: true
meanTimeBetweenKillsInWorkDays: 2
minTimeBetweenKillsInWorkDays: 1
grouping: APP # 按应用终止一个实例,而不是按集群
exceptions:
- account: production
region: us-east-1
detail: "*-canary" # 永不终止金丝雀实例
# 应用并触发手动终止以进行测试
chaos-monkey --app my-service --account staging --dry-run false
每周安装次数
736
代码仓库
GitHub 星标数
7.2K
首次出现
Jan 20, 2026
安全审计
安装于
opencode607
gemini-cli589
claude-code584
codex573
github-copilot542
cursor539
Load detailed guidance based on context:
| Topic | Reference | Load When |
|---|---|---|
| Experiments | references/experiment-design.md | Designing hypothesis, blast radius, rollback |
| Infrastructure | references/infrastructure-chaos.md | Server, network, zone, region failures |
| Kubernetes | references/kubernetes-chaos.md | Pod, node, Litmus, chaos mesh experiments |
| Tools & Automation | references/chaos-tools.md | Chaos Monkey, Gremlin, Pumba, CI/CD integration |
| Game Days | references/game-days.md | Planning, executing, learning from game days |
Non-obvious constraints that must be enforced on every experiment:
When implementing chaos engineering, provide:
The following shows a complete experiment — from hypothesis to rollback — using Litmus Chaos on Kubernetes.
# Verify baseline: p99 latency < 200ms, error rate < 0.1%
kubectl get deploy my-service -n production
kubectl top pods -n production -l app=my-service
# chaos-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: my-service-pod-delete
namespace: production
spec:
appinfo:
appns: production
applabel: "app=my-service"
appkind: deployment
# Limit blast radius: only 1 replica at a time
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60" # seconds
- name: CHAOS_INTERVAL
value: "20" # delete one pod every 20s
- name: FORCE
value: "false"
- name: PODS_AFFECTED_PERC
value: "33" # max 33% of replicas affected
# Apply the experiment
kubectl apply -f chaos-pod-delete.yaml
# Watch experiment status
kubectl describe chaosengine my-service-pod-delete -n production
kubectl get chaosresult my-service-pod-delete-pod-delete -n production -w
# Tail application logs for errors
kubectl logs -l app=my-service -n production --since=2m -f
# Check ChaosResult verdict when complete
kubectl get chaosresult my-service-pod-delete-pod-delete \
-n production -o jsonpath='{.status.experimentStatus.verdict}'
# Immediately stop the experiment
kubectl patch chaosengine my-service-pod-delete \
-n production --type merge -p '{"spec":{"engineState":"stop"}}'
# Confirm all pods are healthy
kubectl rollout status deployment/my-service -n production
# Install toxiproxy CLI
brew install toxiproxy # macOS; use the binary release on Linux
# Start toxiproxy server (runs alongside your service)
toxiproxy-server &
# Create a proxy for your downstream dependency
toxiproxy-cli create -l 0.0.0.0:22222 -u downstream-db:5432 db-proxy
# Inject 300ms latency with 10% jitter — blast radius: this proxy only
toxiproxy-cli toxic add db-proxy -t latency -a latency=300 -a jitter=30
# Run your load test / observe metrics here ...
# Remove the toxic to restore normal behaviour
toxiproxy-cli toxic remove db-proxy -n latency_downstream
# chaos-monkey-config.yml — restrict to a single ASG
deployment:
enabled: true
regionIndependence: false
chaos:
enabled: true
meanTimeBetweenKillsInWorkDays: 2
minTimeBetweenKillsInWorkDays: 1
grouping: APP # kill one instance per app, not per cluster
exceptions:
- account: production
region: us-east-1
detail: "*-canary" # never kill canary instances
# Apply and trigger a manual kill for testing
chaos-monkey --app my-service --account staging --dry-run false
Weekly Installs
736
Repository
GitHub Stars
7.2K
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubWarnSocketPassSnykWarn
Installed on
opencode607
gemini-cli589
claude-code584
codex573
github-copilot542
cursor539