senior-devops by borghei/claude-skills
npx skills add https://github.com/borghei/claude-skills --skill senior-devops涵盖完整基础设施生命周期的生产级 DevOps 工程工具包:CI/CD 流水线设计、容器编排、基础设施即代码、云平台架构、部署策略、可观测性、安全加固、成本优化和事件响应。
当您遇到以下情况时使用此技能:
| 类别 | 术语 |
|---|---|
| CI/CD | pipeline, GitHub Actions, GitLab CI, Jenkins, CircleCI, build automation, artifact registry, continuous integration, continuous delivery, continuous deployment |
| 容器 | Docker, Dockerfile, docker-compose, container image, multi-stage build, OCI, container registry, ECR, GCR, ACR |
| 编排 | Kubernetes, k8s, kubectl, Helm, pod, deployment, service, ingress, HPA, VPA, StatefulSet, DaemonSet, CronJob |
| IaC | Terraform, OpenTofu, CloudFormation, Pulumi, Ansible, state management, tfstate, modules, workspaces, drift detection |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 云 | AWS, GCP, Azure, EC2, EKS, GKE, AKS, Lambda, Cloud Functions, S3, VPC, IAM, load balancer, auto-scaling |
| 监控 | Prometheus, Grafana, Datadog, ELK, Loki, Jaeger, OpenTelemetry, alerting, SLO, SLI, SLA, dashboards |
| 部署 | blue-green, canary, rolling update, feature flags, rollback, zero-downtime, A/B deployment, progressive delivery |
| 安全 | Vault, secrets management, RBAC, network policy, supply chain security, SBOM, image scanning, Trivy, Falco |
| 可靠性 | incident response, runbook, postmortem, SRE, error budget, chaos engineering, disaster recovery, RTO, RPO |
| 成本 | FinOps, right-sizing, spot instances, reserved capacity, cost allocation, tagging strategy, savings plans |
此技能提供三个核心自动化工具:
# 为任何平台生成 CI/CD 流水线 (GitHub Actions, GitLab CI, Jenkins, CircleCI)
python scripts/pipeline_generator.py <project-path> --platform github-actions --verbose
# 使用模块、状态配置和环境分离搭建 Terraform 基础设施
python scripts/terraform_scaffolder.py <target-path> --provider aws --env production --verbose
# 通过策略选择、健康检查和回滚支持来管理部署
python scripts/deployment_manager.py <target-path> --strategy canary --verbose
| 工具 | 用途 | 关键标志 |
|---|---|---|
pipeline_generator.py | 根据项目分析生成 CI/CD 流水线配置 | --platform, --stages, --json |
terraform_scaffolder.py | 使用最佳实践模式创建 Terraform 模块结构 | --provider, --env, --modules |
deployment_manager.py | 通过策略选择和回滚来编排部署 | --strategy, --target, --dry-run |
每个生产环境的 Dockerfile 都应遵循此分层模式:
# 阶段 1: 构建
FROM node:20-alpine AS builder
WORKDIR /app
# 首先复制依赖清单 (缓存层)
COPY package.json package-lock.json ./
RUN npm ci --only=production && npm cache clean --force
# 复制源代码并构建
COPY . .
RUN npm run build
# 阶段 2: 生产环境
FROM node:20-alpine AS production
WORKDIR /app
# 以非 root 用户运行
RUN addgroup -g 1001 appgroup && \
adduser -u 1001 -G appgroup -s /bin/sh -D appuser
# 仅复制生产工件
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:appgroup /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1
CMD ["node", "dist/server.js"]
关键规则:
latest.dockerignore 排除 .git、node_modules、测试文件、文档npm ci 而非 npm install,始终复制锁定文件适用于典型微服务栈的生产就绪 compose 文件:
version: "3.9"
x-common: &common
restart: unless-stopped
logging:
driver: json-file
options:
max-size: "10m"
max-file: "3"
services:
app:
<<: *common
build:
context: .
dockerfile: Dockerfile
target: production
ports:
- "3000:3000"
environment:
- NODE_ENV=production
- DATABASE_URL=postgresql://app:${DB_PASSWORD}@db:5432/appdb
- REDIS_URL=redis://redis:6379
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
deploy:
resources:
limits:
cpus: "1.0"
memory: 512M
reservations:
cpus: "0.25"
memory: 128M
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000/healthz"]
interval: 15s
timeout: 5s
retries: 3
db:
<<: *common
image: postgres:16-alpine
volumes:
- pgdata:/var/lib/postgresql/data
environment:
POSTGRES_DB: appdb
POSTGRES_USER: app
POSTGRES_PASSWORD: ${DB_PASSWORD}
healthcheck:
test: ["CMD-SHELL", "pg_isready -U app -d appdb"]
interval: 10s
timeout: 5s
retries: 5
redis:
<<: *common
image: redis:7-alpine
command: redis-server --maxmemory 128mb --maxmemory-policy allkeys-lru
volumes:
- redisdata:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3
volumes:
pgdata:
redisdata:
trivy image --severity HIGH,CRITICAL myapp:latestUSER 指令--read-only --tmpfs /tmpdocker history --no-trunc 验证Sidecar 模式 -- 在不修改主容器的情况下添加功能:
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
labels:
app: web
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
serviceAccountName: app-sa
securityContext:
runAsNonRoot: true
fsGroup: 1001
containers:
- name: app
image: myapp:1.2.3
ports:
- containerPort: 3000
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: "1"
memory: 512Mi
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
startupProbe:
httpGet:
path: /healthz
port: 3000
failureThreshold: 30
periodSeconds: 10
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: app-secrets
key: db-password
- name: log-shipper
image: fluent/fluent-bit:2.2
volumeMounts:
- name: app-logs
mountPath: /var/log/app
volumes:
- name: app-logs
emptyDir: {}
探针决策框架:
charts/myapp/
Chart.yaml
values.yaml
values-staging.yaml
values-production.yaml
templates/
deployment.yaml
service.yaml
ingress.yaml
hpa.yaml
networkpolicy.yaml
serviceaccount.yaml
_helpers.tpl
关键的 values.yaml 模式:
replicaCount: 3
image:
repository: myapp
tag: "1.2.3"
pullPolicy: IfNotPresent
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: "1"
memory: 512Mi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: app.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: app-tls
hosts:
- app.example.com
HPA (Horizontal Pod Autoscaler):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120
决策:HPA vs VPA vs KEDA
| 扩缩器 | 适用场景 | 避免场景 |
|---|---|---|
| HPA | 无状态服务、可预测的 CPU/内存模式 | 有状态工作负载、突发性事件驱动负载 |
| VPA | 调整请求/限制的大小、批处理作业、单副本工作负载 | 单独用于延迟敏感的服务 |
| KEDA | 事件驱动的扩缩容 (队列深度、HTTP 速率、cron) | 简单的基于 CPU 的扩缩容 (HPA 更简单) |
具有缓存、矩阵测试和部署门控的生产流水线:
name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
permissions:
contents: read
packages: write
id-token: write
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
node-version: [18, 20]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
cache: npm
- run: npm ci
- run: npm run lint
- run: npm test -- --coverage
- uses: actions/upload-artifact@v4
if: matrix.node-version == 20
with:
name: coverage
path: coverage/
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
scan-type: fs
severity: HIGH,CRITICAL
exit-code: 1
build:
needs: [test, security]
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha
type=ref,event=branch
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy-staging:
needs: build
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Deploy to staging
run: |
helm upgrade --install app charts/myapp \
--namespace staging \
--values charts/myapp/values-staging.yaml \
--set image.tag=${{ github.sha }} \
--wait --timeout 300s
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Deploy to production (canary)
run: |
helm upgrade --install app charts/myapp \
--namespace production \
--values charts/myapp/values-production.yaml \
--set image.tag=${{ github.sha }} \
--set canary.enabled=true \
--set canary.weight=10 \
--wait --timeout 300s
stages:
- test
- build
- deploy
variables:
DOCKER_BUILDKIT: 1
test:
stage: test
image: node:20-alpine
cache:
key: ${CI_COMMIT_REF_SLUG}
paths:
- node_modules/
script:
- npm ci
- npm run lint
- npm test -- --coverage
coverage: '/Lines\s*:\s*(\d+\.?\d*)%/'
artifacts:
reports:
coverage_report:
coverage_format: cobertura
path: coverage/cobertura-coverage.xml
build:
stage: build
image: docker:24
services:
- docker:24-dind
only:
- main
script:
- docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
deploy_staging:
stage: deploy
environment:
name: staging
url: https://staging.example.com
only:
- main
script:
- helm upgrade --install app charts/myapp
--namespace staging
--set image.tag=$CI_COMMIT_SHA
--wait
deploy_production:
stage: deploy
environment:
name: production
url: https://app.example.com
only:
- main
when: manual
script:
- helm upgrade --install app charts/myapp
--namespace production
--set image.tag=$CI_COMMIT_SHA
--wait
infrastructure/
modules/
vpc/
main.tf
variables.tf
outputs.tf
eks/
main.tf
variables.tf
outputs.tf
rds/
main.tf
variables.tf
outputs.tf
environments/
staging/
main.tf # 使用 staging 值调用模块
terraform.tfvars
backend.tf # S3 + DynamoDB 状态后端
production/
main.tf
terraform.tfvars
backend.tf
带有锁定的远程状态 (AWS):
# backend.tf
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/infrastructure.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
状态管理规则:
terraform plan,仅在批准后运行 terraform applyterraform state list 和 terraform state show 进行调试,切勿手动编辑状态| 模式 | 适用场景 | 权衡 |
|---|---|---|
| Workspaces | 相同配置,不同规模 (具有相同拓扑的 dev/staging/prod) | 共享状态后端,易于切换,但配置更难分化 |
| Directories | 不同环境需要不同的资源或拓扑 | 完全隔离,边界清晰,但存在重复的样板代码 |
建议 : 使用目录进行环境分离。使用模块处理共享逻辑。Workspaces 更适合临时环境 (PR 预览、负载测试环境)。
将漂移检测集成到 CI 中:
# 按计划 (每日) 在 CI 中运行
terraform plan -detailed-exitcode -out=plan.tfplan
# 退出码 0 = 无变化 (干净)
# 退出码 1 = 错误
# 退出码 2 = 检测到变化 (漂移)
# 在退出码为 2 时发出警报
if [ $? -eq 2 ]; then
# 发送警报到 Slack/PagerDuty
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d '{"text":"Terraform drift detected in production. Review required."}'
fi
prevent_destroy。destroy 和 replace 操作。| 支柱 | 工具 | 用途 |
|---|---|---|
| 指标 | Prometheus + Grafana | 数值时间序列数据 (CPU、延迟、错误率) |
| 日志 | Loki / ELK (Elasticsearch, Logstash, Kibana) | 用于调试的结构化事件记录 |
| 追踪 | Jaeger / Tempo + OpenTelemetry | 跨服务的请求流,用于延迟分析 |
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate exceeds 5% for 5 minutes"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
> 2.0
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency exceeds 2s for 10 minutes"
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} restarting frequently"
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space below 15% on {{ $labels.instance }}"
先定义 SLI,再设置 SLO:
| 服务 | SLI (测量内容) | SLO (目标) | 错误预算 |
|---|---|---|---|
| API 网关 | 成功请求数 / 总请求数 | 99.9% 可用性 (每月 43.8 分钟停机时间) | 0.1% |
| API 延迟 | 低于 500ms 的请求数 / 总请求数 | 第 99 百分位数 < 500ms | 1% |
| 数据管道 | 成功的管道运行次数 / 总运行次数 | 99.5% 成功率 | 0.5% |
| 部署 | 成功的部署次数 / 总部署次数 | 99% 成功率 | 1% |
错误预算策略 : 当错误预算耗尽时,冻结功能部署并优先处理可靠性工作,直到预算恢复。
每个服务仪表板都应包含这些面板 ("四大黄金信号"):
| 能力 | AWS | GCP | Azure |
|---|---|---|---|
| 托管 Kubernetes | EKS | GKE | AKS |
| 无服务器计算 | Lambda | Cloud Functions / Cloud Run | Azure Functions |
| 容器服务 | ECS/Fargate | Cloud Run | Container Apps |
| 对象存储 | S3 | Cloud Storage | Blob Storage |
| 托管数据库 | RDS / Aurora | Cloud SQL / AlloyDB | Azure SQL / Cosmos DB |
| 消息队列 | SQS / SNS | Pub/Sub | Service Bus |
| CDN | CloudFront | Cloud CDN | Azure CDN / Front Door |
| DNS | Route 53 | Cloud DNS | Azure DNS |
| 密钥管理 | Secrets Manager | Secret Manager | Key Vault |
| IAM | IAM + STS | IAM + Workload Identity | Entra ID + RBAC |
| IaC | CloudFormation / CDK | Deployment Manager | Bicep / ARM |
何时采用多云是合理的:
何时不合理:
如果采用多云:
| 策略 | 风险 | 回滚速度 | 基础设施成本 | 最佳适用场景 |
|---|---|---|---|---|
| 滚动更新 | 中等 | 分钟级 | 1x | 无状态服务、内部 API |
| 蓝绿部署 | 低 | 秒级 (DNS/LB 切换) | 2x (部署期间) | 关键任务、需要零停机时间 |
| 金丝雀发布 | 低 | 秒级 (将流量切回) | 1.1x | 面向用户的服务、渐进式验证 |
| 功能标志 | 最低 | 即时 (切换) | 1x | 细粒度控制、A/B 测试、主干开发 |
# Blue (当前生产环境)
apiVersion: v1
kind: Service
metadata:
name: app-production
spec:
selector:
app: myapp
version: blue # 指向当前版本
ports:
- port: 80
targetPort: 3000
---
# Green (新版本) -- 与 blue 并行部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: app
image: myapp:2.0.0
切换步骤:
version: blue 切换到 version: green# 用于金丝雀路由的 Istio VirtualService
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: app-canary
spec:
hosts:
- app.example.com
http:
- route:
- destination:
host: app-stable
port:
number: 80
weight: 90
- destination:
host: app-canary
port:
number: 80
weight: 10
金丝雀晋升阶梯:
使用功能标志来解耦部署与发布:
# 使用 LaunchDarkly / Unleash / 简单配置的示例
if feature_flags.is_enabled("new-checkout-flow", user_context):
return new_checkout_handler(request)
else:
return legacy_checkout_handler(request)
标志生命周期:
决策矩阵:
| 工具 | 最佳适用场景 | 避免场景 |
|---|---|---|
| HashiCorp Vault | 动态密钥、PKI、加密即服务、多云 | 小型团队、简单应用程序 |
| AWS Secrets Manager | AWS 原生工作负载、自动轮换 | 多云或混合需求 |
| AWS SSM Parameter Store | 非敏感配置、低成本的密钥存储 | 大规模轮换或审计需求 |
| Kubernetes Secrets | Pod 级别注入 (启用了静态加密) | 长期存储密钥或跨集群共享 |
| SOPS / age | Git 中的加密密钥 (gitops 工作流) | 不熟悉密钥管理的团队 |
用于 Kubernetes 的 Vault 集成模式:
# 使用 Vault Agent Injector
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
template:
metadata:
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "app-role"
vault.hashicorp.com/agent-inject-secret-db: "secret/data/app/db"
vault.hashicorp.com/agent-inject-template-db: |
{{- with secret "secret/data/app/db" -}}
export DB_HOST={{ .Data.data.host }}
export DB_PASSWORD={{ .Data.data.password }}
{{- end -}}
spec:
serviceAccountName: app-sa
containers:
- name: app
image: myapp:1.2.3
command: ["/bin/sh", "-c", "source /vault/secrets/db && node server.js"]
默认拒绝,显式允许:
# 默认拒绝所有入站和出站流量
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# 允许应用程序从 ingress 控制器接收流量并与数据库通信
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: app-network-policy
namespace: production
spec:
podSelector:
matchLabels:
app: web
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 3000
egress:
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
- to: # 允许 DNS 解析
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53
kubectl auth can-i --list --as=system:serviceaccount:production:app-sacluster-admin 权限# 使用 cosign 对容器镜像进行签名
cosign sign --key cosign.key ghcr.io/myorg/myapp:1.2.3
# 部署前验证
cosign verify --key cosign.pub ghcr.io/myorg/myapp:1.2.3
# 生成 SBOM
syft ghcr.io/myorg/myapp:1.2.3 -o spdx-json > sbom.json
# 扫描 SBOM 中的漏洞
grype sbom:sbom.json --fail-on high
准入控制 : 使用 Kyverno 或 OPA Gatekeeper 来强制执行策略:
| 工作负载类型 | 适合 Spot 吗? | 模式 |
|---|---|---|
| 无状态 Web 服务器 |
Production-grade DevOps engineering toolkit covering the full infrastructure lifecycle: CI/CD pipeline design, container orchestration, infrastructure as code, cloud platform architecture, deployment strategies, observability, security hardening, cost optimization, and incident response.
Use this skill when you encounter:
| Category | Terms |
|---|---|
| CI/CD | pipeline, GitHub Actions, GitLab CI, Jenkins, CircleCI, build automation, artifact registry, continuous integration, continuous delivery, continuous deployment |
| Containers | Docker, Dockerfile, docker-compose, container image, multi-stage build, OCI, container registry, ECR, GCR, ACR |
| Orchestration | Kubernetes, k8s, kubectl, Helm, pod, deployment, service, ingress, HPA, VPA, StatefulSet, DaemonSet, CronJob |
| IaC | Terraform, OpenTofu, CloudFormation, Pulumi, Ansible, state management, tfstate, modules, workspaces, drift detection |
| Cloud | AWS, GCP, Azure, EC2, EKS, GKE, AKS, Lambda, Cloud Functions, S3, VPC, IAM, load balancer, auto-scaling |
| Monitoring | Prometheus, Grafana, Datadog, ELK, Loki, Jaeger, OpenTelemetry, alerting, SLO, SLI, SLA, dashboards |
| Deployment | blue-green, canary, rolling update, feature flags, rollback, zero-downtime, A/B deployment, progressive delivery |
| Security | Vault, secrets management, RBAC, network policy, supply chain security, SBOM, image scanning, Trivy, Falco |
| Reliability | incident response, runbook, postmortem, SRE, error budget, chaos engineering, disaster recovery, RTO, RPO |
| Cost | FinOps, right-sizing, spot instances, reserved capacity, cost allocation, tagging strategy, savings plans |
This skill provides three core automation tools:
# Generate CI/CD pipelines for any platform (GitHub Actions, GitLab CI, Jenkins, CircleCI)
python scripts/pipeline_generator.py <project-path> --platform github-actions --verbose
# Scaffold Terraform infrastructure with modules, state config, and environment separation
python scripts/terraform_scaffolder.py <target-path> --provider aws --env production --verbose
# Manage deployments with strategy selection, health checks, and rollback support
python scripts/deployment_manager.py <target-path> --strategy canary --verbose
| Tool | Purpose | Key Flags |
|---|---|---|
pipeline_generator.py | Generates CI/CD pipeline configurations from project analysis | --platform, --stages, --json |
terraform_scaffolder.py | Creates Terraform module structure with best-practice patterns | --provider, --env, --modules |
Every production Dockerfile should follow this layered pattern:
# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
# Copy dependency manifests first (cache layer)
COPY package.json package-lock.json ./
RUN npm ci --only=production && npm cache clean --force
# Copy source and build
COPY . .
RUN npm run build
# Stage 2: Production
FROM node:20-alpine AS production
WORKDIR /app
# Run as non-root
RUN addgroup -g 1001 appgroup && \
adduser -u 1001 -G appgroup -s /bin/sh -D appuser
# Copy only production artifacts
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:appgroup /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1
CMD ["node", "dist/server.js"]
Critical rules:
latest in production.dockerignore to exclude .git, node_modules, test files, docsnpm ci not npm install, lock files always copiedProduction-ready compose for a typical microservice stack:
version: "3.9"
x-common: &common
restart: unless-stopped
logging:
driver: json-file
options:
max-size: "10m"
max-file: "3"
services:
app:
<<: *common
build:
context: .
dockerfile: Dockerfile
target: production
ports:
- "3000:3000"
environment:
- NODE_ENV=production
- DATABASE_URL=postgresql://app:${DB_PASSWORD}@db:5432/appdb
- REDIS_URL=redis://redis:6379
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
deploy:
resources:
limits:
cpus: "1.0"
memory: 512M
reservations:
cpus: "0.25"
memory: 128M
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000/healthz"]
interval: 15s
timeout: 5s
retries: 3
db:
<<: *common
image: postgres:16-alpine
volumes:
- pgdata:/var/lib/postgresql/data
environment:
POSTGRES_DB: appdb
POSTGRES_USER: app
POSTGRES_PASSWORD: ${DB_PASSWORD}
healthcheck:
test: ["CMD-SHELL", "pg_isready -U app -d appdb"]
interval: 10s
timeout: 5s
retries: 5
redis:
<<: *common
image: redis:7-alpine
command: redis-server --maxmemory 128mb --maxmemory-policy allkeys-lru
volumes:
- redisdata:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3
volumes:
pgdata:
redisdata:
trivy image --severity HIGH,CRITICAL myapp:latestUSER directive--read-only --tmpfs /tmpdocker history --no-truncSidecar pattern -- add capabilities without modifying the main container:
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
labels:
app: web
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
serviceAccountName: app-sa
securityContext:
runAsNonRoot: true
fsGroup: 1001
containers:
- name: app
image: myapp:1.2.3
ports:
- containerPort: 3000
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: "1"
memory: 512Mi
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
startupProbe:
httpGet:
path: /healthz
port: 3000
failureThreshold: 30
periodSeconds: 10
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: app-secrets
key: db-password
- name: log-shipper
image: fluent/fluent-bit:2.2
volumeMounts:
- name: app-logs
mountPath: /var/log/app
volumes:
- name: app-logs
emptyDir: {}
Probe decision framework:
charts/myapp/
Chart.yaml
values.yaml
values-staging.yaml
values-production.yaml
templates/
deployment.yaml
service.yaml
ingress.yaml
hpa.yaml
networkpolicy.yaml
serviceaccount.yaml
_helpers.tpl
Key values.yaml patterns:
replicaCount: 3
image:
repository: myapp
tag: "1.2.3"
pullPolicy: IfNotPresent
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: "1"
memory: 512Mi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: app.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: app-tls
hosts:
- app.example.com
HPA (Horizontal Pod Autoscaler):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120
Decision: HPA vs VPA vs KEDA
| Scaler | Use When | Avoid When |
|---|---|---|
| HPA | Stateless services, predictable CPU/memory patterns | Stateful workloads, bursty event-driven loads |
| VPA | Right-sizing requests/limits, batch jobs, single-replica workloads | Used alone for latency-sensitive services |
| KEDA | Event-driven scaling (queue depth, HTTP rate, cron) | Simple CPU-based scaling (HPA is simpler) |
Production pipeline with caching, matrix testing, and deployment gates:
name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
permissions:
contents: read
packages: write
id-token: write
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
node-version: [18, 20]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
cache: npm
- run: npm ci
- run: npm run lint
- run: npm test -- --coverage
- uses: actions/upload-artifact@v4
if: matrix.node-version == 20
with:
name: coverage
path: coverage/
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
scan-type: fs
severity: HIGH,CRITICAL
exit-code: 1
build:
needs: [test, security]
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha
type=ref,event=branch
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy-staging:
needs: build
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Deploy to staging
run: |
helm upgrade --install app charts/myapp \
--namespace staging \
--values charts/myapp/values-staging.yaml \
--set image.tag=${{ github.sha }} \
--wait --timeout 300s
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Deploy to production (canary)
run: |
helm upgrade --install app charts/myapp \
--namespace production \
--values charts/myapp/values-production.yaml \
--set image.tag=${{ github.sha }} \
--set canary.enabled=true \
--set canary.weight=10 \
--wait --timeout 300s
stages:
- test
- build
- deploy
variables:
DOCKER_BUILDKIT: 1
test:
stage: test
image: node:20-alpine
cache:
key: ${CI_COMMIT_REF_SLUG}
paths:
- node_modules/
script:
- npm ci
- npm run lint
- npm test -- --coverage
coverage: '/Lines\s*:\s*(\d+\.?\d*)%/'
artifacts:
reports:
coverage_report:
coverage_format: cobertura
path: coverage/cobertura-coverage.xml
build:
stage: build
image: docker:24
services:
- docker:24-dind
only:
- main
script:
- docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
deploy_staging:
stage: deploy
environment:
name: staging
url: https://staging.example.com
only:
- main
script:
- helm upgrade --install app charts/myapp
--namespace staging
--set image.tag=$CI_COMMIT_SHA
--wait
deploy_production:
stage: deploy
environment:
name: production
url: https://app.example.com
only:
- main
when: manual
script:
- helm upgrade --install app charts/myapp
--namespace production
--set image.tag=$CI_COMMIT_SHA
--wait
infrastructure/
modules/
vpc/
main.tf
variables.tf
outputs.tf
eks/
main.tf
variables.tf
outputs.tf
rds/
main.tf
variables.tf
outputs.tf
environments/
staging/
main.tf # Calls modules with staging values
terraform.tfvars
backend.tf # S3 + DynamoDB state backend
production/
main.tf
terraform.tfvars
backend.tf
Remote state with locking (AWS):
# backend.tf
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/infrastructure.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
State management rules:
terraform plan in CI, terraform apply only after approvalterraform state list and terraform state show for debugging, never edit state manually| Pattern | Use When | Trade-offs |
|---|---|---|
| Workspaces | Same config, different scale (dev/staging/prod with identical topology) | Shared state backend, easy switching, but harder to diverge configs |
| Directories | Different environments need different resources or topology | Full isolation, clear boundaries, but duplicated boilerplate |
Recommendation : Use directories for environment separation. Use modules for shared logic. Workspaces are better suited for ephemeral environments (PR previews, load test environments).
Integrate drift detection into CI:
# Run in CI on a schedule (daily)
terraform plan -detailed-exitcode -out=plan.tfplan
# Exit code 0 = no changes (clean)
# Exit code 1 = error
# Exit code 2 = changes detected (drift)
# Alert on exit code 2
if [ $? -eq 2 ]; then
# Send alert to Slack/PagerDuty
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d '{"text":"Terraform drift detected in production. Review required."}'
fi
prevent_destroy on critical resources (databases, S3 buckets with data).destroy and replace actions.| Pillar | Tool | Purpose |
|---|---|---|
| Metrics | Prometheus + Grafana | Numeric time-series data (CPU, latency, error rates) |
| Logs | Loki / ELK (Elasticsearch, Logstash, Kibana) | Structured event records for debugging |
| Traces | Jaeger / Tempo + OpenTelemetry | Request flow across services for latency analysis |
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate exceeds 5% for 5 minutes"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
> 2.0
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency exceeds 2s for 10 minutes"
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} restarting frequently"
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space below 15% on {{ $labels.instance }}"
Define SLIs first, then set SLOs:
| Service | SLI (what you measure) | SLO (target) | Error Budget |
|---|---|---|---|
| API Gateway | Successful requests / Total requests | 99.9% availability (43.8 min/month downtime) | 0.1% |
| API Latency | Requests under 500ms / Total requests | 99th percentile < 500ms | 1% |
| Data Pipeline | Successful pipeline runs / Total runs | 99.5% success rate | 0.5% |
| Deployment | Successful deploys / Total deploys | 99% success rate | 1% |
Error budget policy : When the error budget is exhausted, freeze feature deployments and prioritize reliability work until the budget recovers.
Every service dashboard should include these panels (the "Four Golden Signals"):
| Capability | AWS | GCP | Azure |
|---|---|---|---|
| Managed Kubernetes | EKS | GKE | AKS |
| Serverless Compute | Lambda | Cloud Functions / Cloud Run | Azure Functions |
| Container Service | ECS/Fargate | Cloud Run | Container Apps |
| Object Storage | S3 | Cloud Storage | Blob Storage |
| Managed Database | RDS / Aurora | Cloud SQL / AlloyDB | Azure SQL / Cosmos DB |
| Message Queue | SQS / SNS | Pub/Sub | Service Bus |
When multi-cloud makes sense:
When it does not:
If you go multi-cloud:
| Strategy | Risk | Rollback Speed | Infrastructure Cost | Best For |
|---|---|---|---|---|
| Rolling Update | Medium | Minutes | 1x | Stateless services, internal APIs |
| Blue-Green | Low | Seconds (DNS/LB switch) | 2x during deploy | Mission-critical, zero-downtime required |
| Canary | Low | Seconds (shift traffic back) | 1.1x | User-facing services, gradual validation |
| Feature Flags | Lowest | Instant (toggle) | 1x | Granular control, A/B testing, trunk-based dev |
# Blue (current production)
apiVersion: v1
kind: Service
metadata:
name: app-production
spec:
selector:
app: myapp
version: blue # Points to current version
ports:
- port: 80
targetPort: 3000
---
# Green (new version) -- deploy alongside blue
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: app
image: myapp:2.0.0
Cutover steps:
version: blue to version: green# Istio VirtualService for canary routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: app-canary
spec:
hosts:
- app.example.com
http:
- route:
- destination:
host: app-stable
port:
number: 80
weight: 90
- destination:
host: app-canary
port:
number: 80
weight: 10
Canary promotion ladder:
Use feature flags for decoupling deployment from release:
# Example with LaunchDarkly / Unleash / simple config
if feature_flags.is_enabled("new-checkout-flow", user_context):
return new_checkout_handler(request)
else:
return legacy_checkout_handler(request)
Flag lifecycle:
Decision matrix:
| Tool | Best For | Avoid When |
|---|---|---|
| HashiCorp Vault | Dynamic secrets, PKI, encryption as a service, multi-cloud | Small teams, simple applications |
| AWS Secrets Manager | AWS-native workloads, automatic rotation | Multi-cloud or hybrid requirements |
| AWS SSM Parameter Store | Non-sensitive config, low-cost secret storage | Rotation or audit requirements at scale |
| Kubernetes Secrets | Pod-level injection (with encryption at rest enabled) | Storing secrets long-term or sharing across clusters |
| SOPS / age | Encrypted secrets in git (gitops workflows) | Teams unfamiliar with key management |
Vault integration pattern for Kubernetes:
# Using Vault Agent Injector
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
template:
metadata:
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "app-role"
vault.hashicorp.com/agent-inject-secret-db: "secret/data/app/db"
vault.hashicorp.com/agent-inject-template-db: |
{{- with secret "secret/data/app/db" -}}
export DB_HOST={{ .Data.data.host }}
export DB_PASSWORD={{ .Data.data.password }}
{{- end -}}
spec:
serviceAccountName: app-sa
containers:
- name: app
image: myapp:1.2.3
command: ["/bin/sh", "-c", "source /vault/secrets/db && node server.js"]
Default-deny with explicit allow:
# Default deny all ingress and egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Allow app to receive traffic from ingress controller and talk to database
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: app-network-policy
namespace: production
spec:
podSelector:
matchLabels:
app: web
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 3000
egress:
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
- to: # Allow DNS resolution
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53
kubectl auth can-i --list --as=system:serviceaccount:production:app-sacluster-admin to application service accounts# Sign container images with cosign
cosign sign --key cosign.key ghcr.io/myorg/myapp:1.2.3
# Verify before deployment
cosign verify --key cosign.pub ghcr.io/myorg/myapp:1.2.3
# Generate SBOM
syft ghcr.io/myorg/myapp:1.2.3 -o spdx-json > sbom.json
# Scan SBOM for vulnerabilities
grype sbom:sbom.json --fail-on high
Admission control : Use Kyverno or OPA Gatekeeper to enforce policies:
| Workload Type | Spot Suitable? | Pattern |
|---|---|---|
| Stateless web servers (behind LB) | Yes | Mix 70% spot + 30% on-demand |
| CI/CD runners | Yes | 100% spot with retry logic |
| Batch processing / ETL | Yes | Spot fleet with checkpointing |
| Databases / stateful | No | Use reserved instances |
| Kubernetes control plane | No | On-demand or reserved |
| Dev/test environments | Yes | 100% spot, accept interruptions |
team, environment, service, cost-center on all resources| Severity | Definition | Response Time | Example |
|---|---|---|---|
| SEV-1 | Complete service outage, data loss risk | 15 minutes | Production database down, payment system failure |
| SEV-2 | Significant degradation, partial outage | 30 minutes | High error rate, API latency > 10x normal |
| SEV-3 | Minor degradation, workaround available | 4 hours | Non-critical feature broken, elevated error rate < 1% |
| SEV-4 | Cosmetic / informational | Next business day | Dashboard rendering issue, log verbosity spike |
# Runbook: [Service Name] - [Issue Type]
## Symptoms
- What alerts fire
- What users report
- What dashboards show
## Impact
- Which users/services affected
- Revenue impact estimate
## Diagnosis Steps
1. Check service health: `kubectl get pods -n production -l app=myapp`
2. Review recent deployments: `helm history myapp -n production`
3. Check error logs: `kubectl logs -l app=myapp -n production --tail=100`
4. Verify database connectivity: `kubectl exec -it app-pod -- pg_isready -h db-host`
5. Check resource utilization: Review Grafana dashboard [link]
## Remediation
### Quick Fix (< 5 min)
- Restart pods: `kubectl rollout restart deployment/myapp -n production`
- Scale up: `kubectl scale deployment/myapp --replicas=10 -n production`
### Rollback (< 10 min)
- `helm rollback myapp [previous-revision] -n production`
### Root Cause Fix
- [Document fix steps specific to this issue]
## Escalation
- L1: On-call engineer (PagerDuty)
- L2: Team lead / service owner
- L3: VP Engineering (SEV-1 only)
## Communication
- Statuspage update within 15 min of SEV-1/SEV-2
- Slack channel: #incidents
Every SEV-1 and SEV-2 incident requires a blameless postmortem within 3 business days:
Template : Store postmortems in a shared wiki. Link them from the incident channel for team visibility.
This skill includes three reference guides for deep-dive topics:
| Reference | Path | Covers |
|---|---|---|
| CI/CD Pipeline Guide | references/cicd_pipeline_guide.md | Pipeline patterns, platform comparisons, optimization techniques, testing strategies |
| Infrastructure as Code | references/infrastructure_as_code.md | Terraform patterns, module design, state management, provider configuration |
| Deployment Strategies | references/deployment_strategies.md | Strategy comparison, implementation details, rollback procedures, traffic management |
Use the reference files for extended examples and edge-case handling beyond what this skill file covers.
This skill works alongside other skills in the library:
| Skill | Integration |
|---|---|
| senior-secops | Security scanning in CI/CD pipelines, container image scanning, compliance checks |
| senior-architect | Infrastructure design decisions, service topology, dependency analysis |
| senior-backend | Application containerization, health check endpoints, config management |
| senior-cloud-architect | Cloud platform selection, multi-region architecture, disaster recovery planning |
| incident-commander | Incident escalation procedures, communication protocols, postmortem facilitation |
| code-reviewer | Infrastructure-as-code review standards, Terraform plan review, pipeline config review |
| aws-solution-architect | AWS-specific infrastructure patterns, service selection, cost optimization |
Last Updated: February 2026 Version: 2.0.0 Tools: 3 Python automation scripts References: 3 deep-dive guides
Weekly Installs
73
Repository
GitHub Stars
29
First Seen
Jan 24, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
claude-code54
opencode53
gemini-cli49
cursor46
codex45
github-copilot44
Azure Data Explorer (Kusto) 查询技能:KQL数据分析、日志遥测与时间序列处理
133,300 周安装
React/Next.js 高级质量保证工具:自动化测试、覆盖率分析与E2E测试脚手架
232 周安装
jQuery 4.0 迁移指南:破坏性变更、升级步骤与兼容性解决方案
232 周安装
Hugging Face Jobs:云端运行AI工作负载,无需本地GPU,支持数据处理、批量推理和模型训练
232 周安装
MongoDB与PostgreSQL数据库指南:选择、查询、优化与部署实战
235 周安装
Elasticsearch授权管理技能:RBAC角色访问控制、用户与角色配置、文档级安全
242 周安装
Linear 问题管理:Lobe Chat 集成工作流与 MCP 工具使用指南
241 周安装
deployment_manager.py | Orchestrates deployments with strategy selection and rollback | --strategy, --target, --dry-run |
| CDN | CloudFront | Cloud CDN | Azure CDN / Front Door |
| DNS | Route 53 | Cloud DNS | Azure DNS |
| Secrets | Secrets Manager | Secret Manager | Key Vault |
| IAM | IAM + STS | IAM + Workload Identity | Entra ID + RBAC |
| IaC | CloudFormation / CDK | Deployment Manager | Bicep / ARM |