高级 DevOps 工程师工具包 | 生产级 CI/CD、Kubernetes、Terraform、云架构与可观测性

senior-devops by borghei/claude-skills

109 周安装量

77 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/borghei/claude-skills --skill senior-devops

自动化云服务开发运维

🇨🇳中文介绍

高级 DevOps 工程师

涵盖完整基础设施生命周期的生产级 DevOps 工程工具包：CI/CD 流水线设计、容器编排、基础设施即代码、云平台架构、部署策略、可观测性、安全加固、成本优化和事件响应。

关键词

当您遇到以下情况时使用此技能：

类别	术语
CI/CD	pipeline, GitHub Actions, GitLab CI, Jenkins, CircleCI, build automation, artifact registry, continuous integration, continuous delivery, continuous deployment
容器	Docker, Dockerfile, docker-compose, container image, multi-stage build, OCI, container registry, ECR, GCR, ACR
编排	Kubernetes, k8s, kubectl, Helm, pod, deployment, service, ingress, HPA, VPA, StatefulSet, DaemonSet, CronJob
IaC	Terraform, OpenTofu, CloudFormation, Pulumi, Ansible, state management, tfstate, modules, workspaces, drift detection

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工具	用途	关键标志
`pipeline_generator.py`	根据项目分析生成 CI/CD 流水线配置	`--platform`, `--stages`, `--json`
`terraform_scaffolder.py`	使用最佳实践模式创建 Terraform 模块结构	`--provider`, `--env`, `--modules`
`deployment_manager.py`	通过策略选择和回滚来编排部署	`--strategy`, `--target`, `--dry-run`

Dockerfile 最佳实践

每个生产环境的 Dockerfile 都应遵循此分层模式：

# 阶段 1: 构建
FROM node:20-alpine AS builder
WORKDIR /app

# 首先复制依赖清单 (缓存层)
COPY package.json package-lock.json ./
RUN npm ci --only=production && npm cache clean --force

# 复制源代码并构建
COPY . .
RUN npm run build

# 阶段 2: 生产环境
FROM node:20-alpine AS production
WORKDIR /app

# 以非 root 用户运行
RUN addgroup -g 1001 appgroup && \
    adduser -u 1001 -G appgroup -s /bin/sh -D appuser

# 仅复制生产工件
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:appgroup /app/package.json ./

USER appuser
EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1

CMD ["node", "dist/server.js"]

始终使用特定的镜像标签，生产环境中切勿使用 latest
按从最不频繁更改到最频繁更改的顺序排列 COPY 指令 (最大化层缓存)
使用 .dockerignore 排除 .git、node_modules、测试文件、文档
切勿在镜像中存储密钥 -- 通过环境变量或挂载的密钥在运行时注入
固定包管理器版本：使用 npm ci 而非 npm install，始终复制锁定文件
多阶段构建可将最终镜像大小减少 60-80%

Docker Compose 模式

适用于典型微服务栈的生产就绪 compose 文件：

version: "3.9"

x-common: &common
  restart: unless-stopped
  logging:
    driver: json-file
    options:
      max-size: "10m"
      max-file: "3"

services:
  app:
    <<: *common
    build:
      context: .
      dockerfile: Dockerfile
      target: production
    ports:
      - "3000:3000"
    environment:
      - NODE_ENV=production
      - DATABASE_URL=postgresql://app:${DB_PASSWORD}@db:5432/appdb
      - REDIS_URL=redis://redis:6379
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: 512M
        reservations:
          cpus: "0.25"
          memory: 128M
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000/healthz"]
      interval: 15s
      timeout: 5s
      retries: 3

  db:
    <<: *common
    image: postgres:16-alpine
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: app
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d appdb"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    <<: *common
    image: redis:7-alpine
    command: redis-server --maxmemory 128mb --maxmemory-policy allkeys-lru
    volumes:
      - redisdata:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 3

volumes:
  pgdata:
  redisdata:

容器安全检查清单

仅使用来自可信注册表的基础镜像 (Docker 官方、Chainguard、Distroless)
推送前使用 Trivy 或 Grype 扫描镜像：trivy image --severity HIGH,CRITICAL myapp:latest
容器内没有 root 进程 -- 始终使用 USER 指令
尽可能使用只读根文件系统：--read-only --tmpfs /tmp
强制执行资源限制 (CPU、内存) 以防止"吵闹邻居"攻击
没有密钥被烘焙到镜像层中 -- 使用 docker history --no-trunc 验证
最小化基础镜像 (Alpine、Distroless) 以减少攻击面

Sidecar 模式 -- 在不修改主容器的情况下添加功能：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  labels:
    app: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      serviceAccountName: app-sa
      securityContext:
        runAsNonRoot: true
        fsGroup: 1001
      containers:
        - name: app
          image: myapp:1.2.3
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: "1"
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /healthz
              port: 3000
            initialDelaySeconds: 15
            periodSeconds: 20
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          startupProbe:
            httpGet:
              path: /healthz
              port: 3000
            failureThreshold: 30
            periodSeconds: 10
          env:
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: app-secrets
                  key: db-password
        - name: log-shipper
          image: fluent/fluent-bit:2.2
          volumeMounts:
            - name: app-logs
              mountPath: /var/log/app
      volumes:
        - name: app-logs
          emptyDir: {}

探针决策框架：

startupProbe : 用于启动缓慢的应用程序 (JVM、大型模型加载)。防止 liveness 探针在容器尚未完成启动时将其终止。
livenessProbe : 检测死锁和挂起。保持简单 (检查进程健康状态，而非下游依赖)。
readinessProbe : 控制流量路由。在此处包含依赖检查 (数据库可达性、缓存预热)。

charts/myapp/
  Chart.yaml
  values.yaml
  values-staging.yaml
  values-production.yaml
  templates/
    deployment.yaml
    service.yaml
    ingress.yaml
    hpa.yaml
    networkpolicy.yaml
    serviceaccount.yaml
    _helpers.tpl

关键的 values.yaml 模式：

replicaCount: 3

image:
  repository: myapp
  tag: "1.2.3"
  pullPolicy: IfNotPresent

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: "1"
    memory: 512Mi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: app.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: app-tls
      hosts:
        - app.example.com

资源管理与自动扩缩容

HPA (Horizontal Pod Autoscaler)：

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120

决策：HPA vs VPA vs KEDA

扩缩器	适用场景	避免场景
HPA	无状态服务、可预测的 CPU/内存模式	有状态工作负载、突发性事件驱动负载
VPA	调整请求/限制的大小、批处理作业、单副本工作负载	单独用于延迟敏感的服务
KEDA	事件驱动的扩缩容 (队列深度、HTTP 速率、cron)	简单的基于 CPU 的扩缩容 (HPA 更简单)

具有缓存、矩阵测试和部署门控的生产流水线：

name: CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

permissions:
  contents: read
  packages: write
  id-token: write

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: [18, 20]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node-version }}
          cache: npm
      - run: npm ci
      - run: npm run lint
      - run: npm test -- --coverage
      - uses: actions/upload-artifact@v4
        if: matrix.node-version == 20
        with:
          name: coverage
          path: coverage/

  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: fs
          severity: HIGH,CRITICAL
          exit-code: 1

  build:
    needs: [test, security]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha
            type=ref,event=branch
      - uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to staging
        run: |
          helm upgrade --install app charts/myapp \
            --namespace staging \
            --values charts/myapp/values-staging.yaml \
            --set image.tag=${{ github.sha }} \
            --wait --timeout 300s

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to production (canary)
        run: |
          helm upgrade --install app charts/myapp \
            --namespace production \
            --values charts/myapp/values-production.yaml \
            --set image.tag=${{ github.sha }} \
            --set canary.enabled=true \
            --set canary.weight=10 \
            --wait --timeout 300s

stages:
  - test
  - build
  - deploy

variables:
  DOCKER_BUILDKIT: 1

test:
  stage: test
  image: node:20-alpine
  cache:
    key: ${CI_COMMIT_REF_SLUG}
    paths:
      - node_modules/
  script:
    - npm ci
    - npm run lint
    - npm test -- --coverage
  coverage: '/Lines\s*:\s*(\d+\.?\d*)%/'
  artifacts:
    reports:
      coverage_report:
        coverage_format: cobertura
        path: coverage/cobertura-coverage.xml

build:
  stage: build
  image: docker:24
  services:
    - docker:24-dind
  only:
    - main
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

deploy_staging:
  stage: deploy
  environment:
    name: staging
    url: https://staging.example.com
  only:
    - main
  script:
    - helm upgrade --install app charts/myapp
        --namespace staging
        --set image.tag=$CI_COMMIT_SHA
        --wait

deploy_production:
  stage: deploy
  environment:
    name: production
    url: https://app.example.com
  only:
    - main
  when: manual
  script:
    - helm upgrade --install app charts/myapp
        --namespace production
        --set image.tag=$CI_COMMIT_SHA
        --wait

流水线设计原则

快速失败 : 在昂贵的集成测试之前运行代码检查和单元测试
积极缓存 : Node 模块、Docker 层、Go 模块、pip 包
不可变工件 : 构建一次，将相同的工件部署到每个环境
门控晋升 : 在生产部署前需要手动批准或自动化冒烟测试
尽可能并行 : 并发运行独立的测试套件和安全扫描
本地可重现 : 每个 CI 步骤都应该可以在开发人员机器上运行

基础设施即代码

Terraform 模块结构

infrastructure/
  modules/
    vpc/
      main.tf
      variables.tf
      outputs.tf
    eks/
      main.tf
      variables.tf
      outputs.tf
    rds/
      main.tf
      variables.tf
      outputs.tf
  environments/
    staging/
      main.tf          # 使用 staging 值调用模块
      terraform.tfvars
      backend.tf        # S3 + DynamoDB 状态后端
    production/
      main.tf
      terraform.tfvars
      backend.tf

带有锁定的远程状态 (AWS)：

# backend.tf
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/infrastructure.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

状态管理规则：

每个组件每个环境一个状态文件 (控制爆炸半径)
切勿在本地或 git 中存储状态
启用静态加密和传输中加密
使用 DynamoDB (AWS) 或 Cloud Storage (GCP) 进行状态锁定
在 CI 中运行 terraform plan，仅在批准后运行 terraform apply
使用 terraform state list 和 terraform state show 进行调试，切勿手动编辑状态

Workspace 与 Directory 模式

模式	适用场景	权衡
Workspaces	相同配置，不同规模 (具有相同拓扑的 dev/staging/prod)	共享状态后端，易于切换，但配置更难分化
Directories	不同环境需要不同的资源或拓扑	完全隔离，边界清晰，但存在重复的样板代码

建议 : 使用目录进行环境分离。使用模块处理共享逻辑。Workspaces 更适合临时环境 (PR 预览、负载测试环境)。

将漂移检测集成到 CI 中：

# 按计划 (每日) 在 CI 中运行
terraform plan -detailed-exitcode -out=plan.tfplan

# 退出码 0 = 无变化 (干净)
# 退出码 1 = 错误
# 退出码 2 = 检测到变化 (漂移)

# 在退出码为 2 时发出警报
if [ $? -eq 2 ]; then
  # 发送警报到 Slack/PagerDuty
  curl -X POST "$SLACK_WEBHOOK" \
    -H 'Content-Type: application/json' \
    -d '{"text":"Terraform drift detected in production. Review required."}'
fi

单体状态 : 整个基础设施一个状态文件。按组件和环境拆分。
硬编码值 : 使用变量和 tfvars。切勿硬编码 AMI ID、实例类型或 CIDR 块。
无生命周期规则 : 对关键资源 (数据库、包含数据的 S3 存储桶) 使用 prevent_destroy。
忽略 plan 输出 : 在应用前始终查看 plan 差异，尤其是 destroy 和 replace 操作。

监控与可观测性

支柱	工具	用途
指标	Prometheus + Grafana	数值时间序列数据 (CPU、延迟、错误率)
日志	Loki / ELK (Elasticsearch, Logstash, Kibana)	用于调试的结构化事件记录
追踪	Jaeger / Tempo + OpenTelemetry	跨服务的请求流，用于延迟分析

Prometheus 告警规则

groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate exceeds 5% for 5 minutes"
          runbook: "https://wiki.example.com/runbooks/high-error-rate"

      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
          > 2.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency exceeds 2s for 10 minutes"

      - alert: PodCrashLooping
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} restarting frequently"

      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space below 15% on {{ $labels.instance }}"

先定义 SLI，再设置 SLO：

服务	SLI (测量内容)	SLO (目标)	错误预算
API 网关	成功请求数 / 总请求数	99.9% 可用性 (每月 43.8 分钟停机时间)	0.1%
API 延迟	低于 500ms 的请求数 / 总请求数	第 99 百分位数 < 500ms	1%
数据管道	成功的管道运行次数 / 总运行次数	99.5% 成功率	0.5%
部署	成功的部署次数 / 总部署次数	99% 成功率	1%

错误预算策略 : 当错误预算耗尽时，冻结功能部署并优先处理可靠性工作，直到预算恢复。

Grafana 仪表板要点

每个服务仪表板都应包含这些面板 ("四大黄金信号")：

延迟 : P50、P90、P99 响应时间 (直方图)
流量 : 按端点和状态码统计的每秒请求数
错误 : 5xx 率、4xx 率、应用程序特定的错误代码
饱和度 : CPU 使用率、内存使用率、连接池利用率、队列深度

能力	AWS	GCP	Azure
托管 Kubernetes	EKS	GKE	AKS
无服务器计算	Lambda	Cloud Functions / Cloud Run	Azure Functions
容器服务	ECS/Fargate	Cloud Run	Container Apps
对象存储	S3	Cloud Storage	Blob Storage
托管数据库	RDS / Aurora	Cloud SQL / AlloyDB	Azure SQL / Cosmos DB
消息队列	SQS / SNS	Pub/Sub	Service Bus
CDN	CloudFront	Cloud CDN	Azure CDN / Front Door
DNS	Route 53	Cloud DNS	Azure DNS
密钥管理	Secrets Manager	Secret Manager	Key Vault
IAM	IAM + STS	IAM + Workload Identity	Entra ID + RBAC
IaC	CloudFormation / CDK	Deployment Manager	Bicep / ARM

多云策略决策框架

何时采用多云是合理的：

监管要求强制要求地理或供应商多样性
收购带来了在不同云上的工作负载
特定的最佳服务 (例如，ML 用 GCP，广度用 AWS)

何时不合理：

仅以避免供应商锁定为动机 (操作成本超过节省)
无法承担复杂性开销的小型团队
没有监管驱动需要分布的工作负载

如果采用多云：

使用 Terraform (而非供应商特定的 IaC) 作为抽象层
跨云标准化 Kubernetes 作为计算平面
集中可观测性 (Datadog, Grafana Cloud) 以避免碎片化的可见性
投资平台工程团队来管理抽象

策略	风险	回滚速度	基础设施成本	最佳适用场景
滚动更新	中等	分钟级	1x	无状态服务、内部 API
蓝绿部署	低	秒级 (DNS/LB 切换)	2x (部署期间)	关键任务、需要零停机时间
金丝雀发布	低	秒级 (将流量切回)	1.1x	面向用户的服务、渐进式验证
功能标志	最低	即时 (切换)	1x	细粒度控制、A/B 测试、主干开发

# Blue (当前生产环境)
apiVersion: v1
kind: Service
metadata:
  name: app-production
spec:
  selector:
    app: myapp
    version: blue     # 指向当前版本
  ports:
    - port: 80
      targetPort: 3000

---
# Green (新版本) -- 与 blue 并行部署
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
        - name: app
          image: myapp:2.0.0

将 green 与 blue 并行部署 (两者都在运行，只有 blue 提供服务)
通过内部服务或端口转发对 green 运行冒烟测试
将服务选择器从 version: blue 切换到 version: green
监控 15 分钟
如果健康，则缩减 blue。如果不健康，则将选择器切换回 blue。

使用 Istio/Nginx 的金丝雀发布

# 用于金丝雀路由的 Istio VirtualService
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: app-canary
spec:
  hosts:
    - app.example.com
  http:
    - route:
        - destination:
            host: app-stable
            port:
              number: 80
          weight: 90
        - destination:
            host: app-canary
            port:
              number: 80
          weight: 10

金丝雀晋升阶梯：

部署金丝雀，分配 5% 流量
监控错误率和延迟 10 分钟
晋升到 25%，监控 10 分钟
晋升到 50%，监控 15 分钟
晋升到 100% (金丝雀变为稳定版本)
在任何步骤中，如果错误率超过基线 2 倍，则自动回滚

使用功能标志来解耦部署与发布：

# 使用 LaunchDarkly / Unleash / 简单配置的示例
if feature_flags.is_enabled("new-checkout-flow", user_context):
    return new_checkout_handler(request)
else:
    return legacy_checkout_handler(request)

标志生命周期：

创建标志 (默认: 关闭)
为内部用户 / 测试人员启用
渐进式推出: 5% -> 25% -> 50% -> 100%
在完全推出后的 2 个冲刺周期内移除标志和死代码路径

工具	最佳适用场景	避免场景
HashiCorp Vault	动态密钥、PKI、加密即服务、多云	小型团队、简单应用程序
AWS Secrets Manager	AWS 原生工作负载、自动轮换	多云或混合需求
AWS SSM Parameter Store	非敏感配置、低成本的密钥存储	大规模轮换或审计需求
Kubernetes Secrets	Pod 级别注入 (启用了静态加密)	长期存储密钥或跨集群共享
SOPS / age	Git 中的加密密钥 (gitops 工作流)	不熟悉密钥管理的团队

用于 Kubernetes 的 Vault 集成模式：

# 使用 Vault Agent Injector
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  template:
    metadata:
      annotations:
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/role: "app-role"
        vault.hashicorp.com/agent-inject-secret-db: "secret/data/app/db"
        vault.hashicorp.com/agent-inject-template-db: |
          {{- with secret "secret/data/app/db" -}}
          export DB_HOST={{ .Data.data.host }}
          export DB_PASSWORD={{ .Data.data.password }}
          {{- end -}}
    spec:
      serviceAccountName: app-sa
      containers:
        - name: app
          image: myapp:1.2.3
          command: ["/bin/sh", "-c", "source /vault/secrets/db && node server.js"]

默认拒绝，显式允许：

# 默认拒绝所有入站和出站流量
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

---
# 允许应用程序从 ingress 控制器接收流量并与数据库通信
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: app-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: web
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 3000
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: postgres
      ports:
        - protocol: TCP
          port: 5432
    - to:  # 允许 DNS 解析
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53

遵循最小权限原则：授予所需的最小权限
对集群范围的资源使用 ClusterRoles，对命名空间范围的资源使用 Roles
将服务账户绑定到角色，而非用户 (服务账户可审计且可轮换)
使用以下命令审计 RBAC：kubectl auth can-i --list --as=system:serviceaccount:production:app-sa
切勿向应用程序服务账户授予 cluster-admin 权限

# 使用 cosign 对容器镜像进行签名
cosign sign --key cosign.key ghcr.io/myorg/myapp:1.2.3

# 部署前验证
cosign verify --key cosign.pub ghcr.io/myorg/myapp:1.2.3

# 生成 SBOM
syft ghcr.io/myorg/myapp:1.2.3 -o spdx-json > sbom.json

# 扫描 SBOM 中的漏洞
grype sbom:sbom.json --fail-on high

准入控制 : 使用 Kyverno 或 OPA Gatekeeper 来强制执行策略：

仅允许来自可信注册表的镜像
要求镜像签名
阻止以 root 身份运行的容器
对所有 pod 强制执行资源限制

资源调整方法论

收集 : 从 Prometheus/CloudWatch 收集 2-4 周的 CPU 和内存利用率数据
分析 : 识别平均 CPU 利用率低于 40% 的实例
建议 : 建议缩小一个规格 (例如，m5.xlarge -> m5.large)
验证 : 在 staging 环境中应用，进行负载测试，确认没有性能回归
应用 : 在维护窗口期间在生产环境中调整大小
监控 : 变更后跟踪 1 周以确认稳定性

Spot/抢占式实例策略

工作负载类型	适合 Spot 吗？	模式
无状态 Web 服务器

🇺🇸English

Senior DevOps Engineer

Production-grade DevOps engineering toolkit covering the full infrastructure lifecycle: CI/CD pipeline design, container orchestration, infrastructure as code, cloud platform architecture, deployment strategies, observability, security hardening, cost optimization, and incident response.

Keywords
Quick Start
Docker and Containerization
Kubernetes
CI/CD Pipelines
Infrastructure as Code
Monitoring and Observability
Cloud Platforms
Deployment Strategies
Security
Cost Optimization
Incident Response
Reference Documentation
Integration Points

Keywords

Use this skill when you encounter:

Category	Terms
CI/CD	pipeline, GitHub Actions, GitLab CI, Jenkins, CircleCI, build automation, artifact registry, continuous integration, continuous delivery, continuous deployment
Containers	Docker, Dockerfile, docker-compose, container image, multi-stage build, OCI, container registry, ECR, GCR, ACR
Orchestration	Kubernetes, k8s, kubectl, Helm, pod, deployment, service, ingress, HPA, VPA, StatefulSet, DaemonSet, CronJob
IaC	Terraform, OpenTofu, CloudFormation, Pulumi, Ansible, state management, tfstate, modules, workspaces, drift detection
Cloud	AWS, GCP, Azure, EC2, EKS, GKE, AKS, Lambda, Cloud Functions, S3, VPC, IAM, load balancer, auto-scaling
Monitoring	Prometheus, Grafana, Datadog, ELK, Loki, Jaeger, OpenTelemetry, alerting, SLO, SLI, SLA, dashboards
Deployment	blue-green, canary, rolling update, feature flags, rollback, zero-downtime, A/B deployment, progressive delivery
Security	Vault, secrets management, RBAC, network policy, supply chain security, SBOM, image scanning, Trivy, Falco
Reliability	incident response, runbook, postmortem, SRE, error budget, chaos engineering, disaster recovery, RTO, RPO
Cost	FinOps, right-sizing, spot instances, reserved capacity, cost allocation, tagging strategy, savings plans

Quick Start

This skill provides three core automation tools:

# Generate CI/CD pipelines for any platform (GitHub Actions, GitLab CI, Jenkins, CircleCI)
python scripts/pipeline_generator.py <project-path> --platform github-actions --verbose

# Scaffold Terraform infrastructure with modules, state config, and environment separation
python scripts/terraform_scaffolder.py <target-path> --provider aws --env production --verbose

# Manage deployments with strategy selection, health checks, and rollback support
python scripts/deployment_manager.py <target-path> --strategy canary --verbose

Tool Details

Tool	Purpose	Key Flags
`pipeline_generator.py`	Generates CI/CD pipeline configurations from project analysis	`--platform`, `--stages`, `--json`
`terraform_scaffolder.py`	Creates Terraform module structure with best-practice patterns	`--provider`, `--env`, `--modules`

Docker and Containerization

Dockerfile Best Practices

Every production Dockerfile should follow this layered pattern:

# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app

# Copy dependency manifests first (cache layer)
COPY package.json package-lock.json ./
RUN npm ci --only=production && npm cache clean --force

# Copy source and build
COPY . .
RUN npm run build

# Stage 2: Production
FROM node:20-alpine AS production
WORKDIR /app

# Run as non-root
RUN addgroup -g 1001 appgroup && \
    adduser -u 1001 -G appgroup -s /bin/sh -D appuser

# Copy only production artifacts
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:appgroup /app/package.json ./

USER appuser
EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1

CMD ["node", "dist/server.js"]

Critical rules:

Always use specific image tags, never latest in production
Order COPY instructions from least to most frequently changed (maximizes layer cache)
Use .dockerignore to exclude .git, node_modules, test files, docs
Never store secrets in images -- use runtime injection via environment or mounted secrets
Pin package manager versions: npm ci not npm install, lock files always copied
Multi-stage builds reduce final image size by 60-80%

Docker Compose Patterns

Production-ready compose for a typical microservice stack:

version: "3.9"

x-common: &common
  restart: unless-stopped
  logging:
    driver: json-file
    options:
      max-size: "10m"
      max-file: "3"

services:
  app:
    <<: *common
    build:
      context: .
      dockerfile: Dockerfile
      target: production
    ports:
      - "3000:3000"
    environment:
      - NODE_ENV=production
      - DATABASE_URL=postgresql://app:${DB_PASSWORD}@db:5432/appdb
      - REDIS_URL=redis://redis:6379
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: 512M
        reservations:
          cpus: "0.25"
          memory: 128M
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000/healthz"]
      interval: 15s
      timeout: 5s
      retries: 3

  db:
    <<: *common
    image: postgres:16-alpine
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: app
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d appdb"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    <<: *common
    image: redis:7-alpine
    command: redis-server --maxmemory 128mb --maxmemory-policy allkeys-lru
    volumes:
      - redisdata:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 3

volumes:
  pgdata:
  redisdata:

Container Security Checklist

Base images from trusted registries only (Docker Official, Chainguard, Distroless)
Images scanned with Trivy or Grype before push: trivy image --severity HIGH,CRITICAL myapp:latest
No root processes inside containers -- always use USER directive
Read-only root filesystem where possible: --read-only --tmpfs /tmp
Resource limits enforced (CPU, memory) to prevent noisy-neighbor attacks
No secrets baked into image layers -- verify with docker history --no-trunc
Minimal base images (Alpine, Distroless) to reduce attack surface

Kubernetes

Pod Design Patterns

Sidecar pattern -- add capabilities without modifying the main container:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  labels:
    app: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      serviceAccountName: app-sa
      securityContext:
        runAsNonRoot: true
        fsGroup: 1001
      containers:
        - name: app
          image: myapp:1.2.3
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: "1"
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /healthz
              port: 3000
            initialDelaySeconds: 15
            periodSeconds: 20
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          startupProbe:
            httpGet:
              path: /healthz
              port: 3000
            failureThreshold: 30
            periodSeconds: 10
          env:
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: app-secrets
                  key: db-password
        - name: log-shipper
          image: fluent/fluent-bit:2.2
          volumeMounts:
            - name: app-logs
              mountPath: /var/log/app
      volumes:
        - name: app-logs
          emptyDir: {}

Probe decision framework:

startupProbe : Use for slow-starting apps (JVM, large model loading). Prevents liveness from killing a container that has not finished starting.
livenessProbe : Detects deadlocks and hangs. Keep it simple (check process health, not downstream dependencies).
readinessProbe : Controls traffic routing. Include dependency checks here (database reachable, cache warm).

Helm Chart Structure

charts/myapp/
  Chart.yaml
  values.yaml
  values-staging.yaml
  values-production.yaml
  templates/
    deployment.yaml
    service.yaml
    ingress.yaml
    hpa.yaml
    networkpolicy.yaml
    serviceaccount.yaml
    _helpers.tpl

Key values.yaml patterns:

replicaCount: 3

image:
  repository: myapp
  tag: "1.2.3"
  pullPolicy: IfNotPresent

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: "1"
    memory: 512Mi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: app.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: app-tls
      hosts:
        - app.example.com

Resource Management and Auto-Scaling

HPA (Horizontal Pod Autoscaler):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120

Decision: HPA vs VPA vs KEDA

Scaler	Use When	Avoid When
HPA	Stateless services, predictable CPU/memory patterns	Stateful workloads, bursty event-driven loads
VPA	Right-sizing requests/limits, batch jobs, single-replica workloads	Used alone for latency-sensitive services
KEDA	Event-driven scaling (queue depth, HTTP rate, cron)	Simple CPU-based scaling (HPA is simpler)

CI/CD Pipelines

GitHub Actions

Production pipeline with caching, matrix testing, and deployment gates:

name: CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

permissions:
  contents: read
  packages: write
  id-token: write

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: [18, 20]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node-version }}
          cache: npm
      - run: npm ci
      - run: npm run lint
      - run: npm test -- --coverage
      - uses: actions/upload-artifact@v4
        if: matrix.node-version == 20
        with:
          name: coverage
          path: coverage/

  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: fs
          severity: HIGH,CRITICAL
          exit-code: 1

  build:
    needs: [test, security]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha
            type=ref,event=branch
      - uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to staging
        run: |
          helm upgrade --install app charts/myapp \
            --namespace staging \
            --values charts/myapp/values-staging.yaml \
            --set image.tag=${{ github.sha }} \
            --wait --timeout 300s

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to production (canary)
        run: |
          helm upgrade --install app charts/myapp \
            --namespace production \
            --values charts/myapp/values-production.yaml \
            --set image.tag=${{ github.sha }} \
            --set canary.enabled=true \
            --set canary.weight=10 \
            --wait --timeout 300s

GitLab CI

stages:
  - test
  - build
  - deploy

variables:
  DOCKER_BUILDKIT: 1

test:
  stage: test
  image: node:20-alpine
  cache:
    key: ${CI_COMMIT_REF_SLUG}
    paths:
      - node_modules/
  script:
    - npm ci
    - npm run lint
    - npm test -- --coverage
  coverage: '/Lines\s*:\s*(\d+\.?\d*)%/'
  artifacts:
    reports:
      coverage_report:
        coverage_format: cobertura
        path: coverage/cobertura-coverage.xml

build:
  stage: build
  image: docker:24
  services:
    - docker:24-dind
  only:
    - main
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

deploy_staging:
  stage: deploy
  environment:
    name: staging
    url: https://staging.example.com
  only:
    - main
  script:
    - helm upgrade --install app charts/myapp
        --namespace staging
        --set image.tag=$CI_COMMIT_SHA
        --wait

deploy_production:
  stage: deploy
  environment:
    name: production
    url: https://app.example.com
  only:
    - main
  when: manual
  script:
    - helm upgrade --install app charts/myapp
        --namespace production
        --set image.tag=$CI_COMMIT_SHA
        --wait

Pipeline Design Principles

Fail fast : Run linting and unit tests before expensive integration tests
Cache aggressively : Node modules, Docker layers, Go modules, pip packages
Immutable artifacts : Build once, deploy the same artifact to every environment
Gate promotions : Require manual approval or automated smoke tests before production
Parallel where possible : Run independent test suites and security scans concurrently
Reproduce locally : Every CI step should be runnable on a developer machine

Infrastructure as Code

Terraform Module Structure

infrastructure/
  modules/
    vpc/
      main.tf
      variables.tf
      outputs.tf
    eks/
      main.tf
      variables.tf
      outputs.tf
    rds/
      main.tf
      variables.tf
      outputs.tf
  environments/
    staging/
      main.tf          # Calls modules with staging values
      terraform.tfvars
      backend.tf        # S3 + DynamoDB state backend
    production/
      main.tf
      terraform.tfvars
      backend.tf

State Management

Remote state with locking (AWS):

# backend.tf
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/infrastructure.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

State management rules:

One state file per environment per component (blast radius control)
Never store state locally or in git
Enable encryption at rest and in transit
Use DynamoDB (AWS) or Cloud Storage (GCP) for state locking
Run terraform plan in CI, terraform apply only after approval
Use terraform state list and terraform state show for debugging, never edit state manually

Workspace vs Directory Pattern

Pattern	Use When	Trade-offs
Workspaces	Same config, different scale (dev/staging/prod with identical topology)	Shared state backend, easy switching, but harder to diverge configs
Directories	Different environments need different resources or topology	Full isolation, clear boundaries, but duplicated boilerplate

Recommendation : Use directories for environment separation. Use modules for shared logic. Workspaces are better suited for ephemeral environments (PR previews, load test environments).

Drift Detection

Integrate drift detection into CI:

# Run in CI on a schedule (daily)
terraform plan -detailed-exitcode -out=plan.tfplan

# Exit code 0 = no changes (clean)
# Exit code 1 = error
# Exit code 2 = changes detected (drift)

# Alert on exit code 2
if [ $? -eq 2 ]; then
  # Send alert to Slack/PagerDuty
  curl -X POST "$SLACK_WEBHOOK" \
    -H 'Content-Type: application/json' \
    -d '{"text":"Terraform drift detected in production. Review required."}'
fi

Terraform Anti-Patterns

Monolithic state : One state file for the entire infrastructure. Split by component and environment.
Hardcoded values : Use variables and tfvars. Never hardcode AMI IDs, instance types, or CIDR blocks.
No lifecycle rules : Use prevent_destroy on critical resources (databases, S3 buckets with data).
Ignoring plan output : Always review plan diffs before apply, especially destroy and replace actions.

Monitoring and Observability

The Three Pillars

Pillar	Tool	Purpose
Metrics	Prometheus + Grafana	Numeric time-series data (CPU, latency, error rates)
Logs	Loki / ELK (Elasticsearch, Logstash, Kibana)	Structured event records for debugging
Traces	Jaeger / Tempo + OpenTelemetry	Request flow across services for latency analysis

Prometheus Alerting Rules

groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate exceeds 5% for 5 minutes"
          runbook: "https://wiki.example.com/runbooks/high-error-rate"

      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
          > 2.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency exceeds 2s for 10 minutes"

      - alert: PodCrashLooping
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} restarting frequently"

      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space below 15% on {{ $labels.instance }}"

SLO/SLI Definitions

Define SLIs first, then set SLOs:

Service	SLI (what you measure)	SLO (target)	Error Budget
API Gateway	Successful requests / Total requests	99.9% availability (43.8 min/month downtime)	0.1%
API Latency	Requests under 500ms / Total requests	99th percentile < 500ms	1%
Data Pipeline	Successful pipeline runs / Total runs	99.5% success rate	0.5%
Deployment	Successful deploys / Total deploys	99% success rate	1%

Error budget policy : When the error budget is exhausted, freeze feature deployments and prioritize reliability work until the budget recovers.

Grafana Dashboard Essentials

Every service dashboard should include these panels (the "Four Golden Signals"):

Latency : P50, P90, P99 response times (histogram)
Traffic : Requests per second by endpoint and status code
Errors : 5xx rate, 4xx rate, application-specific error codes
Saturation : CPU usage, memory usage, connection pool utilization, queue depth

Cloud Platforms

Service Comparison Matrix

Capability	AWS	GCP	Azure
Managed Kubernetes	EKS	GKE	AKS
Serverless Compute	Lambda	Cloud Functions / Cloud Run	Azure Functions
Container Service	ECS/Fargate	Cloud Run	Container Apps
Object Storage	S3	Cloud Storage	Blob Storage
Managed Database	RDS / Aurora	Cloud SQL / AlloyDB	Azure SQL / Cosmos DB
Message Queue	SQS / SNS	Pub/Sub	Service Bus

Multi-Cloud Strategy Decision Framework

When multi-cloud makes sense:

Regulatory requirements mandate geographic or vendor diversity
Acquisition brings in workloads on a different cloud
Specific best-of-breed services (e.g., GCP for ML, AWS for breadth)

When it does not:

Avoiding vendor lock-in as the sole motivation (the operational tax exceeds the savings)
Small teams that cannot afford the complexity overhead
Workloads with no regulatory driver for distribution

If you go multi-cloud:

Use Terraform (not provider-specific IaC) for the abstraction layer
Standardize on Kubernetes as the compute plane across clouds
Centralize observability (Datadog, Grafana Cloud) to avoid fragmented visibility
Invest in a platform engineering team to manage the abstraction

Deployment Strategies

Strategy Selection Framework

Strategy	Risk	Rollback Speed	Infrastructure Cost	Best For
Rolling Update	Medium	Minutes	1x	Stateless services, internal APIs
Blue-Green	Low	Seconds (DNS/LB switch)	2x during deploy	Mission-critical, zero-downtime required
Canary	Low	Seconds (shift traffic back)	1.1x	User-facing services, gradual validation
Feature Flags	Lowest	Instant (toggle)	1x	Granular control, A/B testing, trunk-based dev

Blue-Green Implementation

# Blue (current production)
apiVersion: v1
kind: Service
metadata:
  name: app-production
spec:
  selector:
    app: myapp
    version: blue     # Points to current version
  ports:
    - port: 80
      targetPort: 3000

---
# Green (new version) -- deploy alongside blue
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
        - name: app
          image: myapp:2.0.0

Cutover steps:

Deploy green alongside blue (both running, only blue serves traffic)
Run smoke tests against green via internal service or port-forward
Switch the service selector from version: blue to version: green
Monitor for 15 minutes
If healthy, scale down blue. If not, switch selector back to blue.

Canary with Istio/Nginx

# Istio VirtualService for canary routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: app-canary
spec:
  hosts:
    - app.example.com
  http:
    - route:
        - destination:
            host: app-stable
            port:
              number: 80
          weight: 90
        - destination:
            host: app-canary
            port:
              number: 80
          weight: 10

Canary promotion ladder:

Deploy canary with 5% traffic
Monitor error rate and latency for 10 minutes
Promote to 25%, monitor 10 minutes
Promote to 50%, monitor 15 minutes
Promote to 100% (canary becomes stable)
Automated rollback if error rate exceeds baseline by 2x at any step

Feature Flags

Use feature flags for decoupling deployment from release:

# Example with LaunchDarkly / Unleash / simple config
if feature_flags.is_enabled("new-checkout-flow", user_context):
    return new_checkout_handler(request)
else:
    return legacy_checkout_handler(request)

Flag lifecycle:

Create flag (default: off)
Enable for internal users / beta testers
Gradual rollout: 5% -> 25% -> 50% -> 100%
Remove flag and dead code path within 2 sprints of full rollout

Security

Secret Management

Decision matrix:

Tool	Best For	Avoid When
HashiCorp Vault	Dynamic secrets, PKI, encryption as a service, multi-cloud	Small teams, simple applications
AWS Secrets Manager	AWS-native workloads, automatic rotation	Multi-cloud or hybrid requirements
AWS SSM Parameter Store	Non-sensitive config, low-cost secret storage	Rotation or audit requirements at scale
Kubernetes Secrets	Pod-level injection (with encryption at rest enabled)	Storing secrets long-term or sharing across clusters
SOPS / age	Encrypted secrets in git (gitops workflows)	Teams unfamiliar with key management

Vault integration pattern for Kubernetes:

# Using Vault Agent Injector
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  template:
    metadata:
      annotations:
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/role: "app-role"
        vault.hashicorp.com/agent-inject-secret-db: "secret/data/app/db"
        vault.hashicorp.com/agent-inject-template-db: |
          {{- with secret "secret/data/app/db" -}}
          export DB_HOST={{ .Data.data.host }}
          export DB_PASSWORD={{ .Data.data.password }}
          {{- end -}}
    spec:
      serviceAccountName: app-sa
      containers:
        - name: app
          image: myapp:1.2.3
          command: ["/bin/sh", "-c", "source /vault/secrets/db && node server.js"]

Network Policies

Default-deny with explicit allow:

# Default deny all ingress and egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

---
# Allow app to receive traffic from ingress controller and talk to database
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: app-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: web
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 3000
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: postgres
      ports:
        - protocol: TCP
          port: 5432
    - to:  # Allow DNS resolution
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53

RBAC Best Practices

Follow the principle of least privilege: grant minimum permissions needed
Use ClusterRoles for cluster-wide resources, Roles for namespace-scoped
Bind service accounts to roles, not users (service accounts are auditable and rotatable)
Audit RBAC with: kubectl auth can-i --list --as=system:serviceaccount:production:app-sa
Never grant cluster-admin to application service accounts

Supply Chain Security

# Sign container images with cosign
cosign sign --key cosign.key ghcr.io/myorg/myapp:1.2.3

# Verify before deployment
cosign verify --key cosign.pub ghcr.io/myorg/myapp:1.2.3

# Generate SBOM
syft ghcr.io/myorg/myapp:1.2.3 -o spdx-json > sbom.json

# Scan SBOM for vulnerabilities
grype sbom:sbom.json --fail-on high

Admission control : Use Kyverno or OPA Gatekeeper to enforce policies:

Only allow images from trusted registries
Require image signatures
Block containers running as root
Enforce resource limits on all pods

Cost Optimization

Right-Sizing Methodology

Collect : Gather 2-4 weeks of CPU and memory utilization data from Prometheus/CloudWatch
Analyze : Identify instances running below 40% average CPU utilization
Recommend : Suggest one size down (e.g., m5.xlarge -> m5.large)
Validate : Apply in staging, load test, confirm no performance regression
Apply : Resize in production during maintenance window
Monitor : Track for 1 week post-change to confirm stability

Spot/Preemptible Instance Strategy

Workload Type	Spot Suitable?	Pattern
Stateless web servers (behind LB)	Yes	Mix 70% spot + 30% on-demand
CI/CD runners	Yes	100% spot with retry logic
Batch processing / ETL	Yes	Spot fleet with checkpointing
Databases / stateful	No	Use reserved instances
Kubernetes control plane	No	On-demand or reserved
Dev/test environments	Yes	100% spot, accept interruptions

FinOps Practices

Tagging strategy : Enforce tags for team, environment, service, cost-center on all resources
Budget alerts : Set CloudWatch/GCP Budget alerts at 50%, 80%, 100% of monthly budget
Reserved capacity : Purchase 1-year reservations for baseline workloads (30-40% savings)
Savings Plans : Use Compute Savings Plans (AWS) for flexible commitment discounts
Scheduled scaling : Scale down non-production environments outside business hours
Storage lifecycle : S3 lifecycle policies to move old data to Glacier/Archive tiers
Unused resource cleanup : Weekly scan for unattached EBS volumes, idle load balancers, stale snapshots

Incident Response

Severity Classification

Severity	Definition	Response Time	Example
SEV-1	Complete service outage, data loss risk	15 minutes	Production database down, payment system failure
SEV-2	Significant degradation, partial outage	30 minutes	High error rate, API latency > 10x normal
SEV-3	Minor degradation, workaround available	4 hours	Non-critical feature broken, elevated error rate < 1%
SEV-4	Cosmetic / informational	Next business day	Dashboard rendering issue, log verbosity spike

Runbook Template

# Runbook: [Service Name] - [Issue Type]

## Symptoms
- What alerts fire
- What users report
- What dashboards show

## Impact
- Which users/services affected
- Revenue impact estimate

## Diagnosis Steps
1. Check service health: `kubectl get pods -n production -l app=myapp`
2. Review recent deployments: `helm history myapp -n production`
3. Check error logs: `kubectl logs -l app=myapp -n production --tail=100`
4. Verify database connectivity: `kubectl exec -it app-pod -- pg_isready -h db-host`
5. Check resource utilization: Review Grafana dashboard [link]

## Remediation
### Quick Fix (< 5 min)
- Restart pods: `kubectl rollout restart deployment/myapp -n production`
- Scale up: `kubectl scale deployment/myapp --replicas=10 -n production`

### Rollback (< 10 min)
- `helm rollback myapp [previous-revision] -n production`

### Root Cause Fix
- [Document fix steps specific to this issue]

## Escalation
- L1: On-call engineer (PagerDuty)
- L2: Team lead / service owner
- L3: VP Engineering (SEV-1 only)

## Communication
- Statuspage update within 15 min of SEV-1/SEV-2
- Slack channel: #incidents

Postmortem Process

Every SEV-1 and SEV-2 incident requires a blameless postmortem within 3 business days:

Timeline : Minute-by-minute reconstruction of what happened
Root cause : Use the "5 Whys" technique to identify the underlying cause
Impact : Users affected, duration, revenue impact
What went well : Detection, communication, and resolution that worked
What went poorly : Gaps in monitoring, slow response, unclear ownership
Action items : Concrete tasks with owners and due dates, prioritized by impact
Lessons learned : Patterns to adopt or avoid going forward

Template : Store postmortems in a shared wiki. Link them from the incident channel for team visibility.

Reference Documentation

This skill includes three reference guides for deep-dive topics:

Reference	Path	Covers
CI/CD Pipeline Guide	`references/cicd_pipeline_guide.md`	Pipeline patterns, platform comparisons, optimization techniques, testing strategies
Infrastructure as Code	`references/infrastructure_as_code.md`	Terraform patterns, module design, state management, provider configuration
Deployment Strategies	`references/deployment_strategies.md`	Strategy comparison, implementation details, rollback procedures, traffic management

Use the reference files for extended examples and edge-case handling beyond what this skill file covers.

Integration Points

This skill works alongside other skills in the library:

Skill	Integration
senior-secops	Security scanning in CI/CD pipelines, container image scanning, compliance checks
senior-architect	Infrastructure design decisions, service topology, dependency analysis
senior-backend	Application containerization, health check endpoints, config management
senior-cloud-architect	Cloud platform selection, multi-region architecture, disaster recovery planning
incident-commander	Incident escalation procedures, communication protocols, postmortem facilitation
code-reviewer	Infrastructure-as-code review standards, Terraform plan review, pipeline config review
aws-solution-architect	AWS-specific infrastructure patterns, service selection, cost optimization

Last Updated: February 2026 Version: 2.0.0 Tools: 3 Python automation scripts References: 3 deep-dive guides

Weekly Installs

Repository

borghei/claude-skills

GitHub Stars

First Seen

Jan 24, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code54

opencode53

gemini-cli49

cursor46

codex45

github-copilot44

Azure Data Explorer (Kusto) 查询技能：KQL数据分析、日志遥测与时间序列处理

133,300 周安装

高级 DevOps 工程师工具包 | 生产级 CI/CD、Kubernetes、Terraform、云架构与可观测性

🇨🇳中文介绍

高级 DevOps 工程师

目录

关键词

相关 Skills

快速开始

工具详情

Docker 与容器化

Dockerfile 最佳实践

Docker Compose 模式

容器安全检查清单

Kubernetes

Pod 设计模式

Helm Chart 结构

资源管理与自动扩缩容

CI/CD 流水线

GitHub Actions

GitLab CI

流水线设计原则

基础设施即代码

Terraform 模块结构

状态管理

Workspace 与 Directory 模式

漂移检测

Terraform 反模式

监控与可观测性

三大支柱

Prometheus 告警规则

SLO/SLI 定义

Grafana 仪表板要点

云平台

服务比较矩阵

多云策略决策框架

部署策略

策略选择框架

蓝绿部署实现

使用 Istio/Nginx 的金丝雀发布

功能标志

安全

密钥管理

网络策略

RBAC 最佳实践

供应链安全

成本优化

资源调整方法论

Spot/抢占式实例策略

🇺🇸English

Senior DevOps Engineer

Table of Contents

Keywords

Quick Start

Tool Details

Docker and Containerization

Dockerfile Best Practices

Docker Compose Patterns

Container Security Checklist

Kubernetes

Pod Design Patterns

Helm Chart Structure

Resource Management and Auto-Scaling

CI/CD Pipelines

GitHub Actions

GitLab CI

Pipeline Design Principles

Infrastructure as Code

Terraform Module Structure

State Management

Workspace vs Directory Pattern

Drift Detection

Terraform Anti-Patterns

Monitoring and Observability

The Three Pillars

Prometheus Alerting Rules

SLO/SLI Definitions

Grafana Dashboard Essentials

Cloud Platforms

Service Comparison Matrix

Multi-Cloud Strategy Decision Framework

Deployment Strategies