SLO实施指南：定义SLI、SLO与错误预算，实现可衡量的服务可靠性目标

slo-implementation by sickn33/antigravity-awesome-skills

83 周安装量

26,900 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill slo-implementation

开发运维监控系统架构

🇨🇳中文介绍

SLO 实施

用于定义和实施服务级别指标（SLI）、服务级别目标（SLO）和错误预算的框架。

不应使用此技能的情况

任务与 SLO 实施无关
需要此范围之外的不同领域或工具

使用说明

明确目标、约束条件和所需输入。
应用相关最佳实践并验证结果。
提供可操作的步骤和验证方法。
如果需要详细示例，请打开 resources/implementation-playbook.md。

目的

使用 SLI、SLO 和错误预算来实施可衡量的可靠性目标，以平衡可靠性与创新速度。

应在以下情况使用此技能

定义服务可靠性目标
衡量用户感知的可靠性
实施错误预算
创建基于 SLO 的警报
跟踪可靠性目标

SLI/SLO/SLA 层级关系

SLA (服务级别协议)
  ↓ 与客户的合同
SLO (服务级别目标)
  ↓ 内部可靠性目标
SLI (服务级别指标)
  ↓ 实际测量值

定义 SLI

常见 SLI 类型

1. 可用性 SLI

# 成功请求数 / 总请求数
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

2. 延迟 SLI

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

可用性 SLO 示例

SLO %	每月停机时间	每年停机时间
99%	7.2 小时	3.65 天
99.9%	43.2 分钟	8.76 小时
99.95%	21.6 分钟	4.38 小时
99.99%	4.32 分钟	52.56 分钟

用户期望
业务需求
当前性能
可靠性成本
竞争对手基准

slos:
  - name: api_availability
    target: 99.9
    window: 28d
    sli: |
      sum(rate(http_requests_total{status!~"5.."}[28d]))
      /
      sum(rate(http_requests_total[28d]))

  - name: api_latency_p95
    target: 99
    window: 28d
    sli: |
      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
      /
      sum(rate(http_request_duration_seconds_count[28d]))

Error Budget = 1 - SLO Target

SLO：99.9% 可用性
错误预算：0.1% = 43.2 分钟/月
当前错误：0.05% = 21.6 分钟/月
剩余预算：50%

error_budget_policy:
  - remaining_budget: 100%
    action: 正常开发速度
  - remaining_budget: 50%
    action: 考虑推迟高风险变更
  - remaining_budget: 10%
    action: 冻结非关键变更
  - remaining_budget: 0%
    action: 功能冻结，专注于可靠性

参考： 参见 references/error-budget.md

Prometheus 记录规则

# SLI 记录规则
groups:
  - name: sli_rules
    interval: 30s
    rules:
      # 可用性 SLI
      - record: sli:http_availability:ratio
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[28d]))
          /
          sum(rate(http_requests_total[28d]))

      # 延迟 SLI (请求 < 500ms)
      - record: sli:http_latency:ratio
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
          /
          sum(rate(http_request_duration_seconds_count[28d]))

  - name: slo_rules
    interval: 5m
    rules:
      # SLO 合规性 (1 = 满足 SLO, 0 = 违反)
      - record: slo:http_availability:compliance
        expr: sli:http_availability:ratio >= bool 0.999

      - record: slo:http_latency:compliance
        expr: sli:http_latency:ratio >= bool 0.99

      # 剩余错误预算 (百分比)
      - record: slo:http_availability:error_budget_remaining
        expr: |
          (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

      # 错误预算消耗率
      - record: slo:http_availability:burn_rate_5m
        expr: |
          (1 - (
            sum(rate(http_requests_total{status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          )) / (1 - 0.999)

groups:
  - name: slo_alerts
    interval: 1m
    rules:
      # 快速消耗：14.4 倍速率，1 小时窗口
      # 在 1 小时内消耗 2% 的错误预算
      - alert: SLOErrorBudgetBurnFast
        expr: |
          slo:http_availability:burn_rate_1h > 14.4
          and
          slo:http_availability:burn_rate_5m > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "检测到快速错误预算消耗"
          description: "错误预算正以 {{ $value }} 倍速率消耗"

      # 慢速消耗：6 倍速率，6 小时窗口
      # 在 6 小时内消耗 5% 的错误预算
      - alert: SLOErrorBudgetBurnSlow
        expr: |
          slo:http_availability:burn_rate_6h > 6
          and
          slo:http_availability:burn_rate_30m > 6
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "检测到慢速错误预算消耗"
          description: "错误预算正以 {{ $value }} 倍速率消耗"

      # 错误预算耗尽
      - alert: SLOErrorBudgetExhausted
        expr: slo:http_availability:error_budget_remaining < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SLO 错误预算已耗尽"
          description: "剩余错误预算：{{ $value }}%"

Grafana 仪表板结构：

┌────────────────────────────────────┐
│ SLO 合规性 (当前)                  │
│ ✓ 99.95% (目标: 99.9%)            │
├────────────────────────────────────┤
│ 剩余错误预算: 65%                  │
│ ████████░░ 65%                     │
├────────────────────────────────────┤
│ SLI 趋势 (28 天)                   │
│ [时间序列图]                       │
├────────────────────────────────────┤
│ 消耗率分析                         │
│ [按时间窗口的消耗率]               │
└────────────────────────────────────┘

# 当前 SLO 合规性
sli:http_availability:ratio * 100

# 剩余错误预算
slo:http_availability:error_budget_remaining

# 按当前消耗率计算，距离错误预算耗尽的天数
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)

多窗口消耗率告警

# 结合短窗口和长窗口以减少误报
rules:
  - alert: SLOBurnRateHigh
    expr: |
      (
        slo:http_availability:burn_rate_1h > 14.4
        and
        slo:http_availability:burn_rate_5m > 14.4
      )
      or
      (
        slo:http_availability:burn_rate_6h > 6
        and
        slo:http_availability:burn_rate_30m > 6
      )
    labels:
      severity: critical

当前 SLO 合规性
错误预算状态
趋势分析
事件影响

SLO 达成情况
错误预算使用情况
事件事后分析
SLO 调整

SLO 相关性
目标调整
流程改进
工具增强

从面向用户的服务开始
使用多个 SLI（可用性、延迟等）
设定可实现的 SLO（不要追求 100%）
实施多窗口告警以减少噪音
持续跟踪错误预算
定期评审 SLO
记录 SLO 决策
与业务目标保持一致
自动化 SLO 报告
使用 SLO 进行优先级排序

assets/slo-template.md - SLO 定义模板
references/slo-definitions.md - SLO 定义模式
references/error-budget.md - 错误预算计算

prometheus-configuration - 用于指标收集
grafana-dashboards - 用于 SLO 可视化

🇺🇸English

SLO Implementation

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

Do not use this skill when

The task is unrelated to slo implementation
You need a different domain or tool outside this scope

Instructions

Clarify goals, constraints, and required inputs.
Apply relevant best practices and validate outcomes.
Provide actionable steps and verification.
If detailed examples are required, open resources/implementation-playbook.md.

Purpose

Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.

Use this skill when

Define service reliability targets
Measure user-perceived reliability
Implement error budgets
Create SLO-based alerts
Track reliability goals

SLI/SLO/SLA Hierarchy

SLA (Service Level Agreement)
  ↓ Contract with customers
SLO (Service Level Objective)
  ↓ Internal reliability target
SLI (Service Level Indicator)
  ↓ Actual measurement

Defining SLIs

Common SLI Types

1. Availability SLI

# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

2. Latency SLI

# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

3. Durability SLI

# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)

Reference: See references/slo-definitions.md

Setting SLO Targets

Availability SLO Examples

SLO %	Downtime/Month	Downtime/Year
99%	7.2 hours	3.65 days
99.9%	43.2 minutes	8.76 hours
99.95%	21.6 minutes	4.38 hours
99.99%	4.32 minutes	52.56 minutes

Choose Appropriate SLOs

Consider:

User expectations
Business requirements
Current performance
Cost of reliability
Competitor benchmarks

Example SLOs:

slos:
  - name: api_availability
    target: 99.9
    window: 28d
    sli: |
      sum(rate(http_requests_total{status!~"5.."}[28d]))
      /
      sum(rate(http_requests_total[28d]))

  - name: api_latency_p95
    target: 99
    window: 28d
    sli: |
      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
      /
      sum(rate(http_request_duration_seconds_count[28d]))

Error Budget Calculation

Error Budget Formula

Error Budget = 1 - SLO Target

Example:

SLO: 99.9% availability
Error Budget: 0.1% = 43.2 minutes/month
Current Error: 0.05% = 21.6 minutes/month
Remaining Budget: 50%

Error Budget Policy

error_budget_policy:
  - remaining_budget: 100%
    action: Normal development velocity
  - remaining_budget: 50%
    action: Consider postponing risky changes
  - remaining_budget: 10%
    action: Freeze non-critical changes
  - remaining_budget: 0%
    action: Feature freeze, focus on reliability

Reference: See references/error-budget.md

SLO Implementation

Prometheus Recording Rules

# SLI Recording Rules
groups:
  - name: sli_rules
    interval: 30s
    rules:
      # Availability SLI
      - record: sli:http_availability:ratio
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[28d]))
          /
          sum(rate(http_requests_total[28d]))

      # Latency SLI (requests < 500ms)
      - record: sli:http_latency:ratio
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
          /
          sum(rate(http_request_duration_seconds_count[28d]))

  - name: slo_rules
    interval: 5m
    rules:
      # SLO compliance (1 = meeting SLO, 0 = violating)
      - record: slo:http_availability:compliance
        expr: sli:http_availability:ratio >= bool 0.999

      - record: slo:http_latency:compliance
        expr: sli:http_latency:ratio >= bool 0.99

      # Error budget remaining (percentage)
      - record: slo:http_availability:error_budget_remaining
        expr: |
          (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

      # Error budget burn rate
      - record: slo:http_availability:burn_rate_5m
        expr: |
          (1 - (
            sum(rate(http_requests_total{status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          )) / (1 - 0.999)

SLO Alerting Rules

groups:
  - name: slo_alerts
    interval: 1m
    rules:
      # Fast burn: 14.4x rate, 1 hour window
      # Consumes 2% error budget in 1 hour
      - alert: SLOErrorBudgetBurnFast
        expr: |
          slo:http_availability:burn_rate_1h > 14.4
          and
          slo:http_availability:burn_rate_5m > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Slow burn: 6x rate, 6 hour window
      # Consumes 5% error budget in 6 hours
      - alert: SLOErrorBudgetBurnSlow
        expr: |
          slo:http_availability:burn_rate_6h > 6
          and
          slo:http_availability:burn_rate_30m > 6
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Error budget exhausted
      - alert: SLOErrorBudgetExhausted
        expr: slo:http_availability:error_budget_remaining < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SLO error budget exhausted"
          description: "Error budget remaining: {{ $value }}%"

SLO Dashboard

Grafana Dashboard Structure:

┌────────────────────────────────────┐
│ SLO Compliance (Current)           │
│ ✓ 99.95% (Target: 99.9%)          │
├────────────────────────────────────┤
│ Error Budget Remaining: 65%        │
│ ████████░░ 65%                     │
├────────────────────────────────────┤
│ SLI Trend (28 days)                │
│ [Time series graph]                │
├────────────────────────────────────┤
│ Burn Rate Analysis                 │
│ [Burn rate by time window]         │
└────────────────────────────────────┘

Example Queries:

# Current SLO compliance
sli:http_availability:ratio * 100

# Error budget remaining
slo:http_availability:error_budget_remaining

# Days until error budget exhausted (at current burn rate)
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)

Multi-Window Burn Rate Alerts

# Combination of short and long windows reduces false positives
rules:
  - alert: SLOBurnRateHigh
    expr: |
      (
        slo:http_availability:burn_rate_1h > 14.4
        and
        slo:http_availability:burn_rate_5m > 14.4
      )
      or
      (
        slo:http_availability:burn_rate_6h > 6
        and
        slo:http_availability:burn_rate_30m > 6
      )
    labels:
      severity: critical

SLO Review Process

Weekly Review

Current SLO compliance
Error budget status
Trend analysis
Incident impact

Monthly Review

SLO achievement
Error budget usage
Incident postmortems
SLO adjustments

Quarterly Review

SLO relevance
Target adjustments
Process improvements
Tooling enhancements

Best Practices

Start with user-facing services
Use multiple SLIs (availability, latency, etc.)
Set achievable SLOs (don't aim for 100%)
Implement multi-window alerts to reduce noise
Track error budget consistently
Review SLOs regularly
Document SLO decisions
Align with business goals
Automate SLO reporting
Use SLOs for prioritization

Reference Files

assets/slo-template.md - SLO definition template
references/slo-definitions.md - SLO definition patterns
references/error-budget.md - Error budget calculations

Related Skills

prometheus-configuration - For metric collection
grafana-dashboards - For SLO visualization

Weekly Installs

Repository

sickn33/antigra…e-skills

GitHub Stars

26.9K

First Seen

Jan 28, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode80

gemini-cli79

cursor78

github-copilot77

claude-code76

codex76

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

99,100 周安装

SLO实施指南：定义SLI、SLO与错误预算，实现可衡量的服务可靠性目标

🇨🇳中文介绍

SLO 实施

不应使用此技能的情况

使用说明

目的

应在以下情况使用此技能

SLI/SLO/SLA 层级关系

定义 SLI

常见 SLI 类型

1. 可用性 SLI

2. 延迟 SLI

相关 Skills

3. 持久性 SLI

设置 SLO 目标

可用性 SLO 示例

选择适当的 SLO

错误预算计算

错误预算公式

错误预算策略

SLO 实施

Prometheus 记录规则

SLO 告警规则

SLO 仪表板

多窗口消耗率告警

SLO 评审流程

每周评审

每月评审

每季度评审

最佳实践

参考文件

相关技能

🇺🇸English

SLO Implementation

Do not use this skill when

Instructions

Purpose

Use this skill when

SLI/SLO/SLA Hierarchy

Defining SLIs

Common SLI Types

1. Availability SLI

2. Latency SLI

3. Durability SLI

Setting SLO Targets

Availability SLO Examples

Choose Appropriate SLOs

Error Budget Calculation

Error Budget Formula

Error Budget Policy

SLO Implementation

Prometheus Recording Rules

SLO Alerting Rules

SLO Dashboard

Multi-Window Burn Rate Alerts

SLO Review Process

Weekly Review

Monthly Review

Quarterly Review

Best Practices

Reference Files

Related Skills

最新 Skills