slo-implementation by wshobson/agents
npx skills add https://github.com/wshobson/agents --skill slo-implementation定义和实施服务等级指标(SLIs)、服务等级目标(SLOs)以及错误预算的框架。
使用 SLIs、SLOs 和错误预算来实现可衡量的可靠性目标,以平衡可靠性与创新速度。
SLA (Service Level Agreement)
↓ 与客户的合同
SLO (Service Level Objective)
↓ 内部可靠性目标
SLI (Service Level Indicator)
↓ 实际测量值
# 成功请求数 / 总请求数
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# 低于延迟阈值的请求数 / 总请求数
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
# 成功写入次数 / 总写入次数
sum(storage_writes_successful_total)
/
sum(storage_writes_total)
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
参考: 参见 references/slo-definitions.md
| SLO % | 每月停机时间 | 每年停机时间 |
|---|---|---|
| 99% | 7.2 小时 | 3.65 天 |
| 99.9% | 43.2 分钟 | 8.76 小时 |
| 99.95% | 21.6 分钟 | 4.38 小时 |
| 99.99% | 4.32 分钟 | 52.56 分钟 |
考虑因素:
SLO 示例:
slos:
- name: api_availability
target: 99.9
window: 28d
sli: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
- name: api_latency_p95
target: 99
window: 28d
sli: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
Error Budget = 1 - SLO Target
示例:
error_budget_policy:
- remaining_budget: 100%
action: 正常开发速度
- remaining_budget: 50%
action: 考虑推迟高风险变更
- remaining_budget: 10%
action: 冻结非关键变更
- remaining_budget: 0%
action: 功能冻结,专注于可靠性
参考: 参见 references/error-budget.md
# SLI 记录规则
groups:
- name: sli_rules
interval: 30s
rules:
# 可用性 SLI
- record: sli:http_availability:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# 延迟 SLI (请求 < 500ms)
- record: sli:http_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
- name: slo_rules
interval: 5m
rules:
# SLO 合规性 (1 = 满足 SLO, 0 = 违反)
- record: slo:http_availability:compliance
expr: sli:http_availability:ratio >= bool 0.999
- record: slo:http_latency:compliance
expr: sli:http_latency:ratio >= bool 0.99
# 剩余错误预算 (百分比)
- record: slo:http_availability:error_budget_remaining
expr: |
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
# 错误预算消耗率
- record: slo:http_availability:burn_rate_5m
expr: |
(1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)) / (1 - 0.999)
groups:
- name: slo_alerts
interval: 1m
rules:
# 快速消耗:14.4 倍速率,1 小时窗口
# 1 小时内消耗 2% 的错误预算
- alert: SLOErrorBudgetBurnFast
expr: |
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "检测到错误预算快速消耗"
description: "错误预算正以 {{ $value }} 倍速率消耗"
# 慢速消耗:6 倍速率,6 小时窗口
# 6 小时内消耗 5% 的错误预算
- alert: SLOErrorBudgetBurnSlow
expr: |
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
for: 15m
labels:
severity: warning
annotations:
summary: "检测到错误预算慢速消耗"
description: "错误预算正以 {{ $value }} 倍速率消耗"
# 错误预算耗尽
- alert: SLOErrorBudgetExhausted
expr: slo:http_availability:error_budget_remaining < 0
for: 5m
labels:
severity: critical
annotations:
summary: "SLO 错误预算已耗尽"
description: "剩余错误预算:{{ $value }}%"
Grafana 仪表板结构:
┌────────────────────────────────────┐
│ SLO 合规性 (当前) │
│ ✓ 99.95% (目标: 99.9%) │
├────────────────────────────────────┤
│ 剩余错误预算: 65% │
│ ████████░░ 65% │
├────────────────────────────────────┤
│ SLI 趋势 (28 天) │
│ [时间序列图] │
├────────────────────────────────────┤
│ 消耗率分析 │
│ [按时间窗口的消耗率] │
└────────────────────────────────────┘
查询示例:
# 当前 SLO 合规性
sli:http_availability:ratio * 100
# 剩余错误预算
slo:http_availability:error_budget_remaining
# 按当前消耗率计算,距离错误预算耗尽的天数
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)
# 结合短窗口和长窗口以减少误报
rules:
- alert: SLOBurnRateHigh
expr: |
(
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
)
or
(
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
)
labels:
severity: critical
prometheus-configuration - 用于指标收集grafana-dashboards - 用于 SLO 可视化每周安装次数
3.1K
代码仓库
GitHub Stars
32.2K
首次出现
Jan 20, 2026
安全审计
已安装于
claude-code2.5K
gemini-cli2.4K
opencode2.4K
cursor2.3K
codex2.3K
github-copilot2.0K
Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
SLA (Service Level Agreement)
↓ Contract with customers
SLO (Service Level Objective)
↓ Internal reliability target
SLI (Service Level Indicator)
↓ Actual measurement
# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)
Reference: See references/slo-definitions.md
| SLO % | Downtime/Month | Downtime/Year |
|---|---|---|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43.2 minutes | 8.76 hours |
| 99.95% | 21.6 minutes | 4.38 hours |
| 99.99% | 4.32 minutes | 52.56 minutes |
Consider:
Example SLOs:
slos:
- name: api_availability
target: 99.9
window: 28d
sli: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
- name: api_latency_p95
target: 99
window: 28d
sli: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
Error Budget = 1 - SLO Target
Example:
error_budget_policy:
- remaining_budget: 100%
action: Normal development velocity
- remaining_budget: 50%
action: Consider postponing risky changes
- remaining_budget: 10%
action: Freeze non-critical changes
- remaining_budget: 0%
action: Feature freeze, focus on reliability
Reference: See references/error-budget.md
# SLI Recording Rules
groups:
- name: sli_rules
interval: 30s
rules:
# Availability SLI
- record: sli:http_availability:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency SLI (requests < 500ms)
- record: sli:http_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
- name: slo_rules
interval: 5m
rules:
# SLO compliance (1 = meeting SLO, 0 = violating)
- record: slo:http_availability:compliance
expr: sli:http_availability:ratio >= bool 0.999
- record: slo:http_latency:compliance
expr: sli:http_latency:ratio >= bool 0.99
# Error budget remaining (percentage)
- record: slo:http_availability:error_budget_remaining
expr: |
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
# Error budget burn rate
- record: slo:http_availability:burn_rate_5m
expr: |
(1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)) / (1 - 0.999)
groups:
- name: slo_alerts
interval: 1m
rules:
# Fast burn: 14.4x rate, 1 hour window
# Consumes 2% error budget in 1 hour
- alert: SLOErrorBudgetBurnFast
expr: |
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Slow burn: 6x rate, 6 hour window
# Consumes 5% error budget in 6 hours
- alert: SLOErrorBudgetBurnSlow
expr: |
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
for: 15m
labels:
severity: warning
annotations:
summary: "Slow error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Error budget exhausted
- alert: SLOErrorBudgetExhausted
expr: slo:http_availability:error_budget_remaining < 0
for: 5m
labels:
severity: critical
annotations:
summary: "SLO error budget exhausted"
description: "Error budget remaining: {{ $value }}%"
Grafana Dashboard Structure:
┌────────────────────────────────────┐
│ SLO Compliance (Current) │
│ ✓ 99.95% (Target: 99.9%) │
├────────────────────────────────────┤
│ Error Budget Remaining: 65% │
│ ████████░░ 65% │
├────────────────────────────────────┤
│ SLI Trend (28 days) │
│ [Time series graph] │
├────────────────────────────────────┤
│ Burn Rate Analysis │
│ [Burn rate by time window] │
└────────────────────────────────────┘
Example Queries:
# Current SLO compliance
sli:http_availability:ratio * 100
# Error budget remaining
slo:http_availability:error_budget_remaining
# Days until error budget exhausted (at current burn rate)
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)
# Combination of short and long windows reduces false positives
rules:
- alert: SLOBurnRateHigh
expr: |
(
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
)
or
(
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
)
labels:
severity: critical
prometheus-configuration - For metric collectiongrafana-dashboards - For SLO visualizationWeekly Installs
3.1K
Repository
GitHub Stars
32.2K
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
claude-code2.5K
gemini-cli2.4K
opencode2.4K
cursor2.3K
codex2.3K
github-copilot2.0K
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
59,200 周安装