slo-implementation by sickn33/antigravity-awesome-skills
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill slo-implementation用于定义和实施服务级别指标(SLI)、服务级别目标(SLO)和错误预算的框架。
resources/implementation-playbook.md。使用 SLI、SLO 和错误预算来实施可衡量的可靠性目标,以平衡可靠性与创新速度。
SLA (服务级别协议)
↓ 与客户的合同
SLO (服务级别目标)
↓ 内部可靠性目标
SLI (服务级别指标)
↓ 实际测量值
# 成功请求数 / 总请求数
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 低于延迟阈值的请求数 / 总请求数
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
# 成功写入数 / 总写入数
sum(storage_writes_successful_total)
/
sum(storage_writes_total)
参考: 参见 references/slo-definitions.md
| SLO % | 每月停机时间 | 每年停机时间 |
|---|---|---|
| 99% | 7.2 小时 | 3.65 天 |
| 99.9% | 43.2 分钟 | 8.76 小时 |
| 99.95% | 21.6 分钟 | 4.38 小时 |
| 99.99% | 4.32 分钟 | 52.56 分钟 |
考虑因素:
SLO 示例:
slos:
- name: api_availability
target: 99.9
window: 28d
sli: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
- name: api_latency_p95
target: 99
window: 28d
sli: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
Error Budget = 1 - SLO Target
示例:
error_budget_policy:
- remaining_budget: 100%
action: 正常开发速度
- remaining_budget: 50%
action: 考虑推迟高风险变更
- remaining_budget: 10%
action: 冻结非关键变更
- remaining_budget: 0%
action: 功能冻结,专注于可靠性
参考: 参见 references/error-budget.md
# SLI 记录规则
groups:
- name: sli_rules
interval: 30s
rules:
# 可用性 SLI
- record: sli:http_availability:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# 延迟 SLI (请求 < 500ms)
- record: sli:http_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
- name: slo_rules
interval: 5m
rules:
# SLO 合规性 (1 = 满足 SLO, 0 = 违反)
- record: slo:http_availability:compliance
expr: sli:http_availability:ratio >= bool 0.999
- record: slo:http_latency:compliance
expr: sli:http_latency:ratio >= bool 0.99
# 剩余错误预算 (百分比)
- record: slo:http_availability:error_budget_remaining
expr: |
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
# 错误预算消耗率
- record: slo:http_availability:burn_rate_5m
expr: |
(1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)) / (1 - 0.999)
groups:
- name: slo_alerts
interval: 1m
rules:
# 快速消耗:14.4 倍速率,1 小时窗口
# 在 1 小时内消耗 2% 的错误预算
- alert: SLOErrorBudgetBurnFast
expr: |
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "检测到快速错误预算消耗"
description: "错误预算正以 {{ $value }} 倍速率消耗"
# 慢速消耗:6 倍速率,6 小时窗口
# 在 6 小时内消耗 5% 的错误预算
- alert: SLOErrorBudgetBurnSlow
expr: |
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
for: 15m
labels:
severity: warning
annotations:
summary: "检测到慢速错误预算消耗"
description: "错误预算正以 {{ $value }} 倍速率消耗"
# 错误预算耗尽
- alert: SLOErrorBudgetExhausted
expr: slo:http_availability:error_budget_remaining < 0
for: 5m
labels:
severity: critical
annotations:
summary: "SLO 错误预算已耗尽"
description: "剩余错误预算:{{ $value }}%"
Grafana 仪表板结构:
┌────────────────────────────────────┐
│ SLO 合规性 (当前) │
│ ✓ 99.95% (目标: 99.9%) │
├────────────────────────────────────┤
│ 剩余错误预算: 65% │
│ ████████░░ 65% │
├────────────────────────────────────┤
│ SLI 趋势 (28 天) │
│ [时间序列图] │
├────────────────────────────────────┤
│ 消耗率分析 │
│ [按时间窗口的消耗率] │
└────────────────────────────────────┘
查询示例:
# 当前 SLO 合规性
sli:http_availability:ratio * 100
# 剩余错误预算
slo:http_availability:error_budget_remaining
# 按当前消耗率计算,距离错误预算耗尽的天数
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)
# 结合短窗口和长窗口以减少误报
rules:
- alert: SLOBurnRateHigh
expr: |
(
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
)
or
(
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
)
labels:
severity: critical
assets/slo-template.md - SLO 定义模板references/slo-definitions.md - SLO 定义模式references/error-budget.md - 错误预算计算prometheus-configuration - 用于指标收集grafana-dashboards - 用于 SLO 可视化每周安装次数
83
代码仓库
GitHub 星标数
26.9K
首次出现
2026年1月28日
安全审计
安装于
opencode80
gemini-cli79
cursor78
github-copilot77
claude-code76
codex76
Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
resources/implementation-playbook.md.Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
SLA (Service Level Agreement)
↓ Contract with customers
SLO (Service Level Objective)
↓ Internal reliability target
SLI (Service Level Indicator)
↓ Actual measurement
# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)
Reference: See references/slo-definitions.md
| SLO % | Downtime/Month | Downtime/Year |
|---|---|---|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43.2 minutes | 8.76 hours |
| 99.95% | 21.6 minutes | 4.38 hours |
| 99.99% | 4.32 minutes | 52.56 minutes |
Consider:
Example SLOs:
slos:
- name: api_availability
target: 99.9
window: 28d
sli: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
- name: api_latency_p95
target: 99
window: 28d
sli: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
Error Budget = 1 - SLO Target
Example:
error_budget_policy:
- remaining_budget: 100%
action: Normal development velocity
- remaining_budget: 50%
action: Consider postponing risky changes
- remaining_budget: 10%
action: Freeze non-critical changes
- remaining_budget: 0%
action: Feature freeze, focus on reliability
Reference: See references/error-budget.md
# SLI Recording Rules
groups:
- name: sli_rules
interval: 30s
rules:
# Availability SLI
- record: sli:http_availability:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency SLI (requests < 500ms)
- record: sli:http_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
- name: slo_rules
interval: 5m
rules:
# SLO compliance (1 = meeting SLO, 0 = violating)
- record: slo:http_availability:compliance
expr: sli:http_availability:ratio >= bool 0.999
- record: slo:http_latency:compliance
expr: sli:http_latency:ratio >= bool 0.99
# Error budget remaining (percentage)
- record: slo:http_availability:error_budget_remaining
expr: |
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
# Error budget burn rate
- record: slo:http_availability:burn_rate_5m
expr: |
(1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)) / (1 - 0.999)
groups:
- name: slo_alerts
interval: 1m
rules:
# Fast burn: 14.4x rate, 1 hour window
# Consumes 2% error budget in 1 hour
- alert: SLOErrorBudgetBurnFast
expr: |
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Slow burn: 6x rate, 6 hour window
# Consumes 5% error budget in 6 hours
- alert: SLOErrorBudgetBurnSlow
expr: |
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
for: 15m
labels:
severity: warning
annotations:
summary: "Slow error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Error budget exhausted
- alert: SLOErrorBudgetExhausted
expr: slo:http_availability:error_budget_remaining < 0
for: 5m
labels:
severity: critical
annotations:
summary: "SLO error budget exhausted"
description: "Error budget remaining: {{ $value }}%"
Grafana Dashboard Structure:
┌────────────────────────────────────┐
│ SLO Compliance (Current) │
│ ✓ 99.95% (Target: 99.9%) │
├────────────────────────────────────┤
│ Error Budget Remaining: 65% │
│ ████████░░ 65% │
├────────────────────────────────────┤
│ SLI Trend (28 days) │
│ [Time series graph] │
├────────────────────────────────────┤
│ Burn Rate Analysis │
│ [Burn rate by time window] │
└────────────────────────────────────┘
Example Queries:
# Current SLO compliance
sli:http_availability:ratio * 100
# Error budget remaining
slo:http_availability:error_budget_remaining
# Days until error budget exhausted (at current burn rate)
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)
# Combination of short and long windows reduces false positives
rules:
- alert: SLOBurnRateHigh
expr: |
(
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
)
or
(
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
)
labels:
severity: critical
assets/slo-template.md - SLO definition templatereferences/slo-definitions.md - SLO definition patternsreferences/error-budget.md - Error budget calculationsprometheus-configuration - For metric collectiongrafana-dashboards - For SLO visualizationWeekly Installs
83
Repository
GitHub Stars
26.9K
First Seen
Jan 28, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode80
gemini-cli79
cursor78
github-copilot77
claude-code76
codex76
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
99,100 周安装