HyperFleet运行就绪审计 - 生产环境就绪性检查与可靠性验证工具

HyperFleet Operational Readiness Audit by openshift-hyperfleet/hyperfleet-claude-plugins

5 GitHub Stars

安装命令

npx skills add https://github.com/openshift-hyperfleet/hyperfleet-claude-plugins --skill 'HyperFleet Operational Readiness Audit'

代码质量开发运维监控

🇨🇳中文介绍

HyperFleet 运行就绪审计技能

严重警告：只读模式

此技能严禁修改被审计仓库中的任何文件。 所有操作均为只读分析。该技能生成报告，但绝不更改代码、配置或文档。

何时使用此技能

当用户出现以下情况时激活此技能：

要求"检查运行就绪"
询问"这个仓库运行就绪了吗？"
要求"审计生产就绪性"
询问"这个仓库存在哪些运行差距？"
要求"运行运行就绪检查"
询问"生产就绪性"
询问"这个服务准备好投入生产运行了吗？"
要求"验证运行要求"

运行就绪要求来源

这些检查基于 HYPERFLEET-539 的运行就绪要求。与动态获取标准的 standards-audit 技能不同，这些运行要求是硬编码的，因为它们代表了很少变化的核心可靠性要求。

仓库类型检测

在运行适用的检查之前，先检测仓库类型。

检测命令

# Check for API indicators
ls pkg/api/ 2>/dev/null && echo "HAS_API_PKG"
ls openapi.yaml 2>/dev/null || ls openapi/openapi.yaml 2>/dev/null && echo "HAS_OPENAPI"
grep -l "database" cmd/*.go 2>/dev/null && echo "HAS_DATABASE"

# Check for Sentinel indicators
basename $(pwd) | grep -i sentinel && echo "IS_SENTINEL"
grep -r "polling\|reconcile" --include="*.go" -l 2>/dev/null | head -1 && echo "HAS_RECONCILE"

# Check for Adapter indicators
basename $(pwd) | grep "^adapter-" && echo "IS_ADAPTER"
grep -r "cloudevents\|pubsub" --include="*.go" -l 2>/dev/null | head -1 && echo "HAS_CLOUDEVENTS"

# Check for Infrastructure
ls charts/Chart.yaml 2>/dev/null || ls Chart.yaml 2>/dev/null && echo "HAS_HELM"
ls *.tf 2>/dev/null && echo "HAS_TERRAFORM"

# Check for Go code
ls cmd/*.go 2>/dev/null || ls pkg/**/*.go 2>/dev/null && echo "IS_GO_REPO"

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

指标	仓库类型
HAS_API_PKG + HAS_OPENAPI + HAS_DATABASE	API 服务
IS_SENTINEL 或 HAS_RECONCILE	Sentinel
IS_ADAPTER 或 HAS_CLOUDEVENTS (无 API)	适配器
HAS_HELM 或 HAS_TERRAFORM (无 Go)	基础设施
IS_GO_REPO (无服务模式)	工具

检查 1：功能性健康探针

严重性： 严重 要求： 健康和就绪端点必须验证实际的依赖项（数据库、消息代理、外部服务），而不仅仅是返回 200 OK。

适用于： API、Sentinel、适配器 不适用于： 基础设施、工具

存在健康端点 (/healthz 或 /health)
存在就绪端点 (/readyz 或 /ready)
健康检查验证实际依赖项：
- 数据库连接性
- 消息代理连接性
- 关键外部服务可用性

# Check for health endpoints
grep -r "/healthz\|/health\|/readyz\|/ready" --include="*.go" -l 2>/dev/null

# Check if health checks verify dependencies (not just returning OK)
grep -r "healthz\|readyz" --include="*.go" -A 20 2>/dev/null | grep -i "ping\|check\|db\|database\|broker\|connect"

存在健康端点
端点包含实际的依赖项检查（不仅仅是 return 200）

未找到健康端点
健康处理程序仅返回静态 OK 而不检查依赖项
缺少与存活探针分离的就绪探针

检查 2：死信开关指标

严重性： 严重（Sentinel 服务必需） 要求： 服务必须发出心跳/时间戳指标，供外部监控用于检测静默故障。

适用于： Sentinel（严重）、适配器（是）、API（可选） 不适用于： 基础设施、工具

存在心跳或时间戳指标
在每个成功的操作周期更新指标
指标遵循命名约定：hyperfleet_*_last_success_timestamp 或 hyperfleet_*_heartbeat

# Check for dead man's switch / heartbeat metrics
grep -r "last_success\|heartbeat\|last_run\|last_processed" --include="*.go" -l 2>/dev/null

# Check for timestamp metric patterns
grep -r "SetToCurrentTime\|prometheus.NewGauge.*timestamp\|prometheus.NewGauge.*heartbeat" --include="*.go" -l 2>/dev/null

# Look for reconciliation loop metrics
grep -r "reconcile.*success\|loop.*completed\|cycle.*finished" --include="*.go" -l 2>/dev/null

存在心跳或时间戳指标
在主处理循环中更新指标

失败指标（对 Sentinel 为严重）：

未找到心跳/时间戳指标
存在指标但未在成功路径上更新
只有错误指标而没有成功指示器

检查 3：带指数退避的重试逻辑

严重性： 主要 要求： 所有 HTTP 客户端和消息代理交互必须实现带指数退避的重试逻辑，以处理瞬时故障。

适用于： API、Sentinel、适配器 不适用于： 基础设施、工具

HTTP 客户端重试配置
消息代理客户端重试配置
指数退避实现（不仅仅是固定延迟）

# Check for retry libraries or patterns
grep -r "retry\|backoff\|Retry\|Backoff" --include="*.go" -l 2>/dev/null

# Check for exponential backoff specifically
grep -r "exponential\|ExponentialBackoff\|backoff.Exponential" --include="*.go" -l 2>/dev/null

# Check for common retry libraries
grep -r "cenkalti/backoff\|avast/retry-go\|hashicorp/go-retryablehttp" --include="*.go" -l 2>/dev/null

# Check if raw http.Client is wrapped
grep -r "http.Client\|http.NewRequest" --include="*.go" -A 5 2>/dev/null | grep -i "retry"

导入/使用了重试库
HTTP 客户端使用重试逻辑包装
配置了指数退避（而非固定延迟）

使用原始 http.Client 而没有重试包装器
使用固定休眠延迟而非指数退避
对代理/外部服务调用没有重试逻辑

检查 4：PodDisruptionBudget

严重性： 主要 要求： Helm 图表必须包含 PodDisruptionBudget 模板，以确保在节点维护和集群升级期间的可用性。

适用于： API、Sentinel、适配器、基础设施 不适用于： 工具

Helm 图表中存在 PDB 模板
PDB 具有合理的默认值（minAvailable 或 maxUnavailable）

# Check for PDB template
ls charts/*/templates/pdb.yaml 2>/dev/null || ls charts/*/templates/poddisruptionbudget.yaml 2>/dev/null

# Check values.yaml for PDB configuration
grep -r "podDisruptionBudget\|pdb:" charts/*/values.yaml 2>/dev/null

# Check for PDB in any template
grep -r "PodDisruptionBudget" charts/*/templates/*.yaml 2>/dev/null

存在 PDB 模板
可在 values.yaml 中配置 PDB 值

Helm 图表中没有 PDB 模板
存在 PDB 但为硬编码（不可配置）
不存在 Helm 图表（对于应该有的服务）

检查 5：资源限制

严重性： 主要 要求： 部署必须定义 CPU 和内存请求以及限制，以确保适当的调度并防止资源耗尽。

适用于： API、Sentinel、适配器、基础设施 不适用于： 工具

定义了资源请求（cpu、内存）
定义了资源限制（cpu、内存）
值可在 values.yaml 中配置

# Check values.yaml for resource configuration
grep -A 10 "resources:" charts/*/values.yaml 2>/dev/null

# Check for both requests and limits
grep -A 20 "resources:" charts/*/values.yaml 2>/dev/null | grep -E "requests:|limits:|cpu:|memory:"

# Check deployment template uses resources
grep -r "\.Values.resources\|resources:" charts/*/templates/deployment.yaml 2>/dev/null

定义了 resources.requests.cpu
定义了 resources.requests.memory
定义了 resources.limits.cpu
定义了 resources.limits.memory
值是模板化的（非硬编码）

缺少请求或限制
只有请求没有限制（或反之）
硬编码值而非模板化

检查 6：优雅关闭

严重性： 严重 要求： 服务必须处理 SIGTERM/SIGINT 信号，停止接受新工作，排空现有工作，并在终止宽限期内干净地退出。

适用于： API、Sentinel、适配器 不适用于： 基础设施、工具

对 SIGTERM 和 SIGINT 的信号处理
服务器/监听器的优雅关闭
完成正在进行的请求
连接排空

# Check for signal handling
grep -r "SIGTERM\|SIGINT\|signal.Notify\|os.Signal" --include="*.go" -l 2>/dev/null

# Check for graceful shutdown
grep -r "Shutdown\|GracefulStop\|graceful" --include="*.go" -l 2>/dev/null

# Check for context cancellation on shutdown
grep -r "context.WithCancel\|ctx.Done" --include="*.go" -A 5 2>/dev/null | grep -i "shutdown\|signal"

为 SIGTERM 和 SIGINT 注册了信号处理器
调用了 Server Shutdown() 或 GracefulStop()
传播了上下文取消

无信号处理
直接使用 os.Exit() 而不进行清理
无优雅的服务器关闭

检查 7：可靠性文档

严重性： 次要 要求： 服务应具备运行文档，包括运行手册、指标文档和操作指南。

适用于： API、Sentinel、适配器、基础设施（部分） 不适用于： 工具

存在运行手册（docs/runbook.md 或类似文件）
记录了指标
具有操作部分的操作指南或 README

# Check for runbook
ls docs/runbook.md 2>/dev/null || ls docs/runbooks/*.md 2>/dev/null || ls RUNBOOK.md 2>/dev/null

# Check for metrics documentation
ls docs/metrics.md 2>/dev/null || grep -l "## Metrics" docs/*.md 2>/dev/null || grep -l "## Metrics" README.md 2>/dev/null

# Check for operational documentation
ls docs/operations.md 2>/dev/null || grep -l "## Operations\|## Operational" docs/*.md 2>/dev/null

至少存在一种形式的操作文档
指标在某个地方有记录

无运行手册
无指标文档
无操作指导

检查项	API	Sentinel	适配器	基础设施	工具
功能性健康探针	是	是	是	否	否
死信开关指标	可选	严重	是	否	否
带退避的重试逻辑	是	是	是	否	否
PodDisruptionBudget	是	是	是	是	否
资源限制	是	是	是	是	否
优雅关闭	是	是	是	否	否
可靠性文档	是	是	是	部分	否

对于每个适用的检查

确定适用性 基于仓库类型
执行检查命令 如上所列
评估结果 对照通过/失败标准
记录状态 为 PASS、PARTIAL 或 FAIL
记录具体差距 包含文件位置和修复方法

# HyperFleet 运行就绪审计报告

**仓库：** [仓库名称]
**路径：** [完整路径]
**仓库类型：** [API/Sentinel/适配器/基础设施/工具]
**审计日期：** [ISO 时间戳]
**要求来源：** HYPERFLEET-539

---

## 摘要

| 检查项 | 状态 | 严重性 | 适用性 |
|-------|--------|----------|------------|
| 功能性健康探针 | PASS/PARTIAL/FAIL | 严重 | 是/否 |
| 死信开关指标 | PASS/PARTIAL/FAIL | 严重 | 是/否 |
| 带退避的重试逻辑 | PASS/PARTIAL/FAIL | 主要 | 是/否 |
| PodDisruptionBudget | PASS/PARTIAL/FAIL | 主要 | 是/否 |
| 资源限制 | PASS/PARTIAL/FAIL | 主要 | 是/否 |
| 优雅关闭 | PASS/PARTIAL/FAIL | 严重 | 是/否 |
| 可靠性文档 | PASS/PARTIAL/FAIL | 次要 | 是/否 |

**总体运行就绪度：** X/Y 项检查通过 (Z%)

---

## 详细发现

### [检查项名称]

**状态：** PASS/PARTIAL/FAIL
**严重性：** 严重/主要/次要
**适用性：** 是/否（如果为否，说明原因）

#### 找到的证据
- [文件路径:行号 - 找到的内容]

#### 发现的差距（如果有）
- **位置：** [文件路径:行号 或 N/A]
- **预期：** [运行就绪要求的内容]
- **实际发现：** [实际找到的内容]
- **修复建议：** [如何修复]

---

## 建议

**严重问题（投入生产前解决）：**
1. [问题描述和修复建议]

**主要问题（尽快解决）：**
1. [问题描述和修复建议]

**次要问题（方便时解决）：**
1. [问题描述和修复建议]

用户： "检查运行就绪"

# HyperFleet 运行就绪审计报告

**仓库：** hyperfleet-sentinel
**路径：** /home/croche/Projects/hyperfleet/hyperfleet-sentinel
**仓库类型：** Sentinel
**审计日期：** 2026-01-29T14:30:00Z
**要求来源：** HYPERFLEET-539

---

## 摘要

| 检查项 | 状态 | 严重性 | 适用性 |
|-------|--------|----------|------------|
| 功能性健康探针 | PASS | 严重 | 是 |
| 死信开关指标 | FAIL | 严重 | 是（必需） |
| 带退避的重试逻辑 | PARTIAL | 主要 | 是 |
| PodDisruptionBudget | PASS | 主要 | 是 |
| 资源限制 | PASS | 主要 | 是 |
| 优雅关闭 | PASS | 严重 | 是 |
| 可靠性文档 | FAIL | 次要 | 是 |

**总体运行就绪度：** 4/7 项检查通过 (57%)

---

## 详细发现

### 功能性健康探针

**状态：** PASS
**严重性：** 严重
**适用性：** 是

#### 找到的证据
- cmd/server/health.go:23 - 注册了 `/healthz` 端点
- cmd/server/health.go:45 - 注册了 `/readyz` 端点
- cmd/server/health.go:52 - 就绪处理程序中的数据库 ping 检查

---

### 死信开关指标

**状态：** FAIL
**严重性：** 严重（Sentinel 服务必需）
**适用性：** 是 - Sentinel 服务必须具有死信开关指标

#### 找到的证据
- 未找到心跳或时间戳指标

#### 发现的差距
- **位置：** N/A - 未实现
- **预期：** 在每个协调周期更新的心跳或时间戳指标
- **实际发现：** 没有能够对静默故障发出警报的指标
- **修复建议：** 添加 `hyperfleet_sentinel_last_success_timestamp` 指标，在每个成功的协调周期后调用 `SetToCurrentTime()`

---

### 带退避的重试逻辑

**状态：** PARTIAL
**严重性：** 主要
**适用性：** 是

#### 找到的证据
- pkg/client/http.go:45 - 找到基本重试逻辑
- 无指数退避实现

#### 发现的差距
- **位置：** pkg/client/http.go:45
- **预期：** 带抖动的指数退避
- **实际发现：** 重试之间的固定 1 秒延迟
- **修复建议：** 使用 `cenkalti/backoff` 或类似库将固定延迟替换为指数退避

---

### PodDisruptionBudget

**状态：** PASS
**严重性：** 主要
**适用性：** 是

#### 找到的证据
- charts/hyperfleet-sentinel/templates/pdb.yaml 存在
- charts/hyperfleet-sentinel/values.yaml:78 - `podDisruptionBudget.minAvailable: 1`

---

### 资源限制

**状态：** PASS
**严重性：** 主要
**适用性：** 是

#### 找到的证据
- charts/hyperfleet-sentinel/values.yaml:45 - resources.requests.cpu: 100m
- charts/hyperfleet-sentinel/values.yaml:46 - resources.requests.memory: 128Mi
- charts/hyperfleet-sentinel/values.yaml:47 - resources.limits.cpu: 500m
- charts/hyperfleet-sentinel/values.yaml:48 - resources.limits.memory: 512Mi

---

### 优雅关闭

**状态：** PASS
**严重性：** 严重
**适用性：** 是

#### 找到的证据
- cmd/main.go:67 - `signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)`
- cmd/main.go:89 - 收到信号时调用 `server.Shutdown(ctx)`

---

### 可靠性文档

**状态：** FAIL
**严重性：** 次要
**适用性：** 是

#### 找到的证据
- 未找到运行手册
- 未找到指标文档

#### 发现的差距
- **位置：** docs/ 目录
- **预期：** 运行手册和指标文档
- **实际发现：** 只有包含安装说明的 README.md
- **修复建议：** 创建包含操作流程的 docs/runbook.md 和记录暴露指标的 docs/metrics.md

---

## 建议

**严重问题（投入生产前解决）：**
1. **死信开关指标** - 添加心跳指标以检测静默故障。这对 Sentinel 服务是必需的。

**主要问题（尽快解决）：**
1. **重试逻辑** - 将固定延迟替换为指数退避，以防止在中断期间出现惊群效应。

**次要问题（方便时解决）：**
1. **可靠性文档** - 为值班支持添加运行手册和指标文档。

如果技能无法完成审计：

未知仓库类型： 要求用户指定或默认为"工具"（限制最严格）
无 Helm 图表： 跳过与 Helm 相关的检查并在报告中注明
无 Go 代码： 跳过基于代码的检查并在报告中注明
部分检查： 报告哪些检查无法执行

尽可能提供部分结果，并为不完整的检查建议手动验证步骤。

此技能为只读 - 绝不修改文件
要求基于 HYPERFLEET-539 硬编码（非动态获取）
严重性评级：严重 > 主要 > 次要
仓库类型影响哪些检查适用
Sentinel 服务对死信开关指标有更严格的要求
所有检查都包含文件位置和具体的修复指导

🇺🇸English

HyperFleet Operational Readiness Audit Skill

CRITICAL: READ-ONLY MODE

This skill MUST NOT modify any files in the repository being audited. All operations are read-only analysis. The skill produces reports but never changes code, configuration, or documentation.

When to Use This Skill

Activate this skill when the user:

Asks to "check operational readiness"
Asks "is this repo operationally ready?"
Asks to "audit for production readiness"
Asks "what operational gaps does this repo have?"
Asks to "run an operational readiness check"
Asks about "production readiness"
Asks "is this service ready for production operations?"
Asks to "validate operational requirements"

Operational Readiness Requirements Source

These checks are based on HYPERFLEET-539 requirements for operational readiness. Unlike the standards-audit skill which dynamically fetches standards, these operational requirements are hardcoded as they represent core reliability requirements that rarely change.

Repository Type Detection

Before running applicable checks, detect the repository type.

Detection Commands

# Check for API indicators
ls pkg/api/ 2>/dev/null && echo "HAS_API_PKG"
ls openapi.yaml 2>/dev/null || ls openapi/openapi.yaml 2>/dev/null && echo "HAS_OPENAPI"
grep -l "database" cmd/*.go 2>/dev/null && echo "HAS_DATABASE"

# Check for Sentinel indicators
basename $(pwd) | grep -i sentinel && echo "IS_SENTINEL"
grep -r "polling\|reconcile" --include="*.go" -l 2>/dev/null | head -1 && echo "HAS_RECONCILE"

# Check for Adapter indicators
basename $(pwd) | grep "^adapter-" && echo "IS_ADAPTER"
grep -r "cloudevents\|pubsub" --include="*.go" -l 2>/dev/null | head -1 && echo "HAS_CLOUDEVENTS"

# Check for Infrastructure
ls charts/Chart.yaml 2>/dev/null || ls Chart.yaml 2>/dev/null && echo "HAS_HELM"
ls *.tf 2>/dev/null && echo "HAS_TERRAFORM"

# Check for Go code
ls cmd/*.go 2>/dev/null || ls pkg/**/*.go 2>/dev/null && echo "IS_GO_REPO"

Repository Type Matrix

Indicators	Repository Type
HAS_API_PKG + HAS_OPENAPI + HAS_DATABASE	API Service
IS_SENTINEL or HAS_RECONCILE	Sentinel
IS_ADAPTER or HAS_CLOUDEVENTS (without API)	Adapter
HAS_HELM or HAS_TERRAFORM (without Go)	Infrastructure
IS_GO_REPO (without service patterns)	Tooling

Operational Readiness Checks

Check 1: Functional Health Probes

Severity: Critical Requirement: Health and readiness endpoints must verify actual dependencies (database, message broker, external services), not just return 200 OK.

Applies to: API, Sentinel, Adapter Does NOT apply to: Infrastructure, Tooling

What to check:

Health endpoint exists (/healthz or /health)
Readiness endpoint exists (/readyz or /ready)
Health checks verify actual dependencies:
- Database connectivity
- Message broker connectivity
- Critical external service availability

Check commands:

# Check for health endpoints
grep -r "/healthz\|/health\|/readyz\|/ready" --include="*.go" -l 2>/dev/null

# Check if health checks verify dependencies (not just returning OK)
grep -r "healthz\|readyz" --include="*.go" -A 20 2>/dev/null | grep -i "ping\|check\|db\|database\|broker\|connect"

Pass criteria:

Health endpoints exist
Endpoints contain actual dependency checks (not just return 200)

Fail indicators:

No health endpoints found
Health handlers only return static OK without checking dependencies
Missing readiness probe separate from liveness

Check 2: Dead Man's Switch Metrics

Severity: Critical (REQUIRED for Sentinel services) Requirement: Services must emit heartbeat/timestamp metrics that external monitoring can use to detect silent failures.

Applies to: Sentinel (CRITICAL), Adapter (Yes), API (Optional) Does NOT apply to: Infrastructure, Tooling

What to check:

Heartbeat or timestamp metric exists
Metric is updated on each successful operation cycle
Metric follows naming convention: hyperfleet_*_last_success_timestamp or hyperfleet_*_heartbeat

Check commands:

# Check for dead man's switch / heartbeat metrics
grep -r "last_success\|heartbeat\|last_run\|last_processed" --include="*.go" -l 2>/dev/null

# Check for timestamp metric patterns
grep -r "SetToCurrentTime\|prometheus.NewGauge.*timestamp\|prometheus.NewGauge.*heartbeat" --include="*.go" -l 2>/dev/null

# Look for reconciliation loop metrics
grep -r "reconcile.*success\|loop.*completed\|cycle.*finished" --include="*.go" -l 2>/dev/null

Pass criteria:

Heartbeat or timestamp metric exists
Metric is updated in main processing loop

Fail indicators (CRITICAL for Sentinel):

No heartbeat/timestamp metrics found
Metrics exist but not updated on success path
Only error metrics without success indicators

Check 3: Retry Logic with Exponential Backoff

Severity: Major Requirement: All HTTP clients and message broker interactions must implement retry logic with exponential backoff to handle transient failures.

Applies to: API, Sentinel, Adapter Does NOT apply to: Infrastructure, Tooling

What to check:

HTTP client retry configuration
Message broker client retry configuration
Exponential backoff implementation (not just fixed delays)

Check commands:

# Check for retry libraries or patterns
grep -r "retry\|backoff\|Retry\|Backoff" --include="*.go" -l 2>/dev/null

# Check for exponential backoff specifically
grep -r "exponential\|ExponentialBackoff\|backoff.Exponential" --include="*.go" -l 2>/dev/null

# Check for common retry libraries
grep -r "cenkalti/backoff\|avast/retry-go\|hashicorp/go-retryablehttp" --include="*.go" -l 2>/dev/null

# Check if raw http.Client is wrapped
grep -r "http.Client\|http.NewRequest" --include="*.go" -A 5 2>/dev/null | grep -i "retry"

Pass criteria:

Retry library is imported/used
HTTP clients wrapped with retry logic
Exponential backoff configured (not fixed delays)

Fail indicators:

Raw http.Client used without retry wrapper
Fixed sleep delays instead of exponential backoff
No retry logic on broker/external service calls

Check 4: PodDisruptionBudget

Severity: Major Requirement: Helm charts must include PodDisruptionBudget templates to ensure availability during node maintenance and cluster upgrades.

Applies to: API, Sentinel, Adapter, Infrastructure Does NOT apply to: Tooling

What to check:

PDB template exists in Helm chart
PDB has sensible defaults (minAvailable or maxUnavailable)

Check commands:

# Check for PDB template
ls charts/*/templates/pdb.yaml 2>/dev/null || ls charts/*/templates/poddisruptionbudget.yaml 2>/dev/null

# Check values.yaml for PDB configuration
grep -r "podDisruptionBudget\|pdb:" charts/*/values.yaml 2>/dev/null

# Check for PDB in any template
grep -r "PodDisruptionBudget" charts/*/templates/*.yaml 2>/dev/null

Pass criteria:

PDB template exists
PDB values configurable in values.yaml

Fail indicators:

No PDB template in Helm chart
PDB exists but hardcoded (not configurable)
No Helm chart exists (for services that should have one)

Check 5: Resource Limits

Severity: Major Requirement: Deployment must have CPU and memory requests AND limits defined to ensure proper scheduling and prevent resource exhaustion.

Applies to: API, Sentinel, Adapter, Infrastructure Does NOT apply to: Tooling

What to check:

Resource requests defined (cpu, memory)
Resource limits defined (cpu, memory)
Values are configurable in values.yaml

Check commands:

# Check values.yaml for resource configuration
grep -A 10 "resources:" charts/*/values.yaml 2>/dev/null

# Check for both requests and limits
grep -A 20 "resources:" charts/*/values.yaml 2>/dev/null | grep -E "requests:|limits:|cpu:|memory:"

# Check deployment template uses resources
grep -r "\.Values.resources\|resources:" charts/*/templates/deployment.yaml 2>/dev/null

Pass criteria:

resources.requests.cpu defined
resources.requests.memory defined
resources.limits.cpu defined
resources.limits.memory defined
Values are templates (not hardcoded)

Fail indicators:

Missing requests or limits
Only requests without limits (or vice versa)
Hardcoded values instead of templated

Check 6: Graceful Shutdown

Severity: Critical Requirement: Services must handle SIGTERM/SIGINT signals, stop accepting new work, drain existing work, and exit cleanly within the termination grace period.

Applies to: API, Sentinel, Adapter Does NOT apply to: Infrastructure, Tooling

What to check:

Signal handling for SIGTERM and SIGINT
Server/listener graceful shutdown
In-flight request completion
Connection draining

Check commands:

# Check for signal handling
grep -r "SIGTERM\|SIGINT\|signal.Notify\|os.Signal" --include="*.go" -l 2>/dev/null

# Check for graceful shutdown
grep -r "Shutdown\|GracefulStop\|graceful" --include="*.go" -l 2>/dev/null

# Check for context cancellation on shutdown
grep -r "context.WithCancel\|ctx.Done" --include="*.go" -A 5 2>/dev/null | grep -i "shutdown\|signal"

Pass criteria:

Signal handlers registered for SIGTERM and SIGINT
Server Shutdown() or GracefulStop() called
Context cancellation propagated

Fail indicators:

No signal handling
Using os.Exit() directly without cleanup
No graceful server shutdown

Check 7: Reliability Documentation

Severity: Minor Requirement: Services should have operational documentation including runbooks, metrics documentation, and operational guides.

Applies to: API, Sentinel, Adapter, Infrastructure (Partial) Does NOT apply to: Tooling

What to check:

Runbook exists (docs/runbook.md or similar)
Metrics documented
Operational guide or README with ops section

Check commands:

# Check for runbook
ls docs/runbook.md 2>/dev/null || ls docs/runbooks/*.md 2>/dev/null || ls RUNBOOK.md 2>/dev/null

# Check for metrics documentation
ls docs/metrics.md 2>/dev/null || grep -l "## Metrics" docs/*.md 2>/dev/null || grep -l "## Metrics" README.md 2>/dev/null

# Check for operational documentation
ls docs/operations.md 2>/dev/null || grep -l "## Operations\|## Operational" docs/*.md 2>/dev/null

Pass criteria:

At least one form of operational documentation exists
Metrics are documented somewhere

Fail indicators:

No runbook
No metrics documentation
No operational guidance

Applicability Matrix

Check	API	Sentinel	Adapter	Infrastructure	Tooling
Functional Health Probes	Yes	Yes	Yes	No	No
Dead Man's Switch Metrics	Optional	CRITICAL	Yes	No	No
Retry Logic with Backoff	Yes	Yes	Yes	No	No
PodDisruptionBudget	Yes	Yes	Yes	Yes	No
Resource Limits	Yes	Yes

Audit Execution

For Each Applicable Check

Determine applicability based on repository type
Execute check commands listed above
Evaluate results against pass/fail criteria
Record status as PASS, PARTIAL, or FAIL
Document specific gaps with file locations and remediation

Output Format

Audit Report Structure

# HyperFleet Operational Readiness Audit Report

**Repository:** [repo name]
**Path:** [full path]
**Repository Type:** [API/Sentinel/Adapter/Infrastructure/Tooling]
**Audit Date:** [ISO timestamp]
**Requirements Source:** HYPERFLEET-539

---

## Summary

| Check | Status | Severity | Applicable |
|-------|--------|----------|------------|
| Functional Health Probes | PASS/PARTIAL/FAIL | Critical | Yes/No |
| Dead Man's Switch Metrics | PASS/PARTIAL/FAIL | Critical | Yes/No |
| Retry Logic with Backoff | PASS/PARTIAL/FAIL | Major | Yes/No |
| PodDisruptionBudget | PASS/PARTIAL/FAIL | Major | Yes/No |
| Resource Limits | PASS/PARTIAL/FAIL | Major | Yes/No |
| Graceful Shutdown | PASS/PARTIAL/FAIL | Critical | Yes/No |
| Reliability Documentation | PASS/PARTIAL/FAIL | Minor | Yes/No |

**Overall Operational Readiness:** X/Y checks passing (Z%)

---

## Detailed Findings

### [Check Name]

**Status:** PASS/PARTIAL/FAIL
**Severity:** Critical/Major/Minor
**Applicable:** Yes/No (reason if No)

#### Evidence Found
- [File path:line - what was found]

#### Gaps Found (if any)
- **Location:** [file path:line number or N/A]
- **Expected:** [what operational readiness requires]
- **Found:** [what was actually found]
- **Remediation:** [how to fix]

---

## Recommendations

**Critical Issues (address before production):**
1. [Issue description and remediation]

**Major Issues (address soon):**
1. [Issue description and remediation]

**Minor Issues (address when convenient):**
1. [Issue description and remediation]

Example Audit Session

User: "check operational readiness"

Skill Response:

# HyperFleet Operational Readiness Audit Report

**Repository:** hyperfleet-sentinel
**Path:** /home/croche/Projects/hyperfleet/hyperfleet-sentinel
**Repository Type:** Sentinel
**Audit Date:** 2026-01-29T14:30:00Z
**Requirements Source:** HYPERFLEET-539

---

## Summary

| Check | Status | Severity | Applicable |
|-------|--------|----------|------------|
| Functional Health Probes | PASS | Critical | Yes |
| Dead Man's Switch Metrics | FAIL | Critical | Yes (REQUIRED) |
| Retry Logic with Backoff | PARTIAL | Major | Yes |
| PodDisruptionBudget | PASS | Major | Yes |
| Resource Limits | PASS | Major | Yes |
| Graceful Shutdown | PASS | Critical | Yes |
| Reliability Documentation | FAIL | Minor | Yes |

**Overall Operational Readiness:** 4/7 checks passing (57%)

---

## Detailed Findings

### Functional Health Probes

**Status:** PASS
**Severity:** Critical
**Applicable:** Yes

#### Evidence Found
- cmd/server/health.go:23 - `/healthz` endpoint registered
- cmd/server/health.go:45 - `/readyz` endpoint registered
- cmd/server/health.go:52 - Database ping check in readiness handler

---

### Dead Man's Switch Metrics

**Status:** FAIL
**Severity:** Critical (REQUIRED for Sentinel)
**Applicable:** Yes - Sentinel services MUST have dead man's switch metrics

#### Evidence Found
- No heartbeat or timestamp metrics found

#### Gaps Found
- **Location:** N/A - not implemented
- **Expected:** Heartbeat or timestamp metric updated on each reconciliation cycle
- **Found:** No metrics that would alert on silent failures
- **Remediation:** Add `hyperfleet_sentinel_last_success_timestamp` gauge metric, call `SetToCurrentTime()` after each successful reconciliation cycle

---

### Retry Logic with Backoff

**Status:** PARTIAL
**Severity:** Major
**Applicable:** Yes

#### Evidence Found
- pkg/client/http.go:45 - Basic retry logic found
- No exponential backoff implementation

#### Gaps Found
- **Location:** pkg/client/http.go:45
- **Expected:** Exponential backoff with jitter
- **Found:** Fixed 1-second delay between retries
- **Remediation:** Replace fixed delays with exponential backoff using `cenkalti/backoff` or similar library

---

### PodDisruptionBudget

**Status:** PASS
**Severity:** Major
**Applicable:** Yes

#### Evidence Found
- charts/hyperfleet-sentinel/templates/pdb.yaml exists
- charts/hyperfleet-sentinel/values.yaml:78 - `podDisruptionBudget.minAvailable: 1`

---

### Resource Limits

**Status:** PASS
**Severity:** Major
**Applicable:** Yes

#### Evidence Found
- charts/hyperfleet-sentinel/values.yaml:45 - resources.requests.cpu: 100m
- charts/hyperfleet-sentinel/values.yaml:46 - resources.requests.memory: 128Mi
- charts/hyperfleet-sentinel/values.yaml:47 - resources.limits.cpu: 500m
- charts/hyperfleet-sentinel/values.yaml:48 - resources.limits.memory: 512Mi

---

### Graceful Shutdown

**Status:** PASS
**Severity:** Critical
**Applicable:** Yes

#### Evidence Found
- cmd/main.go:67 - `signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)`
- cmd/main.go:89 - `server.Shutdown(ctx)` called on signal

---

### Reliability Documentation

**Status:** FAIL
**Severity:** Minor
**Applicable:** Yes

#### Evidence Found
- No runbook found
- No metrics documentation found

#### Gaps Found
- **Location:** docs/ directory
- **Expected:** Runbook and metrics documentation
- **Found:** Only README.md with installation instructions
- **Remediation:** Create docs/runbook.md with operational procedures and docs/metrics.md documenting exposed metrics

---

## Recommendations

**Critical Issues (address before production):**
1. **Dead Man's Switch Metrics** - Add heartbeat metric to detect silent failures. This is REQUIRED for Sentinel services.

**Major Issues (address soon):**
1. **Retry Logic** - Replace fixed delays with exponential backoff to prevent thundering herd during outages.

**Minor Issues (address when convenient):**
1. **Reliability Documentation** - Add runbook and metrics documentation for on-call support.

Error Handling

If the skill cannot complete an audit:

Unknown repo type: Ask user to specify or default to "Tooling" (most restrictive)
No Helm chart: Skip Helm-related checks and note in report
No Go code: Skip code-based checks and note in report
Partial checks: Report which checks could not be performed

Always provide partial results where possible and suggest manual verification steps for incomplete checks.

Notes

This skill is READ-ONLY - it never modifies files
Requirements are hardcoded based on HYPERFLEET-539 (not dynamically fetched)
Severity ratings: Critical > Major > Minor
Repository type affects which checks apply
Sentinel services have stricter requirements for dead man's switch metrics
All checks include file locations and specific remediation guidance

Weekly Installs

–

Repository

openshift-hyper…-plugins

GitHub Stars

First Seen

–

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

79,900 周安装

HyperFleet运行就绪审计 - 生产环境就绪性检查与可靠性验证工具

🇨🇳中文介绍

HyperFleet 运行就绪审计技能

严重警告：只读模式

何时使用此技能

运行就绪要求来源

仓库类型检测

检测命令

相关 Skills

仓库类型矩阵

运行就绪检查

检查 1：功能性健康探针

检查 2：死信开关指标

检查 3：带指数退避的重试逻辑

检查 4：PodDisruptionBudget

检查 5：资源限制

检查 6：优雅关闭

检查 7：可靠性文档

适用性矩阵

审计执行

对于每个适用的检查

输出格式

审计报告结构

审计会话示例

错误处理

注意事项

🇺🇸English

HyperFleet Operational Readiness Audit Skill

CRITICAL: READ-ONLY MODE

When to Use This Skill

Operational Readiness Requirements Source

Repository Type Detection

Detection Commands

Repository Type Matrix

Operational Readiness Checks

Check 1: Functional Health Probes

Check 2: Dead Man's Switch Metrics

Check 3: Retry Logic with Exponential Backoff

Check 4: PodDisruptionBudget

Check 5: Resource Limits

Check 6: Graceful Shutdown

Check 7: Reliability Documentation

Applicability Matrix

Audit Execution

For Each Applicable Check

Output Format

Audit Report Structure

Example Audit Session

Error Handling

Notes

最新 Skills