HyperFleet Operational Readiness Audit by openshift-hyperfleet/hyperfleet-claude-plugins
npx skills add https://github.com/openshift-hyperfleet/hyperfleet-claude-plugins --skill 'HyperFleet Operational Readiness Audit'此技能严禁修改被审计仓库中的任何文件。 所有操作均为只读分析。该技能生成报告,但绝不更改代码、配置或文档。
当用户出现以下情况时激活此技能:
这些检查基于 HYPERFLEET-539 的运行就绪要求。与动态获取标准的 standards-audit 技能不同,这些运行要求是硬编码的,因为它们代表了很少变化的核心可靠性要求。
在运行适用的检查之前,先检测仓库类型。
# Check for API indicators
ls pkg/api/ 2>/dev/null && echo "HAS_API_PKG"
ls openapi.yaml 2>/dev/null || ls openapi/openapi.yaml 2>/dev/null && echo "HAS_OPENAPI"
grep -l "database" cmd/*.go 2>/dev/null && echo "HAS_DATABASE"
# Check for Sentinel indicators
basename $(pwd) | grep -i sentinel && echo "IS_SENTINEL"
grep -r "polling\|reconcile" --include="*.go" -l 2>/dev/null | head -1 && echo "HAS_RECONCILE"
# Check for Adapter indicators
basename $(pwd) | grep "^adapter-" && echo "IS_ADAPTER"
grep -r "cloudevents\|pubsub" --include="*.go" -l 2>/dev/null | head -1 && echo "HAS_CLOUDEVENTS"
# Check for Infrastructure
ls charts/Chart.yaml 2>/dev/null || ls Chart.yaml 2>/dev/null && echo "HAS_HELM"
ls *.tf 2>/dev/null && echo "HAS_TERRAFORM"
# Check for Go code
ls cmd/*.go 2>/dev/null || ls pkg/**/*.go 2>/dev/null && echo "IS_GO_REPO"
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 指标 | 仓库类型 |
|---|---|
| HAS_API_PKG + HAS_OPENAPI + HAS_DATABASE | API 服务 |
| IS_SENTINEL 或 HAS_RECONCILE | Sentinel |
| IS_ADAPTER 或 HAS_CLOUDEVENTS (无 API) | 适配器 |
| HAS_HELM 或 HAS_TERRAFORM (无 Go) | 基础设施 |
| IS_GO_REPO (无服务模式) | 工具 |
严重性: 严重 要求: 健康和就绪端点必须验证实际的依赖项(数据库、消息代理、外部服务),而不仅仅是返回 200 OK。
适用于: API、Sentinel、适配器 不适用于: 基础设施、工具
检查内容:
/healthz 或 /health)/readyz 或 /ready)检查命令:
# Check for health endpoints
grep -r "/healthz\|/health\|/readyz\|/ready" --include="*.go" -l 2>/dev/null
# Check if health checks verify dependencies (not just returning OK)
grep -r "healthz\|readyz" --include="*.go" -A 20 2>/dev/null | grep -i "ping\|check\|db\|database\|broker\|connect"
通过标准:
return 200)失败指标:
严重性: 严重(Sentinel 服务必需) 要求: 服务必须发出心跳/时间戳指标,供外部监控用于检测静默故障。
适用于: Sentinel(严重)、适配器(是)、API(可选) 不适用于: 基础设施、工具
检查内容:
hyperfleet_*_last_success_timestamp 或 hyperfleet_*_heartbeat检查命令:
# Check for dead man's switch / heartbeat metrics
grep -r "last_success\|heartbeat\|last_run\|last_processed" --include="*.go" -l 2>/dev/null
# Check for timestamp metric patterns
grep -r "SetToCurrentTime\|prometheus.NewGauge.*timestamp\|prometheus.NewGauge.*heartbeat" --include="*.go" -l 2>/dev/null
# Look for reconciliation loop metrics
grep -r "reconcile.*success\|loop.*completed\|cycle.*finished" --include="*.go" -l 2>/dev/null
通过标准:
失败指标(对 Sentinel 为严重):
严重性: 主要 要求: 所有 HTTP 客户端和消息代理交互必须实现带指数退避的重试逻辑,以处理瞬时故障。
适用于: API、Sentinel、适配器 不适用于: 基础设施、工具
检查内容:
检查命令:
# Check for retry libraries or patterns
grep -r "retry\|backoff\|Retry\|Backoff" --include="*.go" -l 2>/dev/null
# Check for exponential backoff specifically
grep -r "exponential\|ExponentialBackoff\|backoff.Exponential" --include="*.go" -l 2>/dev/null
# Check for common retry libraries
grep -r "cenkalti/backoff\|avast/retry-go\|hashicorp/go-retryablehttp" --include="*.go" -l 2>/dev/null
# Check if raw http.Client is wrapped
grep -r "http.Client\|http.NewRequest" --include="*.go" -A 5 2>/dev/null | grep -i "retry"
通过标准:
失败指标:
严重性: 主要 要求: Helm 图表必须包含 PodDisruptionBudget 模板,以确保在节点维护和集群升级期间的可用性。
适用于: API、Sentinel、适配器、基础设施 不适用于: 工具
检查内容:
检查命令:
# Check for PDB template
ls charts/*/templates/pdb.yaml 2>/dev/null || ls charts/*/templates/poddisruptionbudget.yaml 2>/dev/null
# Check values.yaml for PDB configuration
grep -r "podDisruptionBudget\|pdb:" charts/*/values.yaml 2>/dev/null
# Check for PDB in any template
grep -r "PodDisruptionBudget" charts/*/templates/*.yaml 2>/dev/null
通过标准:
失败指标:
严重性: 主要 要求: 部署必须定义 CPU 和内存请求以及限制,以确保适当的调度并防止资源耗尽。
适用于: API、Sentinel、适配器、基础设施 不适用于: 工具
检查内容:
检查命令:
# Check values.yaml for resource configuration
grep -A 10 "resources:" charts/*/values.yaml 2>/dev/null
# Check for both requests and limits
grep -A 20 "resources:" charts/*/values.yaml 2>/dev/null | grep -E "requests:|limits:|cpu:|memory:"
# Check deployment template uses resources
grep -r "\.Values.resources\|resources:" charts/*/templates/deployment.yaml 2>/dev/null
通过标准:
resources.requests.cpuresources.requests.memoryresources.limits.cpuresources.limits.memory失败指标:
严重性: 严重 要求: 服务必须处理 SIGTERM/SIGINT 信号,停止接受新工作,排空现有工作,并在终止宽限期内干净地退出。
适用于: API、Sentinel、适配器 不适用于: 基础设施、工具
检查内容:
检查命令:
# Check for signal handling
grep -r "SIGTERM\|SIGINT\|signal.Notify\|os.Signal" --include="*.go" -l 2>/dev/null
# Check for graceful shutdown
grep -r "Shutdown\|GracefulStop\|graceful" --include="*.go" -l 2>/dev/null
# Check for context cancellation on shutdown
grep -r "context.WithCancel\|ctx.Done" --include="*.go" -A 5 2>/dev/null | grep -i "shutdown\|signal"
通过标准:
失败指标:
严重性: 次要 要求: 服务应具备运行文档,包括运行手册、指标文档和操作指南。
适用于: API、Sentinel、适配器、基础设施(部分) 不适用于: 工具
检查内容:
检查命令:
# Check for runbook
ls docs/runbook.md 2>/dev/null || ls docs/runbooks/*.md 2>/dev/null || ls RUNBOOK.md 2>/dev/null
# Check for metrics documentation
ls docs/metrics.md 2>/dev/null || grep -l "## Metrics" docs/*.md 2>/dev/null || grep -l "## Metrics" README.md 2>/dev/null
# Check for operational documentation
ls docs/operations.md 2>/dev/null || grep -l "## Operations\|## Operational" docs/*.md 2>/dev/null
通过标准:
失败指标:
| 检查项 | API | Sentinel | 适配器 | 基础设施 | 工具 |
|---|---|---|---|---|---|
| 功能性健康探针 | 是 | 是 | 是 | 否 | 否 |
| 死信开关指标 | 可选 | 严重 | 是 | 否 | 否 |
| 带退避的重试逻辑 | 是 | 是 | 是 | 否 | 否 |
| PodDisruptionBudget | 是 | 是 | 是 | 是 | 否 |
| 资源限制 | 是 | 是 | 是 | 是 | 否 |
| 优雅关闭 | 是 | 是 | 是 | 否 | 否 |
| 可靠性文档 | 是 | 是 | 是 | 部分 | 否 |
# HyperFleet 运行就绪审计报告
**仓库:** [仓库名称]
**路径:** [完整路径]
**仓库类型:** [API/Sentinel/适配器/基础设施/工具]
**审计日期:** [ISO 时间戳]
**要求来源:** HYPERFLEET-539
---
## 摘要
| 检查项 | 状态 | 严重性 | 适用性 |
|-------|--------|----------|------------|
| 功能性健康探针 | PASS/PARTIAL/FAIL | 严重 | 是/否 |
| 死信开关指标 | PASS/PARTIAL/FAIL | 严重 | 是/否 |
| 带退避的重试逻辑 | PASS/PARTIAL/FAIL | 主要 | 是/否 |
| PodDisruptionBudget | PASS/PARTIAL/FAIL | 主要 | 是/否 |
| 资源限制 | PASS/PARTIAL/FAIL | 主要 | 是/否 |
| 优雅关闭 | PASS/PARTIAL/FAIL | 严重 | 是/否 |
| 可靠性文档 | PASS/PARTIAL/FAIL | 次要 | 是/否 |
**总体运行就绪度:** X/Y 项检查通过 (Z%)
---
## 详细发现
### [检查项名称]
**状态:** PASS/PARTIAL/FAIL
**严重性:** 严重/主要/次要
**适用性:** 是/否(如果为否,说明原因)
#### 找到的证据
- [文件路径:行号 - 找到的内容]
#### 发现的差距(如果有)
- **位置:** [文件路径:行号 或 N/A]
- **预期:** [运行就绪要求的内容]
- **实际发现:** [实际找到的内容]
- **修复建议:** [如何修复]
---
## 建议
**严重问题(投入生产前解决):**
1. [问题描述和修复建议]
**主要问题(尽快解决):**
1. [问题描述和修复建议]
**次要问题(方便时解决):**
1. [问题描述和修复建议]
用户: "检查运行就绪"
技能响应:
# HyperFleet 运行就绪审计报告
**仓库:** hyperfleet-sentinel
**路径:** /home/croche/Projects/hyperfleet/hyperfleet-sentinel
**仓库类型:** Sentinel
**审计日期:** 2026-01-29T14:30:00Z
**要求来源:** HYPERFLEET-539
---
## 摘要
| 检查项 | 状态 | 严重性 | 适用性 |
|-------|--------|----------|------------|
| 功能性健康探针 | PASS | 严重 | 是 |
| 死信开关指标 | FAIL | 严重 | 是(必需) |
| 带退避的重试逻辑 | PARTIAL | 主要 | 是 |
| PodDisruptionBudget | PASS | 主要 | 是 |
| 资源限制 | PASS | 主要 | 是 |
| 优雅关闭 | PASS | 严重 | 是 |
| 可靠性文档 | FAIL | 次要 | 是 |
**总体运行就绪度:** 4/7 项检查通过 (57%)
---
## 详细发现
### 功能性健康探针
**状态:** PASS
**严重性:** 严重
**适用性:** 是
#### 找到的证据
- cmd/server/health.go:23 - 注册了 `/healthz` 端点
- cmd/server/health.go:45 - 注册了 `/readyz` 端点
- cmd/server/health.go:52 - 就绪处理程序中的数据库 ping 检查
---
### 死信开关指标
**状态:** FAIL
**严重性:** 严重(Sentinel 服务必需)
**适用性:** 是 - Sentinel 服务必须具有死信开关指标
#### 找到的证据
- 未找到心跳或时间戳指标
#### 发现的差距
- **位置:** N/A - 未实现
- **预期:** 在每个协调周期更新的心跳或时间戳指标
- **实际发现:** 没有能够对静默故障发出警报的指标
- **修复建议:** 添加 `hyperfleet_sentinel_last_success_timestamp` 指标,在每个成功的协调周期后调用 `SetToCurrentTime()`
---
### 带退避的重试逻辑
**状态:** PARTIAL
**严重性:** 主要
**适用性:** 是
#### 找到的证据
- pkg/client/http.go:45 - 找到基本重试逻辑
- 无指数退避实现
#### 发现的差距
- **位置:** pkg/client/http.go:45
- **预期:** 带抖动的指数退避
- **实际发现:** 重试之间的固定 1 秒延迟
- **修复建议:** 使用 `cenkalti/backoff` 或类似库将固定延迟替换为指数退避
---
### PodDisruptionBudget
**状态:** PASS
**严重性:** 主要
**适用性:** 是
#### 找到的证据
- charts/hyperfleet-sentinel/templates/pdb.yaml 存在
- charts/hyperfleet-sentinel/values.yaml:78 - `podDisruptionBudget.minAvailable: 1`
---
### 资源限制
**状态:** PASS
**严重性:** 主要
**适用性:** 是
#### 找到的证据
- charts/hyperfleet-sentinel/values.yaml:45 - resources.requests.cpu: 100m
- charts/hyperfleet-sentinel/values.yaml:46 - resources.requests.memory: 128Mi
- charts/hyperfleet-sentinel/values.yaml:47 - resources.limits.cpu: 500m
- charts/hyperfleet-sentinel/values.yaml:48 - resources.limits.memory: 512Mi
---
### 优雅关闭
**状态:** PASS
**严重性:** 严重
**适用性:** 是
#### 找到的证据
- cmd/main.go:67 - `signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)`
- cmd/main.go:89 - 收到信号时调用 `server.Shutdown(ctx)`
---
### 可靠性文档
**状态:** FAIL
**严重性:** 次要
**适用性:** 是
#### 找到的证据
- 未找到运行手册
- 未找到指标文档
#### 发现的差距
- **位置:** docs/ 目录
- **预期:** 运行手册和指标文档
- **实际发现:** 只有包含安装说明的 README.md
- **修复建议:** 创建包含操作流程的 docs/runbook.md 和记录暴露指标的 docs/metrics.md
---
## 建议
**严重问题(投入生产前解决):**
1. **死信开关指标** - 添加心跳指标以检测静默故障。这对 Sentinel 服务是必需的。
**主要问题(尽快解决):**
1. **重试逻辑** - 将固定延迟替换为指数退避,以防止在中断期间出现惊群效应。
**次要问题(方便时解决):**
1. **可靠性文档** - 为值班支持添加运行手册和指标文档。
如果技能无法完成审计:
尽可能提供部分结果,并为不完整的检查建议手动验证步骤。
每周安装次数
–
仓库
GitHub 星标数
5
首次出现时间
–
安全审计
This skill MUST NOT modify any files in the repository being audited. All operations are read-only analysis. The skill produces reports but never changes code, configuration, or documentation.
Activate this skill when the user:
These checks are based on HYPERFLEET-539 requirements for operational readiness. Unlike the standards-audit skill which dynamically fetches standards, these operational requirements are hardcoded as they represent core reliability requirements that rarely change.
Before running applicable checks, detect the repository type.
# Check for API indicators
ls pkg/api/ 2>/dev/null && echo "HAS_API_PKG"
ls openapi.yaml 2>/dev/null || ls openapi/openapi.yaml 2>/dev/null && echo "HAS_OPENAPI"
grep -l "database" cmd/*.go 2>/dev/null && echo "HAS_DATABASE"
# Check for Sentinel indicators
basename $(pwd) | grep -i sentinel && echo "IS_SENTINEL"
grep -r "polling\|reconcile" --include="*.go" -l 2>/dev/null | head -1 && echo "HAS_RECONCILE"
# Check for Adapter indicators
basename $(pwd) | grep "^adapter-" && echo "IS_ADAPTER"
grep -r "cloudevents\|pubsub" --include="*.go" -l 2>/dev/null | head -1 && echo "HAS_CLOUDEVENTS"
# Check for Infrastructure
ls charts/Chart.yaml 2>/dev/null || ls Chart.yaml 2>/dev/null && echo "HAS_HELM"
ls *.tf 2>/dev/null && echo "HAS_TERRAFORM"
# Check for Go code
ls cmd/*.go 2>/dev/null || ls pkg/**/*.go 2>/dev/null && echo "IS_GO_REPO"
| Indicators | Repository Type |
|---|---|
| HAS_API_PKG + HAS_OPENAPI + HAS_DATABASE | API Service |
| IS_SENTINEL or HAS_RECONCILE | Sentinel |
| IS_ADAPTER or HAS_CLOUDEVENTS (without API) | Adapter |
| HAS_HELM or HAS_TERRAFORM (without Go) | Infrastructure |
| IS_GO_REPO (without service patterns) | Tooling |
Severity: Critical Requirement: Health and readiness endpoints must verify actual dependencies (database, message broker, external services), not just return 200 OK.
Applies to: API, Sentinel, Adapter Does NOT apply to: Infrastructure, Tooling
What to check:
/healthz or /health)/readyz or /ready)Check commands:
# Check for health endpoints
grep -r "/healthz\|/health\|/readyz\|/ready" --include="*.go" -l 2>/dev/null
# Check if health checks verify dependencies (not just returning OK)
grep -r "healthz\|readyz" --include="*.go" -A 20 2>/dev/null | grep -i "ping\|check\|db\|database\|broker\|connect"
Pass criteria:
return 200)Fail indicators:
Severity: Critical (REQUIRED for Sentinel services) Requirement: Services must emit heartbeat/timestamp metrics that external monitoring can use to detect silent failures.
Applies to: Sentinel (CRITICAL), Adapter (Yes), API (Optional) Does NOT apply to: Infrastructure, Tooling
What to check:
hyperfleet_*_last_success_timestamp or hyperfleet_*_heartbeatCheck commands:
# Check for dead man's switch / heartbeat metrics
grep -r "last_success\|heartbeat\|last_run\|last_processed" --include="*.go" -l 2>/dev/null
# Check for timestamp metric patterns
grep -r "SetToCurrentTime\|prometheus.NewGauge.*timestamp\|prometheus.NewGauge.*heartbeat" --include="*.go" -l 2>/dev/null
# Look for reconciliation loop metrics
grep -r "reconcile.*success\|loop.*completed\|cycle.*finished" --include="*.go" -l 2>/dev/null
Pass criteria:
Fail indicators (CRITICAL for Sentinel):
Severity: Major Requirement: All HTTP clients and message broker interactions must implement retry logic with exponential backoff to handle transient failures.
Applies to: API, Sentinel, Adapter Does NOT apply to: Infrastructure, Tooling
What to check:
Check commands:
# Check for retry libraries or patterns
grep -r "retry\|backoff\|Retry\|Backoff" --include="*.go" -l 2>/dev/null
# Check for exponential backoff specifically
grep -r "exponential\|ExponentialBackoff\|backoff.Exponential" --include="*.go" -l 2>/dev/null
# Check for common retry libraries
grep -r "cenkalti/backoff\|avast/retry-go\|hashicorp/go-retryablehttp" --include="*.go" -l 2>/dev/null
# Check if raw http.Client is wrapped
grep -r "http.Client\|http.NewRequest" --include="*.go" -A 5 2>/dev/null | grep -i "retry"
Pass criteria:
Fail indicators:
Severity: Major Requirement: Helm charts must include PodDisruptionBudget templates to ensure availability during node maintenance and cluster upgrades.
Applies to: API, Sentinel, Adapter, Infrastructure Does NOT apply to: Tooling
What to check:
Check commands:
# Check for PDB template
ls charts/*/templates/pdb.yaml 2>/dev/null || ls charts/*/templates/poddisruptionbudget.yaml 2>/dev/null
# Check values.yaml for PDB configuration
grep -r "podDisruptionBudget\|pdb:" charts/*/values.yaml 2>/dev/null
# Check for PDB in any template
grep -r "PodDisruptionBudget" charts/*/templates/*.yaml 2>/dev/null
Pass criteria:
Fail indicators:
Severity: Major Requirement: Deployment must have CPU and memory requests AND limits defined to ensure proper scheduling and prevent resource exhaustion.
Applies to: API, Sentinel, Adapter, Infrastructure Does NOT apply to: Tooling
What to check:
Check commands:
# Check values.yaml for resource configuration
grep -A 10 "resources:" charts/*/values.yaml 2>/dev/null
# Check for both requests and limits
grep -A 20 "resources:" charts/*/values.yaml 2>/dev/null | grep -E "requests:|limits:|cpu:|memory:"
# Check deployment template uses resources
grep -r "\.Values.resources\|resources:" charts/*/templates/deployment.yaml 2>/dev/null
Pass criteria:
resources.requests.cpu definedresources.requests.memory definedresources.limits.cpu definedresources.limits.memory definedFail indicators:
Severity: Critical Requirement: Services must handle SIGTERM/SIGINT signals, stop accepting new work, drain existing work, and exit cleanly within the termination grace period.
Applies to: API, Sentinel, Adapter Does NOT apply to: Infrastructure, Tooling
What to check:
Check commands:
# Check for signal handling
grep -r "SIGTERM\|SIGINT\|signal.Notify\|os.Signal" --include="*.go" -l 2>/dev/null
# Check for graceful shutdown
grep -r "Shutdown\|GracefulStop\|graceful" --include="*.go" -l 2>/dev/null
# Check for context cancellation on shutdown
grep -r "context.WithCancel\|ctx.Done" --include="*.go" -A 5 2>/dev/null | grep -i "shutdown\|signal"
Pass criteria:
Fail indicators:
Severity: Minor Requirement: Services should have operational documentation including runbooks, metrics documentation, and operational guides.
Applies to: API, Sentinel, Adapter, Infrastructure (Partial) Does NOT apply to: Tooling
What to check:
Check commands:
# Check for runbook
ls docs/runbook.md 2>/dev/null || ls docs/runbooks/*.md 2>/dev/null || ls RUNBOOK.md 2>/dev/null
# Check for metrics documentation
ls docs/metrics.md 2>/dev/null || grep -l "## Metrics" docs/*.md 2>/dev/null || grep -l "## Metrics" README.md 2>/dev/null
# Check for operational documentation
ls docs/operations.md 2>/dev/null || grep -l "## Operations\|## Operational" docs/*.md 2>/dev/null
Pass criteria:
Fail indicators:
| Check | API | Sentinel | Adapter | Infrastructure | Tooling |
|---|---|---|---|---|---|
| Functional Health Probes | Yes | Yes | Yes | No | No |
| Dead Man's Switch Metrics | Optional | CRITICAL | Yes | No | No |
| Retry Logic with Backoff | Yes | Yes | Yes | No | No |
| PodDisruptionBudget | Yes | Yes | Yes | Yes | No |
| Resource Limits | Yes | Yes |
# HyperFleet Operational Readiness Audit Report
**Repository:** [repo name]
**Path:** [full path]
**Repository Type:** [API/Sentinel/Adapter/Infrastructure/Tooling]
**Audit Date:** [ISO timestamp]
**Requirements Source:** HYPERFLEET-539
---
## Summary
| Check | Status | Severity | Applicable |
|-------|--------|----------|------------|
| Functional Health Probes | PASS/PARTIAL/FAIL | Critical | Yes/No |
| Dead Man's Switch Metrics | PASS/PARTIAL/FAIL | Critical | Yes/No |
| Retry Logic with Backoff | PASS/PARTIAL/FAIL | Major | Yes/No |
| PodDisruptionBudget | PASS/PARTIAL/FAIL | Major | Yes/No |
| Resource Limits | PASS/PARTIAL/FAIL | Major | Yes/No |
| Graceful Shutdown | PASS/PARTIAL/FAIL | Critical | Yes/No |
| Reliability Documentation | PASS/PARTIAL/FAIL | Minor | Yes/No |
**Overall Operational Readiness:** X/Y checks passing (Z%)
---
## Detailed Findings
### [Check Name]
**Status:** PASS/PARTIAL/FAIL
**Severity:** Critical/Major/Minor
**Applicable:** Yes/No (reason if No)
#### Evidence Found
- [File path:line - what was found]
#### Gaps Found (if any)
- **Location:** [file path:line number or N/A]
- **Expected:** [what operational readiness requires]
- **Found:** [what was actually found]
- **Remediation:** [how to fix]
---
## Recommendations
**Critical Issues (address before production):**
1. [Issue description and remediation]
**Major Issues (address soon):**
1. [Issue description and remediation]
**Minor Issues (address when convenient):**
1. [Issue description and remediation]
User: "check operational readiness"
Skill Response:
# HyperFleet Operational Readiness Audit Report
**Repository:** hyperfleet-sentinel
**Path:** /home/croche/Projects/hyperfleet/hyperfleet-sentinel
**Repository Type:** Sentinel
**Audit Date:** 2026-01-29T14:30:00Z
**Requirements Source:** HYPERFLEET-539
---
## Summary
| Check | Status | Severity | Applicable |
|-------|--------|----------|------------|
| Functional Health Probes | PASS | Critical | Yes |
| Dead Man's Switch Metrics | FAIL | Critical | Yes (REQUIRED) |
| Retry Logic with Backoff | PARTIAL | Major | Yes |
| PodDisruptionBudget | PASS | Major | Yes |
| Resource Limits | PASS | Major | Yes |
| Graceful Shutdown | PASS | Critical | Yes |
| Reliability Documentation | FAIL | Minor | Yes |
**Overall Operational Readiness:** 4/7 checks passing (57%)
---
## Detailed Findings
### Functional Health Probes
**Status:** PASS
**Severity:** Critical
**Applicable:** Yes
#### Evidence Found
- cmd/server/health.go:23 - `/healthz` endpoint registered
- cmd/server/health.go:45 - `/readyz` endpoint registered
- cmd/server/health.go:52 - Database ping check in readiness handler
---
### Dead Man's Switch Metrics
**Status:** FAIL
**Severity:** Critical (REQUIRED for Sentinel)
**Applicable:** Yes - Sentinel services MUST have dead man's switch metrics
#### Evidence Found
- No heartbeat or timestamp metrics found
#### Gaps Found
- **Location:** N/A - not implemented
- **Expected:** Heartbeat or timestamp metric updated on each reconciliation cycle
- **Found:** No metrics that would alert on silent failures
- **Remediation:** Add `hyperfleet_sentinel_last_success_timestamp` gauge metric, call `SetToCurrentTime()` after each successful reconciliation cycle
---
### Retry Logic with Backoff
**Status:** PARTIAL
**Severity:** Major
**Applicable:** Yes
#### Evidence Found
- pkg/client/http.go:45 - Basic retry logic found
- No exponential backoff implementation
#### Gaps Found
- **Location:** pkg/client/http.go:45
- **Expected:** Exponential backoff with jitter
- **Found:** Fixed 1-second delay between retries
- **Remediation:** Replace fixed delays with exponential backoff using `cenkalti/backoff` or similar library
---
### PodDisruptionBudget
**Status:** PASS
**Severity:** Major
**Applicable:** Yes
#### Evidence Found
- charts/hyperfleet-sentinel/templates/pdb.yaml exists
- charts/hyperfleet-sentinel/values.yaml:78 - `podDisruptionBudget.minAvailable: 1`
---
### Resource Limits
**Status:** PASS
**Severity:** Major
**Applicable:** Yes
#### Evidence Found
- charts/hyperfleet-sentinel/values.yaml:45 - resources.requests.cpu: 100m
- charts/hyperfleet-sentinel/values.yaml:46 - resources.requests.memory: 128Mi
- charts/hyperfleet-sentinel/values.yaml:47 - resources.limits.cpu: 500m
- charts/hyperfleet-sentinel/values.yaml:48 - resources.limits.memory: 512Mi
---
### Graceful Shutdown
**Status:** PASS
**Severity:** Critical
**Applicable:** Yes
#### Evidence Found
- cmd/main.go:67 - `signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)`
- cmd/main.go:89 - `server.Shutdown(ctx)` called on signal
---
### Reliability Documentation
**Status:** FAIL
**Severity:** Minor
**Applicable:** Yes
#### Evidence Found
- No runbook found
- No metrics documentation found
#### Gaps Found
- **Location:** docs/ directory
- **Expected:** Runbook and metrics documentation
- **Found:** Only README.md with installation instructions
- **Remediation:** Create docs/runbook.md with operational procedures and docs/metrics.md documenting exposed metrics
---
## Recommendations
**Critical Issues (address before production):**
1. **Dead Man's Switch Metrics** - Add heartbeat metric to detect silent failures. This is REQUIRED for Sentinel services.
**Major Issues (address soon):**
1. **Retry Logic** - Replace fixed delays with exponential backoff to prevent thundering herd during outages.
**Minor Issues (address when convenient):**
1. **Reliability Documentation** - Add runbook and metrics documentation for on-call support.
If the skill cannot complete an audit:
Always provide partial results where possible and suggest manual verification steps for incomplete checks.
Weekly Installs
–
Repository
GitHub Stars
5
First Seen
–
Security Audits
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
79,900 周安装
| Yes |
| Yes |
| No |
| Graceful Shutdown | Yes | Yes | Yes | No | No |
| Reliability Documentation | Yes | Yes | Yes | Partial | No |