⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

Kubernetes 调试工具包：系统性诊断 Pod、网络、存储和集群问题

k8s-debug by akin-ozer/cc-devops-skills

49 周安装量

131 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/akin-ozer/cc-devops-skills --skill k8s-debug

开发运维容器编排调试

🇨🇳中文介绍

Kubernetes 调试技能

概述

用于调试 Kubernetes 集群、工作负载、网络和存储的系统化工具包，采用确定性的、安全第一的工作流程。

触发短语

当请求类似以下情况时使用此技能：

"我的 Pod 处于 CrashLoopBackOff 状态；帮我找到根本原因。"
"服务 DNS 在一个 Pod 中有效，但在另一个中无效。"
"部署滚动更新卡住了。"
"Pod 处于 Pending 状态且无法调度。"
"变更后集群健康状况看起来下降了。"
"PVC 处于挂起状态，Pod 无法挂载存储。"

先决条件

从技能目录 (devops-skills-plugin/skills/k8s-debug) 运行，以便相对脚本路径按编写的方式工作。

必需条件

已安装并配置 kubectl。
活跃的集群上下文。
对命名空间、Pod、事件、服务和节点的读取权限。

快速预检：

kubectl config current-context
kubectl auth can-i get pods -A
kubectl auth can-i get events -A
kubectl get ns

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

何时使用此技能

此技能适用于：

Pod 故障 (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
服务连接性或 DNS 解析问题
网络策略或入口问题
卷和存储挂载失败
部署滚动更新问题
集群健康或性能下降
资源耗尽 (CPU/内存)
配置问题 (ConfigMaps, Secrets, RBAC)

破坏性命令的安全规则

默认模式是首先进行只读诊断。只有在确认影响范围和回滚方案后，才执行破坏性命令。

需要明确确认的命令：

kubectl delete pod ... --force --grace-period=0
kubectl drain ...
kubectl rollout restart ...
kubectl rollout undo ...
kubectl debug ... --copy-to=...

在采取破坏性操作之前：

# 为回滚和事件记录快照当前状态
kubectl get deploy,rs,pod,svc -n <namespace> -o wide
kubectl get pod <pod-name> -n <namespace> -o yaml > before-<pod-name>.yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > before-events.txt

仅加载观察到的症状所需的部分。

症状 / 需求	打开	起始章节
你需要端到端的诊断路径	`./references/troubleshooting_workflow.md`	`通用调试工作流`
Pod 状态为 `Pending`, `CrashLoopBackOff`, 或 `ImagePullBackOff`	`./references/troubleshooting_workflow.md`	`Pod 生命周期故障排除`
服务可达性或 DNS 故障	`./references/troubleshooting_workflow.md`	`网络故障排除工作流`
节点压力或性能下降	`./references/troubleshooting_workflow.md`	`资源和性能工作流`
PVC / PV / 存储类问题	`./references/troubleshooting_workflow.md`	`存储故障排除工作流`
快速症状到修复查找	`./references/common_issues.md`	匹配的问题标题
针对已知问题的故障后修复选项	`./references/common_issues.md`	`解决方案` 部分

脚本	目的	必需参数	可选参数	输出	回退行为
`./scripts/cluster_health.sh`	集群范围健康快照 (节点, 工作负载, 事件, 常见故障状态)	无	`--strict`, `K8S_REQUEST_TIMEOUT` 环境变量	分段报告到 stdout	检查失败时继续，在摘要和退出码中跟踪它们
`./scripts/network_debug.sh`	以 Pod 为中心的网络和 DNS 诊断	`<pod-name>` (`<namespace>` 默认为 `default`)	`--strict`, `--insecure`, `K8S_REQUEST_TIMEOUT` 环境变量	分段报告到 stdout	默认使用安全 API 探测；不安全的 TLS 需要显式 `--insecure`
`./scripts/pod_diagnostics.py`	深度 Pod 诊断 (状态, describe, YAML, 事件, 每个容器的日志, 节点上下文)	`<pod-name>`	`-n/--namespace`, `-o/--output`	分段报告到 stdout 或文件	缺少访问权限时快速失败；对可选的指标/日志块跳过并给出明确消息

./scripts/cluster_health.sh 和 ./scripts/network_debug.sh 共享相同的约定：

0: 检查完成，无检查失败 (除非设置了 --strict，否则允许警告)。
1: 一个或多个检查失败，或在 --strict 模式下出现警告。
2: 前置条件被阻止 (例如：缺少 kubectl, 无活跃上下文, 无法访问的命名空间/Pod)。

确定性调试工作流

对于任何 Kubernetes 问题，请遵循此系统化方法：

1. 预检和确定范围

kubectl config current-context
kubectl get ns
kubectl auth can-i get pods -n <namespace>

如果预检失败，请先停止并修复访问/上下文问题。

2. 识别问题层级

对问题进行归类：

应用层：应用崩溃、错误、缺陷
Pod 层：Pod 未启动、重启或挂起
服务层：网络连接性、DNS 问题
节点层：节点未就绪、资源耗尽
集群层：控制平面问题、API 问题
存储层：卷挂载失败、PVC 问题
配置层：ConfigMap、Secret、RBAC 问题

3. 使用正确的脚本收集诊断信息

根据范围使用适当的诊断脚本：

使用 ./scripts/pod_diagnostics.py 进行全面的 Pod 分析：

python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace>

Pod 状态和描述
Pod 事件
容器日志 (当前和之前的)
资源使用情况
节点信息
YAML 配置

输出可以保存以供分析：

python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt

集群级健康检查

使用 ./scripts/cluster_health.sh 进行整体集群诊断：

./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt

集群信息和版本
节点状态和资源
所有命名空间中的 Pod
失败/挂起的 Pod
近期事件
部署、服务、StatefulSets、DaemonSets
PVC 和 PV
组件健康状态
常见错误状态 (CrashLoopBackOff, ImagePullBackOff)

使用 ./scripts/network_debug.sh 处理连接性问题：

./scripts/network_debug.sh <namespace> <pod-name>
# 或者仅在明确需要时强制警告敏感性 / 不安全的 TLS：
./scripts/network_debug.sh --strict <namespace> <pod-name>
./scripts/network_debug.sh --insecure <namespace> <pod-name>

Pod 网络配置
DNS 设置和解析
服务端点
网络策略
连接性测试
CoreDNS 日志

4. 遵循特定问题的参考工作流

根据识别出的问题，查阅 ./references/troubleshooting_workflow.md：

Pod Pending：资源/调度工作流
CrashLoopBackOff：应用崩溃工作流
ImagePullBackOff：镜像拉取工作流
服务问题：网络连接性工作流
DNS 故障：DNS 故障排除工作流
资源耗尽：性能调查工作流
存储问题：PVC 绑定工作流
部署卡住：滚动更新工作流

5. 应用针对性修复

参考 ./references/common_issues.md 获取针对特定症状的修复方案。

运行最终验证：

kubectl get pods -n <namespace> -o wide
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
kubectl rollout status deployment/<name> -n <namespace>

当用户可见行为健康且没有新的严重警告事件出现时，问题即告解决。

示例 1: `payments` 命名空间中的 CrashLoopBackOff

python3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt
kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail=100
kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe

然后打开 ./references/common_issues.md 并应用 CrashLoopBackOff 解决方案。

示例 2: 服务 DNS/连接性故障

./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm
kubectl get svc checkout-api -n checkout
kubectl get endpoints checkout-api -n checkout
kubectl get networkpolicies -n checkout

然后遵循 ./references/troubleshooting_workflow.md 中的 服务连接性工作流。

# 查看 Pod 状态
kubectl get pods -n <namespace> -o wide

# 详细的 Pod 信息
kubectl describe pod <pod-name> -n <namespace>

# 查看日志
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # 前一个容器
kubectl logs <pod-name> -n <namespace> -c <container>  # 特定容器

# 在 Pod 中执行命令
kubectl exec <pod-name> -n <namespace> -it -- /bin/sh

# 获取 Pod YAML
kubectl get pod <pod-name> -n <namespace> -o yaml

服务和网络调试

# 检查服务
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>

# 检查端点
kubectl get endpoints -n <namespace>

# 测试 DNS
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default

# 查看事件
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# 节点资源
kubectl top nodes
kubectl describe nodes

# Pod 资源
kubectl top pods -n <namespace>
kubectl top pod <pod-name> -n <namespace> --containers

# 重启部署
kubectl rollout restart deployment/<name> -n <namespace>

# 回滚部署
kubectl rollout undo deployment/<name> -n <namespace>

# 强制删除卡住的 Pod
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0

# 排空节点 (维护)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# 封锁节点 (防止调度)
kubectl cordon <node-name>

故障排除会话在以下所有条件为真时完成：

集群上下文和命名空间已确认。
相关的诊断脚本输出已捕获。
根本原因已确定并与证据 (事件/日志/配置/状态) 相关联。
任何破坏性操作之前都有快照和回滚计划。
修复验证命令显示健康状态。
使用的参考路径 (./references/troubleshooting_workflow.md 或 ./references/common_issues.md) 已在笔记中记录。

用于 Kubernetes 调试的有用附加工具：

kubectl-debug：高级调试插件
stern：多 Pod 日志跟踪
kubectx/kubens：上下文和命名空间切换
k9s：Kubernetes 终端 UI
lens：Kubernetes 桌面 IDE
Prometheus/Grafana：监控和告警
Jaeger/Zipkin：分布式追踪

🇺🇸English

Kubernetes Debugging Skill

Overview

Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow.

Trigger Phrases

Use this skill when requests resemble:

"My pod is in CrashLoopBackOff; help me find the root cause."
"Service DNS works in one pod but not another."
"Deployment rollout is stuck."
"Pods are Pending and not scheduling."
"Cluster health looks degraded after a change."
"PVC is pending and pods cannot mount storage."

Prerequisites

Run from the skill directory (devops-skills-plugin/skills/k8s-debug) so relative script paths work as written.

Required

kubectl installed and configured.
An active cluster context.
Read access to namespaces, pods, events, services, and nodes.

Quick preflight:

kubectl config current-context
kubectl auth can-i get pods -A
kubectl auth can-i get events -A
kubectl get ns

Optional but Recommended

jq for more precise filtering in ./scripts/cluster_health.sh.
Metrics API (metrics-server) for kubectl top.
In-container debug tools (nslookup, getent, curl, wget, ip) for deep network tests.

Fallback behavior:

If optional tools are missing, scripts continue and print warnings with reduced output.
If kubectl top is unavailable, continue with kubectl describe and events.

When to Use This Skill

Use this skill for:

Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
Service connectivity or DNS resolution issues
Network policy or ingress problems
Volume and storage mount failures
Deployment rollout issues
Cluster health or performance degradation
Resource exhaustion (CPU/memory)
Configuration problems (ConfigMaps, Secrets, RBAC)

Safety Rules for Disruptive Commands

Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback.

Commands requiring explicit confirmation:

kubectl delete pod ... --force --grace-period=0
kubectl drain ...
kubectl rollout restart ...
kubectl rollout undo ...
kubectl debug ... --copy-to=...

Before disruptive actions:

# Snapshot current state for rollback and incident notes
kubectl get deploy,rs,pod,svc -n <namespace> -o wide
kubectl get pod <pod-name> -n <namespace> -o yaml > before-<pod-name>.yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > before-events.txt

Reference Navigation Map

Load only the section needed for the observed symptom.

Symptom / Need	Open	Start section
You need an end-to-end diagnosis path	`./references/troubleshooting_workflow.md`	`General Debugging Workflow`
Pod state is `Pending`, `CrashLoopBackOff`, or `ImagePullBackOff`	`./references/troubleshooting_workflow.md`	`Pod Lifecycle Troubleshooting`

Scripts Overview

Script	Purpose	Required args	Optional args	Output	Fallback behavior
`./scripts/cluster_health.sh`	Cluster-wide health snapshot (nodes, workloads, events, common failure states)	None	`--strict`, `K8S_REQUEST_TIMEOUT` env var	Sectioned report to stdout	Continues on check failures, tracks them in summary and exit code
`./scripts/network_debug.sh`	Pod-centric network and DNS diagnostics	`<pod-name>` ( defaults to )

Script Exit Codes

./scripts/cluster_health.sh and ./scripts/network_debug.sh share the same contract:

0: checks completed with no check failures (warnings allowed unless --strict is set).
1: one or more checks failed, or warnings occurred in --strict mode.
2: blocked preconditions (for example: missing kubectl, no active context, inaccessible namespace/pod).

Deterministic Debugging Workflow

Follow this systematic approach for any Kubernetes issue:

1. Preflight and Scope

kubectl config current-context
kubectl get ns
kubectl auth can-i get pods -n <namespace>

If preflight fails, stop and fix access/context first.

2. Identify the Problem Layer

Categorize the issue:

Application Layer : Application crashes, errors, bugs
Pod Layer : Pod not starting, restarting, or pending
Service Layer : Network connectivity, DNS issues
Node Layer : Node not ready, resource exhaustion
Cluster Layer : Control plane issues, API problems
Storage Layer : Volume mount failures, PVC issues
Configuration Layer : ConfigMap, Secret, RBAC issues

3. Gather Diagnostics with the Right Script

Use the appropriate diagnostic script based on scope:

Pod-Level Diagnostics

Use ./scripts/pod_diagnostics.py for comprehensive pod analysis:

python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace>

This script gathers:

Pod status and description
Pod events
Container logs (current and previous)
Resource usage
Node information
YAML configuration

Output can be saved for analysis:

python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt

Cluster-Level Health Check

Use ./scripts/cluster_health.sh for overall cluster diagnostics:

./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt

This script checks:

Cluster info and version
Node status and resources
Pods across all namespaces
Failed/pending pods
Recent events
Deployments, services, statefulsets, daemonsets
PVCs and PVs
Component health
Common error states (CrashLoopBackOff, ImagePullBackOff)

Network Diagnostics

Use ./scripts/network_debug.sh for connectivity issues:

./scripts/network_debug.sh <namespace> <pod-name>
# or force warning sensitivity / insecure TLS only when explicitly needed:
./scripts/network_debug.sh --strict <namespace> <pod-name>
./scripts/network_debug.sh --insecure <namespace> <pod-name>

This script analyzes:

Pod network configuration
DNS setup and resolution
Service endpoints
Network policies
Connectivity tests
CoreDNS logs

4. Follow Issue-Specific Reference Workflow

Based on the identified issue, consult ./references/troubleshooting_workflow.md:

Pod Pending : Resource/scheduling workflow
CrashLoopBackOff : Application crash workflow
ImagePullBackOff : Image pull workflow
Service issues : Network connectivity workflow
DNS failures : DNS troubleshooting workflow
Resource exhaustion : Performance investigation workflow
Storage issues : PVC binding workflow
Deployment stuck : Rollout workflow

5. Apply Targeted Fixes

Refer to ./references/common_issues.md for symptom-specific fixes.

6. Verify and Close

Run final verification:

kubectl get pods -n <namespace> -o wide
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
kubectl rollout status deployment/<name> -n <namespace>

Issue is done when user-visible behavior is healthy and no new critical warning events appear.

Example Flows

Example 1: CrashLoopBackOff in `payments` Namespace

python3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt
kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail=100
kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe

Then open ./references/common_issues.md and apply the CrashLoopBackOff solutions.

Example 2: Service DNS/Connectivity Failure

./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm
kubectl get svc checkout-api -n checkout
kubectl get endpoints checkout-api -n checkout
kubectl get networkpolicies -n checkout

Then follow Service Connectivity Workflow in ./references/troubleshooting_workflow.md.

Essential Manual Commands

Pod Debugging

# View pod status
kubectl get pods -n <namespace> -o wide

# Detailed pod information
kubectl describe pod <pod-name> -n <namespace>

# View logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # Previous container
kubectl logs <pod-name> -n <namespace> -c <container>  # Specific container

# Execute commands in pod
kubectl exec <pod-name> -n <namespace> -it -- /bin/sh

# Get pod YAML
kubectl get pod <pod-name> -n <namespace> -o yaml

Service and Network Debugging

# Check services
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>

# Check endpoints
kubectl get endpoints -n <namespace>

# Test DNS
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default

# View events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Resource Monitoring

# Node resources
kubectl top nodes
kubectl describe nodes

# Pod resources
kubectl top pods -n <namespace>
kubectl top pod <pod-name> -n <namespace> --containers

Emergency Operations

# Restart deployment
kubectl rollout restart deployment/<name> -n <namespace>

# Rollback deployment
kubectl rollout undo deployment/<name> -n <namespace>

# Force delete stuck pod
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0

# Drain node (maintenance)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Cordon node (prevent scheduling)
kubectl cordon <node-name>

Completion Criteria

Troubleshooting session is complete when all are true:

Cluster context and namespace are confirmed.
Relevant diagnostic script output is captured.
Root cause is identified and tied to evidence (events/logs/config/state).
Any disruptive action was preceded by snapshot and rollback plan.
Fix verification commands show healthy state.
Reference path used (./references/troubleshooting_workflow.md or ./references/common_issues.md) is documented in notes.

Related Tools

Useful additional tools for Kubernetes debugging:

kubectl-debug : Advanced debugging plugin
stern : Multi-pod log tailing
kubectx/kubens : Context and namespace switching
k9s : Terminal UI for Kubernetes
lens : Desktop IDE for Kubernetes
Prometheus/Grafana : Monitoring and alerting
Jaeger/Zipkin : Distributed tracing

Weekly Installs

Repository

akin-ozer/cc-de…s-skills

GitHub Stars

107

First Seen

Jan 31, 2026

Security Audits

Gen Agent Trust HubPass SocketWarn SnykPass

Installed on

github-copilot28

opencode28

codex27

gemini-cli26

cursor26

amp24

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

127,000 周安装