重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
k8s-debug by akin-ozer/cc-devops-skills
npx skills add https://github.com/akin-ozer/cc-devops-skills --skill k8s-debug用于调试 Kubernetes 集群、工作负载、网络和存储的系统化工具包,采用确定性的、安全第一的工作流程。
当请求类似以下情况时使用此技能:
CrashLoopBackOff 状态;帮我找到根本原因。"Pending 状态且无法调度。"从技能目录 (devops-skills-plugin/skills/k8s-debug) 运行,以便相对脚本路径按编写的方式工作。
kubectl。快速预检:
kubectl config current-context
kubectl auth can-i get pods -A
kubectl auth can-i get events -A
kubectl get ns
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
jq 用于在 ./scripts/cluster_health.sh 中进行更精确的过滤。kubectl top 的 Metrics API (metrics-server)。nslookup, getent, curl, wget, ip) 用于深度网络测试。回退行为:
kubectl top 不可用,则继续使用 kubectl describe 和事件。此技能适用于:
默认模式是首先进行只读诊断。只有在确认影响范围和回滚方案后,才执行破坏性命令。
需要明确确认的命令:
kubectl delete pod ... --force --grace-period=0kubectl drain ...kubectl rollout restart ...kubectl rollout undo ...kubectl debug ... --copy-to=...在采取破坏性操作之前:
# 为回滚和事件记录快照当前状态
kubectl get deploy,rs,pod,svc -n <namespace> -o wide
kubectl get pod <pod-name> -n <namespace> -o yaml > before-<pod-name>.yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > before-events.txt
仅加载观察到的症状所需的部分。
| 症状 / 需求 | 打开 | 起始章节 |
|---|---|---|
| 你需要端到端的诊断路径 | ./references/troubleshooting_workflow.md | 通用调试工作流 |
Pod 状态为 Pending, CrashLoopBackOff, 或 ImagePullBackOff | ./references/troubleshooting_workflow.md | Pod 生命周期故障排除 |
| 服务可达性或 DNS 故障 | ./references/troubleshooting_workflow.md | 网络故障排除工作流 |
| 节点压力或性能下降 | ./references/troubleshooting_workflow.md | 资源和性能工作流 |
| PVC / PV / 存储类问题 | ./references/troubleshooting_workflow.md | 存储故障排除工作流 |
| 快速症状到修复查找 | ./references/common_issues.md | 匹配的问题标题 |
| 针对已知问题的故障后修复选项 | ./references/common_issues.md | 解决方案 部分 |
| 脚本 | 目的 | 必需参数 | 可选参数 | 输出 | 回退行为 |
|---|---|---|---|---|---|
./scripts/cluster_health.sh | 集群范围健康快照 (节点, 工作负载, 事件, 常见故障状态) | 无 | --strict, K8S_REQUEST_TIMEOUT 环境变量 | 分段报告到 stdout | 检查失败时继续,在摘要和退出码中跟踪它们 |
./scripts/network_debug.sh | 以 Pod 为中心的网络和 DNS 诊断 | <pod-name> (<namespace> 默认为 default) | --strict, --insecure, K8S_REQUEST_TIMEOUT 环境变量 | 分段报告到 stdout | 默认使用安全 API 探测;不安全的 TLS 需要显式 --insecure |
./scripts/pod_diagnostics.py | 深度 Pod 诊断 (状态, describe, YAML, 事件, 每个容器的日志, 节点上下文) | <pod-name> | -n/--namespace, -o/--output | 分段报告到 stdout 或文件 | 缺少访问权限时快速失败;对可选的指标/日志块跳过并给出明确消息 |
./scripts/cluster_health.sh 和 ./scripts/network_debug.sh 共享相同的约定:
0: 检查完成,无检查失败 (除非设置了 --strict,否则允许警告)。1: 一个或多个检查失败,或在 --strict 模式下出现警告。2: 前置条件被阻止 (例如:缺少 kubectl, 无活跃上下文, 无法访问的命名空间/Pod)。对于任何 Kubernetes 问题,请遵循此系统化方法:
kubectl config current-context
kubectl get ns
kubectl auth can-i get pods -n <namespace>
如果预检失败,请先停止并修复访问/上下文问题。
对问题进行归类:
根据范围使用适当的诊断脚本:
使用 ./scripts/pod_diagnostics.py 进行全面的 Pod 分析:
python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace>
此脚本收集:
输出可以保存以供分析:
python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt
使用 ./scripts/cluster_health.sh 进行整体集群诊断:
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt
此脚本检查:
使用 ./scripts/network_debug.sh 处理连接性问题:
./scripts/network_debug.sh <namespace> <pod-name>
# 或者仅在明确需要时强制警告敏感性 / 不安全的 TLS:
./scripts/network_debug.sh --strict <namespace> <pod-name>
./scripts/network_debug.sh --insecure <namespace> <pod-name>
此脚本分析:
根据识别出的问题,查阅 ./references/troubleshooting_workflow.md:
参考 ./references/common_issues.md 获取针对特定症状的修复方案。
运行最终验证:
kubectl get pods -n <namespace> -o wide
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
kubectl rollout status deployment/<name> -n <namespace>
当用户可见行为健康且没有新的严重警告事件出现时,问题即告解决。
payments 命名空间中的 CrashLoopBackOffpython3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt
kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail=100
kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe
然后打开 ./references/common_issues.md 并应用 CrashLoopBackOff 解决方案。
./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm
kubectl get svc checkout-api -n checkout
kubectl get endpoints checkout-api -n checkout
kubectl get networkpolicies -n checkout
然后遵循 ./references/troubleshooting_workflow.md 中的 服务连接性工作流。
# 查看 Pod 状态
kubectl get pods -n <namespace> -o wide
# 详细的 Pod 信息
kubectl describe pod <pod-name> -n <namespace>
# 查看日志
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # 前一个容器
kubectl logs <pod-name> -n <namespace> -c <container> # 特定容器
# 在 Pod 中执行命令
kubectl exec <pod-name> -n <namespace> -it -- /bin/sh
# 获取 Pod YAML
kubectl get pod <pod-name> -n <namespace> -o yaml
# 检查服务
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>
# 检查端点
kubectl get endpoints -n <namespace>
# 测试 DNS
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default
# 查看事件
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# 节点资源
kubectl top nodes
kubectl describe nodes
# Pod 资源
kubectl top pods -n <namespace>
kubectl top pod <pod-name> -n <namespace> --containers
# 重启部署
kubectl rollout restart deployment/<name> -n <namespace>
# 回滚部署
kubectl rollout undo deployment/<name> -n <namespace>
# 强制删除卡住的 Pod
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0
# 排空节点 (维护)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# 封锁节点 (防止调度)
kubectl cordon <node-name>
故障排除会话在以下所有条件为真时完成:
./references/troubleshooting_workflow.md 或 ./references/common_issues.md) 已在笔记中记录。用于 Kubernetes 调试的有用附加工具:
每周安装次数
31
仓库
GitHub 星标数
107
首次出现
2026年1月31日
安全审计
安装于
github-copilot28
opencode28
codex27
gemini-cli26
cursor26
amp24
Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow.
Use this skill when requests resemble:
CrashLoopBackOff; help me find the root cause."Pending and not scheduling."Run from the skill directory (devops-skills-plugin/skills/k8s-debug) so relative script paths work as written.
kubectl installed and configured.Quick preflight:
kubectl config current-context
kubectl auth can-i get pods -A
kubectl auth can-i get events -A
kubectl get ns
jq for more precise filtering in ./scripts/cluster_health.sh.metrics-server) for kubectl top.nslookup, getent, curl, wget, ip) for deep network tests.Fallback behavior:
kubectl top is unavailable, continue with kubectl describe and events.Use this skill for:
Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback.
Commands requiring explicit confirmation:
kubectl delete pod ... --force --grace-period=0kubectl drain ...kubectl rollout restart ...kubectl rollout undo ...kubectl debug ... --copy-to=...Before disruptive actions:
# Snapshot current state for rollback and incident notes
kubectl get deploy,rs,pod,svc -n <namespace> -o wide
kubectl get pod <pod-name> -n <namespace> -o yaml > before-<pod-name>.yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > before-events.txt
Load only the section needed for the observed symptom.
| Symptom / Need | Open | Start section |
|---|---|---|
| You need an end-to-end diagnosis path | ./references/troubleshooting_workflow.md | General Debugging Workflow |
Pod state is Pending, CrashLoopBackOff, or ImagePullBackOff | ./references/troubleshooting_workflow.md | Pod Lifecycle Troubleshooting |
| Script | Purpose | Required args | Optional args | Output | Fallback behavior |
|---|---|---|---|---|---|
./scripts/cluster_health.sh | Cluster-wide health snapshot (nodes, workloads, events, common failure states) | None | --strict, K8S_REQUEST_TIMEOUT env var | Sectioned report to stdout | Continues on check failures, tracks them in summary and exit code |
./scripts/network_debug.sh | Pod-centric network and DNS diagnostics | <pod-name> ( defaults to ) |
./scripts/cluster_health.sh and ./scripts/network_debug.sh share the same contract:
0: checks completed with no check failures (warnings allowed unless --strict is set).1: one or more checks failed, or warnings occurred in --strict mode.2: blocked preconditions (for example: missing kubectl, no active context, inaccessible namespace/pod).Follow this systematic approach for any Kubernetes issue:
kubectl config current-context
kubectl get ns
kubectl auth can-i get pods -n <namespace>
If preflight fails, stop and fix access/context first.
Categorize the issue:
Use the appropriate diagnostic script based on scope:
Use ./scripts/pod_diagnostics.py for comprehensive pod analysis:
python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace>
This script gathers:
Output can be saved for analysis:
python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt
Use ./scripts/cluster_health.sh for overall cluster diagnostics:
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt
This script checks:
Use ./scripts/network_debug.sh for connectivity issues:
./scripts/network_debug.sh <namespace> <pod-name>
# or force warning sensitivity / insecure TLS only when explicitly needed:
./scripts/network_debug.sh --strict <namespace> <pod-name>
./scripts/network_debug.sh --insecure <namespace> <pod-name>
This script analyzes:
Based on the identified issue, consult ./references/troubleshooting_workflow.md:
Refer to ./references/common_issues.md for symptom-specific fixes.
Run final verification:
kubectl get pods -n <namespace> -o wide
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
kubectl rollout status deployment/<name> -n <namespace>
Issue is done when user-visible behavior is healthy and no new critical warning events appear.
payments Namespacepython3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt
kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail=100
kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe
Then open ./references/common_issues.md and apply the CrashLoopBackOff solutions.
./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm
kubectl get svc checkout-api -n checkout
kubectl get endpoints checkout-api -n checkout
kubectl get networkpolicies -n checkout
Then follow Service Connectivity Workflow in ./references/troubleshooting_workflow.md.
# View pod status
kubectl get pods -n <namespace> -o wide
# Detailed pod information
kubectl describe pod <pod-name> -n <namespace>
# View logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # Previous container
kubectl logs <pod-name> -n <namespace> -c <container> # Specific container
# Execute commands in pod
kubectl exec <pod-name> -n <namespace> -it -- /bin/sh
# Get pod YAML
kubectl get pod <pod-name> -n <namespace> -o yaml
# Check services
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>
# Check endpoints
kubectl get endpoints -n <namespace>
# Test DNS
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default
# View events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Node resources
kubectl top nodes
kubectl describe nodes
# Pod resources
kubectl top pods -n <namespace>
kubectl top pod <pod-name> -n <namespace> --containers
# Restart deployment
kubectl rollout restart deployment/<name> -n <namespace>
# Rollback deployment
kubectl rollout undo deployment/<name> -n <namespace>
# Force delete stuck pod
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0
# Drain node (maintenance)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Cordon node (prevent scheduling)
kubectl cordon <node-name>
Troubleshooting session is complete when all are true:
./references/troubleshooting_workflow.md or ./references/common_issues.md) is documented in notes.Useful additional tools for Kubernetes debugging:
Weekly Installs
31
Repository
GitHub Stars
107
First Seen
Jan 31, 2026
Security Audits
Gen Agent Trust HubPassSocketWarnSnykPass
Installed on
github-copilot28
opencode28
codex27
gemini-cli26
cursor26
amp24
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
127,000 周安装
Vercel团队计费管理技能:账户、计划、支出与团队设置完整指南
98 周安装
Umbraco单元测试指南:使用@open-wc/testing测试后台扩展与Lit元素
99 周安装
2026前端技术栈指南:Next.js 16、React 19、TypeScript 5.9+ 生产级最佳实践
99 周安装
Hotwire UX反馈与感知性能优化:Turbo内置模式与最佳实践指南
99 周安装
tuzi-slide-deck:AI幻灯片生成器 - 一键将Markdown转为专业演示文稿
99 周安装
Helm 3 专家指南:Kubernetes Chart 开发、模板与生产运维最佳实践
98 周安装
| Service reachability or DNS failure |
./references/troubleshooting_workflow.md |
Network Troubleshooting Workflow |
| Node pressure or performance regression | ./references/troubleshooting_workflow.md | Resource and Performance Workflow |
| PVC / PV / storage class issues | ./references/troubleshooting_workflow.md | Storage Troubleshooting Workflow |
| Quick symptom-to-fix lookup | ./references/common_issues.md | matching issue heading |
| Post-mortem fix options for known issues | ./references/common_issues.md | Solutions sections |
<namespace>default--strict, --insecure, K8S_REQUEST_TIMEOUT env var |
| Sectioned report to stdout |
Uses secure API probe by default; insecure TLS requires explicit --insecure |
./scripts/pod_diagnostics.py | Deep pod diagnostics (status, describe, YAML, events, per-container logs, node context) | <pod-name> | -n/--namespace, -o/--output | Sectioned report to stdout or file | Fails fast on missing access; skips optional metrics/log blocks with clear messages |