⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

CUDA编程调试与性能优化指南：GPU开发核心工作流与工具详解

cuda by technillogue/ptx-isa-markdown

77 周安装量

168 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/technillogue/ptx-isa-markdown --skill cuda

高性能计算调试 C++

🇨🇳中文介绍

CUDA 编程技能

核心理念

先测量，后猜测。 GPU 性能非常反直觉。先进行性能分析，再提出假设，然后进行修改，最后验证结果。

进行小而独立的更改。 CUDA 的错误会叠加。一次只做一个更改，测试它，提交它。抵制"一次性修复所有问题"的冲动。

printf 是你最强大的工具。 当调试器失效，当工具产生难以理解的输出时，设备代码中的 printf 能揭示真相。不要因为大量使用它而感到尴尬。

有时，盯着差异看。 难以理解的段错误很常见。工具通常帮不上忙。人类的方法：最小化差异，仔细阅读，发现错误。这是合法的，而且通常比使用工具更快。

调试工作流

遇到 Bug 时的第一反应

最小化复现 — 使用尽可能小的输入隔离失败的内核
添加 printf — 在使用任何工具之前，先在设备代码中添加 printf 来追踪执行过程
运行 compute-sanitizer — 非交互式地捕获内存错误：

compute-sanitizer --tool memcheck ./your_program compute-sanitizer --tool racecheck ./your_program # 用于竞态条件 compute-sanitizer --tool initcheck ./your_program # 未初始化的内存
如果仍然卡住，尝试非交互式地使用 cuda-gdb 获取回溯信息：

cuda-gdb -batch -ex "run" -ex "bt" ./your_program
当工具失效时 — 最小化工作代码和问题代码之间的差异。阅读它。Bug 就在差异中。

设备代码中的 printf

__global__ void myKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx == 0) {  // 限制输出
        printf("Kernel launched, n=%d, data[0]=%f\n", n, data[0]);
    }
    // ... 内核逻辑 ...
    if (idx < 10) {  // 采样几个线程
        printf("Thread %d: result=%f\n", idx, someValue);
    }
}

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

compute-sanitizer 快速参考

常见陷阱： "无效的 shared 写入...越界" 通常意味着内核启动时动态共享内存分配不足，而不是数组索引错误。检查 <<<grid, block, smem_size>>>。

# 内存错误（最常见）
compute-sanitizer --tool memcheck ./program

# 其他工具：racecheck, initcheck, synccheck
# 详细选项请见 references/debugging-tools.md

cuda-gdb 非交互式使用

# 在崩溃时获取回溯信息
cuda-gdb -batch -ex "run" -ex "bt" ./program

# 关于断点、线程检查，请见 references/debugging-tools.md

使用调试信息编译：

nvcc -g -G -lineinfo program.cu -o program

cuobjdump 用于二进制检查

# 转储 PTX 和 SASS
cuobjdump -ptx ./program
cuobjdump -sass ./program

# 关于资源使用情况、符号列表，请见 references/debugging-tools.md

完整的调试工具参考： 关于 compute-sanitizer 详细选项、cuda-gdb 工作流和 cuobjdump 分析模式，请参阅 references/debugging-tools.md。

性能优化工作流

切勿在不进行性能分析的情况下进行优化。 对 GPU 瓶颈的直觉几乎总是错误的。分析 → 修复 → 验证的循环才是真正的优化工作，而不是一个初步步骤。

建立基线 — 计时操作，记录下来
使用 nsys 进行性能分析 — 获取时间线，识别哪些内核是重要的
使用 ncu 深入分析 — 分析特定的瓶颈内核
提出假设 — 基于指标，形成具体的假设
更改一件事 — 进行单一的有针对性的更改
验证 — 重新进行性能分析，确认改进
重复

nsys (Nsight Systems) — 时间线性能分析

使用 nsys 回答："时间花在哪里了？" — CPU/GPU 交互、内核启动模式、内存传输、整体时间线。

# 基本性能分析
nsys profile -o report ./program
nsys stats report.nsys-rep --report cuda_gpu_kern_sum

# 使用 NVTX 标记
nsys profile --trace=cuda,nvtx -o report ./program

# 关键报告：cuda_gpu_kern_sum, cuda_api_sum, cuda_gpu_mem_time_sum, nvtx_sum
# 详细用法请见 references/nsys-guide.md

详细的 nsys 分析模式： 关于时间线解读、识别常见瓶颈和分析工作流，请参阅 references/nsys-guide.md。

ncu (Nsight Compute) — 内核分析

使用 ncu 回答："为什么这个内核慢？" — 详细指标、屋顶线、内存分析、占用率。

# 分析特定内核
ncu --kernel-name "myKernel" -o report ./program

# 快速摘要输出到 stdout
ncu --set basic ./program

# 集合：basic, full, memory, launch, roofline
# 部分：ComputeWorkloadAnalysis, MemoryWorkloadAnalysis, Occupancy
# 详细指标和解读请见 references/ncu-guide.md

警告： ncu 专家系统的建议可能具有误导性。始终使用实际指标和实验进行验证。

规模很重要： 在大规模下有益的优化可能在小规模下有害。始终在你实际的问题规模下进行性能分析，而不是理论最大值。

详细的 ncu 指标解读： 关于理解屋顶线分析、内存瓶颈、占用率限制和 warp 调度，请参阅 references/ncu-guide.md。

NVTX 用于自定义插桩

当你需要比内核级别更细的粒度时，使用 NVTX：

#include <nvtx3/nvToolsExt.h>

nvtxRangePush("Operation Name");
// ... 要分析的代码 ...
nvtxRangePop();

编译： -lnvToolsExt | 性能分析： nsys profile --trace=cuda,nvtx

完整模式： 关于嵌套范围、颜色和分析工作流，请参阅 references/nvtx-patterns.md。

症状	可能原因	调查方向
GPU 利用率低	内核启动开销，CPU 瓶颈	nsys 时间线，寻找间隙
内存受限	访问模式差，缓存命中率低	ncu 内存部分，检查合并访问
计算受限但速度慢	占用率低，寄存器压力大	ncu 占用率，减少寄存器使用
大量小内核	启动开销占主导	nsys 时间线，考虑融合
高 memcpy 时间	过多的 H2D/D2H 传输	nsys cuda_gpu_mem，批量传输
大部分周期停滞	存储体冲突，内存停滞	ncu SchedulerStatistics，检查共享内存
高扇区数/请求	合并访问差 (>4 扇区/请求)	ncu 内存指标，使用向量化加载

关键陷阱： 存储体冲突和内存合并访问问题通常主导性能，但如果不进行性能分析则不明显。关于详细诊断和修复方法，请参阅 references/performance-traps.md。

现实检查： 将 80% 的优化时间预算用于你未预料到的问题。基于性能分析的迭代会发现真正的瓶颈。

# 调试构建
nvcc -g -G -lineinfo -O0 program.cu -o program_debug

# 发布构建
nvcc -O3 -lineinfo program.cu -o program

# 特定架构
nvcc -arch=sm_80 program.cu -o program  # Ampere
nvcc -arch=sm_89 program.cu -o program  # Ada Lovelace
nvcc -arch=sm_90 program.cu -o program  # Hopper

# 生成 PTX（用于检查）
nvcc -ptx program.cu

# 详细编译（查看寄存器使用情况）
nvcc --ptxas-options=-v program.cu

# 使用 NVTX
nvcc program.cu -lnvToolsExt -o program

始终为生产环境性能分析使用 -lineinfo 编译 — 开销最小，支持源代码关联。

完整的参考文档可用于基于 grep 的搜索：

PTX ISA 9.1 — references/ptx-docs/ (405 个文件，2.3MB)

搜索指南：references/ptx-isa.md
用于：指令级优化、内联 PTX、TensorCore 操作 (WMMA, WGMMA, TMA)、内存交织

CUDA Runtime API 13.1 — references/cuda-runtime-docs/ (107 个文件，0.9MB)

搜索指南：references/cuda-runtime.md
用于：错误代码、API 参数、设备属性 (cudaDeviceProp)、内存管理、流行为

CUDA Driver API 13.1 — references/cuda-driver-docs/ (128 个文件，0.8MB)

搜索指南：references/cuda-driver.md
用于：上下文管理 (cuCtxCreate)、模块加载 (cuModuleLoad)、虚拟内存、Driver 错误 (CUDA_ERROR_*)、高级功能

每个搜索指南都包含 grep 示例、文档结构和常见使用模式。

搜索策略： 使用 grep/ripgrep 直接在 *-docs/ 目录中搜索。搜索指南 (.md 文件) 提供了导航模式和常见查询。

references/performance-traps.md — 存储体冲突、内存合并访问、规模依赖性优化
references/debugging-tools.md — compute-sanitizer、cuda-gdb、cuobjdump 详细用法
references/nsys-guide.md — nsys 时间线分析和瓶颈识别
references/ncu-guide.md — ncu 指标、屋顶线、占用率解读
references/nvtx-patterns.md — NVTX 插桩和性能分析模式

优化前的检查清单

已建立可复现的基线计时
已使用 nsys 进行性能分析以识别热点
知道哪个（些）内核主导运行时间
已使用 ncu 对目标内核进行性能分析
已识别出具体瓶颈（内存？计算？延迟？）
已形成具体的、可测试的假设
计划更改一件事

2026 年 1 月 28 日

🇺🇸English

CUDA Programming Skill

Core Philosophy

Measure before guessing. GPU performance is deeply counterintuitive. Profile first, hypothesize second, change third, verify fourth.

Small, isolated changes. CUDA bugs compound. Make one change, test it, commit it. Resist the urge to "fix everything at once."

printf is your strongest tool. When debuggers fail, when tools produce inscrutable output, printf in device code reveals truth. Don't be embarrassed to use it extensively.

Sometimes, stare at the diff. Inscrutable segfaults are common. Tools often don't help. The human approach: minimize the diff, read it carefully, see the bug. This is legitimate and often faster than tooling.

Debugging Workflow

First Response to a Bug

Reproduce minimally — Isolate the failing kernel with smallest possible input
Add printf — Before any tool, add printf in device code to trace execution
Run compute-sanitizer — Catch memory errors non-interactively:

compute-sanitizer --tool memcheck ./your_program compute-sanitizer --tool racecheck ./your_program # for race conditions compute-sanitizer --tool initcheck ./your_program # uninitialized memory
If still stuck , try cuda-gdb non-interactively for backtrace:

cuda-gdb -batch -ex "run" -ex "bt" ./your_program
When tools fail — Minimize the diff between working and broken code. Read it. The bug is in the diff.

printf in Device Code

__global__ void myKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx == 0) {  // Limit output
        printf("Kernel launched, n=%d, data[0]=%f\n", n, data[0]);
    }
    // ... kernel logic ...
    if (idx < 10) {  // Sample a few threads
        printf("Thread %d: result=%f\n", idx, someValue);
    }
}

Key patterns:

Guard with if (idx == 0) or if (idx < N) to avoid output flood
Print at kernel entry to confirm launch
Print intermediate values at suspected failure points
Flush is automatic at kernel completion

compute-sanitizer Quick Reference

Common gotcha: "Invalid shared write... out of bounds" usually means insufficient dynamic shared memory allocation in the kernel launch, not wrong array indexing. Check <<<grid, block, smem_size>>>.

# Memory errors (most common)
compute-sanitizer --tool memcheck ./program

# Other tools: racecheck, initcheck, synccheck
# For detailed options, see references/debugging-tools.md

cuda-gdb Non-Interactive

# Get backtrace on crash
cuda-gdb -batch -ex "run" -ex "bt" ./program

# For breakpoints, thread inspection, see references/debugging-tools.md

Compile with debug info:

nvcc -g -G -lineinfo program.cu -o program

cuobjdump for Binary Inspection

# Dump PTX and SASS
cuobjdump -ptx ./program
cuobjdump -sass ./program

# For resource usage, symbol listing, see references/debugging-tools.md

For complete debugging tool reference: See references/debugging-tools.md for detailed compute-sanitizer options, cuda-gdb workflows, and cuobjdump analysis patterns.

Performance Optimization Workflow

Golden Rule

Never optimize without profiling first. Intuition about GPU bottlenecks is almost always wrong. The profile → fix → verify loop is the actual optimization work, not a preliminary step.

Performance Investigation Steps

Establish baseline — Time the operation, record it
Profile with nsys — Get timeline, identify which kernels matter
Deep-dive with ncu — Analyze specific bottleneck kernels
Hypothesize — Based on metrics, form specific hypothesis
Change one thing — Make a single targeted change
Verify — Re-profile, confirm improvement
Repeat

nsys (Nsight Systems) — Timeline Profiling

Use nsys for: "Where is time being spent?" — CPU/GPU interaction, kernel launch patterns, memory transfers, overall timeline.

# Basic profile
nsys profile -o report ./program
nsys stats report.nsys-rep --report cuda_gpu_kern_sum

# With NVTX markers
nsys profile --trace=cuda,nvtx -o report ./program

# Key reports: cuda_gpu_kern_sum, cuda_api_sum, cuda_gpu_mem_time_sum, nvtx_sum
# For detailed usage, see references/nsys-guide.md

For detailed nsys analysis patterns: See references/nsys-guide.md for timeline interpretation, identifying common bottlenecks, and analysis workflows.

ncu (Nsight Compute) — Kernel Analysis

Use ncu for: "Why is this kernel slow?" — Detailed metrics, roofline, memory analysis, occupancy.

# Profile specific kernel
ncu --kernel-name "myKernel" -o report ./program

# Quick summary to stdout
ncu --set basic ./program

# Sets: basic, full, memory, launch, roofline
# Sections: ComputeWorkloadAnalysis, MemoryWorkloadAnalysis, Occupancy
# For detailed metrics and interpretation, see references/ncu-guide.md

Warning: ncu expert system recommendations can be misleading. Always verify with actual metrics and experiments.

Scale matters: Optimizations that help at large scale can hurt at small scale. Always profile at your actual problem size, not theoretical maximums.

For detailed ncu metric interpretation: See references/ncu-guide.md for understanding roofline analysis, memory bottlenecks, occupancy limits, and warp scheduling.

NVTX for Custom Instrumentation

When you need finer granularity than kernel-level, use NVTX:

#include <nvtx3/nvToolsExt.h>

nvtxRangePush("Operation Name");
// ... code to profile ...
nvtxRangePop();

Compile: -lnvToolsExt | Profile: nsys profile --trace=cuda,nvtx

For complete patterns: See references/nvtx-patterns.md for nested ranges, colors, and analysis workflows.

Common Performance Patterns

Symptom	Likely Cause	Investigation
Low GPU utilization	Kernel launch overhead, CPU bottleneck	nsys timeline, look for gaps
Memory bound	Poor access patterns, low cache hit	ncu memory section, check coalescing
Compute bound but slow	Low occupancy, register pressure	ncu occupancy, reduce registers
Lots of small kernels	Launch overhead dominates	nsys timeline, consider fusion
High memcpy time	Excessive H2D/D2H transfers	nsys cuda_gpu_mem, batch transfers
Most cycles stalled	Bank conflicts, memory stalls	ncu SchedulerStatistics, check shared memory
High sectors/request	Poor coalescing (>4 sectors/req)	ncu memory metrics, use vectorized loads

Critical traps: Bank conflicts and memory coalescing issues often dominate performance but aren't obvious without profiling. See references/performance-traps.md for detailed diagnosis and fixes.

Reality check: Budget 80% of optimization time for problems you didn't predict. Profile-driven iteration discovers the real bottlenecks.

Compilation Reference

# Debug build
nvcc -g -G -lineinfo -O0 program.cu -o program_debug

# Release build
nvcc -O3 -lineinfo program.cu -o program

# Specific architecture
nvcc -arch=sm_80 program.cu -o program  # Ampere
nvcc -arch=sm_89 program.cu -o program  # Ada Lovelace
nvcc -arch=sm_90 program.cu -o program  # Hopper

# Generate PTX (inspect it)
nvcc -ptx program.cu

# Verbose compilation (see register usage)
nvcc --ptxas-options=-v program.cu

# With NVTX
nvcc program.cu -lnvToolsExt -o program

Always compile with-lineinfo for production profiling — minimal overhead, enables source correlation.

Local API Documentation

Complete reference documentation available for grep-based search:

PTX ISA 9.1 — references/ptx-docs/ (405 files, 2.3MB)

Search guide: references/ptx-isa.md
Use for: Instruction-level optimization, inline PTX, TensorCore operations (WMMA, WGMMA, TMA), memory swizzling

CUDA Runtime API 13.1 — references/cuda-runtime-docs/ (107 files, 0.9MB)

Search guide: references/cuda-runtime.md
Use for: Error codes, API parameters, device properties (cudaDeviceProp), memory management, stream behavior

CUDA Driver API 13.1 — references/cuda-driver-docs/ (128 files, 0.8MB)

Search guide: references/cuda-driver.md
Use for: Context management (cuCtxCreate), module loading (cuModuleLoad), virtual memory, Driver errors (CUDA_ERROR_*), advanced features

Each search guide contains grep examples, documentation structure, and common usage patterns.

Search strategy: Use grep/ripgrep to search directly in the *-docs/ directories. The search guides (.md files) provide navigation patterns and common queries.

Additional References

references/performance-traps.md — Bank conflicts, memory coalescing, scale-dependent optimizations
references/debugging-tools.md — compute-sanitizer, cuda-gdb, cuobjdump detailed usage
references/nsys-guide.md — nsys timeline analysis and bottleneck identification
references/ncu-guide.md — ncu metrics, roofline, occupancy interpretation
references/nvtx-patterns.md — NVTX instrumentation and profiling patterns

Checklist Before Optimizing

Established reproducible baseline timing
Profiled with nsys to identify hotspots
Know which kernel(s) dominate runtime
Profiled target kernel with ncu
Identified specific bottleneck (memory? compute? latency?)
Formed specific, testable hypothesis
Plan to change ONE thing

Weekly Installs

Repository

technillogue/pt…markdown

GitHub Stars

First Seen

Jan 28, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

codex55

cursor53

opencode53

kimi-cli51

github-copilot51

gemini-cli50

pua-ja AI助手：日本职场“詰め”文化驱动的主动式问题解决与任务执行方法论

641 周安装