重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
npx skills add https://github.com/technillogue/ptx-isa-markdown --skill cuda先测量,后猜测。 GPU 性能非常反直觉。先进行性能分析,再提出假设,然后进行修改,最后验证结果。
进行小而独立的更改。 CUDA 的错误会叠加。一次只做一个更改,测试它,提交它。抵制"一次性修复所有问题"的冲动。
printf 是你最强大的工具。 当调试器失效,当工具产生难以理解的输出时,设备代码中的 printf 能揭示真相。不要因为大量使用它而感到尴尬。
有时,盯着差异看。 难以理解的段错误很常见。工具通常帮不上忙。人类的方法:最小化差异,仔细阅读,发现错误。这是合法的,而且通常比使用工具更快。
最小化复现 — 使用尽可能小的输入隔离失败的内核
添加 printf — 在使用任何工具之前,先在设备代码中添加 printf 来追踪执行过程
运行 compute-sanitizer — 非交互式地捕获内存错误:
compute-sanitizer --tool memcheck ./your_program compute-sanitizer --tool racecheck ./your_program # 用于竞态条件 compute-sanitizer --tool initcheck ./your_program # 未初始化的内存
如果仍然卡住,尝试非交互式地使用 cuda-gdb 获取回溯信息:
cuda-gdb -batch -ex "run" -ex "bt" ./your_program
当工具失效时 — 最小化工作代码和问题代码之间的差异。阅读它。Bug 就在差异中。
__global__ void myKernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx == 0) { // 限制输出
printf("Kernel launched, n=%d, data[0]=%f\n", n, data[0]);
}
// ... 内核逻辑 ...
if (idx < 10) { // 采样几个线程
printf("Thread %d: result=%f\n", idx, someValue);
}
}
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
关键模式:
if (idx == 0) 或 if (idx < N) 进行保护,避免输出泛滥常见陷阱: "无效的 shared 写入...越界" 通常意味着内核启动时动态共享内存分配不足,而不是数组索引错误。检查 <<<grid, block, smem_size>>>。
# 内存错误(最常见)
compute-sanitizer --tool memcheck ./program
# 其他工具:racecheck, initcheck, synccheck
# 详细选项请见 references/debugging-tools.md
# 在崩溃时获取回溯信息
cuda-gdb -batch -ex "run" -ex "bt" ./program
# 关于断点、线程检查,请见 references/debugging-tools.md
使用调试信息编译:
nvcc -g -G -lineinfo program.cu -o program
# 转储 PTX 和 SASS
cuobjdump -ptx ./program
cuobjdump -sass ./program
# 关于资源使用情况、符号列表,请见 references/debugging-tools.md
完整的调试工具参考: 关于 compute-sanitizer 详细选项、cuda-gdb 工作流和 cuobjdump 分析模式,请参阅 references/debugging-tools.md。
切勿在不进行性能分析的情况下进行优化。 对 GPU 瓶颈的直觉几乎总是错误的。分析 → 修复 → 验证的循环才是真正的优化工作,而不是一个初步步骤。
使用 nsys 回答:"时间花在哪里了?" — CPU/GPU 交互、内核启动模式、内存传输、整体时间线。
# 基本性能分析
nsys profile -o report ./program
nsys stats report.nsys-rep --report cuda_gpu_kern_sum
# 使用 NVTX 标记
nsys profile --trace=cuda,nvtx -o report ./program
# 关键报告:cuda_gpu_kern_sum, cuda_api_sum, cuda_gpu_mem_time_sum, nvtx_sum
# 详细用法请见 references/nsys-guide.md
详细的 nsys 分析模式: 关于时间线解读、识别常见瓶颈和分析工作流,请参阅 references/nsys-guide.md。
使用 ncu 回答:"为什么这个内核慢?" — 详细指标、屋顶线、内存分析、占用率。
# 分析特定内核
ncu --kernel-name "myKernel" -o report ./program
# 快速摘要输出到 stdout
ncu --set basic ./program
# 集合:basic, full, memory, launch, roofline
# 部分:ComputeWorkloadAnalysis, MemoryWorkloadAnalysis, Occupancy
# 详细指标和解读请见 references/ncu-guide.md
警告: ncu 专家系统的建议可能具有误导性。始终使用实际指标和实验进行验证。
规模很重要: 在大规模下有益的优化可能在小规模下有害。始终在你实际的问题规模下进行性能分析,而不是理论最大值。
详细的 ncu 指标解读: 关于理解屋顶线分析、内存瓶颈、占用率限制和 warp 调度,请参阅 references/ncu-guide.md。
当你需要比内核级别更细的粒度时,使用 NVTX:
#include <nvtx3/nvToolsExt.h>
nvtxRangePush("Operation Name");
// ... 要分析的代码 ...
nvtxRangePop();
编译: -lnvToolsExt | 性能分析: nsys profile --trace=cuda,nvtx
完整模式: 关于嵌套范围、颜色和分析工作流,请参阅 references/nvtx-patterns.md。
| 症状 | 可能原因 | 调查方向 |
|---|---|---|
| GPU 利用率低 | 内核启动开销,CPU 瓶颈 | nsys 时间线,寻找间隙 |
| 内存受限 | 访问模式差,缓存命中率低 | ncu 内存部分,检查合并访问 |
| 计算受限但速度慢 | 占用率低,寄存器压力大 | ncu 占用率,减少寄存器使用 |
| 大量小内核 | 启动开销占主导 | nsys 时间线,考虑融合 |
| 高 memcpy 时间 | 过多的 H2D/D2H 传输 | nsys cuda_gpu_mem,批量传输 |
| 大部分周期停滞 | 存储体冲突,内存停滞 | ncu SchedulerStatistics,检查共享内存 |
| 高扇区数/请求 | 合并访问差 (>4 扇区/请求) | ncu 内存指标,使用向量化加载 |
关键陷阱: 存储体冲突和内存合并访问问题通常主导性能,但如果不进行性能分析则不明显。关于详细诊断和修复方法,请参阅 references/performance-traps.md。
现实检查: 将 80% 的优化时间预算用于你未预料到的问题。基于性能分析的迭代会发现真正的瓶颈。
# 调试构建
nvcc -g -G -lineinfo -O0 program.cu -o program_debug
# 发布构建
nvcc -O3 -lineinfo program.cu -o program
# 特定架构
nvcc -arch=sm_80 program.cu -o program # Ampere
nvcc -arch=sm_89 program.cu -o program # Ada Lovelace
nvcc -arch=sm_90 program.cu -o program # Hopper
# 生成 PTX(用于检查)
nvcc -ptx program.cu
# 详细编译(查看寄存器使用情况)
nvcc --ptxas-options=-v program.cu
# 使用 NVTX
nvcc program.cu -lnvToolsExt -o program
始终为生产环境性能分析使用 -lineinfo 编译 — 开销最小,支持源代码关联。
完整的参考文档可用于基于 grep 的搜索:
PTX ISA 9.1 — references/ptx-docs/ (405 个文件,2.3MB)
references/ptx-isa.mdCUDA Runtime API 13.1 — references/cuda-runtime-docs/ (107 个文件,0.9MB)
references/cuda-runtime.mdcudaDeviceProp)、内存管理、流行为CUDA Driver API 13.1 — references/cuda-driver-docs/ (128 个文件,0.8MB)
references/cuda-driver.mdcuCtxCreate)、模块加载 (cuModuleLoad)、虚拟内存、Driver 错误 (CUDA_ERROR_*)、高级功能每个搜索指南都包含 grep 示例、文档结构和常见使用模式。
搜索策略: 使用 grep/ripgrep 直接在 *-docs/ 目录中搜索。搜索指南 (.md 文件) 提供了导航模式和常见查询。
references/performance-traps.md — 存储体冲突、内存合并访问、规模依赖性优化references/debugging-tools.md — compute-sanitizer、cuda-gdb、cuobjdump 详细用法references/nsys-guide.md — nsys 时间线分析和瓶颈识别references/ncu-guide.md — ncu 指标、屋顶线、占用率解读references/nvtx-patterns.md — NVTX 插桩和性能分析模式每周安装次数
63
代码仓库
GitHub 星标数
45
首次出现
2026 年 1 月 28 日
安全审计
安装于
codex55
cursor53
opencode53
kimi-cli51
github-copilot51
gemini-cli50
Measure before guessing. GPU performance is deeply counterintuitive. Profile first, hypothesize second, change third, verify fourth.
Small, isolated changes. CUDA bugs compound. Make one change, test it, commit it. Resist the urge to "fix everything at once."
printf is your strongest tool. When debuggers fail, when tools produce inscrutable output, printf in device code reveals truth. Don't be embarrassed to use it extensively.
Sometimes, stare at the diff. Inscrutable segfaults are common. Tools often don't help. The human approach: minimize the diff, read it carefully, see the bug. This is legitimate and often faster than tooling.
Reproduce minimally — Isolate the failing kernel with smallest possible input
Add printf — Before any tool, add printf in device code to trace execution
Run compute-sanitizer — Catch memory errors non-interactively:
compute-sanitizer --tool memcheck ./your_program compute-sanitizer --tool racecheck ./your_program # for race conditions compute-sanitizer --tool initcheck ./your_program # uninitialized memory
If still stuck , try cuda-gdb non-interactively for backtrace:
cuda-gdb -batch -ex "run" -ex "bt" ./your_program
When tools fail — Minimize the diff between working and broken code. Read it. The bug is in the diff.
__global__ void myKernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx == 0) { // Limit output
printf("Kernel launched, n=%d, data[0]=%f\n", n, data[0]);
}
// ... kernel logic ...
if (idx < 10) { // Sample a few threads
printf("Thread %d: result=%f\n", idx, someValue);
}
}
Key patterns:
if (idx == 0) or if (idx < N) to avoid output floodCommon gotcha: "Invalid shared write... out of bounds" usually means insufficient dynamic shared memory allocation in the kernel launch, not wrong array indexing. Check <<<grid, block, smem_size>>>.
# Memory errors (most common)
compute-sanitizer --tool memcheck ./program
# Other tools: racecheck, initcheck, synccheck
# For detailed options, see references/debugging-tools.md
# Get backtrace on crash
cuda-gdb -batch -ex "run" -ex "bt" ./program
# For breakpoints, thread inspection, see references/debugging-tools.md
Compile with debug info:
nvcc -g -G -lineinfo program.cu -o program
# Dump PTX and SASS
cuobjdump -ptx ./program
cuobjdump -sass ./program
# For resource usage, symbol listing, see references/debugging-tools.md
For complete debugging tool reference: See references/debugging-tools.md for detailed compute-sanitizer options, cuda-gdb workflows, and cuobjdump analysis patterns.
Never optimize without profiling first. Intuition about GPU bottlenecks is almost always wrong. The profile → fix → verify loop is the actual optimization work, not a preliminary step.
Use nsys for: "Where is time being spent?" — CPU/GPU interaction, kernel launch patterns, memory transfers, overall timeline.
# Basic profile
nsys profile -o report ./program
nsys stats report.nsys-rep --report cuda_gpu_kern_sum
# With NVTX markers
nsys profile --trace=cuda,nvtx -o report ./program
# Key reports: cuda_gpu_kern_sum, cuda_api_sum, cuda_gpu_mem_time_sum, nvtx_sum
# For detailed usage, see references/nsys-guide.md
For detailed nsys analysis patterns: See references/nsys-guide.md for timeline interpretation, identifying common bottlenecks, and analysis workflows.
Use ncu for: "Why is this kernel slow?" — Detailed metrics, roofline, memory analysis, occupancy.
# Profile specific kernel
ncu --kernel-name "myKernel" -o report ./program
# Quick summary to stdout
ncu --set basic ./program
# Sets: basic, full, memory, launch, roofline
# Sections: ComputeWorkloadAnalysis, MemoryWorkloadAnalysis, Occupancy
# For detailed metrics and interpretation, see references/ncu-guide.md
Warning: ncu expert system recommendations can be misleading. Always verify with actual metrics and experiments.
Scale matters: Optimizations that help at large scale can hurt at small scale. Always profile at your actual problem size, not theoretical maximums.
For detailed ncu metric interpretation: See references/ncu-guide.md for understanding roofline analysis, memory bottlenecks, occupancy limits, and warp scheduling.
When you need finer granularity than kernel-level, use NVTX:
#include <nvtx3/nvToolsExt.h>
nvtxRangePush("Operation Name");
// ... code to profile ...
nvtxRangePop();
Compile: -lnvToolsExt | Profile: nsys profile --trace=cuda,nvtx
For complete patterns: See references/nvtx-patterns.md for nested ranges, colors, and analysis workflows.
| Symptom | Likely Cause | Investigation |
|---|---|---|
| Low GPU utilization | Kernel launch overhead, CPU bottleneck | nsys timeline, look for gaps |
| Memory bound | Poor access patterns, low cache hit | ncu memory section, check coalescing |
| Compute bound but slow | Low occupancy, register pressure | ncu occupancy, reduce registers |
| Lots of small kernels | Launch overhead dominates | nsys timeline, consider fusion |
| High memcpy time | Excessive H2D/D2H transfers | nsys cuda_gpu_mem, batch transfers |
| Most cycles stalled | Bank conflicts, memory stalls | ncu SchedulerStatistics, check shared memory |
| High sectors/request | Poor coalescing (>4 sectors/req) | ncu memory metrics, use vectorized loads |
Critical traps: Bank conflicts and memory coalescing issues often dominate performance but aren't obvious without profiling. See references/performance-traps.md for detailed diagnosis and fixes.
Reality check: Budget 80% of optimization time for problems you didn't predict. Profile-driven iteration discovers the real bottlenecks.
# Debug build
nvcc -g -G -lineinfo -O0 program.cu -o program_debug
# Release build
nvcc -O3 -lineinfo program.cu -o program
# Specific architecture
nvcc -arch=sm_80 program.cu -o program # Ampere
nvcc -arch=sm_89 program.cu -o program # Ada Lovelace
nvcc -arch=sm_90 program.cu -o program # Hopper
# Generate PTX (inspect it)
nvcc -ptx program.cu
# Verbose compilation (see register usage)
nvcc --ptxas-options=-v program.cu
# With NVTX
nvcc program.cu -lnvToolsExt -o program
Always compile with-lineinfo for production profiling — minimal overhead, enables source correlation.
Complete reference documentation available for grep-based search:
PTX ISA 9.1 — references/ptx-docs/ (405 files, 2.3MB)
references/ptx-isa.mdCUDA Runtime API 13.1 — references/cuda-runtime-docs/ (107 files, 0.9MB)
references/cuda-runtime.mdcudaDeviceProp), memory management, stream behaviorCUDA Driver API 13.1 — references/cuda-driver-docs/ (128 files, 0.8MB)
references/cuda-driver.mdcuCtxCreate), module loading (cuModuleLoad), virtual memory, Driver errors (CUDA_ERROR_*), advanced featuresEach search guide contains grep examples, documentation structure, and common usage patterns.
Search strategy: Use grep/ripgrep to search directly in the *-docs/ directories. The search guides (.md files) provide navigation patterns and common queries.
references/performance-traps.md — Bank conflicts, memory coalescing, scale-dependent optimizationsreferences/debugging-tools.md — compute-sanitizer, cuda-gdb, cuobjdump detailed usagereferences/nsys-guide.md — nsys timeline analysis and bottleneck identificationreferences/ncu-guide.md — ncu metrics, roofline, occupancy interpretationreferences/nvtx-patterns.md — NVTX instrumentation and profiling patternsWeekly Installs
63
Repository
GitHub Stars
45
First Seen
Jan 28, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
codex55
cursor53
opencode53
kimi-cli51
github-copilot51
gemini-cli50
pua-ja AI助手:日本职场“詰め”文化驱动的主动式问题解决与任务执行方法论
641 周安装