aoti-debug by pytorch/pytorch
npx skills add https://github.com/pytorch/pytorch --skill aoti-debug此技能帮助诊断和修复常见的 AOTInductor 问题。
检查错误信息并路由到相应的子指南:
如果错误匹配此模式:
Assertion `index out of bounds: 0 <= tmpN < ksM` failed
→ 遵循 triton-index-out-of-bounds.md 中的指南
继续下面的章节。
对于任何 AOTI 错误(段错误、异常、崩溃、错误输出),请始终首先检查这些:
# 编译期间 - 注意设备和形状
model = MyModel().eval() # 什么设备?CPU 还是 .cuda()?
inp = torch.randn(2, 10) # 什么设备?什么形状?
compiled_so = torch._inductor.aot_compile(model, (inp,))
# 加载期间 - 设备类型必须与编译时匹配
loaded = torch._export.aot_load(compiled_so, "???") # 必须与上面的模型/输入设备匹配
# 推理期间 - 设备和形状必须匹配
out = loaded(inp.to("???")) # 必须匹配编译设备,形状必须匹配
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
如果其中任何一项不匹配,您将遇到从段错误到异常再到错误输出等各种错误。
AOTI 要求编译和加载使用相同的设备类型。
症状 :在 aot_load() 或模型执行期间出现段错误、异常或崩溃。
示例错误信息 :
The specified pointer resides on host memory and is not registered with any CUDA deviceExpected out tensor to have device cuda:0, but got cpu instead原因 :编译和加载的设备类型不匹配(见上面的“第一步”)。
解决方案 :确保编译和加载使用相同的设备类型。如果在 CPU 上编译,则在 CPU 上加载。如果在 CUDA 上编译,则在 CUDA 上加载。
症状 :模型执行期间出现 RuntimeError。
原因 :输入设备与编译设备不匹配(见上面的“第一步”)。
更好的调试方法 :使用 AOTI_RUNTIME_CHECK_INPUTS=1 运行以获得更清晰的错误信息。此标志验证所有输入属性,包括设备类型、数据类型、大小和步长:
AOTI_RUNTIME_CHECK_INPUTS=1 python your_script.py
这会产生可操作的错误信息,例如:
Error: input_handles[0]: unmatched device type, expected: 0(cpu), but got: 1(cuda)
如果遇到 CUDA 非法内存访问错误,请遵循此系统方法:
在深入之前,尝试这些调试标志:
AOTI_RUNTIME_CHECK_INPUTS=1
TORCHINDUCTOR_NAN_ASSERTS=1
这些标志在编译时(代码生成时)生效:
AOTI_RUNTIME_CHECK_INPUTS=1 检查输入是否满足编译期间使用的相同守卫条件TORCHINDUCTOR_NAN_ASSERTS=1 在每个内核前后添加代码生成以检查 NaNCUDA IMA 错误可能是非确定性的。使用这些标志来确定性触发错误:
PYTORCH_NO_CUDA_MEMORY_CACHING=1
CUDA_LAUNCH_BLOCKING=1
这些标志在运行时生效:
PYTORCH_NO_CUDA_MEMORY_CACHING=1 禁用 PyTorch 的缓存分配器,该分配器会立即分配比所需更大的缓冲区。这通常是 CUDA IMA 错误非确定性的原因。CUDA_LAUNCH_BLOCKING=1 强制内核逐个启动。没有这个,您会收到“CUDA 内核错误可能被异步报告”的警告,因为内核是异步启动的。使用 AOTI 中间值调试器来精确定位有问题的内核:
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3
这会在运行时逐个打印内核。结合之前的标志,这可以显示错误发生前启动的是哪个内核。
要检查特定内核的输入:
AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_add_ge_logical_and_logical_or_lt_231,_add_position_embeddings_kernel_5" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2
如果内核的输入不符合预期,请检查产生错误输入的内核。
TORCH_LOGS="+inductor,output_code" 查看更多 PT2 内部日志1 以查看更多堆栈跟踪torch._export.aot_compile() # 已弃用
torch._export.aot_load() # 已弃用
torch._inductor.aoti_compile_and_package()
torch._inductor.aoti_load_package()
新的 API 将设备元数据存储在包中,因此 aoti_load_package() 会自动使用正确的设备类型。您只能更改设备索引(例如,cuda:0 与 cuda:1),而不能更改设备类型。
| 变量 | 何时生效 | 用途 |
|---|---|---|
AOTI_RUNTIME_CHECK_INPUTS=1 | 编译时 | 验证输入是否匹配编译守卫条件 |
TORCHINDUCTOR_NAN_ASSERTS=1 | 编译时 | 在内核前后检查 NaN |
PYTORCH_NO_CUDA_MEMORY_CACHING=1 | 运行时 | 使 IMA 错误具有确定性 |
CUDA_LAUNCH_BLOCKING=1 | 运行时 | 强制同步内核启动 |
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 | 编译时 | 在运行时打印内核 |
TORCH_LOGS="+inductor,output_code" | 运行时 | 查看 PT2 内部日志 |
TORCH_SHOW_CPP_STACKTRACES=1 | 运行时 | 显示 C++ 堆栈跟踪 |
每周安装次数
166
代码仓库
GitHub 星标数
98.5K
首次出现
2026年2月4日
安全审计
安装于
opencode165
gemini-cli164
codex163
kimi-cli162
github-copilot161
amp161
This skill helps diagnose and fix common AOTInductor issues.
Check the error message and route to the appropriate sub-guide:
If the error matches this pattern:
Assertion `index out of bounds: 0 <= tmpN < ksM` failed
→ Follow the guide intriton-index-out-of-bounds.md
Continue with the sections below.
For ANY AOTI error (segfault, exception, crash, wrong output), ALWAYS check these first:
# During compilation - note the device and shapes
model = MyModel().eval() # What device? CPU or .cuda()?
inp = torch.randn(2, 10) # What device? What shape?
compiled_so = torch._inductor.aot_compile(model, (inp,))
# During loading - device type MUST match compilation
loaded = torch._export.aot_load(compiled_so, "???") # Must match model/input device above
# During inference - device and shapes MUST match
out = loaded(inp.to("???")) # Must match compile device, shape must match
If any of these don't match, you will get errors ranging from segfaults to exceptions to wrong outputs.
AOTI requires compile and load to use the same device type.
Symptom : Segfault, exception, or crash during aot_load() or model execution.
Example error messages :
The specified pointer resides on host memory and is not registered with any CUDA deviceExpected out tensor to have device cuda:0, but got cpu insteadCause : Compile and load device types don't match (see "First Step" above).
Solution : Ensure compile and load use the same device type. If compiled on CPU, load on CPU. If compiled on CUDA, load on CUDA.
Symptom : RuntimeError during model execution.
Cause : Input device doesn't match compile device (see "First Step" above).
Better Debugging : Run with AOTI_RUNTIME_CHECK_INPUTS=1 for clearer errors. This flag validates all input properties including device type, dtype, sizes, and strides:
AOTI_RUNTIME_CHECK_INPUTS=1 python your_script.py
This produces actionable error messages like:
Error: input_handles[0]: unmatched device type, expected: 0(cpu), but got: 1(cuda)
If you encounter CUDA illegal memory access errors, follow this systematic approach:
Before diving deep, try these debugging flags:
AOTI_RUNTIME_CHECK_INPUTS=1
TORCHINDUCTOR_NAN_ASSERTS=1
These flags take effect at compilation time (at codegen time):
AOTI_RUNTIME_CHECK_INPUTS=1 checks if inputs satisfy the same guards used during compilationTORCHINDUCTOR_NAN_ASSERTS=1 adds codegen before and after each kernel to check for NaNCUDA IMA errors can be non-deterministic. Use these flags to trigger the error deterministically:
PYTORCH_NO_CUDA_MEMORY_CACHING=1
CUDA_LAUNCH_BLOCKING=1
These flags take effect at runtime:
PYTORCH_NO_CUDA_MEMORY_CACHING=1 disables PyTorch's Caching Allocator, which allocates bigger buffers than needed immediately. This is usually why CUDA IMA errors are non-deterministic.CUDA_LAUNCH_BLOCKING=1 forces kernels to launch one at a time. Without this, you get "CUDA kernel errors might be asynchronously reported" warnings since kernels launch asynchronously.Use the AOTI Intermediate Value Debugger to pinpoint the problematic kernel:
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3
This prints kernels one by one at runtime. Together with previous flags, this shows which kernel was launched right before the error.
To inspect inputs to a specific kernel:
AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_add_ge_logical_and_logical_or_lt_231,_add_position_embeddings_kernel_5" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2
If inputs to the kernel are unexpected, inspect the kernel that produces the bad input.
TORCH_LOGS="+inductor,output_code" to see more PT2 internal logs1 to see more stack tracestorch._export.aot_compile() # Deprecated
torch._export.aot_load() # Deprecated
torch._inductor.aoti_compile_and_package()
torch._inductor.aoti_load_package()
The new API stores device metadata in the package, so aoti_load_package() automatically uses the correct device type. You can only change the device index (e.g., cuda:0 vs cuda:1), not the device type.
| Variable | When | Purpose |
|---|---|---|
AOTI_RUNTIME_CHECK_INPUTS=1 | Compile time | Validate inputs match compilation guards |
TORCHINDUCTOR_NAN_ASSERTS=1 | Compile time | Check for NaN before/after kernels |
PYTORCH_NO_CUDA_MEMORY_CACHING=1 | Runtime | Make IMA errors deterministic |
CUDA_LAUNCH_BLOCKING=1 | Runtime | Force synchronous kernel launches |
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 |
Weekly Installs
166
Repository
GitHub Stars
98.5K
First Seen
Feb 4, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode165
gemini-cli164
codex163
kimi-cli162
github-copilot161
amp161
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
46,500 周安装
| Compile time |
| Print kernels at runtime |
TORCH_LOGS="+inductor,output_code" | Runtime | See PT2 internal logs |
TORCH_SHOW_CPP_STACKTRACES=1 | Runtime | Show C++ stack traces |