PyTorch AOTI 调试指南：解决 AOTInductor 常见错误与 CUDA 内存访问问题

aoti-debug by pytorch/pytorch

166 周安装量

98,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/pytorch/pytorch --skill aoti-debug

AI/机器学习 PyTorch 调试

🇨🇳中文介绍

AOTI 调试指南

此技能帮助诊断和修复常见的 AOTInductor 问题。

错误模式路由

检查错误信息并路由到相应的子指南：

Triton 索引越界

如果错误匹配此模式：

Assertion `index out of bounds: 0 <= tmpN < ksM` failed

→ 遵循 triton-index-out-of-bounds.md 中的指南

所有其他错误

继续下面的章节。

第一步：始终检查设备和形状匹配

对于任何 AOTI 错误（段错误、异常、崩溃、错误输出），请始终首先检查这些：

编译设备 == 加载设备 ：模型必须在与其编译时相同的设备类型上加载
输入设备匹配 ：运行时输入必须与编译后的模型在同一设备上
输入形状匹配 ：运行时输入形状必须与编译期间使用的形状匹配（或满足动态形状约束）

# 编译期间 - 注意设备和形状

model = MyModel().eval()           # 什么设备？CPU 还是 .cuda()？
inp = torch.randn(2, 10)           # 什么设备？什么形状？
compiled_so = torch._inductor.aot_compile(model, (inp,))

# 加载期间 - 设备类型必须与编译时匹配
loaded = torch._export.aot_load(compiled_so, "???")  # 必须与上面的模型/输入设备匹配

# 推理期间 - 设备和形状必须匹配
out = loaded(inp.to("???"))  # 必须匹配编译设备，形状必须匹配

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

关键约束：设备类型匹配

AOTI 要求编译和加载使用相同的设备类型。

如果在 CUDA 上编译，必须在 CUDA 上加载（设备索引可以不同）
如果在 CPU 上编译，必须在 CPU 上加载
不支持跨设备加载（例如，在 GPU 上编译，在 CPU 上加载）

1. 设备不匹配段错误

症状：在 aot_load() 或模型执行期间出现段错误、异常或崩溃。

示例错误信息 ：

The specified pointer resides on host memory and is not registered with any CUDA device
在 AOTInductorModelBase 中加载常量时崩溃
Expected out tensor to have device cuda:0, but got cpu instead

原因：编译和加载的设备类型不匹配（见上面的“第一步”）。

解决方案 ：确保编译和加载使用相同的设备类型。如果在 CPU 上编译，则在 CPU 上加载。如果在 CUDA 上编译，则在 CUDA 上加载。

2. 运行时输入设备不匹配

症状：模型执行期间出现 RuntimeError。

原因：输入设备与编译设备不匹配（见上面的“第一步”）。

更好的调试方法 ：使用 AOTI_RUNTIME_CHECK_INPUTS=1 运行以获得更清晰的错误信息。此标志验证所有输入属性，包括设备类型、数据类型、大小和步长：

AOTI_RUNTIME_CHECK_INPUTS=1 python your_script.py

这会产生可操作的错误信息，例如：

Error: input_handles[0]: unmatched device type, expected: 0(cpu), but got: 1(cuda)

调试 CUDA 非法内存访问（IMA）错误

如果遇到 CUDA 非法内存访问错误，请遵循此系统方法：

步骤 1：基本检查

在深入之前，尝试这些调试标志：

AOTI_RUNTIME_CHECK_INPUTS=1
TORCHINDUCTOR_NAN_ASSERTS=1

这些标志在编译时（代码生成时）生效：

AOTI_RUNTIME_CHECK_INPUTS=1 检查输入是否满足编译期间使用的相同守卫条件
TORCHINDUCTOR_NAN_ASSERTS=1 在每个内核前后添加代码生成以检查 NaN

步骤 2：精确定位 CUDA IMA

CUDA IMA 错误可能是非确定性的。使用这些标志来确定性触发错误：

PYTORCH_NO_CUDA_MEMORY_CACHING=1
CUDA_LAUNCH_BLOCKING=1

这些标志在运行时生效：

PYTORCH_NO_CUDA_MEMORY_CACHING=1 禁用 PyTorch 的缓存分配器，该分配器会立即分配比所需更大的缓冲区。这通常是 CUDA IMA 错误非确定性的原因。
CUDA_LAUNCH_BLOCKING=1 强制内核逐个启动。没有这个，您会收到“CUDA 内核错误可能被异步报告”的警告，因为内核是异步启动的。

步骤 3：使用中间值调试器识别有问题的内核

使用 AOTI 中间值调试器来精确定位有问题的内核：

AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3

这会在运行时逐个打印内核。结合之前的标志，这可以显示错误发生前启动的是哪个内核。

要检查特定内核的输入：

AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_add_ge_logical_and_logical_or_lt_231,_add_position_embeddings_kernel_5" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2

如果内核的输入不符合预期，请检查产生错误输入的内核。

日志记录和跟踪

tlparse / TORCH_TRACE ：提供完整的输出代码并记录使用的守卫条件
TORCH_LOGS ：使用 TORCH_LOGS="+inductor,output_code" 查看更多 PT2 内部日志
TORCH_SHOW_CPP_STACKTRACES ：设置为 1 以查看更多堆栈跟踪

动态形状 ：历史上是许多 IMA 的来源。在调试动态形状场景时请特别注意。
自定义操作 ：特别是使用动态形状在 C++ 中实现时。元函数可能需要 Symint 化。

torch._export.aot_compile()  # 已弃用
torch._export.aot_load()     # 已弃用

torch._inductor.aoti_compile_and_package()
torch._inductor.aoti_load_package()

新的 API 将设备元数据存储在包中，因此 aoti_load_package() 会自动使用正确的设备类型。您只能更改设备索引（例如，cuda:0 与 cuda:1），而不能更改设备类型。

变量	何时生效	用途
`AOTI_RUNTIME_CHECK_INPUTS=1`	编译时	验证输入是否匹配编译守卫条件
`TORCHINDUCTOR_NAN_ASSERTS=1`	编译时	在内核前后检查 NaN
`PYTORCH_NO_CUDA_MEMORY_CACHING=1`	运行时	使 IMA 错误具有确定性
`CUDA_LAUNCH_BLOCKING=1`	运行时	强制同步内核启动
`AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3`	编译时	在运行时打印内核
`TORCH_LOGS="+inductor,output_code"`	运行时	查看 PT2 内部日志
`TORCH_SHOW_CPP_STACKTRACES=1`	运行时	显示 C++ 堆栈跟踪

🇺🇸English

AOTI Debugging Guide

This skill helps diagnose and fix common AOTInductor issues.

Error Pattern Routing

Check the error message and route to the appropriate sub-guide:

Triton Index Out of Bounds

If the error matches this pattern:

Assertion `index out of bounds: 0 <= tmpN < ksM` failed

→ Follow the guide intriton-index-out-of-bounds.md

All Other Errors

Continue with the sections below.

First Step: Always Check Device and Shape Matching

For ANY AOTI error (segfault, exception, crash, wrong output), ALWAYS check these first:

Compile device == Load device : The model must be loaded on the same device type it was compiled on
Input devices match : Runtime inputs must be on the same device as the compiled model
Input shapes match : Runtime input shapes must match the shapes used during compilation (or satisfy dynamic shape constraints)

# During compilation - note the device and shapes

model = MyModel().eval()           # What device? CPU or .cuda()?
inp = torch.randn(2, 10)           # What device? What shape?
compiled_so = torch._inductor.aot_compile(model, (inp,))

# During loading - device type MUST match compilation
loaded = torch._export.aot_load(compiled_so, "???")  # Must match model/input device above

# During inference - device and shapes MUST match
out = loaded(inp.to("???"))  # Must match compile device, shape must match

If any of these don't match, you will get errors ranging from segfaults to exceptions to wrong outputs.

Key Constraint: Device Type Matching

AOTI requires compile and load to use the same device type.

If you compile on CUDA, you must load on CUDA (device index can differ)
If you compile on CPU, you must load on CPU
Cross-device loading (e.g., compile on GPU, load on CPU) is NOT supported

Common Error Patterns

1. Device Mismatch Segfault

Symptom : Segfault, exception, or crash during aot_load() or model execution.

Example error messages :

The specified pointer resides on host memory and is not registered with any CUDA device
Crash during constant loading in AOTInductorModelBase
Expected out tensor to have device cuda:0, but got cpu instead

Cause : Compile and load device types don't match (see "First Step" above).

Solution : Ensure compile and load use the same device type. If compiled on CPU, load on CPU. If compiled on CUDA, load on CUDA.

2. Input Device Mismatch at Runtime

Symptom : RuntimeError during model execution.

Cause : Input device doesn't match compile device (see "First Step" above).

Better Debugging : Run with AOTI_RUNTIME_CHECK_INPUTS=1 for clearer errors. This flag validates all input properties including device type, dtype, sizes, and strides:

AOTI_RUNTIME_CHECK_INPUTS=1 python your_script.py

This produces actionable error messages like:

Error: input_handles[0]: unmatched device type, expected: 0(cpu), but got: 1(cuda)

Debugging CUDA Illegal Memory Access (IMA) Errors

If you encounter CUDA illegal memory access errors, follow this systematic approach:

Step 1: Sanity Checks

Before diving deep, try these debugging flags:

AOTI_RUNTIME_CHECK_INPUTS=1
TORCHINDUCTOR_NAN_ASSERTS=1

These flags take effect at compilation time (at codegen time):

AOTI_RUNTIME_CHECK_INPUTS=1 checks if inputs satisfy the same guards used during compilation
TORCHINDUCTOR_NAN_ASSERTS=1 adds codegen before and after each kernel to check for NaN

Step 2: Pinpoint the CUDA IMA

CUDA IMA errors can be non-deterministic. Use these flags to trigger the error deterministically:

PYTORCH_NO_CUDA_MEMORY_CACHING=1
CUDA_LAUNCH_BLOCKING=1

These flags take effect at runtime:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 disables PyTorch's Caching Allocator, which allocates bigger buffers than needed immediately. This is usually why CUDA IMA errors are non-deterministic.
CUDA_LAUNCH_BLOCKING=1 forces kernels to launch one at a time. Without this, you get "CUDA kernel errors might be asynchronously reported" warnings since kernels launch asynchronously.

Step 3: Identify Problematic Kernels with Intermediate Value Debugger

Use the AOTI Intermediate Value Debugger to pinpoint the problematic kernel:

AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3

This prints kernels one by one at runtime. Together with previous flags, this shows which kernel was launched right before the error.

To inspect inputs to a specific kernel:

AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_add_ge_logical_and_logical_or_lt_231,_add_position_embeddings_kernel_5" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2

If inputs to the kernel are unexpected, inspect the kernel that produces the bad input.

Additional Debugging Tools

Logging and Tracing

tlparse / TORCH_TRACE : Provides complete output codes and records guards used
TORCH_LOGS : Use TORCH_LOGS="+inductor,output_code" to see more PT2 internal logs
TORCH_SHOW_CPP_STACKTRACES : Set to 1 to see more stack traces

Common Sources of Issues

Dynamic shapes : Historically a source of many IMAs. Pay special attention when debugging dynamic shape scenarios.
Custom ops : Especially when implemented in C++ with dynamic shapes. The meta function may need to be Symint'ified.

API Notes

Deprecated API

torch._export.aot_compile()  # Deprecated
torch._export.aot_load()     # Deprecated

Current API

torch._inductor.aoti_compile_and_package()
torch._inductor.aoti_load_package()

The new API stores device metadata in the package, so aoti_load_package() automatically uses the correct device type. You can only change the device index (e.g., cuda:0 vs cuda:1), not the device type.

Environment Variables Summary

Variable	When	Purpose
`AOTI_RUNTIME_CHECK_INPUTS=1`	Compile time	Validate inputs match compilation guards
`TORCHINDUCTOR_NAN_ASSERTS=1`	Compile time	Check for NaN before/after kernels
`PYTORCH_NO_CUDA_MEMORY_CACHING=1`	Runtime	Make IMA errors deterministic
`CUDA_LAUNCH_BLOCKING=1`	Runtime	Force synchronous kernel launches
`AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3`

Weekly Installs

166

Repository

pytorch/pytorch

GitHub Stars

98.5K

First Seen

Feb 4, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode165

gemini-cli164

codex163

kimi-cli162

github-copilot161

amp161

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

46,500 周安装