⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

HuggingFace CUDA内核优化指南：为Diffusers和Transformers加速AI模型

cuda-kernels by huggingface/kernels

64 周安装量

534 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/huggingface/kernels --skill cuda-kernels

AI/机器学习高性能计算性能优化

🇨🇳中文介绍

面向 Diffusers 和 Transformers 的 CUDA 内核

此技能提供了针对 NVIDIA GPU（H100、A100、T4）开发优化 CUDA 内核的模式和指南，用于 HuggingFace diffusers 和 transformers 库。

快速开始

Diffusers（视频/图像生成）

用于基准测试内核性能：

# 使用优化内核进行基准测试（端到端速度提升 6%）
python generate_video.py --use-optimized-kernels

# 使用 torch.compile 进行基线基准测试（速度提升 34%）
python generate_video.py --no-optimized-kernels --compile

# 比较配置（注意：--compile 和 --use-optimized-kernels 互斥）
python generate_video.py --use-optimized-kernels && \
python generate_video.py --no-optimized-kernels --compile

获取一个最小的 diffusers 集成示例（约 150 行）：

python scripts/ltx_kernel_injection_example.py

Transformers（大语言模型）

获取一个最小的 transformers 集成示例（约 120 行）：

python scripts/transformers_injection_example.py

HuggingFace Kernels Hub

从 HuggingFace Hub 加载预编译内核（无需本地编译）：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

独立内核微基准测试

python benchmark_rmsnorm.py

支持的库和模型

库	支持的模型	关键内核
diffusers	LTX-Video, Stable Diffusion, FLUX, DiT	RMSNorm, GEGLU, RoPE, AdaLN
transformers	LLaMA, Mistral, Qwen, Falcon	RMSNorm, Attention
GPU	计算能力	指南
---	---	---
H100	sm_90	h100-optimization-guide.md
A100	sm_80	a100-optimization-guide.md
T4	sm_75	t4-optimization-guide.md

在以下情况下使用此技能：

基准测试内核性能，与基线实现进行比较
为扩散模型或大语言模型编写新的 CUDA 内核
针对 H100、A100 或 T4 架构优化现有内核
实现自定义注意力、归一化或激活层
将内核集成到 diffusers 管道中（LTX-Video, Stable Diffusion, FLUX, DiT）
将内核集成到 transformers 模型中（LLaMA, Mistral, Qwen）
在 NVIDIA GPU 上调试内核性能问题

完整的工作示例位于 examples/ltx_video/。此示例演示了：

自定义 CUDA 内核（RMSNorm, RoPE 3D, GEGLU, AdaLN）
使用 setup.py、build.toml 和 flake.nix 的构建系统设置
PyTorch C++ 绑定和 Python API
用于比较优化与基线性能的基准测试脚本

使用基准测试脚本来测量内核性能：

# 包含所有选项的完整基准测试
python scripts/benchmark_example.py \
    --use-optimized-kernels \
    --compile \
    --batch-size 1 \
    --num-frames 161 \
    --height 512 \
    --width 768 \
    --steps 50 \
    --warmup-iterations 2

基准测试脚本选项

选项	默认值	描述
`--use-optimized-kernels`	auto	使用自定义 H100 CUDA 内核
`--no-optimized-kernels`	-	使用基线实现
`--compile`	false	在 transformer 上启用 torch.compile
`--batch-size`	1	每个提示生成的视频数量
`--num-frames`	161	要生成的帧数
`--height`	512	视频高度（像素）
`--width`	768	视频宽度（像素）
`--steps`	50	去噪步数
`--warmup-iterations`	2	基准测试前的预热运行次数

基准测试结果示例

端到端视频生成（49 帧，30 步，H100 80GB）：

配置	时间（秒）	it/s	加速比	备注
基线（无编译）	2.87	12.58	1.00x	参考
优化内核	2.70	13.52	1.06x	快 6%
基线 + torch.compile	2.14	19.05	1.34x	快 34%

重要提示： --use-optimized-kernels 和 --compile 目前互斥。自定义内核需要 PyTorch 自定义操作注册才能与 torch.compile 一起工作。

需要捕获的关键指标：

设备： GPU 型号（例如，NVIDIA H100 80GB HBM3）
精度： 使用的数据类型（例如，bfloat16）
分辨率： 宽度 x 高度（例如，768x512）
帧数： 生成的帧数（例如，49, 161）

RMSNorm 微基准测试

向量化的 RMSNorm 内核相比 PyTorch 基线实现了 2.67 倍的平均加速比：

形状	自定义（毫秒）	PyTorch（毫秒）	加速比
[1×1024×2048]	0.019	0.065	3.37x
[2×1024×2048]	0.024	0.073	3.04x
[4×1024×2048]	0.036	0.093	2.58x
[2×4096×3072]	0.087	0.208	2.41x
[4×4096×3072]	0.157	0.392	2.49x

带宽效率： 达到 H100 理论带宽 3.35 TB/s 的 38%

端到端加速比较小的原因： 在 LTX-Video 中，RMSNorm 约占计算总量的 5%。其余时间花在注意力（Flash Attention/SDPA）、线性投影和 VAE 解码上。

.claude/skills/cuda-kernels/
├── scripts/
│   ├── benchmark_example.py              # 端到端视频生成基准测试
│   ├── benchmark_rmsnorm.py              # 独立的 RMSNorm 微基准测试
│   ├── ltx_kernel_injection_example.py   # 最小的 diffusers 集成示例（约 150 行）
│   ├── transformers_injection_example.py # 最小的 transformers 集成示例（约 120 行）
│   └── huggingface_kernels_example.py    # HuggingFace Kernels Hub 集成示例
├── references/
│   ├── diffusers-integration.md          # 完整的 diffusers 集成指南
│   ├── transformers-integration.md       # 完整的 transformers 集成指南
│   ├── huggingface-kernels-integration.md # HuggingFace Kernels Hub (get_kernel) 指南
│   ├── troubleshooting.md                # 常见问题及解决方案
│   ├── kernel-templates.md               # CUDA 内核模板（包含向量化）
│   ├── h100-optimization-guide.md        # H100 (Hopper) 优化深度指南
│   ├── a100-optimization-guide.md        # A100 (Ampere) 优化深度指南
│   └── t4-optimization-guide.md          # T4 (Turing) 优化深度指南
└── SKILL.md                              # 此文件

examples/ltx_video/                  # 完整的工作示例
├── kernel_src/
│   └── rmsnorm.cu                  # 向量化 RMSNorm 内核（快 2.67 倍）
├── torch-ext/                      # PyTorch 绑定
├── generate_video.py               # 完整的基准测试脚本
├── benchmark_rmsnorm.py            # 独立内核基准测试
└── setup.py                        # pip install -e .

H100 (Hopper) - 主要目标

规格	值	优化影响
SMs	132	网格大小：目标是 132 的倍数
线程数/SM	2048	每个 SM 最多 16 个块，每个块 128 线程
共享内存	192 KB/SM	可以使用较大的分块
L2 缓存	50 MB	跨块重用
内存带宽	3.35 TB/s	合并访问至关重要
Warp 大小	32	所有规约都使用 warp shuffle

快速比较（H100 vs A100 vs T4）

规格	H100	A100	T4
SMs	132	108	40
内存带宽	3.35 TB/s	2.0 TB/s	320 GB/s
共享内存/SM	192 KB	164 KB	64 KB
BF16 支持	是	是	否（仅 FP16）
计算能力	sm_90	sm_80	sm_75

查看详细指南：H100 | A100 | T4

向量化内存访问（对性能至关重要）

使用 __nv_bfloat162 进行 BFloat16 向量化：

// 一次加载 2 个 bfloat16 元素（32 位加载）
const __nv_bfloat162* vec_input = reinterpret_cast<const __nv_bfloat162*>(row_input);

#pragma unroll 4
for (int i = tid; i < vec_hidden; i += stride) {
    __nv_bfloat162 v = vec_input[i];
    float v0 = __bfloat162float(v.x);
    float v1 = __bfloat162float(v.y);
    sum_sq += v0 * v0 + v1 * v1;
}

使用 __half2 进行 FP16 向量化：

const __half2* vec_input = reinterpret_cast<const __half2*>(row_input);
__half2 v = vec_input[i];
float v0 = __half2float(v.x);
float v1 = __half2float(v.y);

使用 float4 进行 FP32 向量化：

const float4* vec_input = reinterpret_cast<const float4*>(row_input);
float4 v = vec_input[i];
sum_sq += v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w;

template <typename T>
__device__ __forceinline__ T warp_reduce_sum(T val) {
    #pragma unroll
    for (int offset = 16; offset > 0; offset >>= 1) {
        val += __shfl_xor_sync(0xffffffff, val, offset);
    }
    return val;
}

注意力机制的块大小

BLOCK_SIZE_M = 128, BLOCK_SIZE_N = 64, BLOCK_SIZE_K = 64
NUM_WARPS = 8

对于逐元素操作（RoPE, GEGLU）：

constexpr int BLOCK_SIZE = 256;
int num_blocks = (total_elements + BLOCK_SIZE - 1) / BLOCK_SIZE;

对于带向量化的规约操作（LayerNorm, RMSNorm）：

// 对于 bf16/fp16 向量化访问，除以 2
int threads = min(hidden_size / 2, MAX_THREADS);
threads = max(threads, WARP_SIZE);
threads = (threads + 32 - 1) / 32 * 32;  // 对齐到 warp 边界

支持的数据类型

所有内核支持三种精度模式：

__half (FP16) - 推理的默认值
__nv_bfloat16 (BF16) - 训练的首选
float (FP32) - 参考/调试

使用 Nix（推荐）

nix run .#build-and-copy --max-jobs 2 --cores 8 -L

uv pip install -e .

[general]
name = "ltx_kernels"
backends = ["cuda"]

[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
cuda-capabilities = ["9.0"]

HuggingFace Kernels Hub (get_kernel)

查看 huggingface-kernels-integration.md 获取完整指南。

直接从 HuggingFace Hub 加载预编译、优化的内核，无需本地编译：

from kernels import get_kernel, has_kernel

# 检查可用性并加载
if has_kernel("kernels-community/activation"):
    activation = get_kernel("kernels-community/activation", version=1)

    # 使用内核
    x = torch.randn((4, 4), dtype=torch.float16, device="cuda")
    y = torch.empty_like(x)
    activation.gelu_fast(y, x)

get_kernel(repo_id, version=None) - 从 Hub 下载并加载内核
has_kernel(repo_id) - 检查是否存在兼容的构建
get_local_kernel(path) - 从本地目录加载（开发用）

流行的社区内核：

kernels-community/activation - GELU, SiLU 等
kernels-community/flash-attn - Flash Attention 2
kernels-community/triton-layer-norm - LayerNorm, RMSNorm

Diffusers 集成（视频/图像生成）

查看 diffusers-integration.md 获取完整指南。

Transformers 集成（大语言模型）

查看 transformers-integration.md 获取完整指南。

与 diffusers 的主要区别：

Transformers 的 RMSNorm 始终有权重（没有 elementwise_affine=False）
使用 'RMSNorm' in class_name 来匹配 LlamaRMSNorm, MistralRMSNorm 等
检查 variance_epsilon (LLaMA) 或 eps (其他) 以获取 epsilon 值
没有 set_processor() 模式 - 改用 Flash Attention 2

最小的 transformers 模式：

from transformers import AutoModelForCausalLM
from ltx_kernels import rmsnorm

def patch_rmsnorm(model):
    for name, module in model.named_modules():
        if 'RMSNorm' in type(module).__name__:
            eps = getattr(module, 'variance_epsilon', None) or getattr(module, 'eps', 1e-6)
            def make_forward(mod, epsilon):
                def forward(x):
                    return rmsnorm(x, mod.weight, eps=epsilon)
                return forward
            module.forward = make_forward(module, eps)

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
patch_rmsnorm(model)

Diffusers 关键陷阱

1. RMSNorm 权重可能为 None

LTX-Video 在某些 RMSNorm 模块中使用 elementwise_affine=False：

# Transformer 块：无权重
self.norm1 = RMSNorm(dim, elementwise_affine=False)

# Attention 模块：有权重
self.norm_q = torch.nn.RMSNorm(..., elementwise_affine=True)

解决方案： 处理两种情况：

has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
    output = rmsnorm(x, module.weight, eps=eps)
else:
    weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
    output = rmsnorm(x, weight, eps=eps)

2. Diffusers 的 RMSNorm != torch.nn.RMSNorm

# 错误 - 会漏掉 diffusers 的 RMSNorm
if isinstance(module, torch.nn.RMSNorm):

# 正确 - 捕获所有 RMSNorm 变体
if type(module).__name__ == 'RMSNorm':

3. LTX-Video 使用 GELU，而非 GEGLU

LTX-Video 使用 activation_fn="gelu-approximate"。不要为 LTX-Video 打补丁 GEGLU。

4. 在 CPU 卸载之前注入内核

pipe = LTXPipeline.from_pretrained(...)
pipe.to("cuda")
inject_optimized_kernels(pipe)  # 在卸载之前
pipe.enable_model_cpu_offload()  # 现在安全了

from diffusers import LTXPipeline
from ltx_kernels import rmsnorm

def patch_rmsnorm_modules(model):
    """修补所有 RMSNorm 模块以使用自定义内核。"""
    for name, module in model.named_modules():
        if type(module).__name__ == 'RMSNorm':
            eps = getattr(module, 'eps', 1e-6)
            has_weight = hasattr(module, 'weight') and module.weight is not None

            if has_weight:
                def make_forward(mod, epsilon):
                    def forward(x):
                        return rmsnorm(x, mod.weight, eps=epsilon)
                    return forward
                module.forward = make_forward(module, eps)
            else:
                def make_forward(epsilon):
                    def forward(x):
                        w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
                        return rmsnorm(x, w, eps=epsilon)
                    return forward
                module.forward = make_forward(eps)

# 使用
pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")
patch_rmsnorm_modules(pipe.transformer)
pipe.enable_model_cpu_offload()

输入布局：[..., hidden_size]
Epsilon 默认值：1e-6
权重可能为 None，如果 elementwise_affine=False
向量化： 对于 BF16 使用 __nv_bfloat162，对于 FP16 使用 __half2，对于 FP32 使用 float4
性能： 使用向量化实现比 PyTorch 快 2.67 倍
带宽： 达到 H100 理论带宽 3.35 TB/s 的约 38%

1D：[batch, seq, heads, head_dim] - 用于文本
3D：[batch, t*h*w, heads, head_dim] - 用于视频
LTX-Video 通过 LTXVideoRotaryPosEmbed 计算自己的 RoPE

GEGLU：输入 [batch, seq, 2*hidden] -> 输出 [batch, seq, hidden]
GELU：标准激活函数
LTX-Video 使用 GELU，而非 GEGLU

公式：norm(x) * weight * (1 + scale) + shift
用于 DiT 块中的条件控制

# NVIDIA Nsight Systems
nsys profile -o profile python your_script.py

# NVIDIA Nsight Compute
ncu --set full -o metrics python your_script.py

查看 troubleshooting.md 获取所有常见问题及解决方案。

"NoneType has no attribute contiguous"：RMSNorm 权重为 None，创建 ones 张量
isinstance() 不匹配：改用 type(module).__name__
GEGLU 未被调用：模型使用 GELU，而非 GEGLU
修补不持久：在 enable_model_cpu_offload() 之前注入
torch.compile 与自定义内核一起失败：见下文

torch.compile 兼容性

自定义 CUDA 内核和 torch.compile 互斥，除非你将内核注册为 PyTorch 自定义操作。

torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped

使用 --use-optimized-kernels 而不使用 --compile（加速 6%）
使用 --compile 而不使用自定义内核（加速 34%）
将内核注册为自定义操作（高级，需要 torch.library）

注册为自定义操作（以实现 torch.compile 兼容性）：

import torch

@torch.library.custom_op("ltx_kernels::rmsnorm", mutates_args={"out"})
def rmsnorm(out: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, eps: float) -> None:
    ops.rmsnorm_forward(out, input.contiguous(), weight.contiguous(), eps)

@rmsnorm.register_fake
def _(out, input, weight, eps):
    pass  # 无形状变化

benchmark_example.py - 用于比较优化与基线的基准测试脚本 - 从此开始
ltx_kernel_injection_example.py - 最小的 diffusers 集成示例（约 150 行）
transformers_injection_example.py - 最小的 transformers/LLM 集成示例（约 120 行）
huggingface_kernels_example.py - HuggingFace Kernels Hub 集成示例

huggingface-kernels-integration.md - HuggingFace Kernels Hub (get_kernel) - 加载预编译内核
diffusers-integration.md - 完整的 diffusers 管道集成指南
transformers-integration.md - 完整的 transformers/LLM 集成指南

h100-optimization-guide.md - H100 (Hopper, sm_90) 深度指南
a100-optimization-guide.md - A100 (Ampere, sm_80) 深度指南
t4-optimization-guide.md - T4 (Turing, sm_75) 深度指南

troubleshooting.md - 常见问题及解决方案
kernel-templates.md - 完整的内核模板
examples/ltx_video/ - 完整的 LTX-Video 示例目录

2026 年 2 月 14 日

🇺🇸English

CUDA Kernels for Diffusers & Transformers

This skill provides patterns and guidance for developing optimized CUDA kernels targeting NVIDIA GPUs (H100, A100, T4) for use with HuggingFace diffusers and transformers libraries.

Quick Start

Diffusers (Video/Image Generation)

For benchmarking kernel performance:

# Benchmark with optimized kernels (6% end-to-end speedup)
python generate_video.py --use-optimized-kernels

# Benchmark baseline with torch.compile (34% speedup)
python generate_video.py --no-optimized-kernels --compile

# Compare configurations (note: --compile and --use-optimized-kernels are mutually exclusive)
python generate_video.py --use-optimized-kernels && \
python generate_video.py --no-optimized-kernels --compile

For a minimal diffusers integration example (~150 lines):

python scripts/ltx_kernel_injection_example.py

Transformers (LLMs)

For a minimal transformers integration example (~120 lines):

python scripts/transformers_injection_example.py

HuggingFace Kernels Hub

Load pre-compiled kernels from HuggingFace Hub (no local compilation):

from kernels import get_kernel

# Load optimized activation kernels
activation = get_kernel("kernels-community/activation", version=1)

# Use the kernel
y = torch.empty_like(x)
activation.gelu_fast(y, x)

For a complete HuggingFace Kernels example:

python scripts/huggingface_kernels_example.py

Isolated Kernel Micro-benchmarks

python benchmark_rmsnorm.py

Supported Libraries & Models

Library	Supported Models	Key Kernels
diffusers	LTX-Video, Stable Diffusion, FLUX, DiT	RMSNorm, GEGLU, RoPE, AdaLN
transformers	LLaMA, Mistral, Qwen, Falcon	RMSNorm, Attention
GPU	Compute Capability	Guide
---	---	---
H100	sm_90	h100-optimization-guide.md
A100	sm_80	a100-optimization-guide.md
T4	sm_75	t4-optimization-guide.md

When This Skill Applies

Use this skill when:

Benchmarking kernel performance against baseline implementations
Writing new CUDA kernels for diffusion models or LLMs
Optimizing existing kernels for H100, A100, or T4 architecture
Implementing custom attention, normalization, or activation layers
Integrating kernels with diffusers pipelines (LTX-Video, Stable Diffusion, FLUX, DiT)
Integrating kernels with transformers models (LLaMA, Mistral, Qwen)
Debugging kernel performance issues on NVIDIA GPUs

Working Example

A complete working example is available at examples/ltx_video/. This demonstrates:

Custom CUDA kernels (RMSNorm, RoPE 3D, GEGLU, AdaLN)
Build system setup with setup.py, build.toml, and flake.nix
PyTorch C++ bindings and Python API
Benchmarking script for comparing optimized vs baseline performance

Benchmarking Kernels

Use the benchmark script to measure kernel performance:

# Full benchmark with all options
python scripts/benchmark_example.py \
    --use-optimized-kernels \
    --compile \
    --batch-size 1 \
    --num-frames 161 \
    --height 512 \
    --width 768 \
    --steps 50 \
    --warmup-iterations 2

Benchmark Script Options

Option	Default	Description
`--use-optimized-kernels`	auto	Use custom H100 CUDA kernels
`--no-optimized-kernels`	-	Use baseline implementation
`--compile`	false	Enable torch.compile on transformer
`--batch-size`	1	Number of videos per prompt
`--num-frames`	161

Example Benchmark Results

End-to-End Video Generation (49 frames, 30 steps, H100 80GB):

Configuration	Time (s)	it/s	Speedup	Notes
Baseline (no compile)	2.87	12.58	1.00x	Reference
Optimized Kernels	2.70	13.52	1.06x	6% faster
Baseline + torch.compile	2.14	19.05	1.34x	34% faster

Important: --use-optimized-kernels and --compile are currently mutually exclusive. Custom kernels require PyTorch custom op registration to work with torch.compile.

Key metrics to capture:

Device: GPU model (e.g., NVIDIA H100 80GB HBM3)
Precision: Data type used (e.g., bfloat16)
Resolution: Width x Height (e.g., 768x512)
Frames: Number of frames generated (e.g., 49, 161)

RMSNorm Micro-benchmarks

The vectorized RMSNorm kernel achieves 2.67x average speedup over PyTorch baseline:

Shape	Custom (ms)	PyTorch (ms)	Speedup
[1×1024×2048]	0.019	0.065	3.37x
[2×1024×2048]	0.024	0.073	3.04x
[4×1024×2048]	0.036	0.093	2.58x
[2×4096×3072]	0.087	0.208	2.41x
[4×4096×3072]	0.157	0.392	2.49x

Bandwidth efficiency: 38% of H100's theoretical 3.35 TB/s

Why end-to-end speedup is smaller: RMSNorm accounts for ~5% of total compute in LTX-Video. The remaining time is spent in attention (Flash Attention/SDPA), linear projections, and VAE decode.

Project Structure

.claude/skills/cuda-kernels/
├── scripts/
│   ├── benchmark_example.py              # End-to-end video generation benchmark
│   ├── benchmark_rmsnorm.py              # Isolated RMSNorm micro-benchmark
│   ├── ltx_kernel_injection_example.py   # Minimal diffusers integration (~150 lines)
│   ├── transformers_injection_example.py # Minimal transformers integration (~120 lines)
│   └── huggingface_kernels_example.py    # HuggingFace Kernels Hub integration
├── references/
│   ├── diffusers-integration.md          # Complete diffusers integration guide
│   ├── transformers-integration.md       # Complete transformers integration guide
│   ├── huggingface-kernels-integration.md # HuggingFace Kernels Hub (get_kernel) guide
│   ├── troubleshooting.md                # Common issues and solutions
│   ├── kernel-templates.md               # CUDA kernel templates (includes vectorized)
│   ├── h100-optimization-guide.md        # H100 (Hopper) optimization deep dive
│   ├── a100-optimization-guide.md        # A100 (Ampere) optimization deep dive
│   └── t4-optimization-guide.md          # T4 (Turing) optimization deep dive
└── SKILL.md                              # This file

examples/ltx_video/                  # Complete working example
├── kernel_src/
│   └── rmsnorm.cu                  # Vectorized RMSNorm kernel (2.67x faster)
├── torch-ext/                      # PyTorch bindings
├── generate_video.py               # Full benchmark script
├── benchmark_rmsnorm.py            # Isolated kernel benchmark
└── setup.py                        # pip install -e .

GPU Architecture Reference

H100 (Hopper) - Primary Target

Spec	Value	Optimization Impact
SMs	132	Grid sizing: aim for multiples of 132
Threads/SM	2048	Max 16 blocks of 128 threads per SM
Shared Memory	192 KB/SM	Large tiles possible
L2 Cache	50 MB	Reuse across blocks
Memory BW	3.35 TB/s	Coalesced access critical
Warp Size	32	All reductions use warp shuffles

Quick Comparison (H100 vs A100 vs T4)

Spec	H100	A100	T4
SMs	132	108	40
Memory BW	3.35 TB/s	2.0 TB/s	320 GB/s
Shared Mem/SM	192 KB	164 KB	64 KB
BF16 Support	Yes	Yes	No (FP16 only)
Compute Cap	sm_90	sm_80	sm_75

See detailed guides: H100 | A100 | T4

Core Kernel Patterns

Vectorized Memory Access (Critical for Performance)

BFloat16 vectorization using__nv_bfloat162:

// Load 2 bfloat16 elements at once (32-bit load)
const __nv_bfloat162* vec_input = reinterpret_cast<const __nv_bfloat162*>(row_input);

#pragma unroll 4
for (int i = tid; i < vec_hidden; i += stride) {
    __nv_bfloat162 v = vec_input[i];
    float v0 = __bfloat162float(v.x);
    float v1 = __bfloat162float(v.y);
    sum_sq += v0 * v0 + v1 * v1;
}

FP16 vectorization using__half2:

const __half2* vec_input = reinterpret_cast<const __half2*>(row_input);
__half2 v = vec_input[i];
float v0 = __half2float(v.x);
float v1 = __half2float(v.y);

FP32 vectorization usingfloat4:

const float4* vec_input = reinterpret_cast<const float4*>(row_input);
float4 v = vec_input[i];
sum_sq += v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w;

Warp Shuffle Reductions

template <typename T>
__device__ __forceinline__ T warp_reduce_sum(T val) {
    #pragma unroll
    for (int offset = 16; offset > 0; offset >>= 1) {
        val += __shfl_xor_sync(0xffffffff, val, offset);
    }
    return val;
}

Block Sizes for Attention

BLOCK_SIZE_M = 128, BLOCK_SIZE_N = 64, BLOCK_SIZE_K = 64
NUM_WARPS = 8

Thread Configuration

For element-wise ops (RoPE, GEGLU):

constexpr int BLOCK_SIZE = 256;
int num_blocks = (total_elements + BLOCK_SIZE - 1) / BLOCK_SIZE;

For reduction ops (LayerNorm, RMSNorm) with vectorization:

// Divide by 2 for bf16/fp16 vectorized access
int threads = min(hidden_size / 2, MAX_THREADS);
threads = max(threads, WARP_SIZE);
threads = (threads + 32 - 1) / 32 * 32;  // Round to warp boundary

Supported Data Types

All kernels support three precision modes:

__half (FP16) - Default for inference
__nv_bfloat16 (BF16) - Preferred for training
float (FP32) - Reference/debugging

Building Kernels

With Nix (Recommended)

nix run .#build-and-copy --max-jobs 2 --cores 8 -L

With pip/uv

uv pip install -e .

build.toml Configuration

[general]
name = "ltx_kernels"
backends = ["cuda"]

[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
cuda-capabilities = ["9.0"]

Library Integration

HuggingFace Kernels Hub (get_kernel)

Seehuggingface-kernels-integration.md for the complete guide.

Load pre-compiled, optimized kernels directly from HuggingFace Hub without local compilation:

from kernels import get_kernel, has_kernel

# Check availability and load
if has_kernel("kernels-community/activation"):
    activation = get_kernel("kernels-community/activation", version=1)

    # Use the kernel
    x = torch.randn((4, 4), dtype=torch.float16, device="cuda")
    y = torch.empty_like(x)
    activation.gelu_fast(y, x)

Key functions:

get_kernel(repo_id, version=None) - Download and load kernel from Hub
has_kernel(repo_id) - Check if compatible build exists
get_local_kernel(path) - Load from local directory (development)

Popular community kernels:

kernels-community/activation - GELU, SiLU, etc.
kernels-community/flash-attn - Flash Attention 2
kernels-community/triton-layer-norm - LayerNorm, RMSNorm

Diffusers Integration (Video/Image Generation)

Seediffusers-integration.md for the complete guide.

Transformers Integration (LLMs)

Seetransformers-integration.md for the complete guide.

Key differences from diffusers:

Transformers RMSNorm always has weights (no elementwise_affine=False)
Use 'RMSNorm' in class_name to match LlamaRMSNorm, MistralRMSNorm, etc.
Check for variance_epsilon (LLaMA) or eps (others) for epsilon
No set_processor() pattern - use Flash Attention 2 instead

Minimal transformers pattern:

from transformers import AutoModelForCausalLM
from ltx_kernels import rmsnorm

def patch_rmsnorm(model):
    for name, module in model.named_modules():
        if 'RMSNorm' in type(module).__name__:
            eps = getattr(module, 'variance_epsilon', None) or getattr(module, 'eps', 1e-6)
            def make_forward(mod, epsilon):
                def forward(x):
                    return rmsnorm(x, mod.weight, eps=epsilon)
                return forward
            module.forward = make_forward(module, eps)

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
patch_rmsnorm(model)

Diffusers Critical Pitfalls

1. RMSNorm Weight May Be None

LTX-Video uses elementwise_affine=False for some RMSNorm modules:

# Transformer blocks: NO WEIGHT
self.norm1 = RMSNorm(dim, elementwise_affine=False)

# Attention modules: HAS WEIGHT
self.norm_q = torch.nn.RMSNorm(..., elementwise_affine=True)

Solution: Handle both cases:

has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
    output = rmsnorm(x, module.weight, eps=eps)
else:
    weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
    output = rmsnorm(x, weight, eps=eps)

2. Diffusers RMSNorm != torch.nn.RMSNorm

# WRONG - misses diffusers RMSNorm
if isinstance(module, torch.nn.RMSNorm):

# CORRECT - catches all RMSNorm variants
if type(module).__name__ == 'RMSNorm':

3. LTX-Video Uses GELU, Not GEGLU

LTX-Video uses activation_fn="gelu-approximate". Don't patch GEGLU for LTX-Video.

4. Inject Kernels BEFORE CPU Offloading

pipe = LTXPipeline.from_pretrained(...)
pipe.to("cuda")
inject_optimized_kernels(pipe)  # BEFORE offloading
pipe.enable_model_cpu_offload()  # Now safe

Minimal Integration Pattern

from diffusers import LTXPipeline
from ltx_kernels import rmsnorm

def patch_rmsnorm_modules(model):
    """Patch all RMSNorm modules to use custom kernel."""
    for name, module in model.named_modules():
        if type(module).__name__ == 'RMSNorm':
            eps = getattr(module, 'eps', 1e-6)
            has_weight = hasattr(module, 'weight') and module.weight is not None

            if has_weight:
                def make_forward(mod, epsilon):
                    def forward(x):
                        return rmsnorm(x, mod.weight, eps=epsilon)
                    return forward
                module.forward = make_forward(module, eps)
            else:
                def make_forward(epsilon):
                    def forward(x):
                        w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
                        return rmsnorm(x, w, eps=epsilon)
                    return forward
                module.forward = make_forward(eps)

# Usage
pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")
patch_rmsnorm_modules(pipe.transformer)
pipe.enable_model_cpu_offload()

Kernel-Specific Guidelines

RMSNorm

Input layout: [..., hidden_size]
Epsilon default: 1e-6
Weight may be None if elementwise_affine=False
Vectorization: Use __nv_bfloat162 for BF16, __half2 for FP16, float4 for FP32
Performance: 2.67x faster than PyTorch with vectorized implementation
Bandwidth: Achieves ~38% of H100's 3.35 TB/s theoretical bandwidth

RoPE

1D: [batch, seq, heads, head_dim] - for text
3D: [batch, t*h*w, heads, head_dim] - for video
LTX-Video computes its own RoPE via LTXVideoRotaryPosEmbed

GEGLU vs GELU

GEGLU : Input [batch, seq, 2*hidden] -> Output [batch, seq, hidden]
GELU : Standard activation
LTX-Video uses GELU, NOT GEGLU

AdaLN

Formula: norm(x) * weight * (1 + scale) + shift
Used in DiT blocks for conditioning

Performance Profiling

# NVIDIA Nsight Systems
nsys profile -o profile python your_script.py

# NVIDIA Nsight Compute
ncu --set full -o metrics python your_script.py

Common Issues

Seetroubleshooting.md for all common issues and solutions.

Quick fixes:

"NoneType has no attribute contiguous" : RMSNorm weight is None, create ones
isinstance() not matching : Use type(module).__name__ instead
GEGLU not called : Model uses GELU, not GEGLU
Patching doesn't persist : Inject before enable_model_cpu_offload()
torch.compile fails with custom kernels : See below

torch.compile Compatibility

Custom CUDA kernels and torch.compile are mutually exclusive unless you register the kernel as a PyTorch custom op.

Error message:

torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped

Workaround options:

Use --use-optimized-kernels without --compile (6% speedup)
Use --compile without custom kernels (34% speedup)
Register kernel as custom op (advanced, requires torch.library)

To register as custom op (for torch.compile compatibility):

import torch

@torch.library.custom_op("ltx_kernels::rmsnorm", mutates_args={"out"})
def rmsnorm(out: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, eps: float) -> None:
    ops.rmsnorm_forward(out, input.contiguous(), weight.contiguous(), eps)

@rmsnorm.register_fake
def _(out, input, weight, eps):
    pass  # No shape changes

HuggingFace CUDA内核优化指南：为Diffusers和Transformers加速AI模型

🇨🇳中文介绍

面向 Diffusers 和 Transformers 的 CUDA 内核

快速开始

Diffusers（视频/图像生成）

Transformers（大语言模型）

HuggingFace Kernels Hub

相关 Skills

独立内核微基准测试

支持的库和模型

适用场景

工作示例

内核基准测试

基准测试脚本选项

基准测试结果示例

RMSNorm 微基准测试

项目结构

GPU 架构参考

H100 (Hopper) - 主要目标

快速比较（H100 vs A100 vs T4）

核心内核模式

向量化内存访问（对性能至关重要）

Warp Shuffle 规约

注意力机制的块大小

线程配置

支持的数据类型

构建内核

使用 Nix（推荐）

使用 pip/uv

build.toml 配置

库集成

HuggingFace Kernels Hub (get_kernel)

Diffusers 集成（视频/图像生成）

Transformers 集成（大语言模型）

Diffusers 关键陷阱

1. RMSNorm 权重可能为 None

2. Diffusers 的 RMSNorm != torch.nn.RMSNorm

3. LTX-Video 使用 GELU，而非 GEGLU

4. 在 CPU 卸载之前注入内核

最小集成模式

内核特定指南

RMSNorm

RoPE

GEGLU 与 GELU

AdaLN

性能分析

常见问题

torch.compile 兼容性

另请参阅

脚本

集成指南

GPU 优化指南

参考

外部资源

🇺🇸English

CUDA Kernels for Diffusers & Transformers

Quick Start

Diffusers (Video/Image Generation)

Transformers (LLMs)

HuggingFace Kernels Hub

Isolated Kernel Micro-benchmarks

Supported Libraries & Models

When This Skill Applies

Working Example

Benchmarking Kernels

Benchmark Script Options

Example Benchmark Results

RMSNorm Micro-benchmarks

Project Structure

GPU Architecture Reference

H100 (Hopper) - Primary Target

Quick Comparison (H100 vs A100 vs T4)

Core Kernel Patterns

Vectorized Memory Access (Critical for Performance)

Warp Shuffle Reductions

Block Sizes for Attention

Thread Configuration

Supported Data Types

Building Kernels

With Nix (Recommended)