重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
cuda-kernels by huggingface/kernels
npx skills add https://github.com/huggingface/kernels --skill cuda-kernels此技能提供了针对 NVIDIA GPU(H100、A100、T4)开发优化 CUDA 内核的模式和指南,用于 HuggingFace diffusers 和 transformers 库。
用于基准测试内核性能:
# 使用优化内核进行基准测试(端到端速度提升 6%)
python generate_video.py --use-optimized-kernels
# 使用 torch.compile 进行基线基准测试(速度提升 34%)
python generate_video.py --no-optimized-kernels --compile
# 比较配置(注意:--compile 和 --use-optimized-kernels 互斥)
python generate_video.py --use-optimized-kernels && \
python generate_video.py --no-optimized-kernels --compile
获取一个最小的 diffusers 集成示例(约 150 行):
python scripts/ltx_kernel_injection_example.py
获取一个最小的 transformers 集成示例(约 120 行):
python scripts/transformers_injection_example.py
从 HuggingFace Hub 加载预编译内核(无需本地编译):
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
from kernels import get_kernel
# 加载优化的激活函数内核
activation = get_kernel("kernels-community/activation", version=1)
# 使用内核
y = torch.empty_like(x)
activation.gelu_fast(y, x)
获取一个完整的 HuggingFace Kernels 示例:
python scripts/huggingface_kernels_example.py
python benchmark_rmsnorm.py
| 库 | 支持的模型 | 关键内核 |
|---|---|---|
| diffusers | LTX-Video, Stable Diffusion, FLUX, DiT | RMSNorm, GEGLU, RoPE, AdaLN |
| transformers | LLaMA, Mistral, Qwen, Falcon | RMSNorm, Attention |
| GPU | 计算能力 | 指南 |
| --- | --- | --- |
| H100 | sm_90 | h100-optimization-guide.md |
| A100 | sm_80 | a100-optimization-guide.md |
| T4 | sm_75 | t4-optimization-guide.md |
在以下情况下使用此技能:
完整的工作示例位于 examples/ltx_video/。此示例演示了:
使用基准测试脚本来测量内核性能:
# 包含所有选项的完整基准测试
python scripts/benchmark_example.py \
--use-optimized-kernels \
--compile \
--batch-size 1 \
--num-frames 161 \
--height 512 \
--width 768 \
--steps 50 \
--warmup-iterations 2
| 选项 | 默认值 | 描述 |
|---|---|---|
--use-optimized-kernels | auto | 使用自定义 H100 CUDA 内核 |
--no-optimized-kernels | - | 使用基线实现 |
--compile | false | 在 transformer 上启用 torch.compile |
--batch-size | 1 | 每个提示生成的视频数量 |
--num-frames | 161 | 要生成的帧数 |
--height | 512 | 视频高度(像素) |
--width | 768 | 视频宽度(像素) |
--steps | 50 | 去噪步数 |
--warmup-iterations | 2 | 基准测试前的预热运行次数 |
端到端视频生成(49 帧,30 步,H100 80GB):
| 配置 | 时间(秒) | it/s | 加速比 | 备注 |
|---|---|---|---|---|
| 基线(无编译) | 2.87 | 12.58 | 1.00x | 参考 |
| 优化内核 | 2.70 | 13.52 | 1.06x | 快 6% |
| 基线 + torch.compile | 2.14 | 19.05 | 1.34x | 快 34% |
重要提示: --use-optimized-kernels 和 --compile 目前互斥。自定义内核需要 PyTorch 自定义操作注册才能与 torch.compile 一起工作。
需要捕获的关键指标:
向量化的 RMSNorm 内核相比 PyTorch 基线实现了 2.67 倍的平均加速比:
| 形状 | 自定义(毫秒) | PyTorch(毫秒) | 加速比 |
|---|---|---|---|
| [1×1024×2048] | 0.019 | 0.065 | 3.37x |
| [2×1024×2048] | 0.024 | 0.073 | 3.04x |
| [4×1024×2048] | 0.036 | 0.093 | 2.58x |
| [2×4096×3072] | 0.087 | 0.208 | 2.41x |
| [4×4096×3072] | 0.157 | 0.392 | 2.49x |
带宽效率: 达到 H100 理论带宽 3.35 TB/s 的 38%
端到端加速比较小的原因: 在 LTX-Video 中,RMSNorm 约占计算总量的 5%。其余时间花在注意力(Flash Attention/SDPA)、线性投影和 VAE 解码上。
.claude/skills/cuda-kernels/
├── scripts/
│ ├── benchmark_example.py # 端到端视频生成基准测试
│ ├── benchmark_rmsnorm.py # 独立的 RMSNorm 微基准测试
│ ├── ltx_kernel_injection_example.py # 最小的 diffusers 集成示例(约 150 行)
│ ├── transformers_injection_example.py # 最小的 transformers 集成示例(约 120 行)
│ └── huggingface_kernels_example.py # HuggingFace Kernels Hub 集成示例
├── references/
│ ├── diffusers-integration.md # 完整的 diffusers 集成指南
│ ├── transformers-integration.md # 完整的 transformers 集成指南
│ ├── huggingface-kernels-integration.md # HuggingFace Kernels Hub (get_kernel) 指南
│ ├── troubleshooting.md # 常见问题及解决方案
│ ├── kernel-templates.md # CUDA 内核模板(包含向量化)
│ ├── h100-optimization-guide.md # H100 (Hopper) 优化深度指南
│ ├── a100-optimization-guide.md # A100 (Ampere) 优化深度指南
│ └── t4-optimization-guide.md # T4 (Turing) 优化深度指南
└── SKILL.md # 此文件
examples/ltx_video/ # 完整的工作示例
├── kernel_src/
│ └── rmsnorm.cu # 向量化 RMSNorm 内核(快 2.67 倍)
├── torch-ext/ # PyTorch 绑定
├── generate_video.py # 完整的基准测试脚本
├── benchmark_rmsnorm.py # 独立内核基准测试
└── setup.py # pip install -e .
| 规格 | 值 | 优化影响 |
|---|---|---|
| SMs | 132 | 网格大小:目标是 132 的倍数 |
| 线程数/SM | 2048 | 每个 SM 最多 16 个块,每个块 128 线程 |
| 共享内存 | 192 KB/SM | 可以使用较大的分块 |
| L2 缓存 | 50 MB | 跨块重用 |
| 内存带宽 | 3.35 TB/s | 合并访问至关重要 |
| Warp 大小 | 32 | 所有规约都使用 warp shuffle |
| 规格 | H100 | A100 | T4 |
|---|---|---|---|
| SMs | 132 | 108 | 40 |
| 内存带宽 | 3.35 TB/s | 2.0 TB/s | 320 GB/s |
| 共享内存/SM | 192 KB | 164 KB | 64 KB |
| BF16 支持 | 是 | 是 | 否(仅 FP16) |
| 计算能力 | sm_90 | sm_80 | sm_75 |
使用 __nv_bfloat162 进行 BFloat16 向量化:
// 一次加载 2 个 bfloat16 元素(32 位加载)
const __nv_bfloat162* vec_input = reinterpret_cast<const __nv_bfloat162*>(row_input);
#pragma unroll 4
for (int i = tid; i < vec_hidden; i += stride) {
__nv_bfloat162 v = vec_input[i];
float v0 = __bfloat162float(v.x);
float v1 = __bfloat162float(v.y);
sum_sq += v0 * v0 + v1 * v1;
}
使用 __half2 进行 FP16 向量化:
const __half2* vec_input = reinterpret_cast<const __half2*>(row_input);
__half2 v = vec_input[i];
float v0 = __half2float(v.x);
float v1 = __half2float(v.y);
使用 float4 进行 FP32 向量化:
const float4* vec_input = reinterpret_cast<const float4*>(row_input);
float4 v = vec_input[i];
sum_sq += v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w;
template <typename T>
__device__ __forceinline__ T warp_reduce_sum(T val) {
#pragma unroll
for (int offset = 16; offset > 0; offset >>= 1) {
val += __shfl_xor_sync(0xffffffff, val, offset);
}
return val;
}
BLOCK_SIZE_M = 128, BLOCK_SIZE_N = 64, BLOCK_SIZE_K = 64NUM_WARPS = 8对于逐元素操作(RoPE, GEGLU):
constexpr int BLOCK_SIZE = 256;
int num_blocks = (total_elements + BLOCK_SIZE - 1) / BLOCK_SIZE;
对于带向量化的规约操作(LayerNorm, RMSNorm):
// 对于 bf16/fp16 向量化访问,除以 2
int threads = min(hidden_size / 2, MAX_THREADS);
threads = max(threads, WARP_SIZE);
threads = (threads + 32 - 1) / 32 * 32; // 对齐到 warp 边界
所有内核支持三种精度模式:
__half (FP16) - 推理的默认值__nv_bfloat16 (BF16) - 训练的首选float (FP32) - 参考/调试nix run .#build-and-copy --max-jobs 2 --cores 8 -L
uv pip install -e .
[general]
name = "ltx_kernels"
backends = ["cuda"]
[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
cuda-capabilities = ["9.0"]
查看 huggingface-kernels-integration.md 获取完整指南。
直接从 HuggingFace Hub 加载预编译、优化的内核,无需本地编译:
from kernels import get_kernel, has_kernel
# 检查可用性并加载
if has_kernel("kernels-community/activation"):
activation = get_kernel("kernels-community/activation", version=1)
# 使用内核
x = torch.randn((4, 4), dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)
关键函数:
get_kernel(repo_id, version=None) - 从 Hub 下载并加载内核has_kernel(repo_id) - 检查是否存在兼容的构建get_local_kernel(path) - 从本地目录加载(开发用)流行的社区内核:
kernels-community/activation - GELU, SiLU 等kernels-community/flash-attn - Flash Attention 2kernels-community/triton-layer-norm - LayerNorm, RMSNorm查看 diffusers-integration.md 获取完整指南。
查看 transformers-integration.md 获取完整指南。
与 diffusers 的主要区别:
elementwise_affine=False)'RMSNorm' in class_name 来匹配 LlamaRMSNorm, MistralRMSNorm 等variance_epsilon (LLaMA) 或 eps (其他) 以获取 epsilon 值set_processor() 模式 - 改用 Flash Attention 2最小的 transformers 模式:
from transformers import AutoModelForCausalLM
from ltx_kernels import rmsnorm
def patch_rmsnorm(model):
for name, module in model.named_modules():
if 'RMSNorm' in type(module).__name__:
eps = getattr(module, 'variance_epsilon', None) or getattr(module, 'eps', 1e-6)
def make_forward(mod, epsilon):
def forward(x):
return rmsnorm(x, mod.weight, eps=epsilon)
return forward
module.forward = make_forward(module, eps)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
patch_rmsnorm(model)
LTX-Video 在某些 RMSNorm 模块中使用 elementwise_affine=False:
# Transformer 块:无权重
self.norm1 = RMSNorm(dim, elementwise_affine=False)
# Attention 模块:有权重
self.norm_q = torch.nn.RMSNorm(..., elementwise_affine=True)
解决方案: 处理两种情况:
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
output = rmsnorm(x, module.weight, eps=eps)
else:
weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
output = rmsnorm(x, weight, eps=eps)
# 错误 - 会漏掉 diffusers 的 RMSNorm
if isinstance(module, torch.nn.RMSNorm):
# 正确 - 捕获所有 RMSNorm 变体
if type(module).__name__ == 'RMSNorm':
LTX-Video 使用 activation_fn="gelu-approximate"。不要为 LTX-Video 打补丁 GEGLU。
pipe = LTXPipeline.from_pretrained(...)
pipe.to("cuda")
inject_optimized_kernels(pipe) # 在卸载之前
pipe.enable_model_cpu_offload() # 现在安全了
from diffusers import LTXPipeline
from ltx_kernels import rmsnorm
def patch_rmsnorm_modules(model):
"""修补所有 RMSNorm 模块以使用自定义内核。"""
for name, module in model.named_modules():
if type(module).__name__ == 'RMSNorm':
eps = getattr(module, 'eps', 1e-6)
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
def make_forward(mod, epsilon):
def forward(x):
return rmsnorm(x, mod.weight, eps=epsilon)
return forward
module.forward = make_forward(module, eps)
else:
def make_forward(epsilon):
def forward(x):
w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
return rmsnorm(x, w, eps=epsilon)
return forward
module.forward = make_forward(eps)
# 使用
pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")
patch_rmsnorm_modules(pipe.transformer)
pipe.enable_model_cpu_offload()
[..., hidden_size]elementwise_affine=False__nv_bfloat162,对于 FP16 使用 __half2,对于 FP32 使用 float4[batch, seq, heads, head_dim] - 用于文本[batch, t*h*w, heads, head_dim] - 用于视频LTXVideoRotaryPosEmbed 计算自己的 RoPE[batch, seq, 2*hidden] -> 输出 [batch, seq, hidden]norm(x) * weight * (1 + scale) + shift# NVIDIA Nsight Systems
nsys profile -o profile python your_script.py
# NVIDIA Nsight Compute
ncu --set full -o metrics python your_script.py
查看 troubleshooting.md 获取所有常见问题及解决方案。
快速修复:
type(module).__name__enable_model_cpu_offload() 之前注入自定义 CUDA 内核和 torch.compile 互斥,除非你将内核注册为 PyTorch 自定义操作。
错误信息:
torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped
变通方案:
--use-optimized-kernels 而不使用 --compile(加速 6%)--compile 而不使用自定义内核(加速 34%)torch.library)注册为自定义操作(以实现 torch.compile 兼容性):
import torch
@torch.library.custom_op("ltx_kernels::rmsnorm", mutates_args={"out"})
def rmsnorm(out: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, eps: float) -> None:
ops.rmsnorm_forward(out, input.contiguous(), weight.contiguous(), eps)
@rmsnorm.register_fake
def _(out, input, weight, eps):
pass # 无形状变化
每周安装次数
64
仓库
GitHub 星标数
534
首次出现
2026 年 2 月 14 日
安全审计
安装于
codex63
opencode62
github-copilot62
gemini-cli61
cursor61
kimi-cli61
This skill provides patterns and guidance for developing optimized CUDA kernels targeting NVIDIA GPUs (H100, A100, T4) for use with HuggingFace diffusers and transformers libraries.
For benchmarking kernel performance:
# Benchmark with optimized kernels (6% end-to-end speedup)
python generate_video.py --use-optimized-kernels
# Benchmark baseline with torch.compile (34% speedup)
python generate_video.py --no-optimized-kernels --compile
# Compare configurations (note: --compile and --use-optimized-kernels are mutually exclusive)
python generate_video.py --use-optimized-kernels && \
python generate_video.py --no-optimized-kernels --compile
For a minimal diffusers integration example (~150 lines):
python scripts/ltx_kernel_injection_example.py
For a minimal transformers integration example (~120 lines):
python scripts/transformers_injection_example.py
Load pre-compiled kernels from HuggingFace Hub (no local compilation):
from kernels import get_kernel
# Load optimized activation kernels
activation = get_kernel("kernels-community/activation", version=1)
# Use the kernel
y = torch.empty_like(x)
activation.gelu_fast(y, x)
For a complete HuggingFace Kernels example:
python scripts/huggingface_kernels_example.py
python benchmark_rmsnorm.py
| Library | Supported Models | Key Kernels |
|---|---|---|
| diffusers | LTX-Video, Stable Diffusion, FLUX, DiT | RMSNorm, GEGLU, RoPE, AdaLN |
| transformers | LLaMA, Mistral, Qwen, Falcon | RMSNorm, Attention |
| GPU | Compute Capability | Guide |
| --- | --- | --- |
| H100 | sm_90 | h100-optimization-guide.md |
| A100 | sm_80 | a100-optimization-guide.md |
| T4 | sm_75 | t4-optimization-guide.md |
Use this skill when:
A complete working example is available at examples/ltx_video/. This demonstrates:
Use the benchmark script to measure kernel performance:
# Full benchmark with all options
python scripts/benchmark_example.py \
--use-optimized-kernels \
--compile \
--batch-size 1 \
--num-frames 161 \
--height 512 \
--width 768 \
--steps 50 \
--warmup-iterations 2
| Option | Default | Description |
|---|---|---|
--use-optimized-kernels | auto | Use custom H100 CUDA kernels |
--no-optimized-kernels | - | Use baseline implementation |
--compile | false | Enable torch.compile on transformer |
--batch-size | 1 | Number of videos per prompt |
--num-frames | 161 |
End-to-End Video Generation (49 frames, 30 steps, H100 80GB):
| Configuration | Time (s) | it/s | Speedup | Notes |
|---|---|---|---|---|
| Baseline (no compile) | 2.87 | 12.58 | 1.00x | Reference |
| Optimized Kernels | 2.70 | 13.52 | 1.06x | 6% faster |
| Baseline + torch.compile | 2.14 | 19.05 | 1.34x | 34% faster |
Important: --use-optimized-kernels and --compile are currently mutually exclusive. Custom kernels require PyTorch custom op registration to work with torch.compile.
Key metrics to capture:
The vectorized RMSNorm kernel achieves 2.67x average speedup over PyTorch baseline:
| Shape | Custom (ms) | PyTorch (ms) | Speedup |
|---|---|---|---|
| [1×1024×2048] | 0.019 | 0.065 | 3.37x |
| [2×1024×2048] | 0.024 | 0.073 | 3.04x |
| [4×1024×2048] | 0.036 | 0.093 | 2.58x |
| [2×4096×3072] | 0.087 | 0.208 | 2.41x |
| [4×4096×3072] | 0.157 | 0.392 | 2.49x |
Bandwidth efficiency: 38% of H100's theoretical 3.35 TB/s
Why end-to-end speedup is smaller: RMSNorm accounts for ~5% of total compute in LTX-Video. The remaining time is spent in attention (Flash Attention/SDPA), linear projections, and VAE decode.
.claude/skills/cuda-kernels/
├── scripts/
│ ├── benchmark_example.py # End-to-end video generation benchmark
│ ├── benchmark_rmsnorm.py # Isolated RMSNorm micro-benchmark
│ ├── ltx_kernel_injection_example.py # Minimal diffusers integration (~150 lines)
│ ├── transformers_injection_example.py # Minimal transformers integration (~120 lines)
│ └── huggingface_kernels_example.py # HuggingFace Kernels Hub integration
├── references/
│ ├── diffusers-integration.md # Complete diffusers integration guide
│ ├── transformers-integration.md # Complete transformers integration guide
│ ├── huggingface-kernels-integration.md # HuggingFace Kernels Hub (get_kernel) guide
│ ├── troubleshooting.md # Common issues and solutions
│ ├── kernel-templates.md # CUDA kernel templates (includes vectorized)
│ ├── h100-optimization-guide.md # H100 (Hopper) optimization deep dive
│ ├── a100-optimization-guide.md # A100 (Ampere) optimization deep dive
│ └── t4-optimization-guide.md # T4 (Turing) optimization deep dive
└── SKILL.md # This file
examples/ltx_video/ # Complete working example
├── kernel_src/
│ └── rmsnorm.cu # Vectorized RMSNorm kernel (2.67x faster)
├── torch-ext/ # PyTorch bindings
├── generate_video.py # Full benchmark script
├── benchmark_rmsnorm.py # Isolated kernel benchmark
└── setup.py # pip install -e .
| Spec | Value | Optimization Impact |
|---|---|---|
| SMs | 132 | Grid sizing: aim for multiples of 132 |
| Threads/SM | 2048 | Max 16 blocks of 128 threads per SM |
| Shared Memory | 192 KB/SM | Large tiles possible |
| L2 Cache | 50 MB | Reuse across blocks |
| Memory BW | 3.35 TB/s | Coalesced access critical |
| Warp Size | 32 | All reductions use warp shuffles |
| Spec | H100 | A100 | T4 |
|---|---|---|---|
| SMs | 132 | 108 | 40 |
| Memory BW | 3.35 TB/s | 2.0 TB/s | 320 GB/s |
| Shared Mem/SM | 192 KB | 164 KB | 64 KB |
| BF16 Support | Yes | Yes | No (FP16 only) |
| Compute Cap | sm_90 | sm_80 | sm_75 |
BFloat16 vectorization using__nv_bfloat162:
// Load 2 bfloat16 elements at once (32-bit load)
const __nv_bfloat162* vec_input = reinterpret_cast<const __nv_bfloat162*>(row_input);
#pragma unroll 4
for (int i = tid; i < vec_hidden; i += stride) {
__nv_bfloat162 v = vec_input[i];
float v0 = __bfloat162float(v.x);
float v1 = __bfloat162float(v.y);
sum_sq += v0 * v0 + v1 * v1;
}
FP16 vectorization using__half2:
const __half2* vec_input = reinterpret_cast<const __half2*>(row_input);
__half2 v = vec_input[i];
float v0 = __half2float(v.x);
float v1 = __half2float(v.y);
FP32 vectorization usingfloat4:
const float4* vec_input = reinterpret_cast<const float4*>(row_input);
float4 v = vec_input[i];
sum_sq += v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w;
template <typename T>
__device__ __forceinline__ T warp_reduce_sum(T val) {
#pragma unroll
for (int offset = 16; offset > 0; offset >>= 1) {
val += __shfl_xor_sync(0xffffffff, val, offset);
}
return val;
}
BLOCK_SIZE_M = 128, BLOCK_SIZE_N = 64, BLOCK_SIZE_K = 64NUM_WARPS = 8For element-wise ops (RoPE, GEGLU):
constexpr int BLOCK_SIZE = 256;
int num_blocks = (total_elements + BLOCK_SIZE - 1) / BLOCK_SIZE;
For reduction ops (LayerNorm, RMSNorm) with vectorization:
// Divide by 2 for bf16/fp16 vectorized access
int threads = min(hidden_size / 2, MAX_THREADS);
threads = max(threads, WARP_SIZE);
threads = (threads + 32 - 1) / 32 * 32; // Round to warp boundary
All kernels support three precision modes:
__half (FP16) - Default for inference__nv_bfloat16 (BF16) - Preferred for trainingfloat (FP32) - Reference/debuggingnix run .#build-and-copy --max-jobs 2 --cores 8 -L
uv pip install -e .
[general]
name = "ltx_kernels"
backends = ["cuda"]
[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
cuda-capabilities = ["9.0"]
Seehuggingface-kernels-integration.md for the complete guide.
Load pre-compiled, optimized kernels directly from HuggingFace Hub without local compilation:
from kernels import get_kernel, has_kernel
# Check availability and load
if has_kernel("kernels-community/activation"):
activation = get_kernel("kernels-community/activation", version=1)
# Use the kernel
x = torch.randn((4, 4), dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)
Key functions:
get_kernel(repo_id, version=None) - Download and load kernel from Hubhas_kernel(repo_id) - Check if compatible build existsget_local_kernel(path) - Load from local directory (development)Popular community kernels:
kernels-community/activation - GELU, SiLU, etc.kernels-community/flash-attn - Flash Attention 2kernels-community/triton-layer-norm - LayerNorm, RMSNormSeediffusers-integration.md for the complete guide.
Seetransformers-integration.md for the complete guide.
Key differences from diffusers:
elementwise_affine=False)'RMSNorm' in class_name to match LlamaRMSNorm, MistralRMSNorm, etc.variance_epsilon (LLaMA) or eps (others) for epsilonset_processor() pattern - use Flash Attention 2 insteadMinimal transformers pattern:
from transformers import AutoModelForCausalLM
from ltx_kernels import rmsnorm
def patch_rmsnorm(model):
for name, module in model.named_modules():
if 'RMSNorm' in type(module).__name__:
eps = getattr(module, 'variance_epsilon', None) or getattr(module, 'eps', 1e-6)
def make_forward(mod, epsilon):
def forward(x):
return rmsnorm(x, mod.weight, eps=epsilon)
return forward
module.forward = make_forward(module, eps)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
patch_rmsnorm(model)
LTX-Video uses elementwise_affine=False for some RMSNorm modules:
# Transformer blocks: NO WEIGHT
self.norm1 = RMSNorm(dim, elementwise_affine=False)
# Attention modules: HAS WEIGHT
self.norm_q = torch.nn.RMSNorm(..., elementwise_affine=True)
Solution: Handle both cases:
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
output = rmsnorm(x, module.weight, eps=eps)
else:
weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
output = rmsnorm(x, weight, eps=eps)
# WRONG - misses diffusers RMSNorm
if isinstance(module, torch.nn.RMSNorm):
# CORRECT - catches all RMSNorm variants
if type(module).__name__ == 'RMSNorm':
LTX-Video uses activation_fn="gelu-approximate". Don't patch GEGLU for LTX-Video.
pipe = LTXPipeline.from_pretrained(...)
pipe.to("cuda")
inject_optimized_kernels(pipe) # BEFORE offloading
pipe.enable_model_cpu_offload() # Now safe
from diffusers import LTXPipeline
from ltx_kernels import rmsnorm
def patch_rmsnorm_modules(model):
"""Patch all RMSNorm modules to use custom kernel."""
for name, module in model.named_modules():
if type(module).__name__ == 'RMSNorm':
eps = getattr(module, 'eps', 1e-6)
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
def make_forward(mod, epsilon):
def forward(x):
return rmsnorm(x, mod.weight, eps=epsilon)
return forward
module.forward = make_forward(module, eps)
else:
def make_forward(epsilon):
def forward(x):
w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
return rmsnorm(x, w, eps=epsilon)
return forward
module.forward = make_forward(eps)
# Usage
pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")
patch_rmsnorm_modules(pipe.transformer)
pipe.enable_model_cpu_offload()
[..., hidden_size]elementwise_affine=False__nv_bfloat162 for BF16, __half2 for FP16, float4 for FP32[batch, seq, heads, head_dim] - for text[batch, t*h*w, heads, head_dim] - for videoLTXVideoRotaryPosEmbed[batch, seq, 2*hidden] -> Output [batch, seq, hidden]norm(x) * weight * (1 + scale) + shift# NVIDIA Nsight Systems
nsys profile -o profile python your_script.py
# NVIDIA Nsight Compute
ncu --set full -o metrics python your_script.py
Seetroubleshooting.md for all common issues and solutions.
Quick fixes:
type(module).__name__ insteadenable_model_cpu_offload()Custom CUDA kernels and torch.compile are mutually exclusive unless you register the kernel as a PyTorch custom op.
Error message:
torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped
Workaround options:
--use-optimized-kernels without --compile (6% speedup)--compile without custom kernels (34% speedup)torch.library)To register as custom op (for torch.compile compatibility):
import torch
@torch.library.custom_op("ltx_kernels::rmsnorm", mutates_args={"out"})
def rmsnorm(out: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, eps: float) -> None:
ops.rmsnorm_forward(out, input.contiguous(), weight.contiguous(), eps)
@rmsnorm.register_fake
def _(out, input, weight, eps):
pass # No shape changes
Weekly Installs
64
Repository
GitHub Stars
534
First Seen
Feb 14, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
codex63
opencode62
github-copilot62
gemini-cli61
cursor61
kimi-cli61
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
53,700 周安装
| Number of frames to generate |
--height | 512 | Video height in pixels |
--width | 768 | Video width in pixels |
--steps | 50 | Denoising steps |
--warmup-iterations | 2 | Warmup runs before benchmark |