PyTorch Metal 内核开发指南：为 Apple Silicon 实现原生 GPU 算子

metal-kernel by pytorch/pytorch

181 周安装量

98,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/pytorch/pytorch --skill metal-kernel

AI/机器学习 iOS 性能优化

🇨🇳中文介绍

Metal 内核编写指南

本技能将指导您为 Apple Silicon 上的 PyTorch 算子实现 Metal 内核。

重要提示： 本技能的目标是通过 c10/metal/ 基础设施使用原生 Metal 能力，而不是 MPSGraph。原生 Metal 内核能提供更好的控制力、性能和可维护性。

概述

本技能涵盖两种工作流程：

添加新的 MPS 支持 - 从头开始实现新的算子
从 MPSGraph 迁移 - 将现有的基于 MPSGraph 的算子转换为原生 Metal

两种工作流程都涉及：

在 aten/src/ATen/native/native_functions.yaml 中更新分发机制
在 aten/src/ATen/native/mps/kernels/ 中编写 Metal 内核
在 aten/src/ATen/native/mps/operations/ 中实现主机端存根

步骤 1：更新 native_functions.yaml

位置： aten/src/ATen/native/native_functions.yaml

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

对于从 MPSGraph 迁移

将现有算子从 MPSGraph 迁移到原生 Metal 时，需要整合分发条目：

# 迁移前（基于 MPSGraph，独立分发）
- func: atan2.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
  structured: True
  structured_inherits: TensorIteratorBase
  dispatch:
    CPU, CUDA: atan2_out
    MPS: atan2_out_mps  # 独立的 MPS 实现

# 迁移后（原生 Metal，通过存根共享分发）
- func: atan2.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
  structured: True
  structured_inherits: TensorIteratorBase
  dispatch:
    CPU, CUDA, MPS: atan2_out  # MPS 现在使用相同的存根机制

关键更改： 将 MPS: my_op_out_mps 替换为将 MPS 添加到共享分发行（例如 CPU, CUDA, MPS: my_op_out）。

分发命名约定：

MPS: function_name_mps - MPS 特定实现（旧的 MPSGraph 模式）
CPU, CUDA, MPS: function_name - 共享存根实现（原生 Metal 模式）

步骤 2：实现 Metal 内核

位置： aten/src/ATen/native/mps/kernels/

// MyKernel.metal
#include <c10/metal/indexing.h>
#include <c10/metal/utils.h>
#include <metal_stdlib>

using namespace metal;
using namespace c10::metal;

// 定义操作函数对象
struct my_op_functor {
  template <typename T>
  inline T operator()(const T x) {
    return /* 你的操作 */;
  }
};

// 为支持的类型注册
REGISTER_UNARY_OP(my_op, float, float);
REGISTER_UNARY_OP(my_op, half, half);
REGISTER_UNARY_OP(my_op, bfloat, bfloat);

struct my_binary_functor {
  template <typename T>
  inline T operator()(const T a, const T b) {
    return /* 你的操作 */;
  }
};

REGISTER_BINARY_OP(my_binary, float, float);
REGISTER_BINARY_OP(my_binary, half, half);

二元内核类型注册宏

对于二元操作，使用 BinaryKernel.metal 中定义的便捷宏：

// 仅浮点类型（float, half, bfloat）
REGISTER_FLOAT_BINARY_OP(my_op);

// 输出为浮点型的整数类型（用于 atan2、copysign 等数学运算）
// 注册：long->float, int->float, short->float, uchar->float, char->float, bool->float
REGISTER_INT2FLOAT_BINARY_OP(my_op);

// 输出类型相同的整数类型（用于按位/逻辑运算）
// 注册：long, int, short, uchar, char, bool
REGISTER_INTEGER_BINARY_OP(my_op);

// 具有运算数学精度的浮点类型（需要更高精度的运算）
REGISTER_OPMATH_FLOAT_BINARY_OP(my_op);

数学函数（atan2、copysign、logaddexp）：同时使用 REGISTER_FLOAT_BINARY_OP 和 REGISTER_INT2FLOAT_BINARY_OP
比较/逻辑运算（maximum、minimum）：同时使用 REGISTER_FLOAT_BINARY_OP 和 REGISTER_INTEGER_BINARY_OP
算术运算（add、sub、mul）：同时使用 REGISTER_FLOAT_BINARY_OP 和 REGISTER_INTEGER_BINARY_OP

atan2 示例（同时支持浮点和整数输入）：

struct atan2_functor {
  template <typename T, enable_if_t<is_floating_point_v<T>, bool> = true>
  inline T operator()(const T a, const T b) {
    return static_cast<T>(precise::atan2(float(a), float(b)));
  }
  template <typename T, enable_if_t<is_integral_v<T>, bool> = true>
  inline float operator()(const T a, const T b) {
    return precise::atan2(float(a), float(b));
  }
};

REGISTER_FLOAT_BINARY_OP(atan2);
REGISTER_INT2FLOAT_BINARY_OP(atan2);

struct my_alpha_functor {
  template <typename T>
  inline T operator()(const T a, const T b, const T alpha) {
    return a + c10::metal::mul(alpha, b);
  }
};

REGISTER_UNARY_ALPHA_OP(my_alpha, float, float, float);
REGISTER_UNARY_ALPHA_OP(my_alpha, half, half, half);

类型特化的函数对象

struct special_functor {
  // 浮点类型
  template <typename T, enable_if_t<is_scalar_floating_point_v<T>, bool> = true>
  inline T operator()(const T x) {
    return precise::exp(x);  // 使用精确数学运算
  }

  // 整数类型
  template <typename T, enable_if_t<is_scalar_integral_v<T>, bool> = true>
  inline float operator()(const T x) {
    return precise::exp(float(x));
  }

  // 复数类型（cfloat 对应 float2，chalf 对应 half2）
  template <typename T, enable_if_t<is_complex_v<T>, bool> = true>
  inline T operator()(const T x) {
    // x.x = 实部, x.y = 虚部
    return T(/* 实部 */, /* 虚部 */);
  }
};

关于复数类型的说明： Metal 中的复数表示为向量类型：

c10::complex<float> 映射到 float2（x = 实部，y = 虚部）
c10::complex<half> 映射到 half2

在函数对象中使用 is_complex_v<T> 来为复数类型进行特化。

可用的 c10/metal 工具函数

opmath_t<T> - 运算数学类型（half->float）
accum_t<T> - 用于归约的累加类型
支持 NaN 传播的 max()、min()

precise::exp()、precise::log()、precise::sqrt()
precise::sin()、precise::cos()、precise::tan()
erf()、erfc()、erfinv()

REGISTER_UNARY_OP(name, in_type, out_type)
REGISTER_BINARY_OP(name, in_type, out_type)
REGISTER_UNARY_ALPHA_OP(name, in_type, alpha_type, out_type)

步骤 3：实现主机端存根

位置： aten/src/ATen/native/mps/operations/

根据操作类型选择或创建适当的文件：

UnaryKernel.mm - 通过存根分发的单输入操作
BinaryKernel.mm - 通过存根分发的双输入操作
UnaryOps.mm / BinaryOps.mm - 遗留的 MPSGraph 实现（供参考）
ReduceOps.mm - 归约操作（sum、mean、max 等）
为不同的操作类别创建新文件

存根注册模式（原生 Metal 首选）

对于使用 TensorIterator 模式的结构化内核：

// 在 BinaryKernel.mm（或适当的文件）中

static void my_op_mps_kernel(TensorIteratorBase& iter) {
  lib.exec_binary_kernel(iter, "my_op");  // "my_op" 与 .metal 文件中的函数对象名称匹配
}

// 注册 MPS 存根 - 这将连接到分发系统
REGISTER_DISPATCH(my_op_stub, &my_op_mps_kernel)

对于一元操作：

static void my_unary_mps_kernel(TensorIteratorBase& iter) {
  lib.exec_unary_kernel(iter, "my_unary");
}

REGISTER_DISPATCH(my_unary_stub, &my_unary_mps_kernel)

迁移：移除旧的 MPSGraph 实现

从 MPSGraph 迁移时，还需移除旧的实现：

从 BinaryOps.mm（或 UnaryOps.mm）中移除：
- 删除 TORCH_IMPL_FUNC(my_op_out_mps) 实现
- 移除相应的 #include <ATen/ops/my_op_native.h> 头文件
添加到 BinaryKernel.mm（或 UnaryKernel.mm）：
- 添加静态内核函数
- 添加 REGISTER_DISPATCH 调用

完成更改后，进行编译以验证所有内容是否正确构建：

cd build && ninja torch_cpu

基本算子支持已通过 test/test_mps.py 中的 test_output_match 进行测试。实现算子后，通过移除预期失败来启用测试：

1. 从 common_mps.py 中移除

位置： torch/testing/_internal/common_mps.py

找到并从跳过/预期失败列表中移除算子：

# 移除类似条目：
MPS_XFAILLIST = {
    "my_op": ...,  # 移除此行
}

MPS_SKIPLIST = {
    "my_op": ...,  # 移除此行
}

2. 从 OpInfo 装饰器中移除

位置： torch/testing/_internal/common_methods_invocations.py（或相关文件）

从 OpInfo 中移除 MPS 特定的装饰器：

OpInfo(
    "my_op",
    # 移除装饰器，例如：
    # decorators=[skipMPS, expectedFailureMPS("reason")],
    ...
)

3. 运行测试以验证

# 运行特定算子测试
python test/test_mps.py -k test_output_match_my_op

# 或运行完整的 MPS 测试套件
python test/test_mps.py

使用 `torch.mps.compile_shader` 调试 Metal 内核

使用 torch.mps.compile_shader 可以即时编译并独立测试单个 Metal 内核。这对于调试多内核流水线非常宝贵，因为您需要独立验证每个阶段。

import torch

source = '''
#include <metal_stdlib>
using namespace metal;

kernel void my_kernel(
    const device float* input [[buffer(0)]],
    device float* output [[buffer(1)]],
    uint tid [[thread_position_in_grid]]) {
  output[tid] = input[tid] * 2.0;
}
'''

lib = torch.mps.compile_shader(source)

inp = torch.tensor([1.0, 2.0, 3.0], device='mps')
out = torch.zeros(3, device='mps')
lib.my_kernel(inp, out, threads=[3, 1, 1], group_size=[3, 1, 1])
torch.mps.synchronize()
print(out)  # tensor([2., 4., 6.], device='mps:0')

compile_shader 使用 dispatchThreads 语义（与 PyTorch 中的 mtl_dispatch1DJob 相同）：

threads=[N, 1, 1] — 线程总数（不是线程组数）
group_size=[G, 1, 1] — 每个线程组的线程数

这与某些主机端代码使用的 dispatchThreadgroups API 不同。要匹配 dispatchThreadgroups:MTLSizeMake(num_tgs, num_slices, 1) threadsPerThreadgroup:MTLSizeMake(TG_SIZE, 1, 1)：

# 等效的 compile_shader 调用：
lib.kernel(args...,
    threads=[num_tgs * TG_SIZE, num_slices, 1],
    group_size=[TG_SIZE, 1, 1])

常量缓冲区参数

将标量常量作为单元素张量传递：

slice_size = torch.tensor([1024], dtype=torch.int32, device='mps')
lib.my_kernel(data, output, slice_size, threads=[1024, 1, 1], group_size=[256, 1, 1])

多内核流水线的调试策略

当一系列内核（例如，直方图 → 前缀和 → 分散）产生错误结果时，单独测试每个内核，并将其输出与 Python/NumPy 参考实现进行验证：

# 1. 运行 GPU 内核
lib.histogram(keys, hist, ..., threads=[N, 1, 1], group_size=[256, 1, 1])
torch.mps.synchronize()

# 2. 在 Python 中计算参考值
ref_hist = compute_histogram_cpu(keys.cpu().numpy(), ...)

# 3. 比较
assert np.array_equal(hist.cpu().numpy(), ref_hist), "直方图不匹配！"

这样可以隔离流水线中哪个内核出现问题，而不是一次性调试整个流水线。

错误的 threads 计数 — threads 是线程总数，不是线程组数。对于 5 个线程组，每组 256 个线程，应使用 threads=[1280, 1, 1]。
线程组内存 — compile_shader 不直接支持 [[threadgroup(N)]] 参数。如果您的内核需要线程组内存，请重构为使用在内核体内声明的 threadgroup 数组。

已将 MPS 分发添加到 native_functions.yaml
在 kernels/ 中实现了 Metal 内核
在 operations/ 中实现了主机端算子
处理空张量
处理非连续张量
支持所需的数据类型（float32、float16、bfloat16，通常还通过 float2/half2 支持复数类型）
已从 torch/testing/_internal/common_mps.py 中移除预期失败
已从 OpInfo 中移除跳过/预期失败装饰器（如果适用）

🇺🇸English

Metal Kernel Writing Guide

This skill guides you through implementing Metal kernels for PyTorch operators on Apple Silicon.

Important: The goal of this skill is to use native Metal capabilities via the c10/metal/ infrastructure, NOT MPSGraph. Native Metal kernels provide better control, performance, and maintainability.

Overview

There are two workflows covered by this skill:

Adding new MPS support - Implementing a new operator from scratch
Migrating from MPSGraph - Converting existing MPSGraph-based operators to native Metal

Both workflows involve:

Update dispatch in aten/src/ATen/native/native_functions.yaml
Write Metal kernel in aten/src/ATen/native/mps/kernels/
Implement host-side stub in aten/src/ATen/native/mps/operations/

Step 1: Update native_functions.yaml

Location: aten/src/ATen/native/native_functions.yaml

For New Operators

Find the operator entry and add MPS dispatch:

# Simple MPS-specific implementation
- func: my_op(Tensor self) -> Tensor
  dispatch:
    CPU: my_op_cpu
    CUDA: my_op_cuda
    MPS: my_op_mps

# Shared implementation across devices (preferred for structured kernels)
- func: my_op.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
  dispatch:
    CPU, CUDA, MPS: my_op_out

# Structured kernel (preferred for new ops)
- func: my_op.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
  structured: True
  structured_inherits: TensorIteratorBase
  dispatch:
    CPU, CUDA, MPS: my_op_out

For Migrating from MPSGraph

When migrating an existing operator from MPSGraph to native Metal, consolidate the dispatch entry :

# BEFORE (MPSGraph-based, separate dispatch)
- func: atan2.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
  structured: True
  structured_inherits: TensorIteratorBase
  dispatch:
    CPU, CUDA: atan2_out
    MPS: atan2_out_mps  # Separate MPS implementation

# AFTER (native Metal, shared dispatch via stub)
- func: atan2.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
  structured: True
  structured_inherits: TensorIteratorBase
  dispatch:
    CPU, CUDA, MPS: atan2_out  # MPS now uses the same stub mechanism

Key change: Replace MPS: my_op_out_mps with adding MPS to the shared dispatch line (e.g., CPU, CUDA, MPS: my_op_out).

Dispatch naming conventions:

MPS: function_name_mps - MPS-specific implementation (old MPSGraph pattern)
CPU, CUDA, MPS: function_name - Shared stub implementation (native Metal pattern)

Step 2: Implement Metal Kernel

Location: aten/src/ATen/native/mps/kernels/

Unary Kernel Pattern

// MyKernel.metal
#include <c10/metal/indexing.h>
#include <c10/metal/utils.h>
#include <metal_stdlib>

using namespace metal;
using namespace c10::metal;

// Define operation functor
struct my_op_functor {
  template <typename T>
  inline T operator()(const T x) {
    return /* your operation */;
  }
};

// Register for supported types
REGISTER_UNARY_OP(my_op, float, float);
REGISTER_UNARY_OP(my_op, half, half);
REGISTER_UNARY_OP(my_op, bfloat, bfloat);

Binary Kernel Pattern

struct my_binary_functor {
  template <typename T>
  inline T operator()(const T a, const T b) {
    return /* your operation */;
  }
};

REGISTER_BINARY_OP(my_binary, float, float);
REGISTER_BINARY_OP(my_binary, half, half);

Binary Kernel Type Registration Macros

For binary operations, use the convenience macros defined in BinaryKernel.metal:

// Floating-point types only (float, half, bfloat)
REGISTER_FLOAT_BINARY_OP(my_op);

// Integral types with float output (for math ops like atan2, copysign)
// Registers: long->float, int->float, short->float, uchar->float, char->float, bool->float
REGISTER_INT2FLOAT_BINARY_OP(my_op);

// Integral types with same-type output (for bitwise/logical ops)
// Registers: long, int, short, uchar, char, bool
REGISTER_INTEGER_BINARY_OP(my_op);

// Floating-point with opmath precision (for ops needing higher precision)
REGISTER_OPMATH_FLOAT_BINARY_OP(my_op);

Common patterns:

Math functions (atan2, copysign, logaddexp): Use both REGISTER_FLOAT_BINARY_OP and REGISTER_INT2FLOAT_BINARY_OP
Comparison/logical ops (maximum, minimum): Use both REGISTER_FLOAT_BINARY_OP and REGISTER_INTEGER_BINARY_OP
Arithmetic ops (add, sub, mul): Use both REGISTER_FLOAT_BINARY_OP and REGISTER_INTEGER_BINARY_OP

Example for atan2 (supports both float and int inputs):

struct atan2_functor {
  template <typename T, enable_if_t<is_floating_point_v<T>, bool> = true>
  inline T operator()(const T a, const T b) {
    return static_cast<T>(precise::atan2(float(a), float(b)));
  }
  template <typename T, enable_if_t<is_integral_v<T>, bool> = true>
  inline float operator()(const T a, const T b) {
    return precise::atan2(float(a), float(b));
  }
};

REGISTER_FLOAT_BINARY_OP(atan2);
REGISTER_INT2FLOAT_BINARY_OP(atan2);

With Scalar Parameter

struct my_alpha_functor {
  template <typename T>
  inline T operator()(const T a, const T b, const T alpha) {
    return a + c10::metal::mul(alpha, b);
  }
};

REGISTER_UNARY_ALPHA_OP(my_alpha, float, float, float);
REGISTER_UNARY_ALPHA_OP(my_alpha, half, half, half);

Type-Specialized Functor

struct special_functor {
  // Floating point types
  template <typename T, enable_if_t<is_scalar_floating_point_v<T>, bool> = true>
  inline T operator()(const T x) {
    return precise::exp(x);  // Use precise math
  }

  // Integral types
  template <typename T, enable_if_t<is_scalar_integral_v<T>, bool> = true>
  inline float operator()(const T x) {
    return precise::exp(float(x));
  }

  // Complex types (float2 for cfloat, half2 for chalf)
  template <typename T, enable_if_t<is_complex_v<T>, bool> = true>
  inline T operator()(const T x) {
    // x.x = real, x.y = imaginary
    return T(/* real */, /* imag */);
  }
};

Note on complex types: Complex numbers in Metal are represented as vector types:

c10::complex<float> maps to float2 (x = real, y = imaginary)
c10::complex<half> maps to half2

Use is_complex_v<T> to specialize for complex types in functors.

Available c10/metal Utilities

utils.h:

opmath_t<T> - Operation math type (half->float)
accum_t<T> - Accumulation type for reductions
max(), min() with NaN propagation

special_math.h:

precise::exp(), precise::log(), precise::sqrt()
precise::sin(), precise::cos(), precise::tan()
erf(), erfc(), erfinv()

indexing.h:

REGISTER_UNARY_OP(name, in_type, out_type)
REGISTER_BINARY_OP(name, in_type, out_type)
REGISTER_UNARY_ALPHA_OP(name, in_type, alpha_type, out_type)

Step 3: Implement Host-Side Stub

Location: aten/src/ATen/native/mps/operations/

Choose or create an appropriate file based on operation type:

UnaryKernel.mm - Single input operations via stub dispatch
BinaryKernel.mm - Two input operations via stub dispatch
UnaryOps.mm / BinaryOps.mm - Legacy MPSGraph implementations (for reference)
ReduceOps.mm - Reductions (sum, mean, max, etc.)
Create new file for distinct operation categories

Stub Registration Pattern (Preferred for Native Metal)

For structured kernels that use the TensorIterator pattern:

// In BinaryKernel.mm (or appropriate file)

static void my_op_mps_kernel(TensorIteratorBase& iter) {
  lib.exec_binary_kernel(iter, "my_op");  // "my_op" matches the functor name in .metal
}

// Register the MPS stub - this connects to the dispatch system
REGISTER_DISPATCH(my_op_stub, &my_op_mps_kernel)

For unary operations:

static void my_unary_mps_kernel(TensorIteratorBase& iter) {
  lib.exec_unary_kernel(iter, "my_unary");
}

REGISTER_DISPATCH(my_unary_stub, &my_unary_mps_kernel)

Migration: Removing Old MPSGraph Implementation

When migrating from MPSGraph, also remove the old implementation:

Remove from BinaryOps.mm (or UnaryOps.mm):
- Delete the TORCH_IMPL_FUNC(my_op_out_mps) implementation
- Remove the corresponding #include <ATen/ops/my_op_native.h> header
Add to BinaryKernel.mm (or UnaryKernel.mm):
- Add the static kernel function
- Add the REGISTER_DISPATCH call

Step 4: Compile

After making changes, compile to verify everything builds correctly:

cd build && ninja torch_cpu

Testing

Basic operator support is already tested by test_output_match in test/test_mps.py. After implementing an operator, enable testing by removing expected failures:

1. Remove from common_mps.py

Location: torch/testing/_internal/common_mps.py

Find and remove the operator from skip/xfail lists:

# Remove entries like:
MPS_XFAILLIST = {
    "my_op": ...,  # Remove this line
}

MPS_SKIPLIST = {
    "my_op": ...,  # Remove this line
}

2. Remove from OpInfo decorators

Location: torch/testing/_internal/common_methods_invocations.py (or related files)

Remove MPS-specific decorators from the OpInfo:

OpInfo(
    "my_op",
    # Remove decorators like:
    # decorators=[skipMPS, expectedFailureMPS("reason")],
    ...
)

3. Run tests to verify

# Run the specific operator test
python test/test_mps.py -k test_output_match_my_op

# Or run full MPS test suite
python test/test_mps.py

Debugging Metal Kernels with `torch.mps.compile_shader`

Use torch.mps.compile_shader to JIT-compile and test individual Metal kernels in isolation. This is invaluable for debugging multi-kernel pipelines where you need to verify each stage independently.

Basic Usage

import torch

source = '''
#include <metal_stdlib>
using namespace metal;

kernel void my_kernel(
    const device float* input [[buffer(0)]],
    device float* output [[buffer(1)]],
    uint tid [[thread_position_in_grid]]) {
  output[tid] = input[tid] * 2.0;
}
'''

lib = torch.mps.compile_shader(source)

inp = torch.tensor([1.0, 2.0, 3.0], device='mps')
out = torch.zeros(3, device='mps')
lib.my_kernel(inp, out, threads=[3, 1, 1], group_size=[3, 1, 1])
torch.mps.synchronize()
print(out)  # tensor([2., 4., 6.], device='mps:0')

Dispatch Semantics

compile_shader uses dispatchThreads semantics (same as mtl_dispatch1DJob in PyTorch):

threads=[N, 1, 1] — total number of threads (NOT threadgroups)
group_size=[G, 1, 1] — threads per threadgroup

This differs from the dispatchThreadgroups API used by some host-side code. To match dispatchThreadgroups:MTLSizeMake(num_tgs, num_slices, 1) threadsPerThreadgroup:MTLSizeMake(TG_SIZE, 1, 1):

# Equivalent compile_shader call:
lib.kernel(args...,
    threads=[num_tgs * TG_SIZE, num_slices, 1],
    group_size=[TG_SIZE, 1, 1])

Constant Buffer Parameters

Pass scalar constants as single-element tensors:

slice_size = torch.tensor([1024], dtype=torch.int32, device='mps')
lib.my_kernel(data, output, slice_size, threads=[1024, 1, 1], group_size=[256, 1, 1])

Debugging Strategy for Multi-Kernel Pipelines

When a pipeline of kernels (e.g., histogram → prefix_sum → scatter) produces wrong results, test each kernel individually and verify its output against a Python/NumPy reference:

# 1. Run GPU kernel
lib.histogram(keys, hist, ..., threads=[N, 1, 1], group_size=[256, 1, 1])
torch.mps.synchronize()

# 2. Compute reference in Python
ref_hist = compute_histogram_cpu(keys.cpu().numpy(), ...)

# 3. Compare
assert np.array_equal(hist.cpu().numpy(), ref_hist), "Histogram mismatch!"

This isolates which kernel in the pipeline is broken, rather than debugging the entire pipeline at once.

Common Pitfalls

Wrongthreads count — threads is total threads, not threadgroups. For 5 threadgroups of 256, use threads=[1280, 1, 1].
Threadgroup memory — compile_shader doesn't support [[threadgroup(N)]] parameters directly. If your kernel needs threadgroup memory, restructure to use threadgroup arrays declared inside the kernel body instead.

Checklist

Added MPS dispatch to native_functions.yaml
Implemented Metal kernel in kernels/
Implemented host-side operator in operations/
Handles empty tensors
Handles non-contiguous tensors
Supports required dtypes (float32, float16, bfloat16, and often complex types via float2/half2)
Removed expected failures from torch/testing/_internal/common_mps.py
Removed skip/xfail decorators from OpInfo (if applicable)

Weekly Installs

181

Repository

pytorch/pytorch

GitHub Stars

98.5K

First Seen

Jan 27, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode173

codex172

gemini-cli171

claude-code170

cursor168

github-copilot166

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

46,500 周安装