Custom CUDA Kernels in CuPy

Beyond the NumPy API: Writing Your Own GPU Kernels

CuPy covers most NumPy/SciPy operations, but sometimes you need a custom operation that has no built-in equivalent. CuPy provides three levels of kernel customization:

  1. ElementwiseKernel β€” apply a per-element operation (easiest)
  2. ReductionKernel β€” reduce an array (sum, max, custom)
  3. RawKernel β€” full CUDA C code (most flexible)

All three compile CUDA kernels at runtime and cache them, so the compilation cost is paid only once.

Definition:

ElementwiseKernel

ElementwiseKernel defines a per-element operation in CUDA C syntax:

import cupy as cp

clipped_relu = cp.ElementwiseKernel(
    'float64 x, float64 cap',    # input params
    'float64 y',                  # output param
    'y = min(max(x, 0.0), cap)',  # CUDA C body
    'clipped_relu'                # kernel name
)

x = cp.random.randn(10_000_000)
y = clipped_relu(x, 6.0)  # runs on GPU

CuPy handles grid/block configuration, broadcasting, and type promotion automatically. The kernel is JIT-compiled on first call and cached for subsequent calls.

ElementwiseKernel supports broadcasting: if x has shape (1000, 1000) and cap is a scalar, CuPy broadcasts correctly.

Definition:

ReductionKernel

ReductionKernel implements parallel reductions (sum, product, max, custom aggregations):

l2_norm_squared = cp.ReductionKernel(
    'float64 x',           # input
    'float64 y',           # output
    'x * x',               # map expression
    'a + b',               # reduce expression
    'y = a',               # post-map (assign result)
    '0',                   # identity value
    'l2_norm_sq'           # kernel name
)

x = cp.random.randn(10_000_000)
result = l2_norm_squared(x)  # sum of squares on GPU
print(float(cp.sqrt(result)))  # L2 norm

The reduction is performed in parallel using a tree-based algorithm, achieving O(n/p+log⁑p)O(n / p + \log p) time with pp threads.

Definition:

RawKernel β€” Full CUDA C Kernels

RawKernel lets you write arbitrary CUDA C code with full control over thread indexing, shared memory, and synchronization:

import cupy as cp

kernel_code = r'''
extern "C" __global__
void vector_add(const double* a, const double* b,
                double* c, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        c[tid] = a[tid] + b[tid];
    }
}
'''

vector_add = cp.RawKernel(kernel_code, 'vector_add')

n = 1_000_000
a = cp.random.randn(n)
b = cp.random.randn(n)
c = cp.empty(n)

threads_per_block = 256
blocks = (n + threads_per_block - 1) // threads_per_block
vector_add((blocks,), (threads_per_block,), (a, b, c, n))

RawKernel requires you to manage grid and block dimensions manually. Use this when you need shared memory, warp-level primitives, or complex thread cooperation patterns.

Theorem: Warp Divergence Penalty

A GPU warp consists of 32 threads that execute in lockstep (SIMT). When threads in a warp take different branches of an if statement, the warp must execute both branches serially, masking inactive threads. The worst-case penalty is:

Tdivergent=Tbranch1+Tbranch2T_{\mathrm{divergent}} = T_{\mathrm{branch1}} + T_{\mathrm{branch2}}

versus max⁑(Tbranch1,Tbranch2)\max(T_{\mathrm{branch1}}, T_{\mathrm{branch2}}) for uniform branching.

If half the threads in a warp go left and half go right, the warp runs both paths at half efficiency. Minimize divergence by ensuring threads within a warp follow the same control flow.

Example: Custom Kernel: Fast Complex Magnitude

Write an ElementwiseKernel that computes the squared magnitude of a complex array without creating intermediate arrays.

Example: RawKernel with Shared Memory: Block Reduction

Write a RawKernel that computes a block-level sum using shared memory for a parallel reduction pattern.

CUDA Thread Hierarchy

CUDA Thread Hierarchy
CUDA organizes threads into a hierarchy: threads form warps (32), warps form blocks, and blocks form a grid. Threads within a block can synchronize and share memory; blocks execute independently.

Quick Check

You need to compute y[i] = sin(x[i]) * exp(-x[i]**2) for a large array. Which CuPy kernel type is most appropriate?

RawKernel β€” you need full CUDA control for transcendental functions

ElementwiseKernel β€” it is a per-element operation with no inter-element dependencies

ReductionKernel β€” sin and exp are reduction operations

Common Mistake: Missing Bounds Check in RawKernel

Mistake:

Writing a RawKernel without checking if the thread index is within the array bounds:

// BUG: no bounds check
c[tid] = a[tid] + b[tid];  // out-of-bounds if n % blockSize != 0

Correction:

Always guard against out-of-bounds access:

if (tid < n) {
    c[tid] = a[tid] + b[tid];
}

GPU threads are launched in multiples of the block size, so the last block will have extra threads beyond the array size.

Warp

A group of 32 GPU threads that execute the same instruction in lockstep (SIMT model). Divergent branches within a warp cause serialization.

Shared Memory

Fast on-chip memory (48-164 KB per streaming multiprocessor) shared by all threads in a block. Used for inter-thread communication and data reuse.

Custom CUDA Kernels β€” Complete Examples

python
ElementwiseKernel, ReductionKernel, and RawKernel examples with benchmarking against built-in CuPy operations.
# Code from: ch11/python/s05_custom_kernels.py
# Load from backend supplements endpoint