Custom CUDA Kernels in CuPy

Beyond the NumPy API: Writing Your Own GPU Kernels

CuPy covers most NumPy/SciPy operations, but sometimes you need a custom operation that has no built-in equivalent. CuPy provides three levels of kernel customization:

ElementwiseKernel — apply a per-element operation (easiest)
ReductionKernel — reduce an array (sum, max, custom)
RawKernel — full CUDA C code (most flexible)

All three compile CUDA kernels at runtime and cache them, so the compilation cost is paid only once.

Definition:
ElementwiseKernel

ElementwiseKernel defines a per-element operation in CUDA C syntax:

import cupy as cp

clipped_relu = cp.ElementwiseKernel(
    'float64 x, float64 cap',    # input params
    'float64 y',                  # output param
    'y = min(max(x, 0.0), cap)',  # CUDA C body
    'clipped_relu'                # kernel name
)

x = cp.random.randn(10_000_000)
y = clipped_relu(x, 6.0)  # runs on GPU

CuPy handles grid/block configuration, broadcasting, and type promotion automatically. The kernel is JIT-compiled on first call and cached for subsequent calls.

ElementwiseKernel supports broadcasting: if x has shape (1000, 1000) and cap is a scalar, CuPy broadcasts correctly.

Definition:
ReductionKernel

ReductionKernel implements parallel reductions (sum, product, max, custom aggregations):

l2_norm_squared = cp.ReductionKernel(
    'float64 x',           # input
    'float64 y',           # output
    'x * x',               # map expression
    'a + b',               # reduce expression
    'y = a',               # post-map (assign result)
    '0',                   # identity value
    'l2_norm_sq'           # kernel name
)

x = cp.random.randn(10_000_000)
result = l2_norm_squared(x)  # sum of squares on GPU
print(float(cp.sqrt(result)))  # L2 norm

The reduction is performed in parallel using a tree-based algorithm, achieving $O(n / p + \log p)$ time with $p$ threads.

Definition:
RawKernel — Full CUDA C Kernels

RawKernel lets you write arbitrary CUDA C code with full control over thread indexing, shared memory, and synchronization:

import cupy as cp

kernel_code = r'''
extern "C" __global__
void vector_add(const double* a, const double* b,
                double* c, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        c[tid] = a[tid] + b[tid];
    }
}
'''

vector_add = cp.RawKernel(kernel_code, 'vector_add')

n = 1_000_000
a = cp.random.randn(n)
b = cp.random.randn(n)
c = cp.empty(n)

threads_per_block = 256
blocks = (n + threads_per_block - 1) // threads_per_block
vector_add((blocks,), (threads_per_block,), (a, b, c, n))

RawKernel requires you to manage grid and block dimensions manually. Use this when you need shared memory, warp-level primitives, or complex thread cooperation patterns.

Theorem: Warp Divergence Penalty

A GPU warp consists of 32 threads that execute in lockstep (SIMT). When threads in a warp take different branches of an if statement, the warp must execute both branches serially, masking inactive threads. The worst-case penalty is:

$T_{\mathrm{divergent}} = T_{\mathrm{branch1}} + T_{\mathrm{branch2}}$

versus $\max(T_{\mathrm{branch1}}, T_{\mathrm{branch2}})$ for uniform branching.

If half the threads in a warp go left and half go right, the warp runs both paths at half efficiency. Minimize divergence by ensuring threads within a warp follow the same control flow.

Example: Custom Kernel: Fast Complex Magnitude

Write an ElementwiseKernel that computes the squared magnitude of a complex array without creating intermediate arrays.

Solution

Implementation

import cupy as cp

abs2_kernel = cp.ElementwiseKernel(
    'complex128 z',
    'float64 mag2',
    'mag2 = z.real() * z.real() + z.imag() * z.imag()',
    'abs_squared'
)

z = cp.random.randn(10_000_000) + 1j * cp.random.randn(10_000_000)

# Custom kernel — no intermediate arrays
mag2_custom = abs2_kernel(z)

# Standard CuPy — creates intermediate array
mag2_standard = cp.abs(z) ** 2

print(cp.allclose(mag2_custom, mag2_standard))  # True

Performance comparison

The custom kernel avoids allocating a temporary array for cp.abs(z) and avoids the square root (which cp.abs computes). For large arrays, this gives a measurable speedup (~1.5-2x) and reduces peak GPU memory usage.

Example: RawKernel with Shared Memory: Block Reduction

Write a RawKernel that computes a block-level sum using shared memory for a parallel reduction pattern.

Solution

CUDA kernel code

import cupy as cp

reduce_code = r'''
extern "C" __global__
void block_sum(const double* input, double* output, int n) {
    extern __shared__ double sdata[];
    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;

    // Load to shared memory
    sdata[tid] = (gid < n) ? input[gid] : 0.0;
    __syncthreads();

    // Tree reduction in shared memory
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // Write block result
    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}
'''

block_sum = cp.RawKernel(reduce_code, 'block_sum')

n = 1_000_000
x = cp.random.randn(n)
block_size = 256
n_blocks = (n + block_size - 1) // block_size
partial = cp.empty(n_blocks)

block_sum((n_blocks,), (block_size,), (x, partial, n),
          shared_mem=block_size * 8)
total = float(partial.sum())
print(f"Sum: {total:.6f}, reference: {float(x.sum()):.6f}")

How it works

Each block loads its portion into fast shared memory (48 KB per SM), then performs a tree reduction. The final block sums are collected on the host. This pattern is the foundation of all GPU reductions (sum, max, dot product, norm).

CUDA Thread Hierarchy — CUDA organizes threads into a hierarchy: threads form warps (32), warps form blocks, and blocks form a grid. Threads within a block can synchronize and share memory; blocks execute independently.

Quick Check

You need to compute y[i] = sin(x[i]) * exp(-x[i]**2) for a large array. Which CuPy kernel type is most appropriate?

RawKernel — you need full CUDA control for transcendental functions

ElementwiseKernel — it is a per-element operation with no inter-element dependencies

ReductionKernel — sin and exp are reduction operations

Correction:

ElementwiseKernel — it is a per-element operation with no inter-element dependencies

This is a classic element-wise operation. ElementwiseKernel handles broadcasting, grid configuration, and type promotion automatically. RawKernel would work but adds unnecessary complexity.

Common Mistake: Missing Bounds Check in RawKernel

Mistake:

Writing a RawKernel without checking if the thread index is within the array bounds:

// BUG: no bounds check
c[tid] = a[tid] + b[tid];  // out-of-bounds if n % blockSize != 0

Correction:

Always guard against out-of-bounds access:

if (tid < n) {
    c[tid] = a[tid] + b[tid];
}

GPU threads are launched in multiples of the block size, so the last block will have extra threads beyond the array size.

Warp

A group of 32 GPU threads that execute the same instruction in lockstep (SIMT model). Divergent branches within a warp cause serialization.

Shared Memory

Fast on-chip memory (48-164 KB per streaming multiprocessor) shared by all threads in a block. Used for inter-thread communication and data reuse.

Custom CUDA Kernels — Complete Examples

python

ElementwiseKernel, ReductionKernel, and RawKernel examples with benchmarking against built-in CuPy operations.

# Code from: ch11/python/s05_custom_kernels.py
# Load from backend supplements endpoint

CuPy Sparse Matrices The Kronecker Matvec on GPU