Custom CUDA Kernels in CuPy
Beyond the NumPy API: Writing Your Own GPU Kernels
CuPy covers most NumPy/SciPy operations, but sometimes you need a custom operation that has no built-in equivalent. CuPy provides three levels of kernel customization:
- ElementwiseKernel β apply a per-element operation (easiest)
- ReductionKernel β reduce an array (sum, max, custom)
- RawKernel β full CUDA C code (most flexible)
All three compile CUDA kernels at runtime and cache them, so the compilation cost is paid only once.
Definition: ElementwiseKernel
ElementwiseKernel
ElementwiseKernel defines a per-element operation in CUDA C syntax:
import cupy as cp
clipped_relu = cp.ElementwiseKernel(
'float64 x, float64 cap', # input params
'float64 y', # output param
'y = min(max(x, 0.0), cap)', # CUDA C body
'clipped_relu' # kernel name
)
x = cp.random.randn(10_000_000)
y = clipped_relu(x, 6.0) # runs on GPU
CuPy handles grid/block configuration, broadcasting, and type promotion automatically. The kernel is JIT-compiled on first call and cached for subsequent calls.
ElementwiseKernel supports broadcasting: if x has shape
(1000, 1000) and cap is a scalar, CuPy broadcasts correctly.
Definition: ReductionKernel
ReductionKernel
ReductionKernel implements parallel reductions (sum, product,
max, custom aggregations):
l2_norm_squared = cp.ReductionKernel(
'float64 x', # input
'float64 y', # output
'x * x', # map expression
'a + b', # reduce expression
'y = a', # post-map (assign result)
'0', # identity value
'l2_norm_sq' # kernel name
)
x = cp.random.randn(10_000_000)
result = l2_norm_squared(x) # sum of squares on GPU
print(float(cp.sqrt(result))) # L2 norm
The reduction is performed in parallel using a tree-based algorithm, achieving time with threads.
Definition: RawKernel β Full CUDA C Kernels
RawKernel β Full CUDA C Kernels
RawKernel lets you write arbitrary CUDA C code with full control
over thread indexing, shared memory, and synchronization:
import cupy as cp
kernel_code = r'''
extern "C" __global__
void vector_add(const double* a, const double* b,
double* c, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < n) {
c[tid] = a[tid] + b[tid];
}
}
'''
vector_add = cp.RawKernel(kernel_code, 'vector_add')
n = 1_000_000
a = cp.random.randn(n)
b = cp.random.randn(n)
c = cp.empty(n)
threads_per_block = 256
blocks = (n + threads_per_block - 1) // threads_per_block
vector_add((blocks,), (threads_per_block,), (a, b, c, n))
RawKernel requires you to manage grid and block dimensions manually. Use this when you need shared memory, warp-level primitives, or complex thread cooperation patterns.
Theorem: Warp Divergence Penalty
A GPU warp consists of 32 threads that execute in lockstep (SIMT).
When threads in a warp take different branches of an if statement,
the warp must execute both branches serially, masking inactive
threads. The worst-case penalty is:
versus for uniform branching.
If half the threads in a warp go left and half go right, the warp runs both paths at half efficiency. Minimize divergence by ensuring threads within a warp follow the same control flow.
Example: Custom Kernel: Fast Complex Magnitude
Write an ElementwiseKernel that computes the squared magnitude of a complex array without creating intermediate arrays.
Implementation
import cupy as cp
abs2_kernel = cp.ElementwiseKernel(
'complex128 z',
'float64 mag2',
'mag2 = z.real() * z.real() + z.imag() * z.imag()',
'abs_squared'
)
z = cp.random.randn(10_000_000) + 1j * cp.random.randn(10_000_000)
# Custom kernel β no intermediate arrays
mag2_custom = abs2_kernel(z)
# Standard CuPy β creates intermediate array
mag2_standard = cp.abs(z) ** 2
print(cp.allclose(mag2_custom, mag2_standard)) # True
Performance comparison
The custom kernel avoids allocating a temporary array for
cp.abs(z) and avoids the square root (which cp.abs
computes). For large arrays, this gives a measurable
speedup (~1.5-2x) and reduces peak GPU memory usage.
CUDA Thread Hierarchy
Quick Check
You need to compute y[i] = sin(x[i]) * exp(-x[i]**2) for a
large array. Which CuPy kernel type is most appropriate?
RawKernel β you need full CUDA control for transcendental functions
ElementwiseKernel β it is a per-element operation with no inter-element dependencies
ReductionKernel β sin and exp are reduction operations
This is a classic element-wise operation. ElementwiseKernel handles broadcasting, grid configuration, and type promotion automatically. RawKernel would work but adds unnecessary complexity.
Common Mistake: Missing Bounds Check in RawKernel
Mistake:
Writing a RawKernel without checking if the thread index is within the array bounds:
// BUG: no bounds check
c[tid] = a[tid] + b[tid]; // out-of-bounds if n % blockSize != 0
Correction:
Always guard against out-of-bounds access:
if (tid < n) {
c[tid] = a[tid] + b[tid];
}
GPU threads are launched in multiples of the block size, so the last block will have extra threads beyond the array size.
Warp
A group of 32 GPU threads that execute the same instruction in lockstep (SIMT model). Divergent branches within a warp cause serialization.
Custom CUDA Kernels β Complete Examples
# Code from: ch11/python/s05_custom_kernels.py
# Load from backend supplements endpoint