CuPy FFT and Signal Processing

cuFFT: The World's Fastest FFT

NVIDIA's cuFFT library is one of the most optimized FFT implementations available. CuPy wraps it with an interface identical to scipy.fft and numpy.fft, making it trivial to GPU-accelerate any FFT-heavy workflow. For large transforms (N>216N > 2^{16}), cuFFT delivers 10-50x speedups over FFTW on CPU.

In radar and wireless communications, the FFT is the computational workhorse: OFDM modulation/demodulation, range-Doppler processing, spectral analysis, and channel estimation all rely on it.

Definition:

CuPy FFT Interface

CuPy provides FFT functions in two locations, both backed by cuFFT:

import cupy as cp

x = cp.random.randn(2**20)

# NumPy-compatible interface
X = cp.fft.fft(x)
x_back = cp.fft.ifft(X)

# SciPy-compatible interface (recommended)
import cupyx.scipy.fft as cufft
X = cufft.fft(x)
X2d = cufft.fft2(cp.random.randn(1024, 1024))

The cupyx.scipy.fft module supports the SciPy FFT backend protocol, allowing it to be registered as a global backend:

import scipy.fft
scipy.fft.set_global_backend(cufft)
# Now scipy.fft.fft() uses GPU automatically

Definition:

Real FFT on GPU

For real-valued signals, cp.fft.rfft computes only the non-redundant positive-frequency half, saving memory and compute:

x = cp.random.randn(2**20)     # real signal
X = cp.fft.rfft(x)             # N//2 + 1 complex values
x_back = cp.fft.irfft(X)       # inverse real FFT

Memory savings: rfft returns N/2+1N/2 + 1 complex values instead of NN, halving the output size. cuFFT internally uses an optimized real-to-complex kernel that is ~40% faster than the full complex FFT.

Theorem: FFT on GPU β€” Memory-Bound Reality

The FFT has arithmetic intensity:

IFFT=5Nlog⁑2N2β‹…Nβ‹…w=5log⁑2N2wI_{\mathrm{FFT}} = \frac{5N\log_2 N}{2 \cdot N \cdot w} = \frac{5\log_2 N}{2w}

where ww is the element width in bytes. For N=220N = 2^{20} and w=16w = 16 (complex128), IFFTβ‰ˆ3.1I_{\mathrm{FFT}} \approx 3.1 FLOP/byte, which is below the GPU's compute-memory crossover point (Iβˆ—β‰ˆ10I^* \approx 10--5050). Therefore, GPU FFTs are memory-bandwidth bound.

The FFT does relatively little arithmetic per byte loaded. GPU speedup comes from the GPU's much higher memory bandwidth (1-3 TB/s vs 50-100 GB/s on CPU), not from its compute throughput. This is why using float32 (halving memory traffic) can nearly double FFT throughput.

Example: Range-Doppler Map on GPU

Compute a range-Doppler map from a 256Γ—2048256 \times 2048 radar data cube (256 pulses, 2048 range bins) using 2-D FFT on the GPU.

GPU vs CPU FFT Benchmark

Compare 1-D FFT throughput on CPU (SciPy/FFTW) vs GPU (CuPy/cuFFT) across transform sizes.

Parameters

Range-Doppler Map Explorer

Interactive range-Doppler map generated on GPU with adjustable target parameters.

Parameters

Quick Check

You switch your GPU FFT from complex128 to complex64. What happens to throughput and why?

No change β€” FFT speed is determined by algorithm complexity O(N log N)

Roughly 2x faster β€” FFT is memory-bound, halving data size halves memory traffic

Exactly 4x faster β€” GPU has dedicated FP32 Tensor Cores

Common Mistake: Not Reusing FFT Plans

Mistake:

Calling cp.fft.fft(x) in a loop with the same-size input, paying the plan creation overhead each time.

Correction:

Use cupyx.scipy.fft with explicit plan caching, or call the FFT once outside the loop to warm up the plan cache:

import cupyx.scipy.fft as cufft

# Warm up plan cache
_ = cufft.fft(x)

# Now the plan is cached; subsequent calls are fast
for i in range(n_iterations):
    X = cufft.fft(x_batch[i])

CuPy caches plans internally, but the first call has ~1ms overhead.

Why This Matters: OFDM and the GPU FFT

In OFDM (used in 5G NR, Wi-Fi, LTE), every symbol requires an IFFT at the transmitter and an FFT at the receiver. 5G NR uses FFT sizes up to 4096, and the base station must process all antenna ports and subframes in real time. A GPU running cuFFT can handle hundreds of simultaneous FFTs β€” exactly the workload of a massive MIMO OFDM base station.

See full treatment in CuPy FFT and Signal Processing

CuPy FFT and Signal Processing β€” Complete Examples

python
FFT, real FFT, 2-D FFT, range-Doppler processing, and benchmarking examples on GPU.
# Code from: ch11/python/s03_cupy_fft.py
# Load from backend supplements endpoint