CuPy FFT and Signal Processing
cuFFT: The World's Fastest FFT
NVIDIA's cuFFT library is one of the most optimized FFT
implementations available. CuPy wraps it with an interface
identical to scipy.fft and numpy.fft, making it trivial to
GPU-accelerate any FFT-heavy workflow. For large transforms
(), cuFFT delivers 10-50x speedups over FFTW on CPU.
In radar and wireless communications, the FFT is the computational workhorse: OFDM modulation/demodulation, range-Doppler processing, spectral analysis, and channel estimation all rely on it.
Definition: CuPy FFT Interface
CuPy FFT Interface
CuPy provides FFT functions in two locations, both backed by cuFFT:
import cupy as cp
x = cp.random.randn(2**20)
# NumPy-compatible interface
X = cp.fft.fft(x)
x_back = cp.fft.ifft(X)
# SciPy-compatible interface (recommended)
import cupyx.scipy.fft as cufft
X = cufft.fft(x)
X2d = cufft.fft2(cp.random.randn(1024, 1024))
The cupyx.scipy.fft module supports the SciPy FFT backend
protocol, allowing it to be registered as a global backend:
import scipy.fft
scipy.fft.set_global_backend(cufft)
# Now scipy.fft.fft() uses GPU automatically
Definition: Real FFT on GPU
Real FFT on GPU
For real-valued signals, cp.fft.rfft computes only the
non-redundant positive-frequency half, saving memory and compute:
x = cp.random.randn(2**20) # real signal
X = cp.fft.rfft(x) # N//2 + 1 complex values
x_back = cp.fft.irfft(X) # inverse real FFT
Memory savings: rfft returns complex values instead
of , halving the output size. cuFFT internally uses an
optimized real-to-complex kernel that is ~40% faster than the
full complex FFT.
Theorem: FFT on GPU β Memory-Bound Reality
The FFT has arithmetic intensity:
where is the element width in bytes. For and (complex128), FLOP/byte, which is below the GPU's compute-memory crossover point (--). Therefore, GPU FFTs are memory-bandwidth bound.
The FFT does relatively little arithmetic per byte loaded.
GPU speedup comes from the GPU's much higher memory bandwidth
(1-3 TB/s vs 50-100 GB/s on CPU), not from its compute
throughput. This is why using float32 (halving memory traffic)
can nearly double FFT throughput.
Example: Range-Doppler Map on GPU
Compute a range-Doppler map from a radar data cube (256 pulses, 2048 range bins) using 2-D FFT on the GPU.
Implementation
import cupy as cp
n_pulses, n_range = 256, 2048
# Simulate radar data cube (pulse x range)
data = (cp.random.randn(n_pulses, n_range)
+ 1j * cp.random.randn(n_pulses, n_range))
# Apply window functions
range_window = cp.hamming(n_range)
doppler_window = cp.hamming(n_pulses)[:, None]
data_windowed = data * range_window * doppler_window
# 2-D FFT: range FFT along axis=1, Doppler FFT along axis=0
rd_map = cp.fft.fftshift(cp.fft.fft2(data_windowed))
# Power in dB
rd_power_db = 20 * cp.log10(cp.abs(rd_map) + 1e-12)
Why GPU wins
The 2-D FFT on a matrix is a natural GPU operation: cuFFT processes all range-bin FFTs in parallel, then all Doppler-bin FFTs in parallel. For real-time radar, this pipeline must run every coherent processing interval (CPI), making GPU acceleration essential.
GPU vs CPU FFT Benchmark
Compare 1-D FFT throughput on CPU (SciPy/FFTW) vs GPU (CuPy/cuFFT) across transform sizes.
Parameters
Range-Doppler Map Explorer
Interactive range-Doppler map generated on GPU with adjustable target parameters.
Parameters
Quick Check
You switch your GPU FFT from complex128 to complex64.
What happens to throughput and why?
No change β FFT speed is determined by algorithm complexity O(N log N)
Roughly 2x faster β FFT is memory-bound, halving data size halves memory traffic
Exactly 4x faster β GPU has dedicated FP32 Tensor Cores
GPU FFTs are memory-bandwidth bound. Switching from 16-byte complex128 to 8-byte complex64 halves memory traffic, nearly doubling throughput. This is the roofline model in action.
Common Mistake: Not Reusing FFT Plans
Mistake:
Calling cp.fft.fft(x) in a loop with the same-size input,
paying the plan creation overhead each time.
Correction:
Use cupyx.scipy.fft with explicit plan caching, or call the
FFT once outside the loop to warm up the plan cache:
import cupyx.scipy.fft as cufft
# Warm up plan cache
_ = cufft.fft(x)
# Now the plan is cached; subsequent calls are fast
for i in range(n_iterations):
X = cufft.fft(x_batch[i])
CuPy caches plans internally, but the first call has ~1ms overhead.
Why This Matters: OFDM and the GPU FFT
In OFDM (used in 5G NR, Wi-Fi, LTE), every symbol requires an IFFT at the transmitter and an FFT at the receiver. 5G NR uses FFT sizes up to 4096, and the base station must process all antenna ports and subframes in real time. A GPU running cuFFT can handle hundreds of simultaneous FFTs β exactly the workload of a massive MIMO OFDM base station.
See full treatment in CuPy FFT and Signal Processing
CuPy FFT and Signal Processing β Complete Examples
# Code from: ch11/python/s03_cupy_fft.py
# Load from backend supplements endpoint