CuPy FFT and Signal Processing

cuFFT: The World's Fastest FFT

NVIDIA's cuFFT library is one of the most optimized FFT implementations available. CuPy wraps it with an interface identical to scipy.fft and numpy.fft, making it trivial to GPU-accelerate any FFT-heavy workflow. For large transforms ( $N > 2^{16}$ ), cuFFT delivers 10-50x speedups over FFTW on CPU.

In radar and wireless communications, the FFT is the computational workhorse: OFDM modulation/demodulation, range-Doppler processing, spectral analysis, and channel estimation all rely on it.

Definition:
CuPy FFT Interface

CuPy provides FFT functions in two locations, both backed by cuFFT:

import cupy as cp

x = cp.random.randn(2**20)

# NumPy-compatible interface
X = cp.fft.fft(x)
x_back = cp.fft.ifft(X)

# SciPy-compatible interface (recommended)
import cupyx.scipy.fft as cufft
X = cufft.fft(x)
X2d = cufft.fft2(cp.random.randn(1024, 1024))

The cupyx.scipy.fft module supports the SciPy FFT backend protocol, allowing it to be registered as a global backend:

import scipy.fft
scipy.fft.set_global_backend(cufft)
# Now scipy.fft.fft() uses GPU automatically

Definition:
Real FFT on GPU

For real-valued signals, cp.fft.rfft computes only the non-redundant positive-frequency half, saving memory and compute:

x = cp.random.randn(2**20)     # real signal
X = cp.fft.rfft(x)             # N//2 + 1 complex values
x_back = cp.fft.irfft(X)       # inverse real FFT

Memory savings: rfft returns $N/2 + 1$ complex values instead of $N$ , halving the output size. cuFFT internally uses an optimized real-to-complex kernel that is ~40% faster than the full complex FFT.

Theorem: FFT on GPU — Memory-Bound Reality

The FFT has arithmetic intensity:

$I_{\mathrm{FFT}} = \frac{5N\log_2 N}{2 \cdot N \cdot w} = \frac{5\log_2 N}{2w}$

where $w$ is the element width in bytes. For $N = 2^{20}$ and $w = 16$ (complex128), $I_{\mathrm{FFT}} \approx 3.1$ FLOP/byte, which is below the GPU's compute-memory crossover point ( $I^* \approx 10$ -- $50$ ). Therefore, GPU FFTs are memory-bandwidth bound.

The FFT does relatively little arithmetic per byte loaded. GPU speedup comes from the GPU's much higher memory bandwidth (1-3 TB/s vs 50-100 GB/s on CPU), not from its compute throughput. This is why using float32 (halving memory traffic) can nearly double FFT throughput.

Example: Range-Doppler Map on GPU

Compute a range-Doppler map from a $256 \times 2048$ radar data cube (256 pulses, 2048 range bins) using 2-D FFT on the GPU.

Solution

Implementation

import cupy as cp

n_pulses, n_range = 256, 2048

# Simulate radar data cube (pulse x range)
data = (cp.random.randn(n_pulses, n_range)
        + 1j * cp.random.randn(n_pulses, n_range))

# Apply window functions
range_window = cp.hamming(n_range)
doppler_window = cp.hamming(n_pulses)[:, None]
data_windowed = data * range_window * doppler_window

# 2-D FFT: range FFT along axis=1, Doppler FFT along axis=0
rd_map = cp.fft.fftshift(cp.fft.fft2(data_windowed))

# Power in dB
rd_power_db = 20 * cp.log10(cp.abs(rd_map) + 1e-12)

Why GPU wins

The 2-D FFT on a $256 \times 2048$ matrix is a natural GPU operation: cuFFT processes all range-bin FFTs in parallel, then all Doppler-bin FFTs in parallel. For real-time radar, this pipeline must run every coherent processing interval (CPI), making GPU acceleration essential.

GPU vs CPU FFT Benchmark

Compare 1-D FFT throughput on CPU (SciPy/FFTW) vs GPU (CuPy/cuFFT) across transform sizes.

Parameters

Range-Doppler Map Explorer

Interactive range-Doppler map generated on GPU with adjustable target parameters.

Parameters

Quick Check

You switch your GPU FFT from complex128 to complex64. What happens to throughput and why?

No change — FFT speed is determined by algorithm complexity O(N log N)

Roughly 2x faster — FFT is memory-bound, halving data size halves memory traffic

Exactly 4x faster — GPU has dedicated FP32 Tensor Cores

Correction:

Roughly 2x faster — FFT is memory-bound, halving data size halves memory traffic

GPU FFTs are memory-bandwidth bound. Switching from 16-byte complex128 to 8-byte complex64 halves memory traffic, nearly doubling throughput. This is the roofline model in action.

Common Mistake: Not Reusing FFT Plans

Mistake:

Calling cp.fft.fft(x) in a loop with the same-size input, paying the plan creation overhead each time.

Correction:

Use cupyx.scipy.fft with explicit plan caching, or call the FFT once outside the loop to warm up the plan cache:

import cupyx.scipy.fft as cufft

# Warm up plan cache
_ = cufft.fft(x)

# Now the plan is cached; subsequent calls are fast
for i in range(n_iterations):
    X = cufft.fft(x_batch[i])

CuPy caches plans internally, but the first call has ~1ms overhead.

Why This Matters: OFDM and the GPU FFT

In OFDM (used in 5G NR, Wi-Fi, LTE), every symbol requires an IFFT at the transmitter and an FFT at the receiver. 5G NR uses FFT sizes up to 4096, and the base station must process all antenna ports and subframes in real time. A GPU running cuFFT can handle hundreds of simultaneous FFTs — exactly the workload of a massive MIMO OFDM base station.

See full treatment in CuPy FFT and Signal Processing

CuPy FFT and Signal Processing — Complete Examples

python

FFT, real FFT, 2-D FFT, range-Doppler processing, and benchmarking examples on GPU.

# Code from: ch11/python/s03_cupy_fft.py
# Load from backend supplements endpoint

CuPy Linear Algebra CuPy Sparse Matrices

CuPy FFT and Signal Processing

cuFFT: The World's Fastest FFT

Definition: CuPy FFT Interface

Definition: Real FFT on GPU

Theorem: FFT on GPU — Memory-Bound Reality

Example: Range-Doppler Map on GPU

Implementation

Why GPU wins

GPU vs CPU FFT Benchmark

Parameters

Range-Doppler Map Explorer

Parameters

Quick Check

Common Mistake: Not Reusing FFT Plans

Why This Matters: OFDM and the GPU FFT

CuPy FFT and Signal Processing — Complete Examples

Definition:
CuPy FFT Interface

Definition:
Real FFT on GPU