CuPy Linear Algebra
cuBLAS and cuSOLVER: The GPU Linear Algebra Stack
CuPy's linear algebra functions dispatch to NVIDIA's optimized libraries: cuBLAS for matrix operations (gemm, gemv, trsm) and cuSOLVER for decompositions (LU, QR, SVD, eigenvalue). These are GPU-native equivalents of LAPACK, tuned for each GPU architecture. You get this performance for free through CuPy's NumPy-compatible API.
Definition: Solving Linear Systems on GPU
Solving Linear Systems on GPU
cp.linalg.solve(A, b) solves
using LU factorization via cuSOLVER:
import cupy as cp
A = cp.random.randn(2000, 2000)
b = cp.random.randn(2000)
x = cp.linalg.solve(A, b) # GPU-accelerated LU solve
For multiple right-hand sides, pass a matrix :
B = cp.random.randn(2000, 100)
X = cp.linalg.solve(A, B) # solves all 100 systems
The API is identical to np.linalg.solve.
Definition: Singular Value Decomposition on GPU
Singular Value Decomposition on GPU
cp.linalg.svd computes the SVD
via cuSOLVER:
U, s, Vh = cp.linalg.svd(A, full_matrices=False)
For the common case of needing only singular values:
s = cp.linalg.svd(A, compute_uv=False)
GPU SVD becomes faster than CPU SVD around for double precision and for single precision.
Use full_matrices=False for the compact SVD when
β it avoids allocating the full or
unitary matrices.
Definition: Hermitian Eigenvalue Decomposition on GPU
Hermitian Eigenvalue Decomposition on GPU
For Hermitian (symmetric) matrices, cp.linalg.eigh gives
real eigenvalues and orthogonal eigenvectors:
eigenvalues, eigenvectors = cp.linalg.eigh(R)
This uses cuSOLVER's syevd algorithm, which is significantly
faster than the general cp.linalg.eig for Hermitian matrices.
For the general (non-Hermitian) case:
eigenvalues, eigenvectors = cp.linalg.eig(A) # complex eigenvalues
Theorem: Batched Linear Algebra on GPU
For a batch of independent systems , , with each , the batched GPU solve processes all systems in parallel:
where the overhead is negligible for moderate . The speedup over sequential CPU solves is:
because the GPU can saturate its thousands of cores across the batch.
Think of it as "SIMD on steroids": instead of processing one matrix at a time, the GPU processes all matrices simultaneously across its streaming multiprocessors. This is the killer application for OFDM processing, where you need to solve one system per subcarrier.
Example: Batched SVD for OFDM Channel Matrices
Compute the SVD of 1024 independent MIMO channel matrices (one per OFDM subcarrier) on the GPU.
Batched computation
import cupy as cp
import numpy as np
import time
n_sub = 1024 # subcarriers
n_rx, n_tx = 8, 4
# Generate batch of channel matrices
H = (cp.random.randn(n_sub, n_rx, n_tx)
+ 1j * cp.random.randn(n_sub, n_rx, n_tx)) / cp.sqrt(2)
# Batched SVD β processes all 1024 matrices at once
cp.cuda.Stream.null.synchronize()
t0 = time.perf_counter()
U, s, Vh = cp.linalg.svd(H, full_matrices=False)
cp.cuda.Stream.null.synchronize()
t_gpu = time.perf_counter() - t0
print(f"GPU batched SVD: {t_gpu*1000:.2f} ms")
print(f"U shape: {U.shape}") # (1024, 8, 4)
print(f"s shape: {s.shape}") # (1024, 4)
print(f"Vh shape: {Vh.shape}") # (1024, 4, 4)
CPU comparison
H_np = cp.asnumpy(H)
t0 = time.perf_counter()
for i in range(n_sub):
np.linalg.svd(H_np[i], full_matrices=False)
t_cpu = time.perf_counter() - t0
print(f"CPU sequential SVD: {t_cpu*1000:.2f} ms")
print(f"Speedup: {t_cpu/t_gpu:.1f}x")
Typical speedup: 20-50x, because the GPU processes all 1024 small matrices concurrently.
GPU vs CPU Linear Algebra Benchmark
Compare solve, SVD, and eigh performance on CPU vs GPU as matrix size grows. Identifies the crossover point for each operation.
Parameters
Quick Check
Your covariance matrix is Hermitian positive semi-definite. Which CuPy function should you use?
cp.linalg.eig β it handles all matrix types
cp.linalg.eigh β it exploits Hermitian structure for 3x speed
cp.linalg.svd β eigenvalues equal singular values for PSD matrices
cp.linalg.eigh β it exploits Hermitian structure for 3x speedCorrect. eigh uses the cuSOLVER syevd kernel, which
exploits symmetry for roughly 3x speedup and guarantees real,
sorted eigenvalues.
Common Mistake: Using GPU for Small Matrices
Mistake:
Running cp.linalg.solve on a system and expecting
a speedup over NumPy.
Correction:
Small matrices do not saturate GPU parallelism. The kernel launch overhead (~5-20 us) exceeds the compute time. Use the GPU only when for solve/SVD, or batch many small operations together. For OFDM processing with many small per-subcarrier matrices, always batch them into a 3-D array.
CPU vs GPU Crossover Sizes (Approximate)
| Operation | CPU Library | GPU Library | Crossover n (f64) | GPU Speedup at n=4096 |
|---|---|---|---|---|
| Matrix multiply | OpenBLAS | cuBLAS | ~128 | 30-50x |
| LU solve | LAPACK | cuSOLVER | ~256 | 10-20x |
| SVD (full) | LAPACK | cuSOLVER | ~512 | 8-15x |
| eigh | LAPACK | cuSOLVER | ~384 | 10-20x |
| Cholesky | LAPACK | cuSOLVER | ~256 | 15-30x |
CuPy Linear Algebra β Complete Examples
# Code from: ch11/python/s02_cupy_linalg.py
# Load from backend supplements endpoint