CuPy Linear Algebra

cuBLAS and cuSOLVER: The GPU Linear Algebra Stack

CuPy's linear algebra functions dispatch to NVIDIA's optimized libraries: cuBLAS for matrix operations (gemm, gemv, trsm) and cuSOLVER for decompositions (LU, QR, SVD, eigenvalue). These are GPU-native equivalents of LAPACK, tuned for each GPU architecture. You get this performance for free through CuPy's NumPy-compatible API.

Definition:
Solving Linear Systems on GPU

cp.linalg.solve(A, b) solves $\mathbf{A}\mathbf{x} = \mathbf{b}$ using LU factorization via cuSOLVER:

import cupy as cp

A = cp.random.randn(2000, 2000)
b = cp.random.randn(2000)
x = cp.linalg.solve(A, b)  # GPU-accelerated LU solve

For multiple right-hand sides, pass a matrix $\mathbf{B}$ :

B = cp.random.randn(2000, 100)
X = cp.linalg.solve(A, B)  # solves all 100 systems

The API is identical to np.linalg.solve.

Definition:
Singular Value Decomposition on GPU

cp.linalg.svd computes the SVD $\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^H$ via cuSOLVER:

U, s, Vh = cp.linalg.svd(A, full_matrices=False)

For the common case of needing only singular values:

s = cp.linalg.svd(A, compute_uv=False)

GPU SVD becomes faster than CPU SVD around $n \approx 500$ for double precision and $n \approx 256$ for single precision.

Use full_matrices=False for the compact SVD when $m \neq n$ — it avoids allocating the full $m \times m$ or $n \times n$ unitary matrices.

Definition:
Hermitian Eigenvalue Decomposition on GPU

For Hermitian (symmetric) matrices, cp.linalg.eigh gives real eigenvalues and orthogonal eigenvectors:

eigenvalues, eigenvectors = cp.linalg.eigh(R)

This uses cuSOLVER's syevd algorithm, which is significantly faster than the general cp.linalg.eig for Hermitian matrices.

For the general (non-Hermitian) case:

eigenvalues, eigenvectors = cp.linalg.eig(A)  # complex eigenvalues

Theorem: Batched Linear Algebra on GPU

For a batch of $B$ independent systems $\mathbf{A}_i \mathbf{x}_i = \mathbf{b}_i$ , $i = 1, \ldots, B$ , with each $\mathbf{A}_i \in \mathbb{C}^{n \times n}$ , the batched GPU solve processes all systems in parallel:

$T_{\mathrm{batched}} \approx T_{\mathrm{single}} + O(B)$

where the $O(B)$ overhead is negligible for moderate $n$ . The speedup over sequential CPU solves is:

$S \approx \frac{B \cdot T_{\mathrm{CPU}}}{T_{\mathrm{GPU,batched}}} \gg B$

because the GPU can saturate its thousands of cores across the batch.

Think of it as "SIMD on steroids": instead of processing one matrix at a time, the GPU processes all $B$ matrices simultaneously across its streaming multiprocessors. This is the killer application for OFDM processing, where you need to solve one system per subcarrier.

Example: Batched SVD for OFDM Channel Matrices

Compute the SVD of 1024 independent $8 \times 4$ MIMO channel matrices (one per OFDM subcarrier) on the GPU.

Solution

Batched computation

import cupy as cp
import numpy as np
import time

n_sub = 1024   # subcarriers
n_rx, n_tx = 8, 4

# Generate batch of channel matrices
H = (cp.random.randn(n_sub, n_rx, n_tx)
     + 1j * cp.random.randn(n_sub, n_rx, n_tx)) / cp.sqrt(2)

# Batched SVD — processes all 1024 matrices at once
cp.cuda.Stream.null.synchronize()
t0 = time.perf_counter()
U, s, Vh = cp.linalg.svd(H, full_matrices=False)
cp.cuda.Stream.null.synchronize()
t_gpu = time.perf_counter() - t0

print(f"GPU batched SVD: {t_gpu*1000:.2f} ms")
print(f"U shape: {U.shape}")   # (1024, 8, 4)
print(f"s shape: {s.shape}")   # (1024, 4)
print(f"Vh shape: {Vh.shape}") # (1024, 4, 4)

CPU comparison

H_np = cp.asnumpy(H)
t0 = time.perf_counter()
for i in range(n_sub):
    np.linalg.svd(H_np[i], full_matrices=False)
t_cpu = time.perf_counter() - t0
print(f"CPU sequential SVD: {t_cpu*1000:.2f} ms")
print(f"Speedup: {t_cpu/t_gpu:.1f}x")

Typical speedup: 20-50x, because the GPU processes all 1024 small matrices concurrently.

GPU vs CPU Linear Algebra Benchmark

Compare solve, SVD, and eigh performance on CPU vs GPU as matrix size grows. Identifies the crossover point for each operation.

Parameters

Quick Check

Your covariance matrix $\mathbf{R} = \mathbf{H}^H \mathbf{H}$ is Hermitian positive semi-definite. Which CuPy function should you use?

cp.linalg.eig — it handles all matrix types

cp.linalg.eigh — it exploits Hermitian structure for 3x speed

cp.linalg.svd — eigenvalues equal singular values for PSD matrices

Correction:

cp.linalg.eigh — it exploits Hermitian structure for 3x speed

Correct. eigh uses the cuSOLVER syevd kernel, which exploits symmetry for roughly 3x speedup and guarantees real, sorted eigenvalues.

Common Mistake: Using GPU for Small Matrices

Mistake:

Running cp.linalg.solve on a $10 \times 10$ system and expecting a speedup over NumPy.

Correction:

Small matrices do not saturate GPU parallelism. The kernel launch overhead (~5-20 us) exceeds the compute time. Use the GPU only when $n > 256$ for solve/SVD, or batch many small operations together. For OFDM processing with many small per-subcarrier matrices, always batch them into a 3-D array.

CPU vs GPU Crossover Sizes (Approximate)

Operation	CPU Library	GPU Library	Crossover n (f64)	GPU Speedup at n=4096
Matrix multiply	OpenBLAS	cuBLAS	~128	30-50x
LU solve	LAPACK	cuSOLVER	~256	10-20x
SVD (full)	LAPACK	cuSOLVER	~512	8-15x
eigh	LAPACK	cuSOLVER	~384	10-20x
Cholesky	LAPACK	cuSOLVER	~256	15-30x

CuPy Linear Algebra — Complete Examples

python

Solve, SVD, eigh, and batched operations on GPU with benchmarking against NumPy.

# Code from: ch11/python/s02_cupy_linalg.py
# Load from backend supplements endpoint

CuPy Basics CuPy FFT and Signal Processing