CuPy Linear Algebra

cuBLAS and cuSOLVER: The GPU Linear Algebra Stack

CuPy's linear algebra functions dispatch to NVIDIA's optimized libraries: cuBLAS for matrix operations (gemm, gemv, trsm) and cuSOLVER for decompositions (LU, QR, SVD, eigenvalue). These are GPU-native equivalents of LAPACK, tuned for each GPU architecture. You get this performance for free through CuPy's NumPy-compatible API.

Definition:

Solving Linear Systems on GPU

cp.linalg.solve(A, b) solves Ax=b\mathbf{A}\mathbf{x} = \mathbf{b} using LU factorization via cuSOLVER:

import cupy as cp

A = cp.random.randn(2000, 2000)
b = cp.random.randn(2000)
x = cp.linalg.solve(A, b)  # GPU-accelerated LU solve

For multiple right-hand sides, pass a matrix B\mathbf{B}:

B = cp.random.randn(2000, 100)
X = cp.linalg.solve(A, B)  # solves all 100 systems

The API is identical to np.linalg.solve.

Definition:

Singular Value Decomposition on GPU

cp.linalg.svd computes the SVD A=UΞ£VH\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^H via cuSOLVER:

U, s, Vh = cp.linalg.svd(A, full_matrices=False)

For the common case of needing only singular values:

s = cp.linalg.svd(A, compute_uv=False)

GPU SVD becomes faster than CPU SVD around nβ‰ˆ500n \approx 500 for double precision and nβ‰ˆ256n \approx 256 for single precision.

Use full_matrices=False for the compact SVD when m≠nm \neq n — it avoids allocating the full m×mm \times m or n×nn \times n unitary matrices.

Definition:

Hermitian Eigenvalue Decomposition on GPU

For Hermitian (symmetric) matrices, cp.linalg.eigh gives real eigenvalues and orthogonal eigenvectors:

eigenvalues, eigenvectors = cp.linalg.eigh(R)

This uses cuSOLVER's syevd algorithm, which is significantly faster than the general cp.linalg.eig for Hermitian matrices.

For the general (non-Hermitian) case:

eigenvalues, eigenvectors = cp.linalg.eig(A)  # complex eigenvalues

Theorem: Batched Linear Algebra on GPU

For a batch of BB independent systems Aixi=bi\mathbf{A}_i \mathbf{x}_i = \mathbf{b}_i, i=1,…,Bi = 1, \ldots, B, with each Ai∈CnΓ—n\mathbf{A}_i \in \mathbb{C}^{n \times n}, the batched GPU solve processes all systems in parallel:

Tbatchedβ‰ˆTsingle+O(B)T_{\mathrm{batched}} \approx T_{\mathrm{single}} + O(B)

where the O(B)O(B) overhead is negligible for moderate nn. The speedup over sequential CPU solves is:

Sβ‰ˆBβ‹…TCPUTGPU,batched≫BS \approx \frac{B \cdot T_{\mathrm{CPU}}}{T_{\mathrm{GPU,batched}}} \gg B

because the GPU can saturate its thousands of cores across the batch.

Think of it as "SIMD on steroids": instead of processing one matrix at a time, the GPU processes all BB matrices simultaneously across its streaming multiprocessors. This is the killer application for OFDM processing, where you need to solve one system per subcarrier.

Example: Batched SVD for OFDM Channel Matrices

Compute the SVD of 1024 independent 8Γ—48 \times 4 MIMO channel matrices (one per OFDM subcarrier) on the GPU.

GPU vs CPU Linear Algebra Benchmark

Compare solve, SVD, and eigh performance on CPU vs GPU as matrix size grows. Identifies the crossover point for each operation.

Parameters

Quick Check

Your covariance matrix R=HHH\mathbf{R} = \mathbf{H}^H \mathbf{H} is Hermitian positive semi-definite. Which CuPy function should you use?

cp.linalg.eig β€” it handles all matrix types

cp.linalg.eigh β€” it exploits Hermitian structure for 3x speed

cp.linalg.svd β€” eigenvalues equal singular values for PSD matrices

Common Mistake: Using GPU for Small Matrices

Mistake:

Running cp.linalg.solve on a 10Γ—1010 \times 10 system and expecting a speedup over NumPy.

Correction:

Small matrices do not saturate GPU parallelism. The kernel launch overhead (~5-20 us) exceeds the compute time. Use the GPU only when n>256n > 256 for solve/SVD, or batch many small operations together. For OFDM processing with many small per-subcarrier matrices, always batch them into a 3-D array.

CPU vs GPU Crossover Sizes (Approximate)

OperationCPU LibraryGPU LibraryCrossover n (f64)GPU Speedup at n=4096
Matrix multiplyOpenBLAScuBLAS~12830-50x
LU solveLAPACKcuSOLVER~25610-20x
SVD (full)LAPACKcuSOLVER~5128-15x
eighLAPACKcuSOLVER~38410-20x
CholeskyLAPACKcuSOLVER~25615-30x

CuPy Linear Algebra β€” Complete Examples

python
Solve, SVD, eigh, and batched operations on GPU with benchmarking against NumPy.
# Code from: ch11/python/s02_cupy_linalg.py
# Load from backend supplements endpoint