CuPy Sparse Matrices

cuSPARSE: Sparse Linear Algebra on the GPU

CuPy provides GPU-accelerated sparse matrix operations through NVIDIA's cuSPARSE library. The cupyx.scipy.sparse module mirrors SciPy's sparse API, supporting CSR, CSC, COO, and DIA formats. For large sparse matrices (millions of rows), GPU sparse matvec can be 5-20x faster than SciPy on CPU.

Definition:

Sparse Matrix Formats on GPU

CuPy supports the standard sparse formats, all backed by cuSPARSE:

import cupy as cp
import cupyx.scipy.sparse as csp

# COO format β€” good for construction
row = cp.array([0, 1, 2, 0])
col = cp.array([0, 1, 2, 2])
data = cp.array([1.0, 2.0, 3.0, 4.0])
A_coo = csp.coo_matrix((data, (row, col)), shape=(3, 3))

# CSR format β€” best for matvec and row slicing
A_csr = A_coo.tocsr()

# CSC format β€” best for column slicing
A_csc = A_coo.tocsc()

The rule of thumb: build in COO, compute in CSR (same as SciPy, but on the GPU).

CuPy sparse matrices store their index and value arrays in GPU memory. Use csp.csr_matrix(scipy_sparse_matrix) to transfer a SciPy sparse matrix to the GPU.

Definition:

Sparse Matrix-Vector Product on GPU

The sparse matrix-vector product y=Ax\mathbf{y} = \mathbf{A}\mathbf{x} for a CSR matrix uses cuSPARSE's csrmv kernel:

x = cp.random.randn(n)
y = A_csr @ x      # or A_csr.dot(x)

The operation touches only the nnz\mathrm{nnz} nonzero elements, giving O(nnz)O(\mathrm{nnz}) complexity instead of O(n2)O(n^2) for dense.

For sparse matrix-matrix products:

C = A_csr @ B_csr   # sparse-sparse multiply
D = A_csr @ X_dense  # sparse-dense multiply (SpMM)

Theorem: Sparse vs Dense Crossover on GPU

For a matrix A∈RnΓ—n\mathbf{A} \in \mathbb{R}^{n \times n} with density ρ=nnz/n2\rho = \mathrm{nnz} / n^2, the sparse CSR matvec is faster than dense matvec when:

ρ<Οβˆ—β‰ˆBmemnβ‹…Ppeakβ‹…widx\rho < \rho^* \approx \frac{B_{\mathrm{mem}}}{n \cdot P_{\mathrm{peak}} \cdot w_{\mathrm{idx}}}

In practice, sparse GPU matvec wins when ρ≲0.01\rho \lesssim 0.01--0.050.05 (1-5% density), depending on matrix size and GPU model. Below this threshold, the sparse format's irregular memory access pattern is outweighed by the reduction in data volume.

Dense matvec has perfect memory access patterns (coalesced reads) but touches n2n^2 elements. Sparse matvec touches only nnz\mathrm{nnz} elements but with irregular access. The crossover depends on how much irregularity penalty the GPU incurs.

Example: Sparse Laplacian Matvec on GPU

Build the 2-D Laplacian stencil as a sparse CSR matrix on the GPU and apply it to a random vector. Compare speed against dense matvec.

Sparse vs Dense GPU Matvec

Compare sparse CSR and dense matrix-vector product performance on GPU across different matrix sizes and densities.

Parameters

Quick Check

You are building a sparse matrix by inserting elements one at a time on the GPU. Which format should you use for construction?

CSR β€” it is the fastest for computation

COO β€” it allows O(1) element insertion

Dense β€” convert to sparse only after filling

CSR (Compressed Sparse Row)

Sparse format storing values, column indices, and row pointers. Optimal for sparse matvec and row slicing on both CPU and GPU.

Related: COO (Coordinate Format)

COO (Coordinate Format)

Sparse format storing (row, col, value) triples. Best for construction; convert to CSR for computation.

Related: CSR (Compressed Sparse Row)

CuPy Sparse Matrices β€” Complete Examples

python
Sparse matrix creation, format conversion, matvec, and benchmarking on GPU.
# Code from: ch11/python/s04_cupy_sparse.py
# Load from backend supplements endpoint