CuPy Sparse Matrices

cuSPARSE: Sparse Linear Algebra on the GPU

CuPy provides GPU-accelerated sparse matrix operations through NVIDIA's cuSPARSE library. The cupyx.scipy.sparse module mirrors SciPy's sparse API, supporting CSR, CSC, COO, and DIA formats. For large sparse matrices (millions of rows), GPU sparse matvec can be 5-20x faster than SciPy on CPU.

Definition:
Sparse Matrix Formats on GPU

CuPy supports the standard sparse formats, all backed by cuSPARSE:

import cupy as cp
import cupyx.scipy.sparse as csp

# COO format — good for construction
row = cp.array([0, 1, 2, 0])
col = cp.array([0, 1, 2, 2])
data = cp.array([1.0, 2.0, 3.0, 4.0])
A_coo = csp.coo_matrix((data, (row, col)), shape=(3, 3))

# CSR format — best for matvec and row slicing
A_csr = A_coo.tocsr()

# CSC format — best for column slicing
A_csc = A_coo.tocsc()

The rule of thumb: build in COO, compute in CSR (same as SciPy, but on the GPU).

CuPy sparse matrices store their index and value arrays in GPU memory. Use csp.csr_matrix(scipy_sparse_matrix) to transfer a SciPy sparse matrix to the GPU.

Definition:
Sparse Matrix-Vector Product on GPU

The sparse matrix-vector product $\mathbf{y} = \mathbf{A}\mathbf{x}$ for a CSR matrix uses cuSPARSE's csrmv kernel:

x = cp.random.randn(n)
y = A_csr @ x      # or A_csr.dot(x)

The operation touches only the $\mathrm{nnz}$ nonzero elements, giving $O(\mathrm{nnz})$ complexity instead of $O(n^2)$ for dense.

For sparse matrix-matrix products:

C = A_csr @ B_csr   # sparse-sparse multiply
D = A_csr @ X_dense  # sparse-dense multiply (SpMM)

Theorem: Sparse vs Dense Crossover on GPU

For a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ with density $\rho = \mathrm{nnz} / n^2$ , the sparse CSR matvec is faster than dense matvec when:

$\rho < \rho^* \approx \frac{B_{\mathrm{mem}}}{n \cdot P_{\mathrm{peak}} \cdot w_{\mathrm{idx}}}$

In practice, sparse GPU matvec wins when $\rho \lesssim 0.01$ -- $0.05$ (1-5% density), depending on matrix size and GPU model. Below this threshold, the sparse format's irregular memory access pattern is outweighed by the reduction in data volume.

Dense matvec has perfect memory access patterns (coalesced reads) but touches $n^2$ elements. Sparse matvec touches only $\mathrm{nnz}$ elements but with irregular access. The crossover depends on how much irregularity penalty the GPU incurs.

Example: Sparse Laplacian Matvec on GPU

Build the 2-D Laplacian stencil as a sparse CSR matrix on the GPU and apply it to a random vector. Compare speed against dense matvec.

Solution

Construction

import cupy as cp
import cupyx.scipy.sparse as csp
import numpy as np
import scipy.sparse as sp

n = 500  # grid size -> matrix is n^2 x n^2
N = n * n  # 250,000

# Build on CPU with SciPy, transfer to GPU
L1d = sp.diags([-1, 2, -1], [-1, 0, 1], shape=(n, n))
I = sp.eye(n)
L2d = sp.kron(I, L1d) + sp.kron(L1d, I)  # 2D Laplacian

# Transfer to GPU
L_gpu = csp.csr_matrix(L2d.tocsr())
x_gpu = cp.random.randn(N)

Benchmark

import time

# Sparse matvec on GPU
cp.cuda.Stream.null.synchronize()
t0 = time.perf_counter()
y_sparse = L_gpu @ x_gpu
cp.cuda.Stream.null.synchronize()
t_sparse = time.perf_counter() - t0

density = L2d.nnz / N**2
print(f"Matrix: {N}x{N}, nnz={L2d.nnz}, density={density:.6f}")
print(f"Sparse GPU matvec: {t_sparse*1e6:.0f} us")

The 2-D Laplacian has density ~ $5/n^2$ , which for $n=500$ gives $\rho \approx 2 \times 10^{-5}$ — far below the sparse crossover, making sparse GPU matvec the clear winner.

Sparse vs Dense GPU Matvec

Compare sparse CSR and dense matrix-vector product performance on GPU across different matrix sizes and densities.

Parameters

Quick Check

You are building a sparse matrix by inserting elements one at a time on the GPU. Which format should you use for construction?

CSR — it is the fastest for computation

COO — it allows O(1) element insertion

Dense — convert to sparse only after filling

Correction:

COO — it allows O(1) element insertion

COO stores (row, col, value) triples with no ordering requirement, making insertion trivial. Convert to CSR after construction for fast computation.

CSR (Compressed Sparse Row)

Sparse format storing values, column indices, and row pointers. Optimal for sparse matvec and row slicing on both CPU and GPU.

Related: COO (Coordinate Format)

COO (Coordinate Format)

Sparse format storing (row, col, value) triples. Best for construction; convert to CSR for computation.

Related: CSR (Compressed Sparse Row)

CuPy Sparse Matrices — Complete Examples

python

Sparse matrix creation, format conversion, matvec, and benchmarking on GPU.

# Code from: ch11/python/s04_cupy_sparse.py
# Load from backend supplements endpoint

CuPy FFT and Signal Processing Custom CUDA Kernels in CuPy