The Kronecker Matvec on GPU
Kronecker Products on GPU: Exploiting Structure for Massive Speedups
The Kronecker product appears in MIMO channel models, 2-D signal processing, and multidimensional PDEs. Forming the full Kronecker product of two matrices creates an matrix โ an memory and compute disaster. The reshape trick from Chapter 6 eliminates this, and the GPU makes the remaining matrix multiplies blazingly fast.
Definition: Kronecker Product and the Vec Trick
Kronecker Product and the Vec Trick
The Kronecker product of and is the block matrix:
The vec trick avoids forming this matrix:
This converts an dense matvec into two matrix multiplies costing .
Definition: Efficient Kronecker Matvec on GPU
Efficient Kronecker Matvec on GPU
The GPU implementation of the vec trick:
import cupy as cp
def kron_matvec_gpu(A, B, x):
"""Compute (A kron B) @ x without forming the full product."""
m, n = A.shape
p, q = B.shape
X = x.reshape(q, m, order='F') # reshape to matrix
Y = B @ X @ A.T # two GPU matmuls
return Y.ravel(order='F') # back to vector
Both matrix multiplications dispatch to cuBLAS, running at near-peak GPU throughput. For , this uses ~24 MB instead of ~8 TB for the full Kronecker product.
When and are square of size , the full Kronecker product is . For , that is elements โ impossible to store. The vec trick makes it trivial.
Theorem: Kronecker Matvec Complexity: Naive vs Efficient vs GPU
For and :
| Method | Time | Memory |
|---|---|---|
| Naive (form ) | ||
| Vec trick (CPU) | ||
| Vec trick (GPU) |
where is the effective GPU parallelism. The GPU vec trick achieves an additional -- speedup over the CPU vec trick for large .
The vec trick reduces the problem to two matrix multiplies, which are the GPU's strongest operation. cuBLAS delivers near-peak TFLOPS on matrix multiplication, so the GPU further accelerates an already-efficient algorithm.
Theorem: Multi-Factor Kronecker Product
For a -factor Kronecker product , with each , the matvec can be computed via sequential mode- products:
Total cost: versus for the naive approach.
Each factor acts on one "dimension" of the reshaped tensor. This is the same idea as separable convolution in image processing or the tensor-train decomposition in numerical linear algebra.
Example: Kronecker MIMO Channel on GPU
In the Kronecker channel model, the MIMO channel matrix is where and are receive and transmit correlation matrices. The full correlation matrix of is . Apply this correlation to a random vector on the GPU without forming the Kronecker product.
Implementation
import cupy as cp
nr, nt = 64, 32 # massive MIMO dimensions
rho_r, rho_t = 0.7, 0.5
# Correlation matrices
R_r = cp.array([[rho_r ** abs(i-j) for j in range(nr)]
for i in range(nr)])
R_t = cp.array([[rho_t ** abs(i-j) for j in range(nt)]
for i in range(nt)])
# Random vector (i.i.d. channel coefficients)
x = cp.random.randn(nr * nt) + 1j * cp.random.randn(nr * nt)
# Apply (R_t^T kron R_r) @ x via vec trick
X = x.reshape(nr, nt, order='F')
Y = R_r @ X @ R_t # R_t^T -> R_t for symmetric R_t
h_corr = Y.ravel(order='F')
print(f"Input shape: {x.shape}")
print(f"Output shape: {h_corr.shape}")
print(f"Memory saved: {nr*nt*nr*nt*16/1e9:.1f} GB avoided")
Memory analysis
Full Kronecker product: complex elements = 67 million entries = 1 GB at complex128. The vec trick uses only elements MB โ a 20,000x reduction.
Example: Three-Factor Kronecker Matvec on GPU
Compute for without forming any Kronecker products.
Implementation
import cupy as cp
n = 50
A = cp.random.randn(n, n)
B = cp.random.randn(n, n)
C = cp.random.randn(n, n)
x = cp.random.randn(n**3)
# Reshape to 3-D tensor, apply factors sequentially
X = x.reshape(n, n, n, order='F')
# Mode-3 multiply (A acts on last dimension)
Y = cp.einsum('ijk,lk->ijl', X, A)
# Mode-2 multiply (B acts on middle dimension)
Y = cp.einsum('ijk,lj->ilk', Y, B)
# Mode-1 multiply (C acts on first dimension)
Y = cp.einsum('ijk,li->ljk', Y, C)
y = Y.ravel(order='F')
print(f"Full Kronecker size: {n**3}x{n**3} = {n**6:.0e} elements")
print(f"Memory for full: {n**6 * 8 / 1e12:.1f} TB")
print(f"Memory used: {3 * n**2 * 8 / 1e6:.1f} MB")
Why einsum on GPU
CuPy's einsum dispatches to cuBLAS under the hood, making
each mode- product a GPU-optimized matrix multiply.
The three-factor case would require
elements to store the full product โ clearly impossible.
Kronecker Matvec: Naive vs Efficient vs GPU
Benchmark comparing naive Kronecker product construction, CPU vec trick, and GPU vec trick across factor sizes.
Parameters
Key Takeaway
Never form the full Kronecker product. The vec trick reduces to , and the GPU further accelerates the remaining matrix multiplies by 10-50x. This combination โ algorithmic optimization plus hardware acceleration โ is the paradigm for high-performance scientific computing.
Kronecker Matvec Methods Compared
| Method | Time Complexity | Memory | n=1000 Wall Time | Implementation |
|---|---|---|---|---|
| np.kron then matvec | Impossible (8 TB) | 1 line but unusable | ||
| CPU vec trick | ~500 ms | 3 lines | ||
| GPU vec trick | ~20 ms | 3 lines + cp | ||
| Multi-factor (K=3, GPU) | ~50 ms | einsum |
Kronecker Matvec on GPU โ Complete Examples
# Code from: ch11/python/s06_kronecker_gpu.py
# Load from backend supplements endpoint