The Kronecker Matvec on GPU

Kronecker Products on GPU: Exploiting Structure for Massive Speedups

The Kronecker product AโŠ—B\mathbf{A} \otimes \mathbf{B} appears in MIMO channel models, 2-D signal processing, and multidimensional PDEs. Forming the full Kronecker product of two nร—nn \times n matrices creates an n2ร—n2n^2 \times n^2 matrix โ€” an O(n4)O(n^4) memory and compute disaster. The reshape trick from Chapter 6 eliminates this, and the GPU makes the remaining O(n3)O(n^3) matrix multiplies blazingly fast.

Definition:

Kronecker Product and the Vec Trick

The Kronecker product of AโˆˆRmร—n\mathbf{A} \in \mathbb{R}^{m \times n} and BโˆˆRpร—q\mathbf{B} \in \mathbb{R}^{p \times q} is the mpร—nqmp \times nq block matrix:

AโŠ—B=[a11Bโ‹ฏa1nBโ‹ฎโ‹ฑโ‹ฎam1Bโ‹ฏamnB]\mathbf{A} \otimes \mathbf{B} = \begin{bmatrix} a_{11}\mathbf{B} & \cdots & a_{1n}\mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{m1}\mathbf{B} & \cdots & a_{mn}\mathbf{B} \end{bmatrix}

The vec trick avoids forming this matrix:

(AโŠ—B)โ€‰vec(X)=vec(BXAT)(\mathbf{A} \otimes \mathbf{B})\,\mathrm{vec}(\mathbf{X}) = \mathrm{vec}(\mathbf{B}\mathbf{X}\mathbf{A}^T)

This converts an O(m2n2)O(m^2 n^2) dense matvec into two matrix multiplies costing O(mnp+mnq)O(mnp + mnq).

Definition:

Efficient Kronecker Matvec on GPU

The GPU implementation of the vec trick:

import cupy as cp

def kron_matvec_gpu(A, B, x):
    """Compute (A kron B) @ x without forming the full product."""
    m, n = A.shape
    p, q = B.shape
    X = x.reshape(q, m, order='F')   # reshape to matrix
    Y = B @ X @ A.T                  # two GPU matmuls
    return Y.ravel(order='F')         # back to vector

Both matrix multiplications dispatch to cuBLAS, running at near-peak GPU throughput. For n=1000n = 1000, this uses ~24 MB instead of ~8 TB for the full Kronecker product.

When A\mathbf{A} and B\mathbf{B} are square of size nn, the full Kronecker product is n2ร—n2n^2 \times n^2. For n=1000n = 1000, that is 106ร—106=101210^6 \times 10^6 = 10^{12} elements โ€” impossible to store. The vec trick makes it trivial.

Theorem: Kronecker Matvec Complexity: Naive vs Efficient vs GPU

For A,BโˆˆRnร—n\mathbf{A}, \mathbf{B} \in \mathbb{R}^{n \times n} and xโˆˆRn2\mathbf{x} \in \mathbb{R}^{n^2}:

Method Time Memory
Naive (form AโŠ—B\mathbf{A} \otimes \mathbf{B}) O(n4)O(n^4) O(n4)O(n^4)
Vec trick (CPU) O(n3)O(n^3) O(n2)O(n^2)
Vec trick (GPU) O(n3/p)O(n^3 / p) O(n2)O(n^2)

where pp is the effective GPU parallelism. The GPU vec trick achieves an additional 1010--50ร—50\times speedup over the CPU vec trick for large nn.

The vec trick reduces the problem to two nร—nn \times n matrix multiplies, which are the GPU's strongest operation. cuBLAS delivers near-peak TFLOPS on matrix multiplication, so the GPU further accelerates an already-efficient algorithm.

Theorem: Multi-Factor Kronecker Product

For a KK-factor Kronecker product M=A1โŠ—A2โŠ—โ‹ฏโŠ—AK\mathbf{M} = \mathbf{A}_1 \otimes \mathbf{A}_2 \otimes \cdots \otimes \mathbf{A}_K, with each AkโˆˆRnkร—nk\mathbf{A}_k \in \mathbb{R}^{n_k \times n_k}, the matvec Mx\mathbf{M}\mathbf{x} can be computed via KK sequential mode-kk products:

y=(A1โŠ—โ‹ฏโŠ—AK)xโŸบreshapeย andย multiplyย alongย eachย mode\mathbf{y} = (\mathbf{A}_1 \otimes \cdots \otimes \mathbf{A}_K)\mathbf{x} \quad\Longleftrightarrow\quad \text{reshape and multiply along each mode}

Total cost: Oโ€‰โฃ(โˆ‘k=1Knkโ‹…โˆj=1Knj)O\!\left(\sum_{k=1}^{K} n_k \cdot \prod_{j=1}^{K} n_j\right) versus Oโ€‰โฃ((โˆknk)2)O\!\left(\left(\prod_k n_k\right)^2\right) for the naive approach.

Each factor acts on one "dimension" of the reshaped tensor. This is the same idea as separable convolution in image processing or the tensor-train decomposition in numerical linear algebra.

Example: Kronecker MIMO Channel on GPU

In the Kronecker channel model, the MIMO channel matrix is H=Rr1/2โ€‰Hwโ€‰Rt1/2\mathbf{H} = \mathbf{R}_r^{1/2}\,\mathbf{H}_w\,\mathbf{R}_t^{1/2} where Rr\mathbf{R}_r and Rt\mathbf{R}_t are receive and transmit correlation matrices. The full correlation matrix of vec(H)\mathrm{vec}(\mathbf{H}) is RtTโŠ—Rr\mathbf{R}_t^T \otimes \mathbf{R}_r. Apply this correlation to a random vector on the GPU without forming the Kronecker product.

Example: Three-Factor Kronecker Matvec on GPU

Compute (AโŠ—BโŠ—C)x(\mathbf{A} \otimes \mathbf{B} \otimes \mathbf{C})\mathbf{x} for n=50n = 50 without forming any Kronecker products.

Kronecker Matvec: Naive vs Efficient vs GPU

Benchmark comparing naive Kronecker product construction, CPU vec trick, and GPU vec trick across factor sizes.

Parameters

Key Takeaway

Never form the full Kronecker product. The vec trick (AโŠ—B)vec(X)=vec(BXAT)(\mathbf{A} \otimes \mathbf{B})\mathrm{vec}(\mathbf{X}) = \mathrm{vec}(\mathbf{B}\mathbf{X}\mathbf{A}^T) reduces O(n4)O(n^4) to O(n3)O(n^3), and the GPU further accelerates the remaining matrix multiplies by 10-50x. This combination โ€” algorithmic optimization plus hardware acceleration โ€” is the paradigm for high-performance scientific computing.

Kronecker Matvec Methods Compared

MethodTime ComplexityMemoryn=1000 Wall TimeImplementation
np.kron then matvecO(n4)O(n^4)O(n4)O(n^4)Impossible (8 TB)1 line but unusable
CPU vec trickO(n3)O(n^3)O(n2)O(n^2)~500 ms3 lines
GPU vec trickO(n3/p)O(n^3/p)O(n2)O(n^2)~20 ms3 lines + cp
Multi-factor (K=3, GPU)O(Knโ‹…nK/p)O(Kn \cdot n^K / p)O(Kn2)O(Kn^2)~50 mseinsum

Kronecker Matvec on GPU โ€” Complete Examples

python
Efficient Kronecker matvec via vec trick on GPU, multi-factor extension, and MIMO Kronecker channel model.
# Code from: ch11/python/s06_kronecker_gpu.py
# Load from backend supplements endpoint