Ferkans — Interactive Telecom Tutor

Kronecker Products on GPU: Exploiting Structure for Massive Speedups

The Kronecker product $\mathbf{A} \otimes \mathbf{B}$ appears in MIMO channel models, 2-D signal processing, and multidimensional PDEs. Forming the full Kronecker product of two $n \times n$ matrices creates an $n^2 \times n^2$ matrix — an $O(n^4)$ memory and compute disaster. The reshape trick from Chapter 6 eliminates this, and the GPU makes the remaining $O(n^3)$ matrix multiplies blazingly fast.

Definition:
Kronecker Product and the Vec Trick

The Kronecker product of $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{B} \in \mathbb{R}^{p \times q}$ is the $mp \times nq$ block matrix:

$\mathbf{A} \otimes \mathbf{B} = \begin{bmatrix} a_{11}\mathbf{B} & \cdots & a_{1n}\mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{m1}\mathbf{B} & \cdots & a_{mn}\mathbf{B} \end{bmatrix}$

The vec trick avoids forming this matrix:

$(\mathbf{A} \otimes \mathbf{B})\,\mathrm{vec}(\mathbf{X}) = \mathrm{vec}(\mathbf{B}\mathbf{X}\mathbf{A}^T)$

This converts an $O(m^2 n^2)$ dense matvec into two matrix multiplies costing $O(mnp + mnq)$ .

Definition:
Efficient Kronecker Matvec on GPU

The GPU implementation of the vec trick:

import cupy as cp

def kron_matvec_gpu(A, B, x):
    """Compute (A kron B) @ x without forming the full product."""
    m, n = A.shape
    p, q = B.shape
    X = x.reshape(q, m, order='F')   # reshape to matrix
    Y = B @ X @ A.T                  # two GPU matmuls
    return Y.ravel(order='F')         # back to vector

Both matrix multiplications dispatch to cuBLAS, running at near-peak GPU throughput. For $n = 1000$ , this uses ~24 MB instead of ~8 TB for the full Kronecker product.

When $\mathbf{A}$ and $\mathbf{B}$ are square of size $n$ , the full Kronecker product is $n^2 \times n^2$ . For $n = 1000$ , that is $10^6 \times 10^6 = 10^{12}$ elements — impossible to store. The vec trick makes it trivial.

Theorem: Kronecker Matvec Complexity: Naive vs Efficient vs GPU

For $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{n \times n}$ and $\mathbf{x} \in \mathbb{R}^{n^2}$ :

Method	Time	Memory
Naive (form $\mathbf{A} \otimes \mathbf{B}$ )	$O(n^4)$	$O(n^4)$
Vec trick (CPU)	$O(n^3)$	$O(n^2)$
Vec trick (GPU)	$O(n^3 / p)$	$O(n^2)$

where $p$ is the effective GPU parallelism. The GPU vec trick achieves an additional $10$ -- $50\times$ speedup over the CPU vec trick for large $n$ .

The vec trick reduces the problem to two $n \times n$ matrix multiplies, which are the GPU's strongest operation. cuBLAS delivers near-peak TFLOPS on matrix multiplication, so the GPU further accelerates an already-efficient algorithm.

Theorem: Multi-Factor Kronecker Product

For a $K$ -factor Kronecker product $\mathbf{M} = \mathbf{A}_1 \otimes \mathbf{A}_2 \otimes \cdots \otimes \mathbf{A}_K$ , with each $\mathbf{A}_k \in \mathbb{R}^{n_k \times n_k}$ , the matvec $\mathbf{M}\mathbf{x}$ can be computed via $K$ sequential mode- $k$ products:

$\mathbf{y} = (\mathbf{A}_1 \otimes \cdots \otimes \mathbf{A}_K)\mathbf{x} \quad\Longleftrightarrow\quad \text{reshape and multiply along each mode}$

Total cost: $O\!\left(\sum_{k=1}^{K} n_k \cdot \prod_{j=1}^{K} n_j\right)$ versus $O\!\left(\left(\prod_k n_k\right)^2\right)$ for the naive approach.

Each factor acts on one "dimension" of the reshaped tensor. This is the same idea as separable convolution in image processing or the tensor-train decomposition in numerical linear algebra.

Example: Kronecker MIMO Channel on GPU

In the Kronecker channel model, the MIMO channel matrix is $\mathbf{H} = \mathbf{R}_r^{1/2}\,\mathbf{H}_w\,\mathbf{R}_t^{1/2}$ where $\mathbf{R}_r$ and $\mathbf{R}_t$ are receive and transmit correlation matrices. The full correlation matrix of $\mathrm{vec}(\mathbf{H})$ is $\mathbf{R}_t^T \otimes \mathbf{R}_r$ . Apply this correlation to a random vector on the GPU without forming the Kronecker product.

Solution

Implementation

import cupy as cp

nr, nt = 64, 32  # massive MIMO dimensions
rho_r, rho_t = 0.7, 0.5

# Correlation matrices
R_r = cp.array([[rho_r ** abs(i-j) for j in range(nr)]
                for i in range(nr)])
R_t = cp.array([[rho_t ** abs(i-j) for j in range(nt)]
                for i in range(nt)])

# Random vector (i.i.d. channel coefficients)
x = cp.random.randn(nr * nt) + 1j * cp.random.randn(nr * nt)

# Apply (R_t^T kron R_r) @ x via vec trick
X = x.reshape(nr, nt, order='F')
Y = R_r @ X @ R_t  # R_t^T -> R_t for symmetric R_t
h_corr = Y.ravel(order='F')

print(f"Input shape: {x.shape}")
print(f"Output shape: {h_corr.shape}")
print(f"Memory saved: {nr*nt*nr*nt*16/1e9:.1f} GB avoided")

Memory analysis

Full Kronecker product: $(64 \times 32)^2 = 2048^2$ complex elements = 67 million entries = 1 GB at complex128. The vec trick uses only $64^2 + 32^2 + 2 \times 64 \times 32$ elements $\approx 0.05$ MB — a 20,000x reduction.

Example: Three-Factor Kronecker Matvec on GPU

Compute $(\mathbf{A} \otimes \mathbf{B} \otimes \mathbf{C})\mathbf{x}$ for $n = 50$ without forming any Kronecker products.

Solution

Implementation

import cupy as cp

n = 50
A = cp.random.randn(n, n)
B = cp.random.randn(n, n)
C = cp.random.randn(n, n)
x = cp.random.randn(n**3)

# Reshape to 3-D tensor, apply factors sequentially
X = x.reshape(n, n, n, order='F')

# Mode-3 multiply (A acts on last dimension)
Y = cp.einsum('ijk,lk->ijl', X, A)

# Mode-2 multiply (B acts on middle dimension)
Y = cp.einsum('ijk,lj->ilk', Y, B)

# Mode-1 multiply (C acts on first dimension)
Y = cp.einsum('ijk,li->ljk', Y, C)

y = Y.ravel(order='F')
print(f"Full Kronecker size: {n**3}x{n**3} = {n**6:.0e} elements")
print(f"Memory for full: {n**6 * 8 / 1e12:.1f} TB")
print(f"Memory used: {3 * n**2 * 8 / 1e6:.1f} MB")

Why einsum on GPU

CuPy's einsum dispatches to cuBLAS under the hood, making each mode- $k$ product a GPU-optimized matrix multiply. The three-factor case would require $50^6 \approx 10^{10}$ elements to store the full product — clearly impossible.

Kronecker Matvec: Naive vs Efficient vs GPU

Benchmark comparing naive Kronecker product construction, CPU vec trick, and GPU vec trick across factor sizes.

Parameters

Key Takeaway

Never form the full Kronecker product. The vec trick $(\mathbf{A} \otimes \mathbf{B})\mathrm{vec}(\mathbf{X}) = \mathrm{vec}(\mathbf{B}\mathbf{X}\mathbf{A}^T)$ reduces $O(n^4)$ to $O(n^3)$ , and the GPU further accelerates the remaining matrix multiplies by 10-50x. This combination — algorithmic optimization plus hardware acceleration — is the paradigm for high-performance scientific computing.

Kronecker Matvec Methods Compared

Method	Time Complexity	Memory	n=1000 Wall Time	Implementation
np.kron then matvec	$O(n^4)$	$O(n^4)$	Impossible (8 TB)	1 line but unusable
CPU vec trick	$O(n^3)$	$O(n^2)$	~500 ms	3 lines
GPU vec trick	$O(n^3/p)$	$O(n^2)$	~20 ms	3 lines + cp
Multi-factor (K=3, GPU)	$O(Kn \cdot n^K / p)$	$O(Kn^2)$	~50 ms	einsum

Kronecker Matvec on GPU — Complete Examples

python

Efficient Kronecker matvec via vec trick on GPU, multi-factor extension, and MIMO Kronecker channel model.

# Code from: ch11/python/s06_kronecker_gpu.py
# Load from backend supplements endpoint

The Kronecker Matvec on GPU

Kronecker Products on GPU: Exploiting Structure for Massive Speedups

Definition: Kronecker Product and the Vec Trick

Definition: Efficient Kronecker Matvec on GPU

Theorem: Kronecker Matvec Complexity: Naive vs Efficient vs GPU

Theorem: Multi-Factor Kronecker Product

Example: Kronecker MIMO Channel on GPU

Implementation

Memory analysis

Example: Three-Factor Kronecker Matvec on GPU

Implementation

Why einsum on GPU

Kronecker Matvec: Naive vs Efficient vs GPU

Parameters

Key Takeaway

Kronecker Matvec Methods Compared

Kronecker Matvec on GPU — Complete Examples

Definition:
Kronecker Product and the Vec Trick

Definition:
Efficient Kronecker Matvec on GPU