Ferkans — Interactive Telecom Tutor

The Computational Bottleneck of OAMP

The OAMP algorithm (Section 17.2) requires computing the LMMSE filter $\mathbf{W}_t = v_2^{t-1}\mathbf{A}^{H}(\sigma^2 \mathbf{I} + v_2^{t-1}\mathbf{A}\mathbf{A}^{H})^{-1}$ at each iteration. For a general $M \times N$ matrix, this costs $O(\min(M,N) \cdot MN)$ — dominated by the SVD preprocessing.

In RF imaging, however, $\mathbf{A}$ has Kronecker structure: $\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2}$ (and sometimes $\mathbf{A}_{1} \otimes \mathbf{A}_{2} \otimes \mathbf{A}_{3}$ for 3D scenes). This structure can be exploited to reduce the LMMSE computation from $O(N^3)$ to $O(N_1^3 + N_2^3)$ , where $N = N_1 N_2$ — often a reduction by a factor of $10^3$ or more.

Definition:
Kronecker-Structured Sensing Matrix

When the transmit array, receive array, and frequency sampling contribute independently to the sensing operator, the discretized forward model has the form

$\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2},$

where $\mathbf{A}_{1} \in \mathbb{C}^{M_1 \times N_1}$ captures one spatial/frequency dimension and $\mathbf{A}_{2} \in \mathbb{C}^{M_2 \times N_2}$ captures the other. The product dimensions are $M = M_1 M_2$ and $N = N_1 N_2$ .

More generally, for 3D imaging with independent array and frequency sampling:

$\mathbf{A} = \mathbf{A}_{\text{freq}} \otimes \mathbf{A}_{\text{Rx}} \otimes \mathbf{A}_{\text{Tx}}.$

The Kronecker structure arises from the separability of the steering vectors in the forward model: $[\mathbf{A}]_{(m_1, m_2), (n_1, n_2)} = [\mathbf{A}_{1}]_{m_1, n_1} \cdot [\mathbf{A}_{2}]_{m_2, n_2}$ . This is exact for UPAs/ULAs in the far field, and approximate for near-field arrays where cross-coupling is small.

Theorem: LMMSE Factorization via Kronecker Structure

Let $\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2}$ with SVDs $\mathbf{A}_{k} = \mathbf{U}_k \boldsymbol{\Sigma}_k \mathbf{V}_k^H$ for $k = 1, 2$ . Then the LMMSE filter in the OAMP iteration can be written as

$\mathbf{W}_t = (\mathbf{V}_1 \otimes \mathbf{V}_2)\, \boldsymbol{\Lambda}_t\, (\mathbf{U}_1 \otimes \mathbf{U}_2)^H,$

where $\boldsymbol{\Lambda}_t$ is a diagonal matrix with entries

$[\boldsymbol{\Lambda}_t]_{(i,j),(i,j)} = \frac{v_2^{t-1}\,s_{1,i}\,s_{2,j}} {\sigma^2 + v_2^{t-1}\,s_{1,i}^2\,s_{2,j}^2},$

and $s_{1,i}$ , $s_{2,j}$ are the singular values of $\mathbf{A}_{1}$ and $\mathbf{A}_{2}$ .

The LMMSE MSE is

$v_1^t = \frac{1}{N}\sum_{i=1}^{r_1}\sum_{j=1}^{r_2} \frac{\sigma^2\,v_2^{t-1}} {\sigma^2 + v_2^{t-1}\,s_{1,i}^2\,s_{2,j}^2} + \frac{N - r_1 r_2}{N}\,v_2^{t-1},$

where $r_k = \text{rank}(\mathbf{A}_{k})$ .

The Kronecker product of two SVDs gives the SVD of the product: $\mathbf{A} = (\mathbf{U}_1 \otimes \mathbf{U}_2) (\boldsymbol{\Sigma}_1 \otimes \boldsymbol{\Sigma}_2) (\mathbf{V}_1 \otimes \mathbf{V}_2)^H$ . The LMMSE filter, which depends on $\mathbf{A}$ only through its SVD, therefore factors as well. We never need to form the $N \times N$ matrix — we work with the $N_1 \times N_1$ and $N_2 \times N_2$ factors.

Proof

SVD of the Kronecker product

Using the mixed-product property $(\mathbf{A} \otimes \mathbf{B})(\mathbf{C} \otimes \mathbf{D}) = (\mathbf{AC}) \otimes (\mathbf{BD})$ , the SVD of $\mathbf{A}$ is

$\mathbf{A} = (\mathbf{U}_1 \otimes \mathbf{U}_2) (\boldsymbol{\Sigma}_1 \otimes \boldsymbol{\Sigma}_2) (\mathbf{V}_1 \otimes \mathbf{V}_2)^H.$

The singular values of $\mathbf{A}$ are all products $s_{1,i}\,s_{2,j}$ .

LMMSE in the singular value basis

In the rotated basis where $\mathbf{A}$ is diagonal, the LMMSE filter is diagonal with entries $\frac{v_2^{t-1}\,s_{1,i}\,s_{2,j}} {\sigma^2 + v_2^{t-1}\,s_{1,i}^2\,s_{2,j}^2}$ . Rotating back gives the stated factored form.

MSE computation

The LMMSE MSE is $v_1^t = \frac{1}{N}\text{tr}[((\sigma^2)^{-1} \mathbf{A}^{H}\mathbf{A} + (v_2^{t-1})^{-1}\mathbf{I})^{-1}]$ . Substituting the Kronecker SVD and summing over the product singular values gives the stated double sum. $\blacksquare$

Definition:
Efficient Matrix-Vector Product via Kronecker Structure

For $\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2}$ and $\mathbf{c} \in \mathbb{C}^{N_1 N_2}$ , the matrix-vector product $\mathbf{A}\mathbf{c}$ can be computed as:

Reshape $\mathbf{c}$ into a matrix $\mathbf{C} \in \mathbb{C}^{N_2 \times N_1}$ .
Compute $\mathbf{Y} = \mathbf{A}_{2} \mathbf{C} \mathbf{A}_{1}^{\mathsf{T}}$ .
Vectorize: $\mathbf{y} = \text{vec}(\mathbf{Y})$ .

Cost: $O(M_2 N_1 N_2 + M_1 M_2 N_1)$ instead of $O(M_1 M_2 N_1 N_2) = O(MN)$ . For square problems ( $M_k \approx N_k \approx \sqrt{N}$ ), this is $O(N^{3/2})$ vs $O(N^2)$ .

Example: Speedup from Kronecker Factorization

Compare the computational cost of OAMP with and without Kronecker exploitation for a 2D RF imaging problem with $N_1 = N_2 = 64$ (so $N = 4096$ ) and $M_1 = M_2 = 40$ (so $M = 1600$ ).

Solution

Without Kronecker: naive OAMP

SVD of $\mathbf{A} \in \mathbb{C}^{1600 \times 4096}$ : $O(1600 \cdot 4096 \cdot 1600) \approx 10^{10}$ flops.
Per-iteration LMMSE: $O(MN) = O(6.5 \times 10^6)$ flops (using the precomputed SVD).
Total for 20 iterations: $\sim 10^{10}$ flops (dominated by the SVD).

With Kronecker: factored OAMP

SVD of $\mathbf{A}_{1} \in \mathbb{C}^{40 \times 64}$ : $O(40 \cdot 64 \cdot 40) \approx 10^5$ flops.
SVD of $\mathbf{A}_{2} \in \mathbb{C}^{40 \times 64}$ : same, $\approx 10^5$ flops.
Per-iteration LMMSE via Kronecker: $O(N_1^2 M_2 + N_2^2 M_1) \approx 3.3 \times 10^5$ flops.
Total for 20 iterations: $\sim 7 \times 10^6$ flops.

Speedup factor

The Kronecker-exploiting version is approximately 1400x faster for this problem size. For $N_1 = N_2 = 128$ ( $N = 16384$ ), the speedup exceeds $10^4$ . This is what makes OAMP practical for RF imaging.

Complexity Comparison — With and Without Kronecker Exploitation

Compare the computational cost (in FLOPs) of OAMP with naive SVD vs Kronecker-factored SVD as the problem size grows.

Parameters

Max factor dimension

N_k

128

\delta = M/N

0.4

Number of OAMP iterations20

🎓CommIT Contribution(2026)

Kronecker-Structured OAMP for RF Imaging

G. Caire, A. Rezaei — TU Berlin Technical Report

The CommIT group's RF imaging simulator implements the Kronecker-factored OAMP algorithm described in this section. The key contribution is the end-to-end pipeline:

Forward model construction that automatically detects and exploits Kronecker separability in the array/frequency geometry.
GPU-accelerated LMMSE using batched small-matrix SVDs instead of a single large SVD — enabling real-time reconstruction for moderate problem sizes.
Hutchinson trace estimator integration for the divergence computation, avoiding the need to differentiate through the LMMSE step.

The simulator achieves reconstruction times under 1 second for $N = 4096$ voxels on a single GPU, compared to $\sim$ 30 seconds for naive OAMP.

RF imagingKronecker structureGPU implementationCommIT simulator

Definition:
Hutchinson Trace Estimator

The OAMP divergence computation requires $\text{div}(\eta) = \frac{1}{N}\text{tr}(\mathbf{J}_\eta)$ where $\mathbf{J}_\eta$ is the Jacobian of the denoiser. For black-box denoisers, we use the Hutchinson estimator:

$\widehat{\text{div}}(\eta) = \frac{1}{N} \boldsymbol{\epsilon}^H \frac{\eta(\mathbf{r} + h\boldsymbol{\epsilon}) - \eta(\mathbf{r})}{h},$

where $\boldsymbol{\epsilon} \sim \mathcal{CN}(\mathbf{0}, \mathbf{I})$ (or Rademacher entries) and $h > 0$ is a small perturbation.

Properties:

Unbiased: $\mathbb{E}[\widehat{\text{div}}] = \text{div}(\eta)$ .
One extra denoiser evaluation per estimate.
Variance scales as $O(1/N)$ , so a single probe vector suffices for large $N$ .

For analytical denoisers (soft thresholding, BG-MMSE), the divergence can be computed in closed form. The Hutchinson estimator is essential for learned denoisers (DnCNN, U-Net) where the Jacobian is not available analytically.

⚠️Engineering Note

GPU Implementation of Kronecker OAMP

The Kronecker-factored OAMP maps naturally to GPU computation:

Batched SVD: The SVDs of $\mathbf{A}_{1}$ and $\mathbf{A}_{2}$ are small ( $\sim 64 \times 64$ ) and computed once. Libraries like cuSOLVER handle this efficiently.
Matrix-matrix products: The LMMSE step reduces to multiplications of the form $\mathbf{A}_{2} \mathbf{C} \mathbf{A}_{1}^{\mathsf{T}}$ — dense GEMM operations that achieve near-peak GPU throughput.
Denoiser: Component-wise denoisers are embarrassingly parallel. Learned denoisers (CNNs) run natively on GPU.

Practical timing (NVIDIA A100, single precision):

$N = 4096$ ( $64 \times 64$ scene): $\sim$ 0.8 s for 20 OAMP iterations.
$N = 16384$ ( $128 \times 128$ scene): $\sim$ 6 s.
$N = 65536$ ( $256 \times 256$ scene): $\sim$ 45 s (memory-limited).

Practical Constraints

•
GPU memory limits the maximum scene size; for $N > 2^{16}$ , the singular value arrays alone require >1 GB
•
The Kronecker factorization is exact only for separable geometries; non-separable corrections require iterative methods

🔧Engineering Note

Numerical Precision of the Hutchinson Estimator

The perturbation size $h$ in the Hutchinson estimator creates a bias-variance tradeoff:

Too small ( $h < 10^{-7}$ in single precision): Numerical cancellation dominates, and the estimate is pure noise.
Too large ( $h > 0.1$ ): The finite-difference approximation to the directional derivative is biased.
Sweet spot: $h \approx N^{-1/3} \cdot \|\mathbf{r}\|$ balances bias and variance.

For learned denoisers with ReLU activations, the Jacobian is piecewise constant, so the finite-difference approximation is exact for any $h$ small enough to stay within one linear region.

Practical Constraints

•
Use float64 for the Hutchinson estimate even if the rest of the computation is float32
•
Average over 3–5 probe vectors for variance reduction when $N < 500$

Common Mistake: Kronecker Structure Is Approximate in Practice

Mistake:

Assuming exact Kronecker structure $\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2}$ for all RF imaging geometries and applying the factored LMMSE without checking the approximation quality.

Correction:

The Kronecker factorization is exact for:

Far-field UPA/ULA arrays with uniform frequency sampling.
Planar target regions perpendicular to the boresight.

It is approximate (but usually good) for:

Non-uniform array geometries.
Near-field imaging (where the wavefront curvature introduces cross-terms).

Always compute the relative approximation error $\|\mathbf{A} - \mathbf{A}_{1} \otimes \mathbf{A}_{2}\|_F / \|\mathbf{A}\|_F$ before using the factored LMMSE. If the error exceeds a few percent, use iterative refinement or the full (unfactored) LMMSE.

Quick Check

For a 2D imaging problem with $N_1 = N_2 = n$ and $M_1 = M_2 = m$ , what is the SVD preprocessing cost of Kronecker-factored OAMP?

$O(n^3)$

$O(2 \cdot m n \cdot \min(m,n))$

$O(m^2 n^2 \min(m^2, n^2))$

Correction:

O(2 \cdot m n \cdot \min(m,n))

We compute two separate SVDs, each of an $m \times n$ matrix, costing $O(m n \min(m,n))$ each. Since $m, n \ll mn = \sqrt{MN}$ , this is dramatically cheaper than the SVD of the full $M \times N$ matrix.

Kronecker product

For $\mathbf{A} \in \mathbb{C}^{m \times n}$ and $\mathbf{B} \in \mathbb{C}^{p \times q}$ , $\mathbf{A} \otimes \mathbf{B} \in \mathbb{C}^{mp \times nq}$ is the block matrix with $(i,j)$ -th block $A_{ij}\mathbf{B}$ . Key property: $(\mathbf{A} \otimes \mathbf{B})(\mathbf{C} \otimes \mathbf{D}) = (\mathbf{AC}) \otimes (\mathbf{BD})$ .

Hutchinson trace estimator

A stochastic estimator for the trace of a matrix: $\text{tr}(\mathbf{A}) \approx \boldsymbol{\epsilon}^H \mathbf{A}\boldsymbol{\epsilon}$ with $\boldsymbol{\epsilon}$ a random vector with i.i.d. entries satisfying $\mathbb{E}[\epsilon_i \bar{\epsilon}_j] = \delta_{ij}$ . Used in OAMP to estimate the denoiser divergence without computing the full Jacobian.

Related: Hutchinson Trace Estimator

Efficient LMMSE via Kronecker Structure