Efficient LMMSE via Kronecker Structure

The Computational Bottleneck of OAMP

The OAMP algorithm (Section 17.2) requires computing the LMMSE filter Wt=v2tβˆ’1AH(Οƒ2I+v2tβˆ’1AAH)βˆ’1\mathbf{W}_t = v_2^{t-1}\mathbf{A}^{H}(\sigma^2 \mathbf{I} + v_2^{t-1}\mathbf{A}\mathbf{A}^{H})^{-1} at each iteration. For a general MΓ—NM \times N matrix, this costs O(min⁑(M,N)β‹…MN)O(\min(M,N) \cdot MN) β€” dominated by the SVD preprocessing.

In RF imaging, however, A\mathbf{A} has Kronecker structure: A=A1βŠ—A2\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2} (and sometimes A1βŠ—A2βŠ—A3\mathbf{A}_{1} \otimes \mathbf{A}_{2} \otimes \mathbf{A}_{3} for 3D scenes). This structure can be exploited to reduce the LMMSE computation from O(N3)O(N^3) to O(N13+N23)O(N_1^3 + N_2^3), where N=N1N2N = N_1 N_2 β€” often a reduction by a factor of 10310^3 or more.

Definition:

Kronecker-Structured Sensing Matrix

When the transmit array, receive array, and frequency sampling contribute independently to the sensing operator, the discretized forward model has the form

A=A1βŠ—A2,\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2},

where A1∈CM1Γ—N1\mathbf{A}_{1} \in \mathbb{C}^{M_1 \times N_1} captures one spatial/frequency dimension and A2∈CM2Γ—N2\mathbf{A}_{2} \in \mathbb{C}^{M_2 \times N_2} captures the other. The product dimensions are M=M1M2M = M_1 M_2 and N=N1N2N = N_1 N_2.

More generally, for 3D imaging with independent array and frequency sampling:

A=AfreqβŠ—ARxβŠ—ATx.\mathbf{A} = \mathbf{A}_{\text{freq}} \otimes \mathbf{A}_{\text{Rx}} \otimes \mathbf{A}_{\text{Tx}}.

The Kronecker structure arises from the separability of the steering vectors in the forward model: [A](m1,m2),(n1,n2)=[A1]m1,n1β‹…[A2]m2,n2[\mathbf{A}]_{(m_1, m_2), (n_1, n_2)} = [\mathbf{A}_{1}]_{m_1, n_1} \cdot [\mathbf{A}_{2}]_{m_2, n_2}. This is exact for UPAs/ULAs in the far field, and approximate for near-field arrays where cross-coupling is small.

Theorem: LMMSE Factorization via Kronecker Structure

Let A=A1βŠ—A2\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2} with SVDs Ak=UkΞ£kVkH\mathbf{A}_{k} = \mathbf{U}_k \boldsymbol{\Sigma}_k \mathbf{V}_k^H for k=1,2k = 1, 2. Then the LMMSE filter in the OAMP iteration can be written as

Wt=(V1βŠ—V2) Λt (U1βŠ—U2)H,\mathbf{W}_t = (\mathbf{V}_1 \otimes \mathbf{V}_2)\, \boldsymbol{\Lambda}_t\, (\mathbf{U}_1 \otimes \mathbf{U}_2)^H,

where Ξ›t\boldsymbol{\Lambda}_t is a diagonal matrix with entries

[Ξ›t](i,j),(i,j)=v2tβˆ’1 s1,i s2,jΟƒ2+v2tβˆ’1 s1,i2 s2,j2,[\boldsymbol{\Lambda}_t]_{(i,j),(i,j)} = \frac{v_2^{t-1}\,s_{1,i}\,s_{2,j}} {\sigma^2 + v_2^{t-1}\,s_{1,i}^2\,s_{2,j}^2},

and s1,is_{1,i}, s2,js_{2,j} are the singular values of A1\mathbf{A}_{1} and A2\mathbf{A}_{2}.

The LMMSE MSE is

v1t=1Nβˆ‘i=1r1βˆ‘j=1r2Οƒ2 v2tβˆ’1Οƒ2+v2tβˆ’1 s1,i2 s2,j2+Nβˆ’r1r2N v2tβˆ’1,v_1^t = \frac{1}{N}\sum_{i=1}^{r_1}\sum_{j=1}^{r_2} \frac{\sigma^2\,v_2^{t-1}} {\sigma^2 + v_2^{t-1}\,s_{1,i}^2\,s_{2,j}^2} + \frac{N - r_1 r_2}{N}\,v_2^{t-1},

where rk=rank(Ak)r_k = \text{rank}(\mathbf{A}_{k}).

The Kronecker product of two SVDs gives the SVD of the product: A=(U1βŠ—U2)(Ξ£1βŠ—Ξ£2)(V1βŠ—V2)H\mathbf{A} = (\mathbf{U}_1 \otimes \mathbf{U}_2) (\boldsymbol{\Sigma}_1 \otimes \boldsymbol{\Sigma}_2) (\mathbf{V}_1 \otimes \mathbf{V}_2)^H. The LMMSE filter, which depends on A\mathbf{A} only through its SVD, therefore factors as well. We never need to form the NΓ—NN \times N matrix β€” we work with the N1Γ—N1N_1 \times N_1 and N2Γ—N2N_2 \times N_2 factors.

Definition:

Efficient Matrix-Vector Product via Kronecker Structure

For A=A1βŠ—A2\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2} and c∈CN1N2\mathbf{c} \in \mathbb{C}^{N_1 N_2}, the matrix-vector product Ac\mathbf{A}\mathbf{c} can be computed as:

  1. Reshape c\mathbf{c} into a matrix C∈CN2Γ—N1\mathbf{C} \in \mathbb{C}^{N_2 \times N_1}.
  2. Compute Y=A2CA1T\mathbf{Y} = \mathbf{A}_{2} \mathbf{C} \mathbf{A}_{1}^{\mathsf{T}}.
  3. Vectorize: y=vec(Y)\mathbf{y} = \text{vec}(\mathbf{Y}).

Cost: O(M2N1N2+M1M2N1)O(M_2 N_1 N_2 + M_1 M_2 N_1) instead of O(M1M2N1N2)=O(MN)O(M_1 M_2 N_1 N_2) = O(MN). For square problems (Mkβ‰ˆNkβ‰ˆNM_k \approx N_k \approx \sqrt{N}), this is O(N3/2)O(N^{3/2}) vs O(N2)O(N^2).

Example: Speedup from Kronecker Factorization

Compare the computational cost of OAMP with and without Kronecker exploitation for a 2D RF imaging problem with N1=N2=64N_1 = N_2 = 64 (so N=4096N = 4096) and M1=M2=40M_1 = M_2 = 40 (so M=1600M = 1600).

Complexity Comparison β€” With and Without Kronecker Exploitation

Compare the computational cost (in FLOPs) of OAMP with naive SVD vs Kronecker-factored SVD as the problem size grows.

Parameters
128
0.4
20
πŸŽ“CommIT Contribution(2026)

Kronecker-Structured OAMP for RF Imaging

G. Caire, A. Rezaei β€” TU Berlin Technical Report

The CommIT group's RF imaging simulator implements the Kronecker-factored OAMP algorithm described in this section. The key contribution is the end-to-end pipeline:

  1. Forward model construction that automatically detects and exploits Kronecker separability in the array/frequency geometry.
  2. GPU-accelerated LMMSE using batched small-matrix SVDs instead of a single large SVD β€” enabling real-time reconstruction for moderate problem sizes.
  3. Hutchinson trace estimator integration for the divergence computation, avoiding the need to differentiate through the LMMSE step.

The simulator achieves reconstruction times under 1 second for N=4096N = 4096 voxels on a single GPU, compared to ∼\sim30 seconds for naive OAMP.

RF imagingKronecker structureGPU implementationCommIT simulator

Definition:

Hutchinson Trace Estimator

The OAMP divergence computation requires div(Ξ·)=1Ntr(JΞ·)\text{div}(\eta) = \frac{1}{N}\text{tr}(\mathbf{J}_\eta) where JΞ·\mathbf{J}_\eta is the Jacobian of the denoiser. For black-box denoisers, we use the Hutchinson estimator:

div^(Ξ·)=1NΟ΅HΞ·(r+hΟ΅)βˆ’Ξ·(r)h,\widehat{\text{div}}(\eta) = \frac{1}{N} \boldsymbol{\epsilon}^H \frac{\eta(\mathbf{r} + h\boldsymbol{\epsilon}) - \eta(\mathbf{r})}{h},

where ϡ∼CN(0,I)\boldsymbol{\epsilon} \sim \mathcal{CN}(\mathbf{0}, \mathbf{I}) (or Rademacher entries) and h>0h > 0 is a small perturbation.

Properties:

  • Unbiased: E[div^]=div(Ξ·)\mathbb{E}[\widehat{\text{div}}] = \text{div}(\eta).
  • One extra denoiser evaluation per estimate.
  • Variance scales as O(1/N)O(1/N), so a single probe vector suffices for large NN.

For analytical denoisers (soft thresholding, BG-MMSE), the divergence can be computed in closed form. The Hutchinson estimator is essential for learned denoisers (DnCNN, U-Net) where the Jacobian is not available analytically.

⚠️Engineering Note

GPU Implementation of Kronecker OAMP

The Kronecker-factored OAMP maps naturally to GPU computation:

  • Batched SVD: The SVDs of A1\mathbf{A}_{1} and A2\mathbf{A}_{2} are small (∼64Γ—64\sim 64 \times 64) and computed once. Libraries like cuSOLVER handle this efficiently.
  • Matrix-matrix products: The LMMSE step reduces to multiplications of the form A2CA1T\mathbf{A}_{2} \mathbf{C} \mathbf{A}_{1}^{\mathsf{T}} β€” dense GEMM operations that achieve near-peak GPU throughput.
  • Denoiser: Component-wise denoisers are embarrassingly parallel. Learned denoisers (CNNs) run natively on GPU.

Practical timing (NVIDIA A100, single precision):

  • N=4096N = 4096 (64Γ—6464 \times 64 scene): ∼\sim0.8 s for 20 OAMP iterations.
  • N=16384N = 16384 (128Γ—128128 \times 128 scene): ∼\sim6 s.
  • N=65536N = 65536 (256Γ—256256 \times 256 scene): ∼\sim45 s (memory-limited).
Practical Constraints
  • β€’

    GPU memory limits the maximum scene size; for N>216N > 2^{16}, the singular value arrays alone require >1 GB

  • β€’

    The Kronecker factorization is exact only for separable geometries; non-separable corrections require iterative methods

πŸ”§Engineering Note

Numerical Precision of the Hutchinson Estimator

The perturbation size hh in the Hutchinson estimator creates a bias-variance tradeoff:

  • Too small (h<10βˆ’7h < 10^{-7} in single precision): Numerical cancellation dominates, and the estimate is pure noise.
  • Too large (h>0.1h > 0.1): The finite-difference approximation to the directional derivative is biased.
  • Sweet spot: hβ‰ˆNβˆ’1/3β‹…βˆ₯rβˆ₯h \approx N^{-1/3} \cdot \|\mathbf{r}\| balances bias and variance.

For learned denoisers with ReLU activations, the Jacobian is piecewise constant, so the finite-difference approximation is exact for any hh small enough to stay within one linear region.

Practical Constraints
  • β€’

    Use float64 for the Hutchinson estimate even if the rest of the computation is float32

  • β€’

    Average over 3–5 probe vectors for variance reduction when N<500N < 500

Common Mistake: Kronecker Structure Is Approximate in Practice

Mistake:

Assuming exact Kronecker structure A=A1βŠ—A2\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2} for all RF imaging geometries and applying the factored LMMSE without checking the approximation quality.

Correction:

The Kronecker factorization is exact for:

  • Far-field UPA/ULA arrays with uniform frequency sampling.
  • Planar target regions perpendicular to the boresight.

It is approximate (but usually good) for:

  • Non-uniform array geometries.
  • Near-field imaging (where the wavefront curvature introduces cross-terms).

Always compute the relative approximation error βˆ₯Aβˆ’A1βŠ—A2βˆ₯F/βˆ₯Aβˆ₯F\|\mathbf{A} - \mathbf{A}_{1} \otimes \mathbf{A}_{2}\|_F / \|\mathbf{A}\|_F before using the factored LMMSE. If the error exceeds a few percent, use iterative refinement or the full (unfactored) LMMSE.

Quick Check

For a 2D imaging problem with N1=N2=nN_1 = N_2 = n and M1=M2=mM_1 = M_2 = m, what is the SVD preprocessing cost of Kronecker-factored OAMP?

O(n3)O(n^3)

O(2β‹…mnβ‹…min⁑(m,n))O(2 \cdot m n \cdot \min(m,n))

O(m2n2min⁑(m2,n2))O(m^2 n^2 \min(m^2, n^2))

Kronecker product

For A∈CmΓ—n\mathbf{A} \in \mathbb{C}^{m \times n} and B∈CpΓ—q\mathbf{B} \in \mathbb{C}^{p \times q}, AβŠ—B∈CmpΓ—nq\mathbf{A} \otimes \mathbf{B} \in \mathbb{C}^{mp \times nq} is the block matrix with (i,j)(i,j)-th block AijBA_{ij}\mathbf{B}. Key property: (AβŠ—B)(CβŠ—D)=(AC)βŠ—(BD)(\mathbf{A} \otimes \mathbf{B})(\mathbf{C} \otimes \mathbf{D}) = (\mathbf{AC}) \otimes (\mathbf{BD}).

Related: Kronecker-Structured Sensing Matrix, LMMSE Factorization via Kronecker Structure

Hutchinson trace estimator

A stochastic estimator for the trace of a matrix: tr(A)β‰ˆΟ΅HAΟ΅\text{tr}(\mathbf{A}) \approx \boldsymbol{\epsilon}^H \mathbf{A}\boldsymbol{\epsilon} with Ο΅\boldsymbol{\epsilon} a random vector with i.i.d. entries satisfying E[Ο΅iΟ΅Λ‰j]=Ξ΄ij\mathbb{E}[\epsilon_i \bar{\epsilon}_j] = \delta_{ij}. Used in OAMP to estimate the denoiser divergence without computing the full Jacobian.

Related: Hutchinson Trace Estimator