Ferkans — Interactive Telecom Tutor

The Computational Wall at $N_t = 10^3$

Once the aperture carries thousands of antennas, the $N_t^{3}$ complexity of full-aperture MMSE estimation becomes prohibitive. A naive per-user MMSE on a $N_t = 4096$ array requires $\sim 7 \times 10^{10}$ flops per coherence block per user, which is multiple orders of magnitude above any realistic baseband budget. The redeeming structural fact, grounded in Section 18.1, is that the VR of every user is spatially localized: no user benefits from processing antennas far outside its VR. This motivates partitioning the array into subarrays and decoupling the estimation problem at subarray level, trading global optimality for an enormous complexity reduction.

,

Definition:
Subarray Partition of an XL-MIMO Array

A subarray partition of an $N_t$ -element array is a collection of disjoint index sets $\{\mathcal{S}_1, \mathcal{S}_2, \ldots, \mathcal{S}_S\}, \qquad \bigcup_{s=1}^S \mathcal{S}_s = \{1,\ldots,N_t\}, \qquad \mathcal{S}_s \cap \mathcal{S}_{s'} = \emptyset \text{ for } s \neq s',$ each of cardinality $|\mathcal{S}_s| = M_s$ with $\sum_s M_s = N_t$ . Typical partitions split a $N_1 \times N_2$ UPA into $P \times Q$ rectangular tiles of size $(N_1/P) \times (N_2/Q)$ ; we denote $S = PQ$ the number of subarrays and $M_s = N_1 N_2 / S$ the per-subarray antenna count.

For a given user $k$ , the active subarrays are those that intersect the user's VR: $\mathcal{A}_k = \{ s : \mathcal{S}_s \cap \mathcal{V}_k \neq \emptyset \}.$

The subarray grid is a receiver-side processing construct, not a hardware constraint — although it matches the natural structure of panel-based arrays where each panel has its own digital baseband unit. When the hardware is already partitioned (e.g., a wall of $8 \times 8$ panels), the obvious choice is to align $\mathcal{S}_s$ with the panel boundaries.

,

Theorem: Complexity Reduction from Subarray Decomposition

Assume the subarray partition has $S$ equal tiles of size $M = N_t/S$ , each subarray processes its own pilots with an independent MMSE estimator, and the CPU combines subarray outputs with a linear weighted sum (no cross-subarray covariance computation). Then the total flop count to compute all $K$ channel estimates is $\mathcal{C}_{\text{sub}} = \mathcal{O}\!\left(K \cdot S \cdot M^3\right) = \mathcal{O}\!\left(K\,\frac{N_t^{3}}{S^2}\right),$ compared with the full-aperture MMSE complexity of $\mathcal{C}_{\text{full}} = \mathcal{O}(K\,N_t^{3})$ . The subarray decomposition therefore yields a factor $S^2$ speedup.

The MMSE estimator for user $k$ inverts a $N_t \times N_t$ covariance matrix — cost $N_t^{3}$ . Splitting into $S$ subarrays inverts $S$ independent $M \times M$ matrices at cost $S \cdot M^3 = N_t^{3} / S^2$ . Two dimensions are saved because each subarray processes $M = N_t/S$ antennas instead of $N_t$ , and the number of subarrays scales additively (one loop) not multiplicatively.

Proof

Full-aperture cost

MMSE for one user: $\hat{\mathbf{H}}_k = \mathbf{R}_k (\mathbf{R}_k + \sigma^2 \mathbf{I})^{-1} \mathbf{Y}_p \mathbf{s}_k^*$ . The inverse is $\mathcal{O}(N_t^{3})$ , matrix-vector multiplications are $\mathcal{O}(N_t^{2})$ . Dominated by the inverse: per user $\mathcal{O}(N_t^{3})$ . Total: $\mathcal{O}(K \cdot N_t^{3})$ .

Subarray cost

For each subarray $s$ , the local MMSE inverts an $M \times M$ matrix at cost $\mathcal{O}(M^3)$ . Summing across subarrays: $\mathcal{O}(S \cdot M^3)$ . Substituting $M = N_t/S$ gives $\mathcal{O}(N_t^{3}/S^2)$ per user. Across users: $\mathcal{O}(K \cdot N_t^{3}/S^2)$ .

Speedup ratio

$\mathcal{C}_{\text{full}} / \mathcal{C}_{\text{sub}} = S^2$ . For $N_t = 4096$ and $S = 64$ ( $8 \times 8$ tiles of $M=64$ ), this is a speedup factor of $4096$ . $\blacksquare$

Why Active-Subarray Pruning Matters

The subarray decomposition alone does not yet use the VR structure. We gain a second multiplier — typically $K / |\mathcal{A}_k|$ per user — by processing only the active subarrays $\mathcal{A}_k$ for user $k$ , skipping the rest. On a large aperture where each user touches only a fraction $|\mathcal{A}_k|/S \approx 0.1$ of the subarrays, this is another 10x reduction. The full complexity of the VR-aware subarray pipeline is therefore $\mathcal{O}(K \cdot |\mathcal{A}_k|_{\text{avg}} \cdot M^3)$ .

Subarray-Based Channel Estimation Pipeline

Complexity:

\mathcal{O}\bigl(K \cdot |\mathcal{A}_k|_{\text{avg}} \cdot M^3\bigr)

, embarrassingly parallel across subarrays.

Input: Pilot matrix

\mathbf{S}_{i,k}

, pilot observations

\mathbf{Y}_p \in \mathbb{C}^{N_t \times \tau_p}

, subarray partition

\{\mathcal{S}_s\}

, per-subarray covariances

\{\mathbf{R}_{s,k}\}

, noise power

\sigma^2

, activation threshold

\eta > 0

.

Output: Channel estimates

\{\hat{\mathbf{H}}_k\}_{k=1}^{K}

.

1. Per-subarray matched filtering (parallel across $s$ ):

2.

\quad

for

s = 1, \ldots, S

do

3.

\quad\quad

\mathbf{Z}_{s,k} \leftarrow \mathbf{Y}_p[\mathcal{S}_s, :]\, {\mathbf{S}_{i,k}}_{k}^{H} / \|{\mathbf{S}_{i,k}}_{k}\|^2

for each user

k

4.

\quad\quad

Declare subarray

s

active for user

k

if

\|\mathbf{Z}_{s,k}\|^2 / M > \eta\, \sigma^2

5.

\quad

end for

6. Local MMSE on active subarrays:

7.

\quad

for each user

k

and each active subarray

s \in \mathcal{A}_k

do

8.

\quad\quad

\hat{\mathbf{H}}_k[\mathcal{S}_s] \leftarrow \mathbf{R}_{s,k}(\mathbf{R}_{s,k} + \sigma^2\mathbf{I})^{-1} \mathbf{Z}_{s,k}

9.

\quad

end for

10. Zero-fill inactive subarrays:

\hat{\mathbf{H}}_k[\mathcal{S}_s] \leftarrow \mathbf{0}

for

s \notin \mathcal{A}_k

.

11. return

\{\hat{\mathbf{H}}_k\}

.

Steps 1–5 run independently per subarray and can be mapped to panel-local baseband units; only the user-level zero-fill in step 10 requires a central aggregator. The algorithm has no cross-subarray matrix inversion, which is what unlocks the $S^2$ complexity reduction of Theorem TComplexity Reduction from Subarray Decomposition.

NMSE vs Number of Subarrays (With and Without VR Pruning)

Compare the NMSE of full-aperture MMSE, plain subarray MMSE, and VR-pruned subarray MMSE as a function of the subarray count $S$ . Notice that plain subarray MMSE degrades as $S$ grows (smaller tiles lose covariance information), while VR-pruned subarray MMSE remains near the full-aperture NMSE until the tiles become smaller than the VR boundary features.

Parameters

N_1

32

N_2

32

VR fraction0.3

SNR (dB)5

Example: A 4096-Element Array: How Much Do We Save?

An XL-MIMO array has $N_t = 4096$ elements arranged as $64 \times 64$ . A design uses $8 \times 8$ subarray tiles, giving $S = 64$ subarrays of $M = 64$ each. Users touch on average $|\mathcal{A}_k| = 6$ active subarrays. Compare the per-user, per-coherence-block flop count of: (a) full-aperture MMSE, (b) plain subarray MMSE, (c) VR-pruned subarray MMSE.

Solution

Full-aperture MMSE

Dominant cost $N_t^{3} = 4096^3 \approx 6.87 \times 10^{10}$ flops per user. For $K = 16$ users and a coherence time of $1$ ms at $20$ MHz, this is $\sim 10^{15}$ flops/s — orders of magnitude above any realistic baseband budget.

Plain subarray MMSE

Per user: $S \cdot M^3 = 64 \cdot 64^3 = 1.68 \times 10^7$ flops. Speedup over (a): $N_t^{3} / (S M^3) = S^2 = 4096$ . Per-coherence- block cost drops from $\sim 10^{12}$ flops (all users) to $\sim 2.7 \times 10^8$ flops — tractable.

VR-pruned subarray MMSE

Only $|\mathcal{A}_k| = 6$ of 64 subarrays are processed. Per user: $6 \cdot 64^3 \approx 1.57 \times 10^6$ flops. Extra speedup over (b): $64/6 \approx 10.7$ . Total over all users: $\sim 2.5 \times 10^7$ flops per coherence block — well within budget.

Interpretation

The subarray decomposition provides the $S^2 = 4096$ speedup; the VR-aware pruning adds another $S/|\mathcal{A}_k| \approx 10$ . The resulting estimator runs in real time on commodity hardware and loses at most $0.5$ dB of NMSE relative to the intractable full-aperture MMSE (see the interactive plot above). $\blacksquare$

Full-Aperture vs Subarray vs VR-Pruned Subarray MMSE

Attribute	Full-aperture MMSE	Plain subarray MMSE	VR-pruned subarray MMSE
Per-user flops	$N_t^{3}$	$S \cdot M^3 = N_t^{3}/S^2$	$\|\mathcal{A}_k\| \cdot M^3$
Parallelism	Serial inverse	Embarrassingly parallel across $S$	Embarrassingly parallel across $\|\mathcal{A}_k\|$
NMSE (stationary channel)	Optimal	0.3-1 dB penalty	0.3-1 dB penalty
NMSE (VR with low $\|\mathcal{V}_k\|$ )	Wastes pilots on dead antennas	Same as full if tiles cover VR	Near full-aperture, ~10x cheaper
Requires VR detector?	No	No	Yes (Section 18.5)
Cross-subarray coupling?	Full	None	None
Typical use case	$N_t \lesssim 128$	Regular massive MIMO panels	XL-MIMO with blockage / multipath clustering

What Subarray Processing Does Not Buy You

Subarray decomposition is a computational decoupling, not an information-theoretic one. The subarray estimators ignore cross-subarray covariance $\mathbf{R}_{s,s'} = \mathbb{E}[\mathbf{H}_{k}[\mathcal{S}_s] \mathbf{H}_{k}[\mathcal{S}_{s'}]^H]$ , which is non-zero whenever the spatial correlation is non-trivial. In practice the loss is small ( $< 1$ dB) because most of the per-user covariance mass lives within a single subarray, but the approximation is visible at high SNR where fine-grained correlation matters. If absolute fidelity is needed at $\text{SNR} \geq 20$ dB, use a two-stage estimator: subarray MMSE first, then a low-rank refinement that couples neighbouring subarrays.

Common Mistake: Do Not Make Subarrays Smaller Than VR Features

Mistake:

Push $S$ as large as possible to maximize the $S^2$ speedup.

Correction:

When subarray size $M$ drops below the typical VR cluster diameter, individual tiles no longer see enough antennas to estimate the in-tile covariance reliably, and the VR detector starts flipping whole subarrays on and off based on a handful of samples. The sweet spot is $M^{1/2} \gtrsim 1$ – $2$ times the expected VR border thickness; for a $32 \times 32$ VR on a $64 \times 64$ array, $8 \times 8$ subarrays work well. Smaller tiles force the detector to rely on the MRF prior to glue fragments back together, which works but eats into the prior's noise-cleaning budget.

Quick Check

An XL-MIMO array has $N_t = 1024$ antennas. It is partitioned into $S = 16$ square subarrays. What is the flop-count speedup of plain subarray MMSE over full-aperture MMSE, per user?

16

256

1024

4096

Correction:

256

By Theorem TComplexity Reduction from Subarray Decomposition, the speedup is $S^2 = 16^2 = 256$ . Each subarray inverts an $M = 64$ -antenna matrix, so total cost is $S \cdot M^3 = N_t^{3}/S^2$ .

Quick Check

Why should subarray tiles not be made much smaller than the typical VR cluster diameter?

The MRF prior cannot run on small tiles

Local in-tile covariance estimation becomes unreliable, and VR detection starts flipping tiles on evidence too noisy to trust

The subarray count must be a power of two

The fronthaul cost grows faster than $S$

Correction:

Local in-tile covariance estimation becomes unreliable, and VR detection starts flipping tiles on evidence too noisy to trust

Each tile must contain enough antennas to form a reliable per-tile covariance estimate and make a stable activation decision. When the tile size drops below the VR boundary feature scale, the detector starts toggling whole tiles on a handful of noisy samples. The sweet spot is tiles slightly larger than the characteristic VR boundary thickness (Section 18.3).

🔧Engineering Note

Align Subarrays with Hardware Panels

A production XL-MIMO array is rarely a single monolithic panel. Ericsson, Nokia, and Huawei commercial XL-MIMO products expose the array as a grid of panels, each with its own baseband unit and its own front-haul link to the central processor. The natural subarray partition is one subarray per panel:

Cross-panel traffic stays at the fronthaul level (a weighted sum of per-panel estimates), not at the baseband level.
The panel boundary matches a natural discontinuity in the spatial covariance (different oscillators, different calibration).
A panel that is blocked or powered down simply drops out of $\mathcal{A}_k$ without any algorithm reconfiguration.

Practical Constraints

•
Panel size: 4-16 antennas per panel at sub-6 GHz; 64-256 at mmWave
•
Inter-panel fronthaul: ~1 Gbps per panel for weighted-sum output
•
Per-panel computation budget: $< 1$ ms for $M \leq 256$

📋 Ref: 3GPP TR 38.867 — Release 18 study item on multi-panel operation

,

Subarray-Based Processing

The Computational Wall at Nt=103N_t = 10^3Nt​=103

Definition: Subarray Partition of an XL-MIMO Array

Theorem: Complexity Reduction from Subarray Decomposition

Full-aperture cost

Subarray cost

Speedup ratio

Why Active-Subarray Pruning Matters

Subarray-Based Channel Estimation Pipeline

NMSE vs Number of Subarrays (With and Without VR Pruning)

Parameters

Example: A 4096-Element Array: How Much Do We Save?

Full-aperture MMSE

Plain subarray MMSE

VR-pruned subarray MMSE

Interpretation

Full-Aperture vs Subarray vs VR-Pruned Subarray MMSE

What Subarray Processing Does Not Buy You

Common Mistake: Do Not Make Subarrays Smaller Than VR Features

Quick Check

Quick Check

Align Subarrays with Hardware Panels

The Computational Wall at $N_t = 10^3$

Definition:
Subarray Partition of an XL-MIMO Array