Subarray-Based Processing

The Computational Wall at Nt=103N_t = 10^3

Once the aperture carries thousands of antennas, the Nt3N_t^{3} complexity of full-aperture MMSE estimation becomes prohibitive. A naive per-user MMSE on a Nt=4096N_t = 4096 array requires ∼7Γ—1010\sim 7 \times 10^{10} flops per coherence block per user, which is multiple orders of magnitude above any realistic baseband budget. The redeeming structural fact, grounded in Section 18.1, is that the VR of every user is spatially localized: no user benefits from processing antennas far outside its VR. This motivates partitioning the array into subarrays and decoupling the estimation problem at subarray level, trading global optimality for an enormous complexity reduction.

,

Definition:

Subarray Partition of an XL-MIMO Array

A subarray partition of an NtN_t-element array is a collection of disjoint index sets {S1,S2,…,SS},⋃s=1SSs={1,…,Nt},Ss∩Ssβ€²=βˆ…Β forΒ sβ‰ sβ€²,\{\mathcal{S}_1, \mathcal{S}_2, \ldots, \mathcal{S}_S\}, \qquad \bigcup_{s=1}^S \mathcal{S}_s = \{1,\ldots,N_t\}, \qquad \mathcal{S}_s \cap \mathcal{S}_{s'} = \emptyset \text{ for } s \neq s', each of cardinality ∣Ss∣=Ms|\mathcal{S}_s| = M_s with βˆ‘sMs=Nt\sum_s M_s = N_t. Typical partitions split a N1Γ—N2N_1 \times N_2 UPA into PΓ—QP \times Q rectangular tiles of size (N1/P)Γ—(N2/Q)(N_1/P) \times (N_2/Q); we denote S=PQS = PQ the number of subarrays and Ms=N1N2/SM_s = N_1 N_2 / S the per-subarray antenna count.

For a given user kk, the active subarrays are those that intersect the user's VR: Ak={s:Ss∩Vkβ‰ βˆ…}.\mathcal{A}_k = \{ s : \mathcal{S}_s \cap \mathcal{V}_k \neq \emptyset \}.

The subarray grid is a receiver-side processing construct, not a hardware constraint β€” although it matches the natural structure of panel-based arrays where each panel has its own digital baseband unit. When the hardware is already partitioned (e.g., a wall of 8Γ—88 \times 8 panels), the obvious choice is to align Ss\mathcal{S}_s with the panel boundaries.

,

Theorem: Complexity Reduction from Subarray Decomposition

Assume the subarray partition has SS equal tiles of size M=Nt/SM = N_t/S, each subarray processes its own pilots with an independent MMSE estimator, and the CPU combines subarray outputs with a linear weighted sum (no cross-subarray covariance computation). Then the total flop count to compute all KK channel estimates is Csub=O ⁣(Kβ‹…Sβ‹…M3)=O ⁣(K Nt3S2),\mathcal{C}_{\text{sub}} = \mathcal{O}\!\left(K \cdot S \cdot M^3\right) = \mathcal{O}\!\left(K\,\frac{N_t^{3}}{S^2}\right), compared with the full-aperture MMSE complexity of Cfull=O(K Nt3)\mathcal{C}_{\text{full}} = \mathcal{O}(K\,N_t^{3}). The subarray decomposition therefore yields a factor S2S^2 speedup.

The MMSE estimator for user kk inverts a NtΓ—NtN_t \times N_t covariance matrix β€” cost Nt3N_t^{3}. Splitting into SS subarrays inverts SS independent MΓ—MM \times M matrices at cost Sβ‹…M3=Nt3/S2S \cdot M^3 = N_t^{3} / S^2. Two dimensions are saved because each subarray processes M=Nt/SM = N_t/S antennas instead of NtN_t, and the number of subarrays scales additively (one loop) not multiplicatively.

Why Active-Subarray Pruning Matters

The subarray decomposition alone does not yet use the VR structure. We gain a second multiplier β€” typically K/∣Ak∣K / |\mathcal{A}_k| per user β€” by processing only the active subarrays Ak\mathcal{A}_k for user kk, skipping the rest. On a large aperture where each user touches only a fraction ∣Ak∣/Sβ‰ˆ0.1|\mathcal{A}_k|/S \approx 0.1 of the subarrays, this is another 10x reduction. The full complexity of the VR-aware subarray pipeline is therefore O(Kβ‹…βˆ£Ak∣avgβ‹…M3)\mathcal{O}(K \cdot |\mathcal{A}_k|_{\text{avg}} \cdot M^3).

Subarray-Based Channel Estimation Pipeline

Complexity: O(Kβ‹…βˆ£Ak∣avgβ‹…M3)\mathcal{O}\bigl(K \cdot |\mathcal{A}_k|_{\text{avg}} \cdot M^3\bigr), embarrassingly parallel across subarrays.
Input: Pilot matrix Si,k\mathbf{S}_{i,k}, pilot observations
Yp∈CNtΓ—Ο„p\mathbf{Y}_p \in \mathbb{C}^{N_t \times \tau_p}, subarray partition
{Ss}\{\mathcal{S}_s\}, per-subarray covariances {Rs,k}\{\mathbf{R}_{s,k}\}, noise power
Οƒ2\sigma^2, activation threshold Ξ·>0\eta > 0.
Output: Channel estimates {H^k}k=1K\{\hat{\mathbf{H}}_k\}_{k=1}^{K}.
1. Per-subarray matched filtering (parallel across ss):
2. \quad for s=1,…,Ss = 1, \ldots, S do
3. \quad\quad Zs,k←Yp[Ss,:] Si,kkH/βˆ₯Si,kkβˆ₯2\mathbf{Z}_{s,k} \leftarrow \mathbf{Y}_p[\mathcal{S}_s, :]\, {\mathbf{S}_{i,k}}_{k}^{H} / \|{\mathbf{S}_{i,k}}_{k}\|^2 for each user kk
4. \quad\quad Declare subarray ss active for user kk if βˆ₯Zs,kβˆ₯2/M>η σ2\|\mathbf{Z}_{s,k}\|^2 / M > \eta\, \sigma^2
5. \quad end for
6. Local MMSE on active subarrays:
7. \quad for each user kk and each active subarray s∈Aks \in \mathcal{A}_k do
8. \quad\quad H^k[Ss]←Rs,k(Rs,k+Οƒ2I)βˆ’1Zs,k\hat{\mathbf{H}}_k[\mathcal{S}_s] \leftarrow \mathbf{R}_{s,k}(\mathbf{R}_{s,k} + \sigma^2\mathbf{I})^{-1} \mathbf{Z}_{s,k}
9. \quad end for
10. Zero-fill inactive subarrays: H^k[Ss]←0\hat{\mathbf{H}}_k[\mathcal{S}_s] \leftarrow \mathbf{0} for sβˆ‰Aks \notin \mathcal{A}_k.
11. return {H^k}\{\hat{\mathbf{H}}_k\}.

Steps 1–5 run independently per subarray and can be mapped to panel-local baseband units; only the user-level zero-fill in step 10 requires a central aggregator. The algorithm has no cross-subarray matrix inversion, which is what unlocks the S2S^2 complexity reduction of Theorem TComplexity Reduction from Subarray Decomposition.

NMSE vs Number of Subarrays (With and Without VR Pruning)

Compare the NMSE of full-aperture MMSE, plain subarray MMSE, and VR-pruned subarray MMSE as a function of the subarray count SS. Notice that plain subarray MMSE degrades as SS grows (smaller tiles lose covariance information), while VR-pruned subarray MMSE remains near the full-aperture NMSE until the tiles become smaller than the VR boundary features.

Parameters
32
32
0.3
5

Example: A 4096-Element Array: How Much Do We Save?

An XL-MIMO array has Nt=4096N_t = 4096 elements arranged as 64Γ—6464 \times 64. A design uses 8Γ—88 \times 8 subarray tiles, giving S=64S = 64 subarrays of M=64M = 64 each. Users touch on average ∣Ak∣=6|\mathcal{A}_k| = 6 active subarrays. Compare the per-user, per-coherence-block flop count of: (a) full-aperture MMSE, (b) plain subarray MMSE, (c) VR-pruned subarray MMSE.

Full-Aperture vs Subarray vs VR-Pruned Subarray MMSE

AttributeFull-aperture MMSEPlain subarray MMSEVR-pruned subarray MMSE
Per-user flopsNt3N_t^{3}Sβ‹…M3=Nt3/S2S \cdot M^3 = N_t^{3}/S^2∣Akβˆ£β‹…M3|\mathcal{A}_k| \cdot M^3
ParallelismSerial inverseEmbarrassingly parallel across SSEmbarrassingly parallel across ∣Ak∣|\mathcal{A}_k|
NMSE (stationary channel)Optimal0.3-1 dB penalty0.3-1 dB penalty
NMSE (VR with low ∣Vk∣|\mathcal{V}_k|)Wastes pilots on dead antennasSame as full if tiles cover VRNear full-aperture, ~10x cheaper
Requires VR detector?NoNoYes (Section 18.5)
Cross-subarray coupling?FullNoneNone
Typical use caseNt≲128N_t \lesssim 128Regular massive MIMO panelsXL-MIMO with blockage / multipath clustering

What Subarray Processing Does Not Buy You

Subarray decomposition is a computational decoupling, not an information-theoretic one. The subarray estimators ignore cross-subarray covariance Rs,sβ€²=E[Hk[Ss]Hk[Ssβ€²]H]\mathbf{R}_{s,s'} = \mathbb{E}[\mathbf{H}_{k}[\mathcal{S}_s] \mathbf{H}_{k}[\mathcal{S}_{s'}]^H], which is non-zero whenever the spatial correlation is non-trivial. In practice the loss is small (<1< 1 dB) because most of the per-user covariance mass lives within a single subarray, but the approximation is visible at high SNR where fine-grained correlation matters. If absolute fidelity is needed at SNRβ‰₯20\text{SNR} \geq 20 dB, use a two-stage estimator: subarray MMSE first, then a low-rank refinement that couples neighbouring subarrays.

Common Mistake: Do Not Make Subarrays Smaller Than VR Features

Mistake:

Push SS as large as possible to maximize the S2S^2 speedup.

Correction:

When subarray size MM drops below the typical VR cluster diameter, individual tiles no longer see enough antennas to estimate the in-tile covariance reliably, and the VR detector starts flipping whole subarrays on and off based on a handful of samples. The sweet spot is M1/2≳1M^{1/2} \gtrsim 1–22 times the expected VR border thickness; for a 32Γ—3232 \times 32 VR on a 64Γ—6464 \times 64 array, 8Γ—88 \times 8 subarrays work well. Smaller tiles force the detector to rely on the MRF prior to glue fragments back together, which works but eats into the prior's noise-cleaning budget.

Quick Check

An XL-MIMO array has Nt=1024N_t = 1024 antennas. It is partitioned into S=16S = 16 square subarrays. What is the flop-count speedup of plain subarray MMSE over full-aperture MMSE, per user?

16

256

1024

4096

Quick Check

Why should subarray tiles not be made much smaller than the typical VR cluster diameter?

The MRF prior cannot run on small tiles

Local in-tile covariance estimation becomes unreliable, and VR detection starts flipping tiles on evidence too noisy to trust

The subarray count must be a power of two

The fronthaul cost grows faster than SS

πŸ”§Engineering Note

Align Subarrays with Hardware Panels

A production XL-MIMO array is rarely a single monolithic panel. Ericsson, Nokia, and Huawei commercial XL-MIMO products expose the array as a grid of panels, each with its own baseband unit and its own front-haul link to the central processor. The natural subarray partition is one subarray per panel:

  • Cross-panel traffic stays at the fronthaul level (a weighted sum of per-panel estimates), not at the baseband level.
  • The panel boundary matches a natural discontinuity in the spatial covariance (different oscillators, different calibration).
  • A panel that is blocked or powered down simply drops out of Ak\mathcal{A}_k without any algorithm reconfiguration.
Practical Constraints
  • β€’

    Panel size: 4-16 antennas per panel at sub-6 GHz; 64-256 at mmWave

  • β€’

    Inter-panel fronthaul: ~1 Gbps per panel for weighted-sum output

  • β€’

    Per-panel computation budget: <1< 1 ms for M≀256M \leq 256

πŸ“‹ Ref: 3GPP TR 38.867 β€” Release 18 study item on multi-panel operation
,