Ferkans — Interactive Telecom Tutor

The Slot Clock Is Merciless

In the theoretical chapters we treated the channel estimation and precoder computation as a free operation: we wrote $\mathbf{W} = \mathbf{H}(\mathbf{H}^{H} \mathbf{H})^{-1}$ and moved on. In a real testbed, the moment the uplink pilots are received, a clock starts ticking down to the deadline by which the downlink samples must leave the DACs. For 5G NR at $\Delta f = 30$ kHz numerology that deadline is roughly $T_{\rm slot} = 500$ microseconds. Every stage of the pipeline — channel estimation, precoder computation, data encoding, DAC playout, PA nonlinearity compensation — gets a slice of that budget, and if any stage overruns, the entire slot is dropped.

This section has two goals. First, we lay out the real-time architecture of a massive MIMO testbed — the partition of work between FPGA and SoC, the fixed-point vs floating-point decision, and the latency budget decomposition. Second, we derive the SINR loss caused by truncating the baseband multiplies to $b$ bits and show how that loss determines the required mantissa width.

Definition:
Massive MIMO Real-Time Processing Pipeline

A real-time massive MIMO baseband ingests an $N_t \times N_{\rm sc}$ matrix of per-subcarrier uplink samples each OFDM symbol and, over one slot of $L$ symbols, must produce an $N_t \times N_{\rm sc}$ matrix of downlink samples for transmission. The pipeline has the stages:

RF sampling and synchronization. ADCs digitize $N_t$ antenna signals at sample rate $f_s \geq W$ . A per-antenna CFO and timing correction is applied.
FFT and pilot extraction. A length- $N_{\rm sc}$ FFT per antenna produces the frequency-domain samples; pilot resources are pulled out to form the channel-estimation input.
Channel estimation. Least-squares or LMMSE estimation produces $\hat{\mathbf{H}} \in \mathbb{C}^{N_t \times K}$ (or, in frequency-selective channels, one such matrix per resource block).
Combiner / precoder computation. Linear detection ( $\hat{\mathbf{H}}^H$ , ZF, MMSE) is applied to the uplink data; or linear precoding is computed and stored to be applied to the downlink data in the next slot.
Symbol processing. Modulation, per-layer interference cancellation if used, and IFFT.
DAC playout. The downlink samples are clocked to the DACs with sub-nanosecond relative jitter across antennas so the coherent beam is preserved.

The entire chain must complete within $T_{\rm slot}$ ; stages 3 and 4 are the compute-heavy kernels and typically consume most of the budget.

⚠️Engineering Note

5G NR Numerology and the Available Compute Window

The 5G NR frame structure fixes the slot duration as a function of the subcarrier spacing $\Delta f = 15 \cdot 2^\mu$ kHz:

$\mu = 0$ (15 kHz, sub-6 GHz legacy): $T_{\rm slot} = 1000~\mu$ s
$\mu = 1$ (30 kHz, sub-6 GHz main): $T_{\rm slot} = 500~\mu$ s
$\mu = 3$ (120 kHz, mmWave FR2): $T_{\rm slot} = 125~\mu$ s

In TDD, the slot must accommodate uplink pilots, uplink data, guard symbols, and downlink data. A typical experimental TDD pattern leaves roughly 300 $\,\mu$ s of compute window in sub-6 and under 80 $\,\mu$ s in FR2. That window is all the time the FPGA and SoC have to (i) estimate the channel, (ii) form the combiner or precoder, and (iii) apply it to payload data. Any stage that overruns causes the slot to be dropped and the user to see an outage event.

Practical Constraints

•
Sub-6 GHz (30 kHz) testbeds have $\sim 300~\mu\mathrm{s}$ of compute window per slot
•
FR2 (120 kHz) testbeds have $\sim 80~\mu\mathrm{s}$ of compute window per slot
•
Missing the deadline drops the slot; repeated drops trigger HARQ retransmission and SLA violations

📋 Ref: 3GPP TS 38.211, Table 4.3.2-1

Theorem: Per-Slot Compute Budget for Linear Precoding

For a uniform-linear BS with $N_t$ antennas serving $K$ single-antenna users at channel estimate $\hat{\mathbf{H}} \in \mathbb{C}^{N_t \times K}$ , the dominant compute cost per slot of computing and applying the precoder scales as:

Maximum ratio (MR): $\Theta(N_t K)$ complex multiplies to form $\hat{\mathbf{H}}^*$ and apply to data.
Zero forcing (ZF): $\Theta(N_t K^{2} + K^{3})$ multiplies: the first term dominates when $N_t \gg K$ .
Regularized MMSE: same asymptotic as ZF with a slightly larger constant.

Multiplying by the number of resource blocks $N_{\rm RB}$ (each with its own precoder) and dividing by the compute window $T_c < T_{\rm slot}$ gives the required throughput in complex multiplies per second.

Channel estimation is usually cheap — a matrix product. The expensive step is the $K^{3}$ matrix inverse in ZF or MMSE, which dominates for moderate $K$ but is eclipsed by the $N_t K^{2}$ Gram step as $N_t$ grows large. Massive MIMO operates in exactly the regime where $N_t \gg K$ , so the Gram matrix construction is the single most expensive operation per slot.

Show Hint

Write the precoder as $\mathbf{W} = \hat{\mathbf{H}} (\hat{\mathbf{H}}^H \hat{\mathbf{H}})^{-1}$ and count the complex multiplies for each factor.

The Gram matrix $\hat{\mathbf{H}}^H \hat{\mathbf{H}}$ is $K \times K$ and costs $N_t K^{2}$ multiplies.

Inverting a $K \times K$ matrix is $\Theta(K^{3})$ via Cholesky; multiplying by $\hat{\mathbf{H}}$ is another $N_tK^{2}$ .

Sum the dominant terms and compare to MR to quantify the complexity overhead of ZF over MR in the massive-MIMO regime.

Proof

MR precoder cost

The maximum-ratio precoder is $\mathbf{W}_{\rm MR} = \hat{\mathbf{H}}^*$ . Forming the conjugate is free; applying it to a per-user data symbol is a matrix-vector product of size $N_t \times K$ , costing $N_tK$ multiplies per data sample. Over a slot with $N_d$ data samples per RB and $N_{\rm RB}$ RBs the total cost is $N_{\rm RB} N_d N_t K$ .

ZF: Gram matrix construction

The ZF precoder requires first forming the $K\times K$ Gram matrix $\mathbf{G} = \hat{\mathbf{H}}^H \hat{\mathbf{H}}$ . Each entry is an inner product of two $N_t$ -dimensional vectors, costing $N_t$ multiplies; there are $K^{2}$ entries (exploiting Hermitian symmetry cuts this in half but not the asymptotic order). Total: $\Theta(N_tK^{2})$ .

ZF: matrix inversion

Inverting $\mathbf{G}$ by Cholesky factorization costs $\Theta(K^{3})$ multiplies. For the massive-MIMO regime $N_t \gg K$ this term is dominated by the Gram construction, but for moderate $K$ it is still the step with the worst constant factor.

ZF: combine and apply

Finally, $\mathbf{W}_{\rm ZF} = \hat{\mathbf{H}} \mathbf{G}^{-1}$ is formed at cost $\Theta(N_tK^{2})$ , and applied to each data sample at cost $\Theta(N_tK)$ . Summing all four contributions: once-per-slot costs are $\Theta(N_tK^{2} + K^{3})$ , per-sample costs are $\Theta(N_tK)$ . $\blacksquare$

⚠️Engineering Note

FPGA vs SoC vs GPU for Massive MIMO Baseband

Real-time massive MIMO basebands live on three architectural points:

FPGA (Xilinx UltraScale+, Intel Stratix). Hand-written RTL with fixed-point DSP slices. Best latency (microseconds), best determinism, worst developer productivity. Used in LuMaMi, ArgosV3, and most FR2 prototypes where the compute window is under 100 microseconds.
SoC with CPU + FPGA fabric (Xilinx Zynq UltraScale+, RFSoC). A compromise: the FPGA fabric handles the per-sample processing (FFT, channel estimation kernels) while the CPU handles per-slot control and the matrix inverse. This is the current mainstream choice for academic testbeds.
Pure x86 (OAI, srsRAN). The entire pipeline runs as a real-time Linux process with AVX2/AVX-512 vectorization. Scales to roughly $N_t=64$ in sub-6 numerology on a single high-end Xeon. Easiest to develop and debug, worst latency jitter. Acceptable when the slot window is loose (sub-6 only).
GPU offload. An emerging option using NVIDIA Aerial or similar frameworks. Fits the matrix-inverse step in particular but suffers from PCIe transfer latency. Still niche for real-time MIMO processing.

The hybrid FPGA+SoC architecture has become dominant because it matches the per-sample/per-slot partition of the pipeline stages.

Practical Constraints

•
FPGA DSP slices are 18-by-18 bit fixed-point — larger mantissas need multiple slices per multiply
•
CPU AVX-512 gives 16 single-precision multiplies per cycle per core
•
GPU PCIe transfer alone costs $\sim 10~\mu$ s each way, eating a large fraction of the FR2 compute window

,

Definition:
Fixed-Point vs Floating-Point Baseband

The baseband multiplier-accumulator operates on each complex sample in one of two number formats:

Fixed-point with $b$ mantissa bits: the sample is represented as a signed integer, with an implied scale factor. Dynamic range is $\sim 6b$ dB; arithmetic is one cycle per DSP slice; power consumption and area scale as $b^2$ . This is the standard FPGA format.
Floating-point (IEEE 754 single or half precision): the sample has a mantissa of 23 or 10 bits and an exponent. Dynamic range is effectively unlimited for massive MIMO purposes; area and power are 5--10 $\times$ larger than the equivalent fixed-point. Standard on CPU and GPU.

The engineering decision is: what is the smallest $b$ we can use in fixed-point before the quantization-induced SINR loss becomes visible in the per-user rate?

Theorem: Fixed-Point Quantization SINR Penalty

Suppose each complex multiply in a zero-forcing massive MIMO combiner uses $b$ -bit fixed-point arithmetic with saturation-free headroom, so the per-multiply relative error is a zero-mean uniform random variable with variance $\sigma_q^2 = 2^{-2b}/3$ . For a floating-point SINR of $\gamma_0$ , the fixed-point effective SINR degrades as $\gamma_{\rm fxp}(b) \approx \frac{\gamma_0}{1 + \gamma_0 \, c_{\rm arch} \cdot 2^{-2b}},$ where $c_{\rm arch}$ is a pipeline-dependent constant of order $N_tK$ .

Quantization noise is multiplicative per-multiply and adds up coherently across the array. At low reference SINR the channel noise dominates and quantization is invisible; at high SINR the quantization noise becomes the floor, setting a ceiling on the achievable rate that is determined entirely by the mantissa width. Each extra bit of mantissa pushes the ceiling up by roughly 6 dB.

Show Hint

Model each multiply as $(1+\epsilon)$ with $\epsilon$ uniform on $[-2^{-b}, 2^{-b}]$ .

The total quantization error at the output is the sum of per-multiply errors along the signal path; each term is independent.

There are $\Theta(N_tK)$ multiplies on the signal path, so the variance grows as $c_{\rm arch}\sigma_q^2$ .

Adding this to the noise floor and dividing by the signal power yields the reciprocal-SINR relation.

Proof

Per-multiply error model

A rounding-to-nearest $b$ -bit fixed-point multiply incurs an error $\epsilon$ uniform on $[-2^{-(b+1)}, 2^{-(b+1)}]$ (in units of the full-scale amplitude), with variance $\sigma_q^2 = 2^{-2b}/12$ . We include both I and Q channels so the effective complex variance is $2^{-2b}/6$ .

Accumulation along the pipeline

Each output sample of the ZF combiner passes through the Gram construction, the Cholesky-based inverse, and the final application to the input vector. Counting multiplies along the signal path gives a constant $c_{\rm arch}$ of order $N_t K$ . Under the Widrow model the per-multiply errors are mutually independent, so the total quantization error at the combined output has variance $c_{\rm arch}\sigma_q^2 \cdot P_{\rm signal}$ , where $P_{\rm signal}$ is the nominal signal power.

Effective SINR

Denote the channel-noise variance by $P_{\rm noise}$ and the floating-point SINR by $\gamma_0 = P_{\rm signal}/P_{\rm noise}$ . Adding the quantization variance to the noise floor: $\gamma_{\rm fxp} = \frac{P_{\rm signal}}{P_{\rm noise} + c_{\rm arch}\sigma_q^2 P_{\rm signal}} = \frac{\gamma_0}{1 + \gamma_0 c_{\rm arch} \sigma_q^2}.$

Substituting the bit width

Using $\sigma_q^2 = 2^{-2b}/3$ (absorbing the factor of four from the I/Q split into $c_{\rm arch}$ ) produces the claimed expression. In the high-SINR limit $\gamma_{\rm fxp}\to (c_{\rm arch} 2^{-2b})^{-1}$ , independent of $\gamma_0$ — a rate ceiling set purely by the mantissa width. $\blacksquare$

Fixed-Point SINR Loss as a Function of Bit Width

Scan the mantissa bit width $b$ and observe the effective SINR ceiling. For a given floating-point reference SINR, there is a minimum $b$ below which the per-user rate is dominated by quantization rather than channel noise.

Parameters

\gamma_0

(dB)20

Floating-point reference SINR

N_t

64

K

8

Example: Sizing the Mantissa for a 16-User LuMaMi-Class Testbed

A LuMaMi-class testbed has $N_t=100$ antennas, serves $K=10$ users, and runs a ZF combiner. The desired operating point is $\gamma_0 = 20$ dB floating-point, and the design tolerance for quantization loss is 0.3 dB. Using $c_{\rm arch} \approx N_tK/4 = 250$ , what is the smallest mantissa bit width $b$ that meets this target?

Solution

Translate the tolerance into an SINR ratio

A 0.3 dB loss means $\gamma_{\rm fxp}/\gamma_0 \geq 10^{-0.03} \approx 0.933$ . From the theorem: $\frac{\gamma_{\rm fxp}}{\gamma_0} = \frac{1}{1 + \gamma_0 c_{\rm arch} 2^{-2b}}.$

Solve for $2^{-2b}$

Rearranging: $\gamma_0 c_{\rm arch} 2^{-2b} \leq 1/0.933 - 1 \approx 0.0718$ . With $\gamma_0 = 100$ (linear) and $c_{\rm arch} = 250$ : $2^{-2b} \leq 0.0718 / (100 \cdot 250) = 2.87 \times 10^{-6}$ .

Take logs

$2b \geq -\log_2(2.87 \times 10^{-6}) \approx 18.4$ , so $b \geq 9.2$ . Rounding up to avoid the exact edge, $b = 10$ mantissa bits are sufficient. In practice LuMaMi used 16-bit mantissas throughout, leaving a generous 6-bit headroom for accumulation growth — a standard FPGA design choice.

Per-Slot Latency Budget Allocation

Complexity: O(1) — the budget is a linear sum of precomputed per-stage costs.

Input: Slot duration

T_{\rm slot}

, TDD pattern, array size

N_t

,

users

K

, bandwidth

W

, combiner type (MR/ZF/MMSE).

Output: Per-stage latency budget

\{\tau_i\}_{i=1}^{6}

summing to

T_{\rm slot}

.

1. Reserve guard intervals for TDD switching:

\tau_{\rm guard} \leftarrow

two OFDM symbols.

2. Compute the pilot window

\tau_{\rm pilot}

from the SRS/DMRS configuration.

3. Compute window

T_c \leftarrow T_{\rm slot} - \tau_{\rm guard} - \tau_{\rm pilot}

.

4. Budget by stage based on expected compute cost:

5.

\quad \tau_{\rm FFT}

\leftarrow

N_t \log_2 N_{\rm sc} / \mathrm{throughput}_{\rm FFT}

6.

\quad \tau_{\rm chest}

\leftarrow

N_tK / \mathrm{throughput}_{\rm MAC}

7.

\quad \tau_{\rm gram}

\leftarrow

N_tK^{2} / \mathrm{throughput}_{\rm MAC}

8.

\quad \tau_{\rm inv}

\leftarrow

K^{3} / \mathrm{throughput}_{\rm MAC}

9.

\quad \tau_{\rm apply}

\leftarrow

N_tK N_{\rm data} / \mathrm{throughput}_{\rm MAC}

10.

\quad \tau_{\rm DAC}

\leftarrow

fixed DAC playout delay

11. If

\sum \tau_i > T_c

then halve

K

, fall back to MR, or switch numerology.

12. Return

\{\tau_i\}

.

This algorithm runs once at testbed configuration time. In operation the stages execute in a pipeline; only stage 11 (overflow detection) is revisited when the channel or user count changes.

Latency Budget Breakdown Across the Pipeline

Stacked bar of where the slot deadline is spent: channel estimation, Gram matrix, inverse, precoder application, DAC playout, and guard. Sweep the user count $K$ or the array size $N_t$ to see which stage becomes the bottleneck first.

Parameters

N_t

64

K

8

SCS

🎓CommIT Contribution(2023)

Distributed Real-Time Processing for Cell-Free Massive MIMO

F. Gottsch, K. Ito, G. Caire — IEEE Transactions on Wireless Communications

Gottsch, Ito, and Caire show that the Gram-matrix bottleneck of centralized ZF in massive MIMO can be spatially decomposed across cell-free access points with only a modest fronthaul overhead, provided the user-centric clustering keeps the per-AP user count bounded. Their analysis traces the per-AP compute cost as a function of cluster size and gives the exact breakpoint at which distributed processing overtakes centralized in terms of latency. The result underpins the Massive Beams startup's real-time architecture and shows, concretely, that the theoretical cell-free performance gains can be realized within the 5G NR slot budget.

cell-freereal-timecommitlatencyView Paper →

Common Mistake: Fixed-Point Accumulator Overflow

Mistake:

Assuming the mantissa width is adequate based only on the per-sample dynamic range. An accumulator that sums $N$ independent samples grows in variance by a factor of $N$ — and a bit width sized for a single sample will saturate long before the sum is complete.

Correction:

Size the accumulator to hold $\log_2 N$ extra bits above the per-sample width. In a massive MIMO inner product over $N_t=256$ antennas that is 8 extra bits, so a 16-bit input sample needs at least a 24-bit accumulator. LuMaMi's 36-bit accumulators are a deliberately generous choice that eliminated accumulator overflow as a failure mode across every tested configuration.

Common Mistake: Debugging on Float, Shipping on Fixed

Mistake:

Prototyping the MIMO algorithm in MATLAB or NumPy with double-precision floating-point, getting the expected capacity curves, and assuming the fixed-point implementation will match.

Correction:

Always include a bit-accurate fixed-point simulation in the development loop. The SINR loss computed in Theorem 26.2 is the minimum you can expect; real pipelines also suffer from clipping, non-uniform bit-growth, and implementation bugs that only surface in the bit-accurate model. The golden reference should be a cycle-accurate fixed-point simulator, not a floating-point prototype.

Quick Check

A 30 kHz 5G NR numerology has $T_{\rm slot} = 500~\mu$ s. Roughly how much of that is actually available for compute after pilot reception and TDD guard intervals?

Essentially all 500 us

Around 300 us

Under 50 us

Exactly 250 us

Correction:

Around 300 us

After reserving two OFDM symbols (~70 us) for TDD guard and the pilot window (~130 us) for SRS and UL data, the compute window in a typical TDD pattern is around 300 us. That is the hard ceiling for channel estimation plus precoder computation.

Key Takeaway

The slot budget is real. Massive MIMO is a compute problem dominated by the $N_tK^{2}$ Gram-matrix step and bounded by a sub-millisecond deadline. Fixed-point mantissa widths around 10--16 bits are sufficient for the algorithm, but accumulator headroom, FPGA/SoC partitioning, and the latency of each pipeline stage determine whether the theory can actually run on a real testbed. Every design decision in this section trades one of three currencies: latency, silicon area, or mantissa bits.

Real-Time Implementation and the Slot Budget

The Slot Clock Is Merciless

Definition: Massive MIMO Real-Time Processing Pipeline

5G NR Numerology and the Available Compute Window

Theorem: Per-Slot Compute Budget for Linear Precoding

MR precoder cost

ZF: Gram matrix construction

ZF: matrix inversion

ZF: combine and apply

FPGA vs SoC vs GPU for Massive MIMO Baseband

Definition: Fixed-Point vs Floating-Point Baseband

Theorem: Fixed-Point Quantization SINR Penalty

Per-multiply error model

Accumulation along the pipeline

Effective SINR

Substituting the bit width

Fixed-Point SINR Loss as a Function of Bit Width

Parameters

Example: Sizing the Mantissa for a 16-User LuMaMi-Class Testbed

Translate the tolerance into an SINR ratio

Solve for $2^{-2b}$

Take logs

Per-Slot Latency Budget Allocation

Latency Budget Breakdown Across the Pipeline

Parameters

Distributed Real-Time Processing for Cell-Free Massive MIMO

Common Mistake: Fixed-Point Accumulator Overflow

Common Mistake: Debugging on Float, Shipping on Fixed

Quick Check

Key Takeaway

Definition:
Massive MIMO Real-Time Processing Pipeline

Definition:
Fixed-Point vs Floating-Point Baseband