Real-Time Implementation and the Slot Budget

The Slot Clock Is Merciless

In the theoretical chapters we treated the channel estimation and precoder computation as a free operation: we wrote W=H(HHH)1\mathbf{W} = \mathbf{H}(\mathbf{H}^{H} \mathbf{H})^{-1} and moved on. In a real testbed, the moment the uplink pilots are received, a clock starts ticking down to the deadline by which the downlink samples must leave the DACs. For 5G NR at Δf=30\Delta f = 30 kHz numerology that deadline is roughly Tslot=500T_{\rm slot} = 500 microseconds. Every stage of the pipeline — channel estimation, precoder computation, data encoding, DAC playout, PA nonlinearity compensation — gets a slice of that budget, and if any stage overruns, the entire slot is dropped.

This section has two goals. First, we lay out the real-time architecture of a massive MIMO testbed — the partition of work between FPGA and SoC, the fixed-point vs floating-point decision, and the latency budget decomposition. Second, we derive the SINR loss caused by truncating the baseband multiplies to bb bits and show how that loss determines the required mantissa width.

Definition:

Massive MIMO Real-Time Processing Pipeline

A real-time massive MIMO baseband ingests an Nt×NscN_t \times N_{\rm sc} matrix of per-subcarrier uplink samples each OFDM symbol and, over one slot of LL symbols, must produce an Nt×NscN_t \times N_{\rm sc} matrix of downlink samples for transmission. The pipeline has the stages:

  1. RF sampling and synchronization. ADCs digitize NtN_t antenna signals at sample rate fsWf_s \geq W. A per-antenna CFO and timing correction is applied.

  2. FFT and pilot extraction. A length-NscN_{\rm sc} FFT per antenna produces the frequency-domain samples; pilot resources are pulled out to form the channel-estimation input.

  3. Channel estimation. Least-squares or LMMSE estimation produces H^CNt×K\hat{\mathbf{H}} \in \mathbb{C}^{N_t \times K} (or, in frequency-selective channels, one such matrix per resource block).

  4. Combiner / precoder computation. Linear detection (H^H\hat{\mathbf{H}}^H, ZF, MMSE) is applied to the uplink data; or linear precoding is computed and stored to be applied to the downlink data in the next slot.

  5. Symbol processing. Modulation, per-layer interference cancellation if used, and IFFT.

  6. DAC playout. The downlink samples are clocked to the DACs with sub-nanosecond relative jitter across antennas so the coherent beam is preserved.

The entire chain must complete within TslotT_{\rm slot}; stages 3 and 4 are the compute-heavy kernels and typically consume most of the budget.

⚠️Engineering Note

5G NR Numerology and the Available Compute Window

The 5G NR frame structure fixes the slot duration as a function of the subcarrier spacing Δf=152μ\Delta f = 15 \cdot 2^\mu kHz:

  • μ=0\mu = 0 (15 kHz, sub-6 GHz legacy): Tslot=1000 μT_{\rm slot} = 1000~\mus
  • μ=1\mu = 1 (30 kHz, sub-6 GHz main): Tslot=500 μT_{\rm slot} = 500~\mus
  • μ=3\mu = 3 (120 kHz, mmWave FR2): Tslot=125 μT_{\rm slot} = 125~\mus

In TDD, the slot must accommodate uplink pilots, uplink data, guard symbols, and downlink data. A typical experimental TDD pattern leaves roughly 300μ\,\mus of compute window in sub-6 and under 80μ\,\mus in FR2. That window is all the time the FPGA and SoC have to (i) estimate the channel, (ii) form the combiner or precoder, and (iii) apply it to payload data. Any stage that overruns causes the slot to be dropped and the user to see an outage event.

Practical Constraints
  • Sub-6 GHz (30 kHz) testbeds have 300 μs\sim 300~\mu\mathrm{s} of compute window per slot

  • FR2 (120 kHz) testbeds have 80 μs\sim 80~\mu\mathrm{s} of compute window per slot

  • Missing the deadline drops the slot; repeated drops trigger HARQ retransmission and SLA violations

📋 Ref: 3GPP TS 38.211, Table 4.3.2-1

Theorem: Per-Slot Compute Budget for Linear Precoding

For a uniform-linear BS with NtN_t antennas serving KK single-antenna users at channel estimate H^CNt×K\hat{\mathbf{H}} \in \mathbb{C}^{N_t \times K}, the dominant compute cost per slot of computing and applying the precoder scales as:

  • Maximum ratio (MR): Θ(NtK)\Theta(N_t K) complex multiplies to form H^\hat{\mathbf{H}}^* and apply to data.
  • Zero forcing (ZF): Θ(NtK2+K3)\Theta(N_t K^{2} + K^{3}) multiplies: the first term dominates when NtKN_t \gg K.
  • Regularized MMSE: same asymptotic as ZF with a slightly larger constant.

Multiplying by the number of resource blocks NRBN_{\rm RB} (each with its own precoder) and dividing by the compute window Tc<TslotT_c < T_{\rm slot} gives the required throughput in complex multiplies per second.

Channel estimation is usually cheap — a matrix product. The expensive step is the K3K^{3} matrix inverse in ZF or MMSE, which dominates for moderate KK but is eclipsed by the NtK2N_t K^{2} Gram step as NtN_t grows large. Massive MIMO operates in exactly the regime where NtKN_t \gg K, so the Gram matrix construction is the single most expensive operation per slot.

⚠️Engineering Note

FPGA vs SoC vs GPU for Massive MIMO Baseband

Real-time massive MIMO basebands live on three architectural points:

  • FPGA (Xilinx UltraScale+, Intel Stratix). Hand-written RTL with fixed-point DSP slices. Best latency (microseconds), best determinism, worst developer productivity. Used in LuMaMi, ArgosV3, and most FR2 prototypes where the compute window is under 100 microseconds.

  • SoC with CPU + FPGA fabric (Xilinx Zynq UltraScale+, RFSoC). A compromise: the FPGA fabric handles the per-sample processing (FFT, channel estimation kernels) while the CPU handles per-slot control and the matrix inverse. This is the current mainstream choice for academic testbeds.

  • Pure x86 (OAI, srsRAN). The entire pipeline runs as a real-time Linux process with AVX2/AVX-512 vectorization. Scales to roughly Nt=64N_t=64 in sub-6 numerology on a single high-end Xeon. Easiest to develop and debug, worst latency jitter. Acceptable when the slot window is loose (sub-6 only).

  • GPU offload. An emerging option using NVIDIA Aerial or similar frameworks. Fits the matrix-inverse step in particular but suffers from PCIe transfer latency. Still niche for real-time MIMO processing.

The hybrid FPGA+SoC architecture has become dominant because it matches the per-sample/per-slot partition of the pipeline stages.

Practical Constraints
  • FPGA DSP slices are 18-by-18 bit fixed-point — larger mantissas need multiple slices per multiply

  • CPU AVX-512 gives 16 single-precision multiplies per cycle per core

  • GPU PCIe transfer alone costs 10 μ\sim 10~\mus each way, eating a large fraction of the FR2 compute window

,

Definition:

Fixed-Point vs Floating-Point Baseband

The baseband multiplier-accumulator operates on each complex sample in one of two number formats:

  • Fixed-point with bb mantissa bits: the sample is represented as a signed integer, with an implied scale factor. Dynamic range is 6b\sim 6b dB; arithmetic is one cycle per DSP slice; power consumption and area scale as b2b^2. This is the standard FPGA format.

  • Floating-point (IEEE 754 single or half precision): the sample has a mantissa of 23 or 10 bits and an exponent. Dynamic range is effectively unlimited for massive MIMO purposes; area and power are 5--10×\times larger than the equivalent fixed-point. Standard on CPU and GPU.

The engineering decision is: what is the smallest bb we can use in fixed-point before the quantization-induced SINR loss becomes visible in the per-user rate?

Theorem: Fixed-Point Quantization SINR Penalty

Suppose each complex multiply in a zero-forcing massive MIMO combiner uses bb-bit fixed-point arithmetic with saturation-free headroom, so the per-multiply relative error is a zero-mean uniform random variable with variance σq2=22b/3\sigma_q^2 = 2^{-2b}/3. For a floating-point SINR of γ0\gamma_0, the fixed-point effective SINR degrades as γfxp(b)γ01+γ0carch22b,\gamma_{\rm fxp}(b) \approx \frac{\gamma_0}{1 + \gamma_0 \, c_{\rm arch} \cdot 2^{-2b}}, where carchc_{\rm arch} is a pipeline-dependent constant of order NtKN_tK.

Quantization noise is multiplicative per-multiply and adds up coherently across the array. At low reference SINR the channel noise dominates and quantization is invisible; at high SINR the quantization noise becomes the floor, setting a ceiling on the achievable rate that is determined entirely by the mantissa width. Each extra bit of mantissa pushes the ceiling up by roughly 6 dB.

Fixed-Point SINR Loss as a Function of Bit Width

Scan the mantissa bit width bb and observe the effective SINR ceiling. For a given floating-point reference SINR, there is a minimum bb below which the per-user rate is dominated by quantization rather than channel noise.

Parameters
20

Floating-point reference SINR

64
8

Example: Sizing the Mantissa for a 16-User LuMaMi-Class Testbed

A LuMaMi-class testbed has Nt=100N_t=100 antennas, serves K=10K=10 users, and runs a ZF combiner. The desired operating point is γ0=20\gamma_0 = 20 dB floating-point, and the design tolerance for quantization loss is 0.3 dB. Using carchNtK/4=250c_{\rm arch} \approx N_tK/4 = 250, what is the smallest mantissa bit width bb that meets this target?

Per-Slot Latency Budget Allocation

Complexity: O(1) — the budget is a linear sum of precomputed per-stage costs.
Input: Slot duration TslotT_{\rm slot}, TDD pattern, array size NtN_t,
users KK, bandwidth WW, combiner type (MR/ZF/MMSE).
Output: Per-stage latency budget {τi}i=16\{\tau_i\}_{i=1}^{6} summing to TslotT_{\rm slot}.
1. Reserve guard intervals for TDD switching: τguard\tau_{\rm guard} \leftarrow two OFDM symbols.
2. Compute the pilot window τpilot\tau_{\rm pilot} from the SRS/DMRS configuration.
3. Compute window TcTslotτguardτpilotT_c \leftarrow T_{\rm slot} - \tau_{\rm guard} - \tau_{\rm pilot}.
4. Budget by stage based on expected compute cost:
5. τFFT\quad \tau_{\rm FFT} \leftarrow Ntlog2Nsc/throughputFFTN_t \log_2 N_{\rm sc} / \mathrm{throughput}_{\rm FFT}
6. τchest\quad \tau_{\rm chest} \leftarrow NtK/throughputMACN_tK / \mathrm{throughput}_{\rm MAC}
7. τgram\quad \tau_{\rm gram} \leftarrow NtK2/throughputMACN_tK^{2} / \mathrm{throughput}_{\rm MAC}
8. τinv\quad \tau_{\rm inv} \leftarrow K3/throughputMACK^{3} / \mathrm{throughput}_{\rm MAC}
9. τapply\quad \tau_{\rm apply} \leftarrow NtKNdata/throughputMACN_tK N_{\rm data} / \mathrm{throughput}_{\rm MAC}
10. τDAC\quad \tau_{\rm DAC} \leftarrow fixed DAC playout delay
11. If τi>Tc\sum \tau_i > T_c then halve KK, fall back to MR, or switch numerology.
12. Return {τi}\{\tau_i\}.

This algorithm runs once at testbed configuration time. In operation the stages execute in a pipeline; only stage 11 (overflow detection) is revisited when the channel or user count changes.

Latency Budget Breakdown Across the Pipeline

Stacked bar of where the slot deadline is spent: channel estimation, Gram matrix, inverse, precoder application, DAC playout, and guard. Sweep the user count KK or the array size NtN_t to see which stage becomes the bottleneck first.

Parameters
64
8
🎓CommIT Contribution(2023)

Distributed Real-Time Processing for Cell-Free Massive MIMO

F. Gottsch, K. Ito, G. CaireIEEE Transactions on Wireless Communications

Gottsch, Ito, and Caire show that the Gram-matrix bottleneck of centralized ZF in massive MIMO can be spatially decomposed across cell-free access points with only a modest fronthaul overhead, provided the user-centric clustering keeps the per-AP user count bounded. Their analysis traces the per-AP compute cost as a function of cluster size and gives the exact breakpoint at which distributed processing overtakes centralized in terms of latency. The result underpins the Massive Beams startup's real-time architecture and shows, concretely, that the theoretical cell-free performance gains can be realized within the 5G NR slot budget.

cell-freereal-timecommitlatencyView Paper →

Common Mistake: Fixed-Point Accumulator Overflow

Mistake:

Assuming the mantissa width is adequate based only on the per-sample dynamic range. An accumulator that sums NN independent samples grows in variance by a factor of NN — and a bit width sized for a single sample will saturate long before the sum is complete.

Correction:

Size the accumulator to hold log2N\log_2 N extra bits above the per-sample width. In a massive MIMO inner product over Nt=256N_t=256 antennas that is 8 extra bits, so a 16-bit input sample needs at least a 24-bit accumulator. LuMaMi's 36-bit accumulators are a deliberately generous choice that eliminated accumulator overflow as a failure mode across every tested configuration.

Common Mistake: Debugging on Float, Shipping on Fixed

Mistake:

Prototyping the MIMO algorithm in MATLAB or NumPy with double-precision floating-point, getting the expected capacity curves, and assuming the fixed-point implementation will match.

Correction:

Always include a bit-accurate fixed-point simulation in the development loop. The SINR loss computed in Theorem 26.2 is the minimum you can expect; real pipelines also suffer from clipping, non-uniform bit-growth, and implementation bugs that only surface in the bit-accurate model. The golden reference should be a cycle-accurate fixed-point simulator, not a floating-point prototype.

Quick Check

A 30 kHz 5G NR numerology has Tslot=500 μT_{\rm slot} = 500~\mus. Roughly how much of that is actually available for compute after pilot reception and TDD guard intervals?

Essentially all 500 us

Around 300 us

Under 50 us

Exactly 250 us

Key Takeaway

The slot budget is real. Massive MIMO is a compute problem dominated by the NtK2N_tK^{2} Gram-matrix step and bounded by a sub-millisecond deadline. Fixed-point mantissa widths around 10--16 bits are sufficient for the algorithm, but accumulator headroom, FPGA/SoC partitioning, and the latency of each pipeline stage determine whether the theory can actually run on a real testbed. Every design decision in this section trades one of three currencies: latency, silicon area, or mantissa bits.