Real-Time Implementation and the Slot Budget
The Slot Clock Is Merciless
In the theoretical chapters we treated the channel estimation and precoder computation as a free operation: we wrote and moved on. In a real testbed, the moment the uplink pilots are received, a clock starts ticking down to the deadline by which the downlink samples must leave the DACs. For 5G NR at kHz numerology that deadline is roughly microseconds. Every stage of the pipeline — channel estimation, precoder computation, data encoding, DAC playout, PA nonlinearity compensation — gets a slice of that budget, and if any stage overruns, the entire slot is dropped.
This section has two goals. First, we lay out the real-time architecture of a massive MIMO testbed — the partition of work between FPGA and SoC, the fixed-point vs floating-point decision, and the latency budget decomposition. Second, we derive the SINR loss caused by truncating the baseband multiplies to bits and show how that loss determines the required mantissa width.
Definition: Massive MIMO Real-Time Processing Pipeline
Massive MIMO Real-Time Processing Pipeline
A real-time massive MIMO baseband ingests an matrix of per-subcarrier uplink samples each OFDM symbol and, over one slot of symbols, must produce an matrix of downlink samples for transmission. The pipeline has the stages:
-
RF sampling and synchronization. ADCs digitize antenna signals at sample rate . A per-antenna CFO and timing correction is applied.
-
FFT and pilot extraction. A length- FFT per antenna produces the frequency-domain samples; pilot resources are pulled out to form the channel-estimation input.
-
Channel estimation. Least-squares or LMMSE estimation produces (or, in frequency-selective channels, one such matrix per resource block).
-
Combiner / precoder computation. Linear detection (, ZF, MMSE) is applied to the uplink data; or linear precoding is computed and stored to be applied to the downlink data in the next slot.
-
Symbol processing. Modulation, per-layer interference cancellation if used, and IFFT.
-
DAC playout. The downlink samples are clocked to the DACs with sub-nanosecond relative jitter across antennas so the coherent beam is preserved.
The entire chain must complete within ; stages 3 and 4 are the compute-heavy kernels and typically consume most of the budget.
5G NR Numerology and the Available Compute Window
The 5G NR frame structure fixes the slot duration as a function of the subcarrier spacing kHz:
- (15 kHz, sub-6 GHz legacy): s
- (30 kHz, sub-6 GHz main): s
- (120 kHz, mmWave FR2): s
In TDD, the slot must accommodate uplink pilots, uplink data, guard symbols, and downlink data. A typical experimental TDD pattern leaves roughly 300s of compute window in sub-6 and under 80s in FR2. That window is all the time the FPGA and SoC have to (i) estimate the channel, (ii) form the combiner or precoder, and (iii) apply it to payload data. Any stage that overruns causes the slot to be dropped and the user to see an outage event.
- •
Sub-6 GHz (30 kHz) testbeds have of compute window per slot
- •
FR2 (120 kHz) testbeds have of compute window per slot
- •
Missing the deadline drops the slot; repeated drops trigger HARQ retransmission and SLA violations
Theorem: Per-Slot Compute Budget for Linear Precoding
For a uniform-linear BS with antennas serving single-antenna users at channel estimate , the dominant compute cost per slot of computing and applying the precoder scales as:
- Maximum ratio (MR): complex multiplies to form and apply to data.
- Zero forcing (ZF): multiplies: the first term dominates when .
- Regularized MMSE: same asymptotic as ZF with a slightly larger constant.
Multiplying by the number of resource blocks (each with its own precoder) and dividing by the compute window gives the required throughput in complex multiplies per second.
Channel estimation is usually cheap — a matrix product. The expensive step is the matrix inverse in ZF or MMSE, which dominates for moderate but is eclipsed by the Gram step as grows large. Massive MIMO operates in exactly the regime where , so the Gram matrix construction is the single most expensive operation per slot.
Write the precoder as and count the complex multiplies for each factor.
The Gram matrix is and costs multiplies.
Inverting a matrix is via Cholesky; multiplying by is another .
Sum the dominant terms and compare to MR to quantify the complexity overhead of ZF over MR in the massive-MIMO regime.
MR precoder cost
The maximum-ratio precoder is . Forming the conjugate is free; applying it to a per-user data symbol is a matrix-vector product of size , costing multiplies per data sample. Over a slot with data samples per RB and RBs the total cost is .
ZF: Gram matrix construction
The ZF precoder requires first forming the Gram matrix . Each entry is an inner product of two -dimensional vectors, costing multiplies; there are entries (exploiting Hermitian symmetry cuts this in half but not the asymptotic order). Total: .
ZF: matrix inversion
Inverting by Cholesky factorization costs multiplies. For the massive-MIMO regime this term is dominated by the Gram construction, but for moderate it is still the step with the worst constant factor.
ZF: combine and apply
Finally, is formed at cost , and applied to each data sample at cost . Summing all four contributions: once-per-slot costs are , per-sample costs are .
FPGA vs SoC vs GPU for Massive MIMO Baseband
Real-time massive MIMO basebands live on three architectural points:
-
FPGA (Xilinx UltraScale+, Intel Stratix). Hand-written RTL with fixed-point DSP slices. Best latency (microseconds), best determinism, worst developer productivity. Used in LuMaMi, ArgosV3, and most FR2 prototypes where the compute window is under 100 microseconds.
-
SoC with CPU + FPGA fabric (Xilinx Zynq UltraScale+, RFSoC). A compromise: the FPGA fabric handles the per-sample processing (FFT, channel estimation kernels) while the CPU handles per-slot control and the matrix inverse. This is the current mainstream choice for academic testbeds.
-
Pure x86 (OAI, srsRAN). The entire pipeline runs as a real-time Linux process with AVX2/AVX-512 vectorization. Scales to roughly in sub-6 numerology on a single high-end Xeon. Easiest to develop and debug, worst latency jitter. Acceptable when the slot window is loose (sub-6 only).
-
GPU offload. An emerging option using NVIDIA Aerial or similar frameworks. Fits the matrix-inverse step in particular but suffers from PCIe transfer latency. Still niche for real-time MIMO processing.
The hybrid FPGA+SoC architecture has become dominant because it matches the per-sample/per-slot partition of the pipeline stages.
- •
FPGA DSP slices are 18-by-18 bit fixed-point — larger mantissas need multiple slices per multiply
- •
CPU AVX-512 gives 16 single-precision multiplies per cycle per core
- •
GPU PCIe transfer alone costs s each way, eating a large fraction of the FR2 compute window
Definition: Fixed-Point vs Floating-Point Baseband
Fixed-Point vs Floating-Point Baseband
The baseband multiplier-accumulator operates on each complex sample in one of two number formats:
-
Fixed-point with mantissa bits: the sample is represented as a signed integer, with an implied scale factor. Dynamic range is dB; arithmetic is one cycle per DSP slice; power consumption and area scale as . This is the standard FPGA format.
-
Floating-point (IEEE 754 single or half precision): the sample has a mantissa of 23 or 10 bits and an exponent. Dynamic range is effectively unlimited for massive MIMO purposes; area and power are 5--10 larger than the equivalent fixed-point. Standard on CPU and GPU.
The engineering decision is: what is the smallest we can use in fixed-point before the quantization-induced SINR loss becomes visible in the per-user rate?
Theorem: Fixed-Point Quantization SINR Penalty
Suppose each complex multiply in a zero-forcing massive MIMO combiner uses -bit fixed-point arithmetic with saturation-free headroom, so the per-multiply relative error is a zero-mean uniform random variable with variance . For a floating-point SINR of , the fixed-point effective SINR degrades as where is a pipeline-dependent constant of order .
Quantization noise is multiplicative per-multiply and adds up coherently across the array. At low reference SINR the channel noise dominates and quantization is invisible; at high SINR the quantization noise becomes the floor, setting a ceiling on the achievable rate that is determined entirely by the mantissa width. Each extra bit of mantissa pushes the ceiling up by roughly 6 dB.
Model each multiply as with uniform on .
The total quantization error at the output is the sum of per-multiply errors along the signal path; each term is independent.
There are multiplies on the signal path, so the variance grows as .
Adding this to the noise floor and dividing by the signal power yields the reciprocal-SINR relation.
Per-multiply error model
A rounding-to-nearest -bit fixed-point multiply incurs an error uniform on (in units of the full-scale amplitude), with variance . We include both I and Q channels so the effective complex variance is .
Accumulation along the pipeline
Each output sample of the ZF combiner passes through the Gram construction, the Cholesky-based inverse, and the final application to the input vector. Counting multiplies along the signal path gives a constant of order . Under the Widrow model the per-multiply errors are mutually independent, so the total quantization error at the combined output has variance , where is the nominal signal power.
Effective SINR
Denote the channel-noise variance by and the floating-point SINR by . Adding the quantization variance to the noise floor:
Substituting the bit width
Using (absorbing the factor of four from the I/Q split into ) produces the claimed expression. In the high-SINR limit , independent of — a rate ceiling set purely by the mantissa width.
Fixed-Point SINR Loss as a Function of Bit Width
Scan the mantissa bit width and observe the effective SINR ceiling. For a given floating-point reference SINR, there is a minimum below which the per-user rate is dominated by quantization rather than channel noise.
Parameters
Floating-point reference SINR
Example: Sizing the Mantissa for a 16-User LuMaMi-Class Testbed
A LuMaMi-class testbed has antennas, serves users, and runs a ZF combiner. The desired operating point is dB floating-point, and the design tolerance for quantization loss is 0.3 dB. Using , what is the smallest mantissa bit width that meets this target?
Translate the tolerance into an SINR ratio
A 0.3 dB loss means . From the theorem:
Solve for $2^{-2b}$
Rearranging: . With (linear) and : .
Take logs
, so . Rounding up to avoid the exact edge, mantissa bits are sufficient. In practice LuMaMi used 16-bit mantissas throughout, leaving a generous 6-bit headroom for accumulation growth — a standard FPGA design choice.
Per-Slot Latency Budget Allocation
Complexity: O(1) — the budget is a linear sum of precomputed per-stage costs.This algorithm runs once at testbed configuration time. In operation the stages execute in a pipeline; only stage 11 (overflow detection) is revisited when the channel or user count changes.
Latency Budget Breakdown Across the Pipeline
Stacked bar of where the slot deadline is spent: channel estimation, Gram matrix, inverse, precoder application, DAC playout, and guard. Sweep the user count or the array size to see which stage becomes the bottleneck first.
Parameters
Distributed Real-Time Processing for Cell-Free Massive MIMO
Gottsch, Ito, and Caire show that the Gram-matrix bottleneck of centralized ZF in massive MIMO can be spatially decomposed across cell-free access points with only a modest fronthaul overhead, provided the user-centric clustering keeps the per-AP user count bounded. Their analysis traces the per-AP compute cost as a function of cluster size and gives the exact breakpoint at which distributed processing overtakes centralized in terms of latency. The result underpins the Massive Beams startup's real-time architecture and shows, concretely, that the theoretical cell-free performance gains can be realized within the 5G NR slot budget.
Common Mistake: Fixed-Point Accumulator Overflow
Mistake:
Assuming the mantissa width is adequate based only on the per-sample dynamic range. An accumulator that sums independent samples grows in variance by a factor of — and a bit width sized for a single sample will saturate long before the sum is complete.
Correction:
Size the accumulator to hold extra bits above the per-sample width. In a massive MIMO inner product over antennas that is 8 extra bits, so a 16-bit input sample needs at least a 24-bit accumulator. LuMaMi's 36-bit accumulators are a deliberately generous choice that eliminated accumulator overflow as a failure mode across every tested configuration.
Common Mistake: Debugging on Float, Shipping on Fixed
Mistake:
Prototyping the MIMO algorithm in MATLAB or NumPy with double-precision floating-point, getting the expected capacity curves, and assuming the fixed-point implementation will match.
Correction:
Always include a bit-accurate fixed-point simulation in the development loop. The SINR loss computed in Theorem 26.2 is the minimum you can expect; real pipelines also suffer from clipping, non-uniform bit-growth, and implementation bugs that only surface in the bit-accurate model. The golden reference should be a cycle-accurate fixed-point simulator, not a floating-point prototype.
Quick Check
A 30 kHz 5G NR numerology has s. Roughly how much of that is actually available for compute after pilot reception and TDD guard intervals?
Essentially all 500 us
Around 300 us
Under 50 us
Exactly 250 us
After reserving two OFDM symbols (~70 us) for TDD guard and the pilot window (~130 us) for SRS and UL data, the compute window in a typical TDD pattern is around 300 us. That is the hard ceiling for channel estimation plus precoder computation.
Key Takeaway
The slot budget is real. Massive MIMO is a compute problem dominated by the Gram-matrix step and bounded by a sub-millisecond deadline. Fixed-point mantissa widths around 10--16 bits are sufficient for the algorithm, but accumulator headroom, FPGA/SoC partitioning, and the latency of each pipeline stage determine whether the theory can actually run on a real testbed. Every design decision in this section trades one of three currencies: latency, silicon area, or mantissa bits.