Ferkans — Interactive Telecom Tutor

ex-mimo-ch25-01

Easy

Under what conditions on the channel model and the noise model is the linear MMSE channel estimator provably Bayes-optimal? Give the precise statement and identify the two assumptions that must hold.

Show Hint

Think about the loss function and the prior.

Under what joint distribution is the conditional expectation linear?

The orthogonality principle requires both signal and noise to have known second-order statistics.

Solution

Assumptions

MMSE is Bayes-optimal under squared-error loss when (i) the channel $\mathbf{H}$ is jointly Gaussian with the observation $\mathbf{y}_p$ and (ii) the channel covariance $\boldsymbol{\Sigma}_{\ntn{ch}}$ and the noise covariance are known exactly.

Why

Under these assumptions the conditional expectation $\mathbb{E}[\mathbf{H} \mid \mathbf{y}_p]$ is linear in $\mathbf{y}_p$ and equals the MMSE formula. The Bayes rule under squared-error loss is the conditional expectation, which equals the MMSE estimator. Neither non-Gaussianity nor imperfect covariance survive the derivation. $\blacksquare$

ex-mimo-ch25-02

Easy

Compute the Gaussian rate-distortion function $R(D)$ for a 64-dimensional channel with eigenvalues $\lambda_i = 2^{-(i-1)}$ for $i = 1, \ldots, 64$ (geometric decay) at distortion target $D = 0.1 \sum_i \lambda_i$ . How many bits per channel instance are required?

Show Hint

Apply the reverse water-filling formula $R(D) = \sum_i \log_2^+(\lambda_i / \mu)$ .

Find $\mu$ such that $\sum_i \min(\lambda_i, \mu) = D$ .

For a geometric spectrum most of the mass is on the first few eigenvalues.

Solution

Total power

$\sum_{i=1}^{64} 2^{-(i-1)} = 2 - 2^{-63} \approx 2$ . The distortion budget is $D = 0.2$ .

Find the water level

We need $\sum_i \min(2^{-(i-1)}, \mu) = 0.2$ . Try $\mu = 2^{-3} = 0.125$ : eigenvalues below $\mu$ are $i \geq 4$ , summing to $2^{-3} + 2^{-4} + \cdots \approx 0.25$ — slightly too much. Try $\mu = 2^{-4} = 0.0625$ : eigenvalues below $\mu$ are $i \geq 5$ , summing to $0.125$ , plus 3 full low-index eigenvalues add nothing to the distortion. Total $\approx 0.125$ , too little. Interpolate to $\mu \approx 0.08$ for $D = 0.2$ .

Compute the rate

$R(D) = \sum_{i: \lambda_i > \mu} \log_2(\lambda_i / \mu)$ . For $\mu \approx 0.08$ the threshold is around $i = 4$ , and $R(D) \approx \log_2(1 / 0.08) + \log_2(0.5 / 0.08) + \log_2(0.25/0.08) + \log_2(0.125/0.08)$ $\approx 3.64 + 2.64 + 1.64 + 0.64 \approx 8.6$ bits per channel instance. A CsiNet achieving $D = 0.2$ at this channel would thus be within a factor of 2-3 of the information-theoretic optimum at roughly 20-30 bits of actual feedback budget. $\blacksquare$

ex-mimo-ch25-03

Medium

Explain why CsiNet preprocesses the channel by taking the IFFT to the delay domain before applying convolutional layers. Identify the physical prior being injected, and describe what happens to NMSE if the IFFT preprocessing is removed.

Show Hint

Think about what is sparse in the delay domain but not in the frequency domain.

Convolutional layers exploit local spatial correlation.

The IFFT is a fixed, non-trainable transformation — what does that buy you?

Solution

Sparsity in delay

For typical urban/suburban multipath, the channel impulse response has only a handful of significant taps (roughly equal to the number of clusters in the environment). In the frequency domain the same channel looks "smooth" but not sparse; in the delay domain it is explicitly sparse.

Convolutions exploit local correlation

The 2D conv layers in CsiNet have small kernels that can efficiently represent sparse + locally correlated signals. They cannot efficiently represent a "dense smooth" signal; a Transformer or MLP would handle that better.

Without the IFFT

Removing the IFFT preprocessing drops NMSE by 3-5 dB at the same bit budget on CDL-C channels. The convolutional inductive bias is mismatched to the frequency-domain representation; the network has to spend capacity learning the IFFT, and it does so imperfectly. Hard-coding the IFFT is a free gain and a textbook example of the model-based DL principle: use the physics, do not learn it from scratch. $\blacksquare$

ex-mimo-ch25-04

Medium

A 64-beam mmWave BS uses an LSTM beam predictor trained on vehicular-speed traces. In a deployment where half the UEs are vehicular and half are pedestrian, the deployed network reaches 90 % top-5 accuracy on vehicular UEs and only 60 % on pedestrians. Diagnose the failure and propose a fix that stays within the data-driven paradigm.

Show Hint

Distribution shift: the training set had no pedestrian traces.

What is different about the pedestrian beam dynamics compared to vehicular?

Can the network be made velocity-aware?

Solution

Diagnosis

The training distribution covers only vehicular velocities. Pedestrian UEs have much slower, more randomly-distributed beam transitions dominated by blocking and body shadowing rather than deterministic mobility. The LSTM's learned temporal features do not transfer.

Velocity-stratified training

Retrain on a mixture of 50 % pedestrian and 50 % vehicular traces. The mixed training teaches the network to handle both temporal scales. Expected top-5 accuracy after retraining: approximately 92 % on vehicular and 88 % on pedestrian — a few percent lower on vehicular (capacity sacrifice) but much more uniform.

Velocity conditioning

Better: pass UE velocity as an auxiliary input to the LSTM. Now the same network handles both modes without sacrificing either. Velocity is a standard measurement the UE reports to the BS, so this is free at inference time. This is how the Rel-18 AI/ML study item recommends training sequence models for beam management. $\blacksquare$

ex-mimo-ch25-05

Medium

Describe the reward gaming failure mode in the "maximize sum-rate" formulation of RL power control and propose three different reward modifications that prevent it. Rank the modifications by how much they sacrifice peak efficiency versus fairness.

Show Hint

Sum-rate maximization is achieved by serving the strongest user.

Proportional fairness and max-min are two standard alternatives.

Each has a different tradeoff between efficiency and fairness.

Solution

Failure mode

"Maximize sum-rate" leads the RL agent to starve the weakest users — serving only the strongest users maximizes instantaneous sum-rate. The training reward looks excellent while the fairness metric (Jain's index, 95%-likely rate) collapses.

Fix 1: proportional fairness

Reward $\sum_k \log R_k^{\text{avg}}$ . The log penalizes starvation heavily. Sacrifices a small amount of peak sum-rate for a much better worst-user rate. Corresponds to the log-utility scheduler of Chapter 5.

Fix 2: max-min

Reward $\min_k R_k^{\text{avg}}$ . The weakest user's rate becomes the entire objective. Strong fairness, but very large efficiency sacrifice (sum-rate can drop by 30-50 %). Best when URLLC-style minimum-rate guarantees matter.

Fix 3: alpha-fairness

Reward $\sum_k R_k^{1-\alpha}/(1-\alpha)$ with $\alpha \in (0, 1]$ . Interpolates between sum-rate ( $\alpha \to 0$ ) and max-min ( $\alpha \to \infty$ ) via a single tunable parameter. Most flexible choice, lets the network designer dial the efficiency-fairness tradeoff post-training. Ranking by efficiency sacrifice: sum-rate (none) < alpha=0.5 < PF (log, alpha=1) < max-min. $\blacksquare$

ex-mimo-ch25-06

Medium

Unroll 5 iterations of ISTA for sparse channel recovery. Write the explicit layer-by-layer forward pass of the deep-unfolded network and identify the trainable parameters at each layer. How many trainable parameters are there in total for a 32-subcarrier problem?

Show Hint

Classical ISTA has step size $\eta$ and threshold $\lambda$ .

The unfolded network can have per-layer $(\eta_k, \lambda_k)$ or full trainable matrices.

Keep it minimal: per-layer scalars only.

Solution

Forward pass

Layer $k$ ( $k = 1, \ldots, 5$ ): $\mathbf{x}^{k} = \mathcal{S}_{\lambda_k}(\mathbf{x}^{k-1} + \eta_k \mathbf{A}^H(\mathbf{y} - \mathbf{A}\mathbf{x}^{k-1}))$ with $\mathbf{x}^{0} = \mathbf{0}$ . Final output is $\mathbf{x}^{5}$ .

Trainable parameters (minimal parameterization)

$(\eta_k, \lambda_k)_{k=1}^{5}$ : 2 scalars per layer, 5 layers, total 10 trainable parameters. The measurement matrix $\mathbf{A}$ stays fixed (it is the pilot pattern, known).

Why not more parameters

Adding a trainable per-layer matrix $\mathbf{W}_k$ or a trainable measurement correction turns the network into a generic MLP and loses generalization. With 10 parameters total, the unfolded network can only improve the classical step sizes and thresholds — exactly the model-based DL sweet spot. Compare: a CsiNet with the same task has $\approx 10^5$ parameters. $\blacksquare$

ex-mimo-ch25-07

Hard

Prove that the PPO clipped surrogate objective reduces to the vanilla policy gradient when the clip parameter $\epsilon$ tends to infinity. Comment on why this limit is not used in practice.

Show Hint

When $\epsilon \to \infty$ , the clip is never active.

The surrogate reduces to $\rho_t \hat{A}_t$ which is the importance-weighted advantage.

Vanilla policy gradient is notoriously unstable; explain why.

Solution

Uncap the clip

PPO surrogate: $L^{\text{clip}}_t = \min(\rho_t \hat{A}_t, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon)\hat{A}_t)$ . As $\epsilon \to \infty$ , clip $(\rho_t, 1-\epsilon, 1+\epsilon) = \rho_t$ , so both terms are equal and $L^{\text{clip}}_t = \rho_t \hat{A}_t$ .

Connection to policy gradient

The expected value $\mathbb{E}[\rho_t \hat{A}_t]$ is the importance-weighted advantage. At $\phi = \phi_{\text{old}}$ , $\rho_t = 1$ and the gradient with respect to $\phi$ recovers the vanilla policy gradient $\nabla_\phi \mathbb{E}[\pi_\phi(a \mid s) A^{\pi_{\text{old}}}(s,a)]$ .

Why the limit is not used

Without clipping, a single large positive advantage causes an arbitrarily large policy update, which pushes the new policy outside the trust region where the advantage estimate is valid. The training collapses and the reward trajectory becomes chaotic. The whole point of PPO is to keep the step small enough that the advantage estimate remains roughly correct. Empirically, $\epsilon \approx 0.2$ is the standard choice. $\blacksquare$

ex-mimo-ch25-08

Hard

Show formally that a deep-unfolded $K$ -layer network, initialized at the classical ISTA parameter values, reproduces $K$ iterations of classical ISTA exactly. Conclude that the trained network's performance is never worse than $K$ steps of ISTA, provided the training loss is non-increasing in the layer parameters.

Show Hint

Initialization point: each layer matches one classical iteration.

End-to-end training decreases the loss monotonically under gradient descent.

The classical fallback is the initialization.

Solution

Initialization

Set $(\eta_k, \lambda_k) = (\eta, \lambda)$ for all $k$ , where $(\eta, \lambda)$ are the classical ISTA hyperparameters. The unfolded network is now literally $K$ applications of the classical iteration.

Training loss at initialization

The loss $\mathcal{L}_0 = \|\mathbf{x}^{\star} - \hat{\mathbf{x}}^K_{\text{ISTA}}\|^2$ equals the classical-ISTA recovery error.

Gradient descent decreases loss

Any gradient-descent step with small enough learning rate decreases the loss: $\mathcal{L}_{t+1} \leq \mathcal{L}_t$ . Therefore $\mathcal{L}_{\text{final}} \leq \mathcal{L}_0$ — the trained network is never worse than the classical $K$ -iteration algorithm. In practice the strict inequality typically holds and gives 2-4 dB NMSE improvement. $\blacksquare$

ex-mimo-ch25-09

Hard

Assume the optimal CSI feedback codec operates exactly on the Gaussian rate-distortion curve $R(D)$ of Theorem 25.1. For a channel covariance with eigenvalues decaying as $\lambda_i \propto i^{-\alpha}$ , derive how $R(D)$ scales with $D$ in the regime where only a vanishing fraction of eigenvalues lies above the water level.

Show Hint

Integrate the continuous version of the reverse water-filling.

$R(D) = \int_{\lambda > \mu} \log_2(\lambda/\mu) \, d\lambda$ in the continuous limit.

A power-law eigenvalue density gives a log-log dependence between $R$ and $D$ .

Solution

Continuous water-filling

Treat $\lambda(i) = C \cdot i^{-\alpha}$ as a continuous function of $i$ . The water level $\mu$ determines the cutoff index $i^{\star}$ via $\lambda(i^{\star}) = \mu$ , i.e. $i^{\star} = (C/\mu)^{1/\alpha}$ .

Distortion budget

The distortion in the discarded tail is $D = \int_{i^{\star}}^{\infty} \lambda(i) \, di = \frac{C}{\alpha - 1}(i^{\star})^{1 - \alpha}$ . Substituting $i^{\star}$ gives $D \propto \mu^{1 - 1/\alpha}$ or equivalently $\mu \propto D^{\alpha/(\alpha-1)}$ .

Rate

$R(D) = \int_1^{i^{\star}} \log_2(\lambda(i) / \mu) \, di = \int_1^{i^{\star}} (\log_2(C/\mu) - \alpha \log_2 i) \, di$ . The dominant term is $i^{\star} \log_2(C/\mu) - \alpha i^{\star} \log_2 i^{\star}$ , which simplifies to $R(D) \propto (C/\mu)^{1/\alpha} \log_2(C/\mu)$ . Substituting $\mu \propto D^{\alpha/(\alpha-1)}$ gives $R(D) \sim D^{-1/(\alpha-1)} \log_2(1/D)$ .

Interpretation

For $\alpha = 2$ (fast decay, typical of massive MIMO) $R(D) \sim D^{-1} \log(1/D)$ — the feedback overhead decays quickly as the distortion target relaxes. For $\alpha = 1.1$ (slow decay, rich multipath) $R(D) \sim D^{-10}$ — the feedback grows catastrophically as distortion tightens. Channels with strong angular structure (low rank, high $\alpha$ ) are intrinsically feedback-friendly; isotropic channels are not. This is the quantitative reason CsiNet helps for some channel types and not others. $\blacksquare$

ex-mimo-ch25-10

Medium

Explain the concept of "hybrid deployment with safe fallback" as advocated by the 6G@TU Berlin / Huawei workshop. Design a concrete fallback trigger for a deep-unfolded channel estimator: what measurable quantity would you threshold, and what is the threshold value?

Show Hint

A good fallback trigger monitors the learned component's uncertainty or plausibility.

Residual norm, output entropy, or input-distribution distance are candidates.

The threshold should be tuned on a held-out matched validation set.

Solution

Hybrid deployment concept

A learned component (e.g. deep-unfolded ISTA channel estimator) runs in parallel with a classical fallback (e.g. sample-MMSE). A small monitor compares the learned output against a plausibility criterion; when the criterion fails the classical output is used instead. The classical branch guarantees worst-case correctness; the learned branch provides the gain when it is applicable.

Trigger choice

For a channel estimator, the natural trigger is the residual norm $\rho = \|\mathbf{y}_p - \mathbf{S}_{i,k} \hat{\mathbf{H}}_{\text{learned}}\|^2 / \|\mathbf{y}_p\|^2$ . On matched data this is small; on shifted data the learned network's output is mismatched to the measurement, which the residual detects.

Threshold tuning

Compute $\rho$ on a held-out matched validation set and set the threshold at the 95th percentile. On deployment, any input with $\rho$ above the threshold triggers the fallback. This yields roughly 5 % fallback rate on matched data (controlled false-positive) and much higher fallback rate on shifted data (where it is supposed to trigger). Monitor the fallback rate continuously: if it drifts above 20 %, the network needs retraining. $\blacksquare$

ex-mimo-ch25-11

Easy

List three physical-layer tasks where pure data-driven deep learning is a poor choice and three where it is a reasonable default. Justify each briefly.

Show Hint

Poor: tasks where a closed-form Bayes solution exists.

Reasonable: tasks with rich temporal or spatial structure and no clean analytical model.

Generalization requirements matter too.

Solution

Poor choices

Linear channel estimation in a Gaussian regime. MMSE is Bayes-optimal; no network can beat it.
Downlink precoder design for a known channel. The closed-form RZF / water-filling formula is optimal.
LDPC decoding. Belief propagation has decades of analysis and very strong performance. DL decoders have never beaten BP on a well-designed code.

Reasonable choices

Beam prediction in mobility. Rich temporal structure, no clean rule-based model, 3GPP Rel-18 candidate.
Scenario classification (indoor vs outdoor vs LoS vs NLoS). Pure pattern recognition, no analytical form, huge ML literature to draw on.
Channel compression for CSI feedback in fixed deployments. Per-scenario training is acceptable, and the rate-distortion gain over Type II is real. $\blacksquare$

ex-mimo-ch25-12

Challenge

A CsiNet codec achieves NMSE $= -18$ dB at 128 bits on its training distribution (CDL-C). On an unseen CDL-D deployment it degrades to $-11$ dB at the same bit budget. A deep-unfolded OAMP-Net with 12 trainable parameters achieves $-15$ dB on CDL-C and $-14$ dB on CDL-D at 128 bits. Which architecture is preferable for a commercial deployment seeing a mix of CDL-A/B/C/D scenarios, and why? Quantify the expected worst-case NMSE under each choice.

Show Hint

Worst-case performance dominates commercial deployment decisions.

Compute the expected NMSE under a uniform mixture over scenarios.

Consider the retraining cost of each choice.

Solution

Worst-case NMSE comparison

CsiNet worst case (CDL-D or similar shift): $-11$ dB. OAMP-Net worst case: $-14$ dB. The model-based approach wins by 3 dB on the scenario it was not trained on.

Average NMSE under uniform mixture

CsiNet: approximately $(-18 + (-11)\cdot 3)/4 = -12.75$ dB (one scenario matched, three shifted). OAMP-Net: approximately $(-15 + (-14)\cdot 3)/4 = -14.25$ dB. OAMP-Net wins on average too.

Retraining cost

CsiNet to match OAMP-Net would need to be retrained separately for each deployment, i.e. four separate trained models plus a model-distribution mechanism to swap them. OAMP-Net needs one trained model. Operational cost heavily favors OAMP-Net.

Recommendation

Deploy OAMP-Net. It gives up $3$ dB peak efficiency on matched CDL-C to gain $3$ dB worst-case efficiency and eliminate the per-scenario retraining operational overhead. This is a textbook case of the 6G@UT/Huawei position: model-based DL beats data-driven DL on every commercial deployment metric. The CsiNet gain is only visible in academic papers, not in field trials. $\blacksquare$

ex-mimo-ch25-13

Medium

Derive the expected overhead saving of an LSTM beam predictor vs exhaustive beam sweep, as a function of the codebook size $|\mathcal{B}|$ and the top- $k$ accuracy $p_k$ . Assume that when the true beam is not in the top- $k$ a fallback to exhaustive search is triggered.

Show Hint

Predicted overhead = top- $k$ measurement cost + fallback probability $\times$ exhaustive cost.

Fallback probability = $1 - p_k$ .

Overhead saving = $1 - (\text{predicted} / \text{exhaustive})$ .

Solution

Overhead formula

Let $C_{\text{full}} = |\mathcal{B}|$ measurements (exhaustive). LSTM + fallback overhead per slot: $C_{\text{LSTM}} = k + (1 - p_k) \cdot |\mathcal{B}|$ .

Overhead saving

Saving $= 1 - C_{\text{LSTM}} / C_{\text{full}} = 1 - k / |\mathcal{B}| - (1 - p_k) = p_k - k / |\mathcal{B}|$ .

Numerical example

For $|\mathcal{B}| = 64$ , $k = 5$ , $p_k = 0.95$ : Saving $= 0.95 - 5/64 \approx 0.95 - 0.078 = 0.872$ , i.e. 87.2 % overhead reduction — an 8x improvement, consistent with Example 25.3. For $p_k = 0.90$ : Saving $\approx 82 \%$ , still very useful. If $p_k$ drops to 0.5 (weak predictor): Saving $\approx 42 \%$ , marginal. This is why the top-5 accuracy is such a useful headline metric: it directly determines the deployment value. $\blacksquare$

ex-mimo-ch25-14

Medium

In the context of wireless RL, explain the simulation-to-reality gap and list three concrete reasons why a policy trained on a 3GPP system-level simulator may fail on a real gNB. Propose one mitigation for each reason.

Show Hint

Simulators abstract away hardware details.

Simulators are also faster than real time, which changes the training dynamics.

Real deployments have non-stationary noise and delays the simulator does not model.

Solution

Reason 1: Hardware non-idealities

Real PA nonlinearity, oscillator phase noise, and IQ imbalance add structured noise that the simulator typically omits. Mitigation: Inject empirical hardware noise models into the simulator using measurements from real RRUs during data generation.

Reason 2: Timing / processing delays

Simulator assumes actions take effect instantly; real scheduling decisions have 1-2 slot processing delays and variable latency. Mitigation: Add a Markovian delay model to the simulator so the learned policy accounts for non-instantaneous control.

Reason 3: Traffic statistics

Real UE traffic is bursty, heavy-tailed, and correlated across users; simulators use independent Poisson or constant arrivals. Mitigation: Replay real traffic traces into the simulator instead of using synthetic models, and use domain randomization over multiple traffic profiles during training. $\blacksquare$

ex-mimo-ch25-15

Challenge

Consider a Transformer-based CSI encoder with attention across all subcarriers. Compute the number of multiply-accumulate operations per CSI update for $N_f = 64$ subcarriers, $N_t = 64$ antennas, embedding dimension $d = 128$ , and one attention layer. Compare with a CsiNet (convolutional) encoder on the same problem and identify which is deployable on a handset NPU with a 1 GMAC/s budget.

Show Hint

Attention cost is $O(N_f^2 \cdot d)$ per layer plus $O(N_f \cdot d^2)$ for the feedforward block.

Convolution cost is $O(N_f \cdot N_t \cdot K \cdot C)$ for kernel size $K$ and channels $C$ .

1 GMAC/s at a 5 ms update rate gives a budget of 5 MMAC per inference.

Solution

Transformer MAC count

Attention: $2 N_f^2 d = 2 \cdot 4096 \cdot 128 = 1.05 \cdot 10^6$ MACs. Feedforward block (2 layers, 4x expansion): $8 N_f d^2 = 8 \cdot 64 \cdot 16384 = 8.4 \cdot 10^6$ MACs. Total per attention layer: approximately $9.5$ MMAC.

CsiNet MAC count

3 conv layers, $3\times 3$ kernels, 32 channels, spatial size $N_f \times N_t = 64 \times 64$ : $3 \cdot 9 \cdot 32 \cdot 32 \cdot 4096 \approx 1.1$ MMAC per layer, roughly $3.5$ MMAC total for 3 layers.

Deployability

At a 5 ms CSI update rate, the 1 GMAC/s NPU budget provides 5 MMAC per inference. CsiNet at 3.5 MMAC fits comfortably. Transformer-CSI at 9.5 MMAC does not fit — it would need either a larger NPU, a reduced update rate (10 ms instead of 5 ms), or architectural compression (distillation to a smaller network). In the current generation of handset NPUs the CsiNet-class models are deployable; Transformer-CSI is not yet. This is why CsiNet remains the 3GPP reference despite Transformers' better NMSE on paper. $\blacksquare$

Exercises

ex-mimo-ch25-01

Assumptions

Why

ex-mimo-ch25-02

Total power

Find the water level

Compute the rate

ex-mimo-ch25-03

Sparsity in delay

Convolutions exploit local correlation

Without the IFFT

ex-mimo-ch25-04

Diagnosis

Velocity-stratified training

Velocity conditioning

ex-mimo-ch25-05

Failure mode

Fix 1: proportional fairness

Fix 2: max-min

Fix 3: alpha-fairness

ex-mimo-ch25-06

Forward pass

Trainable parameters (minimal parameterization)

Why not more parameters

ex-mimo-ch25-07

Uncap the clip

Connection to policy gradient

Why the limit is not used

ex-mimo-ch25-08

Initialization

Training loss at initialization

Gradient descent decreases loss

ex-mimo-ch25-09

Continuous water-filling

Distortion budget

Rate

Interpretation

ex-mimo-ch25-10

Hybrid deployment concept

Trigger choice

Threshold tuning

ex-mimo-ch25-11

Poor choices

Reasonable choices

ex-mimo-ch25-12

Worst-case NMSE comparison

Average NMSE under uniform mixture

Retraining cost

Recommendation

ex-mimo-ch25-13

Overhead formula

Overhead saving

Numerical example

ex-mimo-ch25-14

Reason 1: Hardware non-idealities

Reason 2: Timing / processing delays

Reason 3: Traffic statistics

ex-mimo-ch25-15

Transformer MAC count

CsiNet MAC count

Deployability