Exercises
ex-mimo-ch25-01
EasyUnder what conditions on the channel model and the noise model is the linear MMSE channel estimator provably Bayes-optimal? Give the precise statement and identify the two assumptions that must hold.
Think about the loss function and the prior.
Under what joint distribution is the conditional expectation linear?
The orthogonality principle requires both signal and noise to have known second-order statistics.
Assumptions
MMSE is Bayes-optimal under squared-error loss when (i) the channel is jointly Gaussian with the observation and (ii) the channel covariance and the noise covariance are known exactly.
Why
Under these assumptions the conditional expectation is linear in and equals the MMSE formula. The Bayes rule under squared-error loss is the conditional expectation, which equals the MMSE estimator. Neither non-Gaussianity nor imperfect covariance survive the derivation.
ex-mimo-ch25-02
EasyCompute the Gaussian rate-distortion function for a 64-dimensional channel with eigenvalues for (geometric decay) at distortion target . How many bits per channel instance are required?
Apply the reverse water-filling formula .
Find such that .
For a geometric spectrum most of the mass is on the first few eigenvalues.
Total power
. The distortion budget is .
Find the water level
We need . Try : eigenvalues below are , summing to β slightly too much. Try : eigenvalues below are , summing to , plus 3 full low-index eigenvalues add nothing to the distortion. Total , too little. Interpolate to for .
Compute the rate
. For the threshold is around , and bits per channel instance. A CsiNet achieving at this channel would thus be within a factor of 2-3 of the information-theoretic optimum at roughly 20-30 bits of actual feedback budget.
ex-mimo-ch25-03
MediumExplain why CsiNet preprocesses the channel by taking the IFFT to the delay domain before applying convolutional layers. Identify the physical prior being injected, and describe what happens to NMSE if the IFFT preprocessing is removed.
Think about what is sparse in the delay domain but not in the frequency domain.
Convolutional layers exploit local spatial correlation.
The IFFT is a fixed, non-trainable transformation β what does that buy you?
Sparsity in delay
For typical urban/suburban multipath, the channel impulse response has only a handful of significant taps (roughly equal to the number of clusters in the environment). In the frequency domain the same channel looks "smooth" but not sparse; in the delay domain it is explicitly sparse.
Convolutions exploit local correlation
The 2D conv layers in CsiNet have small kernels that can efficiently represent sparse + locally correlated signals. They cannot efficiently represent a "dense smooth" signal; a Transformer or MLP would handle that better.
Without the IFFT
Removing the IFFT preprocessing drops NMSE by 3-5 dB at the same bit budget on CDL-C channels. The convolutional inductive bias is mismatched to the frequency-domain representation; the network has to spend capacity learning the IFFT, and it does so imperfectly. Hard-coding the IFFT is a free gain and a textbook example of the model-based DL principle: use the physics, do not learn it from scratch.
ex-mimo-ch25-04
MediumA 64-beam mmWave BS uses an LSTM beam predictor trained on vehicular-speed traces. In a deployment where half the UEs are vehicular and half are pedestrian, the deployed network reaches 90 % top-5 accuracy on vehicular UEs and only 60 % on pedestrians. Diagnose the failure and propose a fix that stays within the data-driven paradigm.
Distribution shift: the training set had no pedestrian traces.
What is different about the pedestrian beam dynamics compared to vehicular?
Can the network be made velocity-aware?
Diagnosis
The training distribution covers only vehicular velocities. Pedestrian UEs have much slower, more randomly-distributed beam transitions dominated by blocking and body shadowing rather than deterministic mobility. The LSTM's learned temporal features do not transfer.
Velocity-stratified training
Retrain on a mixture of 50 % pedestrian and 50 % vehicular traces. The mixed training teaches the network to handle both temporal scales. Expected top-5 accuracy after retraining: approximately 92 % on vehicular and 88 % on pedestrian β a few percent lower on vehicular (capacity sacrifice) but much more uniform.
Velocity conditioning
Better: pass UE velocity as an auxiliary input to the LSTM. Now the same network handles both modes without sacrificing either. Velocity is a standard measurement the UE reports to the BS, so this is free at inference time. This is how the Rel-18 AI/ML study item recommends training sequence models for beam management.
ex-mimo-ch25-05
MediumDescribe the reward gaming failure mode in the "maximize sum-rate" formulation of RL power control and propose three different reward modifications that prevent it. Rank the modifications by how much they sacrifice peak efficiency versus fairness.
Sum-rate maximization is achieved by serving the strongest user.
Proportional fairness and max-min are two standard alternatives.
Each has a different tradeoff between efficiency and fairness.
Failure mode
"Maximize sum-rate" leads the RL agent to starve the weakest users β serving only the strongest users maximizes instantaneous sum-rate. The training reward looks excellent while the fairness metric (Jain's index, 95%-likely rate) collapses.
Fix 1: proportional fairness
Reward . The log penalizes starvation heavily. Sacrifices a small amount of peak sum-rate for a much better worst-user rate. Corresponds to the log-utility scheduler of Chapter 5.
Fix 2: max-min
Reward . The weakest user's rate becomes the entire objective. Strong fairness, but very large efficiency sacrifice (sum-rate can drop by 30-50 %). Best when URLLC-style minimum-rate guarantees matter.
Fix 3: alpha-fairness
Reward with . Interpolates between sum-rate () and max-min () via a single tunable parameter. Most flexible choice, lets the network designer dial the efficiency-fairness tradeoff post-training. Ranking by efficiency sacrifice: sum-rate (none) < alpha=0.5 < PF (log, alpha=1) < max-min.
ex-mimo-ch25-06
MediumUnroll 5 iterations of ISTA for sparse channel recovery. Write the explicit layer-by-layer forward pass of the deep-unfolded network and identify the trainable parameters at each layer. How many trainable parameters are there in total for a 32-subcarrier problem?
Classical ISTA has step size and threshold .
The unfolded network can have per-layer or full trainable matrices.
Keep it minimal: per-layer scalars only.
Forward pass
Layer (): with . Final output is .
Trainable parameters (minimal parameterization)
: 2 scalars per layer, 5 layers, total 10 trainable parameters. The measurement matrix stays fixed (it is the pilot pattern, known).
Why not more parameters
Adding a trainable per-layer matrix or a trainable measurement correction turns the network into a generic MLP and loses generalization. With 10 parameters total, the unfolded network can only improve the classical step sizes and thresholds β exactly the model-based DL sweet spot. Compare: a CsiNet with the same task has parameters.
ex-mimo-ch25-07
HardProve that the PPO clipped surrogate objective reduces to the vanilla policy gradient when the clip parameter tends to infinity. Comment on why this limit is not used in practice.
When , the clip is never active.
The surrogate reduces to which is the importance-weighted advantage.
Vanilla policy gradient is notoriously unstable; explain why.
Uncap the clip
PPO surrogate: . As , clip, so both terms are equal and .
Connection to policy gradient
The expected value is the importance-weighted advantage. At , and the gradient with respect to recovers the vanilla policy gradient .
Why the limit is not used
Without clipping, a single large positive advantage causes an arbitrarily large policy update, which pushes the new policy outside the trust region where the advantage estimate is valid. The training collapses and the reward trajectory becomes chaotic. The whole point of PPO is to keep the step small enough that the advantage estimate remains roughly correct. Empirically, is the standard choice.
ex-mimo-ch25-08
HardShow formally that a deep-unfolded -layer network, initialized at the classical ISTA parameter values, reproduces iterations of classical ISTA exactly. Conclude that the trained network's performance is never worse than steps of ISTA, provided the training loss is non-increasing in the layer parameters.
Initialization point: each layer matches one classical iteration.
End-to-end training decreases the loss monotonically under gradient descent.
The classical fallback is the initialization.
Initialization
Set for all , where are the classical ISTA hyperparameters. The unfolded network is now literally applications of the classical iteration.
Training loss at initialization
The loss equals the classical-ISTA recovery error.
Gradient descent decreases loss
Any gradient-descent step with small enough learning rate decreases the loss: . Therefore β the trained network is never worse than the classical -iteration algorithm. In practice the strict inequality typically holds and gives 2-4 dB NMSE improvement.
ex-mimo-ch25-09
HardAssume the optimal CSI feedback codec operates exactly on the Gaussian rate-distortion curve of Theorem 25.1. For a channel covariance with eigenvalues decaying as , derive how scales with in the regime where only a vanishing fraction of eigenvalues lies above the water level.
Integrate the continuous version of the reverse water-filling.
in the continuous limit.
A power-law eigenvalue density gives a log-log dependence between and .
Continuous water-filling
Treat as a continuous function of . The water level determines the cutoff index via , i.e. .
Distortion budget
The distortion in the discarded tail is . Substituting gives or equivalently .
Rate
. The dominant term is , which simplifies to . Substituting gives .
Interpretation
For (fast decay, typical of massive MIMO) β the feedback overhead decays quickly as the distortion target relaxes. For (slow decay, rich multipath) β the feedback grows catastrophically as distortion tightens. Channels with strong angular structure (low rank, high ) are intrinsically feedback-friendly; isotropic channels are not. This is the quantitative reason CsiNet helps for some channel types and not others.
ex-mimo-ch25-10
MediumExplain the concept of "hybrid deployment with safe fallback" as advocated by the 6G@TU Berlin / Huawei workshop. Design a concrete fallback trigger for a deep-unfolded channel estimator: what measurable quantity would you threshold, and what is the threshold value?
A good fallback trigger monitors the learned component's uncertainty or plausibility.
Residual norm, output entropy, or input-distribution distance are candidates.
The threshold should be tuned on a held-out matched validation set.
Hybrid deployment concept
A learned component (e.g. deep-unfolded ISTA channel estimator) runs in parallel with a classical fallback (e.g. sample-MMSE). A small monitor compares the learned output against a plausibility criterion; when the criterion fails the classical output is used instead. The classical branch guarantees worst-case correctness; the learned branch provides the gain when it is applicable.
Trigger choice
For a channel estimator, the natural trigger is the residual norm . On matched data this is small; on shifted data the learned network's output is mismatched to the measurement, which the residual detects.
Threshold tuning
Compute on a held-out matched validation set and set the threshold at the 95th percentile. On deployment, any input with above the threshold triggers the fallback. This yields roughly 5 % fallback rate on matched data (controlled false-positive) and much higher fallback rate on shifted data (where it is supposed to trigger). Monitor the fallback rate continuously: if it drifts above 20 %, the network needs retraining.
ex-mimo-ch25-11
EasyList three physical-layer tasks where pure data-driven deep learning is a poor choice and three where it is a reasonable default. Justify each briefly.
Poor: tasks where a closed-form Bayes solution exists.
Reasonable: tasks with rich temporal or spatial structure and no clean analytical model.
Generalization requirements matter too.
Poor choices
- Linear channel estimation in a Gaussian regime. MMSE is Bayes-optimal; no network can beat it.
- Downlink precoder design for a known channel. The closed-form RZF / water-filling formula is optimal.
- LDPC decoding. Belief propagation has decades of analysis and very strong performance. DL decoders have never beaten BP on a well-designed code.
Reasonable choices
- Beam prediction in mobility. Rich temporal structure, no clean rule-based model, 3GPP Rel-18 candidate.
- Scenario classification (indoor vs outdoor vs LoS vs NLoS). Pure pattern recognition, no analytical form, huge ML literature to draw on.
- Channel compression for CSI feedback in fixed deployments. Per-scenario training is acceptable, and the rate-distortion gain over Type II is real.
ex-mimo-ch25-12
ChallengeA CsiNet codec achieves NMSE dB at 128 bits on its training distribution (CDL-C). On an unseen CDL-D deployment it degrades to dB at the same bit budget. A deep-unfolded OAMP-Net with 12 trainable parameters achieves dB on CDL-C and dB on CDL-D at 128 bits. Which architecture is preferable for a commercial deployment seeing a mix of CDL-A/B/C/D scenarios, and why? Quantify the expected worst-case NMSE under each choice.
Worst-case performance dominates commercial deployment decisions.
Compute the expected NMSE under a uniform mixture over scenarios.
Consider the retraining cost of each choice.
Worst-case NMSE comparison
CsiNet worst case (CDL-D or similar shift): dB. OAMP-Net worst case: dB. The model-based approach wins by 3 dB on the scenario it was not trained on.
Average NMSE under uniform mixture
CsiNet: approximately dB (one scenario matched, three shifted). OAMP-Net: approximately dB. OAMP-Net wins on average too.
Retraining cost
CsiNet to match OAMP-Net would need to be retrained separately for each deployment, i.e. four separate trained models plus a model-distribution mechanism to swap them. OAMP-Net needs one trained model. Operational cost heavily favors OAMP-Net.
Recommendation
Deploy OAMP-Net. It gives up dB peak efficiency on matched CDL-C to gain dB worst-case efficiency and eliminate the per-scenario retraining operational overhead. This is a textbook case of the 6G@UT/Huawei position: model-based DL beats data-driven DL on every commercial deployment metric. The CsiNet gain is only visible in academic papers, not in field trials.
ex-mimo-ch25-13
MediumDerive the expected overhead saving of an LSTM beam predictor vs exhaustive beam sweep, as a function of the codebook size and the top- accuracy . Assume that when the true beam is not in the top- a fallback to exhaustive search is triggered.
Predicted overhead = top- measurement cost + fallback probability exhaustive cost.
Fallback probability = .
Overhead saving = .
Overhead formula
Let measurements (exhaustive). LSTM + fallback overhead per slot: .
Overhead saving
Saving .
Numerical example
For , , : Saving , i.e. 87.2 % overhead reduction β an 8x improvement, consistent with Example 25.3. For : Saving , still very useful. If drops to 0.5 (weak predictor): Saving , marginal. This is why the top-5 accuracy is such a useful headline metric: it directly determines the deployment value.
ex-mimo-ch25-14
MediumIn the context of wireless RL, explain the simulation-to-reality gap and list three concrete reasons why a policy trained on a 3GPP system-level simulator may fail on a real gNB. Propose one mitigation for each reason.
Simulators abstract away hardware details.
Simulators are also faster than real time, which changes the training dynamics.
Real deployments have non-stationary noise and delays the simulator does not model.
Reason 1: Hardware non-idealities
Real PA nonlinearity, oscillator phase noise, and IQ imbalance add structured noise that the simulator typically omits. Mitigation: Inject empirical hardware noise models into the simulator using measurements from real RRUs during data generation.
Reason 2: Timing / processing delays
Simulator assumes actions take effect instantly; real scheduling decisions have 1-2 slot processing delays and variable latency. Mitigation: Add a Markovian delay model to the simulator so the learned policy accounts for non-instantaneous control.
Reason 3: Traffic statistics
Real UE traffic is bursty, heavy-tailed, and correlated across users; simulators use independent Poisson or constant arrivals. Mitigation: Replay real traffic traces into the simulator instead of using synthetic models, and use domain randomization over multiple traffic profiles during training.
ex-mimo-ch25-15
ChallengeConsider a Transformer-based CSI encoder with attention across all subcarriers. Compute the number of multiply-accumulate operations per CSI update for subcarriers, antennas, embedding dimension , and one attention layer. Compare with a CsiNet (convolutional) encoder on the same problem and identify which is deployable on a handset NPU with a 1 GMAC/s budget.
Attention cost is per layer plus for the feedforward block.
Convolution cost is for kernel size and channels .
1 GMAC/s at a 5 ms update rate gives a budget of 5 MMAC per inference.
Transformer MAC count
Attention: MACs. Feedforward block (2 layers, 4x expansion): MACs. Total per attention layer: approximately MMAC.
CsiNet MAC count
3 conv layers, kernels, 32 channels, spatial size : MMAC per layer, roughly MMAC total for 3 layers.
Deployability
At a 5 ms CSI update rate, the 1 GMAC/s NPU budget provides 5 MMAC per inference. CsiNet at 3.5 MMAC fits comfortably. Transformer-CSI at 9.5 MMAC does not fit β it would need either a larger NPU, a reduced update rate (10 ms instead of 5 ms), or architectural compression (distillation to a smaller network). In the current generation of handset NPUs the CsiNet-class models are deployable; Transformer-CSI is not yet. This is why CsiNet remains the 3GPP reference despite Transformers' better NMSE on paper.