Exercises
ex-ch31-01
EasyA two-layer fully connected neural network is used for channel estimation in an OFDM system with subcarriers and pilots. The hidden layer has units. The input is and the output is the full channel estimate .
(a) Write the dimensions of the weight matrices , and bias vectors , .
(b) Compute the total number of trainable parameters.
(c) If the training dataset has 300 channel realisations, is this likely sufficient? Justify using the rule-of-thumb ratio.
The input dimension is and the output dimension is .
Count parameters as: .
Weight matrix dimensions
$
Parameter count
$
Data sufficiency assessment
The rule of thumb requires for good generalisation. Here , but we only have 300 samples. The ratio is , far below the recommended 5--10.
The 300 samples are grossly insufficient for this network. Options: (1) reduce to 8 (giving 2600 parameters), (2) use a model-based approach (deep unfolding) with far fewer parameters, or (3) augment data by training at multiple SNR values.
ex-ch31-02
EasyAn end-to-end autoencoder for 8-ary signalling over channel uses (2D constellation) is trained under AWGN at SNR dB. After training, the encoder produces the following constellation points (in I-Q coordinates):
| Symbol | I | Q |
|---|---|---|
| 0 | 1.05 | 0.02 |
| 1 | 0.51 | 0.88 |
| 2 | 0.89 | |
| 3 | 0.03 | |
| 4 | ||
| 5 | 0.50 | |
| 6 | 0.01 | 0.01 |
| 7 | 0.00 |
(a) Compute the average symbol energy.
(b) Does this constellation resemble any classical modulation scheme? Explain any differences.
Average symbol energy: .
Compare the point positions with 8-PSK (all on a unit circle) and think about which points are anomalous.
Average symbol energy
Computing for each symbol:
| Symbol | |
|---|---|
| 0 | |
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 |
Comparison with classical schemes
Symbols 0--5 lie approximately on a circle of radius at angles --- this is essentially 6-PSK (regular hexagonal arrangement). Symbols 6 and 7, however, are clustered near the origin.
This is a (6,2) constellation: 6 points on a circle plus 2 near the centre. Classical 8-PSK places all 8 points on a circle with angular spacing . The autoencoder has discovered an alternative arrangement that sacrifices the energy of symbols 6 and 7 (placing them at the origin where they are easily confused with each other under noise) to increase the minimum distance among the remaining 6 symbols.
Whether this is truly optimal depends on the prior probability of each symbol. If all symbols are equiprobable, the near-origin symbols are a weakness. This illustrates that small-scale autoencoder training can sometimes converge to suboptimal local minima.
ex-ch31-03
EasyIn a Q-learning power control problem with users and discrete power levels per user:
(a) How many entries does the joint Q-table have if the state space has states (3 quantisation levels per user)?
(b) If an alternative per-user decomposition is used (each user has an independent Q-table of size ), how many total entries are needed?
(c) What is the reduction factor?
Joint action space: .
Per-user approach: independent tables of size .
Joint Q-table size
$
Per-user Q-table size
Each user has a table of size . Total for users: entries.
Reduction factor
KK = 827 \times 4^8 = 1,769,4728 \times 27 \times 4 = 8642048\times$ reduction. However, the per-user decomposition ignores inter-user coupling and may converge to a suboptimal (Nash equilibrium) rather than the globally optimal allocation.
ex-ch31-04
EasyIn federated learning with clients, each client has a model with parameters (32-bit floats).
(a) Compute the upload cost per client per round (in KB).
(b) If FedAvg runs for rounds with full client participation, compute the total communication cost (in MB).
(c) If only clients are randomly selected per round, what is the new total cost?
32-bit float = 4 bytes.
Total cost = rounds participating clients per round cost per client.
Per-client upload cost
$
Full participation total cost
\sim$31 MB total bidirectional communication.)
Partial participation cost
3.3\times$ reduction. Partial participation also reduces the per-round latency (fewer clients to wait for), but may slow convergence because less data is represented per round.
ex-ch31-05
MediumA neural-network channel estimator uses MSE loss: . The network has a single hidden layer: with ReLU activation.
(a) Derive the gradient .
(b) Derive the gradient , expressing the chain rule through the ReLU.
(c) If all pilot observations are real-valued and element-wise, explain why simplifies.
Let be the hidden activations.
The ReLU derivative is (indicator function).
Gradient w.r.t. $\mathbf{W}_2$
Define the error and the hidden activation . Then .
In matrix form: where and are the error and activation matrices with samples as rows.
Gradient w.r.t. $\mathbf{W}_1$ (backpropagation through ReLU)
By the chain rule:
where is the pre-activation. The diagonal matrix zeroes out gradients for hidden units whose pre-activation was negative (inactive ReLU units).
Simplification for non-negative inputs
If element-wise and , then whenever has non-negative entries. In this case, (all ones) and the ReLU becomes the identity:
This simplification only holds at initialisation with non-negative weights; during training, some weights become negative and dead ReLU units appear.
ex-ch31-06
MediumConsider a LISTA network with layers applied to the sparse recovery problem , where , , and SNR dB.
(a) If only the per-layer thresholds are learned (with and fixed at the ISTA values), how many trainable parameters does the network have?
(b) If all matrices and thresholds are learned, how many parameters?
(c) ISTA achieves NMSE dB after iterations. Based on the linear convergence theorem for LISTA, estimate the number of LISTA layers needed to achieve the same NMSE, assuming .
For (c), use and solve .
Assume for a rough estimate (the constant absorbs the initial error).
Thresholds only
With only per-layer scalar thresholds:
For layers: only 10 trainable parameters. This is remarkably few and explains the excellent sample efficiency of model-based deep unfolding.
All parameters learned
Per layer:
- : parameters
- : parameters
- : parameter
Per layer: parameters. Total for layers: .
For : parameters --- still far fewer than a general black-box network mapping would need for comparable depth.
Estimating LISTA layers from convergence rate
Target NMSE: dB .
Using with and :
So LISTA layers suffice, compared to ISTA iterations --- a reduction. With optimised (trained) thresholds and matrices, the effective can be smaller, potentially allowing even fewer layers.
ex-ch31-07
MediumA Q-learning agent controls the power of a single transmitter communicating with one receiver. The state is the quantised channel gain: (3 states). The action is the power level: (3 actions). The discount factor is (myopic, no future reward consideration).
The reward is the rate with . For the three states, the typical channel gains are .
(a) Compute the optimal Q-table exactly (9 entries).
(b) What is the optimal policy?
(c) If the agent has explored each state-action pair exactly once with , what does the Q-table look like? Is it already optimal?
With , .
Since the channel gain is deterministic given the state (typical value), the Q-value is just the rate.
Optimal Q-table
With , :
| State | |||
|---|---|---|---|
| low () | |||
| med () | |||
| high () |
Optimal policy
For every state, the maximum Q-value is at :
This is unsurprising for a single-user system without a power constraint: more power always means more rate. In a multi-user setting with interference, the optimal policy would be state-dependent.
Q-table after one visit per pair
With and , the update directly sets after one visit.
Therefore, the Q-table is already optimal after one visit per pair, because the reward is deterministic (given the quantised state) and eliminates the need to estimate future returns. This is a special case; with stochastic rewards or , convergence requires many visits.
ex-ch31-08
MediumIn FedAvg with clients, the global model is . Each client performs local SGD steps with learning rate . The local gradients at client have mean and the global gradient is .
(a) Write the local model after steps (assuming constant gradient, i.e., the quadratic approximation): .
(b) Write the aggregated model .
(c) Compare with a single centralised SGD step using the global gradient. When are they equivalent?
Substitute the expression from (a) into the averaging formula.
The centralised update would be .
Local model after $E$ steps
Under the constant-gradient approximation:
Aggregated model
$
Comparison with centralised SGD
Centralised SGD: .
FedAvg: .
These are identical when , i.e., the effective federated learning rate is times the local learning rate.
Crucially, this equivalence holds only under the constant-gradient assumption (i.e., is linear). For non-linear , the gradient changes during local SGD, and each client drifts toward its own local optimum. The resulting "client drift" creates the heterogeneity bias in the FedAvg convergence bound. This is why the bias term in Theorem 31.4 grows with .
ex-ch31-09
MediumA soft-thresholding operator is used in each layer of a LISTA network. During backpropagation, we need and .
(a) Derive for .
(b) Derive .
(c) Explain why the gradient w.r.t. provides a learning signal for adapting the sparsity level per layer.
Consider the three regions: , , and separately.
In the active region (), .
Gradient w.r.t. $z$
z = \pm\theta$, the function is not differentiable, but the subgradient is conventionally taken as 0 (consistent with treating the ReLU-like kink as having zero derivative).
This is identical to the derivative of the "dead zone" function --- values inside the threshold band have zero gradient (they contribute nothing to the output and receive no learning signal), while values outside pass gradients unchanged.
Gradient w.r.t. $\theta$
For : , so:
For : , so:
Summarising:
Interpretation for sparsity adaptation
The gradient (for active elements) tells us:
- If increasing would reduce the output magnitude of an element that should be large (positive contribution to loss), the gradient will push to decrease.
- If an element is spuriously active (non-zero output when the ground truth is zero), the loss gradient through this element will push to increase, killing the spurious entry.
In early LISTA layers, the training objective encourages a large threshold (aggressive sparsification to identify the correct support). In later layers, the objective encourages a small threshold (fine-tune the amplitudes of the identified non-zero entries). This automatic adaptation of sparsity level per layer is the key advantage over ISTA's fixed threshold.
ex-ch31-10
MediumAn autoencoder for symbols over AWGN with channel uses has encoder output (after power normalisation):
where and is the one-hot encoding.
(a) Show that the power constraint is automatically satisfied by this normalisation.
(b) The decoder receives and outputs . Write the cross-entropy loss for a batch of samples .
(c) Compute for a single sample, where logits .
For (a), substitute the normalisation formula and simplify.
The gradient of cross-entropy w.r.t. logits has the elegant form where is one-hot.
Power constraint verification
Let . Then:
Cross-entropy loss
\mathbf{l}_i = \mathbf{W}_4\sigma(\mathbf{W}_3\mathbf{y}_i + \mathbf{b}_3) + \mathbf{b}_4$ is the logit vector.
Gradient w.r.t. logits
For a single sample with true label :
where and is the one-hot vector with a 1 at position .
Proof sketch: . For : , giving . For : , giving . Combining: .
This elegant form is why softmax + cross-entropy is the standard choice for classification --- the gradient is simply the prediction error.
ex-ch31-11
Hard(MMSE as optimal NN target.) Consider the channel estimation problem where the channel and the observation is , .
(a) Show that the MMSE estimator is .
(b) Prove that among all estimators , the MMSE estimator minimises .
(c) Show that a linear neural network (no activation function) with trained with MSE loss on infinite data converges to the MMSE weight matrix .
(d) Explain why a non-linear network cannot do better than the linear MMSE in this Gaussian setting.
For (a), use the conditional mean of jointly Gaussian vectors.
For (b), use the law of total expectation and the orthogonality principle.
For (c), minimise over .
MMSE estimator derivation
Since and are jointly Gaussian:
where (since and ) and .
Therefore: .
MMSE optimality proof
For any estimator :
Expanding:
By the orthogonality principle, the MMSE error is orthogonal to any function of , so the cross-term vanishes. The second term is non-negative. Therefore:
with equality iff a.s.
Linear NN convergence
Minimise :
Setting to zero: , which is exactly the MMSE weight.
Non-linear network cannot improve
For jointly Gaussian , the conditional mean is linear in . This is a fundamental property of Gaussian distributions. Since the MMSE estimator (conditional mean) is already linear, no non-linear function of can achieve a lower MSE. A non-linear NN trained with MSE loss will converge to the same linear mapping (plus some noise from finite-sample training).
Non-linear NNs offer an advantage only when the channel distribution is non-Gaussian (e.g., sparse channels, channels with discrete components, or channels corrupted by non-Gaussian interference).
ex-ch31-12
Hard(LISTA convergence analysis.) Consider a simplified LISTA with scalar signal (), scalar measurement where is known, has a Laplacian prior , and .
(a) Write one ISTA iteration for this 1D problem: .
(b) Show that in a LISTA layer with learned parameters , the update is . What are the ISTA-initialised values of ?
(c) Consider the fixed point . Show that for , the fixed point satisfies (assuming ).
(d) For the optimal Bayesian estimator (MAP), derive the MAP estimate and compare with the LISTA fixed point.
The Lipschitz constant is in 1D.
The MAP estimate minimises .
ISTA iteration in 1D
With , step size , and threshold :
Simplifying: .
Note: in this 1D case, ISTA converges in one step because the gradient step already reaches the proximal minimiser.
LISTA layer and initialisation
LISTA update: .
ISTA-initialised values:
With , the update becomes regardless of the current iterate, confirming the one-step convergence of 1D ISTA.
Fixed point analysis
At a fixed point with :
For (and assuming ):
With ISTA initialisation (): .
MAP estimate comparison
The MAP estimate minimises:
Taking the subdifferential and setting to zero:
For : .
This is , i.e., soft-thresholding with threshold rather than .
Comparison: ISTA uses threshold ; MAP uses . They coincide when . LISTA can learn the correct threshold from data, thereby matching the MAP estimator, while ISTA uses a fixed (possibly mismatched) threshold.
ex-ch31-13
Hard(Convergence of Q-learning.) Consider a two-state, two-action MDP with transition matrix and rewards:
Discount factor .
(a) Write the Bellman optimality equations for .
(b) Solve the system of 4 equations to find exactly.
(c) What is the optimal policy?
(d) Simulate 5 Q-learning updates starting from with , initial state , and (greedy). Show that the agent may take suboptimal actions initially.
The Bellman equation is .
Define and substitute.
Bellman optimality equations
V^(s) = \max_a Q^(s, a)$.
Solve the system
From the equations, , so . Similarly, , so .
Let and :
From the first: , so .
From the second: , so .
Substituting:
, so and .
The full Q-table:
Optimal policy
s_1R = 5s_2R = 8$). The optimal policy happens to be greedy w.r.t. immediate rewards in this example, but this is coincidental.
Q-learning simulation (initial suboptimality)
Start: , state , greedy ().
Step 1: All Q-values equal 0, greedy picks (arbitrary tie-breaking). . Next state: (prob 0.8). .
Step 2: , pick . . Next: (prob 0.2, suppose transition). .
Step 3: State , , pick (tie). . Next: . .
Note: in step 3, the agent picks in (suboptimal --- the optimal action is with ). It takes because both Q-values were 0 and the tie was broken in favour of . Without exploration (), the agent may lock into for for many steps before the Q-value of gets updated. This illustrates the necessity of exploration.
ex-ch31-14
Hard(FedAvg heterogeneity bias.) Consider clients with quadratic objectives:
The global objective is .
(a) Find the global optimum and the local optima .
(b) Run one round of FedAvg: starting from , each client performs gradient descent steps with learning rate . Derive the aggregated model as a function of and .
(c) Show that regardless of and . Explain why the heterogeneity bias is zero in this symmetric example.
(d) Now consider asymmetric objectives: , (i.e., has curvature 2 instead of 1). Find and show that FedAvg with does not converge to .
The gradient of is .
For (c), exploit the symmetry .
Optima
Local optima: , .
Global: .
FedAvg round
Client 1: . Starting from :
- After 1 step:
- After 2 steps:
- After steps: (geometric series for gradient descent on quadratic).
Client 2: . Starting from :
- After steps:
Aggregation and zero bias
a_1 = a_2 = 1w_1^* = -w_2^* = 1$). The client drifts exactly cancel upon averaging.
Asymmetric case: non-zero bias
With : .
Global optimum: . Setting : .
Client 1 after steps from 0: .
Client 2 after steps from 0 (with step size and gradient ): .
Aggregated: .
For : . For small, this approaches 0, not .
For : .
The asymmetric curvatures cause the client with larger curvature (client 2) to drift faster toward its local optimum, and the averaging does not correct this imbalance. This is the mechanism behind the heterogeneity bias in the FedAvg convergence theorem.
ex-ch31-15
Hard(Autoencoder capacity and channel coding.) An autoencoder maps messages to channel uses over AWGN, achieving block error rate (BLER) .
(a) The communication rate is bits per channel use. For , , compute and compare with the AWGN capacity at SNR dB.
(b) The sphere-packing (Shannon) lower bound on BLER for codewords in under AWGN is:
Explain qualitatively why the autoencoder's BLER approaches this bound as the network capacity increases.
(c) A key limitation of autoencoder-based "codes" is that they are fixed-length, fixed- systems. Explain why scaling to (typical for modern codes with 100-bit messages) is fundamentally challenging for the autoencoder approach.
AWGN capacity: bits/use.
The one-hot representation of messages requires an -dimensional input, which grows exponentially in the number of bits.
Rate calculation
R = 0.571R/C \approx 0.44Mn$ (shorter blocks), but the autoencoder architecture limits how far either can go.
Approaching the sphere-packing bound
The encoder maps messages to points in (the 2D constellation space over channel uses). Under the power constraint, these points lie on or near a sphere. The optimal arrangement maximises the minimum distance between points, which corresponds to packing non-overlapping "decoding cones" on the sphere.
The sphere-packing bound computes the maximum fraction of the sphere that can be covered by such cones. A sufficiently expressive autoencoder, by the universal approximation theorem, can learn the optimal point placement (or a close approximation), thereby approaching the sphere-packing bound. The decoder learns the optimal decision regions, which are the Voronoi cells of the constellation.
In practice, autoencoders have been shown to match or slightly exceed the performance of known short block codes for small and .
Scaling challenge
For messages:
-
Input representation: The one-hot vector has dimension --- this is astronomically large and cannot be represented in any computer.
-
Output layer: The decoder softmax has outputs, requiring weights in the last layer alone.
-
Training: Each mini-batch can only cover a vanishing fraction of the messages, making training infeasible.
This is the fundamental exponential scaling barrier of the autoencoder approach. Classical channel codes circumvent this by using structured encoders (linear codes, convolutional codes, LDPC, Polar) whose encoding complexity scales polynomially in the block length.
Proposed solutions include: (a) Turbo Autoencoder (using interleaved convolutional structure), (b) bit-level autoencoders that process one bit at a time, and (c) neural network decoders for existing structured codes (keeping the encoder classical).
ex-ch31-16
Hard(Over-the-air federated aggregation.) In over-the-air computation (AirComp), clients simultaneously transmit their model updates over a MAC channel. The server receives:
where is the channel coefficient from client , is the transmitted signal, and .
(a) To achieve the desired aggregation , each client must pre-equalise: . Show that the server estimate is then .
(b) The pre-equalisation requires transmit power . If client has a power constraint , show that the maximum model update norm is .
(c) To handle power-limited clients, a common approach is to truncate (clip) the update: where . Derive the MSE of the aggregation when out of clients are clipped.
(d) Compare the communication cost (channel uses) of AirComp vs orthogonal TDMA for clients with parameters.
AirComp transmits all clients simultaneously in channel uses. TDMA requires channel uses.
For (c), the clipping introduces a deterministic bias plus the aggregation noise.
AirComp aggregation
With :
The server computes:
The aggregation noise has variance per dimension, which decreases as --- more clients means less noise per dimension due to the averaging effect.
Power constraint
|h_c|$ small) have a tighter constraint on the update norm. This is the fundamental fairness issue in AirComp: power-limited clients contribute distorted updates.
MSE with clipping
Partition clients into unclipped (, ) and clipped (, ).
For unclipped clients: (no error).
For clipped clients: (scaled to norm ). The clipping error is: .
The aggregated estimate:
The MSE is:
The first term is the clipping bias (deterministic, grows with the number of clipped clients and the severity of clipping) and the second is the aggregation noise (stochastic, decreases with ).
Communication cost comparison
AirComp: All clients transmit simultaneously. Each dimension of is transmitted in one channel use. Total: channel uses.
Orthogonal TDMA: Each client transmits in a separate time slot. Total: channel uses.
Reduction: , equal to the number of clients.
AirComp achieves a speedup by exploiting the superposition property of the wireless MAC channel. The trade-off is the aggregation noise () and the need for channel pre-equalisation, which requires CSI at the clients and may be power-limited.