Ferkans — Interactive Telecom Tutor

ex-ch31-01

Easy

A two-layer fully connected neural network is used for channel estimation in an OFDM system with $N_c = 128$ subcarriers and $N_p = 16$ pilots. The hidden layer has $H = 64$ units. The input is $[\operatorname{Re}(\hat{\mathbf{h}}_p), \operatorname{Im}(\hat{\mathbf{h}}_p)] \in \mathbb{R}^{2N_p}$ and the output is the full channel estimate $[\operatorname{Re}(\hat{\mathbf{h}}), \operatorname{Im}(\hat{\mathbf{h}})] \in \mathbb{R}^{2N_c}$ .

(a) Write the dimensions of the weight matrices $\mathbf{W}_1$ , $\mathbf{W}_2$ and bias vectors $\mathbf{b}_1$ , $\mathbf{b}_2$ .

(b) Compute the total number of trainable parameters.

(c) If the training dataset has 300 channel realisations, is this likely sufficient? Justify using the rule-of-thumb ratio.

Show Hint

The input dimension is $2 \times 16 = 32$ and the output dimension is $2 \times 128 = 256$ .

Count parameters as: $\text{dim}_{\text{in}} \times H + H + H \times \text{dim}_{\text{out}} + \text{dim}_{\text{out}}$ .

Solution

Weight matrix dimensions

$\mathbf{W}_1 \in \mathbb{R}^{32 \times 64}, \quad \mathbf{b}_1 \in \mathbb{R}^{64}, \quad \mathbf{W}_2 \in \mathbb{R}^{64 \times 256}, \quad \mathbf{b}_2 \in \mathbb{R}^{256}$ $

Parameter count

$|\theta| = 32 \times 64 + 64 + 64 \times 256 + 256 = 2048 + 64 + 16384 + 256 = 18752$ $

Data sufficiency assessment

The rule of thumb requires $N_{\text{train}} \geq 5|\theta|$ for good generalisation. Here $5 \times 18752 = 93\,760$ , but we only have 300 samples. The ratio is $300/18752 \approx 0.016$ , far below the recommended 5--10.

The 300 samples are grossly insufficient for this network. Options: (1) reduce $H$ to $\sim$ 8 (giving $\sim$ 2600 parameters), (2) use a model-based approach (deep unfolding) with far fewer parameters, or (3) augment data by training at multiple SNR values.

ex-ch31-02

Easy

An end-to-end autoencoder for 8-ary signalling over $n = 2$ channel uses (2D constellation) is trained under AWGN at SNR $= 12$ dB. After training, the encoder produces the following constellation points (in I-Q coordinates):

Symbol	I	Q
0	1.05	0.02
1	0.51	0.88
2	$-0.49$	0.89
3	$-1.04$	0.03
4	$-0.52$	$-0.87$
5	0.50	$-0.88$
6	0.01	0.01
7	0.00	$-0.01$

(a) Compute the average symbol energy.

(b) Does this constellation resemble any classical modulation scheme? Explain any differences.

Show Hint

Average symbol energy: $E_s = \frac{1}{M}\sum_{m=0}^{M-1}(I_m^2 + Q_m^2)$ .

Compare the point positions with 8-PSK (all on a unit circle) and think about which points are anomalous.

Solution

Average symbol energy

Computing $|s_m|^2 = I_m^2 + Q_m^2$ for each symbol:

Symbol	$\|s_m\|^2$
0	$1.1025 + 0.0004 = 1.1029$
1	$0.2601 + 0.7744 = 1.0345$
2	$0.2401 + 0.7921 = 1.0322$
3	$1.0816 + 0.0009 = 1.0825$
4	$0.2704 + 0.7569 = 1.0273$
5	$0.2500 + 0.7744 = 1.0244$
6	$0.0001 + 0.0001 = 0.0002$
7	$0.0000 + 0.0001 = 0.0001$

$E_s = \frac{1}{8}(1.1029 + 1.0345 + 1.0322 + 1.0825 + 1.0273 + 1.0244 + 0.0002 + 0.0001)$ $= \frac{6.3041}{8} = 0.788$

Comparison with classical schemes

Symbols 0--5 lie approximately on a circle of radius $\approx 1.0$ at angles $\{0^\circ, 60^\circ, 120^\circ, 180^\circ, 240^\circ, 300^\circ\}$ --- this is essentially 6-PSK (regular hexagonal arrangement). Symbols 6 and 7, however, are clustered near the origin.

This is a (6,2) constellation: 6 points on a circle plus 2 near the centre. Classical 8-PSK places all 8 points on a circle with angular spacing $45^\circ$ . The autoencoder has discovered an alternative arrangement that sacrifices the energy of symbols 6 and 7 (placing them at the origin where they are easily confused with each other under noise) to increase the minimum distance among the remaining 6 symbols.

Whether this is truly optimal depends on the prior probability of each symbol. If all symbols are equiprobable, the near-origin symbols are a weakness. This illustrates that small-scale autoencoder training can sometimes converge to suboptimal local minima.

ex-ch31-03

Easy

In a Q-learning power control problem with $K = 3$ users and $J = 4$ discrete power levels per user:

(a) How many entries does the joint Q-table have if the state space has $|\mathcal{S}| = 27$ states (3 quantisation levels per user)?

(b) If an alternative per-user decomposition is used (each user has an independent Q-table of size $|\mathcal{S}| \times J$ ), how many total entries are needed?

(c) What is the reduction factor?

Show Hint

Joint action space: $|\mathcal{A}| = J^K$ .

Per-user approach: $K$ independent tables of size $|\mathcal{S}| \times J$ .

Solution

Joint Q-table size

$|\mathcal{A}| = J^K = 4^3 = 64 \text{ joint actions}KATEXPLACEHOLDER0END\text{Q-table entries} = |\mathcal{S}| \times |\mathcal{A}| = 27 \times 64 = 1728$ $

Per-user Q-table size

Each user has a table of size $|\mathcal{S}| \times J = 27 \times 4 = 108$ . Total for $K = 3$ users: $3 \times 108 = 324$ entries.

Reduction factor

$\text{Reduction} = \frac{1728}{324} = 5.33\times$ $The reduction grows exponentially with$ K $: for$ K = 8 $users, the joint table has$ 27 \times 4^8 = 1,769,472 $entries vs$ 8 \times 27 \times 4 = 864 $for the per-user approach --- a$ 2048\times$ reduction. However, the per-user decomposition ignores inter-user coupling and may converge to a suboptimal (Nash equilibrium) rather than the globally optimal allocation.

ex-ch31-04

Easy

In federated learning with $C = 10$ clients, each client has a model with $|\mathbf{w}| = 5000$ parameters (32-bit floats).

(a) Compute the upload cost per client per round (in KB).

(b) If FedAvg runs for $R = 80$ rounds with full client participation, compute the total communication cost (in MB).

(c) If only $C_r = 3$ clients are randomly selected per round, what is the new total cost?

Show Hint

32-bit float = 4 bytes.

Total cost = rounds $\times$ participating clients per round $\times$ cost per client.

Solution

Per-client upload cost

$\text{Cost per client} = 5000 \times 4 \;\text{bytes} = 20\,000 \;\text{bytes} = 19.53 \;\text{KB} \approx 20 \;\text{KB}$ $

Full participation total cost

$\text{Total} = R \times C \times 20 \;\text{KB} = 80 \times 10 \times 20 = 16\,000 \;\text{KB} = 15.6 \;\text{MB}$ $(Plus the same amount for the server-to-client broadcast, giving$ \sim$31 MB total bidirectional communication.)

Partial participation cost

$\text{Total} = R \times C_r \times 20 \;\text{KB} = 80 \times 3 \times 20 = 4800 \;\text{KB} = 4.7 \;\text{MB}$ $A$ 3.3\times$ reduction. Partial participation also reduces the per-round latency (fewer clients to wait for), but may slow convergence because less data is represented per round.

ex-ch31-05

Medium

A neural-network channel estimator uses MSE loss: $\mathcal{L}(\theta) = \frac{1}{N}\sum_{i=1}^N \|\hat{\mathbf{h}}_i - \mathbf{h}_i\|^2$ . The network has a single hidden layer: $\hat{\mathbf{h}} = \mathbf{W}_2\,\sigma(\mathbf{W}_1 \mathbf{y}_p + \mathbf{b}_1) + \mathbf{b}_2$ with ReLU activation.

(a) Derive the gradient $\partial\mathcal{L}/\partial\mathbf{W}_2$ .

(b) Derive the gradient $\partial\mathcal{L}/\partial\mathbf{W}_1$ , expressing the chain rule through the ReLU.

(c) If all pilot observations are real-valued and $\mathbf{y}_p \geq 0$ element-wise, explain why $\partial\mathcal{L}/\partial\mathbf{W}_1$ simplifies.

Show Hint

Let $\mathbf{a} = \sigma(\mathbf{W}_1\mathbf{y}_p + \mathbf{b}_1)$ be the hidden activations.

The ReLU derivative is $\sigma'(z) = \mathbf{1}_{z > 0}$ (indicator function).

Solution

Gradient w.r.t. $\mathbf{W}_2$

Define the error $\mathbf{e}_i = \hat{\mathbf{h}}_i - \mathbf{h}_i$ and the hidden activation $\mathbf{a}_i = \sigma(\mathbf{W}_1\mathbf{y}_{p,i} + \mathbf{b}_1)$ . Then $\hat{\mathbf{h}}_i = \mathbf{W}_2\mathbf{a}_i + \mathbf{b}_2$ .

$\frac{\partial\mathcal{L}}{\partial\mathbf{W}_2} = \frac{2}{N}\sum_{i=1}^N \mathbf{e}_i \, \mathbf{a}_i^T = \frac{2}{N}\sum_{i=1}^N (\mathbf{W}_2\mathbf{a}_i + \mathbf{b}_2 - \mathbf{h}_i)\,\mathbf{a}_i^T$

In matrix form: $\frac{\partial\mathcal{L}}{\partial\mathbf{W}_2} = \frac{2}{N}\mathbf{E}^T\mathbf{A}$ where $\mathbf{E}$ and $\mathbf{A}$ are the error and activation matrices with samples as rows.

Gradient w.r.t. $\mathbf{W}_1$ (backpropagation through ReLU)

By the chain rule:

$\frac{\partial\mathcal{L}}{\partial\mathbf{W}_1} = \frac{2}{N}\sum_{i=1}^N \underbrace{\mathrm{diag}(\mathbf{1}_{\mathbf{z}_i > 0})}_{\text{ReLU derivative}} \mathbf{W}_2^T \mathbf{e}_i \, \mathbf{y}_{p,i}^T$

where $\mathbf{z}_i = \mathbf{W}_1\mathbf{y}_{p,i} + \mathbf{b}_1$ is the pre-activation. The diagonal matrix $\mathrm{diag}(\mathbf{1}_{\mathbf{z}_i > 0})$ zeroes out gradients for hidden units whose pre-activation was negative (inactive ReLU units).

Simplification for non-negative inputs

If $\mathbf{y}_p \geq 0$ element-wise and $\mathbf{b}_1 = 0$ , then $\mathbf{z}_i = \mathbf{W}_1\mathbf{y}_{p,i} \geq 0$ whenever $\mathbf{W}_1$ has non-negative entries. In this case, $\mathbf{1}_{\mathbf{z}_i > 0} = \mathbf{1}$ (all ones) and the ReLU becomes the identity:

$\frac{\partial\mathcal{L}}{\partial\mathbf{W}_1} = \frac{2}{N}\sum_{i=1}^N \mathbf{W}_2^T \mathbf{e}_i \, \mathbf{y}_{p,i}^T$

This simplification only holds at initialisation with non-negative weights; during training, some weights become negative and dead ReLU units appear.

ex-ch31-06

Medium

Consider a LISTA network with $L$ layers applied to the sparse recovery problem $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{n}$ , where $\mathbf{A} \in \mathbb{R}^{25 \times 50}$ , $s = 3$ , and SNR $= 20$ dB.

(a) If only the per-layer thresholds $\{\theta^{(l)}\}_{l=0}^{L-1}$ are learned (with $\mathbf{W}^{(l)}$ and $\mathbf{S}^{(l)}$ fixed at the ISTA values), how many trainable parameters does the network have?

(b) If all matrices and thresholds are learned, how many parameters?

(c) ISTA achieves NMSE $= -15$ dB after $T = 30$ iterations. Based on the linear convergence theorem for LISTA, estimate the number of LISTA layers needed to achieve the same NMSE, assuming $\rho = 0.7$ .

Show Hint

For (c), use $\text{NMSE}_{\text{LISTA}} \leq C\rho^L$ and solve $C\rho^L = \text{NMSE}_{\text{target}}$ .

Assume $C = 1$ for a rough estimate (the constant absorbs the initial error).

Solution

Thresholds only

With only per-layer scalar thresholds: $|\theta| = L \text{ parameters}$

For $L = 10$ layers: only 10 trainable parameters. This is remarkably few and explains the excellent sample efficiency of model-based deep unfolding.

All parameters learned

Per layer:

$\mathbf{W}^{(l)} \in \mathbb{R}^{50 \times 50}$ : $2500$ parameters
$\mathbf{S}^{(l)} \in \mathbb{R}^{50 \times 25}$ : $1250$ parameters
$\theta^{(l)}$ : $1$ parameter

Per layer: $3751$ parameters. Total for $L$ layers: $3751L$ .

For $L = 10$ : $37\,510$ parameters --- still far fewer than a general black-box network mapping $\mathbb{R}^{25} \to \mathbb{R}^{50}$ would need for comparable depth.

Estimating LISTA layers from convergence rate

Target NMSE: $-15$ dB $= 10^{-1.5} \approx 0.0316$ .

Using $C\rho^L \leq 0.0316$ with $C = 1$ and $\rho = 0.7$ :

$0.7^L \leq 0.0316$ $L \ln(0.7) \leq \ln(0.0316)$ $L \geq \frac{\ln(0.0316)}{\ln(0.7)} = \frac{-3.454}{-0.357} \approx 9.7$

So $L = 10$ LISTA layers suffice, compared to $T = 30$ ISTA iterations --- a $3\times$ reduction. With optimised (trained) thresholds and matrices, the effective $\rho$ can be smaller, potentially allowing even fewer layers.

ex-ch31-07

Medium

A Q-learning agent controls the power of a single transmitter communicating with one receiver. The state is the quantised channel gain: $s \in \{\text{low}, \text{medium}, \text{high}\}$ (3 states). The action is the power level: $a \in \{0.5, 1.0, 2.0\}$ (3 actions). The discount factor is $\gamma = 0$ (myopic, no future reward consideration).

The reward is the rate $r = \log_2(1 + |h|^2 p / \sigma^2)$ with $\sigma^2 = 1$ . For the three states, the typical channel gains are $|h|^2 \in \{0.3, 1.0, 3.0\}$ .

(a) Compute the optimal Q-table $Q^*(s, a)$ exactly (9 entries).

(b) What is the optimal policy?

(c) If the agent has explored each state-action pair exactly once with $\alpha = 1$ , what does the Q-table look like? Is it already optimal?

Show Hint

With $\gamma = 0$ , $Q^*(s,a) = \mathbb{E}[r | s, a] = \log_2(1 + |h|^2 p)$ .

Since the channel gain is deterministic given the state (typical value), the Q-value is just the rate.

Solution

Optimal Q-table

With $\gamma = 0$ , $Q^*(s, a) = r(s, a)$ :

State	$p = 0.5$	$p = 1.0$	$p = 2.0$
low ( $\|h\|^2 = 0.3$ )	$\log_2(1.15) = 0.202$	$\log_2(1.3) = 0.379$	$\log_2(1.6) = 0.678$
med ( $\|h\|^2 = 1.0$ )	$\log_2(1.5) = 0.585$	$\log_2(2.0) = 1.000$	$\log_2(3.0) = 1.585$
high ( $\|h\|^2 = 3.0$ )	$\log_2(2.5) = 1.322$	$\log_2(4.0) = 2.000$	$\log_2(7.0) = 2.807$

Optimal policy

For every state, the maximum Q-value is at $p = 2.0$ :

$\pi^*(s) = 2.0 \quad \forall s$

This is unsurprising for a single-user system without a power constraint: more power always means more rate. In a multi-user setting with interference, the optimal policy would be state-dependent.

Q-table after one visit per pair

With $\alpha = 1$ and $\gamma = 0$ , the update $Q(s,a) \leftarrow Q(s,a) + 1 \cdot [r - Q(s,a)] = r$ directly sets $Q(s,a) = r(s,a)$ after one visit.

Therefore, the Q-table is already optimal after one visit per pair, because the reward is deterministic (given the quantised state) and $\gamma = 0$ eliminates the need to estimate future returns. This is a special case; with stochastic rewards or $\gamma > 0$ , convergence requires many visits.

ex-ch31-08

Medium

In FedAvg with $C = 5$ clients, the global model is $\mathbf{w} \in \mathbb{R}^d$ . Each client performs $E = 5$ local SGD steps with learning rate $\eta$ . The local gradients at client $c$ have mean $\mathbf{g}_c = \nabla F_c(\mathbf{w})$ and the global gradient is $\mathbf{g} = \frac{1}{5}\sum_{c=1}^5 \mathbf{g}_c$ .

(a) Write the local model after $E$ steps (assuming constant gradient, i.e., the quadratic approximation): $\mathbf{w}_c^{(E)} = \mathbf{w} - E\eta\,\mathbf{g}_c$ .

(b) Write the aggregated model $\bar{\mathbf{w}} = \frac{1}{5}\sum_c \mathbf{w}_c^{(E)}$ .

(c) Compare $\bar{\mathbf{w}}$ with a single centralised SGD step using the global gradient. When are they equivalent?

Show Hint

Substitute the expression from (a) into the averaging formula.

The centralised update would be $\mathbf{w} - \eta_{\text{central}}\,\mathbf{g}$ .

Solution

Local model after $E$ steps

Under the constant-gradient approximation:

$\mathbf{w}_c^{(E)} = \mathbf{w} - E\eta\,\mathbf{g}_c = \mathbf{w} - E\eta\,\nabla F_c(\mathbf{w})$

Aggregated model

$\bar{\mathbf{w}} = \frac{1}{C}\sum_{c=1}^C \mathbf{w}_c^{(E)} = \mathbf{w} - E\eta\,\frac{1}{C}\sum_{c=1}^C \mathbf{g}_c = \mathbf{w} - E\eta\,\mathbf{g}$ $

Comparison with centralised SGD

Centralised SGD: $\mathbf{w}_{\text{cent}} = \mathbf{w} - \eta_{\text{cent}}\,\mathbf{g}$ .

FedAvg: $\bar{\mathbf{w}} = \mathbf{w} - E\eta\,\mathbf{g}$ .

These are identical when $\eta_{\text{cent}} = E\eta$ , i.e., the effective federated learning rate is $E$ times the local learning rate.

Crucially, this equivalence holds only under the constant-gradient assumption (i.e., $F_c$ is linear). For non-linear $F_c$ , the gradient changes during local SGD, and each client drifts toward its own local optimum. The resulting "client drift" creates the heterogeneity bias in the FedAvg convergence bound. This is why the bias term $O(E\Gamma/\mu)$ in Theorem 31.4 grows with $E$ .

ex-ch31-09

Medium

A soft-thresholding operator $\mathcal{S}_\theta(z) = \mathrm{sign}(z)\max(|z| - \theta, 0)$ is used in each layer of a LISTA network. During backpropagation, we need $\frac{\partial \mathcal{S}_\theta}{\partial z}$ and $\frac{\partial \mathcal{S}_\theta}{\partial \theta}$ .

(a) Derive $\frac{\partial \mathcal{S}_\theta(z)}{\partial z}$ for $z \neq \pm\theta$ .

(b) Derive $\frac{\partial \mathcal{S}_\theta(z)}{\partial \theta}$ .

(c) Explain why the gradient w.r.t. $\theta$ provides a learning signal for adapting the sparsity level per layer.

Show Hint

Consider the three regions: $z > \theta$ , $-\theta < z < \theta$ , and $z < -\theta$ separately.

In the active region ( $|z| > \theta$ ), $\mathcal{S}_\theta(z) = z - \theta\,\mathrm{sign}(z)$ .

Solution

Gradient w.r.t. $z$

$\frac{\partial \mathcal{S}_\theta(z)}{\partial z} = \begin{cases} 1 & \text{if } |z| > \theta \\ 0 & \text{if } |z| < \theta \end{cases}$ $At$ z = \pm\theta$, the function is not differentiable, but the subgradient is conventionally taken as 0 (consistent with treating the ReLU-like kink as having zero derivative).

This is identical to the derivative of the "dead zone" function --- values inside the threshold band have zero gradient (they contribute nothing to the output and receive no learning signal), while values outside pass gradients unchanged.

Gradient w.r.t. $\theta$

For $|z| > \theta$ : $\mathcal{S}_\theta(z) = z - \theta\,\mathrm{sign}(z)$ , so:

$\frac{\partial \mathcal{S}_\theta(z)}{\partial \theta} = -\mathrm{sign}(z)$

For $|z| < \theta$ : $\mathcal{S}_\theta(z) = 0$ , so:

$\frac{\partial \mathcal{S}_\theta(z)}{\partial \theta} = 0$

Summarising: $\frac{\partial \mathcal{S}_\theta(z)}{\partial \theta} = \begin{cases} -\mathrm{sign}(z) & |z| > \theta \\ 0 & |z| < \theta \end{cases}$

Interpretation for sparsity adaptation

The gradient $\partial\mathcal{S}_\theta/\partial\theta = -\mathrm{sign}(z)$ (for active elements) tells us:

If increasing $\theta$ would reduce the output magnitude of an element that should be large (positive contribution to loss), the gradient will push $\theta$ to decrease.
If an element is spuriously active (non-zero output when the ground truth is zero), the loss gradient through this element will push $\theta$ to increase, killing the spurious entry.

In early LISTA layers, the training objective encourages a large threshold (aggressive sparsification to identify the correct support). In later layers, the objective encourages a small threshold (fine-tune the amplitudes of the identified non-zero entries). This automatic adaptation of sparsity level per layer is the key advantage over ISTA's fixed threshold.

ex-ch31-10

Medium

An autoencoder for $M = 4$ symbols over AWGN with $n = 2$ channel uses has encoder output (after power normalisation):

$\mathbf{x}_s = \frac{\mathbf{z}_s}{\sqrt{\frac{1}{M}\sum_{m=1}^M\|\mathbf{z}_m\|^2}}$

where $\mathbf{z}_s = \mathbf{W}_2\sigma(\mathbf{W}_1\mathbf{e}_s + \mathbf{b}_1) + \mathbf{b}_2$ and $\mathbf{e}_s$ is the one-hot encoding.

(a) Show that the power constraint $\frac{1}{M}\sum_s \|\mathbf{x}_s\|^2 = 1$ is automatically satisfied by this normalisation.

(b) The decoder receives $\mathbf{y} = \mathbf{x}_s + \mathbf{n}$ and outputs $\hat{\mathbf{p}} = \mathrm{softmax}(\mathbf{W}_4\sigma(\mathbf{W}_3\mathbf{y} + \mathbf{b}_3) + \mathbf{b}_4)$ . Write the cross-entropy loss for a batch of $B$ samples $\{(s_i, \mathbf{y}_i)\}_{i=1}^B$ .

(c) Compute $\frac{\partial\mathcal{L}}{\partial\text{logits}}$ for a single sample, where logits $= \mathbf{W}_4\sigma(\mathbf{W}_3\mathbf{y} + \mathbf{b}_3) + \mathbf{b}_4$ .

Show Hint

For (a), substitute the normalisation formula and simplify.

The gradient of cross-entropy w.r.t. logits has the elegant form $\hat{\mathbf{p}} - \mathbf{e}_s$ where $\mathbf{e}_s$ is one-hot.

Solution

Power constraint verification

Let $P = \frac{1}{M}\sum_{m=1}^M\|\mathbf{z}_m\|^2$ . Then:

$\frac{1}{M}\sum_{s=1}^M \|\mathbf{x}_s\|^2 = \frac{1}{M}\sum_{s=1}^M \frac{\|\mathbf{z}_s\|^2}{P} = \frac{1}{P} \cdot \frac{1}{M}\sum_{s=1}^M \|\mathbf{z}_s\|^2 = \frac{P}{P} = 1 \quad \checkmark$

Cross-entropy loss

$\mathcal{L} = -\frac{1}{B}\sum_{i=1}^B \log \hat{p}_{s_i} = -\frac{1}{B}\sum_{i=1}^B \log\!\left[\mathrm{softmax}(\mathbf{l}_i)\right]_{s_i}$ $where$ \mathbf{l}_i = \mathbf{W}_4\sigma(\mathbf{W}_3\mathbf{y}_i + \mathbf{b}_3) + \mathbf{b}_4$ is the logit vector.

Gradient w.r.t. logits

For a single sample with true label $s$ :

$\frac{\partial\mathcal{L}}{\partial\mathbf{l}} = \hat{\mathbf{p}} - \mathbf{e}_s$

where $\hat{\mathbf{p}} = \mathrm{softmax}(\mathbf{l})$ and $\mathbf{e}_s$ is the one-hot vector with a 1 at position $s$ .

Proof sketch: $\frac{\partial(-\log p_s)}{\partial l_j} = -\frac{1}{p_s}\frac{\partial p_s}{\partial l_j}$ . For $j = s$ : $\frac{\partial p_s}{\partial l_s} = p_s(1 - p_s)$ , giving $-(1 - p_s) = p_s - 1$ . For $j \neq s$ : $\frac{\partial p_s}{\partial l_j} = -p_s p_j$ , giving $p_j$ . Combining: $\hat{\mathbf{p}} - \mathbf{e}_s$ .

This elegant form is why softmax + cross-entropy is the standard choice for classification --- the gradient is simply the prediction error.

ex-ch31-11

Hard

(MMSE as optimal NN target.) Consider the channel estimation problem where the channel $\mathbf{h} \sim \mathcal{CN}(\mathbf{0}, \mathbf{R}_{HH})$ and the observation is $\mathbf{y} = \mathbf{h} + \mathbf{n}$ , $\mathbf{n} \sim \mathcal{CN}(\mathbf{0}, \sigma^2\mathbf{I})$ .

(a) Show that the MMSE estimator is $\hat{\mathbf{h}}_{\mathrm{MMSE}} = \mathbf{R}_{HH}(\mathbf{R}_{HH} + \sigma^2\mathbf{I})^{-1}\mathbf{y}$ .

(b) Prove that among all estimators $g(\mathbf{y})$ , the MMSE estimator minimises $\mathbb{E}[\|\mathbf{h} - g(\mathbf{y})\|^2]$ .

(c) Show that a linear neural network (no activation function) with $\hat{\mathbf{h}} = \mathbf{W}\mathbf{y}$ trained with MSE loss on infinite data converges to the MMSE weight matrix $\mathbf{W}^* = \mathbf{R}_{HH}(\mathbf{R}_{HH} + \sigma^2\mathbf{I})^{-1}$ .

(d) Explain why a non-linear network cannot do better than the linear MMSE in this Gaussian setting.

Show Hint

For (a), use the conditional mean of jointly Gaussian vectors.

For (b), use the law of total expectation and the orthogonality principle.

For (c), minimise $\mathbb{E}[\|\mathbf{h} - \mathbf{W}\mathbf{y}\|^2]$ over $\mathbf{W}$ .

Solution

MMSE estimator derivation

Since $\mathbf{h}$ and $\mathbf{y}$ are jointly Gaussian:

$\hat{\mathbf{h}}_{\mathrm{MMSE}} = \mathbb{E}[\mathbf{h}|\mathbf{y}] = \mathbf{R}_{h y}\mathbf{R}_{y y}^{-1}\mathbf{y}$

where $\mathbf{R}_{h y} = \mathbb{E}[\mathbf{h}\mathbf{y}^H] = \mathbf{R}_{HH}$ (since $\mathbf{y} = \mathbf{h} + \mathbf{n}$ and $\mathbf{h} \perp \mathbf{n}$ ) and $\mathbf{R}_{y y} = \mathbf{R}_{HH} + \sigma^2\mathbf{I}$ .

Therefore: $\hat{\mathbf{h}}_{\mathrm{MMSE}} = \mathbf{R}_{HH}(\mathbf{R}_{HH} + \sigma^2\mathbf{I})^{-1}\mathbf{y}$ .

MMSE optimality proof

For any estimator $g(\mathbf{y})$ :

$\mathbb{E}[\|\mathbf{h} - g(\mathbf{y})\|^2] = \mathbb{E}[\|\mathbf{h} - \hat{\mathbf{h}}_{\mathrm{MMSE}} + \hat{\mathbf{h}}_{\mathrm{MMSE}} - g(\mathbf{y})\|^2]$

Expanding: $= \mathbb{E}[\|\mathbf{h} - \hat{\mathbf{h}}_{\mathrm{MMSE}}\|^2] + \mathbb{E}[\|\hat{\mathbf{h}}_{\mathrm{MMSE}} - g(\mathbf{y})\|^2] + 2\operatorname{Re}\,\mathbb{E}[(\mathbf{h} - \hat{\mathbf{h}}_{\mathrm{MMSE}})^H( \hat{\mathbf{h}}_{\mathrm{MMSE}} - g(\mathbf{y}))]$

By the orthogonality principle, the MMSE error $\mathbf{h} - \hat{\mathbf{h}}_{\mathrm{MMSE}}$ is orthogonal to any function of $\mathbf{y}$ , so the cross-term vanishes. The second term is non-negative. Therefore:

$\mathbb{E}[\|\mathbf{h} - g(\mathbf{y})\|^2] \geq \mathbb{E}[\|\mathbf{h} - \hat{\mathbf{h}}_{\mathrm{MMSE}}\|^2]$

with equality iff $g(\mathbf{y}) = \hat{\mathbf{h}}_{\mathrm{MMSE}}$ a.s.

Linear NN convergence

Minimise $J(\mathbf{W}) = \mathbb{E}[\|\mathbf{h} - \mathbf{W}\mathbf{y}\|^2]$ :

$\nabla_{\mathbf{W}} J = -2\mathbb{E}[(\mathbf{h} - \mathbf{W}\mathbf{y})\mathbf{y}^H] = -2(\mathbf{R}_{hy} - \mathbf{W}\mathbf{R}_{yy})$

Setting to zero: $\mathbf{W}^* = \mathbf{R}_{hy}\mathbf{R}_{yy}^{-1} = \mathbf{R}_{HH}(\mathbf{R}_{HH} + \sigma^2\mathbf{I})^{-1}$ , which is exactly the MMSE weight.

Non-linear network cannot improve

For jointly Gaussian $(\mathbf{h}, \mathbf{y})$ , the conditional mean $\mathbb{E}[\mathbf{h}|\mathbf{y}]$ is linear in $\mathbf{y}$ . This is a fundamental property of Gaussian distributions. Since the MMSE estimator (conditional mean) is already linear, no non-linear function of $\mathbf{y}$ can achieve a lower MSE. A non-linear NN trained with MSE loss will converge to the same linear mapping (plus some noise from finite-sample training).

Non-linear NNs offer an advantage only when the channel distribution is non-Gaussian (e.g., sparse channels, channels with discrete components, or channels corrupted by non-Gaussian interference).

ex-ch31-12

Hard

(LISTA convergence analysis.) Consider a simplified LISTA with scalar signal ( $N = 1$ ), scalar measurement $y = ax + n$ where $a > 0$ is known, $x$ has a Laplacian prior $p(x) \propto e^{-\lambda|x|}$ , and $n \sim \mathcal{N}(0, \sigma^2)$ .

(a) Write one ISTA iteration for this 1D problem: $x^{(t+1)} = \mathcal{S}_{\lambda/a^2}(x^{(t)} + \frac{1}{a}(y - ax^{(t)}))$ .

(b) Show that in a LISTA layer with learned parameters $w, s, \theta$ , the update is $x^{(l+1)} = \mathcal{S}_\theta(wx^{(l)} + sy)$ . What are the ISTA-initialised values of $w, s, \theta$ ?

(c) Consider the fixed point $x^* = \mathcal{S}_\theta(wx^* + sy)$ . Show that for $|wx^* + sy| > \theta$ , the fixed point satisfies $x^* = \frac{sy}{1 - w}$ (assuming $|w| < 1$ ).

(d) For the optimal Bayesian estimator (MAP), derive the MAP estimate $\hat{x}_{\mathrm{MAP}}$ and compare with the LISTA fixed point.

Show Hint

The Lipschitz constant is $L = a^2$ in 1D.

The MAP estimate minimises $\frac{1}{2\sigma^2}(y - ax)^2 + \lambda|x|$ .

Solution

ISTA iteration in 1D

With $L = a^2$ , step size $1/L = 1/a^2$ , and threshold $\lambda/L = \lambda/a^2$ :

$x^{(t+1)} = \mathcal{S}_{\lambda/a^2}\!\left( x^{(t)} - \frac{1}{a^2} a(ax^{(t)} - y)\right) = \mathcal{S}_{\lambda/a^2}\!\left( x^{(t)} - \frac{1}{a}(ax^{(t)} - y)\right)$

Simplifying: $x^{(t+1)} = \mathcal{S}_{\lambda/a^2}(\frac{y}{a})$ .

Note: in this 1D case, ISTA converges in one step because the gradient step already reaches the proximal minimiser.

LISTA layer and initialisation

LISTA update: $x^{(l+1)} = \mathcal{S}_\theta(wx^{(l)} + sy)$ .

ISTA-initialised values: $w = 1 - \frac{a^2}{a^2} = 0, \quad s = \frac{a}{a^2} = \frac{1}{a}, \quad \theta = \frac{\lambda}{a^2}$

With $w = 0$ , the update becomes $x^{(l+1)} = \mathcal{S}_{\lambda/a^2}(y/a)$ regardless of the current iterate, confirming the one-step convergence of 1D ISTA.

Fixed point analysis

At a fixed point with $|wx^* + sy| > \theta$ :

$x^* = wx^* + sy - \theta\,\mathrm{sign}(wx^* + sy)$ $x^*(1 - w) = sy - \theta\,\mathrm{sign}(wx^* + sy)$

For $sy > 0$ (and assuming $wx^* + sy > 0$ ):

$x^* = \frac{sy - \theta}{1 - w}$

With ISTA initialisation ( $w = 0$ ): $x^* = sy - \theta = y/a - \lambda/a^2$ .

MAP estimate comparison

The MAP estimate minimises: $\hat{x}_{\mathrm{MAP}} = \arg\min_x \frac{1}{2\sigma^2}(y - ax)^2 + \lambda|x|$

Taking the subdifferential and setting to zero: $\frac{a}{\sigma^2}(a\hat{x} - y) + \lambda\,\mathrm{sign}(\hat{x}) = 0$ $\hat{x} = \frac{1}{a}\left(y - \frac{\lambda\sigma^2}{a}\mathrm{sign}(\hat{x})\right)$

For $\hat{x} > 0$ : $\hat{x} = \frac{y}{a} - \frac{\lambda\sigma^2}{a^2}$ .

This is $\mathcal{S}_{\lambda\sigma^2/a^2}(y/a)$ , i.e., soft-thresholding with threshold $\lambda\sigma^2/a^2$ rather than $\lambda/a^2$ .

Comparison: ISTA uses threshold $\lambda/a^2$ ; MAP uses $\lambda\sigma^2/a^2$ . They coincide when $\sigma^2 = 1$ . LISTA can learn the correct threshold $\theta^* = \lambda\sigma^2/a^2$ from data, thereby matching the MAP estimator, while ISTA uses a fixed (possibly mismatched) threshold.

ex-ch31-13

Hard

(Convergence of Q-learning.) Consider a two-state, two-action MDP with transition matrix and rewards:

$P(s'|s,a) = \begin{cases} 0.8 & s' = s \\ 0.2 & s' \neq s \end{cases} \quad \forall a$

$R(s_1, a_1) = 5, \; R(s_1, a_2) = 1, \; R(s_2, a_1) = 2, \; R(s_2, a_2) = 8$

Discount factor $\gamma = 0.9$ .

(a) Write the Bellman optimality equations for $Q^*(s, a)$ .

(b) Solve the system of 4 equations to find $Q^*(s, a)$ exactly.

(c) What is the optimal policy?

(d) Simulate 5 Q-learning updates starting from $Q = 0$ with $\alpha = 0.5$ , initial state $s_1$ , and $\epsilon = 0$ (greedy). Show that the agent may take suboptimal actions initially.

Show Hint

The Bellman equation is $Q^*(s,a) = R(s,a) + \gamma\sum_{s'} P(s'|s,a)\max_{a'} Q^*(s',a')$ .

Define $V^*(s) = \max_a Q^*(s,a)$ and substitute.

Solution

Bellman optimality equations

$Q^*(s_1, a_1) = 5 + 0.9[0.8 V^*(s_1) + 0.2 V^*(s_2)]KATEXPLACEHOLDER0ENDQ^*(s_1, a_2) = 1 + 0.9[0.8 V^*(s_1) + 0.2 V^*(s_2)]KATEXPLACEHOLDER1ENDQ^*(s_2, a_1) = 2 + 0.9[0.2 V^*(s_1) + 0.8 V^*(s_2)]KATEXPLACEHOLDER2ENDQ^*(s_2, a_2) = 8 + 0.9[0.2 V^*(s_1) + 0.8 V^*(s_2)]$ $where$ V^(s) = \max_a Q^(s, a)$.

Solve the system

From the equations, $Q^*(s_1, a_1) - Q^*(s_1, a_2) = 4 > 0$ , so $V^*(s_1) = Q^*(s_1, a_1)$ . Similarly, $Q^*(s_2, a_2) - Q^*(s_2, a_1) = 6 > 0$ , so $V^*(s_2) = Q^*(s_2, a_2)$ .

Let $V_1 = V^*(s_1)$ and $V_2 = V^*(s_2)$ :

$V_1 = 5 + 0.9(0.8 V_1 + 0.2 V_2) = 5 + 0.72 V_1 + 0.18 V_2$ $V_2 = 8 + 0.9(0.2 V_1 + 0.8 V_2) = 8 + 0.18 V_1 + 0.72 V_2$

From the first: $0.28 V_1 - 0.18 V_2 = 5$ , so $V_1 = (5 + 0.18 V_2)/0.28$ .

From the second: $-0.18 V_1 + 0.28 V_2 = 8$ , so $V_2 = (8 + 0.18 V_1)/0.28$ .

Substituting: $V_1 = (5 + 0.18(8 + 0.18 V_1)/0.28)/0.28$

$0.28 V_1 = 5 + 0.18 \times (8 + 0.18 V_1)/0.28$

$0.0784 V_1 = 1.4 + 0.18(8 + 0.18 V_1) = 1.4 + 1.44 + 0.0324 V_1$

$0.046 V_1 = 2.84$ , so $V_1 = 61.74$ and $V_2 = (8 + 0.18 \times 61.74)/0.28 = (8 + 11.11)/0.28 = 68.25$ .

The full Q-table: $Q^*(s_1, a_1) = 61.74, \quad Q^*(s_1, a_2) = 57.74$ $Q^*(s_2, a_1) = 62.25, \quad Q^*(s_2, a_2) = 68.25$

Optimal policy

$\pi^*(s_1) = a_1 \quad (Q^* = 61.74 > 57.74)KATEXPLACEHOLDER0END\pi^*(s_2) = a_2 \quad (Q^* = 68.25 > 62.25)$ $Intuitively: in$ s_1 $, take the high-immediate-reward action ($ R = 5 $); in$ s_2 $, take the even higher immediate reward ($ R = 8$). The optimal policy happens to be greedy w.r.t. immediate rewards in this example, but this is coincidental.

Q-learning simulation (initial suboptimality)

Start: $Q = 0$ , state $s_1$ , greedy ( $\epsilon = 0$ ).

Step 1: All Q-values equal 0, greedy picks $a_1$ (arbitrary tie-breaking). $r = 5$ . Next state: $s_1$ (prob 0.8). $Q(s_1,a_1) \leftarrow 0 + 0.5[5 + 0.9 \times 0 - 0] = 2.5$ .

Step 2: $Q(s_1,a_1) = 2.5 > Q(s_1,a_2) = 0$ , pick $a_1$ . $r = 5$ . Next: $s_2$ (prob 0.2, suppose transition). $Q(s_1,a_1) \leftarrow 2.5 + 0.5[5 + 0.9 \times 0 - 2.5] = 3.75$ .

Step 3: State $s_2$ , $Q(s_2,a_1) = Q(s_2,a_2) = 0$ , pick $a_1$ (tie). $r = 2$ . Next: $s_2$ . $Q(s_2,a_1) \leftarrow 0 + 0.5[2 + 0.9 \times 0 - 0] = 1.0$ .

Note: in step 3, the agent picks $a_1$ in $s_2$ (suboptimal --- the optimal action is $a_2$ with $R = 8$ ). It takes $a_1$ because both Q-values were 0 and the tie was broken in favour of $a_1$ . Without exploration ( $\epsilon = 0$ ), the agent may lock into $a_1$ for $s_2$ for many steps before the Q-value of $a_2$ gets updated. This illustrates the necessity of exploration.

ex-ch31-14

Hard

(FedAvg heterogeneity bias.) Consider $C = 2$ clients with quadratic objectives:

$F_1(w) = \frac{1}{2}(w - 1)^2, \qquad F_2(w) = \frac{1}{2}(w + 1)^2$

The global objective is $F(w) = \frac{1}{2}[F_1(w) + F_2(w)]$ .

(a) Find the global optimum $w^*$ and the local optima $w_1^*, w_2^*$ .

(b) Run one round of FedAvg: starting from $w^{(0)} = 0$ , each client performs $E$ gradient descent steps with learning rate $\eta$ . Derive the aggregated model $w^{(1)}$ as a function of $E$ and $\eta$ .

(c) Show that $w^{(1)} = 0 = w^*$ regardless of $E$ and $\eta$ . Explain why the heterogeneity bias is zero in this symmetric example.

(d) Now consider asymmetric objectives: $F_1(w) = \frac{1}{2}(w - 1)^2$ , $F_2(w) = (w + 1)^2$ (i.e., $F_2$ has curvature 2 instead of 1). Find $w^*$ and show that FedAvg with $E > 1$ does not converge to $w^*$ .

Show Hint

The gradient of $F_c(w) = \frac{a_c}{2}(w - w_c^*)^2$ is $a_c(w - w_c^*)$ .

For (c), exploit the symmetry $w_1^* = -w_2^*$ .

Solution

Optima

Local optima: $w_1^* = 1$ , $w_2^* = -1$ .

Global: $F(w) = \frac{1}{2}[\frac{1}{2}(w-1)^2 + \frac{1}{2}(w+1)^2] = \frac{1}{2}[w^2 + 1] \Rightarrow w^* = 0$ .

FedAvg round

Client 1: $\nabla F_1(w) = w - 1$ . Starting from $w = 0$ :

After 1 step: $w_1 = 0 - \eta(0 - 1) = \eta$
After 2 steps: $w_1 = \eta - \eta(\eta - 1) = \eta(2 - \eta)$
After $E$ steps: $w_1^{(E)} = 1 - (1 - \eta)^E$ (geometric series for gradient descent on quadratic).

Client 2: $\nabla F_2(w) = w + 1$ . Starting from $w = 0$ :

After $E$ steps: $w_2^{(E)} = -1 + (1 - \eta)^E = -(1 - (1-\eta)^E)$

Aggregation and zero bias

$w^{(1)} = \frac{1}{2}(w_1^{(E)} + w_2^{(E)}) = \frac{1}{2}[(1 - (1-\eta)^E) - (1 - (1-\eta)^E)] = 0 = w^*$ $The bias is zero because of perfect **symmetry**: both clients have the same curvature ($ a_1 = a_2 = 1 $) and their optima are symmetric around the global optimum ($ w_1^* = -w_2^* = 1$). The client drifts exactly cancel upon averaging.

Asymmetric case: non-zero bias

With $F_2(w) = (w + 1)^2$ : $\nabla F_2(w) = 2(w + 1)$ .

Global optimum: $F(w) = \frac{1}{2}[\frac{1}{2}(w-1)^2 + (w+1)^2] = \frac{1}{2}[\frac{3}{2}w^2 + w + \frac{3}{2}]$ . Setting $F'(w) = \frac{1}{2}[3w + 1] = 0$ : $w^* = -1/3$ .

Client 1 after $E$ steps from 0: $w_1^{(E)} = 1 - (1-\eta)^E$ .

Client 2 after $E$ steps from 0 (with step size $\eta$ and gradient $2(w+1)$ ): $w_2^{(E)} = -1 + (1 - 2\eta)^E$ .

Aggregated: $w^{(1)} = \frac{1}{2}[1 - (1-\eta)^E - 1 + (1-2\eta)^E] = \frac{1}{2}[(1-2\eta)^E - (1-\eta)^E]$ .

For $E = 1$ : $w^{(1)} = \frac{1}{2}[-2\eta + \eta] = -\eta/2$ . For $\eta$ small, this approaches 0, not $-1/3 = w^*$ .

For $E \to \infty$ : $w^{(1)} = \frac{1}{2}[0 - 0] = 0 \neq w^*$ .

The asymmetric curvatures cause the client with larger curvature (client 2) to drift faster toward its local optimum, and the averaging does not correct this imbalance. This is the mechanism behind the heterogeneity bias in the FedAvg convergence theorem.

ex-ch31-15

Hard

(Autoencoder capacity and channel coding.) An autoencoder maps $M$ messages to $n$ channel uses over AWGN, achieving block error rate (BLER) $P_e$ .

(a) The communication rate is $R = \frac{\log_2 M}{n}$ bits per channel use. For $M = 16$ , $n = 7$ , compute $R$ and compare with the AWGN capacity $C = \frac{1}{2}\log_2(1 + \text{SNR})$ at SNR $= 7$ dB.

(b) The sphere-packing (Shannon) lower bound on BLER for $M$ codewords in $\mathbb{R}^{2n}$ under AWGN is:

$P_e \geq 1 - \frac{\text{Vol}(\text{cone})}{\text{Vol}(\text{sphere})}$

Explain qualitatively why the autoencoder's BLER approaches this bound as the network capacity increases.

(c) A key limitation of autoencoder-based "codes" is that they are fixed-length, fixed- $M$ systems. Explain why scaling to $M = 2^{100}$ (typical for modern codes with 100-bit messages) is fundamentally challenging for the autoencoder approach.

Show Hint

AWGN capacity: $C = \frac{1}{2}\log_2(1 + 10^{0.7}) \approx 1.42$ bits/use.

The one-hot representation of $M$ messages requires an $M$ -dimensional input, which grows exponentially in the number of bits.

Solution

Rate calculation

$R = \frac{\log_2 16}{7} = \frac{4}{7} \approx 0.571 \;\text{bits/channel use}KATEXPLACEHOLDER0ENDC = \frac{1}{2}\log_2(1 + 5.012) = \frac{1}{2}\log_2(6.012) \approx 1.29 \;\text{bits/use}$ $The autoencoder rate$ R = 0.571 $is well below capacity ($ R/C \approx 0.44 $), so reliable communication should be possible. The gap to capacity could be closed by increasing$ M $(more messages per block) or decreasing$ n$ (shorter blocks), but the autoencoder architecture limits how far either can go.

Approaching the sphere-packing bound

The encoder maps $M$ messages to $M$ points in $\mathbb{R}^{2n}$ (the 2D constellation space over $n$ channel uses). Under the power constraint, these points lie on or near a sphere. The optimal arrangement maximises the minimum distance between points, which corresponds to packing $M$ non-overlapping "decoding cones" on the sphere.

The sphere-packing bound computes the maximum fraction of the sphere that can be covered by $M$ such cones. A sufficiently expressive autoencoder, by the universal approximation theorem, can learn the optimal point placement (or a close approximation), thereby approaching the sphere-packing bound. The decoder learns the optimal decision regions, which are the Voronoi cells of the constellation.

In practice, autoencoders have been shown to match or slightly exceed the performance of known short block codes for small $M$ and $n$ .

Scaling challenge

For $M = 2^{100}$ messages:

Input representation: The one-hot vector $\mathbf{e}_s \in \mathbb{R}^M$ has dimension $M = 2^{100}$ --- this is astronomically large and cannot be represented in any computer.
Output layer: The decoder softmax has $M = 2^{100}$ outputs, requiring $2^{100}$ weights in the last layer alone.
Training: Each mini-batch can only cover a vanishing fraction of the $2^{100}$ messages, making training infeasible.

This is the fundamental exponential scaling barrier of the autoencoder approach. Classical channel codes circumvent this by using structured encoders (linear codes, convolutional codes, LDPC, Polar) whose encoding complexity scales polynomially in the block length.

Proposed solutions include: (a) Turbo Autoencoder (using interleaved convolutional structure), (b) bit-level autoencoders that process one bit at a time, and (c) neural network decoders for existing structured codes (keeping the encoder classical).

ex-ch31-16

Hard

(Over-the-air federated aggregation.) In over-the-air computation (AirComp), $C$ clients simultaneously transmit their model updates $\Delta\mathbf{w}_c \in \mathbb{R}^d$ over a MAC channel. The server receives:

$\mathbf{y} = \sum_{c=1}^C h_c \, \mathbf{x}_c + \mathbf{n}$

where $h_c$ is the channel coefficient from client $c$ , $\mathbf{x}_c$ is the transmitted signal, and $\mathbf{n} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$ .

(a) To achieve the desired aggregation $\bar{\mathbf{w}} = \frac{1}{C}\sum_c \Delta\mathbf{w}_c$ , each client must pre-equalise: $\mathbf{x}_c = \frac{1}{h_c}\Delta\mathbf{w}_c$ . Show that the server estimate is then $\hat{\bar{\mathbf{w}}} = \frac{1}{C}\mathbf{y} = \bar{\mathbf{w}} + \frac{1}{C}\mathbf{n}$ .

(b) The pre-equalisation requires transmit power $P_c = \frac{1}{|h_c|^2}\|\Delta\mathbf{w}_c\|^2$ . If client $c$ has a power constraint $P_c \leq P_{\max}$ , show that the maximum model update norm is $\|\Delta\mathbf{w}_c\| \leq |h_c|\sqrt{P_{\max}}$ .

(c) To handle power-limited clients, a common approach is to truncate (clip) the update: $\Delta\mathbf{w}_c^{\text{clip}} = \Delta\mathbf{w}_c \cdot \min(1, \tau / \|\Delta\mathbf{w}_c\|)$ where $\tau = |h_c|\sqrt{P_{\max}}$ . Derive the MSE of the aggregation when $K$ out of $C$ clients are clipped.

(d) Compare the communication cost (channel uses) of AirComp vs orthogonal TDMA for $C = 20$ clients with $d = 10^4$ parameters.

Show Hint

AirComp transmits all clients simultaneously in $d$ channel uses. TDMA requires $C \times d$ channel uses.

For (c), the clipping introduces a deterministic bias plus the aggregation noise.

Solution

AirComp aggregation

With $\mathbf{x}_c = \frac{1}{h_c}\Delta\mathbf{w}_c$ :

$\mathbf{y} = \sum_{c=1}^C h_c \cdot \frac{1}{h_c}\Delta\mathbf{w}_c + \mathbf{n} = \sum_{c=1}^C \Delta\mathbf{w}_c + \mathbf{n}$

The server computes:

$\hat{\bar{\mathbf{w}}} = \frac{1}{C}\mathbf{y} = \frac{1}{C}\sum_{c=1}^C \Delta\mathbf{w}_c + \frac{1}{C}\mathbf{n} = \bar{\mathbf{w}} + \frac{1}{C}\mathbf{n}$

The aggregation noise has variance $\sigma^2/C^2$ per dimension, which decreases as $1/C^2$ --- more clients means less noise per dimension due to the averaging effect.

Power constraint

$P_c = \|\mathbf{x}_c\|^2 = \frac{\|\Delta\mathbf{w}_c\|^2}{|h_c|^2} \leq P_{\max}KATEXPLACEHOLDER0END\|\Delta\mathbf{w}_c\| \leq |h_c|\sqrt{P_{\max}} = \tau_c$ $Clients with weak channels ($ |h_c|$ small) have a tighter constraint on the update norm. This is the fundamental fairness issue in AirComp: power-limited clients contribute distorted updates.

MSE with clipping

Partition clients into unclipped ( $\mathcal{U}$ , $|\mathcal{U}| = C - K$ ) and clipped ( $\mathcal{C}$ , $|\mathcal{C}| = K$ ).

For unclipped clients: $\Delta\mathbf{w}_c^{\text{clip}} = \Delta\mathbf{w}_c$ (no error).

For clipped clients: $\Delta\mathbf{w}_c^{\text{clip}} = \tau_c \frac{\Delta\mathbf{w}_c}{\|\Delta\mathbf{w}_c\|}$ (scaled to norm $\tau_c$ ). The clipping error is: $\mathbf{e}_c = \Delta\mathbf{w}_c^{\text{clip}} - \Delta\mathbf{w}_c = -\left(1 - \frac{\tau_c}{\|\Delta\mathbf{w}_c\|}\right)\Delta\mathbf{w}_c$ .

The aggregated estimate: $\hat{\bar{\mathbf{w}}} = \bar{\mathbf{w}} + \frac{1}{C}\sum_{c \in \mathcal{C}}\mathbf{e}_c + \frac{1}{C}\mathbf{n}$

The MSE is: $\text{MSE} = \frac{1}{C^2}\sum_{c \in \mathcal{C}}\left(1 - \frac{\tau_c}{\|\Delta\mathbf{w}_c\|}\right)^2\|\Delta\mathbf{w}_c\|^2 + \frac{d\sigma^2}{C^2}$

The first term is the clipping bias (deterministic, grows with the number of clipped clients and the severity of clipping) and the second is the aggregation noise (stochastic, decreases with $C$ ).

Communication cost comparison

AirComp: All $C = 20$ clients transmit simultaneously. Each dimension of $\Delta\mathbf{w}_c$ is transmitted in one channel use. Total: $d = 10\,000$ channel uses.

Orthogonal TDMA: Each client transmits in a separate time slot. Total: $C \times d = 20 \times 10\,000 = 200\,000$ channel uses.

Reduction: $200\,000 / 10\,000 = 20\times$ , equal to the number of clients.

AirComp achieves a $C\times$ speedup by exploiting the superposition property of the wireless MAC channel. The trade-off is the aggregation noise ( $\sigma^2/C^2$ ) and the need for channel pre-equalisation, which requires CSI at the clients and may be power-limited.

Exercises

ex-ch31-01

Weight matrix dimensions

Parameter count

Data sufficiency assessment

ex-ch31-02

Average symbol energy

Comparison with classical schemes

ex-ch31-03

Joint Q-table size

Per-user Q-table size

Reduction factor

ex-ch31-04

Per-client upload cost

Full participation total cost

Partial participation cost

ex-ch31-05

Gradient w.r.t. $\mathbf{W}_2$

Gradient w.r.t. $\mathbf{W}_1$ (backpropagation through ReLU)

Simplification for non-negative inputs

ex-ch31-06

Thresholds only

All parameters learned

Estimating LISTA layers from convergence rate

ex-ch31-07

Optimal Q-table

Optimal policy

Q-table after one visit per pair

ex-ch31-08

Local model after $E$ steps

Aggregated model

Comparison with centralised SGD

ex-ch31-09

Gradient w.r.t. $z$

Gradient w.r.t. $\theta$

Interpretation for sparsity adaptation

ex-ch31-10

Power constraint verification

Cross-entropy loss

Gradient w.r.t. logits

ex-ch31-11

MMSE estimator derivation

MMSE optimality proof

Linear NN convergence

Non-linear network cannot improve

ex-ch31-12

ISTA iteration in 1D

LISTA layer and initialisation

Fixed point analysis

MAP estimate comparison

ex-ch31-13

Bellman optimality equations

Solve the system

Optimal policy

Q-learning simulation (initial suboptimality)

ex-ch31-14

Optima

FedAvg round

Aggregation and zero bias

Asymmetric case: non-zero bias

ex-ch31-15

Rate calculation

Approaching the sphere-packing bound

Scaling challenge

ex-ch31-16

AirComp aggregation

Power constraint

MSE with clipping

Communication cost comparison