Exercises

ex-ch31-01

Easy

A two-layer fully connected neural network is used for channel estimation in an OFDM system with Nc=128N_c = 128 subcarriers and Np=16N_p = 16 pilots. The hidden layer has H=64H = 64 units. The input is [Reโก(h^p),Imโก(h^p)]โˆˆR2Np[\operatorname{Re}(\hat{\mathbf{h}}_p), \operatorname{Im}(\hat{\mathbf{h}}_p)] \in \mathbb{R}^{2N_p} and the output is the full channel estimate [Reโก(h^),Imโก(h^)]โˆˆR2Nc[\operatorname{Re}(\hat{\mathbf{h}}), \operatorname{Im}(\hat{\mathbf{h}})] \in \mathbb{R}^{2N_c}.

(a) Write the dimensions of the weight matrices W1\mathbf{W}_1, W2\mathbf{W}_2 and bias vectors b1\mathbf{b}_1, b2\mathbf{b}_2.

(b) Compute the total number of trainable parameters.

(c) If the training dataset has 300 channel realisations, is this likely sufficient? Justify using the rule-of-thumb ratio.

ex-ch31-02

Easy

An end-to-end autoencoder for 8-ary signalling over n=2n = 2 channel uses (2D constellation) is trained under AWGN at SNR =12= 12 dB. After training, the encoder produces the following constellation points (in I-Q coordinates):

Symbol I Q
0 1.05 0.02
1 0.51 0.88
2 โˆ’0.49-0.49 0.89
3 โˆ’1.04-1.04 0.03
4 โˆ’0.52-0.52 โˆ’0.87-0.87
5 0.50 โˆ’0.88-0.88
6 0.01 0.01
7 0.00 โˆ’0.01-0.01

(a) Compute the average symbol energy.

(b) Does this constellation resemble any classical modulation scheme? Explain any differences.

ex-ch31-03

Easy

In a Q-learning power control problem with K=3K = 3 users and J=4J = 4 discrete power levels per user:

(a) How many entries does the joint Q-table have if the state space has โˆฃSโˆฃ=27|\mathcal{S}| = 27 states (3 quantisation levels per user)?

(b) If an alternative per-user decomposition is used (each user has an independent Q-table of size โˆฃSโˆฃร—J|\mathcal{S}| \times J), how many total entries are needed?

(c) What is the reduction factor?

ex-ch31-04

Easy

In federated learning with C=10C = 10 clients, each client has a model with โˆฃwโˆฃ=5000|\mathbf{w}| = 5000 parameters (32-bit floats).

(a) Compute the upload cost per client per round (in KB).

(b) If FedAvg runs for R=80R = 80 rounds with full client participation, compute the total communication cost (in MB).

(c) If only Cr=3C_r = 3 clients are randomly selected per round, what is the new total cost?

ex-ch31-05

Medium

A neural-network channel estimator uses MSE loss: L(ฮธ)=1Nโˆ‘i=1Nโˆฅh^iโˆ’hiโˆฅ2\mathcal{L}(\theta) = \frac{1}{N}\sum_{i=1}^N \|\hat{\mathbf{h}}_i - \mathbf{h}_i\|^2. The network has a single hidden layer: h^=W2โ€‰ฯƒ(W1yp+b1)+b2\hat{\mathbf{h}} = \mathbf{W}_2\,\sigma(\mathbf{W}_1 \mathbf{y}_p + \mathbf{b}_1) + \mathbf{b}_2 with ReLU activation.

(a) Derive the gradient โˆ‚L/โˆ‚W2\partial\mathcal{L}/\partial\mathbf{W}_2.

(b) Derive the gradient โˆ‚L/โˆ‚W1\partial\mathcal{L}/\partial\mathbf{W}_1, expressing the chain rule through the ReLU.

(c) If all pilot observations are real-valued and ypโ‰ฅ0\mathbf{y}_p \geq 0 element-wise, explain why โˆ‚L/โˆ‚W1\partial\mathcal{L}/\partial\mathbf{W}_1 simplifies.

ex-ch31-06

Medium

Consider a LISTA network with LL layers applied to the sparse recovery problem y=Ax+n\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{n}, where AโˆˆR25ร—50\mathbf{A} \in \mathbb{R}^{25 \times 50}, s=3s = 3, and SNR =20= 20 dB.

(a) If only the per-layer thresholds {ฮธ(l)}l=0Lโˆ’1\{\theta^{(l)}\}_{l=0}^{L-1} are learned (with W(l)\mathbf{W}^{(l)} and S(l)\mathbf{S}^{(l)} fixed at the ISTA values), how many trainable parameters does the network have?

(b) If all matrices and thresholds are learned, how many parameters?

(c) ISTA achieves NMSE =โˆ’15= -15 dB after T=30T = 30 iterations. Based on the linear convergence theorem for LISTA, estimate the number of LISTA layers needed to achieve the same NMSE, assuming ฯ=0.7\rho = 0.7.

ex-ch31-07

Medium

A Q-learning agent controls the power of a single transmitter communicating with one receiver. The state is the quantised channel gain: sโˆˆ{low,medium,high}s \in \{\text{low}, \text{medium}, \text{high}\} (3 states). The action is the power level: aโˆˆ{0.5,1.0,2.0}a \in \{0.5, 1.0, 2.0\} (3 actions). The discount factor is ฮณ=0\gamma = 0 (myopic, no future reward consideration).

The reward is the rate r=logโก2(1+โˆฃhโˆฃ2p/ฯƒ2)r = \log_2(1 + |h|^2 p / \sigma^2) with ฯƒ2=1\sigma^2 = 1. For the three states, the typical channel gains are โˆฃhโˆฃ2โˆˆ{0.3,1.0,3.0}|h|^2 \in \{0.3, 1.0, 3.0\}.

(a) Compute the optimal Q-table Qโˆ—(s,a)Q^*(s, a) exactly (9 entries).

(b) What is the optimal policy?

(c) If the agent has explored each state-action pair exactly once with ฮฑ=1\alpha = 1, what does the Q-table look like? Is it already optimal?

ex-ch31-08

Medium

In FedAvg with C=5C = 5 clients, the global model is wโˆˆRd\mathbf{w} \in \mathbb{R}^d. Each client performs E=5E = 5 local SGD steps with learning rate ฮท\eta. The local gradients at client cc have mean gc=โˆ‡Fc(w)\mathbf{g}_c = \nabla F_c(\mathbf{w}) and the global gradient is g=15โˆ‘c=15gc\mathbf{g} = \frac{1}{5}\sum_{c=1}^5 \mathbf{g}_c.

(a) Write the local model after EE steps (assuming constant gradient, i.e., the quadratic approximation): wc(E)=wโˆ’Eฮทโ€‰gc\mathbf{w}_c^{(E)} = \mathbf{w} - E\eta\,\mathbf{g}_c.

(b) Write the aggregated model wห‰=15โˆ‘cwc(E)\bar{\mathbf{w}} = \frac{1}{5}\sum_c \mathbf{w}_c^{(E)}.

(c) Compare wห‰\bar{\mathbf{w}} with a single centralised SGD step using the global gradient. When are they equivalent?

ex-ch31-09

Medium

A soft-thresholding operator Sฮธ(z)=sign(z)maxโก(โˆฃzโˆฃโˆ’ฮธ,0)\mathcal{S}_\theta(z) = \mathrm{sign}(z)\max(|z| - \theta, 0) is used in each layer of a LISTA network. During backpropagation, we need โˆ‚Sฮธโˆ‚z\frac{\partial \mathcal{S}_\theta}{\partial z} and โˆ‚Sฮธโˆ‚ฮธ\frac{\partial \mathcal{S}_\theta}{\partial \theta}.

(a) Derive โˆ‚Sฮธ(z)โˆ‚z\frac{\partial \mathcal{S}_\theta(z)}{\partial z} for zโ‰ ยฑฮธz \neq \pm\theta.

(b) Derive โˆ‚Sฮธ(z)โˆ‚ฮธ\frac{\partial \mathcal{S}_\theta(z)}{\partial \theta}.

(c) Explain why the gradient w.r.t. ฮธ\theta provides a learning signal for adapting the sparsity level per layer.

ex-ch31-10

Medium

An autoencoder for M=4M = 4 symbols over AWGN with n=2n = 2 channel uses has encoder output (after power normalisation):

xs=zs1Mโˆ‘m=1Mโˆฅzmโˆฅ2\mathbf{x}_s = \frac{\mathbf{z}_s}{\sqrt{\frac{1}{M}\sum_{m=1}^M\|\mathbf{z}_m\|^2}}

where zs=W2ฯƒ(W1es+b1)+b2\mathbf{z}_s = \mathbf{W}_2\sigma(\mathbf{W}_1\mathbf{e}_s + \mathbf{b}_1) + \mathbf{b}_2 and es\mathbf{e}_s is the one-hot encoding.

(a) Show that the power constraint 1Mโˆ‘sโˆฅxsโˆฅ2=1\frac{1}{M}\sum_s \|\mathbf{x}_s\|^2 = 1 is automatically satisfied by this normalisation.

(b) The decoder receives y=xs+n\mathbf{y} = \mathbf{x}_s + \mathbf{n} and outputs p^=softmax(W4ฯƒ(W3y+b3)+b4)\hat{\mathbf{p}} = \mathrm{softmax}(\mathbf{W}_4\sigma(\mathbf{W}_3\mathbf{y} + \mathbf{b}_3) + \mathbf{b}_4). Write the cross-entropy loss for a batch of BB samples {(si,yi)}i=1B\{(s_i, \mathbf{y}_i)\}_{i=1}^B.

(c) Compute โˆ‚Lโˆ‚logits\frac{\partial\mathcal{L}}{\partial\text{logits}} for a single sample, where logits =W4ฯƒ(W3y+b3)+b4= \mathbf{W}_4\sigma(\mathbf{W}_3\mathbf{y} + \mathbf{b}_3) + \mathbf{b}_4.

ex-ch31-11

Hard

(MMSE as optimal NN target.) Consider the channel estimation problem where the channel hโˆผCN(0,RHH)\mathbf{h} \sim \mathcal{CN}(\mathbf{0}, \mathbf{R}_{HH}) and the observation is y=h+n\mathbf{y} = \mathbf{h} + \mathbf{n}, nโˆผCN(0,ฯƒ2I)\mathbf{n} \sim \mathcal{CN}(\mathbf{0}, \sigma^2\mathbf{I}).

(a) Show that the MMSE estimator is h^MMSE=RHH(RHH+ฯƒ2I)โˆ’1y\hat{\mathbf{h}}_{\mathrm{MMSE}} = \mathbf{R}_{HH}(\mathbf{R}_{HH} + \sigma^2\mathbf{I})^{-1}\mathbf{y}.

(b) Prove that among all estimators g(y)g(\mathbf{y}), the MMSE estimator minimises E[โˆฅhโˆ’g(y)โˆฅ2]\mathbb{E}[\|\mathbf{h} - g(\mathbf{y})\|^2].

(c) Show that a linear neural network (no activation function) with h^=Wy\hat{\mathbf{h}} = \mathbf{W}\mathbf{y} trained with MSE loss on infinite data converges to the MMSE weight matrix Wโˆ—=RHH(RHH+ฯƒ2I)โˆ’1\mathbf{W}^* = \mathbf{R}_{HH}(\mathbf{R}_{HH} + \sigma^2\mathbf{I})^{-1}.

(d) Explain why a non-linear network cannot do better than the linear MMSE in this Gaussian setting.

ex-ch31-12

Hard

(LISTA convergence analysis.) Consider a simplified LISTA with scalar signal (N=1N = 1), scalar measurement y=ax+ny = ax + n where a>0a > 0 is known, xx has a Laplacian prior p(x)โˆeโˆ’ฮปโˆฃxโˆฃp(x) \propto e^{-\lambda|x|}, and nโˆผN(0,ฯƒ2)n \sim \mathcal{N}(0, \sigma^2).

(a) Write one ISTA iteration for this 1D problem: x(t+1)=Sฮป/a2(x(t)+1a(yโˆ’ax(t)))x^{(t+1)} = \mathcal{S}_{\lambda/a^2}(x^{(t)} + \frac{1}{a}(y - ax^{(t)})).

(b) Show that in a LISTA layer with learned parameters w,s,ฮธw, s, \theta, the update is x(l+1)=Sฮธ(wx(l)+sy)x^{(l+1)} = \mathcal{S}_\theta(wx^{(l)} + sy). What are the ISTA-initialised values of w,s,ฮธw, s, \theta?

(c) Consider the fixed point xโˆ—=Sฮธ(wxโˆ—+sy)x^* = \mathcal{S}_\theta(wx^* + sy). Show that for โˆฃwxโˆ—+syโˆฃ>ฮธ|wx^* + sy| > \theta, the fixed point satisfies xโˆ—=sy1โˆ’wx^* = \frac{sy}{1 - w} (assuming โˆฃwโˆฃ<1|w| < 1).

(d) For the optimal Bayesian estimator (MAP), derive the MAP estimate x^MAP\hat{x}_{\mathrm{MAP}} and compare with the LISTA fixed point.

ex-ch31-13

Hard

(Convergence of Q-learning.) Consider a two-state, two-action MDP with transition matrix and rewards:

P(sโ€ฒโˆฃs,a)={0.8sโ€ฒ=s0.2sโ€ฒโ‰ sโˆ€aP(s'|s,a) = \begin{cases} 0.8 & s' = s \\ 0.2 & s' \neq s \end{cases} \quad \forall a

R(s1,a1)=5,โ€…โ€ŠR(s1,a2)=1,โ€…โ€ŠR(s2,a1)=2,โ€…โ€ŠR(s2,a2)=8R(s_1, a_1) = 5, \; R(s_1, a_2) = 1, \; R(s_2, a_1) = 2, \; R(s_2, a_2) = 8

Discount factor ฮณ=0.9\gamma = 0.9.

(a) Write the Bellman optimality equations for Qโˆ—(s,a)Q^*(s, a).

(b) Solve the system of 4 equations to find Qโˆ—(s,a)Q^*(s, a) exactly.

(c) What is the optimal policy?

(d) Simulate 5 Q-learning updates starting from Q=0Q = 0 with ฮฑ=0.5\alpha = 0.5, initial state s1s_1, and ฯต=0\epsilon = 0 (greedy). Show that the agent may take suboptimal actions initially.

ex-ch31-14

Hard

(FedAvg heterogeneity bias.) Consider C=2C = 2 clients with quadratic objectives:

F1(w)=12(wโˆ’1)2,F2(w)=12(w+1)2F_1(w) = \frac{1}{2}(w - 1)^2, \qquad F_2(w) = \frac{1}{2}(w + 1)^2

The global objective is F(w)=12[F1(w)+F2(w)]F(w) = \frac{1}{2}[F_1(w) + F_2(w)].

(a) Find the global optimum wโˆ—w^* and the local optima w1โˆ—,w2โˆ—w_1^*, w_2^*.

(b) Run one round of FedAvg: starting from w(0)=0w^{(0)} = 0, each client performs EE gradient descent steps with learning rate ฮท\eta. Derive the aggregated model w(1)w^{(1)} as a function of EE and ฮท\eta.

(c) Show that w(1)=0=wโˆ—w^{(1)} = 0 = w^* regardless of EE and ฮท\eta. Explain why the heterogeneity bias is zero in this symmetric example.

(d) Now consider asymmetric objectives: F1(w)=12(wโˆ’1)2F_1(w) = \frac{1}{2}(w - 1)^2, F2(w)=(w+1)2F_2(w) = (w + 1)^2 (i.e., F2F_2 has curvature 2 instead of 1). Find wโˆ—w^* and show that FedAvg with E>1E > 1 does not converge to wโˆ—w^*.

ex-ch31-15

Hard

(Autoencoder capacity and channel coding.) An autoencoder maps MM messages to nn channel uses over AWGN, achieving block error rate (BLER) PeP_e.

(a) The communication rate is R=logโก2MnR = \frac{\log_2 M}{n} bits per channel use. For M=16M = 16, n=7n = 7, compute RR and compare with the AWGN capacity C=12logโก2(1+SNR)C = \frac{1}{2}\log_2(1 + \text{SNR}) at SNR =7= 7 dB.

(b) The sphere-packing (Shannon) lower bound on BLER for MM codewords in R2n\mathbb{R}^{2n} under AWGN is:

Peโ‰ฅ1โˆ’Vol(cone)Vol(sphere)P_e \geq 1 - \frac{\text{Vol}(\text{cone})}{\text{Vol}(\text{sphere})}

Explain qualitatively why the autoencoder's BLER approaches this bound as the network capacity increases.

(c) A key limitation of autoencoder-based "codes" is that they are fixed-length, fixed-MM systems. Explain why scaling to M=2100M = 2^{100} (typical for modern codes with 100-bit messages) is fundamentally challenging for the autoencoder approach.

ex-ch31-16

Hard

(Over-the-air federated aggregation.) In over-the-air computation (AirComp), CC clients simultaneously transmit their model updates ฮ”wcโˆˆRd\Delta\mathbf{w}_c \in \mathbb{R}^d over a MAC channel. The server receives:

y=โˆ‘c=1Chcโ€‰xc+n\mathbf{y} = \sum_{c=1}^C h_c \, \mathbf{x}_c + \mathbf{n}

where hch_c is the channel coefficient from client cc, xc\mathbf{x}_c is the transmitted signal, and nโˆผN(0,ฯƒ2I)\mathbf{n} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}).

(a) To achieve the desired aggregation wห‰=1Cโˆ‘cฮ”wc\bar{\mathbf{w}} = \frac{1}{C}\sum_c \Delta\mathbf{w}_c, each client must pre-equalise: xc=1hcฮ”wc\mathbf{x}_c = \frac{1}{h_c}\Delta\mathbf{w}_c. Show that the server estimate is then wห‰^=1Cy=wห‰+1Cn\hat{\bar{\mathbf{w}}} = \frac{1}{C}\mathbf{y} = \bar{\mathbf{w}} + \frac{1}{C}\mathbf{n}.

(b) The pre-equalisation requires transmit power Pc=1โˆฃhcโˆฃ2โˆฅฮ”wcโˆฅ2P_c = \frac{1}{|h_c|^2}\|\Delta\mathbf{w}_c\|^2. If client cc has a power constraint Pcโ‰คPmaxโกP_c \leq P_{\max}, show that the maximum model update norm is โˆฅฮ”wcโˆฅโ‰คโˆฃhcโˆฃPmaxโก\|\Delta\mathbf{w}_c\| \leq |h_c|\sqrt{P_{\max}}.

(c) To handle power-limited clients, a common approach is to truncate (clip) the update: ฮ”wcclip=ฮ”wcโ‹…minโก(1,ฯ„/โˆฅฮ”wcโˆฅ)\Delta\mathbf{w}_c^{\text{clip}} = \Delta\mathbf{w}_c \cdot \min(1, \tau / \|\Delta\mathbf{w}_c\|) where ฯ„=โˆฃhcโˆฃPmaxโก\tau = |h_c|\sqrt{P_{\max}}. Derive the MSE of the aggregation when KK out of CC clients are clipped.

(d) Compare the communication cost (channel uses) of AirComp vs orthogonal TDMA for C=20C = 20 clients with d=104d = 10^4 parameters.