Ferkans — Interactive Telecom Tutor

ex-ch28-01

Easy

Show that the information bottleneck Lagrangian $\mathcal{L}_{\text{IB}} = I(X;T) - \betaI(T;Y)$ is bounded below by $-\betaI(X;Y)$ for any $P_{T|X}$ satisfying the Markov chain $Y \multimap X \multimap T$ .

Show Hint

Use the data processing inequality on the chain $Y \multimap X \multimap T$ .

What is the maximum possible value of $I(T;Y)$ ?

Solution

Apply data processing inequality

By the Markov chain $Y \multimap X \multimap T$ and the data processing inequality: $I(T;Y) \leq I(X;Y)$ .

Bound the Lagrangian

Since $I(X;T) \geq 0$ : $\mathcal{L}_{\text{IB}} = I(X;T) - \betaI(T;Y) \geq 0 - \betaI(X;Y) = -\betaI(X;Y)$

ex-ch28-02

Easy

For the binary IB example (Section 1), compute $I(X;Y)$ when $p = 0.25$ and $X$ is uniform.

Show Hint

Use $I(X;Y) = H(Y) - H(Y|X) = 1 - H_{b}(p)$ .

Solution

Compute conditional entropy

$H(Y|X) = H_{b}(p) = H_{b}(0.25) = -0.25\log_2 0.25 - 0.75\log_2 0.75 \approx 0.811$ bits.

Compute mutual information

Since $X$ is uniform, $Y$ is also uniform (by symmetry of the BSC with uniform input), so $H(Y) = 1$ bit. Therefore $I(X;Y) = 1 - 0.811 = 0.189$ bits.

ex-ch28-03

Easy

Verify that the Xu-Raginsky bound gives a trivial (vacuous) result when $I(W;S) \geq n/2$ . What does this mean for deterministic learning algorithms?

Show Hint

The bound is $\sqrt{2I(W;S)/n}$ . When is this $\geq 1$ ?

Solution

Check when the bound exceeds 1

The bound gives $|\mathbb{E}[\text{gen}]| \leq \sqrt{2I(W;S)/n}$ . When $I(W;S) \geq n/2$ , the bound is $\geq 1$ , which is vacuous since the loss is bounded in $[0,1]$ .

Deterministic algorithms

For a deterministic algorithm, $W = A(S)$ , so $I(W;S) = H(W)$ . For continuous $W$ (e.g., neural network weights), $H(W) = \infty$ , and the bound is infinite. This is why MI bounds require randomized algorithms.

ex-ch28-04

Easy

In AirComp with $K$ users and perfect CSI, if all channels are equal ( $h_k = h$ for all $k$ ), what is the MSE of estimating $\bar{s} = \frac{1}{K}\sum_k s_k$ with channel inversion?

Show Hint

With equal channels, channel inversion is just scaling by $\eta/h$ .

Solution

Compute the scaling

With $h_k = h$ for all $k$ , channel inversion gives $\alpha_k = \eta/h$ . Power constraint: $\eta^2 \sigma_s^2/|h|^2 \leq P$ , so $\eta^* = |h|\sqrt{P}/\sigma_s$ .

Compute MSE

$\text{MSE} = \frac{\sigma^2}{K^2(\eta^*)^2} = \frac{\sigma^2\sigma_s^2}{K^2 P |h|^2}$ . With equal channels, the MSE decreases as $1/K^2$ , reflecting perfect coherent combining of $K$ users (array gain).

ex-ch28-05

Easy

Show that the information curve $\mathcal{I}(R)$ is concave.

Show Hint

Use time-sharing between two bottleneck mappings.

Solution

Time-sharing argument

Let $P_{T|X}^{(1)}$ achieve $(I_{1}, \mathcal{I}(I_{1}))$ and $P_{T|X}^{(2)}$ achieve $(I_{2}, \mathcal{I}(I_{2}))$ . Consider the mixture $P_{T|X}^\lambda = \lambda P_{T|X}^{(1)} + (1-\lambda) P_{T|X}^{(2)}$ .

Bound the mixture

By convexity of mutual information in $P_{T|X}$ for fixed $P_{X,Y}$ : $I(X;T) \leq \lambda I_{1} + (1-\lambda)I_{2}$ and $I(T;Y) \geq \lambda \mathcal{I}(I_{1}) + (1-\lambda)\mathcal{I}(I_{2})$ . Therefore $\mathcal{I}(\lambda I_{1} + (1-\lambda)I_{2}) \geq \lambda \mathcal{I}(I_{1}) + (1-\lambda)\mathcal{I}(I_{2})$ .

ex-ch28-06

Medium

Derive the PAC-Bayes bound from the Donsker-Varadhan inequality. Specifically, show that for any measurable $f(w)$ and distributions $Q, P$ over hypotheses: $\mathbb{E}_{Q}[f(W)] \leq D(Q \| P) + \log \mathbb{E}_{P}[e^{f(W)}]$

Show Hint

Start with the definition of KL divergence $D(Q \| P) = \mathbb{E}_Q[\log(dQ/dP)]$ .

Use the log-sum inequality or Jensen's inequality.

Solution

Write the KL divergence

$D(Q \| P) = \mathbb{E}_Q\!\left[\log \frac{dQ}{dP}\right]$ .

Rearrange

$\mathbb{E}_Q[f(W)] = \mathbb{E}_Q\!\left[f(W) - \log\frac{dQ}{dP}\right] + D(Q \| P)$ .

Bound using Jensen

$\mathbb{E}_Q\!\left[f(W) - \log\frac{dQ}{dP}\right] = \mathbb{E}_Q\!\left[\log\frac{e^{f(W)} dP}{dQ}\right] \leq \log \mathbb{E}_Q\!\left[\frac{e^{f(W)} dP}{dQ}\right] = \log \mathbb{E}_P[e^{f(W)}]$ . Combining: $\mathbb{E}_Q[f(W)] \leq D(Q\|P) + \log\mathbb{E}_P[e^{f(W)}]$ .

ex-ch28-07

Medium

A distributed mean estimation system has $K = 50$ users, each observing $n_k = 20$ samples from $\mathcal{N}(\theta, I_d)$ with $d = 10$ . Each user can send $B_k = 100$ bits. Is communication or statistics the bottleneck? What is the expected MSE?

Show Hint

Compare $B = KB_k$ to $dn = d \cdot K n_k$ .

The MSE is $\max\{d/n, d^2/B\}$ .

Solution

Compute parameters

Total samples: $n = Kn_k = 1000$ . Total budget: $B = KB_k = 5000$ bits. Threshold: $dn = 10 \times 1000 = 10{,}000$ . Since $B = 5000 < dn = 10{,}000$ , communication is the bottleneck.

Compute MSE

$\text{MSE} \approx \max\{d/n, d^2/B\} = \max\{10/1000, 100/5000\} = \max\{0.01, 0.02\} = 0.02$ . The communication-limited MSE ( $0.02$ ) is twice the statistical limit ( $0.01$ ).

ex-ch28-08

Medium

Prove that the IB self-consistent equations have the same structure as the Blahut-Arimoto algorithm for rate-distortion. Specifically, identify the "distortion measure" and show that the encoder update has the exponential form $P_{T|X}(t|x) \propto P_T(t) e^{-\beta d(x,t)}$ .

Show Hint

Define $d(x,t) = D(P_{Y|X=x} \| P_{Y|T=t})$ .

Solution

Define the distortion

Set $d(x,t) = D(P_{Y|X=x} \| P_{Y|T=t})$ . This measures how well the cluster representative $P_{Y|T=t}$ approximates the true conditional $P_{Y|X=x}$ .

Rewrite the IB Lagrangian

$I(X;T) - \betaI(T;Y) = I(X;T) + \beta\mathbb{E}[d(X,T)] - \betaI(X;Y)$ . Since $I(X;Y)$ is constant, minimizing $\mathcal{L}_{\text{IB}}$ is equivalent to minimizing $I(X;T) + \beta\mathbb{E}[d(X,T)]$ .

Apply the BA structure

This is exactly the rate-distortion Lagrangian $I(X;T) + s \mathbb{E}[d(X,T)]$ with $s = \beta$ . The Blahut-Arimoto encoder update is $P_{T|X}(t|x) \propto P_T(t) e^{-s d(x,t)}$ , matching the IB self-consistent equation.

ex-ch28-09

Medium

For the random dithering quantizer with $b$ bits per coordinate (Section 3, Example 1), compute the expected number of bits needed to achieve total MSE $\leq \epsilon^2$ for the mean of $K$ users' $d$ -dimensional gradients, assuming i.i.d. components with $\|g_k\|_\infty \leq G$ .

Show Hint

Per-user quantization variance is $dG^2 2^{-2b}$ . After averaging $K$ users, variance is $dG^2 2^{-2b}/K$ .

Solution

Compute MSE per coordinate after averaging

Each user quantizes with variance $G^2 2^{-2b}$ per coordinate. After averaging $K$ independent quantized vectors, the MSE per coordinate is $G^2 2^{-2b} / K$ . Total MSE across $d$ dimensions: $\text{MSE} = dG^2 2^{-2b}/K$ .

Solve for b

Require $dG^2 2^{-2b}/K \leq \epsilon^2$ , so $2^{2b} \geq dG^2/(K\epsilon^2)$ , giving $b \geq \frac{1}{2}\log_2(dG^2/(K\epsilon^2))$ .

Total communication

Total bits = $K \cdot d \cdot b = \frac{Kd}{2}\log_2\!\left(\frac{dG^2}{K\epsilon^2}\right)$ . For $K = 100$ , $d = 1000$ , $G = 1$ , $\epsilon = 0.01$ : $b \geq \frac{1}{2}\log_2(10^7) \approx 11.6$ , so $b = 12$ bits, total $= 100 \times 1000 \times 12 = 1.2$ Mbits per round.

ex-ch28-10

Medium

Show that for a nomographic function $f(s_1, \ldots, s_K) = \psi(\sum_k \phi_k(s_k))$ , the geometric mean $(\prod_k s_k)^{1/K}$ (with $s_k > 0$ ) is nomographic. What are the pre- and post-processing functions?

Show Hint

Take the logarithm to convert a product into a sum.

Solution

Logarithm trick

$\log\!\left(\prod_k s_k\right)^{1/K} = \frac{1}{K}\sum_k \log s_k$ . Set $\phi_k(s_k) = \frac{1}{K}\log s_k$ and $\psi(u) = e^u$ .

Verify

$\psi\!\left(\sum_k \phi_k(s_k)\right) = \exp\!\left(\frac{1}{K}\sum_k \log s_k\right) = \left(\prod_k s_k\right)^{1/K}$ . This is nomographic, so the geometric mean can be computed over the air using the MAC with logarithmic pre-processing and exponential post-processing.

ex-ch28-11

Hard

Derive the individual-sample MI bound. Start with the decomposition $\text{gen}(W,S) = \frac{1}{n}\sum_{i=1}^n [\ell(W, Z_i') - \ell(W, Z_i)]$ where $Z_i'$ is an independent copy of $Z_i$ , and show that $|\mathbb{E}[\text{gen}]| \leq \frac{1}{n}\sum_{i=1}^n \sqrt{2I(W; Z_i)}$ .

Show Hint

Use the supersample technique: consider $\tilde{S} = (S, S')$ where $S'$ is an independent copy.

Apply the single-sample Xu-Raginsky bound to each term.

Solution

Supersample decomposition

Let $Z_i' \sim P_Z$ be independent copies. Define $\tilde{S}_i = (Z_1, \ldots, Z_{i-1}, Z_i', Z_{i+1}, \ldots, Z_n)$ (replace sample $i$ with its copy). Then: $\mathbb{E}[\text{gen}(W,S)] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[\ell(W, Z_i') - \ell(W, Z_i)]$

Apply the single-sample bound

For each $i$ , condition on $Z_{-i} = (Z_1, \ldots, Z_{i-1}, Z_{i+1}, \ldots, Z_n)$ . Given $Z_{-i}$ , $W$ depends on $Z_i$ only through the algorithm's randomness and $Z_i$ . Applying the Donsker-Varadhan bound to the single variable $Z_i$ : $|\mathbb{E}[\ell(W,Z_i') - \ell(W,Z_i) | Z_{-i}]| \leq \sqrt{2I(W; Z_i | Z_{-i})}$

Remove the conditioning

Since $Z_i$ is independent of $Z_{-i}$ , we have $I(W; Z_i | Z_{-i}) \leq I(W; Z_i)$ (this uses the fact that conditioning on independent variables does not increase MI when the algorithm treats samples symmetrically). Summing over $i$ : $|\mathbb{E}[\text{gen}]| \leq \frac{1}{n}\sum_{i=1}^n \sqrt{2I(W; Z_i)}$

ex-ch28-12

Hard

Consider the Gaussian IB: $X \sim \mathcal{N}(0, \Sigma_X)$ , $Y | X \sim \mathcal{N}(AX, \Sigma_N)$ for some matrix $A$ and noise covariance $\Sigma_N$ . Show that the optimal IB mapping is linear: $T = BX + \xi$ where $\xi \sim \mathcal{N}(0, \Sigma_\xi)$ is independent of $X$ . Derive the information curve.

Show Hint

Jointly Gaussian distributions are preserved under linear transformations.

Use the entropy-power inequality or direct Gaussian optimization.

Solution

Establish Gaussianity

For jointly Gaussian $(X,Y)$ , the maximum $I(T;Y)$ for a given $I(X;T) = R$ is achieved by a Gaussian $T$ , because Gaussian distributions maximize entropy for a given covariance. A Gaussian $T$ that is a function of Gaussian $X$ must be of the form $T = BX + \xi$ .

Compute the mutual informations

$I(X;T) = \frac{1}{2}\log\frac{|\Sigma_T|}{|\Sigma_\xi|} = \frac{1}{2}\log\frac{|B\Sigma_X B^\top + \Sigma_\xi|}{|\Sigma_\xi|}$ . $I(T;Y) = h(T) + h(Y) - h(T,Y)$ , computable in closed form from the joint covariance of $(T,Y)$ .

Water-filling structure

The optimization over $B$ and $\Sigma_\xi$ has a water-filling solution in the canonical correlation domain of $(X,Y)$ . Each canonical component is either "on" (included in $T$ ) or "off" (discarded), with the threshold determined by $\beta$ . The information curve is piecewise linear in the canonical correlation basis.

ex-ch28-13

Hard

In the distributed estimation setting, consider the case where users have heterogeneous sample sizes: user $k$ has $n_k$ samples, with $\sum_k n_k = n$ but $n_k$ may differ. Show that the optimal bit allocation is $B_k^* \propto n_k$ (allocate more bits to users with more data) and derive the resulting MSE.

Show Hint

The local Fisher information at user $k$ is proportional to $n_k$ .

Think of it as a rate-distortion problem: allocate bits to minimize total MSE.

Solution

Local sufficient statistics

User $k$ 's sufficient statistic is $\bar{X}_k = \frac{1}{n_k}\sum_{i=1}^{n_k} X_{k,i} \sim \mathcal{N}(\theta, I_d/n_k)$ . The local Fisher information is $J_k = n_k I_d$ .

Rate-distortion formulation

User $k$ quantizes $\bar{X}_k$ with $B_k$ bits. The quantization MSE per dimension is at least $\frac{d}{n_k} \cdot 2^{-2B_k/d}$ (Gaussian rate-distortion with variance $1/n_k$ ). The server forms $\hat{\theta} = \sum_k w_k \hat{X}_k$ with weights $w_k \propto n_k$ .

Optimize bit allocation

The total MSE is $\sum_k w_k^2 \cdot \frac{d}{n_k} 2^{-2B_k/d}$ . Minimizing subject to $\sum_k B_k = B$ using Lagrange multipliers gives $B_k^* \propto n_k + \frac{d}{2}\log(n_k/n_1)$ (approximately proportional to $n_k$ for large $B_k$ ). The resulting MSE is $\approx d^2 / B + d/n$ , matching the homogeneous case when $n_k = n/K$ .

ex-ch28-14

Hard

Analyze the MSE of truncated channel inversion for AirComp. Users with $|h_k| < h_{\min}$ are excluded from the current round. Derive the bias-variance tradeoff as a function of the threshold $h_{\min}$ under Rayleigh fading.

Show Hint

Under Rayleigh fading, $|h_k|^2 \sim \text{Exp}(1)$ . The probability of exclusion is $P(|h_k| < h_{\min}) = 1 - e^{-h_{\min}^2}$ .

The bias comes from missing users; the variance comes from the noise amplification.

Solution

Compute participation probability

Under Rayleigh fading, $|h_k|^2 \sim \text{Exp}(1)$ , so $P(|h_k| \geq h_{\min}) = e^{-h_{\min}^2} \triangleq p$ . On average, $Kp$ users participate per round.

Compute bias

The server estimates $\hat{\bar{s}} = \frac{1}{K}\sum_{k \in \mathcal{K}} s_k$ where $\mathcal{K} = \{k : |h_k| \geq h_{\min}\}$ . The bias is $|\mathbb{E}[\hat{\bar{s}}] - \bar{s}| = |\bar{s}| \cdot (1-p)$ (assuming $\mathbb{E}[s_k] = \mu$ for all $k$ ). The squared bias is $\mu^2(1-p)^2$ .

Compute variance

Conditioned on $\mathcal{K}$ , the variance from noise is $\frac{\sigma^2}{K^2 \eta^2}$ where $\eta = h_{\min}\sqrt{P}/\sigma_s$ (since $\min_{k \in \mathcal{K}} |h_k| = h_{\min}$ ). The total MSE is $\mu^2(1-e^{-h_{\min}^2})^2 + \frac{\sigma^2\sigma_s^2}{K^2 P h_{\min}^2}$ . Optimizing over $h_{\min}$ yields the optimal threshold.

ex-ch28-15

Challenge

Prove that adding Gaussian noise to the output of a learning algorithm achieves the optimal rate in the MI generalization bound, in the following sense: for the class of all algorithms with output in $\mathbb{R}^d$ satisfying $\|W\| \leq B$ , the Gaussian mechanism $W + \mathcal{N}(0, \sigma^2 I_d)$ achieves: $I(W;S) = O\!\left(d \log\!\left(1 + \frac{B^2}{\sigma^2}\right)\right)$ and show this is tight (no other noise distribution can achieve smaller MI for the same output distortion).

Show Hint

Use the maximum entropy property of the Gaussian distribution.

The MI is bounded by the AWGN channel capacity.

Solution

Upper bound on MI

$I(W; W + \xi) \leq I(W'; W' + \xi)$ where $W' \sim \text{Unif}(B \cdot \mathbb{S}^{d-1})$ maximizes entropy under the norm constraint. By the capacity of the AWGN channel: $I(W'; W'+\xi) \leq \frac{d}{2}\log(1 + B^2/\sigma^2)$ .

Tightness: Gaussian noise is optimal

For a fixed $\mathbb{E}[\|W - (W+\xi)\|^2] = d\sigma^2$ (distortion), we want to minimize $I(W; W+\xi)$ . By the entropy power inequality, Gaussian noise maximizes the conditional entropy $h(W+\xi | W) = h(\xi) = \frac{d}{2}\log(2\pi e \sigma^2)$ for a given variance. But MI = $h(W+\xi) - h(\xi)$ , and $h(W+\xi)$ is bounded above regardless of the noise distribution (since $\|W\| \leq B$ ), so maximizing $h(\xi)$ minimizes MI. Gaussian $\xi$ achieves this maximum.

Combine

The Gaussian mechanism is MI-optimal: it minimizes $I(W;S)$ for a given output perturbation level $\sigma^2$ . Combined with the Xu-Raginsky bound, this gives the tightest MI generalization bound achievable by output perturbation.

ex-ch28-16

Medium

Show that the computation capacity of the $K$ -user Gaussian MAC for the sum function equals the sum-rate capacity $\frac{1}{2}\log(1 + KP/\sigma^2)$ . Explain why this is not the case for the MAX function $f(s_1, \ldots, s_K) = \max_k s_k$ .

Show Hint

For the sum, all users can transmit coherently. For the MAX, the receiver needs to distinguish individual values.

Solution

Sum: achievability

Each user transmits $X_k = s_k$ with power $P$ . The receiver gets $Y = \sum_k s_k + Z$ . This is an AWGN channel with total power $KP$ and noise $\sigma^2$ , so the computation rate is $\frac{1}{2}\log(1+KP/\sigma^2)$ — matching the sum-rate MAC capacity.

MAX: separation required

To compute $\max_k s_k$ , the receiver must decode each $s_k$ individually (or at least determine the ordering). This requires the MAC to support individual decoding, which limits the rate to the MAC capacity region. The computation rate is at most $\min_k \frac{1}{2}\log(1 + P/(\sigma^2 + (K-1)P)) \approx \frac{1}{2K}\log(1+KP/\sigma^2)$ , which is $K$ times smaller than the sum computation rate. The MAX function is not nomographic and cannot exploit the MAC superposition.

ex-ch28-17

Medium

In federated learning with $K = 10$ users, $d = 500$ , and a communication budget of $B = 10{,}000$ bits per round per user, what is the maximum number of SGD rounds $T$ to achieve final MSE $\epsilon^2 = 0.01$ for a strongly convex objective with parameter $\mu$ ? Compare with unlimited communication.

Show Hint

For strongly convex optimization, SGD converges as $O(1/(\mu T))$ with exact gradients.

With quantization noise, the convergence rate becomes $O(1/(\mu T) + \text{quantization MSE})$ .

Solution

Quantization MSE

Bits per coordinate: $b = B/d = 10{,}000/500 = 20$ bits. Quantization MSE per coordinate (after averaging $K$ users): $G^2 \cdot 2^{-2b}/K = G^2 \cdot 2^{-40}/10 \approx 10^{-13} G^2$ . This is negligible for any reasonable $G$ .

Convergence rate

With 20 bits per coordinate, the quantization noise is essentially zero compared to the stochastic gradient noise. The convergence is dominated by SGD variance, not communication. This shows that $b = 20$ bits is more than sufficient — in practice, $b = 4$ - $8$ bits often suffice for federated learning.

Conclusion

The number of rounds is $T = O(1/(\mu \epsilon^2))$ , same as unlimited communication. Communication becomes the bottleneck only when $b$ is very small (1-3 bits per coordinate) or when the dimension $d$ is extremely large relative to $B$ .

ex-ch28-18

Challenge

Prove that the information bottleneck undergoes a phase transition at a critical $\beta_c$ : for $\beta < \beta_c$ , the optimal solution is $T$ independent of $X$ (trivial), and for $\beta > \beta_c$ , a non-trivial solution bifurcates. Compute $\beta_c$ for binary $X$ and $Y$ with crossover probability $p$ .

Show Hint

This is analogous to a second-order phase transition in statistical mechanics.

Analyze the stability of the trivial solution $P_{T|X}(t|x) = P_T(t)$ under small perturbations.

Solution

Trivial solution

At $\beta = 0$ , the optimal $T$ is independent of $X$ : $P_{T|X}(t|x) = P_T(t)$ . This achieves $I(X;T) = 0$ and $I(T;Y) = 0$ .

Perturbation analysis

Perturb around the trivial solution: $P_{T|X}(t|x) = P_T(t)(1 + \epsilon \delta(t,x))$ where $\sum_t P_T(t)\delta(t,x) = 0$ for each $x$ . Expanding the IB Lagrangian to second order in $\epsilon$ : $\Delta \mathcal{L} \approx \epsilon^2 \left[1 - \beta \cdot \lambda_{\max}\right]$ where $\lambda_{\max}$ is the largest eigenvalue of the matrix $M_{ty} = \sum_x P_X(x) P_{Y|X}(y|x) \cdot P_{Y|X}(y|x) / P_Y(y)$ (a normalized version of the conditional expectation operator).

Critical beta

The trivial solution is stable (a local minimum) when $1 - \beta\lambda_{\max} > 0$ , i.e., $\beta < 1/\lambda_{\max} = \beta_c$ . For binary $X,Y$ with BSC( $p$ ): $\lambda_{\max} = (1-2p)^2 / (p(1-p))$ , so $\beta_c = p(1-p)/(1-2p)^2$ . For $p = 0.1$ : $\beta_c = 0.09/0.64 \approx 0.141$ .

ex-ch28-19

Hard

Derive the MSE of AirComp with MMSE-based aggregation instead of channel inversion. The server computes $\hat{\bar{s}} = \mathbf{a}^\top \mathbf{Y}$ where $\mathbf{a}$ minimizes the MSE $\mathbb{E}[|\hat{\bar{s}} - \bar{s}|^2]$ . Show that the MMSE receiver avoids the power penalty of channel inversion for users in deep fade.

Show Hint

This is a standard LMMSE estimation problem.

The MMSE weights favor strong channels and suppress weak ones.

Solution

Signal model

Each user transmits $X_k = \sqrt{P} s_k$ (equal power, no channel inversion). The receiver observes $Y = \sum_k h_k \sqrt{P} s_k + Z$ . Let $\mathbf{h} = [h_1, \ldots, h_K]^\top$ and $\mathbf{s} = [s_1, \ldots, s_K]^\top$ .

LMMSE receiver

The LMMSE estimate of $\bar{s} = \frac{1}{K}\mathbf{1}^\top \mathbf{s}$ from $Y$ is: $\hat{\bar{s}} = \frac{\sqrt{P}\sigma_s^2 \mathbf{h}^\top \mathbf{1}/K}{P\sigma_s^2 \|\mathbf{h}\|^2 + \sigma^2} \cdot Y$ .

MSE

$\text{MSE}_{\text{MMSE}} = \frac{\sigma_s^2}{K} \left(1 - \frac{P\sigma_s^2 |\mathbf{h}^\top \mathbf{1}|^2/K}{P\sigma_s^2 \|\mathbf{h}\|^2 + \sigma^2}\right)$ . Unlike channel inversion, the MMSE receiver does not amplify noise from weak channels. The MSE is bounded by $\sigma_s^2/K$ (the variance of the sample mean) and decreases with SNR. Users in deep fade contribute less but do not degrade the estimate.

ex-ch28-20

Challenge

Consider the multi-round communication model for distributed estimation. At each round $t$ , user $k$ sends $B$ bits based on its data $X_k^n$ and all previous server broadcasts $M_{\text{server}}^{1:t-1}$ . The server broadcasts $B_{\text{down}}$ bits after each round. Show that $T$ rounds of interaction can reduce the communication cost compared to the one-shot lower bound $d^2/B$ , and characterize the improvement.

Show Hint

Interactive protocols allow the server to "steer" future messages based on past estimates.

Think of successive refinement: each round refines the estimate.

Solution

One-shot baseline

Without interaction, the MSE is $\max\{d/n, d^2/B\}$ where $B$ is the total one-shot budget.

Interactive protocol

In each round, the server broadcasts its current estimate $\hat{\theta}^{(t)}$ . Users then send $B$ bits describing the residual $\bar{X}_k - \hat{\theta}^{(t)}$ , which has smaller dynamic range. This is successive refinement of a Gaussian source.

MSE after $T$ rounds

After $T$ rounds with $B$ bits per round per user, total communication $TB \cdot K$ bits, the MSE improves as: $\text{MSE}(T) \approx \frac{d}{n} + \frac{d}{n} \cdot 2^{-2TB/(Kd)}$ . The exponential decay in $T$ shows that interaction exponentially improves over one-shot communication. The key is that each round quantizes a residual with decreasing variance, requiring fewer bits for the same relative accuracy.

Exercises

ex-ch28-01

Apply data processing inequality

Bound the Lagrangian

ex-ch28-02

Compute conditional entropy

Compute mutual information

ex-ch28-03

Check when the bound exceeds 1

Deterministic algorithms

ex-ch28-04

Compute the scaling

Compute MSE

ex-ch28-05

Time-sharing argument

Bound the mixture

ex-ch28-06

Write the KL divergence

Rearrange

Bound using Jensen

ex-ch28-07

Compute parameters

Compute MSE

ex-ch28-08

Define the distortion

Rewrite the IB Lagrangian

Apply the BA structure

ex-ch28-09

Compute MSE per coordinate after averaging

Solve for b

Total communication

ex-ch28-10

Logarithm trick

Verify

ex-ch28-11

Supersample decomposition

Apply the single-sample bound

Remove the conditioning

ex-ch28-12

Establish Gaussianity

Compute the mutual informations

Water-filling structure

ex-ch28-13

Local sufficient statistics

Rate-distortion formulation

Optimize bit allocation

ex-ch28-14

Compute participation probability

Compute bias

Compute variance

ex-ch28-15

Upper bound on MI

Tightness: Gaussian noise is optimal

Combine

ex-ch28-16

Sum: achievability

MAX: separation required

ex-ch28-17

Quantization MSE

Convergence rate

Conclusion

ex-ch28-18

Trivial solution

Perturbation analysis

Critical beta

ex-ch28-19

Signal model

LMMSE receiver

MSE

ex-ch28-20

One-shot baseline

Interactive protocol

MSE after $T$ rounds