Exercises

ex-ch28-01

Easy

Show that the information bottleneck Lagrangian LIB=I(X;T)βˆ’\betaI(T;Y)\mathcal{L}_{\text{IB}} = I(X;T) - \betaI(T;Y) is bounded below by βˆ’\betaI(X;Y)-\betaI(X;Y) for any PT∣XP_{T|X} satisfying the Markov chain Y⊸X⊸TY \multimap X \multimap T.

ex-ch28-02

Easy

For the binary IB example (Section 1), compute I(X;Y)I(X;Y) when p=0.25p = 0.25 and XX is uniform.

ex-ch28-03

Easy

Verify that the Xu-Raginsky bound gives a trivial (vacuous) result when I(W;S)β‰₯n/2I(W;S) \geq n/2. What does this mean for deterministic learning algorithms?

ex-ch28-04

Easy

In AirComp with KK users and perfect CSI, if all channels are equal (hk=hh_k = h for all kk), what is the MSE of estimating sΛ‰=1Kβˆ‘ksk\bar{s} = \frac{1}{K}\sum_k s_k with channel inversion?

ex-ch28-05

Easy

Show that the information curve I(R)\mathcal{I}(R) is concave.

ex-ch28-06

Medium

Derive the PAC-Bayes bound from the Donsker-Varadhan inequality. Specifically, show that for any measurable f(w)f(w) and distributions Q,PQ, P over hypotheses: EQ[f(W)]≀D(Qβˆ₯P)+log⁑EP[ef(W)]\mathbb{E}_{Q}[f(W)] \leq D(Q \| P) + \log \mathbb{E}_{P}[e^{f(W)}]

ex-ch28-07

Medium

A distributed mean estimation system has K=50K = 50 users, each observing nk=20n_k = 20 samples from N(ΞΈ,Id)\mathcal{N}(\theta, I_d) with d=10d = 10. Each user can send Bk=100B_k = 100 bits. Is communication or statistics the bottleneck? What is the expected MSE?

ex-ch28-08

Medium

Prove that the IB self-consistent equations have the same structure as the Blahut-Arimoto algorithm for rate-distortion. Specifically, identify the "distortion measure" and show that the encoder update has the exponential form PT∣X(t∣x)∝PT(t)eβˆ’Ξ²d(x,t)P_{T|X}(t|x) \propto P_T(t) e^{-\beta d(x,t)}.

ex-ch28-09

Medium

For the random dithering quantizer with bb bits per coordinate (Section 3, Example 1), compute the expected number of bits needed to achieve total MSE ≀ϡ2\leq \epsilon^2 for the mean of KK users' dd-dimensional gradients, assuming i.i.d. components with βˆ₯gkβˆ₯βˆžβ‰€G\|g_k\|_\infty \leq G.

ex-ch28-10

Medium

Show that for a nomographic function f(s1,…,sK)=ψ(βˆ‘kΟ•k(sk))f(s_1, \ldots, s_K) = \psi(\sum_k \phi_k(s_k)), the geometric mean (∏ksk)1/K(\prod_k s_k)^{1/K} (with sk>0s_k > 0) is nomographic. What are the pre- and post-processing functions?

ex-ch28-11

Hard

Derive the individual-sample MI bound. Start with the decomposition gen(W,S)=1nβˆ‘i=1n[β„“(W,Ziβ€²)βˆ’β„“(W,Zi)]\text{gen}(W,S) = \frac{1}{n}\sum_{i=1}^n [\ell(W, Z_i') - \ell(W, Z_i)] where Ziβ€²Z_i' is an independent copy of ZiZ_i, and show that ∣E[gen]βˆ£β‰€1nβˆ‘i=1n2I(W;Zi)|\mathbb{E}[\text{gen}]| \leq \frac{1}{n}\sum_{i=1}^n \sqrt{2I(W; Z_i)}.

ex-ch28-12

Hard

Consider the Gaussian IB: X∼N(0,ΣX)X \sim \mathcal{N}(0, \Sigma_X), Y∣X∼N(AX,ΣN)Y | X \sim \mathcal{N}(AX, \Sigma_N) for some matrix AA and noise covariance ΣN\Sigma_N. Show that the optimal IB mapping is linear: T=BX+ξT = BX + \xi where ξ∼N(0,Σξ)\xi \sim \mathcal{N}(0, \Sigma_\xi) is independent of XX. Derive the information curve.

ex-ch28-13

Hard

In the distributed estimation setting, consider the case where users have heterogeneous sample sizes: user kk has nkn_k samples, with βˆ‘knk=n\sum_k n_k = n but nkn_k may differ. Show that the optimal bit allocation is Bkβˆ—βˆnkB_k^* \propto n_k (allocate more bits to users with more data) and derive the resulting MSE.

ex-ch28-14

Hard

Analyze the MSE of truncated channel inversion for AirComp. Users with ∣hk∣<hmin⁑|h_k| < h_{\min} are excluded from the current round. Derive the bias-variance tradeoff as a function of the threshold hmin⁑h_{\min} under Rayleigh fading.

ex-ch28-15

Challenge

Prove that adding Gaussian noise to the output of a learning algorithm achieves the optimal rate in the MI generalization bound, in the following sense: for the class of all algorithms with output in Rd\mathbb{R}^d satisfying βˆ₯Wβˆ₯≀B\|W\| \leq B, the Gaussian mechanism W+N(0,Οƒ2Id)W + \mathcal{N}(0, \sigma^2 I_d) achieves: I(W;S)=O ⁣(dlog⁑ ⁣(1+B2Οƒ2))I(W;S) = O\!\left(d \log\!\left(1 + \frac{B^2}{\sigma^2}\right)\right) and show this is tight (no other noise distribution can achieve smaller MI for the same output distortion).

ex-ch28-16

Medium

Show that the computation capacity of the KK-user Gaussian MAC for the sum function equals the sum-rate capacity 12log⁑(1+KP/Οƒ2)\frac{1}{2}\log(1 + KP/\sigma^2). Explain why this is not the case for the MAX function f(s1,…,sK)=max⁑kskf(s_1, \ldots, s_K) = \max_k s_k.

ex-ch28-17

Medium

In federated learning with K=10K = 10 users, d=500d = 500, and a communication budget of B=10,000B = 10{,}000 bits per round per user, what is the maximum number of SGD rounds TT to achieve final MSE Ο΅2=0.01\epsilon^2 = 0.01 for a strongly convex objective with parameter ΞΌ\mu? Compare with unlimited communication.

ex-ch28-18

Challenge

Prove that the information bottleneck undergoes a phase transition at a critical Ξ²c\beta_c: for Ξ²<Ξ²c\beta < \beta_c, the optimal solution is TT independent of XX (trivial), and for Ξ²>Ξ²c\beta > \beta_c, a non-trivial solution bifurcates. Compute Ξ²c\beta_c for binary XX and YY with crossover probability pp.

ex-ch28-19

Hard

Derive the MSE of AirComp with MMSE-based aggregation instead of channel inversion. The server computes sΛ‰^=a⊀Y\hat{\bar{s}} = \mathbf{a}^\top \mathbf{Y} where a\mathbf{a} minimizes the MSE E[∣sΛ‰^βˆ’sΛ‰βˆ£2]\mathbb{E}[|\hat{\bar{s}} - \bar{s}|^2]. Show that the MMSE receiver avoids the power penalty of channel inversion for users in deep fade.

ex-ch28-20

Challenge

Consider the multi-round communication model for distributed estimation. At each round tt, user kk sends BB bits based on its data XknX_k^n and all previous server broadcasts Mserver1:tβˆ’1M_{\text{server}}^{1:t-1}. The server broadcasts BdownB_{\text{down}} bits after each round. Show that TT rounds of interaction can reduce the communication cost compared to the one-shot lower bound d2/Bd^2/B, and characterize the improvement.