Exercises
ex-ch28-01
EasyShow that the information bottleneck Lagrangian is bounded below by for any satisfying the Markov chain .
Use the data processing inequality on the chain .
What is the maximum possible value of ?
Apply data processing inequality
By the Markov chain and the data processing inequality: .
Bound the Lagrangian
Since :
ex-ch28-02
EasyFor the binary IB example (Section 1), compute when and is uniform.
Use .
Compute conditional entropy
bits.
Compute mutual information
Since is uniform, is also uniform (by symmetry of the BSC with uniform input), so bit. Therefore bits.
ex-ch28-03
EasyVerify that the Xu-Raginsky bound gives a trivial (vacuous) result when . What does this mean for deterministic learning algorithms?
The bound is . When is this ?
Check when the bound exceeds 1
The bound gives . When , the bound is , which is vacuous since the loss is bounded in .
Deterministic algorithms
For a deterministic algorithm, , so . For continuous (e.g., neural network weights), , and the bound is infinite. This is why MI bounds require randomized algorithms.
ex-ch28-04
EasyIn AirComp with users and perfect CSI, if all channels are equal ( for all ), what is the MSE of estimating with channel inversion?
With equal channels, channel inversion is just scaling by .
Compute the scaling
With for all , channel inversion gives . Power constraint: , so .
Compute MSE
. With equal channels, the MSE decreases as , reflecting perfect coherent combining of users (array gain).
ex-ch28-05
EasyShow that the information curve is concave.
Use time-sharing between two bottleneck mappings.
Time-sharing argument
Let achieve and achieve . Consider the mixture .
Bound the mixture
By convexity of mutual information in for fixed : and . Therefore .
ex-ch28-06
MediumDerive the PAC-Bayes bound from the Donsker-Varadhan inequality. Specifically, show that for any measurable and distributions over hypotheses:
Start with the definition of KL divergence .
Use the log-sum inequality or Jensen's inequality.
Write the KL divergence
.
Rearrange
.
Bound using Jensen
. Combining: .
ex-ch28-07
MediumA distributed mean estimation system has users, each observing samples from with . Each user can send bits. Is communication or statistics the bottleneck? What is the expected MSE?
Compare to .
The MSE is .
Compute parameters
Total samples: . Total budget: bits. Threshold: . Since , communication is the bottleneck.
Compute MSE
. The communication-limited MSE () is twice the statistical limit ().
ex-ch28-08
MediumProve that the IB self-consistent equations have the same structure as the Blahut-Arimoto algorithm for rate-distortion. Specifically, identify the "distortion measure" and show that the encoder update has the exponential form .
Define .
Define the distortion
Set . This measures how well the cluster representative approximates the true conditional .
Rewrite the IB Lagrangian
. Since is constant, minimizing is equivalent to minimizing .
Apply the BA structure
This is exactly the rate-distortion Lagrangian with . The Blahut-Arimoto encoder update is , matching the IB self-consistent equation.
ex-ch28-09
MediumFor the random dithering quantizer with bits per coordinate (Section 3, Example 1), compute the expected number of bits needed to achieve total MSE for the mean of users' -dimensional gradients, assuming i.i.d. components with .
Per-user quantization variance is . After averaging users, variance is .
Compute MSE per coordinate after averaging
Each user quantizes with variance per coordinate. After averaging independent quantized vectors, the MSE per coordinate is . Total MSE across dimensions: .
Solve for b
Require , so , giving .
Total communication
Total bits = . For , , , : , so bits, total Mbits per round.
ex-ch28-10
MediumShow that for a nomographic function , the geometric mean (with ) is nomographic. What are the pre- and post-processing functions?
Take the logarithm to convert a product into a sum.
Logarithm trick
. Set and .
Verify
. This is nomographic, so the geometric mean can be computed over the air using the MAC with logarithmic pre-processing and exponential post-processing.
ex-ch28-11
HardDerive the individual-sample MI bound. Start with the decomposition where is an independent copy of , and show that .
Use the supersample technique: consider where is an independent copy.
Apply the single-sample Xu-Raginsky bound to each term.
Supersample decomposition
Let be independent copies. Define (replace sample with its copy). Then:
Apply the single-sample bound
For each , condition on . Given , depends on only through the algorithm's randomness and . Applying the Donsker-Varadhan bound to the single variable :
Remove the conditioning
Since is independent of , we have (this uses the fact that conditioning on independent variables does not increase MI when the algorithm treats samples symmetrically). Summing over :
ex-ch28-12
HardConsider the Gaussian IB: , for some matrix and noise covariance . Show that the optimal IB mapping is linear: where is independent of . Derive the information curve.
Jointly Gaussian distributions are preserved under linear transformations.
Use the entropy-power inequality or direct Gaussian optimization.
Establish Gaussianity
For jointly Gaussian , the maximum for a given is achieved by a Gaussian , because Gaussian distributions maximize entropy for a given covariance. A Gaussian that is a function of Gaussian must be of the form .
Compute the mutual informations
. , computable in closed form from the joint covariance of .
Water-filling structure
The optimization over and has a water-filling solution in the canonical correlation domain of . Each canonical component is either "on" (included in ) or "off" (discarded), with the threshold determined by . The information curve is piecewise linear in the canonical correlation basis.
ex-ch28-13
HardIn the distributed estimation setting, consider the case where users have heterogeneous sample sizes: user has samples, with but may differ. Show that the optimal bit allocation is (allocate more bits to users with more data) and derive the resulting MSE.
The local Fisher information at user is proportional to .
Think of it as a rate-distortion problem: allocate bits to minimize total MSE.
Local sufficient statistics
User 's sufficient statistic is . The local Fisher information is .
Rate-distortion formulation
User quantizes with bits. The quantization MSE per dimension is at least (Gaussian rate-distortion with variance ). The server forms with weights .
Optimize bit allocation
The total MSE is . Minimizing subject to using Lagrange multipliers gives (approximately proportional to for large ). The resulting MSE is , matching the homogeneous case when .
ex-ch28-14
HardAnalyze the MSE of truncated channel inversion for AirComp. Users with are excluded from the current round. Derive the bias-variance tradeoff as a function of the threshold under Rayleigh fading.
Under Rayleigh fading, . The probability of exclusion is .
The bias comes from missing users; the variance comes from the noise amplification.
Compute participation probability
Under Rayleigh fading, , so . On average, users participate per round.
Compute bias
The server estimates where . The bias is (assuming for all ). The squared bias is .
Compute variance
Conditioned on , the variance from noise is where (since ). The total MSE is . Optimizing over yields the optimal threshold.
ex-ch28-15
ChallengeProve that adding Gaussian noise to the output of a learning algorithm achieves the optimal rate in the MI generalization bound, in the following sense: for the class of all algorithms with output in satisfying , the Gaussian mechanism achieves: and show this is tight (no other noise distribution can achieve smaller MI for the same output distortion).
Use the maximum entropy property of the Gaussian distribution.
The MI is bounded by the AWGN channel capacity.
Upper bound on MI
where maximizes entropy under the norm constraint. By the capacity of the AWGN channel: .
Tightness: Gaussian noise is optimal
For a fixed (distortion), we want to minimize . By the entropy power inequality, Gaussian noise maximizes the conditional entropy for a given variance. But MI = , and is bounded above regardless of the noise distribution (since ), so maximizing minimizes MI. Gaussian achieves this maximum.
Combine
The Gaussian mechanism is MI-optimal: it minimizes for a given output perturbation level . Combined with the Xu-Raginsky bound, this gives the tightest MI generalization bound achievable by output perturbation.
ex-ch28-16
MediumShow that the computation capacity of the -user Gaussian MAC for the sum function equals the sum-rate capacity . Explain why this is not the case for the MAX function .
For the sum, all users can transmit coherently. For the MAX, the receiver needs to distinguish individual values.
Sum: achievability
Each user transmits with power . The receiver gets . This is an AWGN channel with total power and noise , so the computation rate is β matching the sum-rate MAC capacity.
MAX: separation required
To compute , the receiver must decode each individually (or at least determine the ordering). This requires the MAC to support individual decoding, which limits the rate to the MAC capacity region. The computation rate is at most , which is times smaller than the sum computation rate. The MAX function is not nomographic and cannot exploit the MAC superposition.
ex-ch28-17
MediumIn federated learning with users, , and a communication budget of bits per round per user, what is the maximum number of SGD rounds to achieve final MSE for a strongly convex objective with parameter ? Compare with unlimited communication.
For strongly convex optimization, SGD converges as with exact gradients.
With quantization noise, the convergence rate becomes .
Quantization MSE
Bits per coordinate: bits. Quantization MSE per coordinate (after averaging users): . This is negligible for any reasonable .
Convergence rate
With 20 bits per coordinate, the quantization noise is essentially zero compared to the stochastic gradient noise. The convergence is dominated by SGD variance, not communication. This shows that bits is more than sufficient β in practice, - bits often suffice for federated learning.
Conclusion
The number of rounds is , same as unlimited communication. Communication becomes the bottleneck only when is very small (1-3 bits per coordinate) or when the dimension is extremely large relative to .
ex-ch28-18
ChallengeProve that the information bottleneck undergoes a phase transition at a critical : for , the optimal solution is independent of (trivial), and for , a non-trivial solution bifurcates. Compute for binary and with crossover probability .
This is analogous to a second-order phase transition in statistical mechanics.
Analyze the stability of the trivial solution under small perturbations.
Trivial solution
At , the optimal is independent of : . This achieves and .
Perturbation analysis
Perturb around the trivial solution: where for each . Expanding the IB Lagrangian to second order in : where is the largest eigenvalue of the matrix (a normalized version of the conditional expectation operator).
Critical beta
The trivial solution is stable (a local minimum) when , i.e., . For binary with BSC(): , so . For : .
ex-ch28-19
HardDerive the MSE of AirComp with MMSE-based aggregation instead of channel inversion. The server computes where minimizes the MSE . Show that the MMSE receiver avoids the power penalty of channel inversion for users in deep fade.
This is a standard LMMSE estimation problem.
The MMSE weights favor strong channels and suppress weak ones.
Signal model
Each user transmits (equal power, no channel inversion). The receiver observes . Let and .
LMMSE receiver
The LMMSE estimate of from is: .
MSE
. Unlike channel inversion, the MMSE receiver does not amplify noise from weak channels. The MSE is bounded by (the variance of the sample mean) and decreases with SNR. Users in deep fade contribute less but do not degrade the estimate.
ex-ch28-20
ChallengeConsider the multi-round communication model for distributed estimation. At each round , user sends bits based on its data and all previous server broadcasts . The server broadcasts bits after each round. Show that rounds of interaction can reduce the communication cost compared to the one-shot lower bound , and characterize the improvement.
Interactive protocols allow the server to "steer" future messages based on past estimates.
Think of successive refinement: each round refines the estimate.
One-shot baseline
Without interaction, the MSE is where is the total one-shot budget.
Interactive protocol
In each round, the server broadcasts its current estimate . Users then send bits describing the residual , which has smaller dynamic range. This is successive refinement of a Gaussian source.
MSE after $T$ rounds
After rounds with bits per round per user, total communication bits, the MSE improves as: . The exponential decay in shows that interaction exponentially improves over one-shot communication. The key is that each round quantizes a residual with decreasing variance, requiring fewer bits for the same relative accuracy.