Exercises
ex-ch08-01
EasyFor jointly Gaussian with correlation coefficient , show that Wyner's common information equals the mutual information: .
For Gaussian , the optimal that makes is the MMSE estimate of from (or vice versa).
Show that (or equivalently the projection) suffices.
Construct optimal W
Let (the MMSE estimator, scaled). Given , the conditional distribution of given and given are independent Gaussians (since removing the common component leaves independent residuals in the Gaussian case).
Compute rate
. Since is a function of (or ), in the degenerate case...
More carefully: . By the Markov chain, if is chosen as a function of the common part. For Gaussians, .
ex-ch08-02
MediumFor the doubly symmetric binary source (, , ), show that Wyner's common information is , which is strictly greater than when... Actually, verify they are equal for the DSBS.
Compute .
For the DSBS, check if can be chosen so that with .
Compute mutual information
.
Check common information
For the DSBS, there is no nontrivial that renders with a binary . The minimum over all valid can be shown to equal (since must essentially identify or to break the dependence). Therefore for .
This shows that Wyner's common information can strictly exceed mutual information for discrete sources.
ex-ch08-03
MediumShow that the information bottleneck at reduces to minimizing , the information in about that is not relevant to .
Write .
Use the Markov chain to simplify .
Expand the IB objective
I(X; T) = I(X; T | Y) + I(T; Y) - I(T; Y | X)T - X - YI(T; Y | X) = 0I(X; T) = I(X; T | Y) + I(T; Y)$.
Substitute into IB
\beta = 1\mathcal{L}_{\text{IB}} = I(X; T | Y)\beta = 1TXYTXY$.
ex-ch08-04
MediumShow that the VAE KL term is an upper bound on under the variational distribution.
Write and compare with the KL term.
Use the fact that is the marginal.
Write MI in terms of variational distribution
Under the joint distribution :
where is the aggregated posterior.
Compare with KL term
D(q_\phi(z) | p(z))$, which measures how far the aggregated posterior deviates from the prior.
ex-ch08-05
EasyIn a DVC system using LDPC-based Slepian-Wolf coding, the BSC correlation model has crossover probability . What is the minimum syndrome rate needed, and what LDPC code rate should be used?
The syndrome rate must be at least .
The LDPC code rate is .
Compute syndrome rate
$
LDPC code rate
The LDPC code rate is . This must be below the BSC(0.05) capacity of . In practice, we use a code rate slightly below this (e.g., 0.70) to allow for the gap to capacity of practical LDPC codes.
ex-ch08-06
HardDerive the self-consistent equations for the information bottleneck by taking the functional derivative of with respect to , subject to the normalization constraint .
Use Lagrange multipliers for the normalization constraint.
Express and in terms of , , and .
The derivative involves .
Write the Lagrangian
$
Take functional derivative
Setting :
Rearranging and using the normalization constraint to determine :
where ensures normalization.
ex-ch08-07
HardShow that for a -VAE with Gaussian encoder and standard Gaussian prior , the KL term has the closed-form expression:
Use the formula for KL divergence between two multivariate Gaussians.
The KL between and is .
Apply Gaussian KL formula
For diagonal :
Interpretation
Each latent dimension contributes independently to the KL cost. The term with equality iff and (matching the prior).
Dimensions with large carry more "information" about (the mean shifts away from the prior) and incur higher rate cost. This is the rate-distortion tradeoff in action: each dimension acts as a "channel" with capacity determined by how far the encoder pushes the posterior from the prior.
ex-ch08-08
MediumA source coding system with a helper has and (the helper observes only the parity of ). What is the minimum rate as a function of the helper rate ?
takes values in with bit.
Given , is uniform on , so bit.
Compute entropies
bits, bit. bit (given parity, two equally likely values remain). (parity is a deterministic function of ).
Rate region
- At : bits (no help).
- At : any suffices for full help!
Actually, the helper needs to send at rate bit for the decoder to fully know . But with the Slepian-Wolf corner point, the helper only needs rate since determines .
So for any : bit. At : bits.
The transition is sharp β even a tiny helper rate gives the full benefit because is a function of (zero conditional entropy).
ex-ch08-09
ChallengeConsider the information bottleneck for binary classification: is a feature vector, is a binary label, and is a binary compressed representation. Show that the optimal at any is a soft clustering that groups inputs with similar posterior .
The IB self-consistent equation involves .
With binary , this becomes a two-cluster problem.
Inputs with similar should go to the same cluster.
Apply the IB equations
With binary :
For , is assigned to cluster or based on which cluster's average posterior is closer (in KL sense) to .
Interpretation
For binary and binary , the KL divergence reduces to comparing with . Inputs with tend to cluster , and those with tend to cluster .
At (hard clustering), this becomes a deterministic threshold on . At finite , the clustering is soft, with the "temperature" controlling the fuzziness.
This is precisely what a logistic regression classifier does β the IB provides an information-theoretic justification for soft classification.
ex-ch08-10
MediumIn the DISCUS framework for Slepian-Wolf coding, explain why the syndrome decoding problem with is equivalent to channel decoding on a BSC().
Think of as a codeword error pattern on a BSC.
The syndrome is the syndrome of the error pattern.
Channel coding analogy
In standard channel coding on a BSC(): a codeword is transmitted, the received vector is where is the error pattern with i.i.d. entries. The decoder uses (since for valid codewords) to estimate .
DISCUS equivalence
In DISCUS: is the source, is the side information, is the "correlation noise." The encoder sends , and the decoder computes .
This is identical to the channel decoding problem: the decoder has syndrome and must recover , where has i.i.d. entries. Any BSC()-capacity-approaching decoder (belief propagation for LDPC codes) solves both problems.