Ferkans — Interactive Telecom Tutor

ex-ch08-01

Easy

For jointly Gaussian $(X, Y)$ with correlation coefficient $\rho$ , show that Wyner's common information equals the mutual information: $C_W(X; Y) = I(X; Y) = \frac{1}{2}\log\frac{1}{1-\rho^2}$ .

Show Hint

For Gaussian $(X, Y)$ , the optimal $W$ that makes $X \perp Y | W$ is the MMSE estimate of $X$ from $Y$ (or vice versa).

Show that $W = \rho X + \sqrt{1-\rho^2} N$ (or equivalently the projection) suffices.

Solution

Construct optimal W

Let $W = \mathbb{E}[X|Y] = \rho Y / \sigma_Y$ (the MMSE estimator, scaled). Given $W$ , the conditional distribution of $X$ given $W$ and $Y$ given $W$ are independent Gaussians (since removing the common component leaves independent residuals in the Gaussian case).

Compute rate

$I(X, Y; W) = H(W) - H(W | X, Y)$ . Since $W$ is a function of $Y$ (or $X$ ), $H(W|X,Y) = 0$ in the degenerate case...

More carefully: $I(X,Y; W) = I(X; W) + I(Y; W | X)$ . By the Markov chain, $I(Y; W | X) = 0$ if $W$ is chosen as a function of the common part. For Gaussians, $C_W(X;Y) = \frac{1}{2}\log\frac{1}{1-\rho^2} = I(X;Y)$ .

ex-ch08-02

Medium

For the doubly symmetric binary source ( $X \sim \text{Bernoulli}(1/2)$ , $Y = X \oplus Z$ , $Z \sim \text{Bernoulli}(p)$ ), show that Wyner's common information is $C_W(X; Y) = 1 - \mathcal{H}_2(p)$ , which is strictly greater than $I(X; Y) = 1 - \mathcal{H}_2(p)$ when... Actually, verify they are equal for the DSBS.

Show Hint

Compute $I(X;Y) = H(X) - H(X|Y) = 1 - \mathcal{H}_2(p)$ .

For the DSBS, check if $W$ can be chosen so that $X \perp Y | W$ with $I(X,Y;W) = I(X;Y)$ .

Solution

Compute mutual information

$I(X; Y) = H(Y) - H(Y|X) = 1 - \mathcal{H}_2(p)$ .

Check common information

For the DSBS, there is no nontrivial $W$ that renders $X \perp Y | W$ with a binary $W$ . The minimum $I(X,Y; W)$ over all valid $W$ can be shown to equal $1$ (since $W$ must essentially identify $X$ or $Y$ to break the dependence). Therefore $C_W(X;Y) = 1 > 1 - \mathcal{H}_2(p) = I(X;Y)$ for $p > 0$ .

This shows that Wyner's common information can strictly exceed mutual information for discrete sources.

ex-ch08-03

Medium

Show that the information bottleneck at $\beta = 1$ reduces to minimizing $I(X; T | Y)$ , the information in $T$ about $X$ that is not relevant to $Y$ .

Show Hint

Write $I(X; T) = I(X; T | Y) + I(T; Y) - I(T; Y | X)$ .

Use the Markov chain $T - X - Y$ to simplify $I(T; Y | X) = 0$ .

Solution

Expand the IB objective

$\mathcal{L}_{\text{IB}} = I(X; T) - \beta \cdot I(T; Y)$ $By the chain rule:$ I(X; T) = I(X; T | Y) + I(T; Y) - I(T; Y | X) $. Since$ T - X - Y $is Markov:$ I(T; Y | X) = 0 $. Therefore:$ I(X; T) = I(X; T | Y) + I(T; Y)$.

Substitute into IB

$\mathcal{L}_{\text{IB}} = I(X; T | Y) + I(T; Y) - \beta \cdot I(T; Y)KATEXPLACEHOLDER0END= I(X; T | Y) + (1 - \beta)I(T; Y)$ $At$ \beta = 1 $:$ \mathcal{L}_{\text{IB}} = I(X; T | Y) $. So minimizing the IB at$ \beta = 1 $is equivalent to finding the representation$ T $that contains the least information about$ X $beyond what$ Y $already provides — i.e.,$ T $should be a sufficient statistic of$ X $for$ Y$.

ex-ch08-04

Medium

Show that the VAE KL term $\mathbb{E}_{P_{\text{data}}}[D(q_\phi(z|x) \| p(z))]$ is an upper bound on $I(X; Z)$ under the variational distribution.

Show Hint

Write $I(X; Z) = \mathbb{E}[\log q_\phi(z|x)] - \mathbb{E}[\log q_\phi(z)]$ and compare with the KL term.

Use the fact that $q_\phi(z) = \mathbb{E}_{P_{\text{data}}}[q_\phi(z|x)]$ is the marginal.

Solution

Write MI in terms of variational distribution

Under the joint distribution $P_{\text{data}}(x) q_\phi(z|x)$ :

$I(X; Z) = \mathbb{E}\left[\log\frac{q_\phi(z|x)}{q_\phi(z)}\right]$

where $q_\phi(z) = \int P_{\text{data}}(x) q_\phi(z|x) dx$ is the aggregated posterior.

Compare with KL term

$\mathbb{E}[D(q_\phi(z|x) \| p(z))] = \mathbb{E}\left[\log\frac{q_\phi(z|x)}{p(z)}\right]KATEXPLACEHOLDER0END= \mathbb{E}\left[\log\frac{q_\phi(z|x)}{q_\phi(z)}\right] + \mathbb{E}\left[\log\frac{q_\phi(z)}{p(z)}\right]KATEXPLACEHOLDER1END= I(X; Z) + D(q_\phi(z) \| p(z))KATEXPLACEHOLDER2END\mathbb{E}[D(q_\phi(z|x) \| p(z))] \geq I(X; Z)$ $The gap is$ D(q_\phi(z) | p(z))$, which measures how far the aggregated posterior deviates from the prior.

ex-ch08-05

Easy

In a DVC system using LDPC-based Slepian-Wolf coding, the BSC correlation model has crossover probability $p = 0.05$ . What is the minimum syndrome rate $R_{\text{syn}}$ needed, and what LDPC code rate should be used?

Show Hint

The syndrome rate must be at least $\mathcal{H}_2(p)$ .

The LDPC code rate is $k/n = 1 - R_{\text{syn}}$ .

Solution

Compute syndrome rate

$R_{\text{syn}} \geq H(X|Y) = \mathcal{H}_2(0.05)KATEXPLACEHOLDER0END= -0.05\log_2(0.05) - 0.95\log_2(0.95) \approx 0.286 \text{ bits}$ $

LDPC code rate

The LDPC code rate is $k/n = 1 - R_{\text{syn}} \leq 1 - 0.286 = 0.714$ . This must be below the BSC(0.05) capacity of $1 - \mathcal{H}_2(0.05) \approx 0.714$ . In practice, we use a code rate slightly below this (e.g., 0.70) to allow for the gap to capacity of practical LDPC codes.

ex-ch08-06

Hard

Derive the self-consistent equations for the information bottleneck by taking the functional derivative of $\mathcal{L}_{\text{IB}} = I(X; T) - \beta \cdot I(T; Y)$ with respect to $P_{T|X}(t|x)$ , subject to the normalization constraint $\sum_t P_{T|X}(t|x) = 1$ .

Show Hint

Use Lagrange multipliers for the normalization constraint.

Express $I(X; T)$ and $I(T; Y)$ in terms of $P_{T|X}$ , $P_T$ , and $P_{Y|T}$ .

The derivative involves $\log P_{T|X}(t|x) - \log P_T(t) + \beta \sum_y P_{Y|X}(y|x) \log P_{Y|T}(y|t)$ .

Solution

Write the Lagrangian

$\mathcal{L} = \sum_{x,t} P_X(x) P_{T|X}(t|x) \log\frac{P_{T|X}(t|x)}{P_T(t)}KATEXPLACEHOLDER0END- \beta \sum_{t,y} P_T(t) P_{Y|T}(y|t) \log\frac{P_{Y|T}(y|t)}{P_Y(y)}KATEXPLACEHOLDER1END- \sum_x \lambda(x)\left(\sum_t P_{T|X}(t|x) - 1\right)$ $

Take functional derivative

Setting $\frac{\partial \mathcal{L}}{\partial P_{T|X}(t|x)} = 0$ :

$P_X(x)\left[\log\frac{P_{T|X}(t|x)}{P_T(t)} + 1\right] - \beta \sum_y P_X(x) P_{Y|X}(y|x) \log\frac{P_{Y|T}(y|t)}{P_Y(y)} - \lambda(x) = 0$

Rearranging and using the normalization constraint to determine $\lambda(x)$ :

$P_{T|X}(t|x) = \frac{P_T(t)}{Z(x,\beta)} \exp\left(\beta \sum_y P_{Y|X}(y|x) \log P_{Y|T}(y|t)\right)$

$= \frac{P_T(t)}{Z(x,\beta)} \exp\left(-\beta D(P_{Y|X}(\cdot|x) \| P_{Y|T}(\cdot|t))\right) \cdot \text{const}$

where $Z(x, \beta)$ ensures normalization.

ex-ch08-07

Hard

Show that for a $\beta$ -VAE with Gaussian encoder $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x)I)$ and standard Gaussian prior $p(z) = \mathcal{N}(0, I)$ , the KL term has the closed-form expression: $D(q_\phi(z|x) \| p(z)) = \frac{1}{2}\sum_{j=1}^d \left[\mu_{\phi,j}^2(x) + \sigma_{\phi,j}^2(x) - 1 - \log\sigma_{\phi,j}^2(x)\right]$

Show Hint

Use the formula for KL divergence between two multivariate Gaussians.

The KL between $\mathcal{N}(\mu, \Sigma)$ and $\mathcal{N}(0, I)$ is $\frac{1}{2}[\text{tr}(\Sigma) + \mu^T\mu - d - \log\det\Sigma]$ .

Solution

Apply Gaussian KL formula

For diagonal $\Sigma = \text{diag}(\sigma_1^2, \ldots, \sigma_d^2)$ :

$D(\mathcal{N}(\mu, \Sigma) \| \mathcal{N}(0, I)) = \frac{1}{2}\left[\text{tr}(\Sigma) + \|\mu\|^2 - d - \log\det\Sigma\right]$

$= \frac{1}{2}\left[\sum_j \sigma_j^2 + \sum_j \mu_j^2 - d - \sum_j \log\sigma_j^2\right]$

$= \frac{1}{2}\sum_{j=1}^d [\mu_j^2 + \sigma_j^2 - 1 - \log\sigma_j^2]$

Interpretation

Each latent dimension contributes independently to the KL cost. The term $\mu_j^2 + \sigma_j^2 - 1 - \log\sigma_j^2 \geq 0$ with equality iff $\mu_j = 0$ and $\sigma_j = 1$ (matching the prior).

Dimensions with large $|\mu_j|$ carry more "information" about $x$ (the mean shifts away from the prior) and incur higher rate cost. This is the rate-distortion tradeoff in action: each dimension acts as a "channel" with capacity determined by how far the encoder pushes the posterior from the prior.

ex-ch08-08

Medium

A source coding system with a helper has $X \sim \text{Uniform}(\{0,1,2,3\})$ and $Y = X \bmod 2$ (the helper observes only the parity of $X$ ). What is the minimum rate $R_X$ as a function of the helper rate $R_Y$ ?

Show Hint

$Y$ takes values in $\{0, 1\}$ with $H(Y) = 1$ bit.

Given $Y$ , $X$ is uniform on $\{Y, Y+2\}$ , so $H(X|Y) = 1$ bit.

Solution

Compute entropies

$H(X) = \log 4 = 2$ bits, $H(Y) = 1$ bit. $H(X|Y) = 1$ bit (given parity, two equally likely values remain). $H(Y|X) = 0$ (parity is a deterministic function of $X$ ).

Rate region

At $R_Y = 0$ : $R_X = H(X) = 2$ bits (no help).
At $R_Y \geq H(Y|X) = 0$ : any $R_Y > 0$ suffices for full help!

Actually, the helper needs to send $Y$ at rate $R_Y \geq H(Y) = 1$ bit for the decoder to fully know $Y$ . But with the Slepian-Wolf corner point, the helper only needs rate $R_Y \geq H(Y|X) = 0$ since $X$ determines $Y$ .

So for any $R_Y > 0$ : $R_X = H(X|Y) = 1$ bit. At $R_Y = 0$ : $R_X = 2$ bits.

The transition is sharp — even a tiny helper rate gives the full benefit because $Y$ is a function of $X$ (zero conditional entropy).

ex-ch08-09

Challenge

Consider the information bottleneck for binary classification: $X \in \{0,1\}^d$ is a feature vector, $Y \in \{0,1\}$ is a binary label, and $T \in \{0,1\}$ is a binary compressed representation. Show that the optimal $P_{T|X}$ at any $\beta$ is a soft clustering that groups inputs with similar posterior $P_{Y|X}(\cdot|x)$ .

Show Hint

The IB self-consistent equation involves $D(P_{Y|X}(\cdot|x) \| P_{Y|T}(\cdot|t))$ .

With binary $T$ , this becomes a two-cluster problem.

Inputs with similar $P(Y=1|X=x)$ should go to the same cluster.

Solution

Apply the IB equations

With binary $T$ : $P_{T|X}(t|x) = \frac{P_T(t)}{Z(x)} \exp(-\beta D(P_{Y|X}(\cdot|x) \| P_{Y|T}(\cdot|t)))$

For $t \in \{0,1\}$ , $x$ is assigned to cluster $t=0$ or $t=1$ based on which cluster's average posterior $P_{Y|T}(\cdot|t)$ is closer (in KL sense) to $P_{Y|X}(\cdot|x)$ .

Interpretation

For binary $Y$ and binary $T$ , the KL divergence reduces to comparing $P(Y=1|X=x)$ with $P(Y=1|T=t)$ . Inputs with $P(Y=1|X=x) > 0.5$ tend to cluster $t=1$ , and those with $P(Y=1|X=x) < 0.5$ tend to cluster $t=0$ .

At $\beta \to \infty$ (hard clustering), this becomes a deterministic threshold on $P(Y=1|X=x)$ . At finite $\beta$ , the clustering is soft, with the "temperature" $1/\beta$ controlling the fuzziness.

This is precisely what a logistic regression classifier does — the IB provides an information-theoretic justification for soft classification.

ex-ch08-10

Medium

In the DISCUS framework for Slepian-Wolf coding, explain why the syndrome decoding problem $\mathbf{H}e^n = s'$ with $e^n \sim \text{i.i.d. Bernoulli}(p)$ is equivalent to channel decoding on a BSC( $p$ ).

Show Hint

Think of $e^n$ as a codeword error pattern on a BSC.

The syndrome $s' = \mathbf{H}e^n$ is the syndrome of the error pattern.

Solution

Channel coding analogy

In standard channel coding on a BSC( $p$ ): a codeword $c^n$ is transmitted, the received vector is $r^n = c^n \oplus e^n$ where $e^n$ is the error pattern with i.i.d. $\text{Bernoulli}(p)$ entries. The decoder uses $\mathbf{H}r^n = \mathbf{H}e^n$ (since $\mathbf{H}c^n = 0$ for valid codewords) to estimate $e^n$ .

DISCUS equivalence

In DISCUS: $x^n$ is the source, $y^n$ is the side information, $e^n = x^n \oplus y^n$ is the "correlation noise." The encoder sends $s = \mathbf{H}x^n$ , and the decoder computes $s' = s \oplus \mathbf{H}y^n = \mathbf{H}(x^n \oplus y^n) = \mathbf{H}e^n$ .

This is identical to the channel decoding problem: the decoder has syndrome $s' = \mathbf{H}e^n$ and must recover $e^n$ , where $e^n$ has i.i.d. $\text{Bernoulli}(p)$ entries. Any BSC( $p$ )-capacity-approaching decoder (belief propagation for LDPC codes) solves both problems.

Exercises

ex-ch08-01

Construct optimal W

Compute rate

ex-ch08-02

Compute mutual information

Check common information

ex-ch08-03

Expand the IB objective

Substitute into IB

ex-ch08-04

Write MI in terms of variational distribution

Compare with KL term

ex-ch08-05

Compute syndrome rate

LDPC code rate

ex-ch08-06

Write the Lagrangian

Take functional derivative

ex-ch08-07

Apply Gaussian KL formula

Interpretation

ex-ch08-08

Compute entropies

Rate region

ex-ch08-09

Apply the IB equations

Interpretation

ex-ch08-10

Channel coding analogy

DISCUS equivalence