Ferkans — Interactive Telecom Tutor

The Information-Theoretic Minimum

What is the absolute minimum rate needed to represent a source at distortion $D$ ? Shannon answered this in 1959 with the rate-distortion function — a convex optimization over "test channels" $P_{\hat{X}|X}$ that maps each distortion level to its minimum rate. The answer is elegant: $R = \min I(X;\hat{X})$ subject to $\mathbb{E}[d] \leq D$ . Intuitively, the mutual information measures how many bits we need to "communicate" the reconstruction $\hat{X}$ to the decoder, and we minimize this subject to the distortion constraint.

Definition:
Rate-Distortion Function

For a source $X \sim P_X$ with distortion measure $d : \mathcal{X} \times \hat{\mathcal{X}} \to [0, \infty)$ , the rate-distortion function is $R = \min_{\substack{P_{\hat{X}|X} : \\ \mathbb{E}[d(X, \hat{X})] \leq D}} I(X; \hat{X}).$ The minimization is over all conditional distributions (test channels) $P_{\hat{X}|X}$ that achieve average distortion at most $D$ . The joint distribution is $P_{X,\hat{X}}(x, \hat{x}) = P_X(x) P_{\hat{X}|X}(\hat{x}|x)$ .

The test channel $P_{\hat{X}|X}$ is a mathematical device — it does not correspond to a physical channel. It represents the "best possible" stochastic mapping from source to reconstruction. The source distribution $P_X$ is fixed; only the test channel is optimized.

Theorem: Properties of the Rate-Distortion Function

For any source and bounded distortion measure:

$R$ is non-increasing in $D$ : more distortion allows less rate.
$R$ is convex in $D$ .
$R \geq 0$ for all $D$ , with $R = 0$ iff $D \geq D_{\max} = \min_{\hat{x}} \mathbb{E}[d(X, \hat{x})]$ .
For discrete sources with Hamming distortion: $R(0) = H(X)$ .
$R$ is continuous on $[0, D_{\max}]$ .

Proof

Non-increasing

If $D_1 < D_2$ , then $\{P_{\hat{X}|X} : \mathbb{E}[d] \leq D_1\} \subseteq \{P_{\hat{X}|X} : \mathbb{E}[d] \leq D_2\}$ . Minimizing over a larger set gives a smaller value: $R(D_2) \leq R(D_1)$ .

Convexity

Let $D = \lambda D_1 + (1-\lambda)D_2$ . Consider the test channel $P^*_{\hat{X}|X} = \lambda P^1_{\hat{X}|X} + (1-\lambda) P^2_{\hat{X}|X}$ where $P^1, P^2$ achieve $R(D_1), R(D_2)$ respectively. The distortion is at most $\lambda D_1 + (1-\lambda)D_2 = D$ (by linearity of expectation). The mutual information satisfies $I(X;\hat{X}) \leq \lambda I_{1} + (1-\lambda)I_{2}$ by convexity of mutual information in the conditional distribution $P_{\hat{X}|X}$ . Therefore $R(D) \leq \lambda R(D_1) + (1-\lambda) R(D_2)$ .

$R(D_{\max}) = 0$

At $D_{\max} = \min_{\hat{x}} \mathbb{E}[d(X, \hat{x})]$ , the optimal test channel is $P_{\hat{X}|X}(\hat{x}|x) = \mathbf{1}\{\hat{x} = \hat{x}^*\}$ — deterministic, always outputting the best constant reconstruction. Then $I(X;\hat{X}) = 0$ .

Example: Binary Source with Hamming Distortion

Compute $R$ for a Bernoulli( $p$ ) source with Hamming distortion $d(x, \hat{x}) = \mathbf{1}\{x \neq \hat{x}\}$ , where $p \leq 1/2$ .

Solution

Set up the optimization

The source is $X \sim \text{Bern}(p)$ and the reconstruction $\hat{X} \in \{0,1\}$ . The test channel is a BSC with crossover probability $D$ : $P(\hat{X} \neq X) = D$ . The distortion constraint is $\mathbb{E}[\mathbf{1}\{X \neq \hat{X}\}] \leq D$ .

Compute mutual information

For a BSC( $D$ ) test channel: $I(X;\hat{X}) = H(\hat{X}) - H(\hat{X}|X) = H(p * D) - H(D)$ where $p * D = p(1-D) + (1-p)D$ is the marginal probability $P(\hat{X} = 1)$ and $H(D) = -D\log D - (1-D)\log(1-D)$ is the binary entropy function.

Minimize over test channels

The BSC( $D$ ) test channel achieves the minimum (the reader should verify this using KKT conditions or the Blahut-Arimoto algorithm). Therefore: $R = \begin{cases} H(p) - H(D) & \text{if } 0 \leq D \leq p \\ 0 & \text{if } D > p \end{cases}$

Interpretation

At $D = 0$ (lossless): $R(0) = H(p)$ , recovering the lossless source coding theorem. At $D = p$ (maximum useful distortion): $R(p) = 0$ , since always outputting $\hat{X} = 0$ achieves distortion $p$ with zero bits. The rate-distortion function smoothly interpolates between these extremes.

Binary Rate-Distortion Function

Plot $R = H(p) - H(D)$ for a binary source. Adjust the source parameter $p$ and observe how the curve changes. Note the convex, non-increasing shape and the endpoints $R(0) = H(p)$ and $R(p) = 0$ .

Parameters

Source parameter

p

0.3

Rate-Distortion as a Convex Program

The rate-distortion optimization $\min_{P_{\hat{X}|X}} I(X;\hat{X})$ subject to $\mathbb{E}[d] \leq D$ is a convex program: the mutual information is convex in $P_{\hat{X}|X}$ for fixed $P_X$ (this is the opposite of the channel capacity problem, where $I$ is concave in $P_X$ ), and the distortion constraint is linear. Convexity means KKT conditions are sufficient for optimality, and iterative algorithms like Blahut-Arimoto converge to the global optimum. This is not just a mathematical convenience — it is why rate-distortion theory is computationally tractable.

Rate-distortion function $R$

The minimum mutual information $I(X;\hat{X})$ over all test channels achieving average distortion at most $D$ . The fundamental limit of lossy compression: no code at rate $R < R$ can achieve distortion $D$ .

Related: Rate-Distortion Function

Test channel

The conditional distribution $P_{\hat{X}|X}$ in the rate-distortion optimization. It represents the stochastic mapping from source to reconstruction that minimizes the required rate at a given distortion level. Not a physical channel.

Key Takeaway

The rate-distortion function $R = \min_{P_{\hat{X}|X}: \mathbb{E}[d] \leq D} I(X;\hat{X})$ is the information-theoretic minimum rate for lossy compression at distortion $D$ . It is convex and non-increasing in $D$ , ranges from $H(X)$ (lossless) to 0 ( $D = D_{\max}$ ), and is computed by a convex optimization over test channels. For a binary source with Hamming distortion: $R = H(p) - H(D)$ .

The Rate-Distortion Function