Fundamental Inequalities

The Workhorses of Converse Proofs

Two inequalities dominate the converse (impossibility) arguments throughout information theory: the data processing inequality and Fano's inequality. The data processing inequality says that no processing can create information. Fano's inequality converts a small probability of error into an upper bound on conditional entropy — and from there, into a bound on rate. Together, they are the tools that prove fundamental limits are tight: you truly cannot do better than capacity.

Definition:

Markov Chain

Random variables XX, YY, ZZ form a Markov chain in that order, written XXYZYXYZZX X \multimap Y \multimap Z Y X \multimap Y \multimap Z Z (or XYZX \to Y \to Z), if the conditional distribution of ZZ given (X,Y)(X, Y) depends only on YY:

PZXY(zx,y)=PZY(zy)for all x,y,z.P_{Z|XY}(z|x,y) = P_{Z|Y}(z|y) \quad \text{for all } x, y, z.

Equivalently, XX and ZZ are conditionally independent given YY.

Markov chain

A sequence of random variables XYZX \to Y \to Z where ZZ is conditionally independent of XX given YY. In information theory, Markov chains arise naturally in channel models: the message XX, the channel input, the channel output YY, and the decoder output X^\hat{X} form a chain XYX^X \to Y \to \hat{X}.

Related: Data processing inequality

Theorem: Data Processing Inequality

If XXYZYXYZZX X \multimap Y \multimap Z Y X \multimap Y \multimap Z Z forms a Markov chain, then:

I(X;Z)I(X;Y).I(X;Z) \leq I(X;Y).

In particular, if Z=g(Y)Z = g(Y) is a deterministic function of YY, then I(X;g(Y))I(X;Y)I(X;g(Y)) \leq I(X;Y).

You cannot increase information about XX by processing YY. The channel output YY is the "bottleneck" — any further processing can only lose information, never gain it. This is why sufficient statistics are so valuable: they are the functions of YY that preserve all information about XX.

Data processing inequality

If XYZX \to Y \to Z is a Markov chain, then I(X;Z)I(X;Y)I(X;Z) \leq I(X;Y). Processing data cannot create information. Equality holds iff ZZ is a sufficient statistic of YY for XX.

Related: Markov chain, Sufficient statistic

Sufficient statistic

A function T(Y)T(Y) is a sufficient statistic for XX based on YY if XXYZT(Y)XYZYX X \multimap Y \multimap Z T(Y) X \multimap Y \multimap Z Y, i.e., T(Y)T(Y) captures all the information that YY contains about XX. Equivalently, I(X;T(Y))=I(X;Y)I(X; T(Y)) = I(X;Y).

Related: Data processing inequality

Theorem: Fano's Inequality

Let XX be a discrete random variable with alphabet X\mathcal{X}, X=M|\mathcal{X}| = M. Let X^=g(Y)\hat{X} = g(Y) be an estimate of XX based on observation YY. Define the probability of error Pe=Pr(X^X)P_e = \Pr(\hat{X} \neq X). Then:

H(XY)hb(Pe)+Pelog(M1)H(Pe,M).H(X|Y) \leq h_b(P_e) + P_e \log(M-1) \triangleq H(P_e, M).

A useful weakened form: H(XY)1+Pelog(M1)H(X|Y) \leq 1 + P_e \log(M-1).

Fano's inequality provides a converse bridge: if we can decode XX from YY with small error probability PeP_e, then the conditional entropy H(XY)H(X|Y) must be small. Equivalently, if H(XY)H(X|Y) is large, then PeP_e must be large — reliable decoding is impossible.

This is the key tool in every converse proof. The argument always goes: assume rate R>CR > C, use Fano to show H(XY)H(X|Y) is bounded above, then show this leads to a contradiction (or that Pe1P_e \to 1).

Example: Fano's Inequality for the BSC

A binary symmetric channel has crossover probability ϵ=0.1\epsilon = 0.1. We send nn i.i.d. bits over this channel without coding (rate R=1R = 1). Use Fano's inequality to lower-bound the bit error rate.

Common Mistake: Fano's Inequality Is Not Always Tight

Mistake:

Treating the Fano bound Pelog(M1)P_e \log(M-1) as if it gives the exact conditional entropy. In particular, using the weakened form H(XY)1+PelogMH(X|Y) \leq 1 + P_e \log M when a tighter bound is needed.

Correction:

Fano's inequality gives an upper bound on H(XY)H(X|Y), not an equality. For binary alphabets (M=2M = 2), Fano's inequality is tight (it becomes H(XY)hb(Pe)H(X|Y) \leq h_b(P_e), which can be achieved). For larger alphabets, the bound can be loose because the entropy of XX conditioned on an error may be much less than log(M1)\log(M-1). Tighter bounds exist (e.g., using the list size in list decoding) but Fano's inequality suffices for most converse proofs.

⚠️Engineering Note

Data Processing Inequality and Receiver Architecture

The data processing inequality has direct implications for receiver design. Consider the processing chain in a typical wireless receiver:

XYanalogYdigitalYfilteredX^X \to Y_{\text{analog}} \to Y_{\text{digital}} \to Y_{\text{filtered}} \to \hat{X}

At each stage, I(X;)I(X; \cdot) can only decrease. This means:

  • ADC resolution matters: If the ADC quantizes too coarsely, information is lost irreversibly. For a given SNR, there is a minimum ADC resolution below which capacity is degraded.
  • Matched filtering is not optional: The matched filter is the sufficient statistic for the AWGN channel — it preserves all information. Any other front-end filter loses information.
  • Early decisions are costly: Making hard decisions (e.g., hard demapping before decoding) discards soft information that the decoder could use. Soft-in/soft-out processing preserves mutual information through the decoding chain.
Practical Constraints
  • ADC quantization introduces irreversible information loss

  • Hard decisions before decoding lose soft information

  • Each processing stage must be designed to preserve mutual information

Fano's Inequality Bound

Visualize the Fano bound: for a given alphabet size MM and error probability PeP_e, the conditional entropy H(XY)H(X|Y) is upper bounded by hb(Pe)+Pelog(M1)h_b(P_e) + P_e \log(M-1). See how the bound tightens as Pe0P_e \to 0.

Parameters
8

Number of possible values of X

Key Takeaway

Data processing and Fano's inequality are the two pillars of converse proofs. The data processing inequality (I(X;Z)I(X;Y)I(X;Z) \leq I(X;Y) for Markov chains XYZX \to Y \to Z) says processing cannot create information. Fano's inequality (H(XY)hb(Pe)+Pelog(M1)H(X|Y) \leq h_b(P_e) + P_e \log(M-1)) converts small error probability into small conditional entropy. Together, they prove that reliable communication above capacity is impossible.