The Normal Approximation

Why Finite Blocklength Matters

Shannon's channel coding theorem tells us that reliable communication is possible at any rate below capacity, provided we use long enough codes. But how long is long enough? The classical theory is silent on this point: it gives us the limit (nn \to \infty) but says nothing about what happens at n=100n = 100 or n=1000n = 1000.

This gap between theory and practice has always existed, but it became acute with the emergence of ultra-reliable low-latency communication (URLLC) in 5G. URLLC targets error probabilities of 10510^{-5} to 10910^{-9} at blocklengths of 100-1000 symbols and latencies of 1 ms. At these operating points, the classical capacity formula C=12log(1+SNR)C = \frac{1}{2}\log(1 + \text{SNR}) is a terrible predictor of achievable performance. We need a finer tool.

That tool is the normal approximation, which describes the maximum coding rate as a function of three quantities: the capacity CC, the channel dispersion VV, and the blocklength nn.

Normal Approximation: Convergence to Capacity

Animated sweep of R(n,ε)R^*(n, \varepsilon) as the blocklength nn increases from 50 to 2000. The gap to capacity CC shrinks as V/n\sqrt{V/n}, with the dispersion VV governing the speed of convergence.

Definition:

Information Density

For a channel pYXp_{Y|X} with input distribution pXp_X and induced output distribution pY(y)=xpX(x)pYX(yx)p_Y(y) = \sum_x p_X(x) p_{Y|X}(y|x), the information density is the random variable:

ι(X;Y)=logpYX(YX)pY(Y).\iota(X; Y) = \log \frac{p_{Y|X}(Y|X)}{p_Y(Y)}.

The information density measures the "amount of information" that a specific input-output pair (x,y)(x, y) conveys. Its expectation is the mutual information:

E[ι(X;Y)]=I(X;Y).\mathbb{E}[\iota(X; Y)] = I(X; Y).

For a memoryless channel over nn uses with i.i.d. inputs, the cumulative information density is:

ι(X;Y)=i=1nι(Xi;Yi).\iota(\mathbf{X}; \mathbf{Y}) = \sum_{i=1}^{n} \iota(X_i; Y_i).

The information density is a random variable, not a number. This is the conceptual shift from classical information theory: instead of averaging over channel realizations (which makes sense for nn \to \infty by the law of large numbers), we track the full distribution of ι\iota, because for finite nn, the fluctuations around the mean matter.

Information density

The log-likelihood ratio ι(x;y)=log(pYX(yx)/pY(y))\iota(x; y) = \log(p_{Y|X}(y|x)/p_Y(y)) measuring the information conveyed by a specific input-output pair. Its mean is the mutual information; its variance is the channel dispersion.

Related: Channel dispersion, Normal approximation

Definition:

Channel Dispersion

The channel dispersion of a DMC pYXp_{Y|X} at the capacity-achieving input distribution pXp_X^* is:

V=Var[ι(X;Y)]=E ⁣[(ι(X;Y)C)2]V = \text{Var}[\iota(X; Y)] = \mathbb{E}\!\left[\left(\iota(X; Y) - C\right)^2\right]

where (X,Y)pXpYX(X, Y) \sim p_X^* \cdot p_{Y|X} and C=I(X;Y)C = I(X; Y) at pXp_X^*.

Equivalently:

V=x,ypX(x)pYX(yx)(logpYX(yx)pY(y)C)2.V = \sum_{x, y} p_X^*(x) p_{Y|X}(y|x) \left(\log \frac{p_{Y|X}(y|x)}{p_Y(y)} - C\right)^2.

The dispersion has units of (nats/channel use)2^2 (or (bits/channel use)2^2 with log2\log_2).

The channel dispersion is the information-theoretic analog of volatility in finance. A channel with high dispersion is "unpredictable" at the individual symbol level, even if its average capacity is well-defined. Intuitively, high dispersion means we need longer codes to concentrate the empirical mutual information around its mean.

Channel dispersion

The variance of the information density under the capacity-achieving input distribution. Governs the second-order correction to capacity at finite blocklength: higher dispersion means slower convergence to the capacity limit.

Related: Information density, Normal approximation

Theorem: The Normal Approximation (Polyanskiy-Poor-Verdu)

For a DMC with capacity C>0C > 0 and dispersion V>0V > 0, the maximum coding rate at blocklength nn and error probability ϵ(0,1)\epsilon \in (0, 1) satisfies:

R(n,ϵ)=CVnQ1(ϵ)+O ⁣(lognn)R^*(n, \epsilon) = C - \sqrt{\frac{V}{n}}\, Q^{-1}(\epsilon) + O\!\left(\frac{\log n}{n}\right)

where Q1()Q^{-1}(\cdot) is the inverse of the Gaussian Q-function Q(x)=12πxet2/2dtQ(x) = \frac{1}{\sqrt{2\pi}}\int_x^{\infty} e^{-t^2/2}\,dt.

The O(logn/n)O(\log n / n) remainder depends on the third moment of the information density (via the Berry-Esseen theorem) and can be bounded explicitly.

The normal approximation says that the capacity penalty at finite blocklength is V/nQ1(ϵ)\sqrt{V/n}\, Q^{-1}(\epsilon). This penalty:

  • Grows with VV (more dispersive channels converge more slowly),
  • Shrinks as n\sqrt{n} (doubling the blocklength only halves the square root of the penalty),
  • Grows with reliability (smaller ϵ\epsilon means larger Q1(ϵ)Q^{-1}(\epsilon)).

The point is that VV is the fundamental "speed of convergence" parameter, much like the variance in the central limit theorem. Two channels with the same capacity but different dispersions behave very differently at short blocklengths.

Example: Dispersion of the Binary Symmetric Channel

Compute the channel dispersion of the BSC with crossover probability p=0.1p = 0.1, and determine the blocklength needed to achieve rate R=0.4R = 0.4 bits/use at error probability ϵ=103\epsilon = 10^{-3}.

Example: Dispersion of the AWGN Channel

For the real AWGN channel Y=X+ZY = X + Z with ZN(0,σ2)Z \sim \mathcal{N}(0, \sigma^2) and power constraint PP, show that:

C=12log(1+SNR),V=SNR(2+SNR)2(1+SNR)2(loge)2C = \frac{1}{2}\log(1 + \text{SNR}), \qquad V = \frac{\text{SNR}(2 + \text{SNR})}{2(1 + \text{SNR})^2} \cdot (\log e)^2

where SNR=P/σ2\text{SNR} = P/\sigma^2. Compute VV for SNR=0\text{SNR} = 0 dB and 1010 dB.

Rate vs Blocklength: The Normal Approximation

Visualize R(n,ϵ)R^*(n, \epsilon) for different channels (AWGN, BSC, BEC), error probabilities, and SNR levels. The asymptotic capacity is shown as a horizontal line. Observe how the gap to capacity shrinks as 1/n1/\sqrt{n}.

Parameters
5
0.001
2000

Common Mistake: Using Capacity as a Design Target at Short Blocklengths

Mistake:

Designing a communication system for n=200n = 200 symbols at a rate equal to 90%90\% of the Shannon capacity, expecting error probability 105\sim 10^{-5}.

Correction:

At n=200n = 200 and ϵ=105\epsilon = 10^{-5}, the normal approximation gives a rate well below capacity. For example, at SNR=5\text{SNR} = 5 dB (AWGN): C=1.16C = 1.16 bits/use, but R(200,105)1.161.0/200×4.260.86R^*(200, 10^{-5}) \approx 1.16 - \sqrt{1.0/200} \times 4.26 \approx 0.86 bits/use. That is only 74%74\% of capacity, not 90%90\%. Using the capacity formula for short-blocklength system design leads to under-provisioning of SNR or over-estimation of throughput.

Historical Note: The Polyanskiy-Poor-Verdu Revolution

2010

For decades, the gap between Shannon's asymptotic results and practical finite-length codes was addressed heuristically: engineers used union bounds, error exponents, and simulation. The 2010 paper by Polyanskiy, Poor, and Verdu changed the game by providing tight non-asymptotic bounds that could be computed for any channel and any blocklength. The normal approximation emerged as a corollary, but the real contribution was the toolkit: the random coding union (RCU) bound for achievability and the meta-converse for the converse. These tools replaced the AEP-based arguments of classical information theory with hypothesis-testing machinery that works for any nn. The impact was immediate: within five years, finite-blocklength analysis became a standard design tool for 5G URLLC. Polyanskiy's 2010 PhD thesis is one of the most influential dissertations in modern information theory.

Key Takeaway

The normal approximation R(n,ϵ)CV/nQ1(ϵ)R^*(n, \epsilon) \approx C - \sqrt{V/n}\,Q^{-1}(\epsilon) is the fundamental formula for finite-blocklength communication. The channel dispersion VV governs the speed of convergence to capacity: it is the variance of the information density. Two channels with the same capacity but different dispersions can behave very differently at short blocklengths. For URLLC design, the normal approximation is far more accurate than the capacity formula.

Quick Check

Two channels have the same capacity C=1C = 1 bit/use, but Channel A has dispersion VA=0.5V_A = 0.5 bits2^2 and Channel B has VB=2V_B = 2 bits2^2. At blocklength n=500n = 500 and ϵ=103\epsilon = 10^{-3}, which channel achieves a higher rate?

Channel A (lower dispersion)

Channel B (higher dispersion)

Both achieve the same rate (same capacity)

Normal approximation

The second-order asymptotic expansion of the maximum coding rate: R(n,ϵ)=CV/nQ1(ϵ)+O(logn/n)R^*(n, \epsilon) = C - \sqrt{V/n}\,Q^{-1}(\epsilon) + O(\log n/n), valid for DMCs with positive capacity and dispersion.

Related: Channel dispersion, Information density

Definition:

Maximal Coding Rate

The maximal coding rate at blocklength nn and error probability ϵ\epsilon is:

R(n,ϵ)=1nlogM(n,ϵ)R^*(n, \epsilon) = \frac{1}{n} \log M^*(n, \epsilon)

where M(n,ϵ)M^*(n, \epsilon) is the maximum number of codewords in a code of blocklength nn whose average (or maximal) error probability does not exceed ϵ\epsilon:

M(n,ϵ)=max{M: an (n,M,ϵ)-code}.M^*(n, \epsilon) = \max\{M : \exists \text{ an } (n, M, \epsilon)\text{-code}\}.

Shannon's coding theorem states limnR(n,ϵ)=C\lim_{n \to \infty} R^*(n, \epsilon) = C for any ϵ(0,1)\epsilon \in (0, 1). The normal approximation refines this by characterizing the speed of convergence.

Dispersions of Common Channels

Channel Capacity CC Dispersion VV
BSC(pp) 1h(p)1 - h(p) p(1p)(log1pp)2p(1-p)\left(\log\frac{1-p}{p}\right)^2
BEC(δ\delta) 1δ1 - \delta δ(1δ)(loge)2\delta(1-\delta)(\log e)^2
AWGN(SNR\text{SNR}) 12log(1+SNR)\frac{1}{2}\log(1+\text{SNR}) SNR(SNR+2)2(1+SNR)2\frac{\text{SNR}(\text{SNR}+2)}{2(1+\text{SNR})^2} (nats2^2)
Z-channel(pp) see Cover & Thomas computed numerically

Note: The BSC has the highest dispersion-to-capacity ratio among binary-input channels, making it one of the hardest channels to code for at short blocklengths.

Quick Check

The BEC with erasure probability δ=0.5\delta = 0.5 has capacity C=0.5C = 0.5 bits/use. Its dispersion is V=0.25(log2e)20.520V = 0.25 \cdot (\log_2 e)^2 \approx 0.520 bits2^2. At n=1000n = 1000 and ϵ=105\epsilon = 10^{-5}, what is the approximate maximum coding rate?

R0.50.520/1000×4.260.403R^* \approx 0.5 - \sqrt{0.520/1000} \times 4.26 \approx 0.403 bits/use

R0.45R^* \approx 0.45 bits/use

R0.5R^* \approx 0.5 bits/use (close to capacity)