The Normal Approximation
Why Finite Blocklength Matters
Shannon's channel coding theorem tells us that reliable communication is possible at any rate below capacity, provided we use long enough codes. But how long is long enough? The classical theory is silent on this point: it gives us the limit () but says nothing about what happens at or .
This gap between theory and practice has always existed, but it became acute with the emergence of ultra-reliable low-latency communication (URLLC) in 5G. URLLC targets error probabilities of to at blocklengths of 100-1000 symbols and latencies of 1 ms. At these operating points, the classical capacity formula is a terrible predictor of achievable performance. We need a finer tool.
That tool is the normal approximation, which describes the maximum coding rate as a function of three quantities: the capacity , the channel dispersion , and the blocklength .
Normal Approximation: Convergence to Capacity
Definition: Information Density
Information Density
For a channel with input distribution and induced output distribution , the information density is the random variable:
The information density measures the "amount of information" that a specific input-output pair conveys. Its expectation is the mutual information:
For a memoryless channel over uses with i.i.d. inputs, the cumulative information density is:
The information density is a random variable, not a number. This is the conceptual shift from classical information theory: instead of averaging over channel realizations (which makes sense for by the law of large numbers), we track the full distribution of , because for finite , the fluctuations around the mean matter.
Information density
The log-likelihood ratio measuring the information conveyed by a specific input-output pair. Its mean is the mutual information; its variance is the channel dispersion.
Related: Channel dispersion, Normal approximation
Definition: Channel Dispersion
Channel Dispersion
The channel dispersion of a DMC at the capacity-achieving input distribution is:
where and at .
Equivalently:
The dispersion has units of (nats/channel use) (or (bits/channel use) with ).
The channel dispersion is the information-theoretic analog of volatility in finance. A channel with high dispersion is "unpredictable" at the individual symbol level, even if its average capacity is well-defined. Intuitively, high dispersion means we need longer codes to concentrate the empirical mutual information around its mean.
Channel dispersion
The variance of the information density under the capacity-achieving input distribution. Governs the second-order correction to capacity at finite blocklength: higher dispersion means slower convergence to the capacity limit.
Related: Information density, Normal approximation
Theorem: The Normal Approximation (Polyanskiy-Poor-Verdu)
For a DMC with capacity and dispersion , the maximum coding rate at blocklength and error probability satisfies:
where is the inverse of the Gaussian Q-function .
The remainder depends on the third moment of the information density (via the Berry-Esseen theorem) and can be bounded explicitly.
The normal approximation says that the capacity penalty at finite blocklength is . This penalty:
- Grows with (more dispersive channels converge more slowly),
- Shrinks as (doubling the blocklength only halves the square root of the penalty),
- Grows with reliability (smaller means larger ).
The point is that is the fundamental "speed of convergence" parameter, much like the variance in the central limit theorem. Two channels with the same capacity but different dispersions behave very differently at short blocklengths.
CLT for information density
For a memoryless channel with i.i.d. inputs, the cumulative information density is a sum of i.i.d. random variables with mean and variance . By the CLT:
Connection to error probability
A code of rate and blocklength can be decoded reliably when (roughly speaking, the channel provides enough information). The error probability is approximately:
Inverting for $R^*(n, psilon)$
Setting the error probability to and solving for :
The Berry-Esseen theorem gives the correction to the CLT approximation.
Example: Dispersion of the Binary Symmetric Channel
Compute the channel dispersion of the BSC with crossover probability , and determine the blocklength needed to achieve rate bits/use at error probability .
Capacity
bits/use.
Dispersion
The information density for the BSC at input and output takes two values:
- When (probability ): bits
- When (probability ): bits
(using at the capacity-achieving uniform input).
Required blocklength
From the normal approximation: .
.
.
.
We need approximately channel uses to achieve rate 0.4 at the BSC with .
Example: Dispersion of the AWGN Channel
For the real AWGN channel with and power constraint , show that:
where . Compute for dB and dB.
Information density for AWGN
With Gaussian input :
After simplification with :
Dispersion computation
The variance of is computed using , , and the independence of and :
The clean form is: .
In natural units (nats): .
Numerical values
dB (= 1): nats bits.
dB (= 10): nats bits.
At high SNR, nats (a universal constant). At low SNR, nats.
Rate vs Blocklength: The Normal Approximation
Visualize for different channels (AWGN, BSC, BEC), error probabilities, and SNR levels. The asymptotic capacity is shown as a horizontal line. Observe how the gap to capacity shrinks as .
Parameters
Common Mistake: Using Capacity as a Design Target at Short Blocklengths
Mistake:
Designing a communication system for symbols at a rate equal to of the Shannon capacity, expecting error probability .
Correction:
At and , the normal approximation gives a rate well below capacity. For example, at dB (AWGN): bits/use, but bits/use. That is only of capacity, not . Using the capacity formula for short-blocklength system design leads to under-provisioning of SNR or over-estimation of throughput.
Historical Note: The Polyanskiy-Poor-Verdu Revolution
2010For decades, the gap between Shannon's asymptotic results and practical finite-length codes was addressed heuristically: engineers used union bounds, error exponents, and simulation. The 2010 paper by Polyanskiy, Poor, and Verdu changed the game by providing tight non-asymptotic bounds that could be computed for any channel and any blocklength. The normal approximation emerged as a corollary, but the real contribution was the toolkit: the random coding union (RCU) bound for achievability and the meta-converse for the converse. These tools replaced the AEP-based arguments of classical information theory with hypothesis-testing machinery that works for any . The impact was immediate: within five years, finite-blocklength analysis became a standard design tool for 5G URLLC. Polyanskiy's 2010 PhD thesis is one of the most influential dissertations in modern information theory.
Key Takeaway
The normal approximation is the fundamental formula for finite-blocklength communication. The channel dispersion governs the speed of convergence to capacity: it is the variance of the information density. Two channels with the same capacity but different dispersions can behave very differently at short blocklengths. For URLLC design, the normal approximation is far more accurate than the capacity formula.
Quick Check
Two channels have the same capacity bit/use, but Channel A has dispersion bits and Channel B has bits. At blocklength and , which channel achieves a higher rate?
Channel A (lower dispersion)
Channel B (higher dispersion)
Both achieve the same rate (same capacity)
bits/use. bits/use. Lower dispersion means faster convergence to capacity at finite blocklength.
Normal approximation
The second-order asymptotic expansion of the maximum coding rate: , valid for DMCs with positive capacity and dispersion.
Related: Channel dispersion, Information density
Definition: Maximal Coding Rate
Maximal Coding Rate
The maximal coding rate at blocklength and error probability is:
where is the maximum number of codewords in a code of blocklength whose average (or maximal) error probability does not exceed :
Shannon's coding theorem states for any . The normal approximation refines this by characterizing the speed of convergence.
Dispersions of Common Channels
| Channel | Capacity | Dispersion |
|---|---|---|
| BSC() | ||
| BEC() | ||
| AWGN() | (nats) | |
| Z-channel() | see Cover & Thomas | computed numerically |
Note: The BSC has the highest dispersion-to-capacity ratio among binary-input channels, making it one of the hardest channels to code for at short blocklengths.
Quick Check
The BEC with erasure probability has capacity bits/use. Its dispersion is bits. At and , what is the approximate maximum coding rate?
bits/use
bits/use
bits/use (close to capacity)
. The penalty is . So bits/use, about of capacity.