Ferkans — Interactive Telecom Tutor

Why Finite Blocklength Matters

Shannon's channel coding theorem tells us that reliable communication is possible at any rate below capacity, provided we use long enough codes. But how long is long enough? The classical theory is silent on this point: it gives us the limit ( $n \to \infty$ ) but says nothing about what happens at $n = 100$ or $n = 1000$ .

This gap between theory and practice has always existed, but it became acute with the emergence of ultra-reliable low-latency communication (URLLC) in 5G. URLLC targets error probabilities of $10^{-5}$ to $10^{-9}$ at blocklengths of 100-1000 symbols and latencies of 1 ms. At these operating points, the classical capacity formula $C = \frac{1}{2}\log(1 + \text{SNR})$ is a terrible predictor of achievable performance. We need a finer tool.

That tool is the normal approximation, which describes the maximum coding rate as a function of three quantities: the capacity $C$ , the channel dispersion $V$ , and the blocklength $n$ .

Normal Approximation: Convergence to Capacity

Animated sweep of

R^*(n, \varepsilon)

as the blocklength

n

increases from 50 to 2000. The gap to capacity

C

shrinks as

\sqrt{V/n}

, with the dispersion

V

governing the speed of convergence.

Definition:
Information Density

For a channel $p_{Y|X}$ with input distribution $p_X$ and induced output distribution $p_Y(y) = \sum_x p_X(x) p_{Y|X}(y|x)$ , the information density is the random variable:

$\iota(X; Y) = \log \frac{p_{Y|X}(Y|X)}{p_Y(Y)}.$

The information density measures the "amount of information" that a specific input-output pair $(x, y)$ conveys. Its expectation is the mutual information:

$\mathbb{E}[\iota(X; Y)] = I(X; Y).$

For a memoryless channel over $n$ uses with i.i.d. inputs, the cumulative information density is:

$\iota(\mathbf{X}; \mathbf{Y}) = \sum_{i=1}^{n} \iota(X_i; Y_i).$

The information density is a random variable, not a number. This is the conceptual shift from classical information theory: instead of averaging over channel realizations (which makes sense for $n \to \infty$ by the law of large numbers), we track the full distribution of $\iota$ , because for finite $n$ , the fluctuations around the mean matter.

Information density

The log-likelihood ratio $\iota(x; y) = \log(p_{Y|X}(y|x)/p_Y(y))$ measuring the information conveyed by a specific input-output pair. Its mean is the mutual information; its variance is the channel dispersion.

Definition:
Channel Dispersion

The channel dispersion of a DMC $p_{Y|X}$ at the capacity-achieving input distribution $p_X^*$ is:

$V = \text{Var}[\iota(X; Y)] = \mathbb{E}\!\left[\left(\iota(X; Y) - C\right)^2\right]$

where $(X, Y) \sim p_X^* \cdot p_{Y|X}$ and $C = I(X; Y)$ at $p_X^*$ .

Equivalently:

$V = \sum_{x, y} p_X^*(x) p_{Y|X}(y|x) \left(\log \frac{p_{Y|X}(y|x)}{p_Y(y)} - C\right)^2.$

The dispersion has units of (nats/channel use) $^2$ (or (bits/channel use) $^2$ with $\log_2$ ).

The channel dispersion is the information-theoretic analog of volatility in finance. A channel with high dispersion is "unpredictable" at the individual symbol level, even if its average capacity is well-defined. Intuitively, high dispersion means we need longer codes to concentrate the empirical mutual information around its mean.

Channel dispersion

The variance of the information density under the capacity-achieving input distribution. Governs the second-order correction to capacity at finite blocklength: higher dispersion means slower convergence to the capacity limit.

Theorem: The Normal Approximation (Polyanskiy-Poor-Verdu)

For a DMC with capacity $C > 0$ and dispersion $V > 0$ , the maximum coding rate at blocklength $n$ and error probability $\epsilon \in (0, 1)$ satisfies:

$R^*(n, \epsilon) = C - \sqrt{\frac{V}{n}}\, Q^{-1}(\epsilon) + O\!\left(\frac{\log n}{n}\right)$

where $Q^{-1}(\cdot)$ is the inverse of the Gaussian Q-function $Q(x) = \frac{1}{\sqrt{2\pi}}\int_x^{\infty} e^{-t^2/2}\,dt$ .

The $O(\log n / n)$ remainder depends on the third moment of the information density (via the Berry-Esseen theorem) and can be bounded explicitly.

The normal approximation says that the capacity penalty at finite blocklength is $\sqrt{V/n}\, Q^{-1}(\epsilon)$ . This penalty:

Grows with $V$ (more dispersive channels converge more slowly),
Shrinks as $\sqrt{n}$ (doubling the blocklength only halves the square root of the penalty),
Grows with reliability (smaller $\epsilon$ means larger $Q^{-1}(\epsilon)$ ).

The point is that $V$ is the fundamental "speed of convergence" parameter, much like the variance in the central limit theorem. Two channels with the same capacity but different dispersions behave very differently at short blocklengths.

Proof

CLT for information density

For a memoryless channel with i.i.d. inputs, the cumulative information density $\iota(\mathbf{X}; \mathbf{Y}) = \sum_{i=1}^n \iota(X_i; Y_i)$ is a sum of i.i.d. random variables with mean $nC$ and variance $nV$ . By the CLT:

$\frac{\iota(\mathbf{X}; \mathbf{Y}) - nC}{\sqrt{nV}} \xrightarrow{d} \mathcal{N}(0, 1).$

Connection to error probability

A code of rate $R$ and blocklength $n$ can be decoded reliably when $\iota(\mathbf{X}; \mathbf{Y}) > nR$ (roughly speaking, the channel provides enough information). The error probability is approximately:

$\epsilon \approx \Pr[\iota(\mathbf{X}; \mathbf{Y}) < nR] = Q\!\left(\frac{nC - nR}{\sqrt{nV}}\right) = Q\!\left(\frac{(C - R)\sqrt{n}}{\sqrt{V}}\right).$

Inverting for $R^*(n, psilon)$

Setting the error probability to $\epsilon$ and solving for $R$ :

$Q\!\left(\frac{(C - R)\sqrt{n}}{\sqrt{V}}\right) = \epsilon \;\Longrightarrow\; \frac{(C - R)\sqrt{n}}{\sqrt{V}} = Q^{-1}(\epsilon) \;\Longrightarrow\; R = C - \sqrt{\frac{V}{n}}\, Q^{-1}(\epsilon).$

The Berry-Esseen theorem gives the $O(\log n / n)$ correction to the CLT approximation.

Example: Dispersion of the Binary Symmetric Channel

Compute the channel dispersion of the BSC with crossover probability $p = 0.1$ , and determine the blocklength needed to achieve rate $R = 0.4$ bits/use at error probability $\epsilon = 10^{-3}$ .

Solution

Capacity

$C = 1 - h(p) = 1 - h(0.1) = 1 - 0.469 = 0.531$ bits/use.

Dispersion

The information density for the BSC at input $X$ and output $Y$ takes two values:

When $Y = X$ (probability $1-p$ ): $\iota = \log\frac{1-p}{1-p_Y} = \log\frac{0.9}{0.5} = 0.848$ bits
When $Y \ne X$ (probability $p$ ): $\iota = \log\frac{p}{p_Y} = \log\frac{0.1}{0.5} = -2.322$ bits

(using $p_Y = 0.5$ at the capacity-achieving uniform input).

$V = (1-p)(\iota_{\text{correct}} - C)^2 + p(\iota_{\text{error}} - C)^2$ $= 0.9(0.848 - 0.531)^2 + 0.1(-2.322 - 0.531)^2 = 0.9 \times 0.1005 + 0.1 \times 8.134 = 0.904 \text{ bits}^2.$

Required blocklength

From the normal approximation: $0.4 = 0.531 - \sqrt{0.904/n} \cdot Q^{-1}(10^{-3})$ .

$Q^{-1}(10^{-3}) = 3.09$ .

$\sqrt{0.904/n} = (0.531 - 0.4)/3.09 = 0.0424$ .

$0.904/n = 0.001797 \Rightarrow n = 503$ .

We need approximately $n = 503$ channel uses to achieve rate 0.4 at the BSC with $\epsilon = 10^{-3}$ .

Example: Dispersion of the AWGN Channel

For the real AWGN channel $Y = X + Z$ with $Z \sim \mathcal{N}(0, \sigma^2)$ and power constraint $P$ , show that:

$C = \frac{1}{2}\log(1 + \text{SNR}), \qquad V = \frac{\text{SNR}(2 + \text{SNR})}{2(1 + \text{SNR})^2} \cdot (\log e)^2$

where $\text{SNR} = P/\sigma^2$ . Compute $V$ for $\text{SNR} = 0$ dB and $10$ dB.

Solution

Information density for AWGN

With Gaussian input $X \sim \mathcal{N}(0, P)$ : $\iota(X; Y) = \log\frac{p_{Y|X}(Y|X)}{p_Y(Y)} = \frac{1}{2}\log\frac{P + \sigma^2}{\sigma^2} + \frac{Y^2}{2(P + \sigma^2)} - \frac{(Y-X)^2}{2\sigma^2}.$

After simplification with $Y = X + Z$ : $\iota(X; Y) = \frac{1}{2}\log(1 + \text{SNR}) + \frac{\text{SNR}}{2(1+\text{SNR})}(X^2/P - 1) + \frac{\sqrt{\text{SNR}}}{1+\text{SNR}}XZ/\sqrt{P\sigma^2} - \frac{Z^2 - \sigma^2}{2\sigma^2(1+\text{SNR})}.$

Dispersion computation

The variance of $\iota(X; Y)$ is computed using $\text{Var}[X^2/P] = 2$ , $\text{Var}[Z^2/\sigma^2] = 2$ , and the independence of $X$ and $Z$ :

$V = \frac{\text{SNR}^{2}}{2(1+\text{SNR})^2} + \frac{\text{SNR}}{(1+\text{SNR})^2} + \frac{1}{2(1+\text{SNR})^2} \cdot (\log e)^2 \cdot \text{(correction)}.$

The clean form is: $V = \frac{\text{SNR}(\text{SNR} + 2)}{2(1+\text{SNR})^2}(\log_e e)^2$ .

In natural units (nats $^2$ ): $V_{\text{nat}} = \frac{\text{SNR}(\text{SNR}+2)}{2(1+\text{SNR})^2}$ .

Numerical values

$\text{SNR} = 0$ dB (= 1): $V = \frac{1 \cdot 3}{2 \cdot 4} = 0.375$ nats $^2$ $= 0.375 \cdot (\log_2 e)^2 \approx 0.780$ bits $^2$ .

$\text{SNR} = 10$ dB (= 10): $V = \frac{10 \cdot 12}{2 \cdot 121} = 0.496$ nats $^2$ $\approx 1.033$ bits $^2$ .

At high SNR, $V \to 1/2$ nats $^2$ (a universal constant). At low SNR, $V \to \text{SNR}$ nats $^2$ .

Rate vs Blocklength: The Normal Approximation

Visualize $R^*(n, \epsilon)$ for different channels (AWGN, BSC, BEC), error probabilities, and SNR levels. The asymptotic capacity is shown as a horizontal line. Observe how the gap to capacity shrinks as $1/\sqrt{n}$ .

Parameters

Channel type

SNR (dB) for AWGN, or crossover prob for BSC/BEC5

Error probability

\epsilon

0.001

Maximum blocklength

n

2000

Common Mistake: Using Capacity as a Design Target at Short Blocklengths

Mistake:

Designing a communication system for $n = 200$ symbols at a rate equal to $90\%$ of the Shannon capacity, expecting error probability $\sim 10^{-5}$ .

Correction:

At $n = 200$ and $\epsilon = 10^{-5}$ , the normal approximation gives a rate well below capacity. For example, at $\text{SNR} = 5$ dB (AWGN): $C = 1.16$ bits/use, but $R^*(200, 10^{-5}) \approx 1.16 - \sqrt{1.0/200} \times 4.26 \approx 0.86$ bits/use. That is only $74\%$ of capacity, not $90\%$ . Using the capacity formula for short-blocklength system design leads to under-provisioning of SNR or over-estimation of throughput.

Historical Note: The Polyanskiy-Poor-Verdu Revolution

2010

For decades, the gap between Shannon's asymptotic results and practical finite-length codes was addressed heuristically: engineers used union bounds, error exponents, and simulation. The 2010 paper by Polyanskiy, Poor, and Verdu changed the game by providing tight non-asymptotic bounds that could be computed for any channel and any blocklength. The normal approximation emerged as a corollary, but the real contribution was the toolkit: the random coding union (RCU) bound for achievability and the meta-converse for the converse. These tools replaced the AEP-based arguments of classical information theory with hypothesis-testing machinery that works for any $n$ . The impact was immediate: within five years, finite-blocklength analysis became a standard design tool for 5G URLLC. Polyanskiy's 2010 PhD thesis is one of the most influential dissertations in modern information theory.

Key Takeaway

The normal approximation $R^*(n, \epsilon) \approx C - \sqrt{V/n}\,Q^{-1}(\epsilon)$ is the fundamental formula for finite-blocklength communication. The channel dispersion $V$ governs the speed of convergence to capacity: it is the variance of the information density. Two channels with the same capacity but different dispersions can behave very differently at short blocklengths. For URLLC design, the normal approximation is far more accurate than the capacity formula.

Quick Check

Two channels have the same capacity $C = 1$ bit/use, but Channel A has dispersion $V_A = 0.5$ bits $^2$ and Channel B has $V_B = 2$ bits $^2$ . At blocklength $n = 500$ and $\epsilon = 10^{-3}$ , which channel achieves a higher rate?

Channel A (lower dispersion)

Channel B (higher dispersion)

Both achieve the same rate (same capacity)

Correction:

Channel A (lower dispersion)

$R_A^* = 1 - \\sqrt{0.5/500} \\times 3.09 = 1 - 0.098 = 0.902$ bits/use. $R_B^* = 1 - \\sqrt{2/500} \\times 3.09 = 1 - 0.195 = 0.805$ bits/use. Lower dispersion means faster convergence to capacity at finite blocklength.

Normal approximation

The second-order asymptotic expansion of the maximum coding rate: $R^*(n, \epsilon) = C - \sqrt{V/n}\,Q^{-1}(\epsilon) + O(\log n/n)$ , valid for DMCs with positive capacity and dispersion.

Definition:
Maximal Coding Rate

The maximal coding rate at blocklength $n$ and error probability $\epsilon$ is:

$R^*(n, \epsilon) = \frac{1}{n} \log M^*(n, \epsilon)$

where $M^*(n, \epsilon)$ is the maximum number of codewords in a code of blocklength $n$ whose average (or maximal) error probability does not exceed $\epsilon$ :

$M^*(n, \epsilon) = \max\{M : \exists \text{ an } (n, M, \epsilon)\text{-code}\}.$

Shannon's coding theorem states $\lim_{n \to \infty} R^*(n, \epsilon) = C$ for any $\epsilon \in (0, 1)$ . The normal approximation refines this by characterizing the speed of convergence.

Dispersions of Common Channels

Channel	Capacity $C$	Dispersion $V$
BSC( $p$ )	$1 - h(p)$	$p(1-p)\left(\log\frac{1-p}{p}\right)^2$
BEC( $\delta$ )	$1 - \delta$	$\delta(1-\delta)(\log e)^2$
AWGN( $\text{SNR}$ )	$\frac{1}{2}\log(1+\text{SNR})$	$\frac{\text{SNR}(\text{SNR}+2)}{2(1+\text{SNR})^2}$ (nats $^2$ )
Z-channel( $p$ )	see Cover & Thomas	computed numerically

Note: The BSC has the highest dispersion-to-capacity ratio among binary-input channels, making it one of the hardest channels to code for at short blocklengths.

Quick Check

The BEC with erasure probability $\delta = 0.5$ has capacity $C = 0.5$ bits/use. Its dispersion is $V = 0.25 \cdot (\log_2 e)^2 \approx 0.520$ bits $^2$ . At $n = 1000$ and $\epsilon = 10^{-5}$ , what is the approximate maximum coding rate?

$R^* \approx 0.5 - \sqrt{0.520/1000} \times 4.26 \approx 0.403$ bits/use

$R^* \approx 0.45$ bits/use

$R^* \approx 0.5$ bits/use (close to capacity)

Correction:

R^* \approx 0.5 - \sqrt{0.520/1000} \times 4.26 \approx 0.403

bits/use

$Q^{-1}(10^{-5}) = 4.26$ . The penalty is $\\sqrt{0.520/1000} \\times 4.26 = 0.0228 \\times 4.26 = 0.097$ . So $R^* \\approx 0.403$ bits/use, about $81\\%$ of capacity.

The Normal Approximation

Why Finite Blocklength Matters

Normal Approximation: Convergence to Capacity

Definition: Information Density

Information density

Definition: Channel Dispersion

Channel dispersion

Theorem: The Normal Approximation (Polyanskiy-Poor-Verdu)

CLT for information density

Connection to error probability

Inverting for $R^*(n, psilon)$

Example: Dispersion of the Binary Symmetric Channel

Capacity

Dispersion

Required blocklength

Example: Dispersion of the AWGN Channel

Information density for AWGN

Dispersion computation

Numerical values

Rate vs Blocklength: The Normal Approximation

Parameters

Common Mistake: Using Capacity as a Design Target at Short Blocklengths

Historical Note: The Polyanskiy-Poor-Verdu Revolution

Key Takeaway

Quick Check

Normal approximation

Definition: Maximal Coding Rate

Dispersions of Common Channels

Quick Check

Definition:
Information Density

Definition:
Channel Dispersion

Inverting for $R^*(n, psilon)$

Definition:
Maximal Coding Rate