Ferkans — Interactive Telecom Tutor

Beyond the CLT: Rare Events

The CLT tells us that $S_n/n$ is approximately $\mathcal{N}(\mu, \sigma^2/n)$ , so the probability that $S_n/n$ deviates from $\mu$ by more than $\epsilon$ is roughly $2Q(\epsilon\sqrt{n}/\sigma)$ , which decays like $e^{-c n}$ for an appropriate constant $c$ .

But the CLT only captures the behavior in the bulk of the distribution. For large deviations — the probability that $S_n$ exceeds $na$ for $a > \mu$ — the CLT's Gaussian approximation gives the wrong exponential rate. Cramer's theorem provides the exact rate, expressed through the Fenchel-Legendre transform of the cumulant generating function.

Definition:
Fenchel-Legendre Transform (Rate Function)

Let $m_X(t) = \log M_X(t)$ be the cumulant generating function. The Fenchel-Legendre transform (or rate function) is

$m_X^*(a) = \sup_{t \in \mathbb{R}}\{at - m_X(t)\}.$

For $a > \mu = \mathbb{E}[X]$ , the supremum is achieved at the unique $\tau > 0$ satisfying $m_X'(\tau) = a$ .

Geometrically, $m_X^*(a)$ is the gap between the line $y = at$ and the convex curve $y = m_X(t)$ , maximized over $t$ . Since $m_X$ is convex with $m_X'(0) = \mu$ , the rate function $m_X^*(a) > 0$ for $a \neq \mu$ and $m_X^*(\mu) = 0$ .

Rate Function

The Fenchel-Legendre transform $m_X^*(a) = \sup_t\{at - m_X(t)\}$ . It gives the exponential decay rate of $\mathbb{P}(S_n/n > a)$ for i.i.d. sums: $\mathbb{P}(S_n > na) \doteq e^{-n m_X^*(a)}$ .

Theorem: The Chernoff Bound

For any $a \in \mathbb{R}$ and $t \geq 0$ :

$\mathbb{P}(S_n > na) \leq e^{-n\sup_{t \geq 0}\{at - m_X(t)\}} = e^{-n m_X^*(a)}.$

Apply Markov's inequality to $e^{tS_n}$ : $\mathbb{P}(S_n > na) = \mathbb{P}(e^{tS_n} > e^{tna}) \leq e^{-tna}\mathbb{E}[e^{tS_n}] = e^{-n(at - m_X(t))}$ . Optimizing over $t \geq 0$ gives the tightest exponential bound.

Proof

Markov's inequality

For $t \geq 0$ : $\mathbb{P}(S_n > na) = \mathbb{P}(e^{tS_n} > e^{tna}) \leq \frac{\mathbb{E}[e^{tS_n}]}{e^{tna}}$ .

Evaluate the MGF

By independence: $\mathbb{E}[e^{tS_n}] = [M_X(t)]^n = e^{n m_X(t)}$ .

So $\mathbb{P}(S_n > na) \leq e^{-n(at - m_X(t))}$ .

Optimize

Take the infimum over $t \geq 0$ : $\mathbb{P}(S_n > na) \leq e^{-n\sup_{t \geq 0}\{at - m_X(t)\}} = e^{-n m_X^*(a)}$ .

Chernoff Bound

The exponential upper bound $\mathbb{P}(X > a) \leq \inf_{t > 0} e^{-ta}M_X(t)$ . For sums of i.i.d. RVs, it gives $\mathbb{P}(S_n > na) \leq e^{-n m_X^*(a)}$ .

Related: Rate Function

Theorem: Cramer's Theorem

Let $X_1, X_2, \ldots$ be i.i.d. with mean $\mu$ , variance $\sigma^2 > 0$ , and MGF finite in a neighborhood of $0$ . For $a > \mu$ with $\mathbb{P}(X_1 > a) > 0$ :

$\lim_{n \to \infty} -\frac{1}{n}\log\mathbb{P}(S_n > na) = m_X^*(a).$

Equivalently, $\mathbb{P}(S_n > na) \doteq e^{-n m_X^*(a)}$ (exponential equivalence).

The Chernoff bound gives the upper bound. The matching lower bound comes from an exponential change of measure (tilting): reweight the distribution so that $a$ becomes the new mean, apply the CLT under the tilted measure, and translate back. The two bounds meet, proving that the Chernoff bound is exponentially tight.

Proof

Upper bound

This is exactly the Chernoff bound (Theorem TThe Chernoff Bound): $\frac{1}{n}\log\mathbb{P}(S_n > na) \leq -m_X^*(a)$ .

Tilted distribution

Let $\tau$ solve $m_X'(\tau) = a$ . Define the tilted CDF: $d\widetilde{F}(x) = \frac{e^{\tau x}}{M_X(\tau)}\,dF(x)$ .

Under $\widetilde{F}$ , the mean shifts to $\widetilde{\mu} = m_X'(\tau) = a$ and the variance is $\widetilde{\sigma}^2 = m_X''(\tau) > 0$ .

Change of measure

$\mathbb{P}(S_n > na) = \int_{na}^{\infty} dF_{S_n}(u) = [M_X(\tau)]^n\int_{na}^{\infty} e^{-\tau u}\,d\widetilde{F}_{S_n}(u)$ .

Lower bound via CLT on tilted measure

Under $\widetilde{F}$ , the tilted RVs have mean $a$ , so $\widetilde{S}_n/n \to a$ by LLN. For $b > a$ :

$\mathbb{P}(S_n > na) \geq [M_X(\tau)]^n e^{-\tau nb} \,\widetilde{\mathbb{P}}(na \leq \widetilde{S}_n \leq nb)$ .

By the CLT under the tilted measure, $\widetilde{\mathbb{P}}(na \leq \widetilde{S}_n \leq nb) \to 1/2$ .

Let $b \downarrow a$

$\frac{1}{n}\log\mathbb{P}(S_n > na) \geq -(\tau b - m_X(\tau)) + o(1) \to -(\tau a - m_X(\tau)) = -m_X^*(a)$ .

,

The Rate Function and Large Deviation Probabilities

Explore the rate function $m_X^*(a)$ for different distributions. The left panel shows the CGF $m_X(t)$ with the tangent line of slope $a$ . The right panel shows $m_X^*(a)$ as a function of $a$ . Adjust the threshold $a$ to see how the exponential decay rate changes.

Parameters

Distribution

Threshold

a

2

Distribution parameter1

🔧Engineering Note

Large Deviations and Error Exponents in Coding Theory

The large deviations framework is not just an abstract curiosity — it is the mathematical engine behind error exponents in information theory. When a decoder makes an error, it is because the empirical statistics of the noise sequence deviate from their typical behavior. The rate at which the error probability decays with blocklength $n$ is precisely a large deviation rate function (the random coding error exponent).

The tilted distribution technique used in the lower bound of Cramer's theorem reappears as the "change of measure" in the sphere-packing bound for channel coding (Gallager, 1965).

Practical Constraints

•
Requires finite MGF (light-tailed distributions)
•
Heavy-tailed noise requires different techniques

Common Mistake: Cramer's Theorem Requires Light Tails

Mistake:

Applying Cramer's theorem to distributions without a finite MGF in a neighborhood of the origin. For example, Pareto or log-normal distributions have infinite MGF for all $t > 0$ .

Correction:

For heavy-tailed distributions, the Chernoff bound and Cramer's theorem do not apply. Large deviation probabilities decay subexponentially — often polynomially. Use the theory of subexponential distributions or specialized results (e.g., Nagaev's theorem) instead.

Large Deviations and Cramer's Theorem