Large Deviations and Cramer's Theorem

Beyond the CLT: Rare Events

The CLT tells us that Sn/nS_n/n is approximately N(ΞΌ,Οƒ2/n)\mathcal{N}(\mu, \sigma^2/n), so the probability that Sn/nS_n/n deviates from ΞΌ\mu by more than Ο΅\epsilon is roughly 2Q(Ο΅n/Οƒ)2Q(\epsilon\sqrt{n}/\sigma), which decays like eβˆ’cne^{-c n} for an appropriate constant cc.

But the CLT only captures the behavior in the bulk of the distribution. For large deviations β€” the probability that SnS_n exceeds nana for a>ΞΌa > \mu β€” the CLT's Gaussian approximation gives the wrong exponential rate. Cramer's theorem provides the exact rate, expressed through the Fenchel-Legendre transform of the cumulant generating function.

Definition:

Fenchel-Legendre Transform (Rate Function)

Let mX(t)=log⁑MX(t)m_X(t) = \log M_X(t) be the cumulant generating function. The Fenchel-Legendre transform (or rate function) is

mXβˆ—(a)=sup⁑t∈R{atβˆ’mX(t)}.m_X^*(a) = \sup_{t \in \mathbb{R}}\{at - m_X(t)\}.

For a>ΞΌ=E[X]a > \mu = \mathbb{E}[X], the supremum is achieved at the unique Ο„>0\tau > 0 satisfying mXβ€²(Ο„)=am_X'(\tau) = a.

Geometrically, mXβˆ—(a)m_X^*(a) is the gap between the line y=aty = at and the convex curve y=mX(t)y = m_X(t), maximized over tt. Since mXm_X is convex with mXβ€²(0)=ΞΌm_X'(0) = \mu, the rate function mXβˆ—(a)>0m_X^*(a) > 0 for aβ‰ ΞΌa \neq \mu and mXβˆ—(ΞΌ)=0m_X^*(\mu) = 0.

Rate Function

The Fenchel-Legendre transform mXβˆ—(a)=sup⁑t{atβˆ’mX(t)}m_X^*(a) = \sup_t\{at - m_X(t)\}. It gives the exponential decay rate of P(Sn/n>a)\mathbb{P}(S_n/n > a) for i.i.d. sums: P(Sn>na)≐eβˆ’nmXβˆ—(a)\mathbb{P}(S_n > na) \doteq e^{-n m_X^*(a)}.

Related: Cumulant Generating Function, Chernoff Bound

Theorem: The Chernoff Bound

For any a∈Ra \in \mathbb{R} and tβ‰₯0t \geq 0:

P(Sn>na)≀eβˆ’nsup⁑tβ‰₯0{atβˆ’mX(t)}=eβˆ’nmXβˆ—(a).\mathbb{P}(S_n > na) \leq e^{-n\sup_{t \geq 0}\{at - m_X(t)\}} = e^{-n m_X^*(a)}.

Apply Markov's inequality to etSne^{tS_n}: P(Sn>na)=P(etSn>etna)≀eβˆ’tnaE[etSn]=eβˆ’n(atβˆ’mX(t))\mathbb{P}(S_n > na) = \mathbb{P}(e^{tS_n} > e^{tna}) \leq e^{-tna}\mathbb{E}[e^{tS_n}] = e^{-n(at - m_X(t))}. Optimizing over tβ‰₯0t \geq 0 gives the tightest exponential bound.

Chernoff Bound

The exponential upper bound P(X>a)≀inf⁑t>0eβˆ’taMX(t)\mathbb{P}(X > a) \leq \inf_{t > 0} e^{-ta}M_X(t). For sums of i.i.d. RVs, it gives P(Sn>na)≀eβˆ’nmXβˆ—(a)\mathbb{P}(S_n > na) \leq e^{-n m_X^*(a)}.

Related: Rate Function

Theorem: Cramer's Theorem

Let X1,X2,…X_1, X_2, \ldots be i.i.d. with mean ΞΌ\mu, variance Οƒ2>0\sigma^2 > 0, and MGF finite in a neighborhood of 00. For a>ΞΌa > \mu with P(X1>a)>0\mathbb{P}(X_1 > a) > 0:

lim⁑nβ†’βˆžβˆ’1nlog⁑P(Sn>na)=mXβˆ—(a).\lim_{n \to \infty} -\frac{1}{n}\log\mathbb{P}(S_n > na) = m_X^*(a).

Equivalently, P(Sn>na)≐eβˆ’nmXβˆ—(a)\mathbb{P}(S_n > na) \doteq e^{-n m_X^*(a)} (exponential equivalence).

The Chernoff bound gives the upper bound. The matching lower bound comes from an exponential change of measure (tilting): reweight the distribution so that aa becomes the new mean, apply the CLT under the tilted measure, and translate back. The two bounds meet, proving that the Chernoff bound is exponentially tight.

,

The Rate Function and Large Deviation Probabilities

Explore the rate function mXβˆ—(a)m_X^*(a) for different distributions. The left panel shows the CGF mX(t)m_X(t) with the tangent line of slope aa. The right panel shows mXβˆ—(a)m_X^*(a) as a function of aa. Adjust the threshold aa to see how the exponential decay rate changes.

Parameters
2
1
πŸ”§Engineering Note

Large Deviations and Error Exponents in Coding Theory

The large deviations framework is not just an abstract curiosity β€” it is the mathematical engine behind error exponents in information theory. When a decoder makes an error, it is because the empirical statistics of the noise sequence deviate from their typical behavior. The rate at which the error probability decays with blocklength nn is precisely a large deviation rate function (the random coding error exponent).

The tilted distribution technique used in the lower bound of Cramer's theorem reappears as the "change of measure" in the sphere-packing bound for channel coding (Gallager, 1965).

Practical Constraints
  • β€’

    Requires finite MGF (light-tailed distributions)

  • β€’

    Heavy-tailed noise requires different techniques

Common Mistake: Cramer's Theorem Requires Light Tails

Mistake:

Applying Cramer's theorem to distributions without a finite MGF in a neighborhood of the origin. For example, Pareto or log-normal distributions have infinite MGF for all t>0t > 0.

Correction:

For heavy-tailed distributions, the Chernoff bound and Cramer's theorem do not apply. Large deviation probabilities decay subexponentially β€” often polynomially. Use the theory of subexponential distributions or specialized results (e.g., Nagaev's theorem) instead.