Cramér's Theorem

Beyond the CLT: The Regime of Rare Events

The central limit theorem tells us that the sample mean Xˉn\bar{X}_n fluctuates around μ\mu at scale 1/n1/\sqrt{n}, and the fluctuations are approximately Gaussian. But what about large deviations — the probability that Xˉn\bar{X}_n deviates from μ\mu by a fixed amount δ>0\delta > 0, regardless of nn? The CLT says this probability goes to zero, but it does not tell us how fast. Large deviations theory provides the answer: the decay is exponential in nn, and the rate is governed by a function I(x)I(x) that encodes the cost of forcing the sample mean to take the value xx. This is the rate function, and Cramér's theorem identifies it as the Legendre-Fenchel transform of the log-MGF.

Definition:

Cumulant Generating Function

Let XX be a real-valued random variable with moment generating function MX(θ)=E[eθX]M_X(\theta) = \mathbb{E}[e^{\theta X}]. The cumulant generating function (CGF) is defined as Λ(θ)logMX(θ)=logE[eθX].\Lambda(\theta) \triangleq \log M_X(\theta) = \log \mathbb{E}[e^{\theta X}]. The effective domain of Λ\Lambda is DΛ={θR:Λ(θ)<}\mathcal{D}_\Lambda = \{\theta \in \mathbb{R} : \Lambda(\theta) < \infty\}.

The CGF is always convex (it is the log of a positive linear functional evaluated at an exponential family). Its derivatives at θ=0\theta = 0 give the cumulants: Λ(0)=μ\Lambda'(0) = \mu, Λ(0)=σ2\Lambda''(0) = \sigma^2.

Example: CGF of a N(μ,σ2)\mathcal{N}(\mu, \sigma^2) Random Variable

Compute the cumulant generating function Λ(θ)\Lambda(\theta) for XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2).

Definition:

Rate Function (Fenchel-Legendre Transform)

The rate function associated with a random variable XX is the Fenchel-Legendre transform of the CGF: I(x)Λ(x)=supθR{θxΛ(θ)}.I(x) \triangleq \Lambda^*(x) = \sup_{\theta \in \mathbb{R}} \bigl\{\theta x - \Lambda(\theta)\bigr\}. Since Λ\Lambda is convex, II is also convex. It satisfies I(x)0I(x) \geq 0 for all xx and I(μ)=0I(\mu) = 0 where μ=E[X]\mu = \mathbb{E}[X].

The rate function I(x)I(x) quantifies the "cost" of observing the sample mean at xx. The minimum cost is zero, attained at the true mean — the most likely value. The further xx deviates from μ\mu, the larger I(x)I(x), and the faster the probability decays.

Example: Rate Function for N(μ,σ2)\mathcal{N}(\mu, \sigma^2)

Compute the rate function I(x)I(x) for XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) using the Legendre transform of Λ(θ)=μθ+σ2θ2/2\Lambda(\theta) = \mu\theta + \sigma^2\theta^2/2.

Example: Rate Function for Bernoulli(pp)

Compute I(x)I(x) for XBernoulli(p)X \sim \text{Bernoulli}(p).

,

Rate Function

A non-negative, lower semicontinuous, convex function I(x)I(x) such that P(XˉnA)exp(ninfxAI(x))\mathbb{P}(\bar{X}_n \in A) \approx \exp(-n \inf_{x \in A} I(x)) for large nn. Obtained as the Legendre-Fenchel transform of the cumulant generating function.

Related: Cumulant Generating Function, Large Deviation Principle

Theorem: Cramér's Theorem

Let X1,X2,X_1, X_2, \ldots be i.i.d. real-valued random variables with cumulant generating function Λ(θ)=logE[eθX1]\Lambda(\theta) = \log \mathbb{E}[e^{\theta X_1}] finite in a neighborhood of the origin. Let Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i and define I(x)=supθ{θxΛ(θ)}I(x) = \sup_\theta \{\theta x - \Lambda(\theta)\}. Then for any closed set FF: lim supn1nlogP(XˉnF)infxFI(x),\limsup_{n \to \infty} \frac{1}{n} \log \mathbb{P}(\bar{X}_n \in F) \leq -\inf_{x \in F} I(x), and for any open set GG: lim infn1nlogP(XˉnG)infxGI(x).\liminf_{n \to \infty} \frac{1}{n} \log \mathbb{P}(\bar{X}_n \in G) \geq -\inf_{x \in G} I(x). In particular, for a>μ=E[X1]a > \mu = \mathbb{E}[X_1]: P(Xˉna)enI(a),\mathbb{P}(\bar{X}_n \geq a) \doteq e^{-nI(a)}, where \doteq denotes logarithmic equivalence (equal exponential rates).

Rare events happen, and when they do, they happen in the "most likely unlikely way." To force Xˉn=a>μ\bar{X}_n = a > \mu, the i.i.d. samples conspire in the cheapest possible way, and the cost of this conspiracy is exactly I(a)I(a) nats per sample. The probability of the event decays exponentially at rate I(a)I(a) per sample.

,

Historical Note: Harald Cramér and the Birth of Large Deviations

1938--2007

Harald Cramér (1893--1985), a Swedish mathematician and statistician, published his foundational result on the exponential decay of tail probabilities in 1938. Working on problems in insurance mathematics — specifically, the ruin probability of an insurance company — Cramér needed precise asymptotics for sums of i.i.d. random variables far from the mean. His 1938 paper established the exponential decay rate and the role of the Legendre transform, though the full measure-theoretic framework was developed later by Varadhan, who received the Abel Prize in 2007 partly for his work on large deviations.

Logarithmic Equivalence

Two sequences ana_n and bnb_n are logarithmically equivalent, written anbna_n \doteq b_n, if limn1nlog(an/bn)=0\lim_{n \to \infty} \frac{1}{n}\log(a_n/b_n) = 0. In large deviations, P(Xˉna)enI(a)\mathbb{P}(\bar{X}_n \geq a) \doteq e^{-nI(a)} means the probability decays at exactly the exponential rate I(a)I(a), up to sub-exponential factors.

Related: Rate Function (Fenchel-Legendre Transform)

Exponential Tilting as Importance Sampling

The tilted distribution dPθ=eθxΛ(θ)dPdP_\theta = e^{\theta x - \Lambda(\theta)}dP shifts the mean of XX from μ\mu to Λ(θ)\Lambda'(\theta). This is exactly the importance sampling change of measure used in Monte Carlo simulation of rare events. To estimate P(Xˉna)\mathbb{P}(\bar{X}_n \geq a) efficiently, one simulates under PθP_{\theta^*} (where θ\theta^* makes aa the new mean) and corrects with the likelihood ratio. This is not just a proof technique — it is the foundation of the most effective rare-event simulation methods.

Rate Function I(x)I(x) for Common Distributions

Visualize the rate function I(x)=supθ{θxΛ(θ)}I(x) = \sup_\theta\{\theta x - \Lambda(\theta)\} for Gaussian, Bernoulli, and Poisson distributions. Observe how the parabolic shape of the Gaussian rate function contrasts with the KL-divergence shape for discrete distributions.

Parameters
1
1

Empirical P(Xˉna)\mathbb{P}(\bar{X}_n \geq a) vs Cram\u00e9r Prediction

Compare the empirical tail probability of the sample mean (from Monte Carlo simulation) with the large deviations prediction enI(a)e^{-nI(a)}. Watch the log-probability become linear in nn with slope I(a)-I(a).

Parameters
2
200

Quick Check

Which of the following is always true about the rate function I(x)I(x)?

I(x)0I(x) \geq 0 for all xx, with I(μ)=0I(\mu) = 0

I(x)I(x) is always symmetric about μ\mu

I(x)I(x) is always bounded

I(x)I(x) is always concave

Common Mistake: Cramér's Theorem Requires i.i.d. Samples

Mistake:

Applying Cramér's theorem to dependent sequences or non-identically distributed sums without checking the assumptions.

Correction:

Cramér's theorem as stated requires i.i.d. random variables. Extensions to weakly dependent sequences exist (Gärtner-Ellis theorem) but require the limit limn1nΛSn(θ)\lim_{n \to \infty} \frac{1}{n}\Lambda_{S_n}(\theta) to exist and be differentiable, where Sn=i=1nXiS_n = \sum_{i=1}^n X_i. Markov chains satisfy this under mild conditions.

Definition:

Large Deviation Principle

A sequence of probability measures {μn}\{\mu_n\} on R\mathbb{R} satisfies a large deviation principle (LDP) with speed nn and rate function II if:

  1. Upper bound: For every closed set FF, lim supn1nlogμn(F)infxFI(x)\limsup_{n \to \infty} \frac{1}{n}\log \mu_n(F) \leq -\inf_{x \in F} I(x).

  2. Lower bound: For every open set GG, lim infn1nlogμn(G)infxGI(x)\liminf_{n \to \infty} \frac{1}{n}\log \mu_n(G) \geq -\inf_{x \in G} I(x).

  3. II is lower semicontinuous with compact level sets {x:I(x)α}\{x : I(x) \leq \alpha\} for all α\alpha.

Cramér's theorem states that P(Xˉn)\mathbb{P}(\bar{X}_n \in \cdot) satisfies an LDP with rate function I(x)=Λ(x)I(x) = \Lambda^*(x).

Good Rate Function

A rate function I:R[0,]I : \mathbb{R} \to [0, \infty] is called good if all its level sets {x:I(x)α}\{x : I(x) \leq \alpha\} are compact for every α0\alpha \geq 0. Cramér's rate function is always good when the MGF is finite in a neighborhood of zero.

Related: Rate Function (Fenchel-Legendre Transform), Large Deviation Principle

The Gärtner-Ellis Extension

When the XiX_i are not identically distributed or have weak dependence, one can still obtain an LDP if the limiting cumulant generating function Λ(θ)=limn1nlogE[eθSn]\Lambda(\theta) = \lim_{n \to \infty} \frac{1}{n}\log \mathbb{E}[e^{\theta S_n}] exists and is differentiable on its effective domain. This is the Gärtner-Ellis theorem, which reduces to Cramér's theorem in the i.i.d. case. It applies, for instance, to Markov chains and to sums with slowly varying distributions.

Key Takeaway

Cramér's theorem is the fundamental result of large deviations theory for i.i.d. sums: the probability of the sample mean exceeding a threshold a>μa > \mu decays as enI(a)e^{-nI(a)}, where I(a)=supθ{θalogE[eθX]}I(a) = \sup_\theta\{\theta a - \log \mathbb{E}[e^{\theta X}]\} is the Legendre-Fenchel transform of the log-MGF. The rate function II is convex, non-negative, and zero only at the true mean.