Cramér's Theorem
Beyond the CLT: The Regime of Rare Events
The central limit theorem tells us that the sample mean fluctuates around at scale , and the fluctuations are approximately Gaussian. But what about large deviations — the probability that deviates from by a fixed amount , regardless of ? The CLT says this probability goes to zero, but it does not tell us how fast. Large deviations theory provides the answer: the decay is exponential in , and the rate is governed by a function that encodes the cost of forcing the sample mean to take the value . This is the rate function, and Cramér's theorem identifies it as the Legendre-Fenchel transform of the log-MGF.
Definition: Cumulant Generating Function
Cumulant Generating Function
Let be a real-valued random variable with moment generating function . The cumulant generating function (CGF) is defined as The effective domain of is .
The CGF is always convex (it is the log of a positive linear functional evaluated at an exponential family). Its derivatives at give the cumulants: , .
Example: CGF of a Random Variable
Compute the cumulant generating function for .
Compute the MGF
For :
Take the logarithm
\theta\theta \in \mathbb{R}\Lambda'(0) = \mu\Lambda''(0) = \sigma^2$.
Definition: Rate Function (Fenchel-Legendre Transform)
Rate Function (Fenchel-Legendre Transform)
The rate function associated with a random variable is the Fenchel-Legendre transform of the CGF: Since is convex, is also convex. It satisfies for all and where .
The rate function quantifies the "cost" of observing the sample mean at . The minimum cost is zero, attained at the true mean — the most likely value. The further deviates from , the larger , and the faster the probability decays.
Example: Rate Function for
Compute the rate function for using the Legendre transform of .
Set up the optimization
$
Take derivative and set to zero
$
Substitute back
\mu\mathbb{P}(\bar{X}_n \geq a) \approx \exp!\left(-n \cdot \frac{(a-\mu)^2}{2\sigma^2}\right)$.
Example: Rate Function for Bernoulli()
Compute for .
CGF
, so .
Legendre transform
Setting : Solving for and substituting: for . This is exactly the KL divergence .
Interpretation
The rate function for Bernoulli i.i.d. sums is the binary KL divergence. The probability that the empirical frequency of ones equals decays as . This foreshadows Sanov's theorem in the next section.
Rate Function
A non-negative, lower semicontinuous, convex function such that for large . Obtained as the Legendre-Fenchel transform of the cumulant generating function.
Related: Cumulant Generating Function, Large Deviation Principle
Theorem: Cramér's Theorem
Let be i.i.d. real-valued random variables with cumulant generating function finite in a neighborhood of the origin. Let and define . Then for any closed set : and for any open set : In particular, for : where denotes logarithmic equivalence (equal exponential rates).
Rare events happen, and when they do, they happen in the "most likely unlikely way." To force , the i.i.d. samples conspire in the cheapest possible way, and the cost of this conspiracy is exactly nats per sample. The probability of the event decays exponentially at rate per sample.
Start with the Chernoff bound: .
For the matching lower bound, use the exponential change of measure (tilting) technique.
Under the tilted measure , the mean shifts to .
Upper bound (Chernoff)
For any : Optimizing over : .
Lower bound via tilting
Choose such that (the tilting parameter). Define the tilted distribution . Under , each has mean and finite variance .
Apply CLT under tilted measure
Under , concentrates around . For : On the event , the likelihood ratio satisfies .
Combine
$
Historical Note: Harald Cramér and the Birth of Large Deviations
1938--2007Harald Cramér (1893--1985), a Swedish mathematician and statistician, published his foundational result on the exponential decay of tail probabilities in 1938. Working on problems in insurance mathematics — specifically, the ruin probability of an insurance company — Cramér needed precise asymptotics for sums of i.i.d. random variables far from the mean. His 1938 paper established the exponential decay rate and the role of the Legendre transform, though the full measure-theoretic framework was developed later by Varadhan, who received the Abel Prize in 2007 partly for his work on large deviations.
Logarithmic Equivalence
Two sequences and are logarithmically equivalent, written , if . In large deviations, means the probability decays at exactly the exponential rate , up to sub-exponential factors.
Exponential Tilting as Importance Sampling
The tilted distribution shifts the mean of from to . This is exactly the importance sampling change of measure used in Monte Carlo simulation of rare events. To estimate efficiently, one simulates under (where makes the new mean) and corrects with the likelihood ratio. This is not just a proof technique — it is the foundation of the most effective rare-event simulation methods.
Rate Function for Common Distributions
Visualize the rate function for Gaussian, Bernoulli, and Poisson distributions. Observe how the parabolic shape of the Gaussian rate function contrasts with the KL-divergence shape for discrete distributions.
Parameters
Empirical vs Cram\u00e9r Prediction
Compare the empirical tail probability of the sample mean (from Monte Carlo simulation) with the large deviations prediction . Watch the log-probability become linear in with slope .
Parameters
Quick Check
Which of the following is always true about the rate function ?
for all , with
is always symmetric about
is always bounded
is always concave
Since and is convex, the supremum , with equality at .
Common Mistake: Cramér's Theorem Requires i.i.d. Samples
Mistake:
Applying Cramér's theorem to dependent sequences or non-identically distributed sums without checking the assumptions.
Correction:
Cramér's theorem as stated requires i.i.d. random variables. Extensions to weakly dependent sequences exist (Gärtner-Ellis theorem) but require the limit to exist and be differentiable, where . Markov chains satisfy this under mild conditions.
Definition: Large Deviation Principle
Large Deviation Principle
A sequence of probability measures on satisfies a large deviation principle (LDP) with speed and rate function if:
-
Upper bound: For every closed set , .
-
Lower bound: For every open set , .
-
is lower semicontinuous with compact level sets for all .
Cramér's theorem states that satisfies an LDP with rate function .
Good Rate Function
A rate function is called good if all its level sets are compact for every . Cramér's rate function is always good when the MGF is finite in a neighborhood of zero.
Related: Rate Function (Fenchel-Legendre Transform), Large Deviation Principle
The Gärtner-Ellis Extension
When the are not identically distributed or have weak dependence, one can still obtain an LDP if the limiting cumulant generating function exists and is differentiable on its effective domain. This is the Gärtner-Ellis theorem, which reduces to Cramér's theorem in the i.i.d. case. It applies, for instance, to Markov chains and to sums with slowly varying distributions.
Key Takeaway
Cramér's theorem is the fundamental result of large deviations theory for i.i.d. sums: the probability of the sample mean exceeding a threshold decays as , where is the Legendre-Fenchel transform of the log-MGF. The rate function is convex, non-negative, and zero only at the true mean.