Ferkans — Interactive Telecom Tutor

Beyond the CLT: The Regime of Rare Events

The central limit theorem tells us that the sample mean $\bar{X}_n$ fluctuates around $\mu$ at scale $1/\sqrt{n}$ , and the fluctuations are approximately Gaussian. But what about large deviations — the probability that $\bar{X}_n$ deviates from $\mu$ by a fixed amount $\delta > 0$ , regardless of $n$ ? The CLT says this probability goes to zero, but it does not tell us how fast. Large deviations theory provides the answer: the decay is exponential in $n$ , and the rate is governed by a function $I(x)$ that encodes the cost of forcing the sample mean to take the value $x$ . This is the rate function, and Cramér's theorem identifies it as the Legendre-Fenchel transform of the log-MGF.

Definition:
Cumulant Generating Function

Let $X$ be a real-valued random variable with moment generating function $M_X(\theta) = \mathbb{E}[e^{\theta X}]$ . The cumulant generating function (CGF) is defined as $\Lambda(\theta) \triangleq \log M_X(\theta) = \log \mathbb{E}[e^{\theta X}].$ The effective domain of $\Lambda$ is $\mathcal{D}_\Lambda = \{\theta \in \mathbb{R} : \Lambda(\theta) < \infty\}$ .

The CGF is always convex (it is the log of a positive linear functional evaluated at an exponential family). Its derivatives at $\theta = 0$ give the cumulants: $\Lambda'(0) = \mu$ , $\Lambda''(0) = \sigma^2$ .

Example: CGF of a $\mathcal{N}(\mu, \sigma^2)$ Random Variable

Compute the cumulant generating function $\Lambda(\theta)$ for $X \sim \mathcal{N}(\mu, \sigma^2)$ .

Solution

Compute the MGF

For $X \sim \mathcal{N}(\mu, \sigma^2)$ : $M_X(\theta) = \mathbb{E}[e^{\theta X}] = \exp\!\left(\mu\theta + \frac{\sigma^2 \theta^2}{2}\right).$

Take the logarithm

$\Lambda(\theta) = \log M_X(\theta) = \mu\theta + \frac{\sigma^2 \theta^2}{2}.$ $This is a quadratic in$ \theta $, defined for all$ \theta \in \mathbb{R} $. The first cumulant is$ \Lambda'(0) = \mu $and the second is$ \Lambda''(0) = \sigma^2$.

Definition:
Rate Function (Fenchel-Legendre Transform)

The rate function associated with a random variable $X$ is the Fenchel-Legendre transform of the CGF: $I(x) \triangleq \Lambda^*(x) = \sup_{\theta \in \mathbb{R}} \bigl\{\theta x - \Lambda(\theta)\bigr\}.$ Since $\Lambda$ is convex, $I$ is also convex. It satisfies $I(x) \geq 0$ for all $x$ and $I(\mu) = 0$ where $\mu = \mathbb{E}[X]$ .

The rate function $I(x)$ quantifies the "cost" of observing the sample mean at $x$ . The minimum cost is zero, attained at the true mean — the most likely value. The further $x$ deviates from $\mu$ , the larger $I(x)$ , and the faster the probability decays.

Example: Rate Function for $\mathcal{N}(\mu, \sigma^2)$

Compute the rate function $I(x)$ for $X \sim \mathcal{N}(\mu, \sigma^2)$ using the Legendre transform of $\Lambda(\theta) = \mu\theta + \sigma^2\theta^2/2$ .

Solution

Set up the optimization

$I(x) = \sup_{\theta} \left\{\theta x - \mu\theta - \frac{\sigma^2\theta^2}{2}\right\}.$ $

Take derivative and set to zero

$\frac{d}{d\theta}\left[\theta x - \mu\theta - \frac{\sigma^2\theta^2}{2}\right] = x - \mu - \sigma^2\theta = 0 \implies \theta^* = \frac{x - \mu}{\sigma^2}.$ $

Substitute back

$I(x) = \frac{(x-\mu)}{\sigma^2} \cdot x - \mu \cdot \frac{(x-\mu)}{\sigma^2} - \frac{\sigma^2}{2}\cdot\frac{(x-\mu)^2}{\sigma^4} = \frac{(x - \mu)^2}{2\sigma^2}.$ $The rate function is a parabola centered at$ \mu $. This is consistent with the Gaussian tail:$ \mathbb{P}(\bar{X}_n \geq a) \approx \exp!\left(-n \cdot \frac{(a-\mu)^2}{2\sigma^2}\right)$.

Example: Rate Function for Bernoulli( $p$ )

Compute $I(x)$ for $X \sim \text{Bernoulli}(p)$ .

Solution

CGF

$M_X(\theta) = (1-p) + pe^{\theta}$ , so $\Lambda(\theta) = \log\bigl((1-p) + pe^{\theta}\bigr)$ .

Legendre transform

Setting $d/d\theta[\theta x - \Lambda(\theta)] = 0$ : $x = \frac{pe^{\theta^*}}{(1-p) + pe^{\theta^*}}.$ Solving for $e^{\theta^*} = \frac{x(1-p)}{p(1-x)}$ and substituting: $I(x) = x \log\frac{x}{p} + (1-x)\log\frac{1-x}{1-p}$ for $x \in [0, 1]$ . This is exactly the KL divergence $D(\text{Ber}(x) \| \text{Ber}(p))$ .

Interpretation

The rate function for Bernoulli i.i.d. sums is the binary KL divergence. The probability that the empirical frequency of ones equals $x$ decays as $\exp(-n \cdot D(\text{Ber}(x) \| \text{Ber}(p)))$ . This foreshadows Sanov's theorem in the next section.

,

Rate Function

A non-negative, lower semicontinuous, convex function $I(x)$ such that $\mathbb{P}(\bar{X}_n \in A) \approx \exp(-n \inf_{x \in A} I(x))$ for large $n$ . Obtained as the Legendre-Fenchel transform of the cumulant generating function.

Theorem: Cramér's Theorem

Let $X_1, X_2, \ldots$ be i.i.d. real-valued random variables with cumulant generating function $\Lambda(\theta) = \log \mathbb{E}[e^{\theta X_1}]$ finite in a neighborhood of the origin. Let $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$ and define $I(x) = \sup_\theta \{\theta x - \Lambda(\theta)\}$ . Then for any closed set $F$ : $\limsup_{n \to \infty} \frac{1}{n} \log \mathbb{P}(\bar{X}_n \in F) \leq -\inf_{x \in F} I(x),$ and for any open set $G$ : $\liminf_{n \to \infty} \frac{1}{n} \log \mathbb{P}(\bar{X}_n \in G) \geq -\inf_{x \in G} I(x).$ In particular, for $a > \mu = \mathbb{E}[X_1]$ : $\mathbb{P}(\bar{X}_n \geq a) \doteq e^{-nI(a)},$ where $\doteq$ denotes logarithmic equivalence (equal exponential rates).

Rare events happen, and when they do, they happen in the "most likely unlikely way." To force $\bar{X}_n = a > \mu$ , the i.i.d. samples conspire in the cheapest possible way, and the cost of this conspiracy is exactly $I(a)$ nats per sample. The probability of the event decays exponentially at rate $I(a)$ per sample.

Show Hint

Start with the Chernoff bound: $\mathbb{P}(\bar{X}_n \geq a) \leq e^{-n\sup_\theta\{\theta a - \Lambda(\theta)\}} = e^{-nI(a)}$ .

For the matching lower bound, use the exponential change of measure (tilting) technique.

Under the tilted measure $d\mathbb{Q}_\theta = e^{\theta X - \Lambda(\theta)} dP$ , the mean shifts to $\Lambda'(\theta^*)$ .

Proof

Upper bound (Chernoff)

For any $\theta > 0$ : $\mathbb{P}(\bar{X}_n \geq a) = \mathbb{P}\!\left(e^{\theta \sum_i X_i} \geq e^{n\theta a}\right) \leq \frac{\mathbb{E}[e^{\theta \sum_i X_i}]}{e^{n\theta a}} = e^{n[\Lambda(\theta) - \theta a]}.$ Optimizing over $\theta$ : $\mathbb{P}(\bar{X}_n \geq a) \leq e^{-n\sup_\theta\{\theta a - \Lambda(\theta)\}} = e^{-nI(a)}$ .

Lower bound via tilting

Choose $\theta^*$ such that $\Lambda'(\theta^*) = a$ (the tilting parameter). Define the tilted distribution $dP_{\theta^*}(x) = e^{\theta^* x - \Lambda(\theta^*)} dP(x)$ . Under $P_{\theta^*}$ , each $X_i$ has mean $a$ and finite variance $\sigma_{\theta^*}^2 = \Lambda''(\theta^*)$ .

Apply CLT under tilted measure

Under $P_{\theta^*}$ , $\bar{X}_n$ concentrates around $a$ . For $\delta > 0$ : $P_{\theta^*}(a - \delta \leq \bar{X}_n \leq a + \delta) \to 1.$ On the event $\{\bar{X}_n \in [a-\delta, a+\delta]\}$ , the likelihood ratio satisfies $dP/dP_{\theta^*} = e^{-\theta^* n\bar{X}_n + n\Lambda(\theta^*)} \geq e^{-n[\theta^*(a+\delta) - \Lambda(\theta^*)]}$ .

Combine

$\mathbb{P}(\bar{X}_n \geq a) \geq \mathbb{P}(a - \delta \leq \bar{X}_n \leq a + \delta) \geq e^{-n[\theta^*(a+\delta) - \Lambda(\theta^*)]} \cdot P_{\theta^*}(\bar{X}_n \in [a-\delta, a+\delta]).KATEXPLACEHOLDER0END\liminf_{n \to \infty} \frac{1}{n}\log \mathbb{P}(\bar{X}_n \geq a) \geq -[\theta^* a - \Lambda(\theta^*)] = -I(a). \quad \blacksquare$ $

,

Historical Note: Harald Cramér and the Birth of Large Deviations

1938--2007

Harald Cramér (1893--1985), a Swedish mathematician and statistician, published his foundational result on the exponential decay of tail probabilities in 1938. Working on problems in insurance mathematics — specifically, the ruin probability of an insurance company — Cramér needed precise asymptotics for sums of i.i.d. random variables far from the mean. His 1938 paper established the exponential decay rate and the role of the Legendre transform, though the full measure-theoretic framework was developed later by Varadhan, who received the Abel Prize in 2007 partly for his work on large deviations.

Logarithmic Equivalence

Two sequences $a_n$ and $b_n$ are logarithmically equivalent, written $a_n \doteq b_n$ , if $\lim_{n \to \infty} \frac{1}{n}\log(a_n/b_n) = 0$ . In large deviations, $\mathbb{P}(\bar{X}_n \geq a) \doteq e^{-nI(a)}$ means the probability decays at exactly the exponential rate $I(a)$ , up to sub-exponential factors.

Exponential Tilting as Importance Sampling

The tilted distribution $dP_\theta = e^{\theta x - \Lambda(\theta)}dP$ shifts the mean of $X$ from $\mu$ to $\Lambda'(\theta)$ . This is exactly the importance sampling change of measure used in Monte Carlo simulation of rare events. To estimate $\mathbb{P}(\bar{X}_n \geq a)$ efficiently, one simulates under $P_{\theta^*}$ (where $\theta^*$ makes $a$ the new mean) and corrects with the likelihood ratio. This is not just a proof technique — it is the foundation of the most effective rare-event simulation methods.

Rate Function $I(x)$ for Common Distributions

Visualize the rate function $I(x) = \sup_\theta\{\theta x - \Lambda(\theta)\}$ for Gaussian, Bernoulli, and Poisson distributions. Observe how the parabolic shape of the Gaussian rate function contrasts with the KL-divergence shape for discrete distributions.

Parameters

Distribution

\mu

(mean or parameter)1

\sigma

(Gaussian only)1

Empirical $\mathbb{P}(\bar{X}_n \geq a)$ vs Cram\u00e9r Prediction

Compare the empirical tail probability of the sample mean (from Monte Carlo simulation) with the large deviations prediction $e^{-nI(a)}$ . Watch the log-probability become linear in $n$ with slope $-I(a)$ .

Parameters

Distribution

a

(threshold)2

n_{\max}

200

Quick Check

Which of the following is always true about the rate function $I(x)$ ?

$I(x) \geq 0$ for all $x$ , with $I(\mu) = 0$

$I(x)$ is always symmetric about $\mu$

$I(x)$ is always bounded

$I(x)$ is always concave

Correction:

I(x) \geq 0

for all

x

, with

I(\mu) = 0

Since $\Lambda(0) = 0$ and $\Lambda$ is convex, the supremum $I(x) = \sup_\theta\{\theta x - \Lambda(\theta)\} \geq 0$ , with equality at $x = \mu = \Lambda'(0)$ .

Common Mistake: Cramér's Theorem Requires i.i.d. Samples

Mistake:

Applying Cramér's theorem to dependent sequences or non-identically distributed sums without checking the assumptions.

Correction:

Cramér's theorem as stated requires i.i.d. random variables. Extensions to weakly dependent sequences exist (Gärtner-Ellis theorem) but require the limit $\lim_{n \to \infty} \frac{1}{n}\Lambda_{S_n}(\theta)$ to exist and be differentiable, where $S_n = \sum_{i=1}^n X_i$ . Markov chains satisfy this under mild conditions.

Definition:
Large Deviation Principle

A sequence of probability measures $\{\mu_n\}$ on $\mathbb{R}$ satisfies a large deviation principle (LDP) with speed $n$ and rate function $I$ if:

Upper bound: For every closed set $F$ , $\limsup_{n \to \infty} \frac{1}{n}\log \mu_n(F) \leq -\inf_{x \in F} I(x)$ .
Lower bound: For every open set $G$ , $\liminf_{n \to \infty} \frac{1}{n}\log \mu_n(G) \geq -\inf_{x \in G} I(x)$ .
$I$ is lower semicontinuous with compact level sets $\{x : I(x) \leq \alpha\}$ for all $\alpha$ .

Cramér's theorem states that $\mathbb{P}(\bar{X}_n \in \cdot)$ satisfies an LDP with rate function $I(x) = \Lambda^*(x)$ .

Good Rate Function

A rate function $I : \mathbb{R} \to [0, \infty]$ is called good if all its level sets $\{x : I(x) \leq \alpha\}$ are compact for every $\alpha \geq 0$ . Cramér's rate function is always good when the MGF is finite in a neighborhood of zero.

The Gärtner-Ellis Extension

When the $X_i$ are not identically distributed or have weak dependence, one can still obtain an LDP if the limiting cumulant generating function $\Lambda(\theta) = \lim_{n \to \infty} \frac{1}{n}\log \mathbb{E}[e^{\theta S_n}]$ exists and is differentiable on its effective domain. This is the Gärtner-Ellis theorem, which reduces to Cramér's theorem in the i.i.d. case. It applies, for instance, to Markov chains and to sums with slowly varying distributions.

Key Takeaway

Cramér's theorem is the fundamental result of large deviations theory for i.i.d. sums: the probability of the sample mean exceeding a threshold $a > \mu$ decays as $e^{-nI(a)}$ , where $I(a) = \sup_\theta\{\theta a - \log \mathbb{E}[e^{\theta X}]\}$ is the Legendre-Fenchel transform of the log-MGF. The rate function $I$ is convex, non-negative, and zero only at the true mean.

Cramér's Theorem

Beyond the CLT: The Regime of Rare Events

Definition: Cumulant Generating Function

Example: CGF of a N(μ,σ2)\mathcal{N}(\mu, \sigma^2)N(μ,σ2) Random Variable

Compute the MGF

Take the logarithm

Definition: Rate Function (Fenchel-Legendre Transform)

Example: Rate Function for N(μ,σ2)\mathcal{N}(\mu, \sigma^2)N(μ,σ2)

Set up the optimization

Take derivative and set to zero

Substitute back

Example: Rate Function for Bernoulli(ppp)

CGF

Legendre transform

Interpretation

Rate Function

Theorem: Cramér's Theorem

Upper bound (Chernoff)

Lower bound via tilting

Apply CLT under tilted measure

Combine

Historical Note: Harald Cramér and the Birth of Large Deviations

Logarithmic Equivalence

Exponential Tilting as Importance Sampling

Rate Function I(x)I(x)I(x) for Common Distributions

Parameters

Empirical P(Xˉn≥a)\mathbb{P}(\bar{X}_n \geq a)P(Xˉn​≥a) vs Cram\u00e9r Prediction

Parameters

Quick Check

Common Mistake: Cramér's Theorem Requires i.i.d. Samples

Definition: Large Deviation Principle

Good Rate Function

The Gärtner-Ellis Extension

Key Takeaway

Definition:
Cumulant Generating Function

Example: CGF of a $\mathcal{N}(\mu, \sigma^2)$ Random Variable

Definition:
Rate Function (Fenchel-Legendre Transform)

Example: Rate Function for $\mathcal{N}(\mu, \sigma^2)$

Example: Rate Function for Bernoulli( $p$ )

Rate Function $I(x)$ for Common Distributions

Empirical $\mathbb{P}(\bar{X}_n \geq a)$ vs Cram\u00e9r Prediction

Definition:
Large Deviation Principle