Sub-Gaussian Random Variables

A Hierarchy of Tail Behavior

The Chernoff bound from Chapter 10 gives exponential tail bounds whenever the MGF exists. But for many applications — finite blocklength information theory, compressed sensing, machine learning generalization bounds — we need a systematic way to classify random variables by their tail behavior. Sub-Gaussian random variables form the "well-behaved" class: their tails decay at least as fast as a Gaussian's, even though they may not be Gaussian at all. Bounded random variables, Rademacher variables, and any random variable with a light enough tail fall into this class — and all of them inherit the same clean concentration inequalities.

Definition:

Sub-Gaussian Random Variable

A zero-mean random variable XX is sub-Gaussian with parameter σsg\sigma_{\mathrm{sg}} (or σsg\sigma_{\mathrm{sg}}-sub-Gaussian) if its moment generating function satisfies E[etX]exp ⁣(σsg2t22)for all tR.\mathbb{E}[e^{tX}] \leq \exp\!\left(\frac{\sigma_{\mathrm{sg}}^2 t^2}{2}\right) \quad \text{for all } t \in \mathbb{R}. The parameter σsg2\sigma_{\mathrm{sg}}^2 is called the sub-Gaussian variance proxy (or sub-Gaussian parameter). A general (non-zero-mean) random variable XX is σsg\sigma_{\mathrm{sg}}-sub-Gaussian if XE[X]X - \mathbb{E}[X] is.

The condition says the MGF of XX is dominated by the MGF of a N(0,σsg2)\mathcal{N}(0, \sigma_{\mathrm{sg}}^2) random variable. This does not mean XX is Gaussian — it means its tails are at most as heavy as a Gaussian's. The sub-Gaussian parameter σsg2\sigma_{\mathrm{sg}}^2 may be larger than Var(X)\text{Var}(X).

Definition:

Sub-Gaussian Norm

The sub-Gaussian norm (or ψ2\psi_2-norm) of a random variable XX is Xψ2=inf ⁣{t>0:E ⁣[exp ⁣(X2t2)]2}.\|X\|_{\psi_2} = \inf\!\left\{t > 0 : \mathbb{E}\!\left[\exp\!\left(\frac{X^2}{t^2}\right)\right] \leq 2\right\}. A random variable is sub-Gaussian if and only if Xψ2<\|X\|_{\psi_2} < \infty. The sub-Gaussian parameter satisfies σsgCXψ2\sigma_{\mathrm{sg}} \leq C\|X\|_{\psi_2} for a universal constant CC.

Theorem: Hoeffding's Lemma

Let XX be a random variable with E[X]=0\mathbb{E}[X] = 0 and aXba \leq X \leq b almost surely. Then XX is sub-Gaussian with parameter σsg=(ba)/2\sigma_{\mathrm{sg}} = (b-a)/2: E[etX]exp ⁣(t2(ba)28)for all tR.\mathbb{E}[e^{tX}] \leq \exp\!\left(\frac{t^2(b-a)^2}{8}\right) \quad \text{for all } t \in \mathbb{R}.

A bounded random variable cannot have heavier tails than a Gaussian — the boundedness constrains the MGF to grow at most quadratically in the exponent. The factor of 8 (not 2) reflects that (ba)/2(b-a)/2 is a conservative proxy for the standard deviation.

Theorem: Hoeffding's Inequality

Let X1,,XnX_1, \ldots, X_n be independent (not necessarily identically distributed) random variables with aiXibia_i \leq X_i \leq b_i. Let Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i. Then for any t>0t > 0: P ⁣(XˉnE[Xˉn]t)exp ⁣(2n2t2i=1n(biai)2).\mathbb{P}\!\left(\bar{X}_n - \mathbb{E}[\bar{X}_n] \geq t\right) \leq \exp\!\left(-\frac{2n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right). For identically bounded variables (ai=aa_i = a, bi=bb_i = b for all ii): P ⁣(Xˉnμt)2exp ⁣(2nt2(ba)2).\mathbb{P}\!\left(|\bar{X}_n - \mu| \geq t\right) \leq 2\exp\!\left(-\frac{2nt^2}{(b-a)^2}\right).

The sum of independent sub-Gaussian variables is sub-Gaussian, and the sub-Gaussian parameters add. Hoeffding's inequality is the Chernoff bound applied to the sub-Gaussian MGF bound.

Example: Polling Accuracy via Hoeffding's Inequality

A poll samples n=1000n = 1000 voters, each independently supporting candidate A with unknown probability pp. The empirical proportion is p^=Xˉn\hat{p} = \bar{X}_n. How large must nn be to guarantee P(p^p0.03)0.05\mathbb{P}(|\hat{p} - p| \geq 0.03) \leq 0.05?

Comparison of Tail Bounds

BoundRequirementTail DecayTightness
MarkovNon-negative RVO(1/t)O(1/t)Very loose
ChebyshevFinite varianceO(1/t2)O(1/t^2)Loose
Chernoff (general)Finite MGF in intervalecte^{-c t} for some ccTight rate, loose constant
HoeffdingBounded RVse2nt2/(ba)2e^{-2nt^2/(b-a)^2}Good for bounded, loose for small-variance
Sub-Gaussiansub-Gaussianet2/(2σsg2)e^{-t^2/(2\sigma_{\mathrm{sg}}^2)}Tight for Gaussian-like tails
BernsteinBounded + variance infoecnt2/(σ2+bt)e^{-cnt^2/(\sigma^2 + bt)}Best of Hoeffding and Chebyshev

Comparing Tail Bounds: Markov, Chebyshev, Hoeffding, Chernoff

Compare the quality of different tail bounds for P(Xˉna)\mathbb{P}(\bar{X}_n \geq a) where XiX_i are i.i.d. bounded random variables. See how the Hoeffding and Chernoff bounds dramatically improve on Markov and Chebyshev.

Parameters
50
0.7

Definition:

Sub-Exponential Random Variable

A zero-mean random variable XX is sub-exponential with parameters (ν2,α)(\nu^2, \alpha) if E[etX]exp ⁣(ν2t22)for all t<1α.\mathbb{E}[e^{tX}] \leq \exp\!\left(\frac{\nu^2 t^2}{2}\right) \quad \text{for all } |t| < \frac{1}{\alpha}. Equivalently, XX is sub-exponential if Xψ1<\|X\|_{\psi_1} < \infty, where Xψ1=inf ⁣{t>0:E ⁣[exp ⁣(Xt)]2}.\|X\|_{\psi_1} = \inf\!\left\{t > 0 : \mathbb{E}\!\left[\exp\!\left(\frac{|X|}{t}\right)\right] \leq 2\right\}.

Sub-exponential is a weaker condition than sub-Gaussian. The key distinction: a sub-Gaussian variable has MGF bounded by a Gaussian MGF for all tt, while a sub-exponential variable only needs this for t|t| small enough. Products and squares of sub-Gaussian variables are sub-exponential but typically not sub-Gaussian.

Theorem: Bernstein's Inequality

Let X1,,XnX_1, \ldots, X_n be independent zero-mean random variables with XiK|X_i| \leq K almost surely. Then for any t>0t > 0: P ⁣(1ni=1nXit)exp ⁣(nt2/2σ2+Kt/3),\mathbb{P}\!\left(\frac{1}{n}\sum_{i=1}^n X_i \geq t\right) \leq \exp\!\left(-\frac{nt^2/2}{\sigma^2 + Kt/3}\right), where σ2=1ni=1nE[Xi2]\sigma^2 = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[X_i^2].

Bernstein's inequality interpolates between Hoeffding (for large tt, where the linear term Kt/3Kt/3 dominates) and a variance-sensitive Gaussian bound (for small tt, where σ2\sigma^2 dominates). When σ2K2\sigma^2 \ll K^2, Bernstein is much tighter than Hoeffding.

Definition:

Matrix-Valued Random Variables

A random Hermitian matrix XCd×d\mathbf{X} \in \mathbb{C}^{d \times d} with E[X]=0\mathbb{E}[\mathbf{X}] = \mathbf{0} is called matrix sub-Gaussian if its matrix MGF satisfies E[etX]exp ⁣(t2σ22I)\mathbb{E}[e^{t\mathbf{X}}] \preceq \exp\!\left(\frac{t^2 \sigma^2}{2}\mathbf{I}\right) in the positive semidefinite (Löwner) order, for all tRt \in \mathbb{R}.

Theorem: Matrix Bernstein Inequality

Let X1,,Xn\mathbf{X}_1, \ldots, \mathbf{X}_n be independent random Hermitian matrices of dimension dd with E[Xk]=0\mathbb{E}[\mathbf{X}_k] = \mathbf{0} and XkK\|\mathbf{X}_k\| \leq K a.s. (operator norm). Define the matrix variance parameter σ2=k=1nE[Xk2].\sigma^2 = \left\|\sum_{k=1}^n \mathbb{E}[\mathbf{X}_k^2]\right\|. Then for any t>0t > 0: P ⁣(k=1nXkt)2dexp ⁣(t2/2σ2+Kt/3).\mathbb{P}\!\left(\left\|\sum_{k=1}^n \mathbf{X}_k\right\| \geq t\right) \leq 2d \cdot \exp\!\left(-\frac{t^2/2}{\sigma^2 + Kt/3}\right).

This is the matrix analogue of Bernstein's inequality. The extra factor of 2d2d (the dimension) comes from a union bound over eigenvalues — a remarkably small price for extending scalar concentration to matrices. The bound is tight enough to be useful for dd up to thousands.

⚠️Engineering Note

Matrix Bernstein in Massive MIMO Channel Estimation

In massive MIMO, the sample covariance matrix R^=1nk=1nhkhkH\hat{\mathbf{R}} = \frac{1}{n}\sum_{k=1}^n \mathbf{h}_k\mathbf{h}_k^H estimates the true spatial covariance R\mathbf{R}. The matrix Bernstein inequality bounds the operator norm error R^R\|\hat{\mathbf{R}} - \mathbf{R}\|: with n=O(dlogd/ϵ2)n = O(d\log d / \epsilon^2) samples, the error is at most ϵR\epsilon\|\mathbf{R}\| with high probability. This tells the system designer how many pilot transmissions are needed to learn the channel covariance to a given accuracy — a question that directly impacts the overhead of covariance-aided beamforming in 5G NR massive MIMO.

,

Sub-Gaussian

A random variable whose moment generating function is dominated by that of a Gaussian. Equivalently, its tails decay at least as fast as ect2e^{-ct^2} for some constant cc. The class includes all bounded random variables, Gaussian variables, and their sums.

Related: Sub-Exponential Random Variable, Hoeffding's Inequality

Sub-Exponential

A random variable whose tails decay at least as fast as ecte^{-ct} for some constant cc. Weaker than sub-Gaussian: squares of sub-Gaussian variables are sub-exponential. Examples include χ2\chi^2 random variables and products of Gaussians.

Related: Sub-Gaussian, Bernstein's Inequality

Common Mistake: Sub-Gaussian Parameter \neq Variance

Mistake:

Treating the sub-Gaussian parameter σsg2\sigma_{\mathrm{sg}}^2 as equal to Var(X)\text{Var}(X).

Correction:

The sub-Gaussian parameter is an upper bound on the variance: Var(X)σsg2\text{Var}(X) \leq \sigma_{\mathrm{sg}}^2. For a Bernoulli(1/21/2) variable, Var(X)=1/4\text{Var}(X) = 1/4 but Hoeffding's lemma gives σsg2=1/4\sigma_{\mathrm{sg}}^2 = 1/4 (tight in this case). For a uniform[1,1][-1,1] variable, Var(X)=1/3\text{Var}(X) = 1/3 but σsg2=1\sigma_{\mathrm{sg}}^2 = 1 (the Hoeffding bound is loose by 3x). Using variance-sensitive bounds (Bernstein) can recover this gap.

Historical Note: Wassily Hoeffding and the Art of Concentration

1963

Wassily Hoeffding (1914--1991), born in Finland to a Russian family, published his celebrated inequality in 1963 while at the University of North Carolina. His paper "Probability Inequalities for Sums of Bounded Random Variables" is one of the most cited papers in probability and statistics. The beauty of Hoeffding's approach is its simplicity: the bounded-variable MGF lemma plus the Chernoff technique yields a distribution-free concentration inequality that remains the default tool whenever variables are known to be bounded. Hoeffding also made fundamental contributions to U-statistics and nonparametric statistics.

Quick Check

If X1,,XnX_1, \ldots, X_n are independent σ\sigma-sub-Gaussian random variables, what is the sub-Gaussian parameter of Sn=i=1nXiS_n = \sum_{i=1}^n X_i?

nσ2n\sigma^2 (parameters add)

n2σ2n^2\sigma^2

σ2\sigma^2 (unchanged)

nσ2\sqrt{n}\sigma^2

Key Takeaway

Sub-Gaussian random variables form a natural class for concentration inequalities: their MGFs are bounded by Gaussian MGFs, their tails decay at least as fast as Gaussian tails, and sums of independent sub-Gaussians are sub-Gaussian with additive parameters. Hoeffding's inequality, Bernstein's inequality, and the matrix Bernstein inequality are the workhorses of modern high-dimensional probability and are essential tools for analyzing wireless communication systems with finite block lengths.