Sufficient Statistics and the Exponential Family

Compressing Data Without Losing Information

A raw observation vector yRn\mathbf{y} \in \mathbb{R}^n may have thousands of components, but the inference task may only need a handful of summaries: a sum, a sum of squares, an inner product with a known pilot. Sufficiency makes this idea precise: a statistic T(Y)T(\mathbf{Y}) is sufficient for θ\theta if, once T(Y)T(\mathbf{Y}) is known, the remaining variability in Y\mathbf{Y} says nothing about θ\theta. Conditioning on TT is a lossless compression of the data for the purpose of inference about θ\theta --- this is the rigorous statement behind every matched filter, every correlator, every frequency-bin summary in a receiver.

Definition:

Sufficient Statistic

A statistic T:YRdT : \mathcal{Y} \to \mathbb{R}^d is sufficient for θ\theta in the family {fθ:θΛ}\{f_\theta : \theta \in \Lambda\} if the conditional distribution of Y\mathbf{Y} given T(Y)T(\mathbf{Y}) does not depend on θ\theta: fθ(yT(y)=t)  =  h(yt)θΛ.f_\theta(\mathbf{y} \mid T(\mathbf{y}) = t) \;=\; h(\mathbf{y} \mid t) \qquad \forall \theta \in \Lambda. Equivalently, the parameter θ\theta is conditionally independent of Y\mathbf{Y} given T(Y)T(\mathbf{Y}).

Sufficiency is a property of the statistic, not of an estimator. Every one-to-one transform of a sufficient statistic is sufficient. The trivial statistic T(y)=yT(\mathbf{y}) = \mathbf{y} is always sufficient --- the interesting question is how far we can compress while preserving sufficiency.

Theorem: Fisher--Neyman Factorization Theorem

A statistic T(Y)T(\mathbf{Y}) is sufficient for θ\theta in the family {fθ:θΛ}\{f_\theta : \theta \in \Lambda\} if and only if the density admits the factorization fθ(y)  =  gθ ⁣(T(y))h(y)θΛ,f_\theta(\mathbf{y}) \;=\; g_\theta\!\bigl(T(\mathbf{y})\bigr) \cdot h(\mathbf{y}) \qquad \forall \theta \in \Lambda, for some measurable gθ0g_\theta \geq 0 and h0h \geq 0 with hh independent of θ\theta.

Read the theorem as a certificate of sufficiency: if the likelihood's θ\theta-dependence enters only through T(y)T(\mathbf{y}), then TT is sufficient. In practice we never compute the conditional distribution --- we stare at the density, group every occurrence of θ\theta, and read off TT directly.

,

Example: Factorization for the Gaussian Location--Scale Family

Let Y1,,YnY_1, \ldots, Y_n be i.i.d. N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with both parameters unknown, θ=(μ,σ2)T\boldsymbol{\theta} = (\mu, \sigma^2)^T. Identify a sufficient statistic.

Example: Factorization: Signal Amplitude in AWGN

Observe Y=As+Z\mathbf{Y} = A \mathbf{s} + \mathbf{Z} with known s\mathbf{s} and ZN(0,σ2In)\mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \sigma^2\, \mathbf{I}_n). Find a sufficient statistic for the scalar amplitude AA.

Definition:

Minimal Sufficient Statistic

A sufficient statistic T(Y)T(\mathbf{Y}) is minimal if, for every other sufficient statistic T(Y)T'(\mathbf{Y}), there exists a function φ\varphi such that T(y)=φ(T(y))T(\mathbf{y}) = \varphi(T'(\mathbf{y})) almost surely. Equivalently, TT induces the coarsest partition of Y\mathcal{Y} among sufficient statistics.

The Lehmann--Scheffe criterion for minimality: TT is minimal sufficient iff T(y)=T(y)T(\mathbf{y}) = T(\mathbf{y}') if and only if the likelihood ratio fθ(y)/fθ(y)f_\theta(\mathbf{y}) / f_\theta(\mathbf{y}') does not depend on θ\theta. For the Gaussian example above, (Yi,Yi2)(\sum Y_i, \sum Y_i^2) is minimal sufficient when both μ\mu and σ2\sigma^2 are unknown.

Definition:

Exponential Family

A parametric family {fθ:θΛ}\{f_\theta : \theta \in \Lambda\} is an exponential family in canonical form if there exist measurable functions h:YR0h : \mathcal{Y} \to \mathbb{R}_{\geq 0}, η:ΛRk\eta : \Lambda \to \mathbb{R}^k, T:YRkT : \mathcal{Y} \to \mathbb{R}^k, and A:ΛRA : \Lambda \to \mathbb{R} such that fθ(y)  =  h(y)exp ⁣(η(θ)TT(y)A(θ)).f_\theta(\mathbf{y}) \;=\; h(\mathbf{y}) \exp\!\bigl(\eta(\theta)^T T(\mathbf{y}) - A(\theta)\bigr). The vector η(θ)\eta(\theta) is the natural parameter, T(y)T(\mathbf{y}) the natural sufficient statistic, and A(θ)=logh(y)exp(η(θ)TT(y))dyA(\theta) = \log \int h(\mathbf{y}) \exp(\eta(\theta)^T T(\mathbf{y}))\, d\mathbf{y} the log-partition (or cumulant) function.

By the Fisher--Neyman factorization, T(Y)T(\mathbf{Y}) is automatically sufficient for θ\theta. Members: Bernoulli, binomial, Poisson, Gaussian (with known or unknown variance), exponential, gamma, beta, Dirichlet, multinomial --- most families you will meet in practice.

Theorem: Complete Sufficient Statistic in the Exponential Family

Consider an exponential family fθ(y)=h(y)exp(η(θ)TT(y)A(θ))f_\theta(\mathbf{y}) = h(\mathbf{y}) \exp(\eta(\theta)^T T(\mathbf{y}) - A(\theta)) with T(y)RkT(\mathbf{y}) \in \mathbb{R}^k. If the natural-parameter image {η(θ):θΛ}\{\eta(\theta) : \theta \in \Lambda\} contains a kk-dimensional open rectangle, then T(Y)T(\mathbf{Y}) is a complete sufficient statistic: any function ψ\psi with Eθ[ψ(T(Y))]=0\mathbb{E}_\theta[\psi(T(\mathbf{Y}))] = 0 for all θΛ\theta \in \Lambda satisfies ψ0\psi \equiv 0 almost surely.

The exponential family's density is a moment generating function of TT, evaluated at the natural parameter η\eta. If a function of TT integrates to zero against every density in the family, it integrates to zero against every point of an open set of exponentials --- which forces it to be zero by analytic continuation. Completeness is what upgrades a sufficient statistic into a tool for producing unique unbiased estimators (Lehmann--Scheffe).

,

Example: Gaussian as an Exponential Family

Write the i.i.d. Gaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2) model for nn samples in canonical exponential form and identify the natural sufficient statistic.

Likelihood as a Function of (T,θ)(T, \theta): Sufficient Statistic as Dimensionality Reduction

Draw nn i.i.d. samples from N(μ,1)\mathcal{N}(\mu, 1). Plot the log-likelihood μ(y)\ell_\mu(\mathbf{y}) as a function of μ\mu for several random realizations of y\mathbf{y}. Observe that the curves differ only by a vertical shift determined by iyi\sum_i y_i: the sufficient statistic.

Parameters
30
0.8
8

Common Mistake: Sufficient vs. Minimal Sufficient

Mistake:

Treating the full observation Y\mathbf{Y} as "a sufficient statistic" and concluding nothing interesting has been gained.

Correction:

Sufficiency holds vacuously for the identity statistic. The useful question is minimal sufficiency: how far can we compress before losing information? In the exponential family, the natural sufficient statistic T(y)T(\mathbf{y}) typically has dimension knk \ll n and is minimal. That dimension gap is the compression ratio.

Fisher--Neyman Factorization for the Gaussian Family

A step-by-step visualization of the factorization fμ,σ2(y)=gμ,σ2(T(y))h(y)f_{\mu,\sigma^2}(\mathbf{y}) = g_{\mu,\sigma^2}(T(\mathbf{y})) h(\mathbf{y}) with T=(iyi,iyi2)T = (\sum_i y_i, \sum_i y_i^2).
The θ\theta-dependence concentrates into T(y)T(\mathbf{y}); what is left is the θ\theta-independent factor h(y)h(\mathbf{y}).

Sufficient Statistic

A function T(Y)T(\mathbf{Y}) such that the conditional distribution of Y\mathbf{Y} given T(Y)T(\mathbf{Y}) is free of θ\theta. Equivalently, it is the only feature of the data the likelihood ever sees.

Related: Minimal Sufficient Statistic, Exponential Family, Complete Statistic

Exponential Family

A family of distributions of the form h(y)exp(η(θ)TT(y)A(θ))h(\mathbf{y}) \exp(\eta(\theta)^T T(\mathbf{y}) - A(\theta)). Contains most workhorse distributions of practice and enjoys automatic sufficiency and (under mild conditions) completeness.

Related: Sufficient Statistic, Complete Statistic, Conjugate Prior

Complete Statistic

A statistic TT whose only unbiased estimator of zero is zero itself: Eθ[ψ(T(Y))]=0\mathbb{E}_\theta[\psi(T(\mathbf{Y}))] = 0 for all θ\theta implies ψ0\psi \equiv 0 a.s. Completeness is the key ingredient in uniqueness of the MVUE.

Related: Sufficient Statistic, Exponential Family, A Procedure for Building the MVUE

Quick Check

For Y=As+Z\mathbf{Y} = A\mathbf{s} + \mathbf{Z} with ZN(0,σ2I)\mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}) and ARA \in \mathbb{R} unknown, which of the following is a sufficient statistic for AA?

y\|\mathbf{y}\|

sTy\mathbf{s}^T\mathbf{y}

y\mathbf{y} itself

sTs\mathbf{s}^T\mathbf{s}