Ferkans — Interactive Telecom Tutor

Compressing Data Without Losing Information

A raw observation vector $\mathbf{y} \in \mathbb{R}^n$ may have thousands of components, but the inference task may only need a handful of summaries: a sum, a sum of squares, an inner product with a known pilot. Sufficiency makes this idea precise: a statistic $T(\mathbf{Y})$ is sufficient for $\theta$ if, once $T(\mathbf{Y})$ is known, the remaining variability in $\mathbf{Y}$ says nothing about $\theta$ . Conditioning on $T$ is a lossless compression of the data for the purpose of inference about $\theta$ --- this is the rigorous statement behind every matched filter, every correlator, every frequency-bin summary in a receiver.

Definition:
Sufficient Statistic

A statistic $T : \mathcal{Y} \to \mathbb{R}^d$ is sufficient for $\theta$ in the family $\{f_\theta : \theta \in \Lambda\}$ if the conditional distribution of $\mathbf{Y}$ given $T(\mathbf{Y})$ does not depend on $\theta$ : $f_\theta(\mathbf{y} \mid T(\mathbf{y}) = t) \;=\; h(\mathbf{y} \mid t) \qquad \forall \theta \in \Lambda.$ Equivalently, the parameter $\theta$ is conditionally independent of $\mathbf{Y}$ given $T(\mathbf{Y})$ .

Sufficiency is a property of the statistic, not of an estimator. Every one-to-one transform of a sufficient statistic is sufficient. The trivial statistic $T(\mathbf{y}) = \mathbf{y}$ is always sufficient --- the interesting question is how far we can compress while preserving sufficiency.

Theorem: Fisher--Neyman Factorization Theorem

A statistic $T(\mathbf{Y})$ is sufficient for $\theta$ in the family $\{f_\theta : \theta \in \Lambda\}$ if and only if the density admits the factorization $f_\theta(\mathbf{y}) \;=\; g_\theta\!\bigl(T(\mathbf{y})\bigr) \cdot h(\mathbf{y}) \qquad \forall \theta \in \Lambda,$ for some measurable $g_\theta \geq 0$ and $h \geq 0$ with $h$ independent of $\theta$ .

Read the theorem as a certificate of sufficiency: if the likelihood's $\theta$ -dependence enters only through $T(\mathbf{y})$ , then $T$ is sufficient. In practice we never compute the conditional distribution --- we stare at the density, group every occurrence of $\theta$ , and read off $T$ directly.

Show Hint

For the "if" direction, compute $f_\theta(\mathbf{y} \mid T(\mathbf{y}) = t)$ by dividing by $\mathbb{P}_\theta(T(\mathbf{Y}) = t)$ and use the factorization.

For the "only if" direction, set $g_\theta(t) = \mathbb{P}_\theta(T(\mathbf{Y}) = t)$ and $h(\mathbf{y}) = f_\theta(\mathbf{y} \mid T(\mathbf{y})=t(\mathbf{y}))$ and verify independence from $\theta$ .

Proof

Sufficient direction: factorization implies sufficiency

Suppose $f_\theta(\mathbf{y}) = g_\theta(T(\mathbf{y})) h(\mathbf{y})$ . For discrete $\mathbf{Y}$ (the continuous case is analogous with densities), $\mathbb{P}_\theta(T(\mathbf{Y}) = t) = \sum_{\mathbf{y}: T(\mathbf{y}) = t} g_\theta(t) h(\mathbf{y}) = g_\theta(t) \sum_{\mathbf{y}: T(\mathbf{y})=t} h(\mathbf{y}) = g_\theta(t) H(t)$ . Hence $\mathbb{P}_\theta(\mathbf{Y} = \mathbf{y} \mid T(\mathbf{Y}) = T(\mathbf{y})) = h(\mathbf{y}) / H(T(\mathbf{y}))$ , which does not depend on $\theta$ .

Necessary direction: sufficiency implies factorization

Suppose $T(\mathbf{Y})$ is sufficient: the conditional $\mathbb{P}_\theta(\mathbf{Y} = \mathbf{y} \mid T(\mathbf{Y}) = T(\mathbf{y}))$ does not depend on $\theta$ ; call it $\phi(\mathbf{y})$ . Then $f_\theta(\mathbf{y}) = \mathbb{P}_\theta(T(\mathbf{Y}) = T(\mathbf{y})) \cdot \phi(\mathbf{y})$ . Set $g_\theta(t) = \mathbb{P}_\theta(T(\mathbf{Y}) = t)$ and $h = \phi$ . The continuous case requires a version of the Radon--Nikodym theorem but the argument is structurally identical. $\blacksquare$

,

Example: Factorization for the Gaussian Location--Scale Family

Let $Y_1, \ldots, Y_n$ be i.i.d. $\mathcal{N}(\mu, \sigma^2)$ with both parameters unknown, $\boldsymbol{\theta} = (\mu, \sigma^2)^T$ . Identify a sufficient statistic.

Solution

Expand the log-likelihood

$f_{\boldsymbol{\theta}}(\mathbf{y}) = (2\pi\sigma^2)^{-n/2} \exp\bigl(-\tfrac{1}{2\sigma^2}\sum_i (y_i - \mu)^2\bigr)$ . Expanding the square, $\sum_i(y_i - \mu)^2 = \sum_i y_i^2 - 2\mu \sum_i y_i + n \mu^2$ , so the exponent depends on $\mathbf{y}$ only through $\bigl(\sum_i y_i, \sum_i y_i^2\bigr)$ .

Read off the sufficient statistic

Hence $T(\mathbf{y}) = \bigl(\sum_i y_i, \sum_i y_i^2\bigr)$ is sufficient for $(\mu, \sigma^2)$ : the entire $n$ -dimensional observation is compressed into two numbers. Equivalently, $(\bar{Y}, \sum_i (Y_i - \bar{Y})^2)$ is sufficient --- a one-to-one transform of $T$ .

Example: Factorization: Signal Amplitude in AWGN

Observe $\mathbf{Y} = A \mathbf{s} + \mathbf{Z}$ with known $\mathbf{s}$ and $\mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \sigma^2\, \mathbf{I}_n)$ . Find a sufficient statistic for the scalar amplitude $A$ .

Solution

Density

$f_A(\mathbf{y}) = (2\pi\sigma^2)^{-n/2} \exp\bigl(-\tfrac{1}{2\sigma^2}\|\mathbf{y} - A\mathbf{s}\|^2\bigr)$ . Expand: $\|\mathbf{y} - A\mathbf{s}\|^2 = \|\mathbf{y}\|^2 - 2A\, \mathbf{s}^T\mathbf{y} + A^2 \|\mathbf{s}\|^2$ .

Identify the $\theta$-dependence

The $A$ -dependence enters only through the inner product $\mathbf{s}^T\mathbf{y}$ . Therefore $T(\mathbf{y}) = \mathbf{s}^T\mathbf{y}$ is a scalar sufficient statistic for $A$ . The matched filter is a sufficient statistic! This is the statistical reason matched-filter receivers never lose information about the amplitude, even after collapsing $n$ real observations into one scalar.

Definition:
Minimal Sufficient Statistic

A sufficient statistic $T(\mathbf{Y})$ is minimal if, for every other sufficient statistic $T'(\mathbf{Y})$ , there exists a function $\varphi$ such that $T(\mathbf{y}) = \varphi(T'(\mathbf{y}))$ almost surely. Equivalently, $T$ induces the coarsest partition of $\mathcal{Y}$ among sufficient statistics.

The Lehmann--Scheffe criterion for minimality: $T$ is minimal sufficient iff $T(\mathbf{y}) = T(\mathbf{y}')$ if and only if the likelihood ratio $f_\theta(\mathbf{y}) / f_\theta(\mathbf{y}')$ does not depend on $\theta$ . For the Gaussian example above, $(\sum Y_i, \sum Y_i^2)$ is minimal sufficient when both $\mu$ and $\sigma^2$ are unknown.

Definition:
Exponential Family

A parametric family $\{f_\theta : \theta \in \Lambda\}$ is an exponential family in canonical form if there exist measurable functions $h : \mathcal{Y} \to \mathbb{R}_{\geq 0}$ , $\eta : \Lambda \to \mathbb{R}^k$ , $T : \mathcal{Y} \to \mathbb{R}^k$ , and $A : \Lambda \to \mathbb{R}$ such that $f_\theta(\mathbf{y}) \;=\; h(\mathbf{y}) \exp\!\bigl(\eta(\theta)^T T(\mathbf{y}) - A(\theta)\bigr).$ The vector $\eta(\theta)$ is the natural parameter, $T(\mathbf{y})$ the natural sufficient statistic, and $A(\theta) = \log \int h(\mathbf{y}) \exp(\eta(\theta)^T T(\mathbf{y}))\, d\mathbf{y}$ the log-partition (or cumulant) function.

By the Fisher--Neyman factorization, $T(\mathbf{Y})$ is automatically sufficient for $\theta$ . Members: Bernoulli, binomial, Poisson, Gaussian (with known or unknown variance), exponential, gamma, beta, Dirichlet, multinomial --- most families you will meet in practice.

Theorem: Complete Sufficient Statistic in the Exponential Family

Consider an exponential family $f_\theta(\mathbf{y}) = h(\mathbf{y}) \exp(\eta(\theta)^T T(\mathbf{y}) - A(\theta))$ with $T(\mathbf{y}) \in \mathbb{R}^k$ . If the natural-parameter image $\{\eta(\theta) : \theta \in \Lambda\}$ contains a $k$ -dimensional open rectangle, then $T(\mathbf{Y})$ is a complete sufficient statistic: any function $\psi$ with $\mathbb{E}_\theta[\psi(T(\mathbf{Y}))] = 0$ for all $\theta \in \Lambda$ satisfies $\psi \equiv 0$ almost surely.

The exponential family's density is a moment generating function of $T$ , evaluated at the natural parameter $\eta$ . If a function of $T$ integrates to zero against every density in the family, it integrates to zero against every point of an open set of exponentials --- which forces it to be zero by analytic continuation. Completeness is what upgrades a sufficient statistic into a tool for producing unique unbiased estimators (Lehmann--Scheffe).

Proof

From completeness to moment generating functions

Suppose $\mathbb{E}_\theta[\psi(T(\mathbf{Y}))] = 0$ for all $\theta \in \Lambda$ . Writing $q_\eta$ for the density of $T(\mathbf{Y})$ under $\eta = \eta(\theta)$ , we have $q_\eta(t) = \tilde h(t) \exp(\eta^T t - A(\eta))$ for some $\tilde h$ , so $\int \psi(t) \tilde h(t) \exp(\eta^T t)\, dt = 0$ over the open rectangle of $\eta$ 's.

Analytic continuation

Define $M(\eta) = \int \psi(t) \tilde h(t) \exp(\eta^T t)\, dt$ , an entire analytic function of $\eta \in \mathbb{C}^k$ (by dominated convergence). It vanishes on a $k$ -dimensional open rectangle, hence on all of $\mathbb{C}^k$ . Inverse Laplace transform then forces $\psi(t) \tilde h(t) = 0$ a.e., i.e., $\psi \equiv 0$ a.e. on the support of $T(\mathbf{Y})$ . $\blacksquare$

,

Example: Gaussian as an Exponential Family

Write the i.i.d. Gaussian $\mathcal{N}(\mu, \sigma^2)$ model for $n$ samples in canonical exponential form and identify the natural sufficient statistic.

Solution

Expand the density

$f_{\boldsymbol{\theta}}(\mathbf{y}) = (2\pi)^{-n/2} \exp\!\bigl(\tfrac{\mu}{\sigma^2}\sum_i y_i - \tfrac{1}{2\sigma^2}\sum_i y_i^2 - \tfrac{n\mu^2}{2\sigma^2} - \tfrac{n}{2}\log\sigma^2\bigr).$

Read off $\eta$, $T$, $A$

Set $\eta(\boldsymbol{\theta}) = \bigl(\mu/\sigma^2, -1/(2\sigma^2)\bigr)^T$ and $T(\mathbf{y}) = \bigl(\sum_i y_i,\, \sum_i y_i^2\bigr)^T$ . The log-partition is $A(\boldsymbol{\theta}) = n\mu^2/(2\sigma^2) + (n/2)\log\sigma^2$ . Since $\{\eta(\boldsymbol{\theta}): \mu \in \mathbb{R}, \sigma^2 > 0\}$ contains an open rectangle in $\mathbb{R}^2$ , $T(\mathbf{Y})$ is a complete sufficient statistic.

Likelihood as a Function of $(T, \theta)$ : Sufficient Statistic as Dimensionality Reduction

Draw $n$ i.i.d. samples from $\mathcal{N}(\mu, 1)$ . Plot the log-likelihood $\ell_\mu(\mathbf{y})$ as a function of $\mu$ for several random realizations of $\mathbf{y}$ . Observe that the curves differ only by a vertical shift determined by $\sum_i y_i$ : the sufficient statistic.

Parameters

samples

n

30

\mu_0

0.8

realizations8

Common Mistake: Sufficient vs. Minimal Sufficient

Mistake:

Treating the full observation $\mathbf{Y}$ as "a sufficient statistic" and concluding nothing interesting has been gained.

Correction:

Sufficiency holds vacuously for the identity statistic. The useful question is minimal sufficiency: how far can we compress before losing information? In the exponential family, the natural sufficient statistic $T(\mathbf{y})$ typically has dimension $k \ll n$ and is minimal. That dimension gap is the compression ratio.

Fisher--Neyman Factorization for the Gaussian Family

A step-by-step visualization of the factorization

f_{\mu,\sigma^2}(\mathbf{y}) = g_{\mu,\sigma^2}(T(\mathbf{y})) h(\mathbf{y})

with

T = (\sum_i y_i, \sum_i y_i^2)

.

The

\theta

-dependence concentrates into

T(\mathbf{y})

; what is left is the

\theta

-independent factor

h(\mathbf{y})

.

Sufficient Statistic

A function $T(\mathbf{Y})$ such that the conditional distribution of $\mathbf{Y}$ given $T(\mathbf{Y})$ is free of $\theta$ . Equivalently, it is the only feature of the data the likelihood ever sees.

Exponential Family

A family of distributions of the form $h(\mathbf{y}) \exp(\eta(\theta)^T T(\mathbf{y}) - A(\theta))$ . Contains most workhorse distributions of practice and enjoys automatic sufficiency and (under mild conditions) completeness.

Complete Statistic

A statistic $T$ whose only unbiased estimator of zero is zero itself: $\mathbb{E}_\theta[\psi(T(\mathbf{Y}))] = 0$ for all $\theta$ implies $\psi \equiv 0$ a.s. Completeness is the key ingredient in uniqueness of the MVUE.

Quick Check

For $\mathbf{Y} = A\mathbf{s} + \mathbf{Z}$ with $\mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$ and $A \in \mathbb{R}$ unknown, which of the following is a sufficient statistic for $A$ ?

$\|\mathbf{y}\|$

$\mathbf{s}^T\mathbf{y}$

$\mathbf{y}$ itself

$\mathbf{s}^T\mathbf{s}$

Correction:

\mathbf{s}^T\mathbf{y}

After expanding the Gaussian density, the $A$ -dependence enters only through the matched-filter output $\mathbf{s}^T\mathbf{y}$ ; the rest of the data is ancillary.

Sufficient Statistics and the Exponential Family

Compressing Data Without Losing Information

Definition: Sufficient Statistic

Theorem: Fisher--Neyman Factorization Theorem

Sufficient direction: factorization implies sufficiency

Necessary direction: sufficiency implies factorization

Example: Factorization for the Gaussian Location--Scale Family

Expand the log-likelihood

Read off the sufficient statistic

Example: Factorization: Signal Amplitude in AWGN

Density

Identify the $\theta$-dependence

Definition: Minimal Sufficient Statistic

Definition: Exponential Family

Theorem: Complete Sufficient Statistic in the Exponential Family

From completeness to moment generating functions

Analytic continuation

Example: Gaussian as an Exponential Family

Expand the density

Read off $\eta$, $T$, $A$

Likelihood as a Function of (T,θ)(T, \theta)(T,θ): Sufficient Statistic as Dimensionality Reduction

Parameters

Common Mistake: Sufficient vs. Minimal Sufficient

Fisher--Neyman Factorization for the Gaussian Family

Sufficient Statistic

Exponential Family

Complete Statistic

Quick Check

Definition:
Sufficient Statistic

Definition:
Minimal Sufficient Statistic

Definition:
Exponential Family

Likelihood as a Function of $(T, \theta)$ : Sufficient Statistic as Dimensionality Reduction