Ferkans — Interactive Telecom Tutor

Why Do We Trust the MLE?

The MLE is just one function of the data among infinitely many. Why has it come to dominate engineering practice? The answer is a package of four large-sample guarantees: consistency (the estimator concentrates on the truth), asymptotic normality (its deviation is $O(1/\sqrt{n})$ and Gaussian), asymptotic efficiency (its asymptotic variance matches the Cramer-Rao lower bound), and invariance (the MLE of any smooth function of the parameter is that function applied to the MLE). In this section we state and prove these four properties.

Definition:
Regularity Conditions

The i.i.d. model $\{f_\theta : \theta \in \Lambda\}$ is called regular at $\theta_0$ if the following hold:

$\Lambda \subseteq \mathbb{R}^m$ is open and $\theta_0$ is interior.
The support $\{y : f_\theta(y) > 0\}$ is the same for all $\theta$ .
$\log f_\theta(y)$ is twice continuously differentiable in $\theta$ .
Differentiation under the integral sign is permitted, so $\mathbb{E}_\theta[s(\theta; Y)] = 0$ and $J(\theta) = -\mathbb{E}_\theta[\partial^2 \log f_\theta(Y)/\partial\theta^2]$ .
$J(\theta_0)$ is finite and strictly positive (non-singular in the vector case).
The parameter is identifiable: $\theta \neq \theta_0 \Rightarrow f_\theta \neq f_{\theta_0}$ on a set of positive measure.

These conditions are inherited from Chapter 5 and are what the proofs in this section require. They fail for support-dependent models (uniform on $[0,\theta]$ ) and for non-identifiable models (over-parameterized mixtures, label switching).

Theorem: Consistency of the MLE

Let $Y_1, Y_2, \ldots$ be i.i.d. from $f_{\theta_0}$ in a regular model with compact $\Lambda$ . Let $\hat\theta_n = g_{\text{ml}}(\mathbf{Y})$ be the MLE from the first $n$ observations. Then $\hat\theta_n \;\xrightarrow{\;p\;}\; \theta_0 \qquad \text{as } n \to \infty.$

The per-sample log-likelihood is, by the WLLN, close to its expectation $\mathbb{E}_{\theta_0}[\log f_\theta(Y)]$ . That expectation is maximized at $\theta = \theta_0$ by Gibbs' inequality (the KL divergence argument). Hence the maximizer of the empirical log-likelihood converges to the maximizer of its population counterpart.

Proof

Population objective and KL divergence

Define $M(\theta) = \mathbb{E}_{\theta_0}[\log f_\theta(Y)]$ . By the non-negativity of KL divergence, $M(\theta_0) - M(\theta) \;=\; \mathbb{E}_{\theta_0}\!\left[\log \frac{f_{\theta_0}(Y)}{f_\theta(Y)}\right] \;=\; D(f_{\theta_0} \| f_\theta) \;\geq\; 0,$ with equality only at $\theta = \theta_0$ by identifiability. So $\theta_0$ is the unique maximizer of $M$ .

Uniform convergence by WLLN

The empirical objective is $M_n(\theta) = n^{-1}\ell_n(\theta) = n^{-1}\sum_i \log f_\theta(Y_i)$ . Under moment assumptions and compactness of $\Lambda$ , standard uniform-WLLN arguments (Wald, 1949) give $\sup_{\theta \in \Lambda}|M_n(\theta) - M(\theta)| \xrightarrow{p} 0$ .

Argmax continuous mapping

The MLE is $\hat\theta_n = \arg\max_\theta M_n(\theta)$ . Since $M_n \to M$ uniformly and $M$ has a unique well-separated maximum at $\theta_0$ , the argmax is continuous, giving $\hat\theta_n \xrightarrow{p} \theta_0$ . $\blacksquare$

,

Theorem: Asymptotic Normality of the MLE

Under regularity, with $\hat\theta_n = g_{\text{ml}}(\mathbf{Y})$ the MLE from $n$ i.i.d. observations and $J_1(\theta_0)$ the per-sample Fisher information, $\sqrt{n}\,\bigl(\hat\theta_n - \theta_0\bigr) \;\xrightarrow{\;d\;}\; \mathcal{N}\!\bigl(0,\; J_1(\theta_0)^{-1}\bigr).$ Equivalently, for large $n$ , $\hat\theta_n \approx \mathcal{N}(\theta_0,\; J(\theta_0)^{-1})$ with $J(\theta_0) = n J_1(\theta_0)$ .

A first-order Taylor expansion of the score around $\theta_0$ gives $0 = s_n(\hat\theta_n) \approx s_n(\theta_0) + s_n'(\tilde\theta)(\hat\theta_n - \theta_0)$ . The CLT makes $n^{-1/2}s_n(\theta_0)$ Gaussian with variance $J_1(\theta_0)$ ; the WLLN makes $-n^{-1}s_n'(\tilde\theta)$ converge to $J_1(\theta_0)$ . Dividing gives the Gaussian limit.

Proof

Taylor-expand the score around the truth

By consistency, $\hat\theta_n$ lies in a neighborhood of $\theta_0$ eventually. Since $s_n(\hat\theta_n) = 0$ , a mean-value expansion gives $0 \;=\; s_n(\theta_0) \;+\; s_n'(\tilde\theta)\,(\hat\theta_n - \theta_0)$ for some $\tilde\theta$ between $\theta_0$ and $\hat\theta_n$ .

Apply CLT to the score at the truth

Define $\xi_i = \partial \log f_\theta(Y_i)/\partial\theta \big|_{\theta_0}$ . The $\xi_i$ are i.i.d. with mean zero (expected score) and variance $J_1(\theta_0)$ . By the CLT, $\frac{1}{\sqrt{n}}\, s_n(\theta_0) \;=\; \frac{1}{\sqrt{n}}\sum_{i=1}^n \xi_i \;\xrightarrow{d}\; \mathcal{N}(0, J_1(\theta_0)).$

Apply WLLN to the curvature

Also $-n^{-1} s_n'(\tilde\theta) = -n^{-1}\sum_i \partial^2\log f_\theta(Y_i)/\partial\theta^2\big|_{\tilde\theta}$ . By consistency of $\tilde\theta$ and the WLLN, this converges in probability to $-\mathbb{E}_{\theta_0}[\partial^2 \log f_\theta(Y)/\partial\theta^2|_{\theta_0}] = J_1(\theta_0)$ .

Assemble via Slutsky

Rearranging the Taylor expansion: $\sqrt{n}(\hat\theta_n - \theta_0) \;=\; \frac{n^{-1/2} s_n(\theta_0)}{-n^{-1} s_n'(\tilde\theta)} \;\xrightarrow{d}\; \frac{\mathcal{N}(0, J_1(\theta_0))}{J_1(\theta_0)} \;=\; \mathcal{N}\!\bigl(0, J_1(\theta_0)^{-1}\bigr).$ This uses Slutsky's theorem (a quotient of a converging numerator and a probability-limit denominator). $\blacksquare$

,

Asymptotic Efficiency

The asymptotic variance of the MLE is $J_1(\theta_0)^{-1}/n = J(\theta_0)^{-1}$ , which is exactly the Cramer-Rao lower bound. In the large- $n$ limit, the MLE achieves the CRLB. This is the asymptotic efficiency property. For finite $n$ the MLE may be biased or inefficient, but among regular models and under mild conditions, it is the gold-standard estimator because no other estimator can beat it asymptotically.

Theorem: Invariance of the MLE under Reparameterization

Let $u: \Lambda \to \Lambda'$ be any function (not necessarily one-to-one). If $\hat\theta_{\text{ml}}$ is the MLE of $\theta$ , then the MLE of $\alpha = u(\theta)$ is $\hat\alpha_{\text{ml}}(\mathbf{y}) \;=\; u\!\bigl(g_{\text{ml}}(\mathbf{y})\bigr)$ whenever $u$ is one-to-one. In general, $\hat\alpha_{\text{ml}}(\mathbf{y}) = \arg\max_{\alpha'} \max\{f_\theta(\mathbf{y}) : u(\theta) = \alpha'\}$ .

The likelihood is a statement about the data, not the parameterization. Whichever coordinate system we use to describe $\theta$ , the MLE identifies the same point in the data-evidence sense.

Proof

Profile likelihood under $u$

Define the induced likelihood $\tilde L(\alpha) = \sup\{L_n(\theta) : u(\theta) = \alpha\}$ . Maximizing $\tilde L$ over $\alpha$ returns the same supremum as maximizing $L_n$ over $\theta$ .

One-to-one case

If $u$ is bijective, the set $u^{-1}(\alpha)$ has one element and $\tilde L(\alpha) = L_n(u^{-1}(\alpha))$ . Maximizing over $\alpha$ is equivalent to maximizing over $\theta$ via the substitution $\theta = u^{-1}(\alpha)$ , so the maximizer is $\hat\alpha = u(\hat\theta_{\text{ml}})$ .

General case

For non-injective $u$ the same profile-likelihood argument applies but $\hat\alpha$ is the value $\alpha$ whose fiber $u^{-1}(\alpha)$ contains the MLE $\hat\theta_{\text{ml}}$ . $\blacksquare$

,

Example: Invariance: Noise Power in dB

Let $Z_1, \ldots, Z_n$ be i.i.d. $\mathcal{N}(0, \sigma^2)$ . Find the MLE of $\alpha = 10\log_{10}(\sigma^2)$ , the noise power expressed in dB.

Solution

MLE of $\sigma^2$

From Example EGaussian MLE: Joint Mean and Variance with $\mu = 0$ fixed, $\hat\sigma^2_{\text{ml}} = n^{-1}\sum_{i=1}^n z_i^2$ .

Apply invariance

Since $u(\sigma^2) = 10\log_{10}(\sigma^2)$ is one-to-one on $(0,\infty)$ , $\hat\alpha_{\text{ml}}(\mathbf{z}) \;=\; 10 \log_{10}\!\left(\frac{1}{n}\sum_{i=1}^n z_i^2\right).$ No re-derivation from scratch is needed. $\blacksquare$

Sampling Distribution of the MLE

Monte Carlo histogram of $\sqrt{n}(\hat\theta_n - \theta_0)$ for the Gaussian mean MLE, overlaid with the asymptotic $\mathcal{N}(0, \sigma^2)$ prediction. Increase $n$ to see the empirical distribution converge to the Gaussian limit.

Parameters

Sample size

n

30

Monte Carlo trials2000

True

\mu

0

\sigma

1

MLE MSE vs CRLB (Exponential Rate)

For the exponential model $f_\theta(y) = \theta e^{-\theta y}$ , compare the empirical MSE of $g_{\text{ml}}(\mathbf{y}) = 1/\bar{y}$ with the Cramer-Rao bound $\theta^2/n$ . The MLE is biased for small $n$ but its MSE approaches the CRLB as $n$ grows, illustrating asymptotic efficiency.

Parameters

True rate

\theta_0

1

Monte Carlo trials2000

Common Mistake: Asymptotic Properties Are Not Finite-Sample Guarantees

Mistake:

Using $\hat\theta \sim \mathcal{N}(\theta_0, J(\theta_0)^{-1})$ as if it were exact, for example to construct confidence intervals with very small $n$ .

Correction:

Asymptotic normality is a limit statement and can be a poor approximation when $n$ is small or when the log-likelihood is asymmetric (skewed models, near-boundary truths). For small $n$ use bootstrap, profile-likelihood intervals, or exact finite-sample distributions when available.

Common Mistake: Non-Identifiable Models

Mistake:

Applying MLE theory to over-parameterized models — Gaussian mixtures with label switching, overcomplete signal dictionaries, $\mathbf{H}\mathbf{v}$ where both $\mathbf{H}$ and $\mathbf{v}$ are unknown up to a scalar.

Correction:

When $\theta \neq \theta_0$ can yield the same distribution, the MLE is not unique and consistency fails. Fix the symmetry by normalization, a canonical orientation, or a pivot element. In Gaussian mixtures, use cluster labels after post-processing.

Key Takeaway

Under regularity, the MLE satisfies four guarantees: consistency, asymptotic normality with variance $J(\theta_0)^{-1}$ , asymptotic efficiency (CRLB attainment), and invariance under reparameterization. These four together justify ML as the default frequentist estimator.

Quick Check

For an i.i.d. sample of size $n$ from $f_{\theta_0}$ under regularity, the asymptotic variance of $\hat\theta_n^{\text{ml}}$ equals:

$J_1(\theta_0)$

$1/(nJ_1(\theta_0)) = 1/J(\theta_0)$

$\sigma^2/n$ regardless of the model

$J_1(\theta_0)^2/n$

Correction:

1/(nJ_1(\theta_0)) = 1/J(\theta_0)

This is the CRLB and the MLE attains it asymptotically.

Properties of the MLE

Why Do We Trust the MLE?

Definition: Regularity Conditions

Theorem: Consistency of the MLE

Population objective and KL divergence

Uniform convergence by WLLN

Argmax continuous mapping

Theorem: Asymptotic Normality of the MLE

Taylor-expand the score around the truth

Apply CLT to the score at the truth

Apply WLLN to the curvature

Assemble via Slutsky

Asymptotic Efficiency

Theorem: Invariance of the MLE under Reparameterization

Profile likelihood under $u$

One-to-one case

General case

Example: Invariance: Noise Power in dB

MLE of $\sigma^2$

Apply invariance

Sampling Distribution of the MLE

Parameters

MLE MSE vs CRLB (Exponential Rate)

Parameters

Common Mistake: Asymptotic Properties Are Not Finite-Sample Guarantees

Common Mistake: Non-Identifiable Models

Key Takeaway

Quick Check

Definition:
Regularity Conditions