Properties of the MLE

Why Do We Trust the MLE?

The MLE is just one function of the data among infinitely many. Why has it come to dominate engineering practice? The answer is a package of four large-sample guarantees: consistency (the estimator concentrates on the truth), asymptotic normality (its deviation is O(1/n)O(1/\sqrt{n}) and Gaussian), asymptotic efficiency (its asymptotic variance matches the Cramer-Rao lower bound), and invariance (the MLE of any smooth function of the parameter is that function applied to the MLE). In this section we state and prove these four properties.

Definition:

Regularity Conditions

The i.i.d. model {fθ:θΛ}\{f_\theta : \theta \in \Lambda\} is called regular at θ0\theta_0 if the following hold:

  1. ΛRm\Lambda \subseteq \mathbb{R}^m is open and θ0\theta_0 is interior.
  2. The support {y:fθ(y)>0}\{y : f_\theta(y) > 0\} is the same for all θ\theta.
  3. logfθ(y)\log f_\theta(y) is twice continuously differentiable in θ\theta.
  4. Differentiation under the integral sign is permitted, so Eθ[s(θ;Y)]=0\mathbb{E}_\theta[s(\theta; Y)] = 0 and J(θ)=Eθ[2logfθ(Y)/θ2]J(\theta) = -\mathbb{E}_\theta[\partial^2 \log f_\theta(Y)/\partial\theta^2].
  5. J(θ0)J(\theta_0) is finite and strictly positive (non-singular in the vector case).
  6. The parameter is identifiable: θθ0fθfθ0\theta \neq \theta_0 \Rightarrow f_\theta \neq f_{\theta_0} on a set of positive measure.

These conditions are inherited from Chapter 5 and are what the proofs in this section require. They fail for support-dependent models (uniform on [0,θ][0,\theta]) and for non-identifiable models (over-parameterized mixtures, label switching).

Theorem: Consistency of the MLE

Let Y1,Y2,Y_1, Y_2, \ldots be i.i.d. from fθ0f_{\theta_0} in a regular model with compact Λ\Lambda. Let θ^n=gml(Y)\hat\theta_n = g_{\text{ml}}(\mathbf{Y}) be the MLE from the first nn observations. Then θ^n    p    θ0as n.\hat\theta_n \;\xrightarrow{\;p\;}\; \theta_0 \qquad \text{as } n \to \infty.

The per-sample log-likelihood is, by the WLLN, close to its expectation Eθ0[logfθ(Y)]\mathbb{E}_{\theta_0}[\log f_\theta(Y)]. That expectation is maximized at θ=θ0\theta = \theta_0 by Gibbs' inequality (the KL divergence argument). Hence the maximizer of the empirical log-likelihood converges to the maximizer of its population counterpart.

,

Theorem: Asymptotic Normality of the MLE

Under regularity, with θ^n=gml(Y)\hat\theta_n = g_{\text{ml}}(\mathbf{Y}) the MLE from nn i.i.d. observations and J1(θ0)J_1(\theta_0) the per-sample Fisher information, n(θ^nθ0)    d    N ⁣(0,  J1(θ0)1).\sqrt{n}\,\bigl(\hat\theta_n - \theta_0\bigr) \;\xrightarrow{\;d\;}\; \mathcal{N}\!\bigl(0,\; J_1(\theta_0)^{-1}\bigr). Equivalently, for large nn, θ^nN(θ0,  J(θ0)1)\hat\theta_n \approx \mathcal{N}(\theta_0,\; J(\theta_0)^{-1}) with J(θ0)=nJ1(θ0)J(\theta_0) = n J_1(\theta_0).

A first-order Taylor expansion of the score around θ0\theta_0 gives 0=sn(θ^n)sn(θ0)+sn(θ~)(θ^nθ0)0 = s_n(\hat\theta_n) \approx s_n(\theta_0) + s_n'(\tilde\theta)(\hat\theta_n - \theta_0). The CLT makes n1/2sn(θ0)n^{-1/2}s_n(\theta_0) Gaussian with variance J1(θ0)J_1(\theta_0); the WLLN makes n1sn(θ~)-n^{-1}s_n'(\tilde\theta) converge to J1(θ0)J_1(\theta_0). Dividing gives the Gaussian limit.

,

Asymptotic Efficiency

The asymptotic variance of the MLE is J1(θ0)1/n=J(θ0)1J_1(\theta_0)^{-1}/n = J(\theta_0)^{-1}, which is exactly the Cramer-Rao lower bound. In the large-nn limit, the MLE achieves the CRLB. This is the asymptotic efficiency property. For finite nn the MLE may be biased or inefficient, but among regular models and under mild conditions, it is the gold-standard estimator because no other estimator can beat it asymptotically.

Theorem: Invariance of the MLE under Reparameterization

Let u:ΛΛu: \Lambda \to \Lambda' be any function (not necessarily one-to-one). If θ^ml\hat\theta_{\text{ml}} is the MLE of θ\theta, then the MLE of α=u(θ)\alpha = u(\theta) is α^ml(y)  =  u ⁣(gml(y))\hat\alpha_{\text{ml}}(\mathbf{y}) \;=\; u\!\bigl(g_{\text{ml}}(\mathbf{y})\bigr) whenever uu is one-to-one. In general, α^ml(y)=argmaxαmax{fθ(y):u(θ)=α}\hat\alpha_{\text{ml}}(\mathbf{y}) = \arg\max_{\alpha'} \max\{f_\theta(\mathbf{y}) : u(\theta) = \alpha'\}.

The likelihood is a statement about the data, not the parameterization. Whichever coordinate system we use to describe θ\theta, the MLE identifies the same point in the data-evidence sense.

,

Example: Invariance: Noise Power in dB

Let Z1,,ZnZ_1, \ldots, Z_n be i.i.d. N(0,σ2)\mathcal{N}(0, \sigma^2). Find the MLE of α=10log10(σ2)\alpha = 10\log_{10}(\sigma^2), the noise power expressed in dB.

Sampling Distribution of the MLE

Monte Carlo histogram of n(θ^nθ0)\sqrt{n}(\hat\theta_n - \theta_0) for the Gaussian mean MLE, overlaid with the asymptotic N(0,σ2)\mathcal{N}(0, \sigma^2) prediction. Increase nn to see the empirical distribution converge to the Gaussian limit.

Parameters
30
2000
0
1

MLE MSE vs CRLB (Exponential Rate)

For the exponential model fθ(y)=θeθyf_\theta(y) = \theta e^{-\theta y}, compare the empirical MSE of gml(y)=1/yˉg_{\text{ml}}(\mathbf{y}) = 1/\bar{y} with the Cramer-Rao bound θ2/n\theta^2/n. The MLE is biased for small nn but its MSE approaches the CRLB as nn grows, illustrating asymptotic efficiency.

Parameters
1
2000

Common Mistake: Asymptotic Properties Are Not Finite-Sample Guarantees

Mistake:

Using θ^N(θ0,J(θ0)1)\hat\theta \sim \mathcal{N}(\theta_0, J(\theta_0)^{-1}) as if it were exact, for example to construct confidence intervals with very small nn.

Correction:

Asymptotic normality is a limit statement and can be a poor approximation when nn is small or when the log-likelihood is asymmetric (skewed models, near-boundary truths). For small nn use bootstrap, profile-likelihood intervals, or exact finite-sample distributions when available.

Common Mistake: Non-Identifiable Models

Mistake:

Applying MLE theory to over-parameterized models — Gaussian mixtures with label switching, overcomplete signal dictionaries, Hv\mathbf{H}\mathbf{v} where both H\mathbf{H} and v\mathbf{v} are unknown up to a scalar.

Correction:

When θθ0\theta \neq \theta_0 can yield the same distribution, the MLE is not unique and consistency fails. Fix the symmetry by normalization, a canonical orientation, or a pivot element. In Gaussian mixtures, use cluster labels after post-processing.

Key Takeaway

Under regularity, the MLE satisfies four guarantees: consistency, asymptotic normality with variance J(θ0)1J(\theta_0)^{-1}, asymptotic efficiency (CRLB attainment), and invariance under reparameterization. These four together justify ML as the default frequentist estimator.

Quick Check

For an i.i.d. sample of size nn from fθ0f_{\theta_0} under regularity, the asymptotic variance of θ^nml\hat\theta_n^{\text{ml}} equals:

J1(θ0)J_1(\theta_0)

1/(nJ1(θ0))=1/J(θ0)1/(nJ_1(\theta_0)) = 1/J(\theta_0)

σ2/n\sigma^2/n regardless of the model

J1(θ0)2/nJ_1(\theta_0)^2/n