Properties of the MLE
Why Do We Trust the MLE?
The MLE is just one function of the data among infinitely many. Why has it come to dominate engineering practice? The answer is a package of four large-sample guarantees: consistency (the estimator concentrates on the truth), asymptotic normality (its deviation is and Gaussian), asymptotic efficiency (its asymptotic variance matches the Cramer-Rao lower bound), and invariance (the MLE of any smooth function of the parameter is that function applied to the MLE). In this section we state and prove these four properties.
Definition: Regularity Conditions
Regularity Conditions
The i.i.d. model is called regular at if the following hold:
- is open and is interior.
- The support is the same for all .
- is twice continuously differentiable in .
- Differentiation under the integral sign is permitted, so and .
- is finite and strictly positive (non-singular in the vector case).
- The parameter is identifiable: on a set of positive measure.
These conditions are inherited from Chapter 5 and are what the proofs in this section require. They fail for support-dependent models (uniform on ) and for non-identifiable models (over-parameterized mixtures, label switching).
Theorem: Consistency of the MLE
Let be i.i.d. from in a regular model with compact . Let be the MLE from the first observations. Then
The per-sample log-likelihood is, by the WLLN, close to its expectation . That expectation is maximized at by Gibbs' inequality (the KL divergence argument). Hence the maximizer of the empirical log-likelihood converges to the maximizer of its population counterpart.
Population objective and KL divergence
Define . By the non-negativity of KL divergence, with equality only at by identifiability. So is the unique maximizer of .
Uniform convergence by WLLN
The empirical objective is . Under moment assumptions and compactness of , standard uniform-WLLN arguments (Wald, 1949) give .
Argmax continuous mapping
The MLE is . Since uniformly and has a unique well-separated maximum at , the argmax is continuous, giving .
Theorem: Asymptotic Normality of the MLE
Under regularity, with the MLE from i.i.d. observations and the per-sample Fisher information, Equivalently, for large , with .
A first-order Taylor expansion of the score around gives . The CLT makes Gaussian with variance ; the WLLN makes converge to . Dividing gives the Gaussian limit.
Taylor-expand the score around the truth
By consistency, lies in a neighborhood of eventually. Since , a mean-value expansion gives for some between and .
Apply CLT to the score at the truth
Define . The are i.i.d. with mean zero (expected score) and variance . By the CLT,
Apply WLLN to the curvature
Also . By consistency of and the WLLN, this converges in probability to .
Assemble via Slutsky
Rearranging the Taylor expansion: This uses Slutsky's theorem (a quotient of a converging numerator and a probability-limit denominator).
Asymptotic Efficiency
The asymptotic variance of the MLE is , which is exactly the Cramer-Rao lower bound. In the large- limit, the MLE achieves the CRLB. This is the asymptotic efficiency property. For finite the MLE may be biased or inefficient, but among regular models and under mild conditions, it is the gold-standard estimator because no other estimator can beat it asymptotically.
Theorem: Invariance of the MLE under Reparameterization
Let be any function (not necessarily one-to-one). If is the MLE of , then the MLE of is whenever is one-to-one. In general, .
The likelihood is a statement about the data, not the parameterization. Whichever coordinate system we use to describe , the MLE identifies the same point in the data-evidence sense.
Profile likelihood under $u$
Define the induced likelihood . Maximizing over returns the same supremum as maximizing over .
One-to-one case
If is bijective, the set has one element and . Maximizing over is equivalent to maximizing over via the substitution , so the maximizer is .
General case
For non-injective the same profile-likelihood argument applies but is the value whose fiber contains the MLE .
Example: Invariance: Noise Power in dB
Let be i.i.d. . Find the MLE of , the noise power expressed in dB.
MLE of $\sigma^2$
From Example EGaussian MLE: Joint Mean and Variance with fixed, .
Apply invariance
Since is one-to-one on , No re-derivation from scratch is needed.
Sampling Distribution of the MLE
Monte Carlo histogram of for the Gaussian mean MLE, overlaid with the asymptotic prediction. Increase to see the empirical distribution converge to the Gaussian limit.
Parameters
MLE MSE vs CRLB (Exponential Rate)
For the exponential model , compare the empirical MSE of with the Cramer-Rao bound . The MLE is biased for small but its MSE approaches the CRLB as grows, illustrating asymptotic efficiency.
Parameters
Common Mistake: Asymptotic Properties Are Not Finite-Sample Guarantees
Mistake:
Using as if it were exact, for example to construct confidence intervals with very small .
Correction:
Asymptotic normality is a limit statement and can be a poor approximation when is small or when the log-likelihood is asymmetric (skewed models, near-boundary truths). For small use bootstrap, profile-likelihood intervals, or exact finite-sample distributions when available.
Common Mistake: Non-Identifiable Models
Mistake:
Applying MLE theory to over-parameterized models — Gaussian mixtures with label switching, overcomplete signal dictionaries, where both and are unknown up to a scalar.
Correction:
When can yield the same distribution, the MLE is not unique and consistency fails. Fix the symmetry by normalization, a canonical orientation, or a pivot element. In Gaussian mixtures, use cluster labels after post-processing.
Key Takeaway
Under regularity, the MLE satisfies four guarantees: consistency, asymptotic normality with variance , asymptotic efficiency (CRLB attainment), and invariance under reparameterization. These four together justify ML as the default frequentist estimator.
Quick Check
For an i.i.d. sample of size from under regularity, the asymptotic variance of equals:
regardless of the model
This is the CRLB and the MLE attains it asymptotically.