The ML Principle

Why Maximum Likelihood?

Chapter 5 developed the Cramer-Rao bound as a universal benchmark for unbiased estimators but did not answer a fundamental practical question: given a parametric model and data, how do we actually produce an estimate? Sufficient statistics and Rao-Blackwellization require that we first guess an unbiased estimator. The maximum likelihood principle replaces that guesswork with a single, model-agnostic recipe β€” pick the parameter value that makes the observed data most probable. The recipe is astonishingly effective: under regularity it is consistent, asymptotically unbiased, asymptotically efficient, and equivariant under reparameterization. For engineering problems with moderate sample sizes and well-posed models, ML is the default frequentist estimator.

,

Definition:

Likelihood and Log-Likelihood

Let {fΞΈ:ΞΈβˆˆΞ›}\{f_\theta : \theta \in \Lambda\} be a parametric family of densities on YβŠ†Rn\mathcal{Y} \subseteq \mathbb{R}^n with Ξ›βŠ†Rm\Lambda \subseteq \mathbb{R}^m. For a fixed observation y∈Y\mathbf{y} \in \mathcal{Y}, the likelihood function is the map ΞΈβ€…β€ŠβŸΌβ€…β€ŠLn(ΞΈ)β€…β€Šβ‰œβ€…β€ŠfΞΈ(y),\theta \;\longmapsto\; L_n(\theta) \;\triangleq\; f_\theta(\mathbf{y}), and the log-likelihood function is β„“n(ΞΈ)β€…β€Šβ‰œβ€…β€Šlog⁑Ln(ΞΈ)β€…β€Š=β€…β€Šlog⁑fΞΈ(y).\ell_n(\theta) \;\triangleq\; \log L_n(\theta) \;=\; \log f_\theta(\mathbf{y}). When the observations are i.i.d. with marginal fΞΈf_\theta, the log-likelihood is a sum of per-sample contributions, β„“n(ΞΈ)β€…β€Š=β€…β€Šβˆ‘i=1nlog⁑fΞΈ(yi).\ell_n(\theta) \;=\; \sum_{i=1}^n \log f_\theta(y_i).

The likelihood is not a probability density over ΞΈ\theta: integrating Ln(ΞΈ)L_n(\theta) over Ξ›\Lambda need not give one. It is simply the value of the sampling density evaluated at the observed data, viewed as a function of the parameter.

Definition:

Maximum Likelihood Estimator

The maximum likelihood estimator (MLE) is any maximizer of the likelihood: gml(y)β€…β€Š=β€…β€Šarg⁑maxβ‘ΞΈβˆˆΞ›β€…β€ŠfΞΈ(y)β€…β€Š=β€…β€Šarg⁑maxβ‘ΞΈβˆˆΞ›β€…β€Šβ„“n(ΞΈ).g_{\text{ml}}(\mathbf{y}) \;=\; \arg\max_{\theta \in \Lambda}\; f_\theta(\mathbf{y}) \;=\; \arg\max_{\theta \in \Lambda}\; \ell_n(\theta). When the maximum is attained in the interior of Ξ›\Lambda and β„“n\ell_n is differentiable, the MLE satisfies the score equation βˆ‡ΞΈβ„“n(ΞΈ)β€…β€Šβˆ£ΞΈ=gml(y)β€…β€Š=β€…β€Š0.\nabla_\theta \ell_n(\theta) \;\Big|_{\theta = g_{\text{ml}}(\mathbf{y})} \;=\; \mathbf{0}.

The MLE is defined up to ties: if several parameters achieve the same maximum likelihood, the definition is ambiguous. In practice, ties arise only on sets of measure zero for continuous models with identifiable parameters.

,

Definition:

Score Function

The score function for a single observation YY is the parameter gradient of the log-density, s(ΞΈ;y)β€…β€Šβ‰œβ€…β€Šβˆ‡ΞΈlog⁑fΞΈ(y),s(\theta; y) \;\triangleq\; \nabla_\theta \log f_\theta(y), and for the full sample y=(y1,…,yn)\mathbf{y} = (y_1, \ldots, y_n) the total score is sn(ΞΈ;y)=βˆ‘i=1ns(ΞΈ;yi)s_n(\theta; \mathbf{y}) = \sum_{i=1}^n s(\theta; y_i).

Under regularity, the expected score is zero: Eθ[s(θ;Y)]=0\mathbb{E}_\theta[s(\theta; Y)] = 0. The variance of the score is the Fisher information J(θ)=Var⁑θ(s(θ;Y))J(\theta) = \operatorname{\text{Var}}_\theta(s(\theta; Y)).

Example: Gaussian MLE: Known Variance, Unknown Mean

Let Y1,…,YnY_1, \ldots, Y_n be i.i.d. N(ΞΈ,Οƒ2)\mathcal{N}(\theta, \sigma^2) with Οƒ2\sigma^2 known. Compute the MLE of ΞΈ\theta.

Example: Gaussian MLE: Joint Mean and Variance

Let Y1,…,YnY_1, \ldots, Y_n be i.i.d. N(ΞΌ,Οƒ2)\mathcal{N}(\mu, \sigma^2) with both ΞΌ\mu and Οƒ2\sigma^2 unknown, ΞΈ=(ΞΌ,Οƒ2)\boldsymbol{\theta} = (\mu, \sigma^2). Find the joint MLE.

Example: Exponential Rate MLE

Let Y1,…,YnY_1, \ldots, Y_n be i.i.d. exponential with rate ΞΈ>0\theta > 0: fΞΈ(y)=ΞΈeβˆ’ΞΈyf_\theta(y) = \theta e^{-\theta y} for yβ‰₯0y \geq 0. Find gmlg_{\text{ml}}.

Example: Uniform Endpoint: a Pathological Case

Let Y1,…,YnY_1, \ldots, Y_n be i.i.d. uniform on [0,ΞΈ][0, \theta], ΞΈ>0\theta > 0. Find the MLE and discuss why the standard regularity analysis breaks down.

Log-Likelihood Surface for N(ΞΌ,Οƒ2)\mathcal{N}(\mu, \sigma^2)

The joint log-likelihood β„“n(ΞΌ,Οƒ2)\ell_n(\mu, \sigma^2) as a function of (ΞΌ,Οƒ2)(\mu, \sigma^2) for an i.i.d. Gaussian sample. The MLE sits at the peak of the surface; the contours show how curvature relates to the Fisher information.

Parameters
1
1
50

Score Function and the MLE

Plot of the score sn(ΞΈ;y)=βˆ‚β„“n/βˆ‚ΞΈs_n(\theta; \mathbf{y}) = \partial \ell_n/\partial\theta for the Gaussian mean model. The MLE is the zero-crossing of the score; the slope at the zero is minus the observed Fisher information.

Parameters
30
0.5
1

Key Takeaway

The MLE is the zero of the score β€” a stationary point of the log-likelihood. Under regularity it is the global maximizer; when regularity fails (as with boundary-dependent supports), the stationarity characterization breaks and we must maximize the likelihood directly.

Quick Check

Which of the following statements about the likelihood function is correct?

It is a probability density on the parameter space Ξ›\Lambda.

It is the joint density of the data, viewed as a function of the parameter.

It equals the posterior density up to a multiplicative constant.

It is always maximized by the sample mean.

Common Mistake: Support-Dependent Distributions Break Regularity

Mistake:

Assuming that the score equation always characterizes the MLE, including for models like uniform on [0,ΞΈ][0, \theta], Pareto, or shifted exponential.

Correction:

When the support of fΞΈf_\theta depends on ΞΈ\theta, the density is not differentiable in ΞΈ\theta and the score equation has no interior solution. Maximize the likelihood directly β€” typically the MLE sits on the boundary and is an order statistic (min, max, etc.). Asymptotic normality and the CRLB do not apply.