The ML Principle
Why Maximum Likelihood?
Chapter 5 developed the Cramer-Rao bound as a universal benchmark for unbiased estimators but did not answer a fundamental practical question: given a parametric model and data, how do we actually produce an estimate? Sufficient statistics and Rao-Blackwellization require that we first guess an unbiased estimator. The maximum likelihood principle replaces that guesswork with a single, model-agnostic recipe β pick the parameter value that makes the observed data most probable. The recipe is astonishingly effective: under regularity it is consistent, asymptotically unbiased, asymptotically efficient, and equivariant under reparameterization. For engineering problems with moderate sample sizes and well-posed models, ML is the default frequentist estimator.
Definition: Likelihood and Log-Likelihood
Likelihood and Log-Likelihood
Let be a parametric family of densities on with . For a fixed observation , the likelihood function is the map and the log-likelihood function is When the observations are i.i.d. with marginal , the log-likelihood is a sum of per-sample contributions,
The likelihood is not a probability density over : integrating over need not give one. It is simply the value of the sampling density evaluated at the observed data, viewed as a function of the parameter.
Definition: Maximum Likelihood Estimator
Maximum Likelihood Estimator
The maximum likelihood estimator (MLE) is any maximizer of the likelihood: When the maximum is attained in the interior of and is differentiable, the MLE satisfies the score equation
The MLE is defined up to ties: if several parameters achieve the same maximum likelihood, the definition is ambiguous. In practice, ties arise only on sets of measure zero for continuous models with identifiable parameters.
Definition: Score Function
Score Function
The score function for a single observation is the parameter gradient of the log-density, and for the full sample the total score is .
Under regularity, the expected score is zero: . The variance of the score is the Fisher information .
Example: Gaussian MLE: Known Variance, Unknown Mean
Let be i.i.d. with known. Compute the MLE of .
Write the log-likelihood
The log-density of one sample is . Summing,
Differentiate and solve the score equation
The score is . Setting it to zero yields , i.e. .
Verify the maximum
The second derivative is , so is the unique global maximum. The sample mean is the MLE.
Example: Gaussian MLE: Joint Mean and Variance
Let be i.i.d. with both and unknown, . Find the joint MLE.
Log-likelihood of the pair
$
Profile the mean
gives regardless of . Substitute back into .
Differentiate in the variance
yields
Observe the bias
Note the divisor rather than . The MLE of is biased in finite samples β β but asymptotically unbiased. This exemplifies a general pattern: finite-sample bias of the MLE is , while variance is , so the MSE is dominated by variance for large .
Example: Exponential Rate MLE
Let be i.i.d. exponential with rate : for . Find .
Log-likelihood
.
Score equation
gives .
Interpret
The MLE is the reciprocal of the sample mean. Since , the estimator inverts the natural moment relationship. By the invariance property (Section 6.2), this is consistent with being the MLE of the mean .
Example: Uniform Endpoint: a Pathological Case
Let be i.i.d. uniform on , . Find the MLE and discuss why the standard regularity analysis breaks down.
Likelihood has a discontinuity
, so if and zero otherwise. The likelihood is maximized by making as small as possible while still covering all observations, giving .
Why standard theory fails
The parameter controls the support of the density, so is not differentiable at and the support depends on . The regularity conditions that give us the score-equation characterization and asymptotic normality with rate fail.
Non-standard convergence rate
Here the MLE converges at rate , not : . The limit distribution is exponential, not Gaussian. The CRLB machinery does not apply.
Log-Likelihood Surface for
The joint log-likelihood as a function of for an i.i.d. Gaussian sample. The MLE sits at the peak of the surface; the contours show how curvature relates to the Fisher information.
Parameters
Score Function and the MLE
Plot of the score for the Gaussian mean model. The MLE is the zero-crossing of the score; the slope at the zero is minus the observed Fisher information.
Parameters
Key Takeaway
The MLE is the zero of the score β a stationary point of the log-likelihood. Under regularity it is the global maximizer; when regularity fails (as with boundary-dependent supports), the stationarity characterization breaks and we must maximize the likelihood directly.
Quick Check
Which of the following statements about the likelihood function is correct?
It is a probability density on the parameter space .
It is the joint density of the data, viewed as a function of the parameter.
It equals the posterior density up to a multiplicative constant.
It is always maximized by the sample mean.
Exactly. The data are fixed, the parameter varies.
Common Mistake: Support-Dependent Distributions Break Regularity
Mistake:
Assuming that the score equation always characterizes the MLE, including for models like uniform on , Pareto, or shifted exponential.
Correction:
When the support of depends on , the density is not differentiable in and the score equation has no interior solution. Maximize the likelihood directly β typically the MLE sits on the boundary and is an order statistic (min, max, etc.). Asymptotic normality and the CRLB do not apply.