Ferkans — Interactive Telecom Tutor

Why Maximum Likelihood?

Chapter 5 developed the Cramer-Rao bound as a universal benchmark for unbiased estimators but did not answer a fundamental practical question: given a parametric model and data, how do we actually produce an estimate? Sufficient statistics and Rao-Blackwellization require that we first guess an unbiased estimator. The maximum likelihood principle replaces that guesswork with a single, model-agnostic recipe — pick the parameter value that makes the observed data most probable. The recipe is astonishingly effective: under regularity it is consistent, asymptotically unbiased, asymptotically efficient, and equivariant under reparameterization. For engineering problems with moderate sample sizes and well-posed models, ML is the default frequentist estimator.

,

Definition:
Likelihood and Log-Likelihood

Let $\{f_\theta : \theta \in \Lambda\}$ be a parametric family of densities on $\mathcal{Y} \subseteq \mathbb{R}^n$ with $\Lambda \subseteq \mathbb{R}^m$ . For a fixed observation $\mathbf{y} \in \mathcal{Y}$ , the likelihood function is the map $\theta \;\longmapsto\; L_n(\theta) \;\triangleq\; f_\theta(\mathbf{y}),$ and the log-likelihood function is $\ell_n(\theta) \;\triangleq\; \log L_n(\theta) \;=\; \log f_\theta(\mathbf{y}).$ When the observations are i.i.d. with marginal $f_\theta$ , the log-likelihood is a sum of per-sample contributions, $\ell_n(\theta) \;=\; \sum_{i=1}^n \log f_\theta(y_i).$

The likelihood is not a probability density over $\theta$ : integrating $L_n(\theta)$ over $\Lambda$ need not give one. It is simply the value of the sampling density evaluated at the observed data, viewed as a function of the parameter.

Definition:
Maximum Likelihood Estimator

The maximum likelihood estimator (MLE) is any maximizer of the likelihood: $g_{\text{ml}}(\mathbf{y}) \;=\; \arg\max_{\theta \in \Lambda}\; f_\theta(\mathbf{y}) \;=\; \arg\max_{\theta \in \Lambda}\; \ell_n(\theta).$ When the maximum is attained in the interior of $\Lambda$ and $\ell_n$ is differentiable, the MLE satisfies the score equation $\nabla_\theta \ell_n(\theta) \;\Big|_{\theta = g_{\text{ml}}(\mathbf{y})} \;=\; \mathbf{0}.$

The MLE is defined up to ties: if several parameters achieve the same maximum likelihood, the definition is ambiguous. In practice, ties arise only on sets of measure zero for continuous models with identifiable parameters.

,

Definition:
Score Function

The score function for a single observation $Y$ is the parameter gradient of the log-density, $s(\theta; y) \;\triangleq\; \nabla_\theta \log f_\theta(y),$ and for the full sample $\mathbf{y} = (y_1, \ldots, y_n)$ the total score is $s_n(\theta; \mathbf{y}) = \sum_{i=1}^n s(\theta; y_i)$ .

Under regularity, the expected score is zero: $\mathbb{E}_\theta[s(\theta; Y)] = 0$ . The variance of the score is the Fisher information $J(\theta) = \operatorname{\text{Var}}_\theta(s(\theta; Y))$ .

Example: Gaussian MLE: Known Variance, Unknown Mean

Let $Y_1, \ldots, Y_n$ be i.i.d. $\mathcal{N}(\theta, \sigma^2)$ with $\sigma^2$ known. Compute the MLE of $\theta$ .

Solution

Write the log-likelihood

The log-density of one sample is $\log f_\theta(y_i) = -\tfrac{1}{2}\log(2\pi\sigma^2) - (y_i-\theta)^2/(2\sigma^2)$ . Summing, $\ell_n(\theta) \;=\; -\tfrac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \theta)^2.$

Differentiate and solve the score equation

The score is $\partial \ell_n/\partial\theta = \sigma^{-2}\sum_{i=1}^n (y_i - \theta)$ . Setting it to zero yields $\sum_i(y_i - \theta) = 0$ , i.e. $g_{\text{ml}}(\mathbf{y}) = \frac{1}{n}\sum_{i=1}^n y_i = \bar{y}$ .

Verify the maximum

The second derivative is $-n/\sigma^2 < 0$ , so $\bar{y}$ is the unique global maximum. The sample mean is the MLE. $\blacksquare$

Example: Gaussian MLE: Joint Mean and Variance

Let $Y_1, \ldots, Y_n$ be i.i.d. $\mathcal{N}(\mu, \sigma^2)$ with both $\mu$ and $\sigma^2$ unknown, $\boldsymbol{\theta} = (\mu, \sigma^2)$ . Find the joint MLE.

Solution

Log-likelihood of the pair

$\ell_n(\mu, \sigma^2) \;=\; -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log \sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mu)^2.$ $

Profile the mean

$\partial\ell_n/\partial\mu = \sigma^{-2}\sum_i(y_i - \mu) = 0$ gives $\hat\mu = \bar{y}$ regardless of $\sigma^2$ . Substitute back into $\ell_n$ .

Differentiate in the variance

$\partial\ell_n/\partial\sigma^2 = -n/(2\sigma^2) + (2\sigma^4)^{-1}\sum_i(y_i-\bar{y})^2 = 0$ yields $\hat\sigma^2_{\text{ml}} \;=\; \frac{1}{n}\sum_{i=1}^n (y_i - \bar{y})^2.$

Observe the bias

Note the divisor $n$ rather than $n-1$ . The MLE of $\sigma^2$ is biased in finite samples — $\mathbb{E}[\hat\sigma^2_{\text{ml}}] = (1 - 1/n)\sigma^2$ — but asymptotically unbiased. This exemplifies a general pattern: finite-sample bias of the MLE is $O(1/n)$ , while variance is $O(1/n)$ , so the MSE is dominated by variance for large $n$ . $\blacksquare$

Example: Exponential Rate MLE

Let $Y_1, \ldots, Y_n$ be i.i.d. exponential with rate $\theta > 0$ : $f_\theta(y) = \theta e^{-\theta y}$ for $y \geq 0$ . Find $g_{\text{ml}}$ .

Solution

Log-likelihood

$\ell_n(\theta) = n\log\theta - \theta \sum_i y_i$ .

Score equation

$\partial\ell_n/\partial\theta = n/\theta - \sum_i y_i = 0$ gives $g_{\text{ml}}(\mathbf{y}) = 1/\bar{y}$ .

Interpret

The MLE is the reciprocal of the sample mean. Since $\mathbb{E}_\theta[Y] = 1/\theta$ , the estimator inverts the natural moment relationship. By the invariance property (Section 6.2), this is consistent with $\hat\tau_{\text{ml}} = \bar{y}$ being the MLE of the mean $\tau = 1/\theta$ . $\blacksquare$

Example: Uniform Endpoint: a Pathological Case

Let $Y_1, \ldots, Y_n$ be i.i.d. uniform on $[0, \theta]$ , $\theta > 0$ . Find the MLE and discuss why the standard regularity analysis breaks down.

Solution

Likelihood has a discontinuity

$f_\theta(y) = \theta^{-1} \mathbf{1}_{[0,\theta]}(y)$ , so $L_n(\theta) = \theta^{-n}$ if $\theta \geq \max_i y_i$ and zero otherwise. The likelihood is maximized by making $\theta$ as small as possible while still covering all observations, giving $g_{\text{ml}}(\mathbf{y}) = \max_i y_i$ .

Why standard theory fails

The parameter $\theta$ controls the support of the density, so $f_\theta$ is not differentiable at $y = \theta$ and the support depends on $\theta$ . The regularity conditions that give us the score-equation characterization and asymptotic normality with rate $\sqrt{n}$ fail.

Non-standard convergence rate

Here the MLE converges at rate $n$ , not $\sqrt{n}$ : $n(\theta - \max_i Y_i)/\theta \xrightarrow{d} \text{Exp}(1)$ . The limit distribution is exponential, not Gaussian. The CRLB machinery does not apply. $\blacksquare$

Log-Likelihood Surface for $\mathcal{N}(\mu, \sigma^2)$

The joint log-likelihood $\ell_n(\mu, \sigma^2)$ as a function of $(\mu, \sigma^2)$ for an i.i.d. Gaussian sample. The MLE sits at the peak of the surface; the contours show how curvature relates to the Fisher information.

Parameters

True

\mu

1

True

\sigma

1

Sample size

n

50

Score Function and the MLE

Plot of the score $s_n(\theta; \mathbf{y}) = \partial \ell_n/\partial\theta$ for the Gaussian mean model. The MLE is the zero-crossing of the score; the slope at the zero is minus the observed Fisher information.

Parameters

Sample size

n

30

True

\mu

0.5

Known

\sigma

1

Key Takeaway

The MLE is the zero of the score — a stationary point of the log-likelihood. Under regularity it is the global maximizer; when regularity fails (as with boundary-dependent supports), the stationarity characterization breaks and we must maximize the likelihood directly.

Quick Check

Which of the following statements about the likelihood function is correct?

It is a probability density on the parameter space $\Lambda$ .

It is the joint density of the data, viewed as a function of the parameter.

It equals the posterior density up to a multiplicative constant.

It is always maximized by the sample mean.

Correction:

It is the joint density of the data, viewed as a function of the parameter.

Exactly. The data are fixed, the parameter varies.

Common Mistake: Support-Dependent Distributions Break Regularity

Mistake:

Assuming that the score equation always characterizes the MLE, including for models like uniform on $[0, \theta]$ , Pareto, or shifted exponential.

Correction:

When the support of $f_\theta$ depends on $\theta$ , the density is not differentiable in $\theta$ and the score equation has no interior solution. Maximize the likelihood directly — typically the MLE sits on the boundary and is an order statistic (min, max, etc.). Asymptotic normality and the CRLB do not apply.

The ML Principle

Why Maximum Likelihood?

Definition: Likelihood and Log-Likelihood

Definition: Maximum Likelihood Estimator

Definition: Score Function

Example: Gaussian MLE: Known Variance, Unknown Mean

Write the log-likelihood

Differentiate and solve the score equation

Verify the maximum

Example: Gaussian MLE: Joint Mean and Variance

Log-likelihood of the pair

Profile the mean

Differentiate in the variance

Observe the bias

Example: Exponential Rate MLE

Log-likelihood

Score equation

Interpret

Example: Uniform Endpoint: a Pathological Case

Likelihood has a discontinuity

Why standard theory fails

Non-standard convergence rate

Log-Likelihood Surface for N(μ,σ2)\mathcal{N}(\mu, \sigma^2)N(μ,σ2)

Parameters

Score Function and the MLE

Parameters

Key Takeaway

Quick Check

Common Mistake: Support-Dependent Distributions Break Regularity

Definition:
Likelihood and Log-Likelihood

Definition:
Maximum Likelihood Estimator

Definition:
Score Function

Log-Likelihood Surface for $\mathcal{N}(\mu, \sigma^2)$