Ferkans — Interactive Telecom Tutor

When the Score Equation Has No Closed Form

Gaussian-mean and exponential-rate MLEs fall out of the score equation directly because the log-likelihood is quadratic or log-concave in simple form. Most engineering models are not so kind — frequency estimation from sinusoidal signals, MIMO channel estimation with unknown phase, and mixture models all require iterative optimization. This section covers the standard numerical ML machinery: Newton-Raphson (uses the Hessian), Fisher scoring (replaces the Hessian with its expectation), and constrained MLE via Lagrange multipliers. The EM algorithm, which exploits hidden-variable structure, appears in Chapter 8.

,

Theorem: Gaussian Linear Model: Closed-Form MLE

Consider $\mathbf{Y} = \mathbf{A}\boldsymbol{\theta} + \mathbf{Z}$ with $\mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma})$ , $\boldsymbol{\Sigma}$ known and positive definite, $\mathbf{A} \in \mathbb{R}^{n\times m}$ with $\mathbf{A}^\mathsf{T}\boldsymbol{\Sigma}^{-1}\mathbf{A}$ invertible. The MLE is $\hat{\boldsymbol{\theta}}_{\text{ml}}(\mathbf{y}) \;=\; (\mathbf{A}^\mathsf{T}\boldsymbol{\Sigma}^{-1}\mathbf{A})^{-1} \mathbf{A}^\mathsf{T}\boldsymbol{\Sigma}^{-1}\mathbf{y},$ with covariance $\operatorname{\text{Var}}(\hat{\boldsymbol{\theta}}_{\text{ml}}) = (\mathbf{A}^\mathsf{T}\boldsymbol{\Sigma}^{-1}\mathbf{A})^{-1}$ achieving the Cramer-Rao bound exactly (not just asymptotically).

The log-likelihood is a quadratic in $\boldsymbol{\theta}$ , so the MLE is a weighted least squares projection with the noise-whitened Gram matrix. Gaussianity plus linearity collapses the iterative machinery into a single closed-form matrix inverse.

Proof

Write the log-likelihood

$\ell(\boldsymbol{\theta}) \;=\; -\tfrac{n}{2}\log(2\pi) - \tfrac{1}{2}\log\det\boldsymbol{\Sigma} - \tfrac{1}{2}(\mathbf{y} - \mathbf{A}\boldsymbol{\theta})^\mathsf{T}\boldsymbol{\Sigma}^{-1}(\mathbf{y} - \mathbf{A}\boldsymbol{\theta}).$ $

Set the gradient to zero

$\nabla_{\boldsymbol{\theta}}\ell = \mathbf{A}^\mathsf{T}\boldsymbol{\Sigma}^{-1}(\mathbf{y} - \mathbf{A}\boldsymbol{\theta}) = \mathbf{0}$ gives the normal equations $\mathbf{A}^\mathsf{T}\boldsymbol{\Sigma}^{-1}\mathbf{A}\boldsymbol{\theta} = \mathbf{A}^\mathsf{T}\boldsymbol{\Sigma}^{-1}\mathbf{y}$ .

Invert and identify the FIM

Inverting yields the stated MLE. The Hessian is $-\mathbf{A}^\mathsf{T}\boldsymbol{\Sigma}^{-1}\mathbf{A}$ , independent of $\boldsymbol{\theta}$ — the FIM is constant, and the CRLB is attained with equality. $\blacksquare$

Newton-Raphson for ML Estimation

Complexity:

O(K \cdot (c_g + c_H + m^3))

with

K

iterations,

m

parameters.

Input: log-likelihood

\ell(\boldsymbol{\theta})

, initialization

\boldsymbol{\theta}^{(0)}

, tolerance

\varepsilon

, max iterations

K_{\max}

.

Output: MLE estimate

\hat{\boldsymbol{\theta}}

.

1. for

k = 0, 1, 2, \ldots, K_{\max}-1

do

2.

\quad

Compute gradient

\mathbf{g}^{(k)} = \nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}^{(k)})

3.

\quad

Compute Hessian

\mathbf{H}^{(k)} = \nabla^2_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}^{(k)})

4.

\quad

if

\mathbf{H}^{(k)}

is not negative definite then damp or add ridge

5.

\quad

Update

\boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} - (\mathbf{H}^{(k)})^{-1}\,\mathbf{g}^{(k)}

6.

\quad

if

\|\boldsymbol{\theta}^{(k+1)} - \boldsymbol{\theta}^{(k)}\| < \varepsilon

then break

7. end for

8. return

\boldsymbol{\theta}^{(k+1)}

Newton-Raphson has quadratic local convergence near a well-conditioned stationary point. The step $-(\mathbf{H}^{(k)})^{-1}\mathbf{g}^{(k)}$ uses the minus sign because we are maximizing $\ell$ and its Hessian is negative-definite at a maximum. In regions where the Hessian is not negative-definite the step can diverge; standard remedies are Levenberg- Marquardt damping, trust regions, or line search.

Fisher Scoring

Complexity:

O(K \cdot (c_s + c_{\text{FIM}} + m^3))

.

Input: log-likelihood

\ell(\boldsymbol{\theta})

, Fisher information matrix

\mathbf{J}(\boldsymbol{\theta})

, initialization

\boldsymbol{\theta}^{(0)}

, tolerance

\varepsilon

.

Output: MLE estimate

\hat{\boldsymbol{\theta}}

.

1. for

k = 0, 1, \ldots

do

2.

\quad

Compute score

\mathbf{s}^{(k)} = \nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}^{(k)})

3.

\quad

Compute FIM

\mathbf{J}^{(k)} = \mathbf{J}(\boldsymbol{\theta}^{(k)})

4.

\quad

Update

\boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} + \bigl(\mathbf{J}^{(k)}\bigr)^{-1}\mathbf{s}^{(k)}

5.

\quad

if

\|\boldsymbol{\theta}^{(k+1)} - \boldsymbol{\theta}^{(k)}\| < \varepsilon

break

6. end for

7. return

\boldsymbol{\theta}^{(k+1)}

Fisher scoring replaces the Hessian $\mathbf{H} = \nabla^2\ell$ by its negative expectation, the Fisher information matrix $\mathbf{J}$ . Since $\mathbf{J}$ is positive definite by construction, the update is a descent-like step that avoids the sign indefiniteness of Newton-Raphson. Near the optimum, the Hessian concentrates around $-\mathbf{J}$ by the WLLN, so Newton and scoring agree locally. In exponential families, the two methods are identical.

Newton-Raphson vs Fisher Scoring

Aspect	Newton-Raphson	Fisher Scoring
Curvature matrix	Observed Hessian $\mathbf{H} = \nabla^2 \ell$	Expected $-\mathbf{J}(\boldsymbol{\theta})$
Definiteness	Can be indefinite far from optimum	Positive definite everywhere (under regularity)
Local convergence rate	Quadratic	Quadratic (rate depends on $\mathbf{H} - \mathbf{J}$ )
Sensitivity to outliers	Higher (data-dependent curvature)	Lower (population curvature)
Exponential family	Identical to scoring	Identical to Newton
Implementation effort	Compute $\mathbf{H}$ analytically or numerically	Requires closed-form $\mathbf{J}$
Typical use	Generic nonlinear problems	GLMs, estimation in Caire's courses

Example: Scoring Update for a Single-Parameter Exponential Family

Let $Y_1, \ldots, Y_n$ be i.i.d. from a single-parameter exponential family $f_\theta(y) = h(y)\exp(\eta(\theta) T(y) - A(\theta))$ . Derive the Fisher scoring update for $\theta$ .

Solution

Score and Fisher information

The score is $s_n(\theta) = \eta'(\theta)\sum_i T(y_i) - n A'(\theta)$ . The per-sample Fisher information is $J_1(\theta) = -\mathbb{E}_\theta[\partial^2 \log f_\theta(Y)/\partial\theta^2] = \eta'(\theta)^2 \operatorname{\text{Var}}_\theta(T(Y))$ . Since $A'(\theta) = \eta'(\theta)\mathbb{E}_\theta[T(Y)]$ , the score simplifies to $s_n(\theta) = \eta'(\theta)(\sum_i T(y_i) - n\mathbb{E}_\theta[T(Y)])$ .

Scoring update

$\theta^{(k+1)} \;=\; \theta^{(k)} + \frac{s_n(\theta^{(k)})}{n J_1(\theta^{(k)})} \;=\; \theta^{(k)} + \frac{\sum_i T(y_i)/n - \mathbb{E}_{\theta^{(k)}}[T(Y)]}{\eta'(\theta^{(k)})\,\operatorname{\text{Var}}_{\theta^{(k)}}(T(Y))}.$ $

Interpret as moment matching

At convergence, the sample moment $\bar T = n^{-1}\sum_i T(y_i)$ equals the population moment $\mathbb{E}_\theta[T(Y)]$ . This recovers the classical moment-matching characterization of exponential-family MLEs. $\blacksquare$

Constrained MLE via Lagrange Multipliers

When the parameter must satisfy $h(\boldsymbol{\theta}) = \mathbf{0}$ (e.g. unit-norm beamforming vector, sum-to-one mixture weights), form the Lagrangian $\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\lambda}) \;=\; \ell(\boldsymbol{\theta}) - \boldsymbol{\lambda}^\mathsf{T} h(\boldsymbol{\theta})$ and solve the KKT system $\nabla_{\boldsymbol{\theta}}\ell - \mathbf{J}_h^\mathsf{T}\boldsymbol{\lambda} = \mathbf{0}$ , $h(\boldsymbol{\theta}) = \mathbf{0}$ jointly. For linear constraints this reduces to a constrained least squares with the same closed-form inverse structure as Theorem TGaussian Linear Model: Closed-Form MLE.

Example: Constrained MLE: Gaussian Means Summing to Zero

Let $Y_i \sim \mathcal{N}(\theta_i, 1)$ independently for $i = 1, 2, 3$ , subject to $\theta_1 + \theta_2 + \theta_3 = 0$ . Find the constrained MLE.

Solution

Unconstrained MLE

Without the constraint, $\hat\theta_i = y_i$ .

Project onto the constraint

The constraint is linear with normal $\mathbf{1} = (1,1,1)^\mathsf{T}$ . The Lagrangian solution is the orthogonal projection of $\mathbf{y}$ onto $\{\boldsymbol{\theta} : \mathbf{1}^\mathsf{T}\boldsymbol{\theta} = 0\}$ , which subtracts the mean: $\hat\theta_i^{\text{con}} \;=\; y_i - \bar{y}, \qquad i = 1, 2, 3,$ where $\bar{y} = (y_1+y_2+y_3)/3$ . $\blacksquare$

Newton-Raphson Trajectory on a 1-D Log-Likelihood

Iterate Newton-Raphson on the log-likelihood of an i.i.d. Cauchy sample (location parameter). This likelihood is non-concave and multi-modal, illustrating why good initialization matters and how Newton can overshoot or converge to the wrong local maximum.

Parameters

Initial

\theta^{(0)}

3

Sample size

n

20

True location

\theta_0

0

Iterations

K

8

Common Mistake: Newton-Raphson Can Diverge

Mistake:

Running Newton-Raphson from an arbitrary starting point and trusting the output without monitoring whether $\ell$ is increasing.

Correction:

Newton's guarantees are local. Far from the optimum, the Hessian can be indefinite or near-singular and the step can overshoot or move to a saddle. Use line search (Armijo backtracking), trust regions, or Fisher scoring (which uses the positive-definite FIM). Always log $\ell$ values across iterations to catch divergence.

Common Mistake: Multiple Local Maxima

Mistake:

Assuming the score equation has a unique solution and returning the first stationary point the optimizer finds.

Correction:

Non-concave log-likelihoods — Cauchy location, mixture models, sinusoid frequency, phase estimation — have multiple local maxima. Best practice is multi-start: run the optimizer from several initializations and retain the one with the highest $\ell$ . Grid-refinement searches (coarse grid then Newton polish) are standard for frequency and DOA problems.

⚠️Engineering Note

Practical Notes on Iterative MLE

In production receivers and estimators, iterative ML comes with several real-world constraints: (i) evaluate log-likelihoods in log-domain to avoid underflow; (ii) damp the Newton step when the Hessian is poorly conditioned (add $\epsilon \mathbf{I}$ ); (iii) cap iterations to enforce real-time constraints; (iv) warm-start with the previous frame's estimate in tracking applications. Fisher scoring is preferred when the FIM is available in closed form because its positive-definite curvature gives robustness without the need for damping heuristics.

Practical Constraints

•
Log-domain likelihood to prevent underflow
•
Maximum iterations bounded by latency budget
•
Initialization from previous frame (tracking)
•
Ridge regularization when FIM/Hessian is ill-conditioned

🎓CommIT Contribution(2021)

Fisher-Scoring-Type Updates in Sparse Bayesian Learning

A. Fengler, P. Jung, G. Caire — IEEE Transactions on Information Theory

In the massive random-access problem, activity detection via sparse Bayesian learning (SBL) uses an EM algorithm whose M-step is a Fisher-scoring-style update on the hyperparameters (activity variances). The CommIT work of Fengler, Jung, and Caire analyzes the convergence and computational structure of these updates in the many-user regime. The connection to ML computation is direct: SBL is an ML-II procedure (maximizing the marginal likelihood in the hyperparameters), and the hyperparameter updates inherit the quadratic local convergence of Fisher scoring. Chapter 8 develops the EM framework in which these updates sit; this chapter explains the Fisher-scoring machinery that SBL uses at its core.

sparse-bayesian-learningmassive-accessem-fisher-scoringView Paper →

Key Takeaway

For linear-Gaussian models the MLE is closed form; otherwise iterate. Newton-Raphson uses the observed Hessian and converges quadratically near the optimum but can diverge far from it. Fisher scoring replaces the Hessian by the FIM and is always positive-definite, making it the safer default when the FIM is available. Good initialization (grid search, moment estimator, previous frame) is essential for non-concave likelihoods.

Quick Check

Which statement about Fisher scoring vs Newton-Raphson is correct?

Fisher scoring never converges faster than Newton-Raphson.

In exponential families, the two algorithms are identical.

Newton-Raphson always uses a positive-definite curvature matrix.

Fisher scoring does not require computing derivatives of $\ell$ .

Correction:

In exponential families, the two algorithms are identical.

Because the Hessian is data-independent and equals $-\mathbf{J}$ .

Computing the MLE

When the Score Equation Has No Closed Form

Theorem: Gaussian Linear Model: Closed-Form MLE

Write the log-likelihood

Set the gradient to zero

Invert and identify the FIM

Newton-Raphson for ML Estimation

Fisher Scoring

Newton-Raphson vs Fisher Scoring

Example: Scoring Update for a Single-Parameter Exponential Family

Score and Fisher information

Scoring update

Interpret as moment matching

Constrained MLE via Lagrange Multipliers

Example: Constrained MLE: Gaussian Means Summing to Zero

Unconstrained MLE

Project onto the constraint

Newton-Raphson Trajectory on a 1-D Log-Likelihood

Parameters

Common Mistake: Newton-Raphson Can Diverge

Common Mistake: Multiple Local Maxima

Practical Notes on Iterative MLE

Fisher-Scoring-Type Updates in Sparse Bayesian Learning

Key Takeaway

Quick Check