Computing the MLE
When the Score Equation Has No Closed Form
Gaussian-mean and exponential-rate MLEs fall out of the score equation directly because the log-likelihood is quadratic or log-concave in simple form. Most engineering models are not so kind — frequency estimation from sinusoidal signals, MIMO channel estimation with unknown phase, and mixture models all require iterative optimization. This section covers the standard numerical ML machinery: Newton-Raphson (uses the Hessian), Fisher scoring (replaces the Hessian with its expectation), and constrained MLE via Lagrange multipliers. The EM algorithm, which exploits hidden-variable structure, appears in Chapter 8.
Theorem: Gaussian Linear Model: Closed-Form MLE
Consider with , known and positive definite, with invertible. The MLE is with covariance achieving the Cramer-Rao bound exactly (not just asymptotically).
The log-likelihood is a quadratic in , so the MLE is a weighted least squares projection with the noise-whitened Gram matrix. Gaussianity plus linearity collapses the iterative machinery into a single closed-form matrix inverse.
Write the log-likelihood
$
Set the gradient to zero
gives the normal equations .
Invert and identify the FIM
Inverting yields the stated MLE. The Hessian is , independent of — the FIM is constant, and the CRLB is attained with equality.
Newton-Raphson for ML Estimation
Complexity: with iterations, parameters.Newton-Raphson has quadratic local convergence near a well-conditioned stationary point. The step uses the minus sign because we are maximizing and its Hessian is negative-definite at a maximum. In regions where the Hessian is not negative-definite the step can diverge; standard remedies are Levenberg- Marquardt damping, trust regions, or line search.
Fisher Scoring
Complexity: .Fisher scoring replaces the Hessian by its negative expectation, the Fisher information matrix . Since is positive definite by construction, the update is a descent-like step that avoids the sign indefiniteness of Newton-Raphson. Near the optimum, the Hessian concentrates around by the WLLN, so Newton and scoring agree locally. In exponential families, the two methods are identical.
Newton-Raphson vs Fisher Scoring
| Aspect | Newton-Raphson | Fisher Scoring |
|---|---|---|
| Curvature matrix | Observed Hessian | Expected |
| Definiteness | Can be indefinite far from optimum | Positive definite everywhere (under regularity) |
| Local convergence rate | Quadratic | Quadratic (rate depends on ) |
| Sensitivity to outliers | Higher (data-dependent curvature) | Lower (population curvature) |
| Exponential family | Identical to scoring | Identical to Newton |
| Implementation effort | Compute analytically or numerically | Requires closed-form |
| Typical use | Generic nonlinear problems | GLMs, estimation in Caire's courses |
Example: Scoring Update for a Single-Parameter Exponential Family
Let be i.i.d. from a single-parameter exponential family . Derive the Fisher scoring update for .
Score and Fisher information
The score is . The per-sample Fisher information is . Since , the score simplifies to .
Scoring update
$
Interpret as moment matching
At convergence, the sample moment equals the population moment . This recovers the classical moment-matching characterization of exponential-family MLEs.
Constrained MLE via Lagrange Multipliers
When the parameter must satisfy (e.g. unit-norm beamforming vector, sum-to-one mixture weights), form the Lagrangian and solve the KKT system , jointly. For linear constraints this reduces to a constrained least squares with the same closed-form inverse structure as Theorem TGaussian Linear Model: Closed-Form MLE.
Example: Constrained MLE: Gaussian Means Summing to Zero
Let independently for , subject to . Find the constrained MLE.
Unconstrained MLE
Without the constraint, .
Project onto the constraint
The constraint is linear with normal . The Lagrangian solution is the orthogonal projection of onto , which subtracts the mean: where .
Newton-Raphson Trajectory on a 1-D Log-Likelihood
Iterate Newton-Raphson on the log-likelihood of an i.i.d. Cauchy sample (location parameter). This likelihood is non-concave and multi-modal, illustrating why good initialization matters and how Newton can overshoot or converge to the wrong local maximum.
Parameters
Common Mistake: Newton-Raphson Can Diverge
Mistake:
Running Newton-Raphson from an arbitrary starting point and trusting the output without monitoring whether is increasing.
Correction:
Newton's guarantees are local. Far from the optimum, the Hessian can be indefinite or near-singular and the step can overshoot or move to a saddle. Use line search (Armijo backtracking), trust regions, or Fisher scoring (which uses the positive-definite FIM). Always log values across iterations to catch divergence.
Common Mistake: Multiple Local Maxima
Mistake:
Assuming the score equation has a unique solution and returning the first stationary point the optimizer finds.
Correction:
Non-concave log-likelihoods — Cauchy location, mixture models, sinusoid frequency, phase estimation — have multiple local maxima. Best practice is multi-start: run the optimizer from several initializations and retain the one with the highest . Grid-refinement searches (coarse grid then Newton polish) are standard for frequency and DOA problems.
Practical Notes on Iterative MLE
In production receivers and estimators, iterative ML comes with several real-world constraints: (i) evaluate log-likelihoods in log-domain to avoid underflow; (ii) damp the Newton step when the Hessian is poorly conditioned (add ); (iii) cap iterations to enforce real-time constraints; (iv) warm-start with the previous frame's estimate in tracking applications. Fisher scoring is preferred when the FIM is available in closed form because its positive-definite curvature gives robustness without the need for damping heuristics.
- •
Log-domain likelihood to prevent underflow
- •
Maximum iterations bounded by latency budget
- •
Initialization from previous frame (tracking)
- •
Ridge regularization when FIM/Hessian is ill-conditioned
Fisher-Scoring-Type Updates in Sparse Bayesian Learning
In the massive random-access problem, activity detection via sparse Bayesian learning (SBL) uses an EM algorithm whose M-step is a Fisher-scoring-style update on the hyperparameters (activity variances). The CommIT work of Fengler, Jung, and Caire analyzes the convergence and computational structure of these updates in the many-user regime. The connection to ML computation is direct: SBL is an ML-II procedure (maximizing the marginal likelihood in the hyperparameters), and the hyperparameter updates inherit the quadratic local convergence of Fisher scoring. Chapter 8 develops the EM framework in which these updates sit; this chapter explains the Fisher-scoring machinery that SBL uses at its core.
Key Takeaway
For linear-Gaussian models the MLE is closed form; otherwise iterate. Newton-Raphson uses the observed Hessian and converges quadratically near the optimum but can diverge far from it. Fisher scoring replaces the Hessian by the FIM and is always positive-definite, making it the safer default when the FIM is available. Good initialization (grid search, moment estimator, previous frame) is essential for non-concave likelihoods.
Quick Check
Which statement about Fisher scoring vs Newton-Raphson is correct?
Fisher scoring never converges faster than Newton-Raphson.
In exponential families, the two algorithms are identical.
Newton-Raphson always uses a positive-definite curvature matrix.
Fisher scoring does not require computing derivatives of .
Because the Hessian is data-independent and equals .