Computing the MLE

When the Score Equation Has No Closed Form

Gaussian-mean and exponential-rate MLEs fall out of the score equation directly because the log-likelihood is quadratic or log-concave in simple form. Most engineering models are not so kind — frequency estimation from sinusoidal signals, MIMO channel estimation with unknown phase, and mixture models all require iterative optimization. This section covers the standard numerical ML machinery: Newton-Raphson (uses the Hessian), Fisher scoring (replaces the Hessian with its expectation), and constrained MLE via Lagrange multipliers. The EM algorithm, which exploits hidden-variable structure, appears in Chapter 8.

,

Theorem: Gaussian Linear Model: Closed-Form MLE

Consider Y=Aθ+Z\mathbf{Y} = \mathbf{A}\boldsymbol{\theta} + \mathbf{Z} with ZN(0,Σ)\mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}), Σ\boldsymbol{\Sigma} known and positive definite, ARn×m\mathbf{A} \in \mathbb{R}^{n\times m} with ATΣ1A\mathbf{A}^\mathsf{T}\boldsymbol{\Sigma}^{-1}\mathbf{A} invertible. The MLE is θ^ml(y)  =  (ATΣ1A)1ATΣ1y,\hat{\boldsymbol{\theta}}_{\text{ml}}(\mathbf{y}) \;=\; (\mathbf{A}^\mathsf{T}\boldsymbol{\Sigma}^{-1}\mathbf{A})^{-1} \mathbf{A}^\mathsf{T}\boldsymbol{\Sigma}^{-1}\mathbf{y}, with covariance Var(θ^ml)=(ATΣ1A)1\operatorname{\text{Var}}(\hat{\boldsymbol{\theta}}_{\text{ml}}) = (\mathbf{A}^\mathsf{T}\boldsymbol{\Sigma}^{-1}\mathbf{A})^{-1} achieving the Cramer-Rao bound exactly (not just asymptotically).

The log-likelihood is a quadratic in θ\boldsymbol{\theta}, so the MLE is a weighted least squares projection with the noise-whitened Gram matrix. Gaussianity plus linearity collapses the iterative machinery into a single closed-form matrix inverse.

Newton-Raphson for ML Estimation

Complexity: O(K(cg+cH+m3))O(K \cdot (c_g + c_H + m^3)) with KK iterations, mm parameters.
Input: log-likelihood (θ)\ell(\boldsymbol{\theta}), initialization θ(0)\boldsymbol{\theta}^{(0)}, tolerance ε\varepsilon, max iterations KmaxK_{\max}.
Output: MLE estimate θ^\hat{\boldsymbol{\theta}}.
1. for k=0,1,2,,Kmax1k = 0, 1, 2, \ldots, K_{\max}-1 do
2. \quad Compute gradient g(k)=θ(θ(k))\mathbf{g}^{(k)} = \nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}^{(k)})
3. \quad Compute Hessian H(k)=θ2(θ(k))\mathbf{H}^{(k)} = \nabla^2_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}^{(k)})
4. \quad if H(k)\mathbf{H}^{(k)} is not negative definite then damp or add ridge
5. \quad Update θ(k+1)=θ(k)(H(k))1g(k)\boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} - (\mathbf{H}^{(k)})^{-1}\,\mathbf{g}^{(k)}
6. \quad if θ(k+1)θ(k)<ε\|\boldsymbol{\theta}^{(k+1)} - \boldsymbol{\theta}^{(k)}\| < \varepsilon then break
7. end for
8. return θ(k+1)\boldsymbol{\theta}^{(k+1)}

Newton-Raphson has quadratic local convergence near a well-conditioned stationary point. The step (H(k))1g(k)-(\mathbf{H}^{(k)})^{-1}\mathbf{g}^{(k)} uses the minus sign because we are maximizing \ell and its Hessian is negative-definite at a maximum. In regions where the Hessian is not negative-definite the step can diverge; standard remedies are Levenberg- Marquardt damping, trust regions, or line search.

Fisher Scoring

Complexity: O(K(cs+cFIM+m3))O(K \cdot (c_s + c_{\text{FIM}} + m^3)).
Input: log-likelihood (θ)\ell(\boldsymbol{\theta}), Fisher information matrix J(θ)\mathbf{J}(\boldsymbol{\theta}), initialization θ(0)\boldsymbol{\theta}^{(0)}, tolerance ε\varepsilon.
Output: MLE estimate θ^\hat{\boldsymbol{\theta}}.
1. for k=0,1,k = 0, 1, \ldots do
2. \quad Compute score s(k)=θ(θ(k))\mathbf{s}^{(k)} = \nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}^{(k)})
3. \quad Compute FIM J(k)=J(θ(k))\mathbf{J}^{(k)} = \mathbf{J}(\boldsymbol{\theta}^{(k)})
4. \quad Update θ(k+1)=θ(k)+(J(k))1s(k)\boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} + \bigl(\mathbf{J}^{(k)}\bigr)^{-1}\mathbf{s}^{(k)}
5. \quad if θ(k+1)θ(k)<ε\|\boldsymbol{\theta}^{(k+1)} - \boldsymbol{\theta}^{(k)}\| < \varepsilon break
6. end for
7. return θ(k+1)\boldsymbol{\theta}^{(k+1)}

Fisher scoring replaces the Hessian H=2\mathbf{H} = \nabla^2\ell by its negative expectation, the Fisher information matrix J\mathbf{J}. Since J\mathbf{J} is positive definite by construction, the update is a descent-like step that avoids the sign indefiniteness of Newton-Raphson. Near the optimum, the Hessian concentrates around J-\mathbf{J} by the WLLN, so Newton and scoring agree locally. In exponential families, the two methods are identical.

Newton-Raphson vs Fisher Scoring

AspectNewton-RaphsonFisher Scoring
Curvature matrixObserved Hessian H=2\mathbf{H} = \nabla^2 \ellExpected J(θ)-\mathbf{J}(\boldsymbol{\theta})
DefinitenessCan be indefinite far from optimumPositive definite everywhere (under regularity)
Local convergence rateQuadraticQuadratic (rate depends on HJ\mathbf{H} - \mathbf{J})
Sensitivity to outliersHigher (data-dependent curvature)Lower (population curvature)
Exponential familyIdentical to scoringIdentical to Newton
Implementation effortCompute H\mathbf{H} analytically or numericallyRequires closed-form J\mathbf{J}
Typical useGeneric nonlinear problemsGLMs, estimation in Caire's courses

Example: Scoring Update for a Single-Parameter Exponential Family

Let Y1,,YnY_1, \ldots, Y_n be i.i.d. from a single-parameter exponential family fθ(y)=h(y)exp(η(θ)T(y)A(θ))f_\theta(y) = h(y)\exp(\eta(\theta) T(y) - A(\theta)). Derive the Fisher scoring update for θ\theta.

Constrained MLE via Lagrange Multipliers

When the parameter must satisfy h(θ)=0h(\boldsymbol{\theta}) = \mathbf{0} (e.g. unit-norm beamforming vector, sum-to-one mixture weights), form the Lagrangian L(θ,λ)  =  (θ)λTh(θ)\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\lambda}) \;=\; \ell(\boldsymbol{\theta}) - \boldsymbol{\lambda}^\mathsf{T} h(\boldsymbol{\theta}) and solve the KKT system θJhTλ=0\nabla_{\boldsymbol{\theta}}\ell - \mathbf{J}_h^\mathsf{T}\boldsymbol{\lambda} = \mathbf{0}, h(θ)=0h(\boldsymbol{\theta}) = \mathbf{0} jointly. For linear constraints this reduces to a constrained least squares with the same closed-form inverse structure as Theorem TGaussian Linear Model: Closed-Form MLE.

Example: Constrained MLE: Gaussian Means Summing to Zero

Let YiN(θi,1)Y_i \sim \mathcal{N}(\theta_i, 1) independently for i=1,2,3i = 1, 2, 3, subject to θ1+θ2+θ3=0\theta_1 + \theta_2 + \theta_3 = 0. Find the constrained MLE.

Newton-Raphson Trajectory on a 1-D Log-Likelihood

Iterate Newton-Raphson on the log-likelihood of an i.i.d. Cauchy sample (location parameter). This likelihood is non-concave and multi-modal, illustrating why good initialization matters and how Newton can overshoot or converge to the wrong local maximum.

Parameters
3
20
0
8

Common Mistake: Newton-Raphson Can Diverge

Mistake:

Running Newton-Raphson from an arbitrary starting point and trusting the output without monitoring whether \ell is increasing.

Correction:

Newton's guarantees are local. Far from the optimum, the Hessian can be indefinite or near-singular and the step can overshoot or move to a saddle. Use line search (Armijo backtracking), trust regions, or Fisher scoring (which uses the positive-definite FIM). Always log \ell values across iterations to catch divergence.

Common Mistake: Multiple Local Maxima

Mistake:

Assuming the score equation has a unique solution and returning the first stationary point the optimizer finds.

Correction:

Non-concave log-likelihoods — Cauchy location, mixture models, sinusoid frequency, phase estimation — have multiple local maxima. Best practice is multi-start: run the optimizer from several initializations and retain the one with the highest \ell. Grid-refinement searches (coarse grid then Newton polish) are standard for frequency and DOA problems.

⚠️Engineering Note

Practical Notes on Iterative MLE

In production receivers and estimators, iterative ML comes with several real-world constraints: (i) evaluate log-likelihoods in log-domain to avoid underflow; (ii) damp the Newton step when the Hessian is poorly conditioned (add ϵI\epsilon \mathbf{I}); (iii) cap iterations to enforce real-time constraints; (iv) warm-start with the previous frame's estimate in tracking applications. Fisher scoring is preferred when the FIM is available in closed form because its positive-definite curvature gives robustness without the need for damping heuristics.

Practical Constraints
  • Log-domain likelihood to prevent underflow

  • Maximum iterations bounded by latency budget

  • Initialization from previous frame (tracking)

  • Ridge regularization when FIM/Hessian is ill-conditioned

🎓CommIT Contribution(2021)

Fisher-Scoring-Type Updates in Sparse Bayesian Learning

A. Fengler, P. Jung, G. CaireIEEE Transactions on Information Theory

In the massive random-access problem, activity detection via sparse Bayesian learning (SBL) uses an EM algorithm whose M-step is a Fisher-scoring-style update on the hyperparameters (activity variances). The CommIT work of Fengler, Jung, and Caire analyzes the convergence and computational structure of these updates in the many-user regime. The connection to ML computation is direct: SBL is an ML-II procedure (maximizing the marginal likelihood in the hyperparameters), and the hyperparameter updates inherit the quadratic local convergence of Fisher scoring. Chapter 8 develops the EM framework in which these updates sit; this chapter explains the Fisher-scoring machinery that SBL uses at its core.

sparse-bayesian-learningmassive-accessem-fisher-scoringView Paper →

Key Takeaway

For linear-Gaussian models the MLE is closed form; otherwise iterate. Newton-Raphson uses the observed Hessian and converges quadratically near the optimum but can diverge far from it. Fisher scoring replaces the Hessian by the FIM and is always positive-definite, making it the safer default when the FIM is available. Good initialization (grid search, moment estimator, previous frame) is essential for non-concave likelihoods.

Quick Check

Which statement about Fisher scoring vs Newton-Raphson is correct?

Fisher scoring never converges faster than Newton-Raphson.

In exponential families, the two algorithms are identical.

Newton-Raphson always uses a positive-definite curvature matrix.

Fisher scoring does not require computing derivatives of \ell.