Prerequisites & Notation

Before You Begin

This chapter takes a step back from the classical Gaussian playbook that dominated Chapters 8–12 and asks two uncomfortable questions: What if the noise is not Gaussian? and What if we do not know the model at all? Robust estimation answers the first by controlling worst-case damage from outliers; non-parametric and data-driven methods answer the second by letting the data do the modelling. The prerequisites are standard: MAP/MMSE estimation, basic convex optimisation, and comfort with the idea of a regularized empirical risk.

  • MAP, MMSE, and the likelihood function (Ch 8, Ch 9)(Review ch08)

    Self-check: Why does the Gaussian assumption lead to a least-squares estimator, and how does that change if the noise is Laplacian?

  • Convex functions, gradients, and subdifferentials

    Self-check: Can you compute the derivative of the Huber loss at u=δu=\delta and verify continuity?

  • Kernel functions and inner-product spaces

    Self-check: Why is K(x,y)=exp(xy2/(2h2))K(\mathbf{x},\mathbf{y}) = \exp(-\|\mathbf{x}-\mathbf{y}\|^2/(2h^2)) positive-definite?

  • Empirical risk minimization (Ch 21)(Review ch21)

    Self-check: What is the difference between empirical risk and population risk, and when do they coincide?

  • Gradient descent and backpropagation

    Self-check: Can you write one step of SGD for a scalar linear regression and identify the step size?

Notation for This Chapter

Symbols used throughout Chapter 23. Robust estimation introduces the loss ρ\rho, the score ψ\psi, the influence function, and the breakdown point; kernel methods introduce the bandwidth hh and the Nadaraya–Watson weights; deep learning adds network parameters θ\boldsymbol{\theta} and the estimator map gθg_{\boldsymbol{\theta}}.

SymbolMeaningIntroduced
ρ(u)\rho(u)Loss function applied to the residual u=yy^u = y - \hat{y}s01
ψ(u)=ρ(u)\psi(u) = \rho'(u)Score (influence) function — derivative of the losss01
δ\deltaHuber transition threshold between quadratic and linear regimess01
IF(y;T,F)\text{IF}(y; T, F)Influence function of functional TT at distribution FF, evaluated at yys01
ε\varepsilon^\starAsymptotic breakdown point of an estimators01
f^h(x)\hat{f}_h(x)Kernel density estimate with bandwidth hhs02
K()K(\cdot)Kernel function (non-negative, integrates to one)s02
m^h(x)\hat{m}_h(x)Nadaraya–Watson regression estimates02
HK\mathcal{H}_KReproducing kernel Hilbert space associated with kernel KKs02
KRn×n\mathbf{K}\in\mathbb{R}^{n\times n}Gram matrix, Kij=K(xi,xj)K_{ij} = K(\mathbf{x}_i, \mathbf{x}_j)s02
gθ(y)g_{\boldsymbol{\theta}}(\mathbf{y})Parameterized estimator (e.g., neural network) with parameters θ\boldsymbol{\theta}s03
L(θ)\mathcal{L}(\boldsymbol{\theta})Empirical loss minimized during trainings03
TTNumber of unfolded iterations in a deep-unfolded architectures03