Prerequisites & Notation

Before You Begin

This chapter takes a step back from the classical Gaussian playbook that dominated Chapters 8–12 and asks two uncomfortable questions: What if the noise is not Gaussian? and What if we do not know the model at all? Robust estimation answers the first by controlling worst-case damage from outliers; non-parametric and data-driven methods answer the second by letting the data do the modelling. The prerequisites are standard: MAP/MMSE estimation, basic convex optimisation, and comfort with the idea of a regularized empirical risk.

MAP, MMSE, and the likelihood function (Ch 8, Ch 9)(Review ch08)
Self-check: Why does the Gaussian assumption lead to a least-squares estimator, and how does that change if the noise is Laplacian?
Convex functions, gradients, and subdifferentials
Self-check: Can you compute the derivative of the Huber loss at $u=\delta$ and verify continuity?
Kernel functions and inner-product spaces
Self-check: Why is $K(\mathbf{x},\mathbf{y}) = \exp(-\|\mathbf{x}-\mathbf{y}\|^2/(2h^2))$ positive-definite?
Empirical risk minimization (Ch 21)(Review ch21)
Self-check: What is the difference between empirical risk and population risk, and when do they coincide?
Gradient descent and backpropagation
Self-check: Can you write one step of SGD for a scalar linear regression and identify the step size?

Notation for This Chapter

Symbols used throughout Chapter 23. Robust estimation introduces the loss $\rho$ , the score $\psi$ , the influence function, and the breakdown point; kernel methods introduce the bandwidth $h$ and the Nadaraya–Watson weights; deep learning adds network parameters $\boldsymbol{\theta}$ and the estimator map $g_{\boldsymbol{\theta}}$ .

Symbol	Meaning	Introduced
$\rho(u)$	Loss function applied to the residual $u = y - \hat{y}$	s01
$\psi(u) = \rho'(u)$	Score (influence) function — derivative of the loss	s01
$\delta$	Huber transition threshold between quadratic and linear regimes	s01
$\text{IF}(y; T, F)$	Influence function of functional $T$ at distribution $F$ , evaluated at $y$	s01
$\varepsilon^\star$	Asymptotic breakdown point of an estimator	s01
$\hat{f}_h(x)$	Kernel density estimate with bandwidth $h$	s02
$K(\cdot)$	Kernel function (non-negative, integrates to one)	s02
$\hat{m}_h(x)$	Nadaraya–Watson regression estimate	s02
$\mathcal{H}_K$	Reproducing kernel Hilbert space associated with kernel $K$	s02
$\mathbf{K}\in\mathbb{R}^{n\times n}$	Gram matrix, $K_{ij} = K(\mathbf{x}_i, \mathbf{x}_j)$	s02
$g_{\boldsymbol{\theta}}(\mathbf{y})$	Parameterized estimator (e.g., neural network) with parameters $\boldsymbol{\theta}$	s03
$\mathcal{L}(\boldsymbol{\theta})$	Empirical loss minimized during training	s03
$T$	Number of unfolded iterations in a deep-unfolded architecture	s03

← Ch 22 Robust Estimation