Prerequisites & Notation
Before You Begin
This chapter takes a step back from the classical Gaussian playbook that dominated Chapters 8–12 and asks two uncomfortable questions: What if the noise is not Gaussian? and What if we do not know the model at all? Robust estimation answers the first by controlling worst-case damage from outliers; non-parametric and data-driven methods answer the second by letting the data do the modelling. The prerequisites are standard: MAP/MMSE estimation, basic convex optimisation, and comfort with the idea of a regularized empirical risk.
- MAP, MMSE, and the likelihood function (Ch 8, Ch 9)(Review ch08)
Self-check: Why does the Gaussian assumption lead to a least-squares estimator, and how does that change if the noise is Laplacian?
- Convex functions, gradients, and subdifferentials
Self-check: Can you compute the derivative of the Huber loss at and verify continuity?
- Kernel functions and inner-product spaces
Self-check: Why is positive-definite?
- Empirical risk minimization (Ch 21)(Review ch21)
Self-check: What is the difference between empirical risk and population risk, and when do they coincide?
- Gradient descent and backpropagation
Self-check: Can you write one step of SGD for a scalar linear regression and identify the step size?
Notation for This Chapter
Symbols used throughout Chapter 23. Robust estimation introduces the loss , the score , the influence function, and the breakdown point; kernel methods introduce the bandwidth and the Nadaraya–Watson weights; deep learning adds network parameters and the estimator map .
| Symbol | Meaning | Introduced |
|---|---|---|
| Loss function applied to the residual | s01 | |
| Score (influence) function — derivative of the loss | s01 | |
| Huber transition threshold between quadratic and linear regimes | s01 | |
| Influence function of functional at distribution , evaluated at | s01 | |
| Asymptotic breakdown point of an estimator | s01 | |
| Kernel density estimate with bandwidth | s02 | |
| Kernel function (non-negative, integrates to one) | s02 | |
| Nadaraya–Watson regression estimate | s02 | |
| Reproducing kernel Hilbert space associated with kernel | s02 | |
| Gram matrix, | s02 | |
| Parameterized estimator (e.g., neural network) with parameters | s03 | |
| Empirical loss minimized during training | s03 | |
| Number of unfolded iterations in a deep-unfolded architecture | s03 |