Chapter Summary

Key Points

1.
When the noise distribution has heavier tails than Gaussian, or when a small fraction of samples is contaminated by outliers, the Gaussian-based ML estimator (least squares) degrades arbitrarily. Robust M-estimators $\hat{\theta} = \arg\min_\theta \sum_i \rho(y_i - f(x_i;\theta))$ replace the quadratic loss by a function $\rho$ whose derivative $\psi=\rho'$ is bounded, trading a small loss of Gaussian efficiency for a large gain in contamination resistance.
2.
The Huber loss is the unique convex loss that is minimax-optimal over the $\varepsilon$ -contamination neighborhood of the Gaussian: it is quadratic for $|r| \le \delta$ and linear for $|r| > \delta$ . Convexity guarantees a unique global minimum reachable by IRLS, while the bounded $\psi$ keeps the influence function bounded. Setting $\delta=1.345$ yields 95% asymptotic efficiency at the Gaussian.
3.
The influence function $\mathrm{IF}(x;T,F) = \psi(x) / \mathbb{E}_F[\psi']$ measures the infinitesimal effect of contaminating $F$ at a point $x$ . Mean, median, Huber, and Tukey biweight span the full efficiency–robustness spectrum: the mean has unbounded IF and breakdown point $0$ ; the median has bounded IF and breakdown $1/2$ ; Huber is a convex compromise; Tukey is non-convex but fully rejects large outliers.
4.
Non-parametric estimation avoids committing to a finite-dimensional model. Kernel density estimation places a smoothing kernel at each sample, and the Nadaraya–Watson estimator is the conditional expectation under a joint KDE. Both obey the bias–variance tradeoff through the bandwidth $h$ , with $h^\star \propto n^{-1/5}$ and optimal MSE $O(n^{-4/5})$ — the classical non-parametric rate.
5.
A reproducing kernel Hilbert space (RKHS) equips a function space with a positive-definite kernel $k(x,x')$ satisfying the reproducing property. The representer theorem reduces the infinite-dimensional problem $\min_f \sum_i L(y_i,f(x_i)) + \lambda\|f\|_\mathcal{H}^2$ to the finite-dimensional problem $\min_{\boldsymbol{\alpha}} \sum_i L(y_i, (\mathbf{K}\boldsymbol{\alpha})_i) + \lambda \boldsymbol{\alpha}^T \mathbf{K}\boldsymbol{\alpha}$ — the "kernel trick" that makes KRR, SVMs, and GP regression tractable.
6.
Gaussian process regression is a Bayesian non-parametric model with closed-form posterior mean and variance. The posterior mean coincides with KRR under a specific choice of regularization; the posterior variance provides calibrated uncertainty. GPs scale as $O(n^3)$ due to the kernel matrix inversion — the dominant limitation for large datasets.
7.
Deep neural networks trained under MSE loss converge to the MMSE estimator $\mathbb{E}[\boldsymbol{\theta} \mid \mathbf{y}]$ in the limit of infinite data and capacity. End-to-end learning replaces hand-crafted pipelines with a single data-trained mapping $\mathbf{y} \mapsto \hat{\boldsymbol{\theta}}$ , at the cost of interpretability, calibration, and robustness to distribution shift.
8.
Deep unfolding bridges model-based and data-driven estimation: unroll $T$ iterations of an iterative algorithm (ISTA, AMP, belief propagation) into $T$ network layers with learnable per-layer parameters. The result inherits the architecture and inductive bias of the iterative method while enjoying the flexibility of supervised learning. LISTA achieves sparse-recovery accuracy comparable to ISTA at $10\!-\!100\times$ fewer iterations.

Looking Ahead

Chapter 24 returns to information-theoretic limits and Chapter 25 closes the book with case studies that combine the estimation primitives developed throughout Part IV: CRLB benchmarking, EM, AMP, kernels, and unfolded networks all play a role in modern 5G/6G channel estimation, sparse recovery, and integrated sensing.

Deep Learning for Estimation Exercises