Chapter Summary
Chapter Summary
Key Points
- 1.
When the noise distribution has heavier tails than Gaussian, or when a small fraction of samples is contaminated by outliers, the Gaussian-based ML estimator (least squares) degrades arbitrarily. Robust M-estimators replace the quadratic loss by a function whose derivative is bounded, trading a small loss of Gaussian efficiency for a large gain in contamination resistance.
- 2.
The Huber loss is the unique convex loss that is minimax-optimal over the -contamination neighborhood of the Gaussian: it is quadratic for and linear for . Convexity guarantees a unique global minimum reachable by IRLS, while the bounded keeps the influence function bounded. Setting yields 95% asymptotic efficiency at the Gaussian.
- 3.
The influence function measures the infinitesimal effect of contaminating at a point . Mean, median, Huber, and Tukey biweight span the full efficiency–robustness spectrum: the mean has unbounded IF and breakdown point ; the median has bounded IF and breakdown ; Huber is a convex compromise; Tukey is non-convex but fully rejects large outliers.
- 4.
Non-parametric estimation avoids committing to a finite-dimensional model. Kernel density estimation places a smoothing kernel at each sample, and the Nadaraya–Watson estimator is the conditional expectation under a joint KDE. Both obey the bias–variance tradeoff through the bandwidth , with and optimal MSE — the classical non-parametric rate.
- 5.
A reproducing kernel Hilbert space (RKHS) equips a function space with a positive-definite kernel satisfying the reproducing property. The representer theorem reduces the infinite-dimensional problem to the finite-dimensional problem — the "kernel trick" that makes KRR, SVMs, and GP regression tractable.
- 6.
Gaussian process regression is a Bayesian non-parametric model with closed-form posterior mean and variance. The posterior mean coincides with KRR under a specific choice of regularization; the posterior variance provides calibrated uncertainty. GPs scale as due to the kernel matrix inversion — the dominant limitation for large datasets.
- 7.
Deep neural networks trained under MSE loss converge to the MMSE estimator in the limit of infinite data and capacity. End-to-end learning replaces hand-crafted pipelines with a single data-trained mapping , at the cost of interpretability, calibration, and robustness to distribution shift.
- 8.
Deep unfolding bridges model-based and data-driven estimation: unroll iterations of an iterative algorithm (ISTA, AMP, belief propagation) into network layers with learnable per-layer parameters. The result inherits the architecture and inductive bias of the iterative method while enjoying the flexibility of supervised learning. LISTA achieves sparse-recovery accuracy comparable to ISTA at fewer iterations.
Looking Ahead
Chapter 24 returns to information-theoretic limits and Chapter 25 closes the book with case studies that combine the estimation primitives developed throughout Part IV: CRLB benchmarking, EM, AMP, kernels, and unfolded networks all play a role in modern 5G/6G channel estimation, sparse recovery, and integrated sensing.