Chapter Summary

Chapter Summary

Key Points

  • 1.

    When the noise distribution has heavier tails than Gaussian, or when a small fraction of samples is contaminated by outliers, the Gaussian-based ML estimator (least squares) degrades arbitrarily. Robust M-estimators θ^=argminθiρ(yif(xi;θ))\hat{\theta} = \arg\min_\theta \sum_i \rho(y_i - f(x_i;\theta)) replace the quadratic loss by a function ρ\rho whose derivative ψ=ρ\psi=\rho' is bounded, trading a small loss of Gaussian efficiency for a large gain in contamination resistance.

  • 2.

    The Huber loss is the unique convex loss that is minimax-optimal over the ε\varepsilon-contamination neighborhood of the Gaussian: it is quadratic for rδ|r| \le \delta and linear for r>δ|r| > \delta. Convexity guarantees a unique global minimum reachable by IRLS, while the bounded ψ\psi keeps the influence function bounded. Setting δ=1.345\delta=1.345 yields 95% asymptotic efficiency at the Gaussian.

  • 3.

    The influence function IF(x;T,F)=ψ(x)/EF[ψ]\mathrm{IF}(x;T,F) = \psi(x) / \mathbb{E}_F[\psi'] measures the infinitesimal effect of contaminating FF at a point xx. Mean, median, Huber, and Tukey biweight span the full efficiency–robustness spectrum: the mean has unbounded IF and breakdown point 00; the median has bounded IF and breakdown 1/21/2; Huber is a convex compromise; Tukey is non-convex but fully rejects large outliers.

  • 4.

    Non-parametric estimation avoids committing to a finite-dimensional model. Kernel density estimation places a smoothing kernel at each sample, and the Nadaraya–Watson estimator is the conditional expectation under a joint KDE. Both obey the bias–variance tradeoff through the bandwidth hh, with hn1/5h^\star \propto n^{-1/5} and optimal MSE O(n4/5)O(n^{-4/5}) — the classical non-parametric rate.

  • 5.

    A reproducing kernel Hilbert space (RKHS) equips a function space with a positive-definite kernel k(x,x)k(x,x') satisfying the reproducing property. The representer theorem reduces the infinite-dimensional problem minfiL(yi,f(xi))+λfH2\min_f \sum_i L(y_i,f(x_i)) + \lambda\|f\|_\mathcal{H}^2 to the finite-dimensional problem minαiL(yi,(Kα)i)+λαTKα\min_{\boldsymbol{\alpha}} \sum_i L(y_i, (\mathbf{K}\boldsymbol{\alpha})_i) + \lambda \boldsymbol{\alpha}^T \mathbf{K}\boldsymbol{\alpha} — the "kernel trick" that makes KRR, SVMs, and GP regression tractable.

  • 6.

    Gaussian process regression is a Bayesian non-parametric model with closed-form posterior mean and variance. The posterior mean coincides with KRR under a specific choice of regularization; the posterior variance provides calibrated uncertainty. GPs scale as O(n3)O(n^3) due to the kernel matrix inversion — the dominant limitation for large datasets.

  • 7.

    Deep neural networks trained under MSE loss converge to the MMSE estimator E[θy]\mathbb{E}[\boldsymbol{\theta} \mid \mathbf{y}] in the limit of infinite data and capacity. End-to-end learning replaces hand-crafted pipelines with a single data-trained mapping yθ^\mathbf{y} \mapsto \hat{\boldsymbol{\theta}}, at the cost of interpretability, calibration, and robustness to distribution shift.

  • 8.

    Deep unfolding bridges model-based and data-driven estimation: unroll TT iterations of an iterative algorithm (ISTA, AMP, belief propagation) into TT network layers with learnable per-layer parameters. The result inherits the architecture and inductive bias of the iterative method while enjoying the flexibility of supervised learning. LISTA achieves sparse-recovery accuracy comparable to ISTA at 10 ⁣ ⁣100×10\!-\!100\times fewer iterations.

Looking Ahead

Chapter 24 returns to information-theoretic limits and Chapter 25 closes the book with case studies that combine the estimation primitives developed throughout Part IV: CRLB benchmarking, EM, AMP, kernels, and unfolded networks all play a role in modern 5G/6G channel estimation, sparse recovery, and integrated sensing.