Exercises
ex-fsi-ch23-01
EasyShow that the Huber loss is continuously differentiable everywhere, including at .
Compute the left and right derivatives of at .
The quadratic piece has derivative at ; the linear piece has derivative .
Piecewise derivative
For , . For , .
Matching at the kink
At , the quadratic piece gives and the linear piece gives . Same at . Hence .
ex-fsi-ch23-02
EasyCompute the breakdown point of the trimmed mean that discards the smallest and largest of samples. Express it in terms of and .
Breakdown point is the smallest fraction of samples whose replacement can drive the estimator to infinity.
How many contaminated points can the trimmed mean tolerate?
Count the tolerable contamination
If fewer than points are moved to infinity, the trimmed mean still averages only finite values. Hence the asymptotic breakdown point is .
Interpretation
With , breakdown is — intermediate between the mean () and the median ().
ex-fsi-ch23-03
EasyVerify that the influence function of the sample mean under the standard Gaussian is . Interpret the unbounded growth.
for the squared-error loss; .
Interpret: a single arbitrarily large moves the mean arbitrarily far.
Direct computation
. Under , . Hence .
Unboundedness
as : a single outlier of magnitude shifts the mean by , which is unbounded in .
ex-fsi-ch23-04
EasyWrite the Gaussian KDE with bandwidth on samples in closed form. Verify that it integrates to .
Use the Gaussian kernel .
Interchange sum and integral, then use .
Density
.
Normalization
.
ex-fsi-ch23-05
EasyState Mercer's condition for a function to be a valid positive-definite kernel.
Symmetry plus positive-semidefinite Gram matrix for every finite set.
Equivalently: admits an eigen-expansion with non-negative eigenvalues.
PSD condition
is symmetric () and for every finite set the Gram matrix is positive semidefinite: for all .
Mercer expansion
Equivalently, with (in of a compact set).
ex-fsi-ch23-06
MediumDerive the IRLS update for the Huber M-estimator of location. Show that each IRLS iteration is a weighted-least-squares problem with weights .
Write the first-order condition .
Replace and iterate.
Psi as weighted identity
for and for . Thus with .
Fixed-point update
The score equation becomes with .
Interpretation
IRLS is a majorization-minimization scheme: at each step outliers (large residuals) get downweighted to .
ex-fsi-ch23-07
MediumFor the Gaussian KDE with bandwidth , the integrated MSE is . Find the optimal and resulting AMISE rate.
Differentiate with respect to and set to zero.
Report as and .
First-order condition
, giving .
Optimal rate
Substituting back, . This is the classical non-parametric rate for densities with bounded second derivative — slower than the parametric .
ex-fsi-ch23-08
MediumProve that kernel ridge regression has solution .
Invoke the representer theorem: .
Substitute and minimize over .
Representer form
By the representer theorem, for some .
Finite-dim objective
Using the reproducing property and , the objective is .
Closed-form solution
Setting the gradient to zero: , so and .
ex-fsi-ch23-09
MediumCompute the posterior mean and variance of a GP at a test point , given training data , noise variance , and covariance function .
Joint prior: .
Apply the Gaussian conditioning formula.
Gaussian conditioning
The posterior is Gaussian with and .
Comparison with KRR
The posterior mean equals the KRR solution with . The posterior variance is a pure GP feature: KRR gives no uncertainty.
ex-fsi-ch23-10
MediumThe Tukey biweight loss has derivative for and otherwise. Show that the corresponding is NOT convex and give one consequence.
Differentiate at a chosen to show changes sign.
If , then at that point.
Non-monotone psi
, grows, reaches a maximum, then decreases back to at . Therefore is negative for some , so is non-convex.
Consequence
Gradient descent / IRLS may converge to a local minimum. Good initialization (e.g. by a convex M-estimate like Huber) is essential in practice.
ex-fsi-ch23-11
MediumShow the "kernel trick" for the polynomial kernel in , : write the explicit feature map such that .
Expand .
Match monomials of to coordinates of .
Expansion
.
Feature map
. Then .
Lesson
We compute inner products in a 6-dimensional feature space at the cost of a scalar multiplication and addition.
ex-fsi-ch23-12
MediumFor ISTA with step and soft threshold , one iteration is . Rewrite LISTA as stacked layers with learnable per layer, and give the number of parameters.
and .
Count: parameters per layer for , .
LISTA recursion
. All are learned by backprop.
Parameter count
Per layer: has parameters, has , and gives . Total per layer: . Over layers: .
ex-fsi-ch23-13
HardProve that at the minimax point, the Huber loss corresponds to ML estimation under the density proportional to . Identify this density (hint: Gaussian center, Laplace tails).
ML under gives .
For : Gaussian-like ; for : Laplace-like .
Least-favorable density
Huber showed that the minimax density over the -contamination class is : Gaussian on the core , Laplace tails outside.
ML correspondence
Under , the log-likelihood is . Maximizing the log-likelihood is exactly Huber M-estimation.
Minimax interpretation
The Huber M-estimator is the MLE for the worst-case density in the contamination ball — hence minimax-optimal.
ex-fsi-ch23-14
HardFor the Nadaraya–Watson estimator with Gaussian kernel and bandwidth , derive the leading-order bias at an interior point and show it depends on where .
Expand numerator and denominator around in powers of .
Use .
Numerator expansion
where is the KDE of the design density.
Denominator expansion
.
Leading bias
Ratio expansion gives . The second term is the "design bias" — it vanishes only when or is flat.
ex-fsi-ch23-15
HardLet be the MSE-trained neural estimator from i.i.d. pairs . Under universal approximation and , prove that converges to the MMSE estimator almost surely.
Split the MSE into irreducible noise plus approximation error.
Universal approximation + law of large numbers.
MSE decomposition
. First term is irreducible; the second is minimized at .
Universal approximation
Neural nets with sufficient width can approximate uniformly on compacta to any . Thus the population minimizer matches .
Consistency
By uniform law of large numbers, the empirical MSE converges to the population MSE, so the minimizer of the empirical MSE converges to as , i.e. to the MMSE estimator.
ex-fsi-ch23-16
HardSuppose a LISTA network trained on is deployed where with . Give an upper bound on the recovery error in terms of and the operator norm of the learned weights.
Propagate perturbation layer-by-layer; soft-threshold is 1-Lipschitz.
Telescope the per-layer Lipschitz constants over layers.
Per-layer Lipschitz
Since is -Lipschitz, a perturbation at layer is amplified by plus an injected term .
Telescoped bound
With and uniform , .
Lesson
If , the bound grows exponentially with depth — a concrete failure mode of distribution-shifted unfolded networks.