Ferkans — Interactive Telecom Tutor

ex-fsi-ch23-01

Easy

Show that the Huber loss $\rho_\delta(r)$ is continuously differentiable everywhere, including at $|r|=\delta$ .

Show Hint

Compute the left and right derivatives of $\rho_\delta$ at $r=\pm\delta$ .

The quadratic piece has derivative $r$ at $|r|=\delta$ ; the linear piece has derivative $\pm\delta$ .

Solution

Piecewise derivative

For $|r| \le \delta$ , $\rho_\delta'(r) = r$ . For $|r|>\delta$ , $\rho_\delta'(r) = \delta \cdot \mathrm{sgn}(r)$ .

Matching at the kink

At $r=\delta$ , the quadratic piece gives $\delta$ and the linear piece gives $\delta$ . Same at $r=-\delta$ . Hence $\rho_\delta \in C^1$ .

ex-fsi-ch23-02

Easy

Compute the breakdown point of the trimmed mean that discards the $k$ smallest and $k$ largest of $n$ samples. Express it in terms of $k$ and $n$ .

Show Hint

Breakdown point is the smallest fraction of samples whose replacement can drive the estimator to infinity.

How many contaminated points can the trimmed mean tolerate?

Solution

Count the tolerable contamination

If fewer than $k{+}1$ points are moved to infinity, the trimmed mean still averages only finite values. Hence the asymptotic breakdown point is $k/n$ .

Interpretation

With $k=\lfloor n/4\rfloor$ , breakdown is $1/4$ — intermediate between the mean ( $0$ ) and the median ( $1/2$ ).

ex-fsi-ch23-03

Easy

Verify that the influence function of the sample mean under the standard Gaussian is $\mathrm{IF}(x) = x$ . Interpret the unbounded growth.

Show Hint

$\psi(x)=x$ for the squared-error loss; $\mathbb{E}[\psi']=1$ .

Interpret: a single arbitrarily large $x$ moves the mean arbitrarily far.

Solution

Direct computation

$\psi(r) = \rho'(r) = r$ . Under $F=\mathcal{N}(0,1)$ , $\mathbb{E}[\psi'(X)]=1$ . Hence $\mathrm{IF}(x;\bar X,F) = x/1 = x$ .

Unboundedness

$|\mathrm{IF}(x)| \to \infty$ as $|x|\to\infty$ : a single outlier of magnitude $M$ shifts the mean by $M/n$ , which is unbounded in $M$ .

ex-fsi-ch23-04

Easy

Write the Gaussian KDE with bandwidth $h$ on samples $x_1,\dots,x_n$ in closed form. Verify that it integrates to $1$ .

Show Hint

Use the Gaussian kernel $K(u) = (2\pi)^{-1/2}e^{-u^2/2}$ .

Interchange sum and integral, then use $\int K(u)du = 1$ .

Solution

Density

$\hat p_h(x) = \dfrac{1}{nh}\sum_{i=1}^n K\!\left(\dfrac{x-x_i}{h}\right)$ .

Normalization

$\int \hat p_h(x) dx = \dfrac{1}{nh}\sum_i \int K\!\left(\dfrac{x-x_i}{h}\right) dx = \dfrac{1}{n}\sum_i \int K(u)du = 1$ .

ex-fsi-ch23-05

Easy

State Mercer's condition for a function $k(x,x')$ to be a valid positive-definite kernel.

Show Hint

Symmetry plus positive-semidefinite Gram matrix for every finite set.

Equivalently: $k$ admits an eigen-expansion with non-negative eigenvalues.

Solution

PSD condition

$k$ is symmetric ( $k(x,x')=k(x',x)$ ) and for every finite set $\{x_1,\dots,x_n\}$ the Gram matrix $K_{ij}=k(x_i,x_j)$ is positive semidefinite: $\boldsymbol{\alpha}^T \mathbf{K}\boldsymbol{\alpha} \ge 0$ for all $\boldsymbol{\alpha}\in\mathbb{R}^n$ .

Mercer expansion

Equivalently, $k(x,x') = \sum_j \lambda_j \phi_j(x)\phi_j(x')$ with $\lambda_j\ge 0$ (in $L^2$ of a compact set).

ex-fsi-ch23-06

Medium

Derive the IRLS update for the Huber M-estimator of location. Show that each IRLS iteration is a weighted-least-squares problem with weights $w_i = \min(1,\delta/|r_i|)$ .

Show Hint

Write the first-order condition $\sum_i \psi_\delta(y_i-\theta)=0$ .

Replace $\psi_\delta(r) = w(r) r$ and iterate.

Solution

Psi as weighted identity

$\psi_\delta(r) = r$ for $|r|\le\delta$ and $\psi_\delta(r)=\delta\,\mathrm{sgn}(r)$ for $|r|>\delta$ . Thus $\psi_\delta(r) = w(r)\,r$ with $w(r) = \min(1,\delta/|r|)$ .

Fixed-point update

The score equation $\sum_i w(r_i)(y_i-\theta)=0$ becomes $\theta^{(t+1)} = \dfrac{\sum_i w_i^{(t)} y_i}{\sum_i w_i^{(t)}}$ with $w_i^{(t)} = w(y_i-\theta^{(t)})$ .

Interpretation

IRLS is a majorization-minimization scheme: at each step outliers (large residuals) get downweighted to $\delta/|r_i|<1$ .

ex-fsi-ch23-07

Medium

For the Gaussian KDE with bandwidth $h$ , the integrated MSE is $\mathrm{AMISE}(h) = \dfrac{R(K)}{nh} + \dfrac{h^4 \mu_2(K)^2 R(p'')}{4}$ . Find the optimal $h^\star$ and resulting AMISE rate.

Show Hint

Differentiate with respect to $h$ and set to zero.

Report $h^\star$ as $n^{-1/5}$ and $\mathrm{AMISE}(h^\star) \propto n^{-4/5}$ .

Solution

First-order condition

$\dfrac{d\,\mathrm{AMISE}}{dh} = -\dfrac{R(K)}{nh^2} + h^3 \mu_2(K)^2 R(p'') = 0$ , giving $h^\star = \left(\dfrac{R(K)}{n\mu_2(K)^2 R(p'')}\right)^{1/5}$ .

Optimal rate

Substituting back, $\mathrm{AMISE}(h^\star) \propto n^{-4/5}$ . This is the classical non-parametric rate for densities with bounded second derivative — slower than the parametric $n^{-1}$ .

ex-fsi-ch23-08

Medium

Prove that kernel ridge regression $\min_{f\in\mathcal{H}} \sum_{i=1}^n (y_i-f(x_i))^2 + \lambda\|f\|_\mathcal{H}^2$ has solution $f^\star(x) = \mathbf{k}(x)^T(\mathbf{K}+\lambda\mathbf{I})^{-1}\mathbf{y}$ .

Show Hint

Invoke the representer theorem: $f^\star(x)=\sum_j \alpha_j k(x_j,x)$ .

Substitute and minimize over $\boldsymbol{\alpha}\in\mathbb{R}^n$ .

Solution

Representer form

By the representer theorem, $f^\star = \sum_{j=1}^n \alpha_j k(x_j,\cdot)$ for some $\boldsymbol{\alpha}\in\mathbb{R}^n$ .

Finite-dim objective

Using the reproducing property $f(x_i)=(\mathbf{K}\boldsymbol{\alpha})_i$ and $\|f\|_\mathcal{H}^2 = \boldsymbol{\alpha}^T\mathbf{K}\boldsymbol{\alpha}$ , the objective is $\|\mathbf{y}-\mathbf{K}\boldsymbol{\alpha}\|^2 + \lambda\boldsymbol{\alpha}^T\mathbf{K}\boldsymbol{\alpha}$ .

Closed-form solution

Setting the gradient to zero: $-2\mathbf{K}(\mathbf{y}-\mathbf{K}\boldsymbol{\alpha})+2\lambda\mathbf{K}\boldsymbol{\alpha}=0$ , so $\boldsymbol{\alpha}^\star = (\mathbf{K}+\lambda\mathbf{I})^{-1}\mathbf{y}$ and $f^\star(x) = \mathbf{k}(x)^T(\mathbf{K}+\lambda\mathbf{I})^{-1}\mathbf{y}$ .

ex-fsi-ch23-09

Medium

Compute the posterior mean and variance of a GP at a test point $x_*$ , given training data $(\mathbf{X},\mathbf{y})$ , noise variance $\sigma_n^2$ , and covariance function $k$ .

Show Hint

Joint prior: $(\mathbf{y},f_*) \sim \mathcal{N}(\mathbf{0},\begin{pmatrix}\mathbf{K}+\sigma_n^2 \mathbf{I} & \mathbf{k}_* \\ \mathbf{k}_*^T & k_{**}\end{pmatrix})$ .

Apply the Gaussian conditioning formula.

Solution

Gaussian conditioning

The posterior $f_* \mid \mathbf{y}$ is Gaussian with $\mu_*(x_*) = \mathbf{k}_*^T (\mathbf{K}+\sigma_n^2 \mathbf{I})^{-1}\mathbf{y}$ and $\sigma_*^2(x_*) = k_{**} - \mathbf{k}_*^T(\mathbf{K}+\sigma_n^2 \mathbf{I})^{-1}\mathbf{k}_*$ .

Comparison with KRR

The posterior mean equals the KRR solution with $\lambda=\sigma_n^2$ . The posterior variance is a pure GP feature: KRR gives no uncertainty.

ex-fsi-ch23-10

Medium

The Tukey biweight loss has derivative $\psi_c(r) = r(1 - (r/c)^2)^2$ for $|r|\le c$ and $0$ otherwise. Show that the corresponding $\rho$ is NOT convex and give one consequence.

Show Hint

Differentiate $\psi_c$ at a chosen $r$ to show $\psi_c'$ changes sign.

If $\psi_c'(r)<0$ , then $\rho_c''(r)<0$ at that point.

Solution

Non-monotone psi

$\psi_c(0)=0$ , $\psi_c(r)$ grows, reaches a maximum, then decreases back to $0$ at $|r|=c$ . Therefore $\psi_c' = \rho_c''$ is negative for some $r$ , so $\rho_c$ is non-convex.

Consequence

Gradient descent / IRLS may converge to a local minimum. Good initialization (e.g. by a convex M-estimate like Huber) is essential in practice.

ex-fsi-ch23-11

Medium

Show the "kernel trick" for the polynomial kernel $k(x,x')=(1+\langle x,x'\rangle)^d$ in $\mathbb{R}^2$ , $d=2$ : write the explicit feature map $\phi$ such that $k(x,x')=\langle\phi(x),\phi(x')\rangle$ .

Show Hint

Expand $(1+x_1 x_1' + x_2 x_2')^2$ .

Match monomials of $x$ to coordinates of $\phi$ .

Solution

Expansion

$(1+x_1 x_1'+x_2 x_2')^2 = 1 + 2x_1 x_1' + 2x_2 x_2' + (x_1 x_1')^2 + (x_2 x_2')^2 + 2 x_1 x_2 x_1' x_2'$ .

Feature map

$\phi(x) = (1,\sqrt{2}x_1,\sqrt{2}x_2,x_1^2,x_2^2,\sqrt{2}x_1 x_2) \in \mathbb{R}^6$ . Then $\langle\phi(x),\phi(x')\rangle = k(x,x')$ .

Lesson

We compute inner products in a 6-dimensional feature space at the cost of a scalar multiplication and addition.

ex-fsi-ch23-12

Medium

For ISTA with step $\eta$ and soft threshold $\tau$ , one iteration is $\mathbf{x}^{(t+1)} = \mathcal{S}_\tau(\mathbf{x}^{(t)} + \eta \mathbf{A}^T(\mathbf{y}-\mathbf{A}\mathbf{x}^{(t)}))$ . Rewrite LISTA as stacked layers with learnable $(\mathbf{W}_e^{(t)},\mathbf{W}_s^{(t)},\tau^{(t)})$ per layer, and give the number of parameters.

Show Hint

$\mathbf{W}_e \leftrightarrow \eta \mathbf{A}^T$ and $\mathbf{W}_s \leftrightarrow (\mathbf{I}-\eta\mathbf{A}^T\mathbf{A})$ .

Count: $Nm + N^2 + 1$ parameters per layer for $\mathbf{y}\in\mathbb{R}^m$ , $\mathbf{x}\in\mathbb{R}^N$ .

Solution

LISTA recursion

$\mathbf{x}^{(t+1)} = \mathcal{S}_{\tau^{(t)}}\!\left(\mathbf{W}_e^{(t)}\mathbf{y} + \mathbf{W}_s^{(t)}\mathbf{x}^{(t)}\right)$ . All $(\mathbf{W}_e^{(t)},\mathbf{W}_s^{(t)},\tau^{(t)})$ are learned by backprop.

Parameter count

Per layer: $\mathbf{W}_e^{(t)}\in\mathbb{R}^{N\times m}$ has $Nm$ parameters, $\mathbf{W}_s^{(t)}\in\mathbb{R}^{N\times N}$ has $N^2$ , and $\tau^{(t)}\in\mathbb{R}$ gives $1$ . Total per layer: $Nm+N^2+1$ . Over $T$ layers: $T(Nm+N^2+1)$ .

ex-fsi-ch23-13

Hard

Prove that at the minimax point, the Huber loss corresponds to ML estimation under the density proportional to $\exp(-\rho_\delta(r))$ . Identify this density (hint: Gaussian center, Laplace tails).

Show Hint

ML under $p(r) \propto e^{-\rho_\delta(r)}$ gives $-\log p(r)=\rho_\delta(r)+C$ .

For $|r|\le\delta$ : Gaussian-like $e^{-r^2/2}$ ; for $|r|>\delta$ : Laplace-like $e^{-\delta|r|+\delta^2/2}$ .

Solution

Least-favorable density

Huber showed that the minimax density over the $\varepsilon$ -contamination class $\{(1-\varepsilon)\mathcal{N}(0,1)+\varepsilon H\}$ is $p^\star(r) \propto \exp(-\rho_\delta(r))$ : Gaussian on the core $|r|\le\delta$ , Laplace tails outside.

ML correspondence

Under $p^\star$ , the log-likelihood is $-\sum_i \rho_\delta(r_i) + \text{const}$ . Maximizing the log-likelihood is exactly Huber M-estimation.

Minimax interpretation

The Huber M-estimator is the MLE for the worst-case density in the contamination ball — hence minimax-optimal.

ex-fsi-ch23-14

Hard

For the Nadaraya–Watson estimator with Gaussian kernel and bandwidth $h$ , derive the leading-order bias at an interior point $x_0$ and show it depends on $p'(x_0), m'(x_0), m''(x_0)$ where $m(x)=\mathbb{E}[Y\mid X=x]$ .

Show Hint

Expand numerator and denominator around $x_0$ in powers of $h$ .

Use $\int u^2 K(u)du = \mu_2(K)$ .

Solution

Numerator expansion

$\mathbb{E}[\hat m_h(x_0) \cdot \hat f_h(x_0)] \approx m(x_0) p(x_0) + \tfrac{h^2 \mu_2}{2}((mp)'')(x_0) + O(h^4)$ where $\hat f_h$ is the KDE of the design density.

Denominator expansion

$\mathbb{E}[\hat f_h(x_0)] \approx p(x_0) + \tfrac{h^2 \mu_2}{2} p''(x_0)$ .

Leading bias

Ratio expansion gives $\mathrm{Bias}(\hat m_h(x_0)) = \dfrac{h^2 \mu_2}{2} \left(m''(x_0) + 2\dfrac{m'(x_0)p'(x_0)}{p(x_0)}\right) + o(h^2)$ . The second term is the "design bias" — it vanishes only when $p'(x_0)=0$ or $m$ is flat.

ex-fsi-ch23-15

Hard

Let $\hat{\theta}_n$ be the MSE-trained neural estimator from $n$ i.i.d. pairs $(\mathbf{y},\boldsymbol{\theta})$ . Under universal approximation and $n\to\infty$ , prove that $\hat{\theta}_n$ converges to the MMSE estimator $\mathbb{E}[\boldsymbol{\theta}\mid\mathbf{y}]$ almost surely.

Show Hint

Split the MSE into irreducible noise plus approximation error.

Universal approximation + law of large numbers.

Solution

MSE decomposition

$\mathbb{E}\|\boldsymbol{\theta} - f(\mathbf{y})\|^2 = \mathbb{E}\|\boldsymbol{\theta}-\mathbb{E}[\boldsymbol{\theta}\mid\mathbf{y}]\|^2 + \mathbb{E}\|\mathbb{E}[\boldsymbol{\theta}\mid\mathbf{y}]-f(\mathbf{y})\|^2$ . First term is irreducible; the second is minimized at $f^\star(\mathbf{y})=\mathbb{E}[\boldsymbol{\theta}\mid\mathbf{y}]$ .

Universal approximation

Neural nets with sufficient width can approximate $f^\star$ uniformly on compacta to any $\epsilon>0$ . Thus the population minimizer matches $f^\star$ .

Consistency

By uniform law of large numbers, the empirical MSE converges to the population MSE, so the minimizer of the empirical MSE converges to $f^\star$ as $n\to\infty$ , i.e. to the MMSE estimator.

ex-fsi-ch23-16

Hard

Suppose a LISTA network trained on $\mathbf{A}_{\text{train}}$ is deployed where $\mathbf{A}_{\text{test}} = \mathbf{A}_{\text{train}} + \Delta$ with $\|\Delta\|_2 = \eta$ . Give an upper bound on the recovery error in terms of $\eta$ and the operator norm of the learned weights.

Show Hint

Propagate perturbation layer-by-layer; soft-threshold is 1-Lipschitz.

Telescope the per-layer Lipschitz constants over $T$ layers.

Solution

Per-layer Lipschitz

Since $\mathcal{S}_\tau$ is $1$ -Lipschitz, a perturbation $\delta^{(t)}$ at layer $t$ is amplified by $\|\mathbf{W}_s^{(t)}\|_2$ plus an injected term $\|\mathbf{W}_e^{(t)}\|_2 \cdot \|\Delta\mathbf{x}^\star\|$ .

Telescoped bound

With $L=\max_t \|\mathbf{W}_s^{(t)}\|_2$ and uniform $\|\mathbf{W}_e^{(t)}\|_2 \le B$ , $\|\hat{\mathbf{x}}^{(T)}_{\text{test}} - \hat{\mathbf{x}}^{(T)}_{\text{train}}\| \le B \eta \|\mathbf{x}^\star\| \cdot \dfrac{L^T - 1}{L-1}$ .

Lesson

If $L \ge 1$ , the bound grows exponentially with depth — a concrete failure mode of distribution-shifted unfolded networks.

Exercises

ex-fsi-ch23-01

Piecewise derivative

Matching at the kink

ex-fsi-ch23-02

Count the tolerable contamination

Interpretation

ex-fsi-ch23-03

Direct computation

Unboundedness

ex-fsi-ch23-04

Density

Normalization

ex-fsi-ch23-05

PSD condition

Mercer expansion

ex-fsi-ch23-06

Psi as weighted identity

Fixed-point update

Interpretation

ex-fsi-ch23-07

First-order condition

Optimal rate

ex-fsi-ch23-08

Representer form

Finite-dim objective

Closed-form solution

ex-fsi-ch23-09

Gaussian conditioning

Comparison with KRR

ex-fsi-ch23-10

Non-monotone psi

Consequence

ex-fsi-ch23-11

Expansion

Feature map

Lesson

ex-fsi-ch23-12

LISTA recursion

Parameter count

ex-fsi-ch23-13

Least-favorable density

ML correspondence

Minimax interpretation

ex-fsi-ch23-14

Numerator expansion

Denominator expansion

Leading bias

ex-fsi-ch23-15

MSE decomposition

Universal approximation

Consistency

ex-fsi-ch23-16

Per-layer Lipschitz

Telescoped bound

Lesson