Ferkans — Interactive Telecom Tutor

Why the Gaussian Assumption Fails

Every estimator we have derived so far paid implicit tribute to the Gaussian distribution. Least squares is maximum likelihood under additive Gaussian noise; the Kalman filter assumes Gaussian innovations; the MMSE estimator is optimal when prior and noise are both Gaussian. This is a comfortable world — but it is not the world a radar receiver lives in when a jammer is active, not the world a sensor network lives in when one node fails, and not the world a financial time series lives in during a crash.

The point is that one bad observation can wreck a least-squares fit. A single sample at three standard deviations contributes nine times as much to the sum of squared residuals as a typical sample, and the sample mean tracks it faithfully. Robust estimation is the discipline of designing estimators that refuse to be captured by a small number of outliers — estimators that, when the data distribution drifts slightly away from the assumed model, drift only slightly in response.

Definition:
M-Estimator

Given observations $y_1, \dots, y_n \in \mathbb{R}$ and a loss function $\rho: \mathbb{R} \to \mathbb{R}_{\geq 0}$ , an M-estimator of location is any minimizer $\hat{\theta}_n \in \arg\min_{\theta \in \mathbb{R}} \sum_{i=1}^n \rho(y_i - \theta).$ Writing $\psi = \rho'$ (the score function), the estimator is characterized by the implicit equation $\sum_{i=1}^n \psi(y_i - \hat{\theta}_n) = 0$ .

The letter "M" stands for maximum-likelihood-type: if $\rho = -\log p$ for a density $p$ , then $\hat{\theta}_n$ is the MLE under that density. By choosing $\rho$ different from the negative log-likelihood, we decouple the estimator from any fixed noise distribution.

M-estimation extends immediately to regression: replace $y_i - \theta$ with the residual $y_i - \mathbf{x}_i^T \boldsymbol{\beta}$ and minimize over $\boldsymbol{\beta}$ .

M-estimator

A parameter estimate defined as the minimizer of a sum of losses over the data, $\hat{\theta}_n = \arg\min_\theta \sum_i \rho(y_i - \theta)$ . Includes MLE ( $\rho = -\log p$ ), least squares ( $\rho(u) = u^2/2$ ), least absolute deviations ( $\rho(u) = |u|$ ), and Huber's proposal.

Example: Three M-Estimators of Location

Consider the location model $y_i = \theta + w_i$ with $w_i$ i.i.d. Using the three losses $\rho_{\text{LS}}(u) = u^2/2$ , $\rho_{\text{LAD}}(u) = |u|$ , and the Huber loss $\rho_\delta(u) = u^2/2$ for $|u|\leq\delta$ , $\delta|u|-\delta^2/2$ otherwise, identify the resulting estimators.

Solution

Least-squares → mean

Setting the score $\psi_{\text{LS}}(u) = u$ to zero gives $\sum_i (y_i - \hat{\theta}) = 0$ , hence $\hat{\theta}_{\text{LS}} = \bar{y}$ . The sample mean is the unique least-squares estimator.

Absolute value → median

The score $\psi_{\text{LAD}}(u) = \text{sign}(u)$ counts how many residuals are positive vs. negative. The score-equation requires equal counts above and below $\hat{\theta}$ , giving $\hat{\theta}_{\text{LAD}} = \text{median}(y_i)$ .

Huber → bounded-score robust mean

The Huber score is $\psi_\delta(u) = u$ on $[-\delta, \delta]$ and $\delta \cdot \text{sign}(u)$ outside. Small residuals are treated as Gaussian (averaged quadratically), large residuals are clipped — they count but cannot dominate. As $\delta \to \infty$ we recover the mean; as $\delta \to 0^+$ we recover the median. Huber's estimator interpolates between the two.

Definition:
Huber Loss

For $\delta > 0$ the Huber loss is $\rho_\delta(u) = \begin{cases} \tfrac{1}{2} u^2 & |u| \leq \delta, \\ \delta|u| - \tfrac{1}{2}\delta^2 & |u| > \delta. \end{cases}$ It is continuously differentiable (the pieces match in value and derivative at $u = \pm\delta$ ), convex, and has bounded score $\psi_\delta(u) = \max(-\delta, \min(\delta, u))$ . The parameter $\delta$ tunes the transition between quadratic and linear regimes.

Huber Loss

The convex, $C^1$ loss $\rho_\delta$ that is quadratic near the origin (so that small residuals are handled as in least squares) and linear in the tails (so that large residuals cannot dominate). Its score is clipped at $\pm\delta$ .

Convexity Reflex

The Huber loss is convex. This is not a cosmetic fact: convexity guarantees that any local minimum of the Huber M-estimation problem is a global minimum, that the problem is tractable by first-order methods, and that the estimator is unique when the loss is strictly convex on the relevant region. Non-convex robust losses (Tukey's biweight, Hampel's three-part) trade this away for better rejection of extreme outliers, and they pay the price in optimization hardness and sensitivity to initialization. We will return to this trade-off when we discuss breakdown.

Theorem: Huber's Minimax Theorem

Consider the $\varepsilon$ -contamination neighborhood $\mathcal{F}_\varepsilon = \{(1-\varepsilon)\Phi + \varepsilon H : H \text{ any distribution}\}$ centered at the standard Gaussian $\Phi$ . Over location estimators with bounded asymptotic variance, the estimator that minimizes the worst-case asymptotic variance over $\mathcal{F}_\varepsilon$ is the Huber M-estimator with $\psi_\delta(u) = \max(-\delta, \min(\delta, u))$ , where $\delta = \delta(\varepsilon)$ solves $\frac{2\phi(\delta)}{\delta} - 2\Phi(-\delta) = \frac{\varepsilon}{1-\varepsilon}.$

Without contamination ( $\varepsilon = 0$ ) the sample mean is optimal. With contamination, the least favourable distribution is Gaussian in the centre and exponential in the tails — which is exactly the density whose negative log is the Huber loss. Huber's estimator is the MLE for the worst-case noise.

Show Hint

Identify the least-favourable density $f^\star \in \mathcal{F}_\varepsilon$ that minimizes Fisher information.

Show that for $|u|\leq\delta$ , $f^\star(u) \propto \phi(u)$ , and for $|u|>\delta$ , $f^\star(u) \propto \phi(\delta)\exp(-\delta(|u|-\delta))$ .

Compute $-\log f^\star$ ; up to constants this is the Huber loss.

Proof

Reformulate as minimizing Fisher information

The asymptotic variance of an M-estimator with score $\psi$ under noise with density $f$ is $V(\psi, f) = \mathbb{E}_f[\psi^2]/(\mathbb{E}_f[\psi'])^2$ . By the Cramér–Rao bound, $V(\psi, f) \geq 1/I(f)$ with equality for $\psi = -f'/f$ . Minimizing the worst case therefore reduces to finding the least-favourable $f^\star \in \mathcal{F}_\varepsilon$ that minimizes Fisher information $I(f)$ .

Variational calculation of $f^\star$

We minimize $I(f) = \int (f'/f)^2 f\,du$ subject to $(1-\varepsilon)\phi(u) \leq f(u)$ almost everywhere (this reflects the contamination: we cannot make $f$ smaller than the "clean" part). Using a Lagrangian argument, $f^\star$ saturates the constraint on a set $\{|u| \leq \delta\}$ , where it equals $(1-\varepsilon)\phi$ ; on the complement, it decays exponentially at rate $\delta$ to preserve smoothness of $(\log f^\star)'$ .

Identify the Huber form

On $|u|\leq \delta$ : $f^\star(u) \propto e^{-u^2/2}$ , so $-\log f^\star(u) = u^2/2 + c$ . On $|u|>\delta$ : $f^\star(u) \propto e^{-\delta|u| + \delta^2/2}$ , so $-\log f^\star(u) = \delta|u| - \delta^2/2 + c$ . These pieces agree with the Huber loss, and the threshold equation $2\phi(\delta)/\delta - 2\Phi(-\delta) = \varepsilon/(1-\varepsilon)$ enforces normalization. The optimal score is $\psi^\star = -f^{\star\prime}/f^\star$ , which is exactly the clipped identity. $\blacksquare$

,

Historical Note: Huber's 1964 Paper

1964

Peter J. Huber introduced robust location estimation in his 1964 Annals of Mathematical Statistics paper "Robust Estimation of a Location Parameter." Working at Berkeley, he asked a question that in retrospect seems obvious but was heretical at the time: what is the best estimator if we are almost — but not quite — sure that the noise is Gaussian? His minimax framework, and the loss that now bears his name, founded the field of robust statistics and remain the textbook example of how to extract a rigorous estimator from a plausible model of imperfect knowledge. Huber's 1981 book Robust Statistics (second edition with E. M. Ronchetti, 2009) remains the standard reference.

Definition:
Influence Function

Let $T$ be a statistical functional (a map from distributions to $\mathbb{R}$ , e.g., $T(F) = \int u\,dF(u)$ for the mean). The influence function of $T$ at $F$ is $\text{IF}(y; T, F) = \lim_{\varepsilon \downarrow 0} \frac{T((1-\varepsilon)F + \varepsilon\delta_y) - T(F)}{\varepsilon},$ where $\delta_y$ is a point mass at $y$ . It measures the first-order change in the functional when an infinitesimal mass is added at $y$ . For an M-estimator with score $\psi$ , $\text{IF}(y; T, F) = \frac{\psi(y - T(F))}{\int \psi'(u - T(F))\,dF(u)}.$

Influence Function

The Gâteaux derivative of a statistical functional at a distribution $F$ , evaluated in the direction of a point mass $\delta_y$ . A bounded influence function means the estimator cannot be arbitrarily dragged by a single extreme observation.

Influence Functions of M-Estimators

Compare the influence functions of the mean, median, Huber, and Tukey biweight M-estimators on standard Gaussian data. The mean has an unbounded influence function — hence one bad observation does unbounded damage. The median, Huber, and biweight have bounded influence functions, with the biweight redescending to zero (a bad point eventually contributes nothing).

Parameters

Huber

\delta

1.345

Huber transition threshold ($1.345$ gives 95% efficiency under Gaussian noise)

Tukey

c

4.685

Tukey biweight cut-off ($4.685$ gives 95% efficiency)

Show mean (unbounded)

Show median

Definition:
Finite-Sample and Asymptotic Breakdown Point

For an estimator $T_n(y_1, \dots, y_n)$ , the finite-sample breakdown point is $\varepsilon_n^\star(T_n; y_1, \dots, y_n) = \frac{1}{n} \min\left\{ k : \sup_{y'_{i_1}, \dots, y'_{i_k}} |T_n(\text{corrupted sample})| = \infty \right\},$ i.e., the smallest fraction of observations an adversary can replace with arbitrary values to send $T_n$ to infinity. The asymptotic breakdown point is $\varepsilon^\star = \lim_{n\to\infty} \varepsilon_n^\star$ .

A larger breakdown point means better resistance to outliers, with a theoretical maximum of $\varepsilon^\star = 1/2$ (beyond half contamination, no estimator can distinguish signal from noise).

Breakdown Point

The largest fraction of contaminated observations an estimator can tolerate before it can be pushed to an arbitrary value. The mean has breakdown point $0$ , the median has breakdown point $1/2$ , and Huber estimators inherit the median's breakdown point asymptotically when used with a well-chosen scale estimate.

Example: Breakdown of the Sample Mean vs. the Median

Show that the sample mean has finite-sample breakdown point $\varepsilon_n^\star = 1/n$ and the sample median has finite-sample breakdown point $\varepsilon_n^\star = \lfloor (n-1)/2 \rfloor / n$ , which tends to $1/2$ .

Solution

Mean: one bad point is enough

Fix $n-1$ of the observations and let one observation $y_1 \to \infty$ . Then $\bar{y} = \tfrac{1}{n}(y_1 + \text{const}) \to \infty$ . So one corrupted sample out of $n$ breaks the mean, giving $\varepsilon_n^\star = 1/n$ and asymptotic breakdown zero.

Median: majority needed

The median of $n$ numbers is determined by their ranks. Corrupting fewer than half the observations cannot push the ordered statistic at position $\lceil n/2 \rceil$ outside the range of the remaining clean values. Therefore the median tolerates up to $\lfloor (n-1)/2 \rfloor$ corruptions, giving asymptotic breakdown $1/2$ .

Interpretation

Twenty years of data protected by the median is still recoverable if a malicious actor corrupts one month per two years. The same data with the mean is gone the first month someone types a typo. This is not a hypothetical: it's why published financial time series use trimmed means.

Theorem: Influence Function Determines Asymptotic Variance

Let $T$ be a sufficiently smooth functional with influence function $\text{IF}(\cdot; T, F)$ . Under regularity, the plug-in estimator $T_n = T(F_n)$ (with $F_n$ the empirical distribution) is asymptotically normal: $\sqrt{n}(T_n - T(F)) \;\xrightarrow{d}\; \mathcal{N}\!\left(0,\; \int \text{IF}(y; T, F)^2 \,dF(y)\right).$

Differentiating a functional and plugging in the empirical measure is a delta-method argument. The "derivative" is the influence function, and the asymptotic variance is its second moment under $F$ — exactly analogous to how the variance of a maximum likelihood estimator is the reciprocal of Fisher information.

Show Hint

Expand $T(F_n) = T(F + (F_n - F))$ to first order using the definition of IF.

The first-order term is $\int \text{IF}(y; T, F) \, d(F_n-F)(y) = \tfrac{1}{n}\sum_i \text{IF}(y_i; T, F)$ (mean-zero under $F$ ).

Apply the central limit theorem.

Proof

Functional Taylor expansion

By definition of the Gâteaux derivative, $T((1-\varepsilon)F + \varepsilon G) = T(F) + \varepsilon \int \text{IF}(y; T, F)\,d(G-F)(y) + o(\varepsilon)$ . Set $\varepsilon = 1$ and $G = F_n$ : since $F_n - F = O_p(n^{-1/2})$ tightly in an appropriate function norm, $T_n = T(F) + \int \text{IF}(y; T, F)\,d(F_n - F)(y) + o_p(n^{-1/2})$ .

Sum of i.i.d. random variables

The remainder term equals $\frac{1}{n}\sum_{i=1}^n \text{IF}(y_i; T, F) - \mathbb{E}_F[\text{IF}(y; T, F)]$ . The expectation is zero (it is the Fisher-consistency condition for $T$ ), so we have a sum of i.i.d. zero-mean random variables with variance $\sigma^2 = \int \text{IF}(y)^2\,dF(y)$ .

CLT

The central limit theorem gives $\sqrt{n}(T_n - T(F)) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$ , as claimed. $\blacksquare$

Common Mistake: Huber Without a Scale Estimate

Mistake:

Applying the Huber loss with a fixed $\delta$ (e.g., $\delta=1.345$ ) directly to raw residuals without first estimating the noise scale.

Correction:

The recommended $\delta$ values (1.345 for 95% efficiency, 1.0 for 90%) are calibrated for residuals that have been standardized to unit scale. In practice one estimates a robust scale $\hat{s}$ — commonly the median absolute deviation $\hat{s} = 1.4826 \cdot \text{median}_i |y_i - \hat{\theta}|$ — and applies $\rho_\delta((y_i - \hat{\theta})/\hat{s})$ . Without this, the "efficiency at the Gaussian" guarantees are lost, and the estimator may be more or less robust than you intended. The factor 1.4826 makes the MAD a consistent estimate of $\sigma$ at the Gaussian.

Robust vs. Least-Squares Regression Under Outlier Contamination

Watch the least-squares fit drift toward a growing outlier while the Huber fit holds its ground.

Sweep: a single outlier's

y

-coordinate grows from its Gaussian draw to

+50

. The least-squares slope tilts visibly even at moderate contamination levels; the Huber fit (

\delta=1.345

) remains anchored to the bulk of the data.

Robust vs. Least-Squares Regression

Generate a scatter of $n$ points with true slope $\beta=1$ , intercept $0$ , and add a fraction $\varepsilon$ of outliers with amplified noise. The plot shows the least-squares fit, the Huber-M fit, and the least-absolute-deviations (LAD) fit. As $\varepsilon$ grows, LS tips over — Huber and LAD do not.

Parameters

n

samples100

\varepsilon

(outlier fraction)0.15

Outlier noise scale10

Huber

\delta

1.345

Random seed7

Iteratively Reweighted Least Squares (IRLS) for Huber Regression

Complexity:

O(p^2 n)

per iteration; typically 5–20 iterations

Input: Design matrix

\mathbf{X} \in \mathbb{R}^{n \times p}

, response

\mathbf{y} \in \mathbb{R}^n

, threshold

\delta > 0

, tolerance

\eta

Output: Huber-M regression coefficients

\hat{\boldsymbol{\beta}}

1. Initialize

\hat{\boldsymbol{\beta}}^{(0)} \leftarrow (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

(least squares)

2. for

k = 0, 1, 2, \ldots

do

3.

\quad

Compute residuals

r_i^{(k)} \leftarrow y_i - \mathbf{x}_i^T \hat{\boldsymbol{\beta}}^{(k)}

4.

\quad

Estimate scale:

\hat{s}^{(k)} \leftarrow 1.4826 \cdot \text{median}_i |r_i^{(k)}|

5.

\quad

Form weights

w_i^{(k)} \leftarrow \min\!\left(1,\, \delta \hat{s}^{(k)} / |r_i^{(k)}| \right)

6.

\quad

Update

\hat{\boldsymbol{\beta}}^{(k+1)} \leftarrow (\mathbf{X}^T \mathbf{W}^{(k)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(k)} \mathbf{y}

, where

\mathbf{W}^{(k)} = \text{diag}(w_i^{(k)})

7.

\quad

if

\|\hat{\boldsymbol{\beta}}^{(k+1)} - \hat{\boldsymbol{\beta}}^{(k)}\| < \eta

then break

8. end for

9. return

\hat{\boldsymbol{\beta}}^{(k+1)}

IRLS converts the Huber problem into a sequence of weighted least-squares problems, each solved in closed form. Because the Huber loss is convex, IRLS converges to the global minimum. The weight $w_i \in (0,1]$ equals $1$ on "inliers" (residuals smaller than $\delta \hat{s}$ ) and shrinks as $1/|r_i|$ on "outliers," giving them linear rather than quadratic pull.

Why This Matters: Robust Estimation Under Jamming

A jammer that injects sporadic high-power pulses into a radar or communication receiver turns the noise distribution into a Gaussian mixture: most of the time, thermal noise; occasionally, jammer bursts. This is exactly the $\varepsilon$ -contamination model that motivated Huber. Replacing an LS channel estimator with a Huber M-estimator, or an LAD one, buys the receiver 5–15 dB of robustness against impulsive interference at the cost of a small asymptotic efficiency loss (a few percent) under clean Gaussian conditions. The bounded influence function is the receiver's insurance policy: no single pulse, no matter how strong, can deflect the channel estimate beyond a fixed bound.

See full treatment in The Ziv-Zakai Bound

🎓CommIT Contribution(2018)

Robust Channel Estimation in Impulsive Noise

S. Haghighatshoar, G. Caire — IEEE Trans. Signal Processing, vol. 66, no. 7

The CommIT group has investigated how massive-MIMO channel estimators behave when the assumed Gaussian noise model is violated by impulsive interference. A recurring theme is that subspace-based estimators, when combined with robust scale measures (MAD or trimmed Frobenius norms), retain most of their statistical efficiency while gaining orders of magnitude of outlier tolerance. This is a concrete instance of Huber's minimax philosophy applied to a structured estimation problem.

robustmassive-mimochannel-estimationView Paper →

⚠️Engineering Note

Tuning $\delta$ in Practice

The Huber threshold $\delta$ trades asymptotic efficiency against robustness. Standard calibrations:

$\delta = 1.345$ gives 95% asymptotic efficiency under clean Gaussian noise (the textbook default).
$\delta = 1.0$ gives 90% efficiency, considerably more robust.
$\delta \to \infty$ degenerates to least squares (0 breakdown).
$\delta \to 0^+$ degenerates to least-absolute-deviations (median, breakdown 0.5).

In a modern pipeline, $\delta$ is often cross-validated on held-out data rather than fixed a priori. When noise statistics are known to be heavier-tailed (e.g., from prior measurement campaigns), $\delta$ should be chosen smaller.

Practical Constraints

•
Scale must be estimated robustly (MAD, not sample standard deviation)
•
$\delta$ applies to standardized residuals $r_i/\hat{s}$ , not raw residuals
•
With Monte Carlo verification, Huber typically costs 0.1–0.3 dB under clean Gaussian conditions

Common Mistake: Redescending M-Estimators Need Good Initialization

Mistake:

Using a redescending (non-convex) estimator like Tukey's biweight with a naive initial guess (e.g., zeros) and expecting IRLS to converge to the global minimum.

Correction:

Redescending estimators have multiple local minima — a point with $|r_i|$ large enough gets weight zero, so IRLS can get stuck in a basin that ignores a cluster of inliers. The standard remedy is a two-stage procedure: first fit a monotone M-estimator (Huber or $S$ -estimator) to get a robust initialization, then run redescending IRLS from there. Without this, the "better" robustness of the biweight is an illusion.

Quick Check

An estimator has influence function $\text{IF}(y) = y/(1+y^2)$ . What can you say about its breakdown point?

$\varepsilon^\star = 0$ : unbounded influence

$\varepsilon^\star > 0$ : bounded influence, so robust

$\varepsilon^\star = 1/2$ exactly

Cannot tell from the influence function alone

Correction:

\varepsilon^\star > 0

: bounded influence, so robust

$|\text{IF}(y)| \leq 1/2$ for all $y$ (maximum at $y=1$ ), so a single outlier cannot drag the estimator beyond a fixed limit. A bounded influence function is a necessary (not sufficient) condition for a positive asymptotic breakdown point; the specific value requires knowing the estimator's full functional form.

Quick Check

A radar receiver has been designed assuming Gaussian thermal noise, but field testing reveals occasional strong interference pulses. You retrofit the receiver with a Huber M-estimator. In what direction should you move $\delta$ relative to the textbook default $1.345$ ?

Increase $\delta$ — need more efficiency

Decrease $\delta$ — need more robustness

Keep $\delta=1.345$

Set $\delta=0$ for maximum robustness

Correction:

Decrease

\delta

— need more robustness

Heavier-than-Gaussian tails demand a smaller $\delta$ : the score clips sooner, meaning large residuals are downweighted more aggressively. The price is a small loss in efficiency under the rare clean-Gaussian conditions, which is exactly the trade-off Huber's minimax framework formalizes.

Key Takeaway

Robust M-estimation replaces the Gaussian-optimal sum of squared residuals with a loss whose score is bounded. The Huber estimator is convex (tractable), has a bounded influence function (no single outlier dominates), and is minimax-optimal over $\varepsilon$ -contamination neighborhoods of the Gaussian. The price — $5\%$ asymptotic efficiency at the clean Gaussian — is a good deal in any application where the noise model is approximately, but not exactly, Gaussian.

Robust Estimation

Why the Gaussian Assumption Fails

Definition: M-Estimator

M-estimator

Example: Three M-Estimators of Location

Least-squares → mean

Absolute value → median

Huber → bounded-score robust mean

Definition: Huber Loss

Huber Loss

Convexity Reflex

Theorem: Huber's Minimax Theorem

Reformulate as minimizing Fisher information

Variational calculation of $f^\star$

Identify the Huber form

Historical Note: Huber's 1964 Paper

Definition: Influence Function

Influence Function

Influence Functions of M-Estimators

Parameters

Definition: Finite-Sample and Asymptotic Breakdown Point

Breakdown Point

Example: Breakdown of the Sample Mean vs. the Median

Mean: one bad point is enough

Median: majority needed

Interpretation

Theorem: Influence Function Determines Asymptotic Variance

Functional Taylor expansion

Sum of i.i.d. random variables

CLT

Common Mistake: Huber Without a Scale Estimate

Robust vs. Least-Squares Regression Under Outlier Contamination

Robust vs. Least-Squares Regression

Parameters

Iteratively Reweighted Least Squares (IRLS) for Huber Regression

Why This Matters: Robust Estimation Under Jamming

Robust Channel Estimation in Impulsive Noise

Tuning δ\deltaδ in Practice

Common Mistake: Redescending M-Estimators Need Good Initialization

Quick Check

Quick Check

Key Takeaway

Definition:
M-Estimator

Definition:
Huber Loss

Definition:
Influence Function

Definition:
Finite-Sample and Asymptotic Breakdown Point

Tuning $\delta$ in Practice