Ferkans — Interactive Telecom Tutor

When There Is No Single Best Estimator

The admissibility results of Section 22.3 tell us which estimators can be ruled out. They do not tell us which single estimator to use. Different estimators trade risk at different parts of the parameter space: some are great when $\boldsymbol{\theta}$ is small, others when it is large. Minimax theory resolves this tension by picking the estimator whose worst-case risk is smallest. The minimax estimator gives a guarantee that holds uniformly over the parameter class — a guarantee that becomes the natural benchmark whenever no reliable prior is available.

The point is that minimax rates carry deep information: they characterise how much data is required to estimate a parameter of a given complexity. For sparse signals the minimax rate is $O(s\log(N/s)/M)$ — a formula that packages together the three pillars of high-dimensional statistics: dimension $N$ , sample size $M$ , and structural complexity $s$ .

Definition:
Minimax Risk

Let $\Theta$ be a parameter space, $L$ a loss function, and $R(\hat\theta,\theta)=\mathbb{E}_\theta[L(\hat\theta,\theta)]$ the frequentist risk. The minimax risk over $\Theta$ is

$r_*^{\text{mm}}(\Theta)=\inf_{\hat\theta}\sup_{\theta\in\Theta}R(\hat\theta,\theta),$

where the infimum is taken over all (measurable) estimators. An estimator $\hat\theta^*$ is minimax if it attains this infimum: $\sup_{\theta\in\Theta}R(\hat\theta^*,\theta)=r_*^{\text{mm}}(\Theta)$ .

Minimax makes no statement about what $\theta$ actually is — only about the worst case. This is conservative by design: minimax estimators hedge against adversarial parameter choices.

Definition:
Bayes Risk and Least-Favourable Priors

For a prior $\pi$ on $\Theta$ , the Bayes risk of an estimator is $r(\hat\theta,\pi)=\int R(\hat\theta,\theta)\,d\pi(\theta)$ , and the Bayes-optimal estimator $\hat\theta_\pi$ minimises $r(\cdot,\pi)$ . A prior $\pi^*$ is least favourable if $r(\hat\theta_{\pi^*},\pi^*)=\sup_\pi r(\hat\theta_\pi,\pi)$ .

The minimax = maximin duality (when it holds) states

$\inf_{\hat\theta}\sup_\theta R(\hat\theta,\theta)=\sup_\pi\inf_{\hat\theta}r(\hat\theta,\pi).$

When a least-favourable prior exists, the Bayes estimator under that prior is minimax. This is the workhorse construction for finding minimax estimators: guess the least-favourable prior, compute the Bayes estimator, check that its worst-case risk matches the Bayes risk under $\pi^*$ .

Theorem: Minimax Risk for Bounded Gaussian Mean

Let $Y\sim\mathcal{N}(\theta,1)$ with $\theta\in[-\tau,\tau]$ for some $\tau>0$ . Under squared-error loss, the minimax risk satisfies

$r_*^{\text{mm}}([-\tau,\tau])=\sup_\pi\Bigl[\mathrm{var}_\pi(\theta)-\mathrm{var}_\pi(\mathbb{E}[\theta\mid Y])\Bigr]=\frac{\tau^2}{1+\tau^2}\cdot(1+o(1))\text{ as }\tau\to 0.$

As $\tau\to\infty$ the minimax risk approaches $1$ , the unbounded Gaussian case.

The least-favourable prior on a bounded interval concentrates its mass at the endpoints — the adversary places $\theta$ at the extremes where estimation is hardest. For small $\tau$ the problem is easy and the risk scales as $\tau^2$ ; for large $\tau$ the constraint becomes vacuous and the risk plateaus at the unrestricted Gaussian value.

Show Hint

Use the minimax-maximin duality: solve the Bayes problem for each prior $\pi$ and maximise.

For $\tau$ small, the two-point prior at $\pm\tau$ is asymptotically least favourable.

Identify the Bayes estimator under a two-point prior as a hyperbolic-tangent shrinker.

Proof

Bayes estimator for two-point prior

Under $\pi=\tfrac12\delta_{-\tau}+\tfrac12\delta_{\tau}$ , the Bayes estimator is $\hat\theta_\pi(y)=\tau\tanh(\tau y)$ and its Bayes risk is $\tau^2(1-\mathbb{E}[\tanh^2(\tau Y)])$ .

Small-$\tau$ expansion

Taylor-expanding $\tanh$ around zero, $\tanh(\tau y)\approx\tau y$ , so $\hat\theta_\pi(y)\approx\tau^2 y$ and the Bayes risk is $\tau^2(1-\tau^2+O(\tau^4))$ , which matches the leading $\tau^2/(1+\tau^2)$ scaling. Verifying the matching supremum argument (this is a two-point prior is least favourable in the limit) completes the proof.

Large-$\tau$ limit

For $\tau\to\infty$ the constraint disappears and $r_*^{\text{mm}}\to 1=\mathrm{Var}(Y)$ , the risk of the MLE, which is minimax on unrestricted $\mathbb{R}$ . $\blacksquare$

Theorem: Minimax Rate for Sparse Estimation

Consider the model $\mathbf{y}=\mathbf{A}\mathbf{x}+\mathbf{w}$ with $\mathbf{A}\in\mathbb{R}^{M\times N}$ having i.i.d. $\mathcal{N}(0,1/M)$ entries, $\mathbf{w}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I}_M)$ , and $\mathbf{x}\in\Theta_s=\{\mathbf{x}\in\mathbb{R}^N:\|\mathbf{x}\|_0\leq s\}$ . Under squared-error loss, the minimax risk satisfies

$\inf_{\hat{\mathbf{x}}}\sup_{\mathbf{x}\in\Theta_s}\mathbb{E}\bigl[\|\hat{\mathbf{x}}-\mathbf{x}\|^2\bigr]\;\asymp\;\sigma^2\cdot\frac{s\log(N/s)}{M}\cdot(1+o(1))$

as $N,M,s\to\infty$ with $s\log(N/s)/M\to 0$ . Both the upper bound (achieved by LASSO with appropriate $\lambda$ or by best-subset selection) and the matching lower bound hold.

The rate $s\log(N/s)/M$ is the price of not knowing which $s$ coordinates are active. The $\log(N/s)$ factor is a union-bound over $\binom{N}{s}$ possible supports, and $s/M$ is the variance of estimating $s$ known parameters from $M$ samples. Together they capture the essence of sparse estimation.

Show Hint

Upper bound: analyse best-subset selection via a union bound over supports.

Lower bound: use Fano's inequality with a packing of $\binom{N}{s}$ widely-separated sparse vectors.

The connection to information theory makes $s\log(N/s)$ appear as a channel coding penalty.

Proof

Upper bound via best-subset

For each support $S\subset[N]$ with $|S|=s$ , restrict OLS to columns in $S$ . The restricted OLS has risk $s\sigma^2/M$ . A union-bound argument over all $\binom{N}{s}\leq(eN/s)^s$ supports inflates the risk by a $\log\binom{N}{s}\asymp s\log(N/s)$ factor in the worst case, giving $\mathbb{E}\|\hat{\mathbf{x}}-\mathbf{x}\|^2\lesssim\sigma^2s\log(N/s)/M$ .

Lower bound via Fano

Construct a $2^{s\log(N/s)}$ -packing of $\Theta_s$ consisting of $s$ -sparse vectors at mutual separation $\delta\asymp\sqrt{\sigma^2s\log(N/s)/M}$ . Fano's inequality says that any estimator has average error probability bounded below by $1-(I(\mathbf{x};\mathbf{y})+\log 2)/\log|\mathcal{P}|$ . Bounding the mutual information $I(\mathbf{x};\mathbf{y})\lesssim M\delta^2$ gives the matching lower bound.

Combine

Upper and lower bounds are of the same order, so the minimax rate is $\Theta(s\log(N/s)/M)$ . The $o(1)$ term can be made explicit; constants are known (Donoho–Johnstone). $\blacksquare$

Key Takeaway

The minimax rate $s\log(N/s)/M$ is the information-theoretic floor for sparse estimation. It tells you the sample complexity required to estimate an $s$ -sparse signal in $\mathbb{R}^N$ to a target accuracy — a formula as fundamental to high-dimensional statistics as the Shannon capacity is to communication.

Sparse Minimax Rate vs. Sample Size

Overlay the LASSO empirical MSE, the oracle-OLS risk $s\sigma^2/M$ , and the minimax rate $s\log(N/s)\sigma^2/M$ as a function of $M$ , for fixed $N$ and $s$ . Observe that LASSO sits at the minimax rate (up to log factors) while oracle OLS beats it by exactly the $\log(N/s)$ price-of-unknown-support.

Parameters

N

500

s

10

\sigma^2

0.1

Least-Favourable Prior for Bounded Gaussian Mean

For $Y\sim\mathcal{N}(\theta,1)$ with $\theta\in[-\tau,\tau]$ , show the two-point / three-point least-favourable prior, the associated Bayes (minimax) estimator, and its risk profile over the interval. Slide $\tau$ to watch the prior bifurcate from a delta at zero through a two-point prior to a more spread-out configuration.

Parameters

\tau

1

Example: Is the James–Stein Estimator Minimax?

For estimating $\boldsymbol{\theta}\in\mathbb{R}^N$ from $\mathbf{y}\sim\mathcal{N}(\boldsymbol{\theta},\mathbf{I}_N)$ with $N\geq 3$ and squared-error loss on $\Theta=\mathbb{R}^N$ , is the James–Stein estimator minimax? What is the minimax risk?

Solution

Minimax risk on unrestricted $\mathbb{R}^N$

The MLE $\hat{\boldsymbol{\theta}}=\mathbf{y}$ has constant risk $N$ , and no estimator can do better uniformly — because the risk of any estimator at $\boldsymbol{\theta}=0$ must be non-negative, and the risk of the MLE at infinity is $N$ . A careful argument (use priors supported on balls of growing radius) shows $r_*^{\text{mm}}(\mathbb{R}^N)=N$ .

Is JS minimax?

Yes. The JS risk satisfies $R(\hat{\boldsymbol{\theta}}_{\text{JS}},\boldsymbol{\theta})<N$ for every $\boldsymbol{\theta}$ and $\sup_{\boldsymbol{\theta}}R(\hat{\boldsymbol{\theta}}_{\text{JS}},\boldsymbol{\theta})=N$ (the supremum is approached as $\|\boldsymbol{\theta}\|\to\infty$ ). Therefore JS attains the minimax risk $N$ — it is minimax, and it strictly dominates the (other) minimax estimator, the MLE.

Lesson

Minimax estimators are not unique. Two estimators can both be minimax even though one dominates the other, as long as the suprema match. This is why minimax alone is not enough — one often combines it with admissibility to pin down an estimator.

Criteria for Choosing an Estimator

Criterion	What it asks	Sample answer
Unbiasedness	$\mathbb{E}[\hat\theta]=\theta$ for all $\theta$ ?	MLE (often)
Efficiency	Achieves the CRLB?	MLE (asymptotically)
Admissibility	No estimator dominates?	Bayes rules under proper priors
Minimax	Best worst-case risk?	JS on $\mathbb{R}^N$ , Bayes under $\pi^*$
Bayes-optimal	Minimises average risk under $\pi$ ?	Posterior mean (for squared loss)
Minimax rate	Optimal scaling in $N,M,s$ ?	LASSO for sparse estimation

Common Mistake: Minimax Can Be Too Pessimistic

Mistake:

Treating the minimax estimator as the "right" answer in every application.

Correction:

Minimax hedges against the worst case. If prior information is available — even rough — a Bayes or empirical-Bayes estimator will typically outperform the minimax rule on the problems that actually occur. Minimax is the correct choice when one genuinely has no prior information or when a worst-case guarantee is required.

Common Mistake: Rates Are Not Constants

Mistake:

Using the minimax rate $s\log(N/s)/M$ as an equality and ignoring constants.

Correction:

The rate is an order-of-magnitude statement: $r_*^{\text{mm}}=\Theta(\sigma^2s\log(N/s)/M)$ . Constants matter for sample-complexity calculations in practice. Donoho–Johnstone and Raskutti–Wainwright–Yu give sharp constants for specific settings; one should consult these when designing real systems.

Historical Note: Donoho and Johnstone: Wavelets Meet Minimax

1994–1998

David Donoho and Iain Johnstone connected minimax theory to wavelet-based signal estimation in a series of landmark papers in the 1990s. Their "wavelet shrinkage" estimator is exactly soft-thresholding in the wavelet basis, with threshold chosen to match the minimax rate over Besov spaces. The same mathematics — $\ell_1$ penalty, soft-thresholding, minimax rate — re-emerged ten years later as LASSO and compressed sensing, now reframed for general sensing matrices.

The Minimax Game

Animated illustration of minimax as a two-player game: the statistician picks an estimator, then Nature picks the worst-case parameter. The saddle-point value is the minimax risk.

The saddle-point characterisation:

\min_{\hat\theta}\max_\theta R=\max_\pi\min_{\hat\theta}r(\hat\theta,\pi)

when duality holds.

Quick Check

For an $s$ -sparse signal in $\mathbb{R}^N$ observed through $M$ Gaussian measurements, the minimax MSE scales as:

$s/M$

$N/M$

$s\log(N/s)/M$

$N\log N/M$

Correction:

s\log(N/s)/M

The $\log(N/s)$ factor is the price of not knowing which $s$ coordinates are active.

Why This Matters: Minimax Rates for Massive Machine-Type Communications

In grant-free massive machine-type communications (mMTC), only $s$ out of $N$ IoT devices are active in any given slot, and the base station must detect which ones from $M$ received samples. The active device vector is $s$ -sparse in $\mathbb{R}^N$ . The minimax theory of TMinimax Rate for Sparse Estimation tells us that $M=\Theta(s\log(N/s))$ samples are necessary and sufficient to identify the active set reliably. This is the information-theoretic floor for grant-free access protocols in 5G-Advanced and 6G.

Minimax Estimator

An estimator that minimises the worst-case risk over a parameter class: $\hat\theta^*=\arg\min_{\hat\theta}\sup_{\theta\in\Theta}R(\hat\theta,\theta)$ .

Least-Favourable Prior

A prior $\pi^*$ whose Bayes risk is largest among all priors. When duality holds, the Bayes estimator under $\pi^*$ is minimax.

🔧Engineering Note

Choosing $\lambda$ in Practice

The minimax-optimal LASSO penalty for the sparse-recovery rate is $\lambda=c\sqrt{\sigma^2\log N/M}$ for a modest constant $c$ (typically $c=2$ gives the Donoho–Johnstone universal threshold). In practice one tunes $\lambda$ by cross-validation, but this formula provides an excellent starting point and a sanity check.

Practical Constraints

•
When $\sigma^2$ is unknown, estimate it from the residuals of an initial LASSO fit and iterate.
•
Scaled-LASSO and square-root LASSO tune $\lambda$ automatically in a noise-variance-free manner.

Minimax Estimation

When There Is No Single Best Estimator

Definition: Minimax Risk

Definition: Bayes Risk and Least-Favourable Priors

Theorem: Minimax Risk for Bounded Gaussian Mean

Bayes estimator for two-point prior

Small-$\tau$ expansion

Large-$\tau$ limit

Theorem: Minimax Rate for Sparse Estimation

Upper bound via best-subset

Lower bound via Fano

Combine

Key Takeaway

Sparse Minimax Rate vs. Sample Size

Parameters

Least-Favourable Prior for Bounded Gaussian Mean

Parameters

Example: Is the James–Stein Estimator Minimax?

Minimax risk on unrestricted $\mathbb{R}^N$

Is JS minimax?

Lesson

Criteria for Choosing an Estimator

Common Mistake: Minimax Can Be Too Pessimistic

Common Mistake: Rates Are Not Constants

Historical Note: Donoho and Johnstone: Wavelets Meet Minimax

The Minimax Game

Quick Check

Why This Matters: Minimax Rates for Massive Machine-Type Communications

Minimax Estimator

Least-Favourable Prior

Choosing λ\lambdaλ in Practice

Definition:
Minimax Risk

Definition:
Bayes Risk and Least-Favourable Priors

Choosing $\lambda$ in Practice