Minimax Estimation

When There Is No Single Best Estimator

The admissibility results of Section 22.3 tell us which estimators can be ruled out. They do not tell us which single estimator to use. Different estimators trade risk at different parts of the parameter space: some are great when θ\boldsymbol{\theta} is small, others when it is large. Minimax theory resolves this tension by picking the estimator whose worst-case risk is smallest. The minimax estimator gives a guarantee that holds uniformly over the parameter class — a guarantee that becomes the natural benchmark whenever no reliable prior is available.

The point is that minimax rates carry deep information: they characterise how much data is required to estimate a parameter of a given complexity. For sparse signals the minimax rate is O(slog(N/s)/M)O(s\log(N/s)/M) — a formula that packages together the three pillars of high-dimensional statistics: dimension NN, sample size MM, and structural complexity ss.

Definition:

Minimax Risk

Let Θ\Theta be a parameter space, LL a loss function, and R(θ^,θ)=Eθ[L(θ^,θ)]R(\hat\theta,\theta)=\mathbb{E}_\theta[L(\hat\theta,\theta)] the frequentist risk. The minimax risk over Θ\Theta is

rmm(Θ)=infθ^supθΘR(θ^,θ),r_*^{\text{mm}}(\Theta)=\inf_{\hat\theta}\sup_{\theta\in\Theta}R(\hat\theta,\theta),

where the infimum is taken over all (measurable) estimators. An estimator θ^\hat\theta^* is minimax if it attains this infimum: supθΘR(θ^,θ)=rmm(Θ)\sup_{\theta\in\Theta}R(\hat\theta^*,\theta)=r_*^{\text{mm}}(\Theta).

Minimax makes no statement about what θ\theta actually is — only about the worst case. This is conservative by design: minimax estimators hedge against adversarial parameter choices.

Definition:

Bayes Risk and Least-Favourable Priors

For a prior π\pi on Θ\Theta, the Bayes risk of an estimator is r(θ^,π)=R(θ^,θ)dπ(θ)r(\hat\theta,\pi)=\int R(\hat\theta,\theta)\,d\pi(\theta), and the Bayes-optimal estimator θ^π\hat\theta_\pi minimises r(,π)r(\cdot,\pi). A prior π\pi^* is least favourable if r(θ^π,π)=supπr(θ^π,π)r(\hat\theta_{\pi^*},\pi^*)=\sup_\pi r(\hat\theta_\pi,\pi).

The minimax = maximin duality (when it holds) states

infθ^supθR(θ^,θ)=supπinfθ^r(θ^,π).\inf_{\hat\theta}\sup_\theta R(\hat\theta,\theta)=\sup_\pi\inf_{\hat\theta}r(\hat\theta,\pi).

When a least-favourable prior exists, the Bayes estimator under that prior is minimax. This is the workhorse construction for finding minimax estimators: guess the least-favourable prior, compute the Bayes estimator, check that its worst-case risk matches the Bayes risk under π\pi^*.

Theorem: Minimax Risk for Bounded Gaussian Mean

Let YN(θ,1)Y\sim\mathcal{N}(\theta,1) with θ[τ,τ]\theta\in[-\tau,\tau] for some τ>0\tau>0. Under squared-error loss, the minimax risk satisfies

rmm([τ,τ])=supπ[varπ(θ)varπ(E[θY])]=τ21+τ2(1+o(1)) as τ0.r_*^{\text{mm}}([-\tau,\tau])=\sup_\pi\Bigl[\mathrm{var}_\pi(\theta)-\mathrm{var}_\pi(\mathbb{E}[\theta\mid Y])\Bigr]=\frac{\tau^2}{1+\tau^2}\cdot(1+o(1))\text{ as }\tau\to 0.

As τ\tau\to\infty the minimax risk approaches 11, the unbounded Gaussian case.

The least-favourable prior on a bounded interval concentrates its mass at the endpoints — the adversary places θ\theta at the extremes where estimation is hardest. For small τ\tau the problem is easy and the risk scales as τ2\tau^2; for large τ\tau the constraint becomes vacuous and the risk plateaus at the unrestricted Gaussian value.

Theorem: Minimax Rate for Sparse Estimation

Consider the model y=Ax+w\mathbf{y}=\mathbf{A}\mathbf{x}+\mathbf{w} with ARM×N\mathbf{A}\in\mathbb{R}^{M\times N} having i.i.d. N(0,1/M)\mathcal{N}(0,1/M) entries, wN(0,σ2IM)\mathbf{w}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I}_M), and xΘs={xRN:x0s}\mathbf{x}\in\Theta_s=\{\mathbf{x}\in\mathbb{R}^N:\|\mathbf{x}\|_0\leq s\}. Under squared-error loss, the minimax risk satisfies

infx^supxΘsE[x^x2]    σ2slog(N/s)M(1+o(1))\inf_{\hat{\mathbf{x}}}\sup_{\mathbf{x}\in\Theta_s}\mathbb{E}\bigl[\|\hat{\mathbf{x}}-\mathbf{x}\|^2\bigr]\;\asymp\;\sigma^2\cdot\frac{s\log(N/s)}{M}\cdot(1+o(1))

as N,M,sN,M,s\to\infty with slog(N/s)/M0s\log(N/s)/M\to 0. Both the upper bound (achieved by LASSO with appropriate λ\lambda or by best-subset selection) and the matching lower bound hold.

The rate slog(N/s)/Ms\log(N/s)/M is the price of not knowing which ss coordinates are active. The log(N/s)\log(N/s) factor is a union-bound over (Ns)\binom{N}{s} possible supports, and s/Ms/M is the variance of estimating ss known parameters from MM samples. Together they capture the essence of sparse estimation.

Key Takeaway

The minimax rate slog(N/s)/Ms\log(N/s)/M is the information-theoretic floor for sparse estimation. It tells you the sample complexity required to estimate an ss-sparse signal in RN\mathbb{R}^N to a target accuracy — a formula as fundamental to high-dimensional statistics as the Shannon capacity is to communication.

Sparse Minimax Rate vs. Sample Size

Overlay the LASSO empirical MSE, the oracle-OLS risk sσ2/Ms\sigma^2/M, and the minimax rate slog(N/s)σ2/Ms\log(N/s)\sigma^2/M as a function of MM, for fixed NN and ss. Observe that LASSO sits at the minimax rate (up to log factors) while oracle OLS beats it by exactly the log(N/s)\log(N/s) price-of-unknown-support.

Parameters
500
10
0.1

Least-Favourable Prior for Bounded Gaussian Mean

For YN(θ,1)Y\sim\mathcal{N}(\theta,1) with θ[τ,τ]\theta\in[-\tau,\tau], show the two-point / three-point least-favourable prior, the associated Bayes (minimax) estimator, and its risk profile over the interval. Slide τ\tau to watch the prior bifurcate from a delta at zero through a two-point prior to a more spread-out configuration.

Parameters
1

Example: Is the James–Stein Estimator Minimax?

For estimating θRN\boldsymbol{\theta}\in\mathbb{R}^N from yN(θ,IN)\mathbf{y}\sim\mathcal{N}(\boldsymbol{\theta},\mathbf{I}_N) with N3N\geq 3 and squared-error loss on Θ=RN\Theta=\mathbb{R}^N, is the James–Stein estimator minimax? What is the minimax risk?

Criteria for Choosing an Estimator

CriterionWhat it asksSample answer
UnbiasednessE[θ^]=θ\mathbb{E}[\hat\theta]=\theta for all θ\theta?MLE (often)
EfficiencyAchieves the CRLB?MLE (asymptotically)
AdmissibilityNo estimator dominates?Bayes rules under proper priors
MinimaxBest worst-case risk?JS on RN\mathbb{R}^N, Bayes under π\pi^*
Bayes-optimalMinimises average risk under π\pi?Posterior mean (for squared loss)
Minimax rateOptimal scaling in N,M,sN,M,s?LASSO for sparse estimation

Common Mistake: Minimax Can Be Too Pessimistic

Mistake:

Treating the minimax estimator as the "right" answer in every application.

Correction:

Minimax hedges against the worst case. If prior information is available — even rough — a Bayes or empirical-Bayes estimator will typically outperform the minimax rule on the problems that actually occur. Minimax is the correct choice when one genuinely has no prior information or when a worst-case guarantee is required.

Common Mistake: Rates Are Not Constants

Mistake:

Using the minimax rate slog(N/s)/Ms\log(N/s)/M as an equality and ignoring constants.

Correction:

The rate is an order-of-magnitude statement: rmm=Θ(σ2slog(N/s)/M)r_*^{\text{mm}}=\Theta(\sigma^2s\log(N/s)/M). Constants matter for sample-complexity calculations in practice. Donoho–Johnstone and Raskutti–Wainwright–Yu give sharp constants for specific settings; one should consult these when designing real systems.

Historical Note: Donoho and Johnstone: Wavelets Meet Minimax

1994–1998

David Donoho and Iain Johnstone connected minimax theory to wavelet-based signal estimation in a series of landmark papers in the 1990s. Their "wavelet shrinkage" estimator is exactly soft-thresholding in the wavelet basis, with threshold chosen to match the minimax rate over Besov spaces. The same mathematics — 1\ell_1 penalty, soft-thresholding, minimax rate — re-emerged ten years later as LASSO and compressed sensing, now reframed for general sensing matrices.

The Minimax Game

Animated illustration of minimax as a two-player game: the statistician picks an estimator, then Nature picks the worst-case parameter. The saddle-point value is the minimax risk.
The saddle-point characterisation: minθ^maxθR=maxπminθ^r(θ^,π)\min_{\hat\theta}\max_\theta R=\max_\pi\min_{\hat\theta}r(\hat\theta,\pi) when duality holds.

Quick Check

For an ss-sparse signal in RN\mathbb{R}^N observed through MM Gaussian measurements, the minimax MSE scales as:

s/Ms/M

N/MN/M

slog(N/s)/Ms\log(N/s)/M

NlogN/MN\log N/M

Why This Matters: Minimax Rates for Massive Machine-Type Communications

In grant-free massive machine-type communications (mMTC), only ss out of NN IoT devices are active in any given slot, and the base station must detect which ones from MM received samples. The active device vector is ss-sparse in RN\mathbb{R}^N. The minimax theory of TMinimax Rate for Sparse Estimation tells us that M=Θ(slog(N/s))M=\Theta(s\log(N/s)) samples are necessary and sufficient to identify the active set reliably. This is the information-theoretic floor for grant-free access protocols in 5G-Advanced and 6G.

Minimax Estimator

An estimator that minimises the worst-case risk over a parameter class: θ^=argminθ^supθΘR(θ^,θ)\hat\theta^*=\arg\min_{\hat\theta}\sup_{\theta\in\Theta}R(\hat\theta,\theta).

Related: Bayes Estimator, Least-Favourable Prior

Least-Favourable Prior

A prior π\pi^* whose Bayes risk is largest among all priors. When duality holds, the Bayes estimator under π\pi^* is minimax.

Related: Minimax Estimator, Bayes Risk and Least-Favourable Priors

🔧Engineering Note

Choosing λ\lambda in Practice

The minimax-optimal LASSO penalty for the sparse-recovery rate is λ=cσ2logN/M\lambda=c\sqrt{\sigma^2\log N/M} for a modest constant cc (typically c=2c=2 gives the Donoho–Johnstone universal threshold). In practice one tunes λ\lambda by cross-validation, but this formula provides an excellent starting point and a sanity check.

Practical Constraints
  • When σ2\sigma^2 is unknown, estimate it from the residuals of an initial LASSO fit and iterate.

  • Scaled-LASSO and square-root LASSO tune λ\lambda automatically in a noise-variance-free manner.