Minimax Estimation
When There Is No Single Best Estimator
The admissibility results of Section 22.3 tell us which estimators can be ruled out. They do not tell us which single estimator to use. Different estimators trade risk at different parts of the parameter space: some are great when is small, others when it is large. Minimax theory resolves this tension by picking the estimator whose worst-case risk is smallest. The minimax estimator gives a guarantee that holds uniformly over the parameter class — a guarantee that becomes the natural benchmark whenever no reliable prior is available.
The point is that minimax rates carry deep information: they characterise how much data is required to estimate a parameter of a given complexity. For sparse signals the minimax rate is — a formula that packages together the three pillars of high-dimensional statistics: dimension , sample size , and structural complexity .
Definition: Minimax Risk
Minimax Risk
Let be a parameter space, a loss function, and the frequentist risk. The minimax risk over is
where the infimum is taken over all (measurable) estimators. An estimator is minimax if it attains this infimum: .
Minimax makes no statement about what actually is — only about the worst case. This is conservative by design: minimax estimators hedge against adversarial parameter choices.
Definition: Bayes Risk and Least-Favourable Priors
Bayes Risk and Least-Favourable Priors
For a prior on , the Bayes risk of an estimator is , and the Bayes-optimal estimator minimises . A prior is least favourable if .
The minimax = maximin duality (when it holds) states
When a least-favourable prior exists, the Bayes estimator under that prior is minimax. This is the workhorse construction for finding minimax estimators: guess the least-favourable prior, compute the Bayes estimator, check that its worst-case risk matches the Bayes risk under .
Theorem: Minimax Risk for Bounded Gaussian Mean
Let with for some . Under squared-error loss, the minimax risk satisfies
As the minimax risk approaches , the unbounded Gaussian case.
The least-favourable prior on a bounded interval concentrates its mass at the endpoints — the adversary places at the extremes where estimation is hardest. For small the problem is easy and the risk scales as ; for large the constraint becomes vacuous and the risk plateaus at the unrestricted Gaussian value.
Use the minimax-maximin duality: solve the Bayes problem for each prior and maximise.
For small, the two-point prior at is asymptotically least favourable.
Identify the Bayes estimator under a two-point prior as a hyperbolic-tangent shrinker.
Bayes estimator for two-point prior
Under , the Bayes estimator is and its Bayes risk is .
Small-$\tau$ expansion
Taylor-expanding around zero, , so and the Bayes risk is , which matches the leading scaling. Verifying the matching supremum argument (this is a two-point prior is least favourable in the limit) completes the proof.
Large-$\tau$ limit
For the constraint disappears and , the risk of the MLE, which is minimax on unrestricted .
Theorem: Minimax Rate for Sparse Estimation
Consider the model with having i.i.d. entries, , and . Under squared-error loss, the minimax risk satisfies
as with . Both the upper bound (achieved by LASSO with appropriate or by best-subset selection) and the matching lower bound hold.
The rate is the price of not knowing which coordinates are active. The factor is a union-bound over possible supports, and is the variance of estimating known parameters from samples. Together they capture the essence of sparse estimation.
Upper bound: analyse best-subset selection via a union bound over supports.
Lower bound: use Fano's inequality with a packing of widely-separated sparse vectors.
The connection to information theory makes appear as a channel coding penalty.
Upper bound via best-subset
For each support with , restrict OLS to columns in . The restricted OLS has risk . A union-bound argument over all supports inflates the risk by a factor in the worst case, giving .
Lower bound via Fano
Construct a -packing of consisting of -sparse vectors at mutual separation . Fano's inequality says that any estimator has average error probability bounded below by . Bounding the mutual information gives the matching lower bound.
Combine
Upper and lower bounds are of the same order, so the minimax rate is . The term can be made explicit; constants are known (Donoho–Johnstone).
Key Takeaway
The minimax rate is the information-theoretic floor for sparse estimation. It tells you the sample complexity required to estimate an -sparse signal in to a target accuracy — a formula as fundamental to high-dimensional statistics as the Shannon capacity is to communication.
Sparse Minimax Rate vs. Sample Size
Overlay the LASSO empirical MSE, the oracle-OLS risk , and the minimax rate as a function of , for fixed and . Observe that LASSO sits at the minimax rate (up to log factors) while oracle OLS beats it by exactly the price-of-unknown-support.
Parameters
Least-Favourable Prior for Bounded Gaussian Mean
For with , show the two-point / three-point least-favourable prior, the associated Bayes (minimax) estimator, and its risk profile over the interval. Slide to watch the prior bifurcate from a delta at zero through a two-point prior to a more spread-out configuration.
Parameters
Example: Is the James–Stein Estimator Minimax?
For estimating from with and squared-error loss on , is the James–Stein estimator minimax? What is the minimax risk?
Minimax risk on unrestricted $\mathbb{R}^N$
The MLE has constant risk , and no estimator can do better uniformly — because the risk of any estimator at must be non-negative, and the risk of the MLE at infinity is . A careful argument (use priors supported on balls of growing radius) shows .
Is JS minimax?
Yes. The JS risk satisfies for every and (the supremum is approached as ). Therefore JS attains the minimax risk — it is minimax, and it strictly dominates the (other) minimax estimator, the MLE.
Lesson
Minimax estimators are not unique. Two estimators can both be minimax even though one dominates the other, as long as the suprema match. This is why minimax alone is not enough — one often combines it with admissibility to pin down an estimator.
Criteria for Choosing an Estimator
| Criterion | What it asks | Sample answer |
|---|---|---|
| Unbiasedness | for all ? | MLE (often) |
| Efficiency | Achieves the CRLB? | MLE (asymptotically) |
| Admissibility | No estimator dominates? | Bayes rules under proper priors |
| Minimax | Best worst-case risk? | JS on , Bayes under |
| Bayes-optimal | Minimises average risk under ? | Posterior mean (for squared loss) |
| Minimax rate | Optimal scaling in ? | LASSO for sparse estimation |
Common Mistake: Minimax Can Be Too Pessimistic
Mistake:
Treating the minimax estimator as the "right" answer in every application.
Correction:
Minimax hedges against the worst case. If prior information is available — even rough — a Bayes or empirical-Bayes estimator will typically outperform the minimax rule on the problems that actually occur. Minimax is the correct choice when one genuinely has no prior information or when a worst-case guarantee is required.
Common Mistake: Rates Are Not Constants
Mistake:
Using the minimax rate as an equality and ignoring constants.
Correction:
The rate is an order-of-magnitude statement: . Constants matter for sample-complexity calculations in practice. Donoho–Johnstone and Raskutti–Wainwright–Yu give sharp constants for specific settings; one should consult these when designing real systems.
Historical Note: Donoho and Johnstone: Wavelets Meet Minimax
1994–1998David Donoho and Iain Johnstone connected minimax theory to wavelet-based signal estimation in a series of landmark papers in the 1990s. Their "wavelet shrinkage" estimator is exactly soft-thresholding in the wavelet basis, with threshold chosen to match the minimax rate over Besov spaces. The same mathematics — penalty, soft-thresholding, minimax rate — re-emerged ten years later as LASSO and compressed sensing, now reframed for general sensing matrices.
The Minimax Game
Quick Check
For an -sparse signal in observed through Gaussian measurements, the minimax MSE scales as:
The factor is the price of not knowing which coordinates are active.
Why This Matters: Minimax Rates for Massive Machine-Type Communications
In grant-free massive machine-type communications (mMTC), only out of IoT devices are active in any given slot, and the base station must detect which ones from received samples. The active device vector is -sparse in . The minimax theory of TMinimax Rate for Sparse Estimation tells us that samples are necessary and sufficient to identify the active set reliably. This is the information-theoretic floor for grant-free access protocols in 5G-Advanced and 6G.
Minimax Estimator
An estimator that minimises the worst-case risk over a parameter class: .
Related: Bayes Estimator, Least-Favourable Prior
Least-Favourable Prior
A prior whose Bayes risk is largest among all priors. When duality holds, the Bayes estimator under is minimax.
Related: Minimax Estimator, Bayes Risk and Least-Favourable Priors
Choosing in Practice
The minimax-optimal LASSO penalty for the sparse-recovery rate is for a modest constant (typically gives the Donoho–Johnstone universal threshold). In practice one tunes by cross-validation, but this formula provides an excellent starting point and a sanity check.
- •
When is unknown, estimate it from the residuals of an initial LASSO fit and iterate.
- •
Scaled-LASSO and square-root LASSO tune automatically in a noise-variance-free manner.