Regularized Estimation
Regularization as a Statistical Necessity
The previous section demonstrated that the MLE is disastrous when is close to one. The point is that any sensible estimator must be biased in this regime. The question is therefore not "should we regularise?" but "how much, and in what geometry?". We treat the two canonical penalties — the ridge penalty and the LASSO penalty — and in each case derive the optimal regularization strength as a function of and the signal-to-noise ratio. These are convex problems, and the solutions admit a closed-form analysis in the Marchenko–Pastur regime.
Definition: Ridge Regression (Tikhonov Regularization)
Ridge Regression (Tikhonov Regularization)
For , the ridge estimator is the solution of the penalised least-squares programme
The objective is strictly convex and admits the closed form
Ridge trades variance against squared bias. The problem is convex for every , strictly convex for , and the unique minimiser is available in closed form — the reader should verify that this follows from differentiating and setting the gradient to zero.
Definition: LASSO Estimator
LASSO Estimator
For , the LASSO estimator is
The problem is convex but not strictly convex; the penalty is non-smooth, which produces the sparsity-promoting "corners" of its sub-level sets. No closed form exists in general, but proximal algorithms (ISTA, FISTA, AMP) solve it efficiently.
The convexity reflex applies: whenever we see + squared-error we should flag the problem as convex and reach for proximal-gradient machinery.
Definition: Elastic Net
Elastic Net
The elastic net interpolates between ridge and LASSO by combining both penalties:
The parameter sets the mix: gives ridge, gives LASSO. The term restores strict convexity and stabilises solutions in the presence of correlated columns.
Elastic net is the default in practice when columns of are correlated — pure LASSO then selects capriciously among correlated predictors, whereas the term spreads weight across them.
Theorem: Optimal Ridge Regularization in the Marchenko–Pastur Regime
Consider the model of DGaussian Linear Observation Model with i.i.d. , and suppose is independent of . Define the signal-to-noise ratio . In the proportional asymptotic regime, the per-coordinate MSE of the ridge estimator
is minimised at
The minimum value depends on and is strictly smaller than the OLS risk for every , and finite even for .
The Bayesian story tells us the answer immediately: under a Gaussian prior , the MAP estimator is ridge regression with , and it coincides with the Bayes-optimal LMMSE estimator. The theorem confirms that this Bayesian-motivated choice is also frequentist-optimal in the proportional regime.
Write the ridge estimator in the singular-value basis of to obtain a scalar shrinkage in each direction.
Compute the MSE as an integral against the MP density, then differentiate in .
The optimum is characterised by a Stieltjes-transform identity; alternatively, recognise the LMMSE connection and use orthogonality of the error.
LMMSE characterisation
Under a Gaussian prior, the joint distribution of given is Gaussian. The LMMSE (=MAP=posterior mean) estimator is A push-through identity rewrites this as , which is exactly .
Orthogonality implies optimality
By the orthogonality principle, the LMMSE estimator is the minimum-MSE linear estimator in the Bayesian risk (averaged over both and ). Any other ridge parameter produces a sub-optimal linear estimator, so .
Closed-form risk
The minimum per-coordinate MSE, in the proportional regime, is the fixed point of the scalar equation where is the Stieltjes transform of the MP law. This is finite for all , including .
Key Takeaway
Under a Gaussian prior, optimal ridge regularisation is . This is the single most useful heuristic in all of high-dimensional estimation: if you can estimate the signal power and the noise power, you can set the ridge penalty correctly.
ISTA for the LASSO
Complexity: Each iteration costs ; converges at rate .The soft-threshold operator is , and it is the proximal operator of . FISTA accelerates ISTA to via Nesterov momentum.
Example: Bias–Variance Tradeoff for Ridge at Fixed
For the Gaussian model of TOptimal Ridge Regularization in the Marchenko–Pastur Regime with , , , numerically sketch the per-coordinate MSE of ridge as a function of and identify the optimum. Compare with the OLS risk.
OLS baseline
OLS per-coordinate risk: .
Optimal ridge
By TOptimal Ridge Regularization in the Marchenko–Pastur Regime, . The MSE at this is strictly below ; numerical evaluation (see and SNR sliders)" data-ref-type="interactive_plot">📊Ridge Risk vs. Regularization (with and SNR sliders)) gives .
Regimes
- : ridge OLS, risk .
- : , risk .
- Intermediate : convex trade-off with minimum at .
Ridge Risk vs. Regularization (with and SNR sliders)
The per-coordinate MSE of the ridge estimator as a function of , with both theoretical curves and Monte-Carlo dots, for adjustable and SNR. Observe that the minimum always sits at , regardless of .
Parameters
LASSO Regularization Path
Plot the coordinate paths of as sweeps from (full sparsity) down to zero (OLS). The sparsity pattern is visible: each coordinate "activates" at a distinct .
Parameters
Ridge vs. LASSO vs. Elastic Net
| Property | Ridge | LASSO | Elastic Net |
|---|---|---|---|
| Penalty | |||
| Closed form | Yes | No (coordinate-wise soft-threshold) | No |
| Sparse solution | Never | Yes, for large enough | Yes (when ) |
| Strictly convex | Yes () | No | Yes () |
| Bayesian MAP under | Gaussian prior | Laplace prior | Gaussian × Laplace |
| Behaviour on correlated columns | Spreads weight | Picks one arbitrarily | Spreads, with sparsity |
| Optimal (Gaussian) | No closed form | No closed form |
Common Mistake: Forgetting to Centre and Scale Before Ridge / LASSO
Mistake:
Applying ridge or LASSO to un-normalised columns and expecting the penalty to treat all predictors symmetrically.
Correction:
The and penalties are not scale-invariant: a column with ten-times-larger values gets ten-times-less regularisation in effective terms. Standard practice is to centre and scale each column of before applying the penalty, and to centre (or include an un-penalised intercept).
Why This Matters: Massive-MIMO Channel Estimation is Ridge Regression
In massive MIMO, the downlink channel matrix is estimated from uplink pilot symbols. When (as is typical), OLS is ill-defined, and LMMSE estimation — which uses prior knowledge of the channel second-order statistics — is the standard choice. The LMMSE channel estimator is exactly a ridge estimator whose regularization parameter is determined by the SNR (see TOptimal Ridge Regularization in the Marchenko–Pastur Regime). This is why practical MIMO receivers are built around covariance-based estimators, not OLS.
Quick Check
Which of the following best explains why LASSO produces exactly-zero coefficients but ridge does not?
The LASSO objective is non-convex.
The ball has corners on the coordinate axes.
LASSO optimises the norm directly.
Ridge penalises large coefficients more than LASSO does.
The sub-level sets of are diamond-shaped with corners on the axes; the solution tends to land on these corners, zeroing out coordinates.
Soft-Thresholding Operator
For , . It is the proximal operator of and appears as the per-coordinate update in ISTA, FISTA, and AMP.
Related: ISTA for the LASSO, Proximal operator