Regularized Estimation

Regularization as a Statistical Necessity

The previous section demonstrated that the MLE is disastrous when γ\gamma is close to one. The point is that any sensible estimator must be biased in this regime. The question is therefore not "should we regularise?" but "how much, and in what geometry?". We treat the two canonical penalties — the 2\ell_2 ridge penalty and the 1\ell_1 LASSO penalty — and in each case derive the optimal regularization strength as a function of γ\gamma and the signal-to-noise ratio. These are convex problems, and the solutions admit a closed-form analysis in the Marchenko–Pastur regime.

Definition:

Ridge Regression (Tikhonov Regularization)

For λ>0\lambda>0, the ridge estimator is the solution of the penalised least-squares programme

x^ridge(λ)=argminxRN12yAx2+λ2x2.\hat{\mathbf{x}}_{\text{ridge}}(\lambda)=\arg\min_{\mathbf{x}\in\mathbb{R}^N}\tfrac12\|\mathbf{y}-\mathbf{A}\mathbf{x}\|^2+\tfrac{\lambda}{2}\|\mathbf{x}\|^2.

The objective is strictly convex and admits the closed form

x^ridge(λ)=(ATA+λI)1ATy.\hat{\mathbf{x}}_{\text{ridge}}(\lambda)=(\mathbf{A}^{T}\mathbf{A}+\lambda\mathbf{I})^{-1}\mathbf{A}^{T}\mathbf{y}.

Ridge trades variance against squared bias. The problem is convex for every λ0\lambda\geq 0, strictly convex for λ>0\lambda>0, and the unique minimiser is available in closed form — the reader should verify that this follows from differentiating and setting the gradient to zero.

Definition:

LASSO Estimator

For λ>0\lambda>0, the LASSO estimator is

z^LASSO(λ)=argminxRN12yAx2+λx1.\hat{\mathbf{z}}_{\text{LASSO}}(\lambda)=\arg\min_{\mathbf{x}\in\mathbb{R}^N}\tfrac12\|\mathbf{y}-\mathbf{A}\mathbf{x}\|^2+\lambda\|\mathbf{x}\|_1.

The problem is convex but not strictly convex; the 1\ell_1 penalty is non-smooth, which produces the sparsity-promoting "corners" of its sub-level sets. No closed form exists in general, but proximal algorithms (ISTA, FISTA, AMP) solve it efficiently.

The convexity reflex applies: whenever we see 1\ell_1 + squared-error we should flag the problem as convex and reach for proximal-gradient machinery.

Definition:

Elastic Net

The elastic net interpolates between ridge and LASSO by combining both penalties:

x^EN(λ,α)=argminx12yAx2+λ(αx1+1α2x2).\hat{\mathbf{x}}_{\text{EN}}(\lambda,\alpha)=\arg\min_{\mathbf{x}}\tfrac12\|\mathbf{y}-\mathbf{A}\mathbf{x}\|^2+\lambda\bigl(\alpha\|\mathbf{x}\|_1+\tfrac{1-\alpha}{2}\|\mathbf{x}\|^2\bigr).

The parameter α[0,1]\alpha\in[0,1] sets the mix: α=0\alpha=0 gives ridge, α=1\alpha=1 gives LASSO. The 2\ell_2 term restores strict convexity and stabilises solutions in the presence of correlated columns.

Elastic net is the default in practice when columns of A\mathbf{A} are correlated — pure LASSO then selects capriciously among correlated predictors, whereas the 2\ell_2 term spreads weight across them.

Theorem: Optimal Ridge Regularization in the Marchenko–Pastur Regime

Consider the model of DGaussian Linear Observation Model with A\mathbf{A} i.i.d. N(0,1/M)\mathcal{N}(0,1/M), and suppose xN(0,σx2IN)\mathbf{x}\sim\mathcal{N}(\mathbf{0},\sigma_x^2\mathbf{I}_N) is independent of A\mathbf{A}. Define the signal-to-noise ratio SNR=σx2/σ2\text{SNR}=\sigma_x^2/\sigma^2. In the proportional asymptotic regime, the per-coordinate MSE of the ridge estimator

MSE(λ):=1NE[x^ridge(λ)x2]\mathrm{MSE}(\lambda):=\tfrac{1}{N}\mathbb{E}\bigl[\|\hat{\mathbf{x}}_{\text{ridge}}(\lambda)-\mathbf{x}\|^2\bigr]

is minimised at

λ=σ2σx2=SNR1.\lambda^\star=\frac{\sigma^2}{\sigma_x^2}=\text{SNR}^{-1}.

The minimum value depends on γ\gamma and is strictly smaller than the OLS risk γσ2/(1γ)\gamma\sigma^2/(1-\gamma) for every γ<1\gamma<1, and finite even for γ1\gamma\geq 1.

The Bayesian story tells us the answer immediately: under a Gaussian prior xN(0,σx2I)\mathbf{x}\sim\mathcal{N}(\mathbf{0},\sigma_x^2\mathbf{I}), the MAP estimator is ridge regression with λ=σ2/σx2\lambda=\sigma^2/\sigma_x^2, and it coincides with the Bayes-optimal LMMSE estimator. The theorem confirms that this Bayesian-motivated choice is also frequentist-optimal in the proportional regime.

Key Takeaway

Under a Gaussian prior, optimal ridge regularisation is λ=1/SNR\lambda^\star=1/\text{SNR}. This is the single most useful heuristic in all of high-dimensional estimation: if you can estimate the signal power and the noise power, you can set the ridge penalty correctly.

ISTA for the LASSO

Complexity: Each iteration costs O(MN)O(MN); converges at rate O(1/k)O(1/k).
Input: ARM×N\mathbf{A}\in\mathbb{R}^{M\times N}, yRM\mathbf{y}\in\mathbb{R}^M, λ>0\lambda>0, step size τ1/A22\tau\leq 1/\|\mathbf{A}\|_2^2
Output: LASSO minimiser x^\hat{\mathbf{x}}
1. Initialise x(0)0\mathbf{x}^{(0)}\leftarrow\mathbf{0}
2. for k=0,1,2,k=0,1,2,\ldots do
3. r(k)yAx(k)\quad\mathbf{r}^{(k)}\leftarrow\mathbf{y}-\mathbf{A}\mathbf{x}^{(k)} (residual)
4. z(k)x(k)+τATr(k)\quad\mathbf{z}^{(k)}\leftarrow\mathbf{x}^{(k)}+\tau\,\mathbf{A}^{T}\mathbf{r}^{(k)} (gradient step)
5. x(k+1)ητλ(z(k))\quad\mathbf{x}^{(k+1)}\leftarrow\eta_{\tau\lambda}(\mathbf{z}^{(k)}) (soft-threshold, coordinate-wise)
6. until convergence

The soft-threshold operator is ηθ(z)=sign(z)max(zθ,0)\eta_\theta(z)=\text{sign}(z)\max(|z|-\theta,0), and it is the proximal operator of θ1\theta\|\cdot\|_1. FISTA accelerates ISTA to O(1/k2)O(1/k^2) via Nesterov momentum.

Example: Bias–Variance Tradeoff for Ridge at Fixed γ\gamma

For the Gaussian model of TOptimal Ridge Regularization in the Marchenko–Pastur Regime with γ=0.5\gamma=0.5, σx2=1\sigma_x^2=1, σ2=0.1\sigma^2=0.1, numerically sketch the per-coordinate MSE of ridge as a function of λ\lambda and identify the optimum. Compare with the OLS risk.

Ridge Risk vs. Regularization (with γ\gamma and SNR sliders)

The per-coordinate MSE of the ridge estimator as a function of λ\lambda, with both theoretical curves and Monte-Carlo dots, for adjustable γ\gamma and SNR. Observe that the minimum always sits at λ=1/SNR\lambda^\star=1/\text{SNR}, regardless of γ\gamma.

Parameters
0.5
10

LASSO Regularization Path

Plot the NN coordinate paths of z^LASSO(λ)\hat{\mathbf{z}}_{\text{LASSO}}(\lambda) as λ\lambda sweeps from ATy\|\mathbf{A}^{T}\mathbf{y}\|_\infty (full sparsity) down to zero (OLS). The sparsity pattern is visible: each coordinate "activates" at a distinct λ\lambda.

Parameters
40
60
5

Ridge vs. LASSO vs. Elastic Net

PropertyRidgeLASSOElastic Net
Penaltyλ2x22\tfrac{\lambda}{2}\|\mathbf{x}\|_2^2λx1\lambda\|\mathbf{x}\|_1λ(αx1+1α2x22)\lambda\bigl(\alpha\|\mathbf{x}\|_1+\tfrac{1-\alpha}{2}\|\mathbf{x}\|_2^2\bigr)
Closed formYesNo (coordinate-wise soft-threshold)No
Sparse solutionNeverYes, for λ\lambda large enoughYes (when α>0\alpha>0)
Strictly convexYes (λ>0\lambda>0)NoYes (α<1\alpha<1)
Bayesian MAP underGaussian priorLaplace priorGaussian × Laplace
Behaviour on correlated columnsSpreads weightPicks one arbitrarilySpreads, with sparsity
Optimal λ\lambda (Gaussian)1/SNR1/\text{SNR}No closed formNo closed form

Common Mistake: Forgetting to Centre and Scale Before Ridge / LASSO

Mistake:

Applying ridge or LASSO to un-normalised columns and expecting the penalty to treat all predictors symmetrically.

Correction:

The 2\ell_2 and 1\ell_1 penalties are not scale-invariant: a column with ten-times-larger values gets ten-times-less regularisation in effective terms. Standard practice is to centre and scale each column of A\mathbf{A} before applying the penalty, and to centre y\mathbf{y} (or include an un-penalised intercept).

Why This Matters: Massive-MIMO Channel Estimation is Ridge Regression

In massive MIMO, the downlink channel matrix HCNt×K\mathbf{H}\in\mathbb{C}^{N_t\times K} is estimated from MM uplink pilot symbols. When M<NtM<N_t (as is typical), OLS is ill-defined, and LMMSE estimation — which uses prior knowledge of the channel second-order statistics — is the standard choice. The LMMSE channel estimator is exactly a ridge estimator whose regularization parameter is determined by the SNR (see TOptimal Ridge Regularization in the Marchenko–Pastur Regime). This is why practical MIMO receivers are built around covariance-based estimators, not OLS.

Quick Check

Which of the following best explains why LASSO produces exactly-zero coefficients but ridge does not?

The LASSO objective is non-convex.

The 1\ell_1 ball has corners on the coordinate axes.

LASSO optimises the 0\ell_0 norm directly.

Ridge penalises large coefficients more than LASSO does.

Soft-Thresholding Operator

For θ0\theta\geq 0, ηθ(z)=sign(z)max(zθ,0)\eta_\theta(z)=\text{sign}(z)\max(|z|-\theta,0). It is the proximal operator of θ1\theta\|\cdot\|_1 and appears as the per-coordinate update in ISTA, FISTA, and AMP.

Related: ISTA for the LASSO, Proximal operator