Ferkans — Interactive Telecom Tutor

Regularization as a Statistical Necessity

The previous section demonstrated that the MLE is disastrous when $\gamma$ is close to one. The point is that any sensible estimator must be biased in this regime. The question is therefore not "should we regularise?" but "how much, and in what geometry?". We treat the two canonical penalties — the $\ell_2$ ridge penalty and the $\ell_1$ LASSO penalty — and in each case derive the optimal regularization strength as a function of $\gamma$ and the signal-to-noise ratio. These are convex problems, and the solutions admit a closed-form analysis in the Marchenko–Pastur regime.

Definition:
Ridge Regression (Tikhonov Regularization)

For $\lambda>0$ , the ridge estimator is the solution of the penalised least-squares programme

$\hat{\mathbf{x}}_{\text{ridge}}(\lambda)=\arg\min_{\mathbf{x}\in\mathbb{R}^N}\tfrac12\|\mathbf{y}-\mathbf{A}\mathbf{x}\|^2+\tfrac{\lambda}{2}\|\mathbf{x}\|^2.$

The objective is strictly convex and admits the closed form

$\hat{\mathbf{x}}_{\text{ridge}}(\lambda)=(\mathbf{A}^{T}\mathbf{A}+\lambda\mathbf{I})^{-1}\mathbf{A}^{T}\mathbf{y}.$

Ridge trades variance against squared bias. The problem is convex for every $\lambda\geq 0$ , strictly convex for $\lambda>0$ , and the unique minimiser is available in closed form — the reader should verify that this follows from differentiating and setting the gradient to zero.

Definition:
LASSO Estimator

For $\lambda>0$ , the LASSO estimator is

$\hat{\mathbf{z}}_{\text{LASSO}}(\lambda)=\arg\min_{\mathbf{x}\in\mathbb{R}^N}\tfrac12\|\mathbf{y}-\mathbf{A}\mathbf{x}\|^2+\lambda\|\mathbf{x}\|_1.$

The problem is convex but not strictly convex; the $\ell_1$ penalty is non-smooth, which produces the sparsity-promoting "corners" of its sub-level sets. No closed form exists in general, but proximal algorithms (ISTA, FISTA, AMP) solve it efficiently.

The convexity reflex applies: whenever we see $\ell_1$ + squared-error we should flag the problem as convex and reach for proximal-gradient machinery.

Definition:
Elastic Net

The elastic net interpolates between ridge and LASSO by combining both penalties:

$\hat{\mathbf{x}}_{\text{EN}}(\lambda,\alpha)=\arg\min_{\mathbf{x}}\tfrac12\|\mathbf{y}-\mathbf{A}\mathbf{x}\|^2+\lambda\bigl(\alpha\|\mathbf{x}\|_1+\tfrac{1-\alpha}{2}\|\mathbf{x}\|^2\bigr).$

The parameter $\alpha\in[0,1]$ sets the mix: $\alpha=0$ gives ridge, $\alpha=1$ gives LASSO. The $\ell_2$ term restores strict convexity and stabilises solutions in the presence of correlated columns.

Elastic net is the default in practice when columns of $\mathbf{A}$ are correlated — pure LASSO then selects capriciously among correlated predictors, whereas the $\ell_2$ term spreads weight across them.

Theorem: Optimal Ridge Regularization in the Marchenko–Pastur Regime

Consider the model of DGaussian Linear Observation Model with $\mathbf{A}$ i.i.d. $\mathcal{N}(0,1/M)$ , and suppose $\mathbf{x}\sim\mathcal{N}(\mathbf{0},\sigma_x^2\mathbf{I}_N)$ is independent of $\mathbf{A}$ . Define the signal-to-noise ratio $\text{SNR}=\sigma_x^2/\sigma^2$ . In the proportional asymptotic regime, the per-coordinate MSE of the ridge estimator

$\mathrm{MSE}(\lambda):=\tfrac{1}{N}\mathbb{E}\bigl[\|\hat{\mathbf{x}}_{\text{ridge}}(\lambda)-\mathbf{x}\|^2\bigr]$

is minimised at

$\lambda^\star=\frac{\sigma^2}{\sigma_x^2}=\text{SNR}^{-1}.$

The minimum value depends on $\gamma$ and is strictly smaller than the OLS risk $\gamma\sigma^2/(1-\gamma)$ for every $\gamma<1$ , and finite even for $\gamma\geq 1$ .

The Bayesian story tells us the answer immediately: under a Gaussian prior $\mathbf{x}\sim\mathcal{N}(\mathbf{0},\sigma_x^2\mathbf{I})$ , the MAP estimator is ridge regression with $\lambda=\sigma^2/\sigma_x^2$ , and it coincides with the Bayes-optimal LMMSE estimator. The theorem confirms that this Bayesian-motivated choice is also frequentist-optimal in the proportional regime.

Show Hint

Write the ridge estimator in the singular-value basis of $\mathbf{A}$ to obtain a scalar shrinkage in each direction.

Compute the MSE as an integral against the MP density, then differentiate in $\lambda$ .

The optimum is characterised by a Stieltjes-transform identity; alternatively, recognise the LMMSE connection and use orthogonality of the error.

Proof

LMMSE characterisation

Under a Gaussian prior, the joint distribution of $(\mathbf{x},\mathbf{y})$ given $\mathbf{A}$ is Gaussian. The LMMSE (=MAP=posterior mean) estimator is $\mathbb{E}[\mathbf{x}\mid\mathbf{y},\mathbf{A}]=\sigma_x^2\mathbf{A}^{T}(\sigma_x^2\mathbf{A}\,\mathbf{A}^{T}+\sigma^2\mathbf{I}_M)^{-1}\mathbf{y}.$ A push-through identity rewrites this as $(\mathbf{A}^{T}\mathbf{A}+\tfrac{\sigma^2}{\sigma_x^2}\mathbf{I}_N)^{-1}\mathbf{A}^{T}\mathbf{y}$ , which is exactly $\hat{\mathbf{x}}_{\text{ridge}}(\sigma^2/\sigma_x^2)$ .

Orthogonality implies optimality

By the orthogonality principle, the LMMSE estimator is the minimum-MSE linear estimator in the Bayesian risk (averaged over both $\mathbf{x}$ and $\mathbf{w}$ ). Any other ridge parameter produces a sub-optimal linear estimator, so $\lambda^\star=\sigma^2/\sigma_x^2=\text{SNR}^{-1}$ .

Closed-form risk

The minimum per-coordinate MSE, in the proportional regime, is the fixed point of the scalar equation $\mathrm{MSE}^\star=\sigma^2\cdot m(-\lambda^\star),$ where $m(z)=\int(\mu-z)^{-1}f_\gamma(\mu)\,d\mu$ is the Stieltjes transform of the MP law. This is finite for all $\gamma>0$ , including $\gamma\geq 1$ . $\blacksquare$

Key Takeaway

Under a Gaussian prior, optimal ridge regularisation is $\lambda^\star=1/\text{SNR}$ . This is the single most useful heuristic in all of high-dimensional estimation: if you can estimate the signal power and the noise power, you can set the ridge penalty correctly.

ISTA for the LASSO

Complexity: Each iteration costs

O(MN)

; converges at rate

O(1/k)

.

Input:

\mathbf{A}\in\mathbb{R}^{M\times N}

,

\mathbf{y}\in\mathbb{R}^M

,

\lambda>0

, step size

\tau\leq 1/\|\mathbf{A}\|_2^2

Output: LASSO minimiser

\hat{\mathbf{x}}

1. Initialise

\mathbf{x}^{(0)}\leftarrow\mathbf{0}

2. for

k=0,1,2,\ldots

do

3.

\quad\mathbf{r}^{(k)}\leftarrow\mathbf{y}-\mathbf{A}\mathbf{x}^{(k)}

(residual)

4.

\quad\mathbf{z}^{(k)}\leftarrow\mathbf{x}^{(k)}+\tau\,\mathbf{A}^{T}\mathbf{r}^{(k)}

(gradient step)

5.

\quad\mathbf{x}^{(k+1)}\leftarrow\eta_{\tau\lambda}(\mathbf{z}^{(k)})

(soft-threshold, coordinate-wise)

6. until convergence

The soft-threshold operator is $\eta_\theta(z)=\text{sign}(z)\max(|z|-\theta,0)$ , and it is the proximal operator of $\theta\|\cdot\|_1$ . FISTA accelerates ISTA to $O(1/k^2)$ via Nesterov momentum.

Example: Bias–Variance Tradeoff for Ridge at Fixed $\gamma$

For the Gaussian model of TOptimal Ridge Regularization in the Marchenko–Pastur Regime with $\gamma=0.5$ , $\sigma_x^2=1$ , $\sigma^2=0.1$ , numerically sketch the per-coordinate MSE of ridge as a function of $\lambda$ and identify the optimum. Compare with the OLS risk.

Solution

OLS baseline

OLS per-coordinate risk: $\gamma\sigma^2/(1-\gamma)=0.5\times 0.1/0.5=0.1$ .

Optimal ridge

By TOptimal Ridge Regularization in the Marchenko–Pastur Regime, $\lambda^\star=\sigma^2/\sigma_x^2=0.1$ . The MSE at this $\lambda^\star$ is strictly below $0.1$ ; numerical evaluation (see $\gamma$ $γ$ and SNR sliders)" data-ref-type="interactive_plot">📊Ridge Risk vs. Regularization (with $\gamma$ and SNR sliders)) gives $\mathrm{MSE}^\star\approx 0.0854$ .

Regimes

$\lambda\to 0^+$ : ridge $\to$ OLS, risk $\to 0.1$ .
$\lambda\to\infty$ : $\hat{\mathbf{x}}\to\mathbf{0}$ , risk $\to \sigma_x^2=1$ .
Intermediate $\lambda$ : convex trade-off with minimum at $\text{SNR}^{-1}$ .

Ridge Risk vs. Regularization (with $\gamma$ and SNR sliders)

The per-coordinate MSE of the ridge estimator as a function of $\lambda$ , with both theoretical curves and Monte-Carlo dots, for adjustable $\gamma$ and SNR. Observe that the minimum always sits at $\lambda^\star=1/\text{SNR}$ , regardless of $\gamma$ .

Parameters

\gamma

0.5

SNR (dB)10

LASSO Regularization Path

Plot the $N$ coordinate paths of $\hat{\mathbf{z}}_{\text{LASSO}}(\lambda)$ as $\lambda$ sweeps from $\|\mathbf{A}^{T}\mathbf{y}\|_\infty$ (full sparsity) down to zero (OLS). The sparsity pattern is visible: each coordinate "activates" at a distinct $\lambda$ .

Parameters

N

40

M

60

s

(true non-zeros)5

Ridge vs. LASSO vs. Elastic Net

Property	Ridge	LASSO	Elastic Net
Penalty	$\tfrac{\lambda}{2}\\|\mathbf{x}\\|_2^2$	$\lambda\\|\mathbf{x}\\|_1$	$\lambda\bigl(\alpha\\|\mathbf{x}\\|_1+\tfrac{1-\alpha}{2}\\|\mathbf{x}\\|_2^2\bigr)$
Closed form	Yes	No (coordinate-wise soft-threshold)	No
Sparse solution	Never	Yes, for $\lambda$ large enough	Yes (when $\alpha>0$ )
Strictly convex	Yes ( $\lambda>0$ )	No	Yes ( $\alpha<1$ )
Bayesian MAP under	Gaussian prior	Laplace prior	Gaussian × Laplace
Behaviour on correlated columns	Spreads weight	Picks one arbitrarily	Spreads, with sparsity
Optimal $\lambda$ (Gaussian)	$1/\text{SNR}$	No closed form	No closed form

Common Mistake: Forgetting to Centre and Scale Before Ridge / LASSO

Mistake:

Applying ridge or LASSO to un-normalised columns and expecting the penalty to treat all predictors symmetrically.

Correction:

The $\ell_2$ and $\ell_1$ penalties are not scale-invariant: a column with ten-times-larger values gets ten-times-less regularisation in effective terms. Standard practice is to centre and scale each column of $\mathbf{A}$ before applying the penalty, and to centre $\mathbf{y}$ (or include an un-penalised intercept).

Why This Matters: Massive-MIMO Channel Estimation is Ridge Regression

In massive MIMO, the downlink channel matrix $\mathbf{H}\in\mathbb{C}^{N_t\times K}$ is estimated from $M$ uplink pilot symbols. When $M<N_t$ (as is typical), OLS is ill-defined, and LMMSE estimation — which uses prior knowledge of the channel second-order statistics — is the standard choice. The LMMSE channel estimator is exactly a ridge estimator whose regularization parameter is determined by the SNR (see TOptimal Ridge Regularization in the Marchenko–Pastur Regime). This is why practical MIMO receivers are built around covariance-based estimators, not OLS.

Quick Check

Which of the following best explains why LASSO produces exactly-zero coefficients but ridge does not?

The LASSO objective is non-convex.

The $\ell_1$ ball has corners on the coordinate axes.

LASSO optimises the $\ell_0$ norm directly.

Ridge penalises large coefficients more than LASSO does.

Correction:

The

\ell_1

ball has corners on the coordinate axes.

The sub-level sets of $\|\mathbf{x}\|_1$ are diamond-shaped with corners on the axes; the solution tends to land on these corners, zeroing out coordinates.

Soft-Thresholding Operator

For $\theta\geq 0$ , $\eta_\theta(z)=\text{sign}(z)\max(|z|-\theta,0)$ . It is the proximal operator of $\theta\|\cdot\|_1$ and appears as the per-coordinate update in ISTA, FISTA, and AMP.

Regularized Estimation

Regularization as a Statistical Necessity

Definition: Ridge Regression (Tikhonov Regularization)

Definition: LASSO Estimator

Definition: Elastic Net

Theorem: Optimal Ridge Regularization in the Marchenko–Pastur Regime

LMMSE characterisation

Orthogonality implies optimality

Closed-form risk

Key Takeaway

ISTA for the LASSO

Example: Bias–Variance Tradeoff for Ridge at Fixed γ\gammaγ

OLS baseline

Optimal ridge

Regimes

Ridge Risk vs. Regularization (with γ\gammaγ and SNR sliders)

Parameters

LASSO Regularization Path

Parameters

Ridge vs. LASSO vs. Elastic Net

Common Mistake: Forgetting to Centre and Scale Before Ridge / LASSO

Why This Matters: Massive-MIMO Channel Estimation is Ridge Regression

Quick Check

Soft-Thresholding Operator

Definition:
Ridge Regression (Tikhonov Regularization)

Definition:
LASSO Estimator

Definition:
Elastic Net

Example: Bias–Variance Tradeoff for Ridge at Fixed $\gamma$

Ridge Risk vs. Regularization (with $\gamma$ and SNR sliders)