Ferkans — Interactive Telecom Tutor

From $\ell_0$ to $\ell_1$ : A Tractable Surrogate

The point is that $\ell_0$ is a discrete, non-convex objective and therefore computationally hopeless at scale. We need a convex surrogate. Among $\ell_p$ norms with $p \in [0, \infty]$ , the $\ell_1$ norm is the unique convex choice that still promotes sparsity: it is the convex hull of the scaled $\ell_0$ "ball" within the unit $\ell_\infty$ ball. Geometrically, $\|\mathbf{x}\|_1$ has corners exactly on the coordinate axes, and a generic random affine constraint touches those corners first. Algorithmically, $\|\cdot\|_1$ is a sum of absolute values — a linear program after variable-splitting. The $\ell_1$ relaxation is not a mere computational trick: we will prove in Section 13.4 that, under structural conditions on $\mathbf{A}$ , $\ell_1$ recovery coincides with $\ell_0$ recovery.

Definition:
Basis Pursuit (BP)

Given $\mathbf{A} \in \mathbb{R}^{M \times N}$ and $\mathbf{y} \in \mathbb{R}^M$ , Basis Pursuit is the convex optimization problem $(P_1)\qquad \min_{\mathbf{x} \in \mathbb{R}^N} \|\mathbf{x}\|_1 \quad \text{subject to} \quad \mathbf{A}\mathbf{x} = \mathbf{y}.$ This is the noiseless $\ell_1$ relaxation of $\ell_0$ minimization. It is a linear program after introducing $\mathbf{x} = \mathbf{x}^+ - \mathbf{x}^-$ with $\mathbf{x}^+, \mathbf{x}^- \geq \mathbf{0}$ .

Definition:
Basis Pursuit Denoising (BPDN)

When the measurements are noisy, $\mathbf{y} = \mathbf{A}\mathbf{x}^\star + \mathbf{w}$ with $\|\mathbf{w}\|_2 \leq \eta$ , the constraint is relaxed to a ball: $(P_{1,\eta})\qquad \min_{\mathbf{x} \in \mathbb{R}^N} \|\mathbf{x}\|_1 \quad \text{subject to} \quad \|\mathbf{A}\mathbf{x} - \mathbf{y}\|_2 \leq \eta.$ This is a second-order cone program (SOCP).

Definition:
LASSO (Least Absolute Shrinkage and Selection Operator)

The LASSO estimator is the Lagrangian form $\hat{\mathbf{z}}_{\text{LASSO}} = \arg\min_{\mathbf{x} \in \mathbb{R}^N} \frac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{x}\|_1,$ where $\lambda > 0$ is a regularization parameter. Equivalently, by convex duality, the LASSO solves $\min_{\mathbf{x}} \|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 \quad \text{subject to} \quad \|\mathbf{x}\|_1 \leq \tau$ for a corresponding $\tau = \tau(\lambda)$ .

The three problems BP, BPDN, LASSO are related but not identical: they have different data-fit vs. sparsity trade-offs. For any LASSO solution $\hat{\mathbf{z}}_{\text{LASSO}}$ with residual $\|\mathbf{y} - \mathbf{A}\hat{\mathbf{z}}_{\text{LASSO}}\|_2 = \eta$ , there is a $\lambda$ such that $\hat{\mathbf{z}}_{\text{LASSO}}$ solves BPDN with that $\eta$ , and conversely.

Why Convexity Matters Here

All three programs $(P_1)$ , BPDN, LASSO are convex — the objective and constraint set are convex. Convexity guarantees that any local optimum is global, the problem is tractable by interior-point methods in polynomial time, and we can derive duality-based certificates of optimality (the KKT conditions in the next theorem). None of this holds for $(P_0)$ .

Theorem: KKT Conditions for LASSO

A vector $\hat{\mathbf{z}}_{\text{LASSO}} \in \mathbb{R}^N$ is a LASSO minimizer if and only if there exists a subgradient $\mathbf{g} \in \partial\|\hat{\mathbf{z}}_{\text{LASSO}}\|_1$ such that $\mathbf{A}^{T}(\mathbf{y} - \mathbf{A}\,\hat{\mathbf{z}}_{\text{LASSO}}) = \lambda \cdot \mathbf{g},$ where the subdifferential satisfies $g_i = \mathrm{sign}(\hat{x}_i)$ if $\hat{x}_i \neq 0$ and $g_i \in [-1, 1]$ otherwise.

The LASSO stationary condition says the correlation of each column with the residual is bounded by $\lambda$ , with equality on the support. Columns with correlation below $\lambda$ are not selected; the threshold is exactly $\lambda$ . This is the mechanism behind LASSO's variable selection.

Proof

Set up the subgradient condition

The LASSO objective $f(\mathbf{x}) = \tfrac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{x}\|_1$ is convex. By the convex optimality condition, $\hat{\mathbf{z}}_{\text{LASSO}}$ is a minimizer iff $\mathbf{0} \in \partial f(\hat{\mathbf{z}}_{\text{LASSO}})$ .

Compute the subdifferential

The first term is differentiable with gradient $-\mathbf{A}^{T}(\mathbf{y} - \mathbf{A}\mathbf{x})$ . The second term has subdifferential $\lambda \cdot \partial\|\mathbf{x}\|_1$ , where $(\partial\|\mathbf{x}\|_1)_i = \{\mathrm{sign}(x_i)\}$ for $x_i \neq 0$ and $(\partial\|\mathbf{x}\|_1)_i = [-1, 1]$ for $x_i = 0$ .

Combine

Thus $\mathbf{0} \in \partial f(\hat{\mathbf{z}}_{\text{LASSO}})$ iff there is $\mathbf{g} \in \partial\|\hat{\mathbf{z}}_{\text{LASSO}}\|_1$ satisfying the stated identity. Equivalently, letting $\mathbf{r} = \mathbf{y} - \mathbf{A}\,\hat{\mathbf{z}}_{\text{LASSO}}$ be the residual, $|\mathbf{A}_{j}^{T} \mathbf{r}| \leq \lambda \text{ for all } j, \quad \mathbf{A}_{j}^{T} \mathbf{r} = \lambda \cdot \mathrm{sign}(\hat{x}_j) \text{ for } j \in \mathrm{supp}(\hat{\mathbf{z}}_{\text{LASSO}}).$

,

Example: Orthogonal LASSO is Soft Thresholding

Consider the LASSO with orthonormal $\mathbf{A}$ , i.e., $\mathbf{A}^{T}\mathbf{A} = \mathbf{I}_N$ (in particular $M = N$ ). Let $\tilde{\mathbf{x}} = \mathbf{A}^{T} \mathbf{y}$ be the least-squares solution. Derive a closed-form expression for $\hat{\mathbf{z}}_{\text{LASSO}}$ .

Solution

Reduce to a separable problem

With $\mathbf{A}^{T}\mathbf{A} = \mathbf{I}$ , expand: $\frac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 = \frac{1}{2}\|\mathbf{y}\|_2^2 - \mathbf{x}^T \mathbf{A}^{T} \mathbf{y} + \frac{1}{2}\|\mathbf{x}\|_2^2.$ Dropping the constant and adding the $\ell_1$ penalty, the objective decouples coordinatewise: $f(\mathbf{x}) = \sum_{i=1}^N \left[\tfrac{1}{2}(x_i - \tilde{x}_i)^2 + \lambda|x_i|\right] + \text{const}.$

Solve each scalar problem

For each $i$ , minimize $\phi(u) = \tfrac{1}{2}(u - \tilde{x}_i)^2 + \lambda|u|$ . Its subgradient is $u - \tilde{x}_i + \lambda \cdot \mathrm{sign}(u)$ for $u \neq 0$ and $-\tilde{x}_i + [-\lambda, \lambda]$ at $u = 0$ . Setting it to $0$ :

$\hat{x}_i = S_{\lambda}(\tilde{x}_i) := \mathrm{sign}(\tilde{x}_i) \cdot \max(|\tilde{x}_i| - \lambda, 0).$

Interpretation

The soft-thresholding operator $S_\lambda$ shrinks each coordinate toward zero by $\lambda$ and kills those smaller than $\lambda$ in magnitude. Orthogonal LASSO is simply coordinatewise soft thresholding; no iterative algorithm is needed. This operator is the building block of the iterative shrinkage-thresholding algorithm (ISTA) for general $\mathbf{A}$ .

Iterative Shrinkage-Thresholding Algorithm (ISTA)

Complexity:

O(KMN)

per ISTA run;

O(1/K)

suboptimality rate

Input: A (sensing matrix), y (observations), λ (reg), L ≥ ||A^T A||_2 (step size), K (iters)

Output: x̂ (LASSO estimate)

1: initialize x⁰ ← 0

2: for k = 0, 1, …, K-1 do

3: g ← A^T (A xᵏ − y) # gradient of data-fit term

4: z ← xᵏ − (1/L) g # gradient step

5: xᵏ⁺¹ ← S_{λ/L}(z) # coordinatewise soft threshold

6: end for

7: return x̂ = x^K

FISTA (Beck & Teboulle, 2009) accelerates ISTA to $O(1/K^2)$ via Nesterov momentum. Both are parameter-free once $L$ is fixed, and they extend to large $N$ where interior-point LPs become too expensive.

Geometry of $\ell_1$ vs $\ell_2$ Recovery

For $N = 2$ and a single linear constraint $a_1 x_1 + a_2 x_2 = y$ , we draw the affine line of feasible solutions and the smallest $\ell_1$ (or $\ell_2$ ) ball that touches it. Move the slope of the line and observe how the $\ell_1$ ball touches at a vertex (on an axis → sparse solution) while the $\ell_2$ ball touches at a generic (dense) point.

Parameters

\mathbf{A}_{1,1}

1

\mathbf{A}_{1,2}

0.6

y

1

LASSO Regularization Path

As $\lambda$ sweeps from large to small, the LASSO estimate traces a piecewise-linear path: more coefficients enter the active set as the penalty weakens. At $\lambda \to 0$ we recover the least-squares (or minimum-norm) solution; at $\lambda \to \infty$ we get $\hat{\mathbf{z}}_{\text{LASSO}} = \mathbf{0}$ .

Parameters

N

20

M

15

true sparsity

s

3

SNR (dB)20

seed3

Why $\ell_1$ Balls Touch at Corners

Animation in

\mathbb{R}^2

: a random affine line is pushed from infinity toward the origin; the

\ell_1

and

\ell_2

balls inflate until they touch. The

\ell_1

ball almost always contacts at a vertex, yielding a coordinate-axis (sparse) solution.

Least Squares vs Ridge vs LASSO

Estimator	Objective	Promotes	Closed form?	Behavior as $\lambda \to \infty$
Least squares (LS)	$\\|\mathbf{y} - \mathbf{A}\mathbf{x}\\|_2^2$	Low residual	Yes (if $M \geq N$ )	N/A
Ridge (Tikhonov)	$\\|\mathbf{y} - \mathbf{A}\mathbf{x}\\|_2^2 + \lambda\\|\mathbf{x}\\|_2^2$	Small $\ell_2$ norm	Yes: $(\mathbf{A}^{T}\mathbf{A} + \lambda\mathbf{I})^{-1}\mathbf{A}^{T}\mathbf{y}$	$\hat{\mathbf{x}} \to \mathbf{0}$ smoothly
LASSO	$\\|\mathbf{y} - \mathbf{A}\mathbf{x}\\|_2^2 + \lambda\\|\mathbf{x}\\|_1$	Sparsity	Only if $\mathbf{A}$ orthogonal (soft thresh.)	Hits $\mathbf{0}$ at finite $\lambda_{\max} = \\|\mathbf{A}^{T}\mathbf{y}\\|_\infty$

Quick Check

Which of the following is true about the LASSO objective $\frac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{x}\|_1$ ?

It is strictly convex in $\mathbf{x}$ even when $\mathbf{A}$ has a nontrivial null space.

It is convex but not everywhere differentiable.

It has a unique minimizer for every $\lambda > 0$ .

It is equivalent to ridge regression with the same $\lambda$ .

Correction:

It is convex but not everywhere differentiable.

$\|\mathbf{x}\|_1$ is convex but non-smooth at coordinate hyperplanes; KKT requires subgradients.

Quick Check

In the orthonormal-design LASSO, what happens to $\hat{x}_i$ when the least-squares coefficient satisfies $|\tilde{x}_i| < \lambda$ ?

$\hat{x}_i = 0$ .

$\hat{x}_i = \tilde{x}_i$ .

$\hat{x}_i = \lambda$ .

$\hat{x}_i = \tilde{x}_i - \lambda$ .

Correction:

\hat{x}_i = 0

.

Soft thresholding sets to zero any coordinate whose magnitude does not exceed $\lambda$ .

Common Mistake: $\ell_p$ for $p < 1$ is Non-Convex

Mistake:

Believing that $\ell_{0.5}$ minimization is "more sparsity-promoting" and therefore better than $\ell_1$ .

Correction:

$\ell_p$ pseudo-norms with $p < 1$ do promote sparsity more aggressively, but their optimization is non-convex and NP-hard in general. $\ell_1$ is the unique $\ell_p$ norm that is both convex and sparsity-promoting. Iteratively reweighted $\ell_1$ methods approximate $\ell_p$ -type objectives via a sequence of convex $\ell_1$ problems.

Historical Note: LASSO in Statistics, Basis Pursuit in Signal Processing

1996-1998

LASSO was introduced by Tibshirani in 1996 as a regression technique combining variable selection and shrinkage. Independently, Chen, Donoho, and Saunders introduced Basis Pursuit in 1998 for signal decomposition over overcomplete dictionaries. The two communities — statistics and signal processing — developed the same idea with different names and motivations. Compressed sensing (2004–2006) united them by proving sharp recovery guarantees under the RIP.

,

Key Takeaway

The $\ell_1$ norm is the convex envelope of $\ell_0$ on the unit $\ell_\infty$ ball. Its geometry (pointed vertices on coordinate axes) makes sparse solutions generic under random affine slicing, and its convexity makes optimization tractable. Basis Pursuit (exact), BPDN (ball-constrained), and LASSO (Lagrangian) are three equivalent views of the same recovery idea, and the KKT conditions give a computable certificate of optimality.

Basis Pursuit and LASSO

From ℓ0\ell_0ℓ0​ to ℓ1\ell_1ℓ1​: A Tractable Surrogate

Definition: Basis Pursuit (BP)

Definition: Basis Pursuit Denoising (BPDN)

Definition: LASSO (Least Absolute Shrinkage and Selection Operator)

Why Convexity Matters Here

Theorem: KKT Conditions for LASSO

Set up the subgradient condition

Compute the subdifferential

Combine

Example: Orthogonal LASSO is Soft Thresholding

Reduce to a separable problem

Solve each scalar problem

Interpretation

Iterative Shrinkage-Thresholding Algorithm (ISTA)

Geometry of ℓ1\ell_1ℓ1​ vs ℓ2\ell_2ℓ2​ Recovery

Parameters

LASSO Regularization Path

Parameters

Why ℓ1\ell_1ℓ1​ Balls Touch at Corners

Least Squares vs Ridge vs LASSO

Quick Check

Quick Check

Common Mistake: ℓp\ell_pℓp​ for p<1p < 1p<1 is Non-Convex

Historical Note: LASSO in Statistics, Basis Pursuit in Signal Processing

Key Takeaway

From $\ell_0$ to $\ell_1$ : A Tractable Surrogate

Definition:
Basis Pursuit (BP)

Definition:
Basis Pursuit Denoising (BPDN)

Definition:
LASSO (Least Absolute Shrinkage and Selection Operator)

Geometry of $\ell_1$ vs $\ell_2$ Recovery

Why $\ell_1$ Balls Touch at Corners

Common Mistake: $\ell_p$ for $p < 1$ is Non-Convex