Basis Pursuit and LASSO

From β„“0\ell_0 to β„“1\ell_1: A Tractable Surrogate

The point is that β„“0\ell_0 is a discrete, non-convex objective and therefore computationally hopeless at scale. We need a convex surrogate. Among β„“p\ell_p norms with p∈[0,∞]p \in [0, \infty], the β„“1\ell_1 norm is the unique convex choice that still promotes sparsity: it is the convex hull of the scaled β„“0\ell_0 "ball" within the unit β„“βˆž\ell_\infty ball. Geometrically, βˆ₯xβˆ₯1\|\mathbf{x}\|_1 has corners exactly on the coordinate axes, and a generic random affine constraint touches those corners first. Algorithmically, βˆ₯β‹…βˆ₯1\|\cdot\|_1 is a sum of absolute values β€” a linear program after variable-splitting. The β„“1\ell_1 relaxation is not a mere computational trick: we will prove in Section 13.4 that, under structural conditions on A\mathbf{A}, β„“1\ell_1 recovery coincides with β„“0\ell_0 recovery.

Definition:

Basis Pursuit (BP)

Given A∈RMΓ—N\mathbf{A} \in \mathbb{R}^{M \times N} and y∈RM\mathbf{y} \in \mathbb{R}^M, Basis Pursuit is the convex optimization problem (P1)min⁑x∈RNβˆ₯xβˆ₯1subjectΒ toAx=y.(P_1)\qquad \min_{\mathbf{x} \in \mathbb{R}^N} \|\mathbf{x}\|_1 \quad \text{subject to} \quad \mathbf{A}\mathbf{x} = \mathbf{y}. This is the noiseless β„“1\ell_1 relaxation of β„“0\ell_0 minimization. It is a linear program after introducing x=x+βˆ’xβˆ’\mathbf{x} = \mathbf{x}^+ - \mathbf{x}^- with x+,xβˆ’β‰₯0\mathbf{x}^+, \mathbf{x}^- \geq \mathbf{0}.

Definition:

Basis Pursuit Denoising (BPDN)

When the measurements are noisy, y=Ax⋆+w\mathbf{y} = \mathbf{A}\mathbf{x}^\star + \mathbf{w} with βˆ₯wβˆ₯2≀η\|\mathbf{w}\|_2 \leq \eta, the constraint is relaxed to a ball: (P1,Ξ·)min⁑x∈RNβˆ₯xβˆ₯1subjectΒ toβˆ₯Axβˆ’yβˆ₯2≀η.(P_{1,\eta})\qquad \min_{\mathbf{x} \in \mathbb{R}^N} \|\mathbf{x}\|_1 \quad \text{subject to} \quad \|\mathbf{A}\mathbf{x} - \mathbf{y}\|_2 \leq \eta. This is a second-order cone program (SOCP).

Definition:

LASSO (Least Absolute Shrinkage and Selection Operator)

The LASSO estimator is the Lagrangian form z^LASSO=arg⁑min⁑x∈RN12βˆ₯yβˆ’Axβˆ₯22+Ξ»βˆ₯xβˆ₯1,\hat{\mathbf{z}}_{\text{LASSO}} = \arg\min_{\mathbf{x} \in \mathbb{R}^N} \frac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{x}\|_1, where Ξ»>0\lambda > 0 is a regularization parameter. Equivalently, by convex duality, the LASSO solves min⁑xβˆ₯yβˆ’Axβˆ₯22subjectΒ toβˆ₯xβˆ₯1≀τ\min_{\mathbf{x}} \|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 \quad \text{subject to} \quad \|\mathbf{x}\|_1 \leq \tau for a corresponding Ο„=Ο„(Ξ»)\tau = \tau(\lambda).

The three problems BP, BPDN, LASSO are related but not identical: they have different data-fit vs. sparsity trade-offs. For any LASSO solution z^LASSO\hat{\mathbf{z}}_{\text{LASSO}} with residual βˆ₯yβˆ’Az^LASSOβˆ₯2=Ξ·\|\mathbf{y} - \mathbf{A}\hat{\mathbf{z}}_{\text{LASSO}}\|_2 = \eta, there is a Ξ»\lambda such that z^LASSO\hat{\mathbf{z}}_{\text{LASSO}} solves BPDN with that Ξ·\eta, and conversely.

Why Convexity Matters Here

All three programs (P1)(P_1), BPDN, LASSO are convex β€” the objective and constraint set are convex. Convexity guarantees that any local optimum is global, the problem is tractable by interior-point methods in polynomial time, and we can derive duality-based certificates of optimality (the KKT conditions in the next theorem). None of this holds for (P0)(P_0).

Theorem: KKT Conditions for LASSO

A vector z^LASSO∈RN\hat{\mathbf{z}}_{\text{LASSO}} \in \mathbb{R}^N is a LASSO minimizer if and only if there exists a subgradient gβˆˆβˆ‚βˆ₯z^LASSOβˆ₯1\mathbf{g} \in \partial\|\hat{\mathbf{z}}_{\text{LASSO}}\|_1 such that AT(yβˆ’A z^LASSO)=Ξ»β‹…g,\mathbf{A}^{T}(\mathbf{y} - \mathbf{A}\,\hat{\mathbf{z}}_{\text{LASSO}}) = \lambda \cdot \mathbf{g}, where the subdifferential satisfies gi=sign(x^i)g_i = \mathrm{sign}(\hat{x}_i) if x^iβ‰ 0\hat{x}_i \neq 0 and gi∈[βˆ’1,1]g_i \in [-1, 1] otherwise.

The LASSO stationary condition says the correlation of each column with the residual is bounded by Ξ»\lambda, with equality on the support. Columns with correlation below Ξ»\lambda are not selected; the threshold is exactly Ξ»\lambda. This is the mechanism behind LASSO's variable selection.

,

Example: Orthogonal LASSO is Soft Thresholding

Consider the LASSO with orthonormal A\mathbf{A}, i.e., ATA=IN\mathbf{A}^{T}\mathbf{A} = \mathbf{I}_N (in particular M=NM = N). Let x~=ATy\tilde{\mathbf{x}} = \mathbf{A}^{T} \mathbf{y} be the least-squares solution. Derive a closed-form expression for z^LASSO\hat{\mathbf{z}}_{\text{LASSO}}.

Iterative Shrinkage-Thresholding Algorithm (ISTA)

Complexity: O(KMN)O(KMN) per ISTA run; O(1/K)O(1/K) suboptimality rate
Input: A (sensing matrix), y (observations), Ξ» (reg), L β‰₯ ||A^T A||_2 (step size), K (iters)
Output: xΜ‚ (LASSO estimate)
1: initialize x⁰ ← 0
2: for k = 0, 1, …, K-1 do
3: g ← A^T (A xᡏ βˆ’ y) # gradient of data-fit term
4: z ← xᡏ βˆ’ (1/L) g # gradient step
5: xᡏ⁺¹ ← S_{Ξ»/L}(z) # coordinatewise soft threshold
6: end for
7: return xΜ‚ = x^K

FISTA (Beck & Teboulle, 2009) accelerates ISTA to O(1/K2)O(1/K^2) via Nesterov momentum. Both are parameter-free once LL is fixed, and they extend to large NN where interior-point LPs become too expensive.

Geometry of β„“1\ell_1 vs β„“2\ell_2 Recovery

For N=2N = 2 and a single linear constraint a1x1+a2x2=ya_1 x_1 + a_2 x_2 = y, we draw the affine line of feasible solutions and the smallest β„“1\ell_1 (or β„“2\ell_2) ball that touches it. Move the slope of the line and observe how the β„“1\ell_1 ball touches at a vertex (on an axis β†’ sparse solution) while the β„“2\ell_2 ball touches at a generic (dense) point.

Parameters
1
0.6
1

LASSO Regularization Path

As Ξ»\lambda sweeps from large to small, the LASSO estimate traces a piecewise-linear path: more coefficients enter the active set as the penalty weakens. At Ξ»β†’0\lambda \to 0 we recover the least-squares (or minimum-norm) solution; at Ξ»β†’βˆž\lambda \to \infty we get z^LASSO=0\hat{\mathbf{z}}_{\text{LASSO}} = \mathbf{0}.

Parameters
20
15
3
20
3

Why β„“1\ell_1 Balls Touch at Corners

Animation in R2\mathbb{R}^2: a random affine line is pushed from infinity toward the origin; the β„“1\ell_1 and β„“2\ell_2 balls inflate until they touch. The β„“1\ell_1 ball almost always contacts at a vertex, yielding a coordinate-axis (sparse) solution.

Least Squares vs Ridge vs LASSO

EstimatorObjectivePromotesClosed form?Behavior as Ξ»β†’βˆž\lambda \to \infty
Least squares (LS)βˆ₯yβˆ’Axβˆ₯22\|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2Low residualYes (if Mβ‰₯NM \geq N)N/A
Ridge (Tikhonov)βˆ₯yβˆ’Axβˆ₯22+Ξ»βˆ₯xβˆ₯22\|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{x}\|_2^2Small β„“2\ell_2 normYes: (ATA+Ξ»I)βˆ’1ATy(\mathbf{A}^{T}\mathbf{A} + \lambda\mathbf{I})^{-1}\mathbf{A}^{T}\mathbf{y}x^β†’0\hat{\mathbf{x}} \to \mathbf{0} smoothly
LASSOβˆ₯yβˆ’Axβˆ₯22+Ξ»βˆ₯xβˆ₯1\|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{x}\|_1SparsityOnly if A\mathbf{A} orthogonal (soft thresh.)Hits 0\mathbf{0} at finite Ξ»max⁑=βˆ₯ATyβˆ₯∞\lambda_{\max} = \|\mathbf{A}^{T}\mathbf{y}\|_\infty

Quick Check

Which of the following is true about the LASSO objective 12βˆ₯yβˆ’Axβˆ₯22+Ξ»βˆ₯xβˆ₯1\frac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{x}\|_1?

It is strictly convex in x\mathbf{x} even when A\mathbf{A} has a nontrivial null space.

It is convex but not everywhere differentiable.

It has a unique minimizer for every Ξ»>0\lambda > 0.

It is equivalent to ridge regression with the same Ξ»\lambda.

Quick Check

In the orthonormal-design LASSO, what happens to x^i\hat{x}_i when the least-squares coefficient satisfies ∣x~i∣<λ|\tilde{x}_i| < \lambda?

x^i=0\hat{x}_i = 0.

x^i=x~i\hat{x}_i = \tilde{x}_i.

x^i=Ξ»\hat{x}_i = \lambda.

x^i=x~iβˆ’Ξ»\hat{x}_i = \tilde{x}_i - \lambda.

Common Mistake: β„“p\ell_p for p<1p < 1 is Non-Convex

Mistake:

Believing that β„“0.5\ell_{0.5} minimization is "more sparsity-promoting" and therefore better than β„“1\ell_1.

Correction:

β„“p\ell_p pseudo-norms with p<1p < 1 do promote sparsity more aggressively, but their optimization is non-convex and NP-hard in general. β„“1\ell_1 is the unique β„“p\ell_p norm that is both convex and sparsity-promoting. Iteratively reweighted β„“1\ell_1 methods approximate β„“p\ell_p-type objectives via a sequence of convex β„“1\ell_1 problems.

Historical Note: LASSO in Statistics, Basis Pursuit in Signal Processing

1996-1998

LASSO was introduced by Tibshirani in 1996 as a regression technique combining variable selection and shrinkage. Independently, Chen, Donoho, and Saunders introduced Basis Pursuit in 1998 for signal decomposition over overcomplete dictionaries. The two communities β€” statistics and signal processing β€” developed the same idea with different names and motivations. Compressed sensing (2004–2006) united them by proving sharp recovery guarantees under the RIP.

,

Key Takeaway

The β„“1\ell_1 norm is the convex envelope of β„“0\ell_0 on the unit β„“βˆž\ell_\infty ball. Its geometry (pointed vertices on coordinate axes) makes sparse solutions generic under random affine slicing, and its convexity makes optimization tractable. Basis Pursuit (exact), BPDN (ball-constrained), and LASSO (Lagrangian) are three equivalent views of the same recovery idea, and the KKT conditions give a computable certificate of optimality.