Basis Pursuit and LASSO
From to : A Tractable Surrogate
The point is that is a discrete, non-convex objective and therefore computationally hopeless at scale. We need a convex surrogate. Among norms with , the norm is the unique convex choice that still promotes sparsity: it is the convex hull of the scaled "ball" within the unit ball. Geometrically, has corners exactly on the coordinate axes, and a generic random affine constraint touches those corners first. Algorithmically, is a sum of absolute values β a linear program after variable-splitting. The relaxation is not a mere computational trick: we will prove in Section 13.4 that, under structural conditions on , recovery coincides with recovery.
Definition: Basis Pursuit (BP)
Basis Pursuit (BP)
Given and , Basis Pursuit is the convex optimization problem This is the noiseless relaxation of minimization. It is a linear program after introducing with .
Definition: Basis Pursuit Denoising (BPDN)
Basis Pursuit Denoising (BPDN)
When the measurements are noisy, with , the constraint is relaxed to a ball: This is a second-order cone program (SOCP).
Definition: LASSO (Least Absolute Shrinkage and Selection Operator)
LASSO (Least Absolute Shrinkage and Selection Operator)
The LASSO estimator is the Lagrangian form where is a regularization parameter. Equivalently, by convex duality, the LASSO solves for a corresponding .
The three problems BP, BPDN, LASSO are related but not identical: they have different data-fit vs. sparsity trade-offs. For any LASSO solution with residual , there is a such that solves BPDN with that , and conversely.
Why Convexity Matters Here
All three programs , BPDN, LASSO are convex β the objective and constraint set are convex. Convexity guarantees that any local optimum is global, the problem is tractable by interior-point methods in polynomial time, and we can derive duality-based certificates of optimality (the KKT conditions in the next theorem). None of this holds for .
Theorem: KKT Conditions for LASSO
A vector is a LASSO minimizer if and only if there exists a subgradient such that where the subdifferential satisfies if and otherwise.
The LASSO stationary condition says the correlation of each column with the residual is bounded by , with equality on the support. Columns with correlation below are not selected; the threshold is exactly . This is the mechanism behind LASSO's variable selection.
Set up the subgradient condition
The LASSO objective is convex. By the convex optimality condition, is a minimizer iff .
Compute the subdifferential
The first term is differentiable with gradient . The second term has subdifferential , where for and for .
Combine
Thus iff there is satisfying the stated identity. Equivalently, letting be the residual,
Example: Orthogonal LASSO is Soft Thresholding
Consider the LASSO with orthonormal , i.e., (in particular ). Let be the least-squares solution. Derive a closed-form expression for .
Reduce to a separable problem
With , expand: Dropping the constant and adding the penalty, the objective decouples coordinatewise:
Solve each scalar problem
For each , minimize . Its subgradient is for and at . Setting it to :
Interpretation
The soft-thresholding operator shrinks each coordinate toward zero by and kills those smaller than in magnitude. Orthogonal LASSO is simply coordinatewise soft thresholding; no iterative algorithm is needed. This operator is the building block of the iterative shrinkage-thresholding algorithm (ISTA) for general .
Iterative Shrinkage-Thresholding Algorithm (ISTA)
Complexity: per ISTA run; suboptimality rateFISTA (Beck & Teboulle, 2009) accelerates ISTA to via Nesterov momentum. Both are parameter-free once is fixed, and they extend to large where interior-point LPs become too expensive.
Geometry of vs Recovery
For and a single linear constraint , we draw the affine line of feasible solutions and the smallest (or ) ball that touches it. Move the slope of the line and observe how the ball touches at a vertex (on an axis β sparse solution) while the ball touches at a generic (dense) point.
Parameters
LASSO Regularization Path
As sweeps from large to small, the LASSO estimate traces a piecewise-linear path: more coefficients enter the active set as the penalty weakens. At we recover the least-squares (or minimum-norm) solution; at we get .
Parameters
Why Balls Touch at Corners
Least Squares vs Ridge vs LASSO
| Estimator | Objective | Promotes | Closed form? | Behavior as |
|---|---|---|---|---|
| Least squares (LS) | Low residual | Yes (if ) | N/A | |
| Ridge (Tikhonov) | Small norm | Yes: | smoothly | |
| LASSO | Sparsity | Only if orthogonal (soft thresh.) | Hits at finite |
Quick Check
Which of the following is true about the LASSO objective ?
It is strictly convex in even when has a nontrivial null space.
It is convex but not everywhere differentiable.
It has a unique minimizer for every .
It is equivalent to ridge regression with the same .
is convex but non-smooth at coordinate hyperplanes; KKT requires subgradients.
Quick Check
In the orthonormal-design LASSO, what happens to when the least-squares coefficient satisfies ?
.
.
.
.
Soft thresholding sets to zero any coordinate whose magnitude does not exceed .
Common Mistake: for is Non-Convex
Mistake:
Believing that minimization is "more sparsity-promoting" and therefore better than .
Correction:
pseudo-norms with do promote sparsity more aggressively, but their optimization is non-convex and NP-hard in general. is the unique norm that is both convex and sparsity-promoting. Iteratively reweighted methods approximate -type objectives via a sequence of convex problems.
Historical Note: LASSO in Statistics, Basis Pursuit in Signal Processing
1996-1998LASSO was introduced by Tibshirani in 1996 as a regression technique combining variable selection and shrinkage. Independently, Chen, Donoho, and Saunders introduced Basis Pursuit in 1998 for signal decomposition over overcomplete dictionaries. The two communities β statistics and signal processing β developed the same idea with different names and motivations. Compressed sensing (2004β2006) united them by proving sharp recovery guarantees under the RIP.
Key Takeaway
The norm is the convex envelope of on the unit ball. Its geometry (pointed vertices on coordinate axes) makes sparse solutions generic under random affine slicing, and its convexity makes optimization tractable. Basis Pursuit (exact), BPDN (ball-constrained), and LASSO (Lagrangian) are three equivalent views of the same recovery idea, and the KKT conditions give a computable certificate of optimality.