ISTA and FISTA
Why We Need Dedicated Sparse Solvers
In Chapter 13 we formulated the LASSO problem and argued that it is a convex relaxation of the combinatorial problem. Good β but an off-the-shelf interior-point solver scales as per iteration, which is hopeless once reaches or more. Realistic compressed-sensing dictionaries are enormous: an sensing matrix with and is routine in imaging and massive-connectivity problems. We need first-order methods that touch only through matrix-vector products and . This section develops two such methods β ISTA and its accelerated cousin FISTA β derives them from the proximal-gradient perspective, and proves their convergence rates. Both exploit convexity: the objective is convex, so every local minimum is global, and every monotone descent strategy succeeds.
Definition: Soft-Threshold Operator
Soft-Threshold Operator
The soft-threshold operator with threshold is Applied componentwise to a vector , we write .
The name "soft" distinguishes it from the hard-threshold which keeps or kills components but never shrinks them. Soft-thresholding is continuous; hard-thresholding is not.
Theorem: Proximal Operator of the Norm
For any and ,
The proximal operator asks: what vector is close to (in ) and small (in )? The answer decouples componentwise, and for each coordinate it is a one-dimensional shrinkage. We do not get "almost sparse" vectors: the operator sets coordinates exactly to zero whenever .
Separability
Both and separate over coordinates, so the minimizer is obtained by solving the 1-D problem for each .
Subgradient optimality
The 1-D objective is convex. Its subdifferential is where for and . Setting subdifferential: for , requiring ; otherwise needs . Combining both cases gives .
Assemble componentwise
Since the separable 1-D solutions are , the vector minimizer is applied componentwise.
The Proximal-Gradient View of LASSO
Split the objective as where is smooth and convex with Lipschitz gradient , and is convex but non-smooth. Proximal gradient methods alternate a gradient step on with a proximal step on : For the LASSO, the proximal step is soft-thresholding β and we have just derived ISTA.
ISTA (Iterative Shrinkage-Thresholding Algorithm)
Complexity: Per iteration: for the two mat-vec products and . Convergence: on objective value.In line 3, the threshold is β the product of the step size and the regularization. If has orthonormal columns () and we pick , one ISTA step reaches the exact LASSO solution.
Theorem: ISTA Convergence Rate
Let with convex and differentiable with -Lipschitz gradient, and convex and lower-semicontinuous. Let be any minimizer and the ISTA iterates with constant step size . Then for every , Thus the suboptimality decays as .
At each step ISTA minimizes a quadratic upper bound (majorizer) of plus exactly. This "majorize-minimize" guarantees monotone descent and a rate β the same rate as plain gradient descent on a smooth convex function. The non-smooth does not slow us down because the proximal operator handles it exactly.
Descent lemma
-smoothness of gives the quadratic upper bound for all . Adding to both sides and minimizing in reveals that ISTA is exactly this minimization: the minimizer is .
Key inequality
Applying the descent lemma with and , then using convexity of at , after rearrangement one obtains the three-point inequality:
Telescoping sum
Sum the inequality from to . The right-hand side telescopes: Since is monotone non-increasing (descent lemma), the -th term is bounded by the average:
On the Step Size
For , the Hessian is and the Lipschitz constant is . In practice we compute via the power method on (a handful of mat-vec products). A slightly smaller step is always safe; a larger step can cause divergence. Backtracking line search is a robust fallback when is unknown.
FISTA (Fast ISTA, Beck-Teboulle 2009)
Complexity: Same per-iteration cost as ISTA: . Convergence: on objective value.FISTA performs the ISTA proximal step starting from the extrapolated point , not from . The extrapolation weight approaches as grows β "look a little bit ahead in the direction you were moving." This is Nesterov's momentum, adapted to the composite (smooth + proximal) setting.
Theorem: FISTA Convergence Rate
Under the same hypotheses as Theorem TISTA Convergence Rate, the FISTA iterates satisfy The suboptimality decays as β a full order of magnitude faster than ISTA.
To reach accuracy , ISTA needs iterations; FISTA needs . For this is the difference between and iterations. Nesterov showed in 1983 that this rate is optimal among first-order methods for smooth convex optimization; Beck and Teboulle extended the result to the composite setting where is non-smooth.
Energy function
Define where . One shows using the ISTA descent inequality applied at and the identity (which is how the momentum sequence is chosen).
Unfold
(direct calculation using and coming from one ISTA step).
Bound $t_k$ from below
Induction on the recursion with gives . Therefore .
Example: One ISTA Step in a Two-Dimensional LASSO
Let , , , and . Compute , perform one ISTA step (with ), and report .
Compute $L$
. Its eigenvalues solve , i.e. . So , giving . Hence .
Gradient step
. (Check: .) .
Soft-threshold
Threshold . . Both components survive, both are shrunk toward zero. The shrinkage is what injects sparsity β components below the threshold would have been zeroed out.
ISTA vs. FISTA Convergence
Compare ISTA and FISTA on a synthetic LASSO problem with a Gaussian sensing matrix. The plot shows on a log scale. Adjust the sparsity level, the regularization , and the problem size to see how the versus gap manifests in practice.
Parameters
Soft- vs. Hard-Thresholding
Visualize the shrinkage operators on a scalar input . Soft-thresholding is continuous and shrinks even the surviving coordinates; hard-thresholding is discontinuous but preserves magnitudes above the threshold.
Parameters
ISTA Iterates on a 2D LASSO
Practical Deployment of ISTA/FISTA
In real systems, ISTA and FISTA are rarely run with a fixed step size. Production solvers
combine three tricks: (i) backtracking line search to avoid computing exactly;
(ii) warm starts across values (the regularization path), so solving a new
problem requires only a few extra iterations; (iii) restart strategies for FISTA
(Donoghue-Candès 2015), which reset whenever monotonicity fails, recovering
linear convergence under strong convexity while preserving the worst case.
Libraries such as SPAMS, SPGL1, and scikit-learn's Lasso ship with these heuristics.
- β’
Compute via power iteration on (O(MN) per iter, 5-10 iters suffice).
- β’
Stop when relative iterate change or after a fixed budget.
- β’
FISTA is non-monotone in β monitor the best iterate seen so far.
ISTA vs. FISTA vs. Subgradient vs. Interior Point
| Method | Per-Iter Cost | Iterations to | Memory | Non-Smoothness |
|---|---|---|---|---|
| Subgradient | handles via | |||
| ISTA | handles via prox | |||
| FISTA | handles via prox | |||
| Interior Point | requires smoothing |
Common Mistake: Using
Mistake:
Picking the step size too large (e.g. on a problem where ) in the hope of converging faster.
Correction:
The ISTA/FISTA convergence proofs require . With the objective may diverge β the algorithm oscillates or explodes. Always estimate first, or use backtracking line search that doubles whenever the descent condition fails.
Common Mistake: Forgetting That Depends on the Scale of and
Mistake:
Copying a value from a paper where was normalized (columns unit ) to a problem where it is not.
Correction:
The LASSO is not scale-invariant. A useful guideline: start with (above which the solution is ) and scan on a log grid.
Historical Note: From EM to FISTA: 40 Years of Shrinkage
1964-2009The soft-thresholding operator goes back to wavelet denoising (Donoho-Johnstone 1994) and earlier robust statistics (Huber 1964). ISTA itself was rediscovered several times as an EM-type iteration: Figueiredo and Nowak (2003) derived it from a Gaussian-Laplace hierarchical model; Daubechies, Defrise and De Mol (2004) gave the first complete convergence analysis for inverse problems under penalties. The method was considered "too slow to be useful" until Beck and Teboulle's FISTA paper in 2009 β they showed that Nesterov's 1983 momentum trick, originally developed for smooth optimization, extends to the composite case. Overnight, problems with became routine on a laptop.
Proximal operator
For a closed convex function , . Generalizes projection (for indicators) and shrinkage (for norms) to arbitrary convex regularizers.
Related: Soft-Threshold Operator, Proximal Operator of the Norm
ISTA
Iterative Shrinkage-Thresholding Algorithm. A proximal-gradient method for the LASSO: alternate a gradient step on the quadratic data-fit term with a soft-threshold of the result. Converges at rate on the objective.
FISTA
Fast ISTA. ISTA plus Nesterov momentum, converging at β the optimal rate for first-order methods on composite convex problems.
Quick Check
To reduce the LASSO suboptimality from to (four orders of magnitude), roughly how many more iterations does ISTA need compared with FISTA?
The same number; both rates are .
ISTA needs 100 times more iterations.
FISTA is slower for high-precision targets.
ISTA needs 4 times more iterations.
ISTA: is a 10000x reduction, needing 10000x more iterations (). FISTA: needs 100x more iterations. Ratio: 100x.
Quick Check
What is ?
Componentwise soft-threshold at 0.3: ; ; .
Key Takeaway
ISTA is gradient descent plus a soft-threshold at every step; FISTA is ISTA plus Nesterov momentum. The convergence rates and follow from descent-lemma arguments and are both consequences of the convexity of the LASSO objective. In practice FISTA with backtracking and adaptive restarts is the default first-order LASSO solver β it requires only matrix-vector products with , memory , and two mat-vecs per iteration.