ISTA and FISTA

Why We Need Dedicated Sparse Solvers

In Chapter 13 we formulated the LASSO problem min⁑x∈RNβ€…β€ŠF(x)β€…β€Š=β€…β€Š12βˆ₯yβˆ’Axβˆ₯22+Ξ»βˆ₯xβˆ₯1\min_{\mathbf{x}\in\mathbb{R}^N}\; F(\mathbf{x}) \;=\; \tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{x}\|_1 and argued that it is a convex relaxation of the combinatorial β„“0\ell_0 problem. Good β€” but an off-the-shelf interior-point solver scales as O(N3)O(N^3) per iteration, which is hopeless once NN reaches 10410^4 or more. Realistic compressed-sensing dictionaries are enormous: an MΓ—NM\times N sensing matrix with Mβ€‰β£βˆΌβ€‰β£103M\!\sim\!10^3 and Nβ€‰β£βˆΌβ€‰β£105N\!\sim\!10^5 is routine in imaging and massive-connectivity problems. We need first-order methods that touch A\mathbf{A} only through matrix-vector products Ax\mathbf{A}\mathbf{x} and A⊀r\mathbf{A}^\top\mathbf{r}. This section develops two such methods β€” ISTA and its accelerated cousin FISTA β€” derives them from the proximal-gradient perspective, and proves their convergence rates. Both exploit convexity: the objective FF is convex, so every local minimum is global, and every monotone descent strategy succeeds.

Definition:

Soft-Threshold Operator

The soft-threshold operator SΟ„:Rβ†’RS_\tau:\mathbb{R}\to\mathbb{R} with threshold Ο„β‰₯0\tau\geq 0 is SΟ„(u)β€…β€Š=β€…β€Šsign(u) max⁑{∣uβˆ£βˆ’Ο„, 0}β€…β€Š=β€…β€Š{uβˆ’Ο„,u>Ο„,0,∣uβˆ£β‰€Ο„,u+Ο„,u<βˆ’Ο„.S_\tau(u) \;=\; \mathrm{sign}(u)\,\max\{|u|-\tau,\,0\} \;=\; \begin{cases}u-\tau, & u>\tau,\\ 0, & |u|\leq \tau,\\ u+\tau, & u<-\tau.\end{cases} Applied componentwise to a vector u∈RN\mathbf{u}\in\mathbb{R}^N, we write SΟ„(u)S_\tau(\mathbf{u}).

The name "soft" distinguishes it from the hard-threshold HΟ„(u)=uβ‹…1{∣u∣>Ο„}H_\tau(u)=u\cdot\mathbf{1}\{|u|>\tau\} which keeps or kills components but never shrinks them. Soft-thresholding is continuous; hard-thresholding is not.

Theorem: Proximal Operator of the β„“1\ell_1 Norm

For any u∈RN\mathbf{u}\in\mathbb{R}^N and Ο„>0\tau>0, proxΟ„βˆ₯β‹…βˆ₯1(u)β€…β€Šβ‰œβ€…β€Šarg⁑min⁑x∈RNβ€…β€Š12βˆ₯xβˆ’uβˆ₯22+Ο„βˆ₯xβˆ₯1β€…β€Š=β€…β€ŠSΟ„(u).\mathrm{prox}_{\tau\|\cdot\|_1}(\mathbf{u}) \;\triangleq\; \arg\min_{\mathbf{x}\in\mathbb{R}^N}\; \tfrac{1}{2}\|\mathbf{x}-\mathbf{u}\|_2^2 + \tau\|\mathbf{x}\|_1 \;=\; S_\tau(\mathbf{u}).

The proximal operator asks: what vector is close to u\mathbf{u} (in β„“2\ell_2) and small (in β„“1\ell_1)? The answer decouples componentwise, and for each coordinate it is a one-dimensional shrinkage. We do not get "almost sparse" vectors: the operator sets coordinates exactly to zero whenever ∣uiβˆ£β‰€Ο„|u_i|\leq\tau.

The Proximal-Gradient View of LASSO

Split the objective as F(x)=f(x)+g(x)F(\mathbf{x}) = f(\mathbf{x}) + g(\mathbf{x}) where f(x)=12βˆ₯yβˆ’Axβˆ₯22f(\mathbf{x})=\tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 is smooth and convex with Lipschitz gradient βˆ‡f(x)=A⊀(Axβˆ’y)\nabla f(\mathbf{x})=\mathbf{A}^\top(\mathbf{A}\mathbf{x}-\mathbf{y}), and g(x)=Ξ»βˆ₯xβˆ₯1g(\mathbf{x})=\lambda\|\mathbf{x}\|_1 is convex but non-smooth. Proximal gradient methods alternate a gradient step on ff with a proximal step on gg: x(k+1)=proxΞ·g ⁣(x(k)βˆ’Ξ·βˆ‡f(x(k))).\mathbf{x}^{(k+1)} = \mathrm{prox}_{\eta g}\!\left(\mathbf{x}^{(k)} - \eta\nabla f(\mathbf{x}^{(k)})\right). For the LASSO, the proximal step is soft-thresholding β€” and we have just derived ISTA.

ISTA (Iterative Shrinkage-Thresholding Algorithm)

Complexity: Per iteration: O(MN)O(MN) for the two mat-vec products Ax\mathbf{A}\mathbf{x} and A⊀r\mathbf{A}^\top\mathbf{r}. Convergence: O(1/k)O(1/k) on objective value.
Input: sensing matrix A, measurements y, regularization lambda,
step size eta (e.g., eta = 1/L where L = ||A||_2^2),
initial iterate x_0, tolerance tol.
1: for k = 0, 1, 2, ... do
2: grad <- A^T (A x_k - y) # gradient of smooth part
3: x_{k+1} <- soft_threshold(x_k - eta * grad, eta * lambda)
4: if ||x_{k+1} - x_k|| / ||x_k|| < tol then stop
5: end for
6: return x_{k+1}

In line 3, the threshold is Ξ·Ξ»\eta\lambda β€” the product of the step size and the regularization. If A\mathbf{A} has orthonormal columns (A⊀A=I\mathbf{A}^\top\mathbf{A}=\mathbf{I}) and we pick Ξ·=1\eta=1, one ISTA step reaches the exact LASSO solution.

Theorem: ISTA Convergence Rate

Let F=f+gF=f+g with ff convex and differentiable with LL-Lipschitz gradient, and gg convex and lower-semicontinuous. Let x⋆\mathbf{x}^\star be any minimizer and {x(k)}\{\mathbf{x}^{(k)}\} the ISTA iterates with constant step size Ξ·=1/L\eta=1/L. Then for every kβ‰₯1k\geq 1, F(x(k))βˆ’F(x⋆)β€…β€Šβ‰€β€…β€ŠL βˆ₯x(0)βˆ’x⋆βˆ₯222k.F(\mathbf{x}^{(k)}) - F(\mathbf{x}^\star) \;\leq\; \frac{L\,\|\mathbf{x}^{(0)}-\mathbf{x}^\star\|_2^2}{2k}. Thus the suboptimality decays as O(1/k)O(1/k).

At each step ISTA minimizes a quadratic upper bound (majorizer) of ff plus gg exactly. This "majorize-minimize" guarantees monotone descent and a 1/k1/k rate β€” the same rate as plain gradient descent on a smooth convex function. The non-smooth gg does not slow us down because the proximal operator handles it exactly.

On the Step Size Ξ·\eta

For f(x)=12βˆ₯yβˆ’Axβˆ₯22f(\mathbf{x})=\tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2, the Hessian is A⊀A\mathbf{A}^\top\mathbf{A} and the Lipschitz constant is L=βˆ₯Aβˆ₯22=Οƒmax⁑2(A)L = \|\mathbf{A}\|_2^2 = \sigma_{\max}^2(\mathbf{A}). In practice we compute LL via the power method on A⊀A\mathbf{A}^\top\mathbf{A} (a handful of mat-vec products). A slightly smaller step Ξ·<1/L\eta < 1/L is always safe; a larger step can cause divergence. Backtracking line search is a robust fallback when LL is unknown.

FISTA (Fast ISTA, Beck-Teboulle 2009)

Complexity: Same per-iteration cost as ISTA: O(MN)O(MN). Convergence: O(1/k2)O(1/k^2) on objective value.
Input: A, y, lambda, L = ||A||_2^2, x_0, tol.
Initialize: z_1 = x_0, t_1 = 1.
1: for k = 1, 2, ... do
2: grad <- A^T (A z_k - y)
3: x_k <- soft_threshold(z_k - (1/L)*grad, lambda/L) # ISTA step from z_k
4: t_{k+1} <- (1 + sqrt(1 + 4*t_k^2)) / 2 # Nesterov momentum
5: z_{k+1} <- x_k + ((t_k - 1)/t_{k+1}) * (x_k - x_{k-1}) # extrapolation
6: if ||x_k - x_{k-1}|| / ||x_{k-1}|| < tol then stop
7: end for
8: return x_k

FISTA performs the ISTA proximal step starting from the extrapolated point zk\mathbf{z}_k, not from xkβˆ’1\mathbf{x}_{k-1}. The extrapolation weight approaches (kβˆ’1)/(k+2)(k-1)/(k+2) as kk grows β€” "look a little bit ahead in the direction you were moving." This is Nesterov's momentum, adapted to the composite (smooth + proximal) setting.

Theorem: FISTA Convergence Rate

Under the same hypotheses as Theorem TISTA Convergence Rate, the FISTA iterates satisfy F(x(k))βˆ’F(x⋆)β€…β€Šβ‰€β€…β€Š2L βˆ₯x(0)βˆ’x⋆βˆ₯22(k+1)2.F(\mathbf{x}^{(k)}) - F(\mathbf{x}^\star) \;\leq\; \frac{2L\,\|\mathbf{x}^{(0)}-\mathbf{x}^\star\|_2^2}{(k+1)^2}. The suboptimality decays as O(1/k2)O(1/k^2) β€” a full order of magnitude faster than ISTA.

To reach accuracy Ξ΅\varepsilon, ISTA needs O(1/Ξ΅)O(1/\varepsilon) iterations; FISTA needs O(1/Ξ΅)O(1/\sqrt{\varepsilon}). For Ξ΅=10βˆ’4\varepsilon=10^{-4} this is the difference between 10410^4 and 10210^2 iterations. Nesterov showed in 1983 that this O(1/k2)O(1/k^2) rate is optimal among first-order methods for smooth convex optimization; Beck and Teboulle extended the result to the composite setting where gg is non-smooth.

,

Example: One ISTA Step in a Two-Dimensional LASSO

Let A=(10.501)\mathbf{A} = \begin{pmatrix}1 & 0.5\\ 0 & 1\end{pmatrix}, y=(0.8, 0.3)⊀\mathbf{y}=(0.8,\,0.3)^\top, Ξ»=0.2\lambda=0.2, and x(0)=(0,0)⊀\mathbf{x}^{(0)}=(0,0)^\top. Compute LL, perform one ISTA step (with Ξ·=1/L\eta=1/L), and report x(1)\mathbf{x}^{(1)}.

ISTA vs. FISTA Convergence

Compare ISTA and FISTA on a synthetic LASSO problem with a Gaussian sensing matrix. The plot shows F(x(k))βˆ’F⋆F(\mathbf{x}^{(k)}) - F^\star on a log scale. Adjust the sparsity level, the regularization Ξ»\lambda, and the problem size to see how the O(1/k)O(1/k) versus O(1/k2)O(1/k^2) gap manifests in practice.

Parameters
120
200
15
0.02
200

Soft- vs. Hard-Thresholding

Visualize the shrinkage operators on a scalar input uu. Soft-thresholding is continuous and shrinks even the surviving coordinates; hard-thresholding is discontinuous but preserves magnitudes above the threshold.

Parameters
0.5

ISTA Iterates on a 2D LASSO

Animated trajectory of x(k)\mathbf{x}^{(k)} under ISTA on a 2D LASSO. Contours of F(x)F(\mathbf{x}) are shown; the iterate slides toward the minimizer along the shrinkage diagonals.
⚠️Engineering Note

Practical Deployment of ISTA/FISTA

In real systems, ISTA and FISTA are rarely run with a fixed step size. Production solvers combine three tricks: (i) backtracking line search to avoid computing LL exactly; (ii) warm starts across Ξ»\lambda values (the regularization path), so solving a new problem requires only a few extra iterations; (iii) restart strategies for FISTA (Donoghue-CandΓ¨s 2015), which reset tk←1t_k\leftarrow 1 whenever monotonicity fails, recovering linear convergence under strong convexity while preserving the O(1/k2)O(1/k^2) worst case. Libraries such as SPAMS, SPGL1, and scikit-learn's Lasso ship with these heuristics.

Practical Constraints
  • β€’

    Compute LL via power iteration on A⊀A\mathbf{A}^\top\mathbf{A} (O(MN) per iter, 5-10 iters suffice).

  • β€’

    Stop when relative iterate change βˆ₯x(k+1)βˆ’x(k)βˆ₯/βˆ₯x(k)βˆ₯<10βˆ’4\|\mathbf{x}^{(k+1)}-\mathbf{x}^{(k)}\|/\|\mathbf{x}^{(k)}\| < 10^{-4} or after a fixed budget.

  • β€’

    FISTA is non-monotone in FF β€” monitor the best iterate seen so far.

ISTA vs. FISTA vs. Subgradient vs. Interior Point

MethodPer-Iter CostIterations to Ξ΅\varepsilonMemoryNon-Smoothness
SubgradientO(MN)O(MN)O(1/Ξ΅2)O(1/\varepsilon^2)O(N)O(N)handles via βˆ‚g\partial g
ISTAO(MN)O(MN)O(1/Ξ΅)O(1/\varepsilon)O(N)O(N)handles via prox
FISTAO(MN)O(MN)O(1/Ξ΅)O(1/\sqrt{\varepsilon})O(N)O(N)handles via prox
Interior PointO(N3)O(N^3)O(log⁑(1/Ρ))O(\log(1/\varepsilon))O(N2)O(N^2)requires smoothing

Common Mistake: Using Ξ·>1/L\eta > 1/L

Mistake:

Picking the step size too large (e.g. Ξ·=1\eta=1 on a problem where L=10L=10) in the hope of converging faster.

Correction:

The ISTA/FISTA convergence proofs require η≀1/L\eta\leq 1/L. With Ξ·>1/L\eta>1/L the objective may diverge β€” the algorithm oscillates or explodes. Always estimate LL first, or use backtracking line search that doubles LL whenever the descent condition fails.

Common Mistake: Forgetting That Ξ»\lambda Depends on the Scale of A\mathbf{A} and y\mathbf{y}

Mistake:

Copying a Ξ»\lambda value from a paper where A\mathbf{A} was normalized (columns unit β„“2\ell_2) to a problem where it is not.

Correction:

The LASSO is not scale-invariant. A useful guideline: start with Ξ»max⁑=βˆ₯A⊀yβˆ₯∞\lambda_{\max} = \|\mathbf{A}^\top\mathbf{y}\|_\infty (above which the solution is 0\mathbf{0}) and scan λ∈[10βˆ’3Ξ»max⁑, λmax⁑]\lambda \in [10^{-3}\lambda_{\max},\,\lambda_{\max}] on a log grid.

Historical Note: From EM to FISTA: 40 Years of Shrinkage

1964-2009

The soft-thresholding operator goes back to wavelet denoising (Donoho-Johnstone 1994) and earlier robust statistics (Huber 1964). ISTA itself was rediscovered several times as an EM-type iteration: Figueiredo and Nowak (2003) derived it from a Gaussian-Laplace hierarchical model; Daubechies, Defrise and De Mol (2004) gave the first complete convergence analysis for inverse problems under β„“1\ell_1 penalties. The method was considered "too slow to be useful" until Beck and Teboulle's FISTA paper in 2009 β€” they showed that Nesterov's 1983 momentum trick, originally developed for smooth optimization, extends to the composite case. Overnight, β„“1\ell_1 problems with N=105N=10^5 became routine on a laptop.

Proximal operator

For a closed convex function gg, proxΟ„g(u)=arg⁑min⁑x12βˆ₯xβˆ’uβˆ₯2+Ο„g(x)\mathrm{prox}_{\tau g}(\mathbf{u}) = \arg\min_\mathbf{x}\tfrac{1}{2}\|\mathbf{x}-\mathbf{u}\|^2 + \tau g(\mathbf{x}). Generalizes projection (for indicators) and shrinkage (for norms) to arbitrary convex regularizers.

Related: Soft-Threshold Operator, Proximal Operator of the β„“1\ell_1 Norm

ISTA

Iterative Shrinkage-Thresholding Algorithm. A proximal-gradient method for the LASSO: alternate a gradient step on the quadratic data-fit term with a soft-threshold of the result. Converges at rate O(1/k)O(1/k) on the objective.

Related: ISTA (Iterative Shrinkage-Thresholding Algorithm)

FISTA

Fast ISTA. ISTA plus Nesterov momentum, converging at O(1/k2)O(1/k^2) β€” the optimal rate for first-order methods on composite convex problems.

Related: FISTA (Fast ISTA, Beck-Teboulle 2009)

Quick Check

To reduce the LASSO suboptimality from 10βˆ’210^{-2} to 10βˆ’610^{-6} (four orders of magnitude), roughly how many more iterations does ISTA need compared with FISTA?

The same number; both rates are O(1/k)O(1/k).

ISTA needs ∼\sim100 times more iterations.

FISTA is slower for high-precision targets.

ISTA needs 4 times more iterations.

Quick Check

What is prox0.3βˆ₯β‹…βˆ₯1((βˆ’0.5, 0.2, 1.0)⊀)\mathrm{prox}_{0.3\|\cdot\|_1}((-0.5,\,0.2,\,1.0)^\top)?

(βˆ’0.2, 0, 0.7)⊀(-0.2,\,0,\,0.7)^\top

(βˆ’0.5, 0.2, 1.0)⊀(-0.5,\,0.2,\,1.0)^\top

(βˆ’0.5, 0, 1.0)⊀(-0.5,\,0,\,1.0)^\top

(0, 0, 0.7)⊀(0,\,0,\,0.7)^\top

Key Takeaway

ISTA is gradient descent plus a soft-threshold at every step; FISTA is ISTA plus Nesterov momentum. The convergence rates O(1/k)O(1/k) and O(1/k2)O(1/k^2) follow from descent-lemma arguments and are both consequences of the convexity of the LASSO objective. In practice FISTA with backtracking and adaptive restarts is the default first-order LASSO solver β€” it requires only matrix-vector products with A\mathbf{A}, memory O(N)O(N), and two mat-vecs per iteration.