Ferkans — Interactive Telecom Tutor

Why We Need Dedicated Sparse Solvers

In Chapter 13 we formulated the LASSO problem $\min_{\mathbf{x}\in\mathbb{R}^N}\; F(\mathbf{x}) \;=\; \tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{x}\|_1$ and argued that it is a convex relaxation of the combinatorial $\ell_0$ problem. Good — but an off-the-shelf interior-point solver scales as $O(N^3)$ per iteration, which is hopeless once $N$ reaches $10^4$ or more. Realistic compressed-sensing dictionaries are enormous: an $M\times N$ sensing matrix with $M\!\sim\!10^3$ and $N\!\sim\!10^5$ is routine in imaging and massive-connectivity problems. We need first-order methods that touch $\mathbf{A}$ only through matrix-vector products $\mathbf{A}\mathbf{x}$ and $\mathbf{A}^\top\mathbf{r}$ . This section develops two such methods — ISTA and its accelerated cousin FISTA — derives them from the proximal-gradient perspective, and proves their convergence rates. Both exploit convexity: the objective $F$ is convex, so every local minimum is global, and every monotone descent strategy succeeds.

Definition:
Soft-Threshold Operator

The soft-threshold operator $S_\tau:\mathbb{R}\to\mathbb{R}$ with threshold $\tau\geq 0$ is $S_\tau(u) \;=\; \mathrm{sign}(u)\,\max\{|u|-\tau,\,0\} \;=\; \begin{cases}u-\tau, & u>\tau,\\ 0, & |u|\leq \tau,\\ u+\tau, & u<-\tau.\end{cases}$ Applied componentwise to a vector $\mathbf{u}\in\mathbb{R}^N$ , we write $S_\tau(\mathbf{u})$ .

The name "soft" distinguishes it from the hard-threshold $H_\tau(u)=u\cdot\mathbf{1}\{|u|>\tau\}$ which keeps or kills components but never shrinks them. Soft-thresholding is continuous; hard-thresholding is not.

Theorem: Proximal Operator of the $\ell_1$ Norm

For any $\mathbf{u}\in\mathbb{R}^N$ and $\tau>0$ , $\mathrm{prox}_{\tau\|\cdot\|_1}(\mathbf{u}) \;\triangleq\; \arg\min_{\mathbf{x}\in\mathbb{R}^N}\; \tfrac{1}{2}\|\mathbf{x}-\mathbf{u}\|_2^2 + \tau\|\mathbf{x}\|_1 \;=\; S_\tau(\mathbf{u}).$

The proximal operator asks: what vector is close to $\mathbf{u}$ (in $\ell_2$ ) and small (in $\ell_1$ )? The answer decouples componentwise, and for each coordinate it is a one-dimensional shrinkage. We do not get "almost sparse" vectors: the operator sets coordinates exactly to zero whenever $|u_i|\leq\tau$ .

Proof

Separability

Both $\tfrac{1}{2}\|\mathbf{x}-\mathbf{u}\|_2^2 = \sum_i \tfrac{1}{2}(x_i-u_i)^2$ and $\|\mathbf{x}\|_1 = \sum_i |x_i|$ separate over coordinates, so the minimizer is obtained by solving the 1-D problem $\min_x \tfrac{1}{2}(x-u)^2 + \tau|x|$ for each $i$ .

Subgradient optimality

The 1-D objective is convex. Its subdifferential is $\{x-u+\tau\,g : g\in\partial|x|\}$ where $\partial|x|=\{\mathrm{sign}(x)\}$ for $x\neq 0$ and $\partial|0|=[-1,1]$ . Setting $0\in$ subdifferential: $x^\star = u - \tau\,\mathrm{sign}(x^\star)$ for $x^\star\neq 0$ , requiring $|u|>\tau$ ; otherwise $x^\star=0$ needs $u\in[-\tau,\tau]$ . Combining both cases gives $x^\star=S_\tau(u)$ .

Assemble componentwise

Since the separable 1-D solutions are $S_\tau(u_i)$ , the vector minimizer is $S_\tau(\mathbf{u})$ applied componentwise.

The Proximal-Gradient View of LASSO

Split the objective as $F(\mathbf{x}) = f(\mathbf{x}) + g(\mathbf{x})$ where $f(\mathbf{x})=\tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2$ is smooth and convex with Lipschitz gradient $\nabla f(\mathbf{x})=\mathbf{A}^\top(\mathbf{A}\mathbf{x}-\mathbf{y})$ , and $g(\mathbf{x})=\lambda\|\mathbf{x}\|_1$ is convex but non-smooth. Proximal gradient methods alternate a gradient step on $f$ with a proximal step on $g$ : $\mathbf{x}^{(k+1)} = \mathrm{prox}_{\eta g}\!\left(\mathbf{x}^{(k)} - \eta\nabla f(\mathbf{x}^{(k)})\right).$ For the LASSO, the proximal step is soft-thresholding — and we have just derived ISTA.

ISTA (Iterative Shrinkage-Thresholding Algorithm)

Complexity: Per iteration:

O(MN)

for the two mat-vec products

\mathbf{A}\mathbf{x}

and

\mathbf{A}^\top\mathbf{r}

. Convergence:

O(1/k)

on objective value.

Input: sensing matrix A, measurements y, regularization lambda,

step size eta (e.g., eta = 1/L where L = ||A||_2^2),

initial iterate x_0, tolerance tol.

1: for k = 0, 1, 2, ... do

2: grad <- A^T (A x_k - y) # gradient of smooth part

3: x_{k+1} <- soft_threshold(x_k - eta * grad, eta * lambda)

4: if ||x_{k+1} - x_k|| / ||x_k|| < tol then stop

5: end for

6: return x_{k+1}

In line 3, the threshold is $\eta\lambda$ — the product of the step size and the regularization. If $\mathbf{A}$ has orthonormal columns ( $\mathbf{A}^\top\mathbf{A}=\mathbf{I}$ ) and we pick $\eta=1$ , one ISTA step reaches the exact LASSO solution.

Theorem: ISTA Convergence Rate

Let $F=f+g$ with $f$ convex and differentiable with $L$ -Lipschitz gradient, and $g$ convex and lower-semicontinuous. Let $\mathbf{x}^\star$ be any minimizer and $\{\mathbf{x}^{(k)}\}$ the ISTA iterates with constant step size $\eta=1/L$ . Then for every $k\geq 1$ , $F(\mathbf{x}^{(k)}) - F(\mathbf{x}^\star) \;\leq\; \frac{L\,\|\mathbf{x}^{(0)}-\mathbf{x}^\star\|_2^2}{2k}.$ Thus the suboptimality decays as $O(1/k)$ .

At each step ISTA minimizes a quadratic upper bound (majorizer) of $f$ plus $g$ exactly. This "majorize-minimize" guarantees monotone descent and a $1/k$ rate — the same rate as plain gradient descent on a smooth convex function. The non-smooth $g$ does not slow us down because the proximal operator handles it exactly.

Proof

Descent lemma

$L$ -smoothness of $f$ gives the quadratic upper bound $f(\mathbf{x}) \leq f(\mathbf{y}) + \langle\nabla f(\mathbf{y}),\mathbf{x}-\mathbf{y}\rangle + \tfrac{L}{2}\|\mathbf{x}-\mathbf{y}\|_2^2$ for all $\mathbf{x},\mathbf{y}$ . Adding $g(\mathbf{x})$ to both sides and minimizing in $\mathbf{x}$ reveals that ISTA is exactly this minimization: the minimizer is $\mathrm{prox}_{g/L}(\mathbf{y}-\tfrac{1}{L}\nabla f(\mathbf{y}))$ .

Key inequality

Applying the descent lemma with $\mathbf{y}=\mathbf{x}^{(k)}$ and $\mathbf{x}=\mathbf{x}^{(k+1)}$ , then using convexity of $f$ at $\mathbf{x}^\star$ , after rearrangement one obtains the three-point inequality: $F(\mathbf{x}^{(k+1)}) - F(\mathbf{x}^\star) \leq \tfrac{L}{2}\!\left(\|\mathbf{x}^{(k)}-\mathbf{x}^\star\|_2^2 - \|\mathbf{x}^{(k+1)}-\mathbf{x}^\star\|_2^2\right).$

Telescoping sum

Sum the inequality from $k=0$ to $k=K-1$ . The right-hand side telescopes: $\sum_{k=0}^{K-1}\!\bigl(F(\mathbf{x}^{(k+1)}) - F(\mathbf{x}^\star)\bigr) \leq \tfrac{L}{2}\|\mathbf{x}^{(0)}-\mathbf{x}^\star\|_2^2.$ Since $\{F(\mathbf{x}^{(k)})\}$ is monotone non-increasing (descent lemma), the $K$ -th term is bounded by the average: $F(\mathbf{x}^{(K)}) - F(\mathbf{x}^\star) \leq \tfrac{L}{2K}\|\mathbf{x}^{(0)}-\mathbf{x}^\star\|_2^2.$

On the Step Size $\eta$

For $f(\mathbf{x})=\tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2$ , the Hessian is $\mathbf{A}^\top\mathbf{A}$ and the Lipschitz constant is $L = \|\mathbf{A}\|_2^2 = \sigma_{\max}^2(\mathbf{A})$ . In practice we compute $L$ via the power method on $\mathbf{A}^\top\mathbf{A}$ (a handful of mat-vec products). A slightly smaller step $\eta < 1/L$ is always safe; a larger step can cause divergence. Backtracking line search is a robust fallback when $L$ is unknown.

FISTA (Fast ISTA, Beck-Teboulle 2009)

Complexity: Same per-iteration cost as ISTA:

O(MN)

. Convergence:

O(1/k^2)

on objective value.

Input: A, y, lambda, L = ||A||_2^2, x_0, tol.

Initialize: z_1 = x_0, t_1 = 1.

1: for k = 1, 2, ... do

2: grad <- A^T (A z_k - y)

3: x_k <- soft_threshold(z_k - (1/L)*grad, lambda/L) # ISTA step from z_k

4: t_{k+1} <- (1 + sqrt(1 + 4*t_k^2)) / 2 # Nesterov momentum

5: z_{k+1} <- x_k + ((t_k - 1)/t_{k+1}) * (x_k - x_{k-1}) # extrapolation

6: if ||x_k - x_{k-1}|| / ||x_{k-1}|| < tol then stop

7: end for

8: return x_k

FISTA performs the ISTA proximal step starting from the extrapolated point $\mathbf{z}_k$ , not from $\mathbf{x}_{k-1}$ . The extrapolation weight approaches $(k-1)/(k+2)$ as $k$ grows — "look a little bit ahead in the direction you were moving." This is Nesterov's momentum, adapted to the composite (smooth + proximal) setting.

Theorem: FISTA Convergence Rate

Under the same hypotheses as Theorem TISTA Convergence Rate, the FISTA iterates satisfy $F(\mathbf{x}^{(k)}) - F(\mathbf{x}^\star) \;\leq\; \frac{2L\,\|\mathbf{x}^{(0)}-\mathbf{x}^\star\|_2^2}{(k+1)^2}.$ The suboptimality decays as $O(1/k^2)$ — a full order of magnitude faster than ISTA.

To reach accuracy $\varepsilon$ , ISTA needs $O(1/\varepsilon)$ iterations; FISTA needs $O(1/\sqrt{\varepsilon})$ . For $\varepsilon=10^{-4}$ this is the difference between $10^4$ and $10^2$ iterations. Nesterov showed in 1983 that this $O(1/k^2)$ rate is optimal among first-order methods for smooth convex optimization; Beck and Teboulle extended the result to the composite setting where $g$ is non-smooth.

Proof

Energy function

Define $v_k = t_k^2\bigl(F(\mathbf{x}^{(k)}) - F(\mathbf{x}^\star)\bigr) + \tfrac{L}{2}\|\mathbf{u}_k\|_2^2$ where $\mathbf{u}_k = t_k \mathbf{x}^{(k)} - (t_k-1)\mathbf{x}^{(k-1)} - \mathbf{x}^\star$ . One shows $v_{k+1}\leq v_k$ using the ISTA descent inequality applied at $\mathbf{z}_{k+1}$ and the identity $t_{k+1}^2 - t_{k+1} = t_k^2$ (which is how the momentum sequence is chosen).

Unfold

$v_k \leq v_1 \leq \tfrac{L}{2}\|\mathbf{x}^{(0)}-\mathbf{x}^\star\|_2^2$ (direct calculation using $t_1=1$ and $\mathbf{x}^{(1)}$ coming from one ISTA step).

Bound $t_k$ from below

Induction on the recursion $t_{k+1} = (1+\sqrt{1+4t_k^2})/2$ with $t_1=1$ gives $t_k \geq (k+1)/2$ . Therefore $F(\mathbf{x}^{(k)}) - F(\mathbf{x}^\star) \leq v_k / t_k^2 \leq \tfrac{2L}{(k+1)^2}\|\mathbf{x}^{(0)}-\mathbf{x}^\star\|_2^2$ .

,

Example: One ISTA Step in a Two-Dimensional LASSO

Let $\mathbf{A} = \begin{pmatrix}1 & 0.5\\ 0 & 1\end{pmatrix}$ , $\mathbf{y}=(0.8,\,0.3)^\top$ , $\lambda=0.2$ , and $\mathbf{x}^{(0)}=(0,0)^\top$ . Compute $L$ , perform one ISTA step (with $\eta=1/L$ ), and report $\mathbf{x}^{(1)}$ .

Solution

Compute $L$

$\mathbf{A}^\top\mathbf{A} = \begin{pmatrix}1 & 0.5\\0.5 & 1.25\end{pmatrix}$ . Its eigenvalues solve $(1-\mu)(1.25-\mu) - 0.25 = 0$ , i.e. $\mu^2 - 2.25\mu + 1 = 0$ . So $\mu = (2.25 \pm \sqrt{5.0625 - 4})/2 = (2.25 \pm 1.031)/2 \in \{0.609,\,1.641\}$ , giving $L = 1.641$ . Hence $\eta = 1/L \approx 0.609$ .

Gradient step

$\nabla f(\mathbf{x}^{(0)}) = \mathbf{A}^\top(\mathbf{A}\mathbf{x}^{(0)} - \mathbf{y}) = -\mathbf{A}^\top\mathbf{y} = -(0.8,\,0.7)^\top$ . (Check: $\mathbf{A}^\top\mathbf{y} = (1\cdot 0.8 + 0\cdot 0.3,\; 0.5\cdot 0.8 + 1\cdot 0.3) = (0.8, 0.7)$ .) $\mathbf{u} = \mathbf{x}^{(0)} - \eta\nabla f = (0,0) - 0.609\cdot(-0.8,-0.7) = (0.487,\,0.426)$ .

Soft-threshold

Threshold $\tau = \eta\lambda = 0.609 \cdot 0.2 = 0.122$ . $\mathbf{x}^{(1)} = S_\tau(\mathbf{u}) = (0.487-0.122,\,0.426-0.122) = (0.365,\,0.304)$ . Both components survive, both are shrunk toward zero. The shrinkage is what injects sparsity — components below the threshold would have been zeroed out.

ISTA vs. FISTA Convergence

Compare ISTA and FISTA on a synthetic LASSO problem with a Gaussian sensing matrix. The plot shows $F(\mathbf{x}^{(k)}) - F^\star$ on a log scale. Adjust the sparsity level, the regularization $\lambda$ , and the problem size to see how the $O(1/k)$ versus $O(1/k^2)$ gap manifests in practice.

Parameters

Measurements

M

120

Signal dimension

N

200

Sparsity

s

15

Regularization

\lambda

0.02

Iterations200

Soft- vs. Hard-Thresholding

Visualize the shrinkage operators on a scalar input $u$ . Soft-thresholding is continuous and shrinks even the surviving coordinates; hard-thresholding is discontinuous but preserves magnitudes above the threshold.

Parameters

Threshold

\tau

0.5

ISTA Iterates on a 2D LASSO

Animated trajectory of

\mathbf{x}^{(k)}

under ISTA on a 2D LASSO. Contours of

F(\mathbf{x})

are shown; the iterate slides toward the minimizer along the shrinkage diagonals.

⚠️Engineering Note

Practical Deployment of ISTA/FISTA

In real systems, ISTA and FISTA are rarely run with a fixed step size. Production solvers combine three tricks: (i) backtracking line search to avoid computing $L$ exactly; (ii) warm starts across $\lambda$ values (the regularization path), so solving a new problem requires only a few extra iterations; (iii) restart strategies for FISTA (Donoghue-Candès 2015), which reset $t_k\leftarrow 1$ whenever monotonicity fails, recovering linear convergence under strong convexity while preserving the $O(1/k^2)$ worst case. Libraries such as SPAMS, SPGL1, and scikit-learn's Lasso ship with these heuristics.

Practical Constraints

•
Compute $L$ via power iteration on $\mathbf{A}^\top\mathbf{A}$ (O(MN) per iter, 5-10 iters suffice).
•
Stop when relative iterate change $\|\mathbf{x}^{(k+1)}-\mathbf{x}^{(k)}\|/\|\mathbf{x}^{(k)}\| < 10^{-4}$ or after a fixed budget.
•
FISTA is non-monotone in $F$ — monitor the best iterate seen so far.

ISTA vs. FISTA vs. Subgradient vs. Interior Point

Method	Per-Iter Cost	Iterations to $\varepsilon$	Memory	Non-Smoothness
Subgradient	$O(MN)$	$O(1/\varepsilon^2)$	$O(N)$	handles via $\partial g$
ISTA	$O(MN)$	$O(1/\varepsilon)$	$O(N)$	handles via prox
FISTA	$O(MN)$	$O(1/\sqrt{\varepsilon})$	$O(N)$	handles via prox
Interior Point	$O(N^3)$	$O(\log(1/\varepsilon))$	$O(N^2)$	requires smoothing

Common Mistake: Using $\eta > 1/L$

Mistake:

Picking the step size too large (e.g. $\eta=1$ on a problem where $L=10$ ) in the hope of converging faster.

Correction:

The ISTA/FISTA convergence proofs require $\eta\leq 1/L$ . With $\eta>1/L$ the objective may diverge — the algorithm oscillates or explodes. Always estimate $L$ first, or use backtracking line search that doubles $L$ whenever the descent condition fails.

Common Mistake: Forgetting That $\lambda$ Depends on the Scale of $\mathbf{A}$ and $\mathbf{y}$

Mistake:

Copying a $\lambda$ value from a paper where $\mathbf{A}$ was normalized (columns unit $\ell_2$ ) to a problem where it is not.

Correction:

The LASSO is not scale-invariant. A useful guideline: start with $\lambda_{\max} = \|\mathbf{A}^\top\mathbf{y}\|_\infty$ (above which the solution is $\mathbf{0}$ ) and scan $\lambda \in [10^{-3}\lambda_{\max},\,\lambda_{\max}]$ on a log grid.

Historical Note: From EM to FISTA: 40 Years of Shrinkage

1964-2009

The soft-thresholding operator goes back to wavelet denoising (Donoho-Johnstone 1994) and earlier robust statistics (Huber 1964). ISTA itself was rediscovered several times as an EM-type iteration: Figueiredo and Nowak (2003) derived it from a Gaussian-Laplace hierarchical model; Daubechies, Defrise and De Mol (2004) gave the first complete convergence analysis for inverse problems under $\ell_1$ penalties. The method was considered "too slow to be useful" until Beck and Teboulle's FISTA paper in 2009 — they showed that Nesterov's 1983 momentum trick, originally developed for smooth optimization, extends to the composite case. Overnight, $\ell_1$ problems with $N=10^5$ became routine on a laptop.

Proximal operator

For a closed convex function $g$ , $\mathrm{prox}_{\tau g}(\mathbf{u}) = \arg\min_\mathbf{x}\tfrac{1}{2}\|\mathbf{x}-\mathbf{u}\|^2 + \tau g(\mathbf{x})$ . Generalizes projection (for indicators) and shrinkage (for norms) to arbitrary convex regularizers.

ISTA

Iterative Shrinkage-Thresholding Algorithm. A proximal-gradient method for the LASSO: alternate a gradient step on the quadratic data-fit term with a soft-threshold of the result. Converges at rate $O(1/k)$ on the objective.

FISTA

Fast ISTA. ISTA plus Nesterov momentum, converging at $O(1/k^2)$ — the optimal rate for first-order methods on composite convex problems.

Quick Check

To reduce the LASSO suboptimality from $10^{-2}$ to $10^{-6}$ (four orders of magnitude), roughly how many more iterations does ISTA need compared with FISTA?

The same number; both rates are $O(1/k)$ .

ISTA needs $\sim$ 100 times more iterations.

FISTA is slower for high-precision targets.

ISTA needs 4 times more iterations.

Correction:

ISTA needs

\sim

100 times more iterations.

ISTA: $10^{-2}\to 10^{-6}$ is a 10000x reduction, needing 10000x more iterations ( $O(1/k)$ ). FISTA: $O(1/k^2)$ needs 100x more iterations. Ratio: 100x.

Quick Check

What is $\mathrm{prox}_{0.3\|\cdot\|_1}((-0.5,\,0.2,\,1.0)^\top)$ ?

$(-0.2,\,0,\,0.7)^\top$

$(-0.5,\,0.2,\,1.0)^\top$

$(-0.5,\,0,\,1.0)^\top$

$(0,\,0,\,0.7)^\top$

Correction:

(-0.2,\,0,\,0.7)^\top

Componentwise soft-threshold at 0.3: $|-0.5|>0.3\to -0.2$ ; $|0.2|<0.3\to 0$ ; $|1.0|>0.3\to 0.7$ .

Key Takeaway

ISTA is gradient descent plus a soft-threshold at every step; FISTA is ISTA plus Nesterov momentum. The convergence rates $O(1/k)$ and $O(1/k^2)$ follow from descent-lemma arguments and are both consequences of the convexity of the LASSO objective. In practice FISTA with backtracking and adaptive restarts is the default first-order LASSO solver — it requires only matrix-vector products with $\mathbf{A}$ , memory $O(N)$ , and two mat-vecs per iteration.

ISTA and FISTA

Why We Need Dedicated Sparse Solvers

Definition: Soft-Threshold Operator

Theorem: Proximal Operator of the ℓ1\ell_1ℓ1​ Norm

Separability

Subgradient optimality

Assemble componentwise

The Proximal-Gradient View of LASSO

ISTA (Iterative Shrinkage-Thresholding Algorithm)

Theorem: ISTA Convergence Rate

Descent lemma

Key inequality

Telescoping sum

On the Step Size η\etaη

FISTA (Fast ISTA, Beck-Teboulle 2009)

Theorem: FISTA Convergence Rate

Energy function

Unfold

Bound $t_k$ from below

Example: One ISTA Step in a Two-Dimensional LASSO

Compute $L$

Gradient step

Soft-threshold

ISTA vs. FISTA Convergence

Parameters

Soft- vs. Hard-Thresholding

Parameters

ISTA Iterates on a 2D LASSO

Practical Deployment of ISTA/FISTA

ISTA vs. FISTA vs. Subgradient vs. Interior Point

Common Mistake: Using η>1/L\eta > 1/Lη>1/L

Common Mistake: Forgetting That λ\lambdaλ Depends on the Scale of A\mathbf{A}A and y\mathbf{y}y

Historical Note: From EM to FISTA: 40 Years of Shrinkage

Proximal operator

ISTA

FISTA

Quick Check

Quick Check

Key Takeaway

Definition:
Soft-Threshold Operator

Theorem: Proximal Operator of the $\ell_1$ Norm

On the Step Size $\eta$

Common Mistake: Using $\eta > 1/L$

Common Mistake: Forgetting That $\lambda$ Depends on the Scale of $\mathbf{A}$ and $\mathbf{y}$