Ferkans — Interactive Telecom Tutor

The Case for Greediness

Convex relaxation solves $\ell_0$ -sparse recovery by replacing the combinatorial $\|\mathbf{x}\|_0$ penalty with $\|\mathbf{x}\|_1$ and invoking ISTA/FISTA/ADMM. This works — but in three important regimes, greedy algorithms are faster, simpler, and often more accurate: (i) known sparsity level $s$ , so we can stop after exactly $s$ selections; (ii) very sparse signals ( $s\ll M$ ), where the greedy search touches only a handful of columns of $\mathbf{A}$ ; (iii) speed-critical applications like real-time channel estimation or radar, where even FISTA is too expensive. This section develops the three canonical greedy methods: OMP, CoSaMP, and IHT. All three share one idea — identify a candidate support, solve a restricted least-squares on it, update. They differ in how they build the support.

Definition:
Restricted Least-Squares

Given a support set $\mathcal{S}\subseteq\{1,\ldots,N\}$ with $|\mathcal{S}|\leq M$ , the restricted least-squares estimate is $\hat{\mathbf{x}}_\mathcal{S} = \arg\min_\mathbf{w}\; \|\mathbf{y}-\mathbf{A}_\mathcal{S}\,\mathbf{w}\|_2^2 = \mathbf{A}_\mathcal{S}^{+}\mathbf{y},$ where $\mathbf{A}_\mathcal{S}$ is the submatrix containing only the columns of $\mathbf{A}$ indexed by $\mathcal{S}$ , and $\mathbf{A}_\mathcal{S}^+$ is its pseudo-inverse. The full estimate places these values on $\mathcal{S}$ and zeros elsewhere.

All three greedy algorithms produce their final iterate by a restricted least-squares solve on their estimated support. The cost is a QR or Cholesky factorization of $\mathbf{A}_\mathcal{S}^\top\mathbf{A}_\mathcal{S}$ — cheap when $|\mathcal{S}|$ is small.

OMP (Orthogonal Matching Pursuit)

Complexity:

O(s\cdot MN)

correlations +

O(s^3)

for the incremental QR of

\mathbf{A}_{\mathcal{S}_k}

. Much cheaper than FISTA when

s\ll M

.

Input: A, y, sparsity s (or residual tolerance eps).

Initialize: residual r_0 = y, support S_0 = {}, iterate x_0 = 0.

1: for k = 1, 2, ..., s do

2: j* <- argmax_j | a_j^T r_{k-1} | # column best correlated with residual

3: S_k <- S_{k-1} U {j*} # enlarge support

4: w <- argmin_w || y - A_{S_k} w ||_2 # solve LS on current support

5: x_k <- embed(w, S_k) # put w at S_k, zeros elsewhere

6: r_k <- y - A_{S_k} w # update residual

7: if ||r_k||_2 < eps then break

8: end for

9: return x_k, S_k

The name orthogonal refers to step 4: after each selection we re-solve LS over the current support, which makes the residual $\mathbf{r}_k$ orthogonal to every column in $\mathcal{S}_k$ . Plain matching pursuit (Mallat-Zhang 1993) skipped this orthogonalization and was forced to revisit coordinates; OMP never revisits a chosen index.

Theorem: OMP Exact Recovery under Coherence

Let $\mathbf{A}\in\mathbb{R}^{M\times N}$ with unit-norm columns and coherence $\mu(\mathbf{A}) = \max_{i\neq j}|\mathbf{a}_i^\top\mathbf{a}_j|$ . If $\mathbf{x}_\star$ is $s$ -sparse with $s<\tfrac{1}{2}(1+1/\mu)$ and $\mathbf{y}=\mathbf{A}\mathbf{x}_\star$ (noiseless), then OMP recovers the true support and $\mathbf{x}_\star$ exactly in $s$ steps.

Low coherence means the columns of $\mathbf{A}$ are nearly orthogonal, so the inner product $\mathbf{a}_j^\top\mathbf{r}_{k-1}$ peaks at a true-support index. OMP then identifies one correct index per iteration. The condition $s<(1+1/\mu)/2$ ensures the peak is always on a correct index: the "interference" from other active coordinates cannot overpower the signal of the unselected true index.

Proof

Induction on the selected support

Claim: at iteration $k$ , $\mathcal{S}_{k-1}\subseteq\mathrm{supp}(\mathbf{x}_\star)$ . Base case $k=1$ : $\mathcal{S}_0=\emptyset$ , trivially true. Assume true for $k-1$ .

Residual in terms of remaining support

Because LS is solved on $\mathcal{S}_{k-1}$ , $\mathbf{r}_{k-1} = (\mathbf{I}-\mathbf{A}_{\mathcal{S}_{k-1}}\mathbf{A}_{\mathcal{S}_{k-1}}^+)\mathbf{y} = \mathbf{A}\mathbf{x}_\star^{\text{rem}}$ where $\mathbf{x}_\star^{\text{rem}}$ is supported on $\mathrm{supp}(\mathbf{x}_\star)\setminus\mathcal{S}_{k-1}$ (and possibly scaled; details in Tropp 2004).

Identification peaks on true support

For a true index $i$ , $|\mathbf{a}_i^\top\mathbf{r}_{k-1}| \geq |x_{\star,i}^{\text{rem}}| - \mu(s-1)\max_j|x_{\star,j}^{\text{rem}}|$ . For an off-support index $\ell$ , $|\mathbf{a}_\ell^\top\mathbf{r}_{k-1}| \leq \mu s\max_j|x_{\star,j}^{\text{rem}}|$ . Under $s<(1+1/\mu)/2$ , the true peak strictly dominates, so $j^\star$ is on the true support.

Termination

After $s$ iterations, $\mathcal{S}_s=\mathrm{supp}(\mathbf{x}_\star)$ , and LS on the exact support recovers $\mathbf{x}_\star$ exactly.

CoSaMP (Compressive Sampling Matching Pursuit)

Complexity: Per iter

O(MN)

for the correlation +

O(s^2 M)

for LS on the merged support. Linear convergence under RIP: reaches

\varepsilon

accuracy in

O(\log(\|\mathbf{x}_\star\|/\varepsilon))

iterations.

Input: A, y, sparsity s.

Initialize: x_0 = 0, residual r_0 = y.

1: for k = 1, 2, ..., K do

2: p <- A^T r_{k-1} # proxy (correlation with residual)

3: T <- indices of 2s largest |p_i| # top-2s candidates

4: S <- T U supp(x_{k-1}) # merge with current support (up to 3s)

5: w <- argmin_w || y - A_S w ||_2 # LS on merged support

6: x_k <- H_s(embed(w, S)) # keep s largest of w

7: r_k <- y - A x_k # update residual

8: if ||r_k||_2 < eps then break

9: end for

10: return x_k

CoSaMP (Needell-Tropp 2008) differs from OMP in two ways: it adds $2s$ candidates per iteration (not 1), and it can remove wrongly-selected indices via the final hard- threshold $H_s$ . This makes CoSaMP robust to noise and RIP-friendly. It converges in logarithmically many iterations, while OMP needs $s$ .

IHT (Iterative Hard Thresholding)

Complexity:

O(MN)

per iteration. Linear convergence under RIP:

\|\mathbf{x}_k-\mathbf{x}_\star\|\leq c^k\|\mathbf{x}_\star\|

for some

c<1

.

Input: A, y, sparsity s, step size eta (e.g., eta=1).

Initialize: x_0 = 0.

1: for k = 0, 1, ... do

2: g <- A^T (A x_k - y) # gradient

3: x_{k+1} <- H_s( x_k - eta * g ) # gradient step + hard-threshold to s terms

4: if ||x_{k+1} - x_k|| / ||x_k|| < tol then stop

5: end for

6: return x_{k+1}

IHT is the combinatorial sibling of ISTA — replace soft-threshold with hard-threshold at sparsity $s$ . It solves the non-convex $\ell_0$ problem directly: $\min_\mathbf{x}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2$ s.t. $\|\mathbf{x}\|_0\leq s$ . Convergence to the global minimum is guaranteed under RIP (Blumensath-Davies 2009).

Theorem: IHT Convergence under RIP

Let $\mathbf{A}$ satisfy RIP of order $3s$ with constant $\delta_{3s}<1/\sqrt{8}\approx 0.354$ . Let $\mathbf{x}_\star$ be $s$ -sparse and $\mathbf{y}=\mathbf{A}\mathbf{x}_\star+\mathbf{w}$ . Then the IHT iterates with step size $\eta=1$ satisfy $\|\mathbf{x}_k-\mathbf{x}_\star\|_2 \leq (2\sqrt{2}\,\delta_{3s})^k \|\mathbf{x}_\star\|_2 + \frac{4}{1-2\sqrt{2}\delta_{3s}}\|\mathbf{w}\|_2.$

Linear convergence to a noise-floor plateau. Each iteration contracts the error by a factor $2\sqrt{2}\delta_{3s}<1$ ; the residual noise leaves an unavoidable noise floor proportional to $\|\mathbf{w}\|$ . Same qualitative story as CoSaMP.

Proof

Error recursion

Let $\mathbf{e}_k = \mathbf{x}_k - \mathbf{x}_\star$ . After the gradient step: $\mathbf{x}_k - \nabla f(\mathbf{x}_k) = \mathbf{x}_\star + (\mathbf{I}-\mathbf{A}^\top\mathbf{A})\mathbf{e}_k + \mathbf{A}^\top\mathbf{w}$ .

Use RIP on 3s-sparse vectors

$H_s$ projects onto a support $\mathcal{S}_{k+1}$ of size $s$ . The error $\mathbf{x}_{k+1}-\mathbf{x}_\star$ is supported on $\mathcal{S}_{k+1}\cup\mathrm{supp}(\mathbf{x}_\star)$ — at most $2s$ nonzeros. The "interference" matrix $\mathbf{I}-\mathbf{A}^\top\mathbf{A}$ acts on vectors of support size $\leq 3s$ , so by RIP its operator norm on these is $\leq\delta_{3s}$ . The hard-threshold loses at most a factor of $2$ by standard best- $s$ -approximation arguments, yielding $\|\mathbf{e}_{k+1}\| \leq 2\delta_{3s}\sqrt{2}\|\mathbf{e}_k\| + 2\sqrt{1+\delta_{2s}}\|\mathbf{w}\|$ .

Unfold recursion

Iterating gives the geometric-series bound in the statement.

OMP: Residual Energy vs. Iteration

Watch the residual $\|\mathbf{r}_k\|_2^2$ drop as OMP greedily picks columns. With low coherence, residual hits zero exactly at iteration $s$ . With high coherence (or noise), residual plateaus at the noise floor.

Parameters

Measurements

M

80

Signal dimension

N

160

True sparsity

s_\star

10

SNR (dB)20

Support Recovery: Greedy vs. LASSO

Sweep the true sparsity level $s_\star$ and plot the probability that each algorithm recovers the exact support. Compares OMP, CoSaMP, IHT, and LASSO/FISTA on random Gaussian sensing matrices.

Parameters

Measurements

M

80

Signal dimension

N

160

Monte-Carlo trials40

SNR (dB)25

OMP: Selecting Columns of $\mathbf{A}$ One at a Time

Visualization of OMP picking the column

\mathbf{a}_{j^\star}

most aligned with the current residual.

Example: OMP on a 3-Column Problem

Let $\mathbf{A}=[\mathbf{a}_1\,|\,\mathbf{a}_2\,|\,\mathbf{a}_3]$ with unit-norm columns and $\mathbf{a}_1^\top\mathbf{a}_2=0.1$ , $\mathbf{a}_1^\top\mathbf{a}_3=0.2$ , $\mathbf{a}_2^\top\mathbf{a}_3=0.05$ . Let $\mathbf{x}_\star=(2,\,0,\,-1)^\top$ (so $s_\star=2$ ) and $\mathbf{y}=\mathbf{A}\mathbf{x}_\star$ . Execute two iterations of OMP.

Solution

Iteration 1: correlations

$\mathbf{r}_0=\mathbf{y}$ . $\mathbf{a}_j^\top\mathbf{y}=\mathbf{a}_j^\top\mathbf{A}\mathbf{x}_\star$ . Compute each: $\mathbf{a}_1^\top\mathbf{y} = 1\cdot 2 + 0.1\cdot 0 + 0.2\cdot(-1) = 1.8$ ; $\mathbf{a}_2^\top\mathbf{y} = 0.1\cdot 2 + 0 + 0.05\cdot(-1) = 0.15$ ; $\mathbf{a}_3^\top\mathbf{y} = 0.2\cdot 2 + 0 + 1\cdot(-1) = -0.6$ . Largest magnitude: $|1.8|$ , so $j^\star=1$ , $\mathcal{S}_1=\{1\}$ . Correct true-support pick.

Iteration 1: residual

LS on $\mathcal{S}_1=\{1\}$ : $w_1 = \mathbf{a}_1^\top\mathbf{y}/\|\mathbf{a}_1\|^2 = 1.8$ . $\mathbf{r}_1 = \mathbf{y} - 1.8\,\mathbf{a}_1$ . Note the LS on $\{1\}$ gave $1.8$ , not the true $2$ , because column 3 was non-orthogonal to column 1.

Iteration 2: correlations

Compute $\mathbf{a}_j^\top\mathbf{r}_1 = \mathbf{a}_j^\top\mathbf{y}-1.8\,\mathbf{a}_j^\top\mathbf{a}_1$ . $j=1$ : $1.8 - 1.8 = 0$ (residual orthogonal to selected column, as it should be). $j=2$ : $0.15 - 1.8\cdot 0.1 = -0.03$ . $j=3$ : $-0.6 - 1.8\cdot 0.2 = -0.96$ . Largest: $|0.96|$ , so $j^\star=3$ , $\mathcal{S}_2=\{1,3\}$ . Both true-support indices now selected.

Iteration 2: final LS recovers $\mathbf{x}_\star$

LS on $\{1,3\}$ with exact support gives exactly $\mathbf{w}=(2,-1)^\top$ (the true coefficients), hence $\hat{\mathbf{x}}_2=(2,0,-1)^\top=\mathbf{x}_\star$ . The residual is zero — OMP terminates with exact recovery after $s_\star=2$ steps.

🔧Engineering Note

When to Pick Greedy Over LASSO

Rule of thumb: if $s$ is known (a priori or from a reliable estimator) and the measurement matrix has low coherence or satisfies strong RIP, greedy methods are faster and simpler. If $s$ is unknown, or the measurement matrix has high coherence, LASSO/FISTA is more forgiving. A practical compromise: run OMP first to identify a support, then refine with a debiased LS solve on that support (the "OMP-debiasing" trick) — this combines the speed of greedy selection with the unbiased coefficients of LS.

Practical Constraints

•
OMP is not robust to measurement noise once the residual drops below the noise floor — stop early.
•
CoSaMP and IHT need RIP, which is hard to verify but holds w.h.p. for random Gaussian/Bernoulli matrices.
•
For streaming data, IHT with small step size acts as an online sparse estimator.

OMP vs. CoSaMP vs. IHT vs. LASSO

Algorithm	Needs $s$ ?	Per-Iter Cost	Iterations	Guarantee
OMP	yes (usually)	$O(MN) + O(k^2)$	$s$	Coherence / ERC
CoSaMP	yes	$O(MN) + O(s^2 M)$	$O(\log(1/\varepsilon))$	RIP- $4s$
IHT	yes	$O(MN)$	$O(\log(1/\varepsilon))$	RIP- $3s$
LASSO / FISTA	no	$O(MN)$	$O(1/\sqrt{\varepsilon})$	RIP- $2s$

Common Mistake: Running OMP on a High-Coherence Dictionary

Mistake:

Applying OMP to an overcomplete dictionary with $\mu(\mathbf{A})$ close to $1$ and expecting exact support recovery.

Correction:

The ERC guarantee $s<(1+1/\mu)/2$ degrades rapidly as $\mu\to 1$ . For $\mu=0.9$ only $s=1$ is safe — OMP will almost certainly pick a wrong index at iteration 2. Either pre-process with dictionary decoherence (e.g. Gram-Schmidt), or use LASSO, which is forgiving of coherence when the signal has distinguishable magnitudes.

Common Mistake: IHT Step Size Too Small

Mistake:

Choosing $\eta=1/L$ with $L=\|\mathbf{A}\|_2^2$ , expecting ISTA-like convergence."

Correction:

IHT's RIP convergence analysis requires $\eta\approx 1$ , assuming $\mathbf{A}$ has columns normalized to $\ell_2\approx 1$ . With $\eta=1/L$ , convergence is typically much slower. Rescale $\mathbf{A}$ so that its columns have unit norm, then use $\eta=1$ .

Historical Note: Matching Pursuit, Orthogonal MP, and Compressed Sensing

1993-2009

Matching Pursuit was introduced by Mallat and Zhang (1993) in the context of wavelet dictionaries. Pati, Rezaiifar and Krishnaprasad (1993) and independently Davis, Mallat and Avellaneda (1994) added the orthogonalization step, producing OMP. Its role exploded in the 2000s with compressed sensing: Tropp's "Greed is Good" (2004) gave the first modern coherence analysis. CoSaMP (Needell-Tropp 2008), Subspace Pursuit (Dai-Milenkovic 2009), and IHT (Blumensath-Davies 2009) emerged almost simultaneously, all motivated by the need for provable RIP-based guarantees matching those of BPDN/LASSO but with lower per-iteration cost.

OMP

Orthogonal Matching Pursuit. At each step, picks the column of $\mathbf{A}$ most correlated with the current residual, adds it to the support, and solves restricted least-squares. Terminates after $s$ steps.

CoSaMP

Compressive Sampling Matching Pursuit. Adds $2s$ candidates per iteration and uses a final hard-threshold to prune back to $s$ . Linear convergence under RIP.

IHT

Iterative Hard Thresholding. Projected gradient descent onto the set of $s$ -sparse vectors. Solves the non-convex $\ell_0$ problem and converges linearly under RIP.

Quick Check

Which scenario favours OMP over LASSO?

Unknown sparsity level and high measurement noise.

Known small sparsity, low coherence, ultra-fast inference required.

Overcomplete dictionary with high mutual coherence.

Distributed data across multiple servers.

Correction:

Known small sparsity, low coherence, ultra-fast inference required.

Greedy methods shine when $s$ is known and small — they touch only $s$ columns of $\mathbf{A}$ , far cheaper than FISTA's full mat-vecs.

Quick Check

IHT converges at what rate under RIP?

Sublinear $O(1/k)$

Sublinear $O(1/k^2)$

Linear $(c^k)$ for some $c<1$

No convergence guarantee

Correction:

Linear

(c^k)

for some

c<1

Theorem TIHT Convergence under RIP: error contracts by $2\sqrt{2}\delta_{3s}<1$ per iteration.

Key Takeaway

Greedy algorithms trade the convex-optimization guarantee for combinatorial directness. OMP picks one column per iteration (runtime $\propto s$ ); CoSaMP and IHT pick / adjust many indices per iteration and converge in $O(\log 1/\varepsilon)$ steps under RIP. When you know $s$ , the sensing matrix is well-conditioned, and the SNR is moderate to high, greedy methods are often the fastest path to a sparse estimate.

Greedy Algorithms: OMP, CoSaMP, IHT

The Case for Greediness

Definition: Restricted Least-Squares

OMP (Orthogonal Matching Pursuit)

Theorem: OMP Exact Recovery under Coherence

Induction on the selected support

Residual in terms of remaining support

Identification peaks on true support

Termination

CoSaMP (Compressive Sampling Matching Pursuit)

IHT (Iterative Hard Thresholding)

Theorem: IHT Convergence under RIP

Error recursion

Use RIP on 3s-sparse vectors

Unfold recursion

OMP: Residual Energy vs. Iteration

Parameters

Support Recovery: Greedy vs. LASSO

Parameters

OMP: Selecting Columns of A\mathbf{A}A One at a Time

Example: OMP on a 3-Column Problem

Iteration 1: correlations

Iteration 1: residual

Iteration 2: correlations

Iteration 2: final LS recovers $\mathbf{x}_\star$

When to Pick Greedy Over LASSO

OMP vs. CoSaMP vs. IHT vs. LASSO

Common Mistake: Running OMP on a High-Coherence Dictionary

Common Mistake: IHT Step Size Too Small

Historical Note: Matching Pursuit, Orthogonal MP, and Compressed Sensing

OMP

CoSaMP

IHT

Quick Check

Quick Check

Key Takeaway

Definition:
Restricted Least-Squares

OMP: Selecting Columns of $\mathbf{A}$ One at a Time