Greedy Algorithms: OMP, CoSaMP, IHT

The Case for Greediness

Convex relaxation solves β„“0\ell_0-sparse recovery by replacing the combinatorial βˆ₯xβˆ₯0\|\mathbf{x}\|_0 penalty with βˆ₯xβˆ₯1\|\mathbf{x}\|_1 and invoking ISTA/FISTA/ADMM. This works β€” but in three important regimes, greedy algorithms are faster, simpler, and often more accurate: (i) known sparsity level ss, so we can stop after exactly ss selections; (ii) very sparse signals (sβ‰ͺMs\ll M), where the greedy search touches only a handful of columns of A\mathbf{A}; (iii) speed-critical applications like real-time channel estimation or radar, where even FISTA is too expensive. This section develops the three canonical greedy methods: OMP, CoSaMP, and IHT. All three share one idea β€” identify a candidate support, solve a restricted least-squares on it, update. They differ in how they build the support.

Definition:

Restricted Least-Squares

Given a support set SβŠ†{1,…,N}\mathcal{S}\subseteq\{1,\ldots,N\} with ∣Sβˆ£β‰€M|\mathcal{S}|\leq M, the restricted least-squares estimate is x^S=arg⁑min⁑wβ€…β€Šβˆ₯yβˆ’AS wβˆ₯22=AS+y,\hat{\mathbf{x}}_\mathcal{S} = \arg\min_\mathbf{w}\; \|\mathbf{y}-\mathbf{A}_\mathcal{S}\,\mathbf{w}\|_2^2 = \mathbf{A}_\mathcal{S}^{+}\mathbf{y}, where AS\mathbf{A}_\mathcal{S} is the submatrix containing only the columns of A\mathbf{A} indexed by S\mathcal{S}, and AS+\mathbf{A}_\mathcal{S}^+ is its pseudo-inverse. The full estimate places these values on S\mathcal{S} and zeros elsewhere.

All three greedy algorithms produce their final iterate by a restricted least-squares solve on their estimated support. The cost is a QR or Cholesky factorization of AS⊀AS\mathbf{A}_\mathcal{S}^\top\mathbf{A}_\mathcal{S} β€” cheap when ∣S∣|\mathcal{S}| is small.

OMP (Orthogonal Matching Pursuit)

Complexity: O(sβ‹…MN)O(s\cdot MN) correlations + O(s3)O(s^3) for the incremental QR of ASk\mathbf{A}_{\mathcal{S}_k}. Much cheaper than FISTA when sβ‰ͺMs\ll M.
Input: A, y, sparsity s (or residual tolerance eps).
Initialize: residual r_0 = y, support S_0 = {}, iterate x_0 = 0.
1: for k = 1, 2, ..., s do
2: j* <- argmax_j | a_j^T r_{k-1} | # column best correlated with residual
3: S_k <- S_{k-1} U {j*} # enlarge support
4: w <- argmin_w || y - A_{S_k} w ||_2 # solve LS on current support
5: x_k <- embed(w, S_k) # put w at S_k, zeros elsewhere
6: r_k <- y - A_{S_k} w # update residual
7: if ||r_k||_2 < eps then break
8: end for
9: return x_k, S_k

The name orthogonal refers to step 4: after each selection we re-solve LS over the current support, which makes the residual rk\mathbf{r}_k orthogonal to every column in Sk\mathcal{S}_k. Plain matching pursuit (Mallat-Zhang 1993) skipped this orthogonalization and was forced to revisit coordinates; OMP never revisits a chosen index.

Theorem: OMP Exact Recovery under Coherence

Let A∈RMΓ—N\mathbf{A}\in\mathbb{R}^{M\times N} with unit-norm columns and coherence ΞΌ(A)=max⁑iβ‰ j∣ai⊀aj∣\mu(\mathbf{A}) = \max_{i\neq j}|\mathbf{a}_i^\top\mathbf{a}_j|. If x⋆\mathbf{x}_\star is ss-sparse with s<12(1+1/ΞΌ)s<\tfrac{1}{2}(1+1/\mu) and y=Ax⋆\mathbf{y}=\mathbf{A}\mathbf{x}_\star (noiseless), then OMP recovers the true support and x⋆\mathbf{x}_\star exactly in ss steps.

Low coherence means the columns of A\mathbf{A} are nearly orthogonal, so the inner product aj⊀rkβˆ’1\mathbf{a}_j^\top\mathbf{r}_{k-1} peaks at a true-support index. OMP then identifies one correct index per iteration. The condition s<(1+1/ΞΌ)/2s<(1+1/\mu)/2 ensures the peak is always on a correct index: the "interference" from other active coordinates cannot overpower the signal of the unselected true index.

CoSaMP (Compressive Sampling Matching Pursuit)

Complexity: Per iter O(MN)O(MN) for the correlation + O(s2M)O(s^2 M) for LS on the merged support. Linear convergence under RIP: reaches Ξ΅\varepsilon accuracy in O(log⁑(βˆ₯x⋆βˆ₯/Ξ΅))O(\log(\|\mathbf{x}_\star\|/\varepsilon)) iterations.
Input: A, y, sparsity s.
Initialize: x_0 = 0, residual r_0 = y.
1: for k = 1, 2, ..., K do
2: p <- A^T r_{k-1} # proxy (correlation with residual)
3: T <- indices of 2s largest |p_i| # top-2s candidates
4: S <- T U supp(x_{k-1}) # merge with current support (up to 3s)
5: w <- argmin_w || y - A_S w ||_2 # LS on merged support
6: x_k <- H_s(embed(w, S)) # keep s largest of w
7: r_k <- y - A x_k # update residual
8: if ||r_k||_2 < eps then break
9: end for
10: return x_k

CoSaMP (Needell-Tropp 2008) differs from OMP in two ways: it adds 2s2s candidates per iteration (not 1), and it can remove wrongly-selected indices via the final hard- threshold HsH_s. This makes CoSaMP robust to noise and RIP-friendly. It converges in logarithmically many iterations, while OMP needs ss.

IHT (Iterative Hard Thresholding)

Complexity: O(MN)O(MN) per iteration. Linear convergence under RIP: βˆ₯xkβˆ’x⋆βˆ₯≀ckβˆ₯x⋆βˆ₯\|\mathbf{x}_k-\mathbf{x}_\star\|\leq c^k\|\mathbf{x}_\star\| for some c<1c<1.
Input: A, y, sparsity s, step size eta (e.g., eta=1).
Initialize: x_0 = 0.
1: for k = 0, 1, ... do
2: g <- A^T (A x_k - y) # gradient
3: x_{k+1} <- H_s( x_k - eta * g ) # gradient step + hard-threshold to s terms
4: if ||x_{k+1} - x_k|| / ||x_k|| < tol then stop
5: end for
6: return x_{k+1}

IHT is the combinatorial sibling of ISTA β€” replace soft-threshold with hard-threshold at sparsity ss. It solves the non-convex β„“0\ell_0 problem directly: min⁑xβˆ₯yβˆ’Axβˆ₯22\min_\mathbf{x}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 s.t. βˆ₯xβˆ₯0≀s\|\mathbf{x}\|_0\leq s. Convergence to the global minimum is guaranteed under RIP (Blumensath-Davies 2009).

Theorem: IHT Convergence under RIP

Let A\mathbf{A} satisfy RIP of order 3s3s with constant Ξ΄3s<1/8β‰ˆ0.354\delta_{3s}<1/\sqrt{8}\approx 0.354. Let x⋆\mathbf{x}_\star be ss-sparse and y=Ax⋆+w\mathbf{y}=\mathbf{A}\mathbf{x}_\star+\mathbf{w}. Then the IHT iterates with step size Ξ·=1\eta=1 satisfy βˆ₯xkβˆ’x⋆βˆ₯2≀(22 δ3s)kβˆ₯x⋆βˆ₯2+41βˆ’22Ξ΄3sβˆ₯wβˆ₯2.\|\mathbf{x}_k-\mathbf{x}_\star\|_2 \leq (2\sqrt{2}\,\delta_{3s})^k \|\mathbf{x}_\star\|_2 + \frac{4}{1-2\sqrt{2}\delta_{3s}}\|\mathbf{w}\|_2.

Linear convergence to a noise-floor plateau. Each iteration contracts the error by a factor 22Ξ΄3s<12\sqrt{2}\delta_{3s}<1; the residual noise leaves an unavoidable noise floor proportional to βˆ₯wβˆ₯\|\mathbf{w}\|. Same qualitative story as CoSaMP.

OMP: Residual Energy vs. Iteration

Watch the residual βˆ₯rkβˆ₯22\|\mathbf{r}_k\|_2^2 drop as OMP greedily picks columns. With low coherence, residual hits zero exactly at iteration ss. With high coherence (or noise), residual plateaus at the noise floor.

Parameters
80
160
10
20

Support Recovery: Greedy vs. LASSO

Sweep the true sparsity level s⋆s_\star and plot the probability that each algorithm recovers the exact support. Compares OMP, CoSaMP, IHT, and LASSO/FISTA on random Gaussian sensing matrices.

Parameters
80
160
40
25

OMP: Selecting Columns of A\mathbf{A} One at a Time

Visualization of OMP picking the column aj⋆\mathbf{a}_{j^\star} most aligned with the current residual.

Example: OMP on a 3-Column Problem

Let A=[a1β€‰βˆ£β€‰a2β€‰βˆ£β€‰a3]\mathbf{A}=[\mathbf{a}_1\,|\,\mathbf{a}_2\,|\,\mathbf{a}_3] with unit-norm columns and a1⊀a2=0.1\mathbf{a}_1^\top\mathbf{a}_2=0.1, a1⊀a3=0.2\mathbf{a}_1^\top\mathbf{a}_3=0.2, a2⊀a3=0.05\mathbf{a}_2^\top\mathbf{a}_3=0.05. Let x⋆=(2, 0,β€‰βˆ’1)⊀\mathbf{x}_\star=(2,\,0,\,-1)^\top (so s⋆=2s_\star=2) and y=Ax⋆\mathbf{y}=\mathbf{A}\mathbf{x}_\star. Execute two iterations of OMP.

πŸ”§Engineering Note

When to Pick Greedy Over LASSO

Rule of thumb: if ss is known (a priori or from a reliable estimator) and the measurement matrix has low coherence or satisfies strong RIP, greedy methods are faster and simpler. If ss is unknown, or the measurement matrix has high coherence, LASSO/FISTA is more forgiving. A practical compromise: run OMP first to identify a support, then refine with a debiased LS solve on that support (the "OMP-debiasing" trick) β€” this combines the speed of greedy selection with the unbiased coefficients of LS.

Practical Constraints
  • β€’

    OMP is not robust to measurement noise once the residual drops below the noise floor β€” stop early.

  • β€’

    CoSaMP and IHT need RIP, which is hard to verify but holds w.h.p. for random Gaussian/Bernoulli matrices.

  • β€’

    For streaming data, IHT with small step size acts as an online sparse estimator.

OMP vs. CoSaMP vs. IHT vs. LASSO

AlgorithmNeeds ss?Per-Iter CostIterationsGuarantee
OMPyes (usually)O(MN)+O(k2)O(MN) + O(k^2)ssCoherence / ERC
CoSaMPyesO(MN)+O(s2M)O(MN) + O(s^2 M)O(log⁑(1/Ρ))O(\log(1/\varepsilon))RIP-4s4s
IHTyesO(MN)O(MN)O(log⁑(1/Ρ))O(\log(1/\varepsilon))RIP-3s3s
LASSO / FISTAnoO(MN)O(MN)O(1/Ξ΅)O(1/\sqrt{\varepsilon})RIP-2s2s

Common Mistake: Running OMP on a High-Coherence Dictionary

Mistake:

Applying OMP to an overcomplete dictionary with ΞΌ(A)\mu(\mathbf{A}) close to 11 and expecting exact support recovery.

Correction:

The ERC guarantee s<(1+1/ΞΌ)/2s<(1+1/\mu)/2 degrades rapidly as ΞΌβ†’1\mu\to 1. For ΞΌ=0.9\mu=0.9 only s=1s=1 is safe β€” OMP will almost certainly pick a wrong index at iteration 2. Either pre-process with dictionary decoherence (e.g. Gram-Schmidt), or use LASSO, which is forgiving of coherence when the signal has distinguishable magnitudes.

Common Mistake: IHT Step Size Too Small

Mistake:

Choosing Ξ·=1/L\eta=1/L with L=βˆ₯Aβˆ₯22L=\|\mathbf{A}\|_2^2, expecting ISTA-like convergence."

Correction:

IHT's RIP convergence analysis requires Ξ·β‰ˆ1\eta\approx 1, assuming A\mathbf{A} has columns normalized to β„“2β‰ˆ1\ell_2\approx 1. With Ξ·=1/L\eta=1/L, convergence is typically much slower. Rescale A\mathbf{A} so that its columns have unit norm, then use Ξ·=1\eta=1.

Historical Note: Matching Pursuit, Orthogonal MP, and Compressed Sensing

1993-2009

Matching Pursuit was introduced by Mallat and Zhang (1993) in the context of wavelet dictionaries. Pati, Rezaiifar and Krishnaprasad (1993) and independently Davis, Mallat and Avellaneda (1994) added the orthogonalization step, producing OMP. Its role exploded in the 2000s with compressed sensing: Tropp's "Greed is Good" (2004) gave the first modern coherence analysis. CoSaMP (Needell-Tropp 2008), Subspace Pursuit (Dai-Milenkovic 2009), and IHT (Blumensath-Davies 2009) emerged almost simultaneously, all motivated by the need for provable RIP-based guarantees matching those of BPDN/LASSO but with lower per-iteration cost.

OMP

Orthogonal Matching Pursuit. At each step, picks the column of A\mathbf{A} most correlated with the current residual, adds it to the support, and solves restricted least-squares. Terminates after ss steps.

Related: OMP (Orthogonal Matching Pursuit)

CoSaMP

Compressive Sampling Matching Pursuit. Adds 2s2s candidates per iteration and uses a final hard-threshold to prune back to ss. Linear convergence under RIP.

Related: CoSaMP (Compressive Sampling Matching Pursuit)

IHT

Iterative Hard Thresholding. Projected gradient descent onto the set of ss-sparse vectors. Solves the non-convex β„“0\ell_0 problem and converges linearly under RIP.

Related: IHT (Iterative Hard Thresholding)

Quick Check

Which scenario favours OMP over LASSO?

Unknown sparsity level and high measurement noise.

Known small sparsity, low coherence, ultra-fast inference required.

Overcomplete dictionary with high mutual coherence.

Distributed data across multiple servers.

Quick Check

IHT converges at what rate under RIP?

Sublinear O(1/k)O(1/k)

Sublinear O(1/k2)O(1/k^2)

Linear (ck)(c^k) for some c<1c<1

No convergence guarantee

Key Takeaway

Greedy algorithms trade the convex-optimization guarantee for combinatorial directness. OMP picks one column per iteration (runtime ∝s\propto s); CoSaMP and IHT pick / adjust many indices per iteration and converge in O(log⁑1/Ρ)O(\log 1/\varepsilon) steps under RIP. When you know ss, the sensing matrix is well-conditioned, and the SNR is moderate to high, greedy methods are often the fastest path to a sparse estimate.