The Sparse Recovery Problem

Why Bother with Underdetermined Systems?

Classical linear estimation asks: given Mβ‰₯NM \geq N noisy observations y=Ax+w\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{w}, how well can we recover x∈RN\mathbf{x} \in \mathbb{R}^N? The least-squares answer is complete, and for M=NM = N with A\mathbf{A} invertible we recover x\mathbf{x} exactly in the noiseless case.

The point is that many modern problems fall on the opposite side: Mβ‰ͺNM \ll N. A camera takes 10 megapixel images but the underlying scene has a few dozen edges. A medical MRI scan could take an hour if all kk-space samples were acquired; patients cannot hold still that long. An RF imager observes a few hundred pilot responses but must locate scatterers on a grid of 10510^5 voxels. In each case the system y=Ax\mathbf{y} = \mathbf{A}\mathbf{x} is massively underdetermined: infinitely many x\mathbf{x} fit the data.

What saves us is that the true x\mathbf{x} is sparse, or nearly so. Sparsity is a structural prior that collapses the infinite solution set to (hopefully) a unique point. This chapter makes that intuition rigorous. We prove that, under certain random designs, MM measurements proportional to slog⁑(N/s)s \log(N/s) β€” far fewer than NN β€” suffice to recover an ss-sparse vector exactly, via a tractable convex program.

Definition:

ss-Sparse Vector

A vector x∈RN\mathbf{x} \in \mathbb{R}^N is called ss-sparse if it has at most ss nonzero entries, i.e., βˆ₯xβˆ₯0:=∣supp(x)∣=∣{i:xiβ‰ 0}βˆ£β‰€s.\|\mathbf{x}\|_0 := |\mathrm{supp}(\mathbf{x})| = |\{i : x_i \neq 0\}| \leq s. The set of all ss-sparse vectors is denoted Ξ£s={x∈RN:βˆ₯xβˆ₯0≀s}.\Sigma_s = \{\mathbf{x} \in \mathbb{R}^N : \|\mathbf{x}\|_0 \leq s\}.

The quantity βˆ₯xβˆ₯0\|\mathbf{x}\|_0 is called the "β„“0\ell_0 norm" by convention, but it is not a norm: it fails homogeneity (βˆ₯Ξ±xβˆ₯0=βˆ₯xβˆ₯0\|\alpha\mathbf{x}\|_0 = \|\mathbf{x}\|_0 for Ξ±β‰ 0\alpha \neq 0) and the set Ξ£s\Sigma_s is a non-convex union of (Ns)\binom{N}{s} linear subspaces.

Definition:

Compressible Vector

A vector x∈RN\mathbf{x} \in \mathbb{R}^N is compressible if its sorted magnitudes decay rapidly. Formally, if ∣x(1)∣β‰₯∣x(2)∣β‰₯β‹―β‰₯∣x(N)∣|x_{(1)}| \geq |x_{(2)}| \geq \cdots \geq |x_{(N)}| denote the order statistics of ∣x∣|\mathbf{x}|, then x\mathbf{x} is compressible with exponent p∈(0,1]p \in (0, 1] if ∣x(k)βˆ£β‰€Cβ‹…kβˆ’1/p,k=1,…,N,|x_{(k)}| \leq C \cdot k^{-1/p}, \quad k = 1, \ldots, N, for some constant C>0C > 0.

Real signals (images in wavelet bases, audio in DCT) are rarely exactly sparse but are typically compressible. The ss-term approximation error Οƒs(x)2=βˆ₯xβˆ’x[s]βˆ₯2\sigma_s(\mathbf{x})_2 = \|\mathbf{x} - \mathbf{x}_{[s]}\|_2, where x[s]\mathbf{x}_{[s]} keeps the ss largest entries, decays like s1/2βˆ’1/ps^{1/2 - 1/p} for p<2p < 2.

Definition:

Sparse Recovery Problem

Let A∈RMΓ—N\mathbf{A} \in \mathbb{R}^{M \times N} with Mβ‰ͺNM \ll N, and let xβ‹†βˆˆΞ£s\mathbf{x}^\star \in \Sigma_s. The sparse recovery problem is: given y=Ax⋆+w,\mathbf{y} = \mathbf{A}\mathbf{x}^\star + \mathbf{w}, where w\mathbf{w} is a noise vector with βˆ₯wβˆ₯2≀η\|\mathbf{w}\|_2 \leq \eta, produce an estimate x^\hat{\mathbf{x}} such that x^β‰ˆx⋆\hat{\mathbf{x}} \approx \mathbf{x}^\star.

In the noiseless case (w=0\mathbf{w} = \mathbf{0}) we seek exact recovery: x^=x⋆\hat{\mathbf{x}} = \mathbf{x}^\star.

Infinitely Many Candidates

Without any prior, the equation Ax=y\mathbf{A}\mathbf{x} = \mathbf{y} with M<NM < N has either no solutions (inconsistent) or an affine subspace of solutions of dimension Nβˆ’rank(A)β‰₯Nβˆ’MN - \mathrm{rank}(\mathbf{A}) \geq N - M. Every x\mathbf{x} in that affine subspace is consistent with the data. Sparsity picks out one.

Theorem: β„“0\ell_0 Minimization: The Ideal but Intractable Recovery

Consider the noiseless problem y=Ax⋆\mathbf{y} = \mathbf{A}\mathbf{x}^\star with βˆ₯x⋆βˆ₯0≀s\|\mathbf{x}^\star\|_0 \leq s. If every 2s2s columns of A\mathbf{A} are linearly independent (equivalently, spark(A)>2s\mathrm{spark}(\mathbf{A}) > 2s), then x⋆\mathbf{x}^\star is the unique solution of (P0)min⁑x∈RNβˆ₯xβˆ₯0subjectΒ toAx=y.(P_0)\qquad \min_{\mathbf{x} \in \mathbb{R}^N} \|\mathbf{x}\|_0 \quad \text{subject to} \quad \mathbf{A}\mathbf{x} = \mathbf{y}.

If two different ss-sparse vectors x⋆,x~\mathbf{x}^\star, \tilde{\mathbf{x}} both satisfied Ax=y\mathbf{A}\mathbf{x} = \mathbf{y}, then their difference xβ‹†βˆ’x~\mathbf{x}^\star - \tilde{\mathbf{x}} would be 2s2s-sparse and lie in the null space of A\mathbf{A}. If no 2s2s columns are linearly dependent, the only such vector is zero.

,

Theorem: β„“0\ell_0 Minimization is NP-Hard

The problem (P0)(P_0) β€” minimizing βˆ₯xβˆ₯0\|\mathbf{x}\|_0 subject to Ax=y\mathbf{A}\mathbf{x} = \mathbf{y} β€” is NP-hard in general. More precisely, no algorithm solves all instances in time polynomial in (M,N)(M, N) unless P=NP\text{P} = \text{NP}.

The feasible set is an affine subspace; the objective βˆ₯xβˆ₯0\|\mathbf{x}\|_0 counts supports. Solving (P0)(P_0) amounts to searching over all (Ns)\binom{N}{s} supports of size ss and checking feasibility on each, which is combinatorial.

Example: Sparse Recovery in Dimension N=2N = 2

Let N=2N = 2, M=1M = 1, and A=[12]\mathbf{A} = \begin{bmatrix} 1 & 2 \end{bmatrix}. Suppose we observe y=2y = 2 and know x⋆\mathbf{x}^\star is 11-sparse. Recover x⋆\mathbf{x}^\star.

β„“0\ell_0 Solutions: Counting Feasible Sparse Vectors

For a small system Ax=y\mathbf{A}\mathbf{x} = \mathbf{y} with random Gaussian A\mathbf{A}, we vary ss and count how many distinct supports S\mathcal{S} of size exactly ss yield a feasible xS\mathbf{x}_\mathcal{S} (restricted to support S\mathcal{S}). Observe the transition from "many solutions" to "unique solution" as MM grows.

Parameters
10
5
2
7

Common Mistake: βˆ₯β‹…βˆ₯0\|\cdot\|_0 is Not a Norm

Mistake:

Students often invoke the triangle inequality or treat β„“0\ell_0 as a continuous objective, e.g., "the gradient of βˆ₯xβˆ₯0\|\mathbf{x}\|_0".

Correction:

βˆ₯β‹…βˆ₯0\|\cdot\|_0 is neither homogeneous (βˆ₯2xβˆ₯0=βˆ₯xβˆ₯0\|2\mathbf{x}\|_0 = \|\mathbf{x}\|_0) nor continuous. It is discontinuous everywhere on a coordinate hyperplane and has no useful subgradient. Any algorithmic approach requires a surrogate (the β„“1\ell_1 norm, iteratively reweighted β„“2\ell_2, or greedy methods like OMP).

Common Mistake: A\mathbf{A} is the Sensing Matrix, not the Channel

Mistake:

Writing the CS model as y=Hx+w\mathbf{y} = \mathbf{H}\mathbf{x} + \mathbf{w}, confusing it with MIMO channel notation.

Correction:

In this book we reserve H\mathbf{H} for wireless channels and A\mathbf{A} for sensing / measurement matrices in compressed sensing and imaging. This follows Caire's convention and prevents notational collisions when both appear in the same analysis (e.g., a CS problem posed in the OFDM frequency domain).

Key Takeaway

Sparse recovery replaces the question "which x\mathbf{x} fits the data?" with "which sparse x\mathbf{x} fits the data?" Combinatorial β„“0\ell_0 minimization yields the right answer when every 2s2s columns of A\mathbf{A} are independent, but it is NP-hard. The art of compressed sensing is to find tractable surrogates that succeed on the same problem instances.

Historical Note: The Birth of Compressed Sensing (2004–2006)

2004-2006

Although sparse recovery had been studied in geophysics (seismic deconvolution) and statistics (LASSO, 1996) for decades, compressed sensing as a general theory crystallized in a remarkable 2004–2006 period. CandΓ¨s, Romberg, and Tao proved that random Fourier samples could recover sparse signals via β„“1\ell_1 minimization with overwhelming probability. Simultaneously, Donoho introduced the term "compressed sensing" and developed the phase-transition analysis. The confluence of these results reframed sampling theory: Nyquist is a sufficient but not necessary rate when structure (sparsity) is available.

,

Support of a vector

supp(x)={i:xiβ‰ 0}βŠ†{1,…,N}\mathrm{supp}(\mathbf{x}) = \{i : x_i \neq 0\} \subseteq \{1, \ldots, N\}. A vector is ss-sparse iff ∣supp(x)βˆ£β‰€s|\mathrm{supp}(\mathbf{x})| \leq s.

Related: ss-Sparse Vector, L0 Norm

Spark of a matrix

The smallest number of columns of A\mathbf{A} that are linearly dependent: spark(A)=min⁑{k:βˆƒΒ aΒ setΒ ofΒ kΒ lin.Β dep.Β columns}\mathrm{spark}(\mathbf{A}) = \min\{k : \exists \text{ a set of } k \text{ lin. dep. columns}\}. Unlike rank, spark is hard to compute. spark(A)>2s\mathrm{spark}(\mathbf{A}) > 2s is equivalent to uniqueness of ss-sparse solutions.