Ferkans — Interactive Telecom Tutor

Why Bother with Underdetermined Systems?

Classical linear estimation asks: given $M \geq N$ noisy observations $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{w}$ , how well can we recover $\mathbf{x} \in \mathbb{R}^N$ ? The least-squares answer is complete, and for $M = N$ with $\mathbf{A}$ invertible we recover $\mathbf{x}$ exactly in the noiseless case.

The point is that many modern problems fall on the opposite side: $M \ll N$ . A camera takes 10 megapixel images but the underlying scene has a few dozen edges. A medical MRI scan could take an hour if all $k$ -space samples were acquired; patients cannot hold still that long. An RF imager observes a few hundred pilot responses but must locate scatterers on a grid of $10^5$ voxels. In each case the system $\mathbf{y} = \mathbf{A}\mathbf{x}$ is massively underdetermined: infinitely many $\mathbf{x}$ fit the data.

What saves us is that the true $\mathbf{x}$ is sparse, or nearly so. Sparsity is a structural prior that collapses the infinite solution set to (hopefully) a unique point. This chapter makes that intuition rigorous. We prove that, under certain random designs, $M$ measurements proportional to $s \log(N/s)$ — far fewer than $N$ — suffice to recover an $s$ -sparse vector exactly, via a tractable convex program.

Definition:
$s$ -Sparse Vector

A vector $\mathbf{x} \in \mathbb{R}^N$ is called $s$ -sparse if it has at most $s$ nonzero entries, i.e., $\|\mathbf{x}\|_0 := |\mathrm{supp}(\mathbf{x})| = |\{i : x_i \neq 0\}| \leq s.$ The set of all $s$ -sparse vectors is denoted $\Sigma_s = \{\mathbf{x} \in \mathbb{R}^N : \|\mathbf{x}\|_0 \leq s\}.$

The quantity $\|\mathbf{x}\|_0$ is called the " $\ell_0$ norm" by convention, but it is not a norm: it fails homogeneity ( $\|\alpha\mathbf{x}\|_0 = \|\mathbf{x}\|_0$ for $\alpha \neq 0$ ) and the set $\Sigma_s$ is a non-convex union of $\binom{N}{s}$ linear subspaces.

Definition:
Compressible Vector

A vector $\mathbf{x} \in \mathbb{R}^N$ is compressible if its sorted magnitudes decay rapidly. Formally, if $|x_{(1)}| \geq |x_{(2)}| \geq \cdots \geq |x_{(N)}|$ denote the order statistics of $|\mathbf{x}|$ , then $\mathbf{x}$ is compressible with exponent $p \in (0, 1]$ if $|x_{(k)}| \leq C \cdot k^{-1/p}, \quad k = 1, \ldots, N,$ for some constant $C > 0$ .

Real signals (images in wavelet bases, audio in DCT) are rarely exactly sparse but are typically compressible. The $s$ -term approximation error $\sigma_s(\mathbf{x})_2 = \|\mathbf{x} - \mathbf{x}_{[s]}\|_2$ , where $\mathbf{x}_{[s]}$ keeps the $s$ largest entries, decays like $s^{1/2 - 1/p}$ for $p < 2$ .

Definition:
Sparse Recovery Problem

Let $\mathbf{A} \in \mathbb{R}^{M \times N}$ with $M \ll N$ , and let $\mathbf{x}^\star \in \Sigma_s$ . The sparse recovery problem is: given $\mathbf{y} = \mathbf{A}\mathbf{x}^\star + \mathbf{w},$ where $\mathbf{w}$ is a noise vector with $\|\mathbf{w}\|_2 \leq \eta$ , produce an estimate $\hat{\mathbf{x}}$ such that $\hat{\mathbf{x}} \approx \mathbf{x}^\star$ .

In the noiseless case ( $\mathbf{w} = \mathbf{0}$ ) we seek exact recovery: $\hat{\mathbf{x}} = \mathbf{x}^\star$ .

Infinitely Many Candidates

Without any prior, the equation $\mathbf{A}\mathbf{x} = \mathbf{y}$ with $M < N$ has either no solutions (inconsistent) or an affine subspace of solutions of dimension $N - \mathrm{rank}(\mathbf{A}) \geq N - M$ . Every $\mathbf{x}$ in that affine subspace is consistent with the data. Sparsity picks out one.

Theorem: $\ell_0$ Minimization: The Ideal but Intractable Recovery

Consider the noiseless problem $\mathbf{y} = \mathbf{A}\mathbf{x}^\star$ with $\|\mathbf{x}^\star\|_0 \leq s$ . If every $2s$ columns of $\mathbf{A}$ are linearly independent (equivalently, $\mathrm{spark}(\mathbf{A}) > 2s$ ), then $\mathbf{x}^\star$ is the unique solution of $(P_0)\qquad \min_{\mathbf{x} \in \mathbb{R}^N} \|\mathbf{x}\|_0 \quad \text{subject to} \quad \mathbf{A}\mathbf{x} = \mathbf{y}.$

If two different $s$ -sparse vectors $\mathbf{x}^\star, \tilde{\mathbf{x}}$ both satisfied $\mathbf{A}\mathbf{x} = \mathbf{y}$ , then their difference $\mathbf{x}^\star - \tilde{\mathbf{x}}$ would be $2s$ -sparse and lie in the null space of $\mathbf{A}$ . If no $2s$ columns are linearly dependent, the only such vector is zero.

Proof

Suppose two $s$-sparse solutions exist

Let $\mathbf{x}^\star, \tilde{\mathbf{x}} \in \Sigma_s$ both satisfy $\mathbf{A}\mathbf{x} = \mathbf{y}$ . Then $\mathbf{h} := \mathbf{x}^\star - \tilde{\mathbf{x}} \in \ker(\mathbf{A})$ and $\|\mathbf{h}\|_0 \leq \|\mathbf{x}^\star\|_0 + \|\tilde{\mathbf{x}}\|_0 \leq 2s$ .

Use the spark condition

By hypothesis, every collection of $2s$ columns of $\mathbf{A}$ is linearly independent. This means the only vector in $\ker(\mathbf{A})$ with support of size at most $2s$ is $\mathbf{h} = \mathbf{0}$ . Hence $\mathbf{x}^\star = \tilde{\mathbf{x}}$ .

Uniqueness in $(P_0)$

Any minimizer of $(P_0)$ has $\ell_0$ at most $\|\mathbf{x}^\star\|_0 \leq s$ , so it is $s$ -sparse. By the previous step, it must equal $\mathbf{x}^\star$ . Hence $\mathbf{x}^\star$ is the unique optimum.

,

Theorem: $\ell_0$ Minimization is NP-Hard

The problem $(P_0)$ — minimizing $\|\mathbf{x}\|_0$ subject to $\mathbf{A}\mathbf{x} = \mathbf{y}$ — is NP-hard in general. More precisely, no algorithm solves all instances in time polynomial in $(M, N)$ unless $\text{P} = \text{NP}$ .

The feasible set is an affine subspace; the objective $\|\mathbf{x}\|_0$ counts supports. Solving $(P_0)$ amounts to searching over all $\binom{N}{s}$ supports of size $s$ and checking feasibility on each, which is combinatorial.

Show Hint

The proof proceeds by reduction from the NP-hard problem of exact cover by 3-sets.

Proof

Reduction from exact cover by 3-sets (sketch)

Given an instance of exact-3-cover — a ground set $U$ of $3s$ elements and a collection $\mathcal{C}$ of 3-element subsets — construct $\mathbf{A}$ with one column per subset, whose rows are incidence indicators. Let $\mathbf{y} = \mathbf{1}$ (the all-ones vector on $U$ ). A vector $\mathbf{x}$ with $\mathbf{A}\mathbf{x} = \mathbf{y}$ and $\|\mathbf{x}\|_0 = s$ corresponds to an exact cover. Hence $(P_0)$ solves the NP-hard decision problem, so it is at least as hard.

Consequence

We must either restrict $\mathbf{A}$ (so that $(P_0)$ becomes tractable) or replace the $\ell_0$ objective with a convex surrogate. Sections 13.2–13.4 pursue the second route.

Example: Sparse Recovery in Dimension $N = 2$

Let $N = 2$ , $M = 1$ , and $\mathbf{A} = \begin{bmatrix} 1 & 2 \end{bmatrix}$ . Suppose we observe $y = 2$ and know $\mathbf{x}^\star$ is $1$ -sparse. Recover $\mathbf{x}^\star$ .

Solution

Enumerate sparse candidates

A $1$ -sparse vector has the form $\mathbf{x} = (t, 0)^T$ or $\mathbf{x} = (0, t)^T$ for some $t \in \mathbb{R}$ .

Impose the constraint

Case 1: $\mathbf{A}(t,0)^T = t = 2 \Rightarrow t = 2$ , giving $\mathbf{x} = (2, 0)^T$ .

Case 2: $\mathbf{A}(0,t)^T = 2t = 2 \Rightarrow t = 1$ , giving $\mathbf{x} = (0, 1)^T$ .

Both are $1$-sparse solutions

Both candidates satisfy the equation and are $1$ -sparse. The problem has two solutions; the $\ell_0$ minimum is not unique. This violates the spark condition: any two columns of $\mathbf{A}$ are (trivially) dependent in $\mathbb{R}^1$ , so $\mathrm{spark}(\mathbf{A}) = 2 \not> 2s = 2$ . We would need $M \geq 2$ and non-proportional columns to guarantee uniqueness for $s = 1$ .

$\ell_0$ Solutions: Counting Feasible Sparse Vectors

For a small system $\mathbf{A}\mathbf{x} = \mathbf{y}$ with random Gaussian $\mathbf{A}$ , we vary $s$ and count how many distinct supports $\mathcal{S}$ of size exactly $s$ yield a feasible $\mathbf{x}_\mathcal{S}$ (restricted to support $\mathcal{S}$ ). Observe the transition from "many solutions" to "unique solution" as $M$ grows.

Parameters

N

(ambient dimension)10

M

(measurements)5

true sparsity

s

2

RNG seed7

Common Mistake: $\|\cdot\|_0$ is Not a Norm

Mistake:

Students often invoke the triangle inequality or treat $\ell_0$ as a continuous objective, e.g., "the gradient of $\|\mathbf{x}\|_0$ ".

Correction:

$\|\cdot\|_0$ is neither homogeneous ( $\|2\mathbf{x}\|_0 = \|\mathbf{x}\|_0$ ) nor continuous. It is discontinuous everywhere on a coordinate hyperplane and has no useful subgradient. Any algorithmic approach requires a surrogate (the $\ell_1$ norm, iteratively reweighted $\ell_2$ , or greedy methods like OMP).

Common Mistake: $\mathbf{A}$ is the Sensing Matrix, not the Channel

Mistake:

Writing the CS model as $\mathbf{y} = \mathbf{H}\mathbf{x} + \mathbf{w}$ , confusing it with MIMO channel notation.

Correction:

In this book we reserve $\mathbf{H}$ for wireless channels and $\mathbf{A}$ for sensing / measurement matrices in compressed sensing and imaging. This follows Caire's convention and prevents notational collisions when both appear in the same analysis (e.g., a CS problem posed in the OFDM frequency domain).

Key Takeaway

Sparse recovery replaces the question "which $\mathbf{x}$ fits the data?" with "which sparse $\mathbf{x}$ fits the data?" Combinatorial $\ell_0$ minimization yields the right answer when every $2s$ columns of $\mathbf{A}$ are independent, but it is NP-hard. The art of compressed sensing is to find tractable surrogates that succeed on the same problem instances.

Historical Note: The Birth of Compressed Sensing (2004–2006)

2004-2006

Although sparse recovery had been studied in geophysics (seismic deconvolution) and statistics (LASSO, 1996) for decades, compressed sensing as a general theory crystallized in a remarkable 2004–2006 period. Candès, Romberg, and Tao proved that random Fourier samples could recover sparse signals via $\ell_1$ minimization with overwhelming probability. Simultaneously, Donoho introduced the term "compressed sensing" and developed the phase-transition analysis. The confluence of these results reframed sampling theory: Nyquist is a sufficient but not necessary rate when structure (sparsity) is available.

,

Support of a vector

$\mathrm{supp}(\mathbf{x}) = \{i : x_i \neq 0\} \subseteq \{1, \ldots, N\}$ . A vector is $s$ -sparse iff $|\mathrm{supp}(\mathbf{x})| \leq s$ .

Related: $s$ -Sparse Vector, L0 Norm

Spark of a matrix

The smallest number of columns of $\mathbf{A}$ that are linearly dependent: $\mathrm{spark}(\mathbf{A}) = \min\{k : \exists \text{ a set of } k \text{ lin. dep. columns}\}$ . Unlike rank, spark is hard to compute. $\mathrm{spark}(\mathbf{A}) > 2s$ is equivalent to uniqueness of $s$ -sparse solutions.

The Sparse Recovery Problem