Ferkans — Interactive Telecom Tutor

ch18-ex01-ista-to-lista

Easy

Write out the ISTA iteration for the LASSO problem $\min_{\mathbf{c}} \frac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{c}\|^2 + \lambda\|\mathbf{c}\|_1$ and identify which quantities become learnable parameters in LISTA. State the LISTA initialisation for each learnable parameter.

Show Hint

Rewrite the ISTA iteration as $\mathcal{S}_{\tau}(\mathbf{W}\hat{\mathbf{c}}^{(k)} + \mathbf{S}\mathbf{y})$ .

Solution

ISTA iteration

$\hat{\mathbf{c}}^{(k+1)} = \mathcal{S}_{\lambda/L}\!\bigl((\mathbf{I} - \tfrac{1}{L}\mathbf{A}^{H}\mathbf{A})\hat{\mathbf{c}}^{(k)} + \tfrac{1}{L}\mathbf{A}^{H}\mathbf{y}\bigr).$ $

LISTA parameterisation

ISTA quantity	LISTA parameter	Initialisation
$\mathbf{I} - \frac{1}{L}\mathbf{A}^{H}\mathbf{A}$	$\mathbf{W}_k$	$\mathbf{I} - \frac{1}{L}\mathbf{G}$
$\frac{1}{L}\mathbf{A}^{H}$	$\mathbf{S}_k$	$\frac{1}{L}\mathbf{A}^{H}$
$\lambda/L$	$\tau_k$	$\lambda/L$

At initialisation, the untrained LISTA network performs exact ISTA. Training then refines the parameters. $\square$

ch18-ex02-admm-split

Easy

Write out the three ADMM updates for $\min_{\mathbf{c}} \frac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{c}\|^2 + \lambda\|\mathbf{c}\|_1$ with splitting variable $\mathbf{z} = \mathbf{c}$ . Identify the closed-form solution for the $\mathbf{c}$ -update.

Show Hint

The $\mathbf{c}$ -update is a quadratic minimisation.

Solution

Write the ADMM updates

$\mathbf{c}$ -update: $\hat{\mathbf{c}}^{(k+1)} = (\mathbf{A}^{H}\mathbf{A} + \rho\mathbf{I})^{-1}(\mathbf{A}^{H}\mathbf{y} + \rho(\mathbf{z}^{(k)} - \mathbf{u}^{(k)}))$

$\mathbf{z}$ -update: $\mathbf{z}^{(k+1)} = \mathcal{S}_{\lambda/\rho}(\hat{\mathbf{c}}^{(k+1)} + \mathbf{u}^{(k)})$

Dual update: $\mathbf{u}^{(k+1)} = \mathbf{u}^{(k)} + \hat{\mathbf{c}}^{(k+1)} - \mathbf{z}^{(k+1)}$

Closed-form solution

The $\mathbf{c}$ -update solves $(\mathbf{G} + \rho\mathbf{I})\mathbf{c} = \mathbf{b}$ where $\mathbf{G} = \mathbf{A}^{H}\mathbf{A}$ . Since $\mathbf{G} + \rho\mathbf{I}$ is positive definite for $\rho > 0$ , the solution is unique. For Kronecker-structured $\mathbf{A}$ , this can be solved via FFT. $\square$

ch18-ex03-weight-tying

Easy

A 15-layer LISTA network for $M = 200$ , $N = 800$ has weight matrices $\mathbf{W}_k \in \mathbb{R}^{800 \times 800}$ and $\mathbf{S}_k \in \mathbb{R}^{800 \times 200}$ . Compute the parameter count with and without weight tying.

Show Hint

Without weight tying: $15 \times (800^2 + 800 \times 200 + 1)$ .

Solution

Without weight tying

Per layer: $800^2 + 800 \times 200 + 1 = 800{,}001$ . Total: $15 \times 800{,}001 = 12{,}000{,}015$ parameters.

With weight tying

Shared: $640{,}000 + 160{,}000 = 800{,}000$ . Per-layer thresholds: $15$ . Total: $800{,}015$ parameters --- a 15 $\times$ reduction. $\square$

ch18-ex04-oamp-unrolled-structure

Easy

Draw a block diagram of one layer of the unrolled OAMP-ProxNet architecture. Label the inputs, outputs, and learnable components. Identify which components use the sensing matrix $\mathbf{A}$ and which do not.

Show Hint

The LMMSE step uses $\mathbf{A}$ . The ProxNet denoiser does not.

Solution

Block diagram description

Input: Previous estimate $\hat{\mathbf{c}}^{(t-1)}$ , measurements $\mathbf{y}$

LMMSE Block (uses $\mathbf{A}$ , not learned): $\mathbf{r}^{(t)} = \hat{\mathbf{c}}^{(t-1)} + \mathbf{W}_{\text{LE}}^{(t)}(\mathbf{y} - \mathbf{A}\hat{\mathbf{c}}^{(t-1)})$

State Evolution (uses singular values of $\mathbf{A}$ , not learned): Compute $\sigma_t^2$

ProxNet (does NOT use $\mathbf{A}$ , LEARNED): $\hat{\mathbf{c}}^{(t)} = \mathcal{D}_{\theta_t}(\mathbf{r}^{(t)};\, \sigma_t^2)$

Output: Updated estimate $\hat{\mathbf{c}}^{(t)}$

Summary of components

Component	Uses $\mathbf{A}$ ?	Learnable?
LMMSE step	Yes	No (physics)
Orthogonalisation	Yes (SVD)	No (analytical)
State evolution	Yes (singular values)	No (analytical)
ProxNet denoiser	No	Yes ( $\theta_t$ )

Only the denoiser is learned. This is why unrolled OAMP has far fewer parameters than LISTA. $\square$

ch18-ex05-convergence-check

Easy

Given the state evolution recursion $\sigma_{t+1}^2 = \frac{1}{\delta}(\sigma^2 + \text{mmse}(\sigma_t^2))$ with $\delta = 0.5$ , $\sigma^2 = 0.01$ , and $\text{mmse}(\sigma^2) = 0.1 \cdot \min(\sigma^2, 1)$ (Bernoulli-Gaussian signal with sparsity 0.1 and unit variance), compute the state evolution trajectory for $T = 5$ layers starting from $\sigma_0^2 = 2.0$ .

Show Hint

Layer 1: $\text{mmse}(2.0) = 0.1 \times \min(2.0, 1) = 0.1$ .

Solution

Iterate the state evolution

Layer 1: $\text{mmse}(2.0) = 0.1$ . $\sigma_1^2 = (0.01 + 0.1)/0.5 = 0.22$ .

Layer 2: $\text{mmse}(0.22) = 0.022$ . $\sigma_2^2 = (0.01 + 0.022)/0.5 = 0.064$ .

Layer 3: $\text{mmse}(0.064) = 0.0064$ . $\sigma_3^2 = (0.01 + 0.0064)/0.5 = 0.0328$ .

Layer 4: $\text{mmse}(0.0328) = 0.00328$ . $\sigma_4^2 = (0.01 + 0.00328)/0.5 = 0.02656$ .

Layer 5: $\text{mmse}(0.02656) = 0.00266$ . $\sigma_5^2 = (0.01 + 0.00266)/0.5 = 0.02531$ .

Analysis

The trajectory converges to $\sigma_\infty^2 \approx 0.025$ (NMSE $\approx -16$ dB). Convergence is rapid in the first 3 layers and slows near the fixed point. A ProxNet denoiser achieving lower mmse at each step would shift the fixed point downward. $\square$

ch18-ex06-gradient-flow

Medium

For a $K$ -layer LISTA network, derive the gradient $\partial \mathcal{L}/\partial \tau_k$ of the MSE loss with respect to the threshold at layer $k$ . Show that the Jacobian of soft-thresholding is a diagonal matrix with entries in $\{0, 1\}$ .

Show Hint

The Jacobian of $\mathcal{S}_\tau(\mathbf{z})$ w.r.t. $\mathbf{z}$ is $\operatorname{diag}(\mathbf{1}_{|z_i| > \tau})$ .

The derivative w.r.t. $\tau$ is $-\operatorname{sign}(z) \mathbf{1}_{|z| > \tau}$ .

Solution

Jacobian of soft-thresholding w.r.t. input

$\frac{\partial [\mathcal{S}_\tau(\mathbf{z})]_i}{\partial z_j} = \delta_{ij} \cdot \mathbf{1}_{|z_i| > \tau}.$ $

Derivative w.r.t. threshold

$\frac{\partial [\mathcal{S}_\tau(\mathbf{z})]_i}{\partial \tau} = -\operatorname{sign}(z_i) \cdot \mathbf{1}_{|z_i| > \tau}.$ $

Chain rule through the network

$\frac{\partial \mathcal{L}}{\partial \tau_k} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{c}}^{(K)}} \left(\prod_{j=k+1}^{K} \frac{\partial \hat{\mathbf{c}}^{(j)}}{\partial \hat{\mathbf{c}}^{(j-1)}}\right) \frac{\partial \hat{\mathbf{c}}^{(k)}}{\partial \tau_k}.$ $The intermediate Jacobians involve products of$ \mathbf{W}j $and the diagonal mask$ \operatorname{diag}(\mathbf{1}{|z_i^{(j)}| > \tau_j}) $. Gradient flow is through the **active set** only.$ \square$

ch18-ex07-oamp-orthogonality

Medium

Show that the OAMP linear estimator $\mathbf{W}_{\text{LE}} = \mathbf{A}^{H}(\mathbf{A}\mathbf{A}^{H} + v\mathbf{I})^{-1}$ satisfies $\frac{1}{N}\operatorname{tr}(\mathbf{W}_{\text{LE}}\mathbf{A}) = \frac{1}{N}\sum_{i} \frac{\sigma_i^2}{\sigma_i^2 + v}$ where $\{\sigma_i\}$ are the singular values of $\mathbf{A}$ .

Show Hint

Use the SVD $\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^H$ and simplify.

Solution

Substitute the SVD

$\mathbf{W}_{\text{LE}}\mathbf{A} = \mathbf{V}\boldsymbol{\Sigma}^H(\boldsymbol{\Sigma}\boldsymbol{\Sigma}^H + v\mathbf{I})^{-1}\boldsymbol{\Sigma}\mathbf{V}^H$ .

Compute the trace

$\frac{1}{N}\operatorname{tr}(\mathbf{W}_{\text{LE}}\mathbf{A}) = \frac{1}{N}\sum_{i=1}^{\min(M,N)} \frac{\sigma_i^2}{\sigma_i^2 + v}.$ $This is a smooth function of$ v $ranging from$ M/N $(as$ v \to 0 $) to$ 0 $(as$ v \to \infty $).$ \square$

ch18-ex08-kronecker-complexity

Medium

For a Kronecker-structured sensing matrix $\mathbf{A} = \mathbf{A}_1 \otimes \mathbf{A}_2$ with $\mathbf{A}_1 \in \mathbb{C}^{M_1 \times N_1}$ and $\mathbf{A}_2 \in \mathbb{C}^{M_2 \times N_2}$ , compare the cost of computing $\mathbf{A}\mathbf{c}$ and $\mathbf{A}^{H}\mathbf{y}$ with and without exploiting the Kronecker structure.

Show Hint

With Kronecker: reshape $\mathbf{c}$ into $\mathbf{C} \in \mathbb{C}^{N_2 \times N_1}$ and compute $\mathbf{A}_2 \mathbf{C} \mathbf{A}_1^T$ .

Solution

Without Kronecker structure

$\mathbf{A} \in \mathbb{C}^{M \times N}$ with $M = M_1 M_2$ , $N = N_1 N_2$ . Cost of $\mathbf{A}\mathbf{c}$ : $O(MN) = O(M_1 M_2 N_1 N_2)$ .

With Kronecker structure

Reshape $\mathbf{c}$ into $\mathbf{C} \in \mathbb{C}^{N_2 \times N_1}$ . Then: $\operatorname{vec}((\mathbf{A}_1 \otimes \mathbf{A}_2)\mathbf{c}) = \operatorname{vec}(\mathbf{A}_2 \mathbf{C} \mathbf{A}_1^T)$ .

Cost: $O(N_1(M_2 N_2 + M_1 M_2))$ .

Speedup

For square problems ( $M_1 = M_2 = m$ , $N_1 = N_2 = n$ ): without $O(m^2 n^2)$ , with $O(mn(m+n))$ . Speedup $\approx O(n/2)$ for $m \approx n$ . For $n = 256$ : $\sim 128\times$ speedup. $\square$

ch18-ex09-intermediate-supervision

Medium

In intermediate supervision, the loss includes terms at every layer: $\mathcal{L}_{\text{total}} = \sum_{k=1}^K w_k \mathcal{L}(\hat{\mathbf{c}}^{(k)}, \mathbf{c}^*)$ . Argue that this helps with vanishing gradients and propose a weight schedule $\{w_k\}$ .

Show Hint

The direct gradient term $w_k \partial \mathcal{L}(\hat{\mathbf{c}}^{(k)})/\partial \theta_k$ bypasses the vanishing chain.

Solution

Without intermediate supervision

$\frac{\partial \mathcal{L}}{\partial \theta_1}$ passes through $K-1$ Jacobians. If each has spectral norm $\rho < 1$ , the gradient decays as $\rho^{K-1}$ . For $K = 20$ , $\rho = 0.9$ : $0.9^{19} \approx 0.14$ .

With intermediate supervision

The direct term $w_k \partial \mathcal{L}(\hat{\mathbf{c}}^{(k)})/\partial \theta_k$ provides a gradient signal that bypasses the vanishing chain.

Proposed weight schedule

Use linearly increasing weights: $w_k = k/K$ . This gives the most weight to the final output while providing non-trivial supervision at every layer. Alternative: $w_k = \alpha^{K-k}$ with $\alpha = 0.5$ . $\square$

ch18-ex10-ista-vs-oamp-condition

Medium

A sensing matrix $\mathbf{A} \in \mathbb{C}^{M \times N}$ has singular values uniformly distributed in $[0.1, 10]$ with $M/N = 0.5$ . Compute the ISTA step size $1/L$ and the ISTA contraction rate. Compare with the OAMP per-singular-value weights $w_i = \sigma_i/(\sigma_i^2 + v)$ and argue why OAMP converges faster.

Show Hint

$L = \sigma_{\max}^2 = 100$ , so $1/L = 0.01$ .

Solution

ISTA analysis

$L = \sigma_{\max}^2 = 100$ . Step size $1/L = 0.01$ . Contraction rate: $\rho_{\text{ISTA}} = 1 - \sigma_{\min}^2/L = 1 - 0.01/100 = 0.9999$ . After 100 iterations: $0.9999^{100} \approx 0.99$ --- almost no progress.

OAMP analysis

OAMP applies per-singular-value weights. For the bulk of singular values near $\sigma_i \approx 5$ , the weight $w_i = 5/(25 + v)$ is much larger than $1/L = 0.01$ .

The effective noise after the LMMSE step involves $\frac{1}{N}\sum_i v/(\sigma_i^2 + v)$ , which is dominated by the bulk spectrum, not the extreme $\sigma_{\min} = 0.1$ .

Conclusion

ISTA is bottlenecked by the worst singular value; OAMP adapts to the full spectrum. For this matrix, OAMP converges in $\sim 10$ iterations while ISTA needs $> 10{,}000$ . $\square$

ch18-ex11-hst-weighted-l1

Medium

Show that spatially-varying soft-thresholding with thresholds $\boldsymbol{\tau} = (\tau_1, \ldots, \tau_N)$ is the proximal operator of the weighted $\ell^1$ norm $R(\mathbf{c}) = \sum_i \tau_i |c_i|$ .

Show Hint

Write $\operatorname{prox}_R(\mathbf{r}) = \arg\min_{\mathbf{c}} \frac{1}{2}\|\mathbf{c} - \mathbf{r}\|^2 + \sum_i \tau_i |c_i|$ and solve component-wise.

Solution

Proximal operator definition

$\operatorname{prox}_R(\mathbf{r}) = \arg\min_{\mathbf{c}} \frac{1}{2}\|\mathbf{c} - \mathbf{r}\|^2 + \sum_i \tau_i |c_i|$ .

The objective separates across components: $\min_{c_i} \frac{1}{2}(c_i - r_i)^2 + \tau_i |c_i|$ .

Per-component solution

The solution is $c_i^* = \mathcal{S}_{\tau_i}(r_i) = \operatorname{sign}(r_i)\max(|r_i| - \tau_i, 0)$ .

This is exactly the spatially-varying soft-thresholding operator. Therefore HST $= \operatorname{prox}_R$ for the weighted $\ell^1$ norm $R(\mathbf{c}) = \sum_i \tau_i |c_i|$ . $\square$

ch18-ex12-lista-contraction

Hard

For a weight-tied LISTA network, prove that the map $\mathcal{T}(\mathbf{c}) = \mathcal{S}_\tau(\mathbf{W}\mathbf{c} + \mathbf{S}\mathbf{y})$ is contractive if $\|\mathbf{W}\| < 1$ . Derive the fixed-point equation.

Show Hint

Use nonexpansiveness of $\mathcal{S}_\tau$ and submultiplicativity of the spectral norm.

Solution

Nonexpansiveness

$\|\mathcal{S}_\tau(\mathbf{u}) - \mathcal{S}_\tau(\mathbf{v})\| \leq \|\mathbf{u} - \mathbf{v}\|$ .

Contraction of one layer

$\|\mathcal{S}_\tau(\mathbf{W}\mathbf{a} + \mathbf{S}\mathbf{y}) - \mathcal{S}_\tau(\mathbf{W}\mathbf{b} + \mathbf{S}\mathbf{y})\| \leq \|\mathbf{W}(\mathbf{a} - \mathbf{b})\| \leq \|\mathbf{W}\|\,\|\mathbf{a} - \mathbf{b}\|.$ $If$ |\mathbf{W}| < 1 $, each layer contracts by$ \rho = |\mathbf{W}|$.

Fixed-point equation

By Banach's theorem, the unique fixed point satisfies: $\mathbf{c}^* = \mathcal{S}_\tau(\mathbf{W}\mathbf{c}^* + \mathbf{S}\mathbf{y})$ . The $K$ -layer network approximates this with error $\leq \rho^K \|\hat{\mathbf{c}}^{(0)} - \mathbf{c}^*\|$ . $\square$

ch18-ex13-ladmm-convergence

Hard

Prove that if the learned proximal operator $\mathcal{G}_{\theta_k^z}$ in L-ADMM is firmly nonexpansive, then the L-ADMM iterates converge to a fixed point (assuming $\rho_k \geq \rho_{\min} > 0$ ).

Show Hint

ADMM is Douglas-Rachford splitting applied to the sum of two maximal monotone operators.

Solution

Douglas-Rachford reduction

ADMM is a special case of Douglas-Rachford splitting. If both proximal operators are firmly nonexpansive, the Douglas-Rachford operator is averaged, guaranteeing convergence.

Learned operators preserve structure

If $\mathcal{G}_{\theta_k^z}$ is firmly nonexpansive and $\mathcal{F}_{\theta_k^x}$ is the true proximal of a convex function, both are resolvents of maximal monotone operators. The ADMM convergence theory applies layer by layer, with varying $\rho_k$ as preconditioning. $\square$

ch18-ex14-spectral-normalisation

Hard

Design a training procedure for a convergent LISTA network that guarantees $\|\mathbf{W}_k\| < 1$ at every training step using spectral normalisation. Analyse the effect on convergence rate.

Show Hint

Spectral normalisation divides $\mathbf{W}$ by its spectral norm, then scales by $(1-\epsilon)$ .

Solution

Spectral normalisation

After each gradient update, compute $\sigma_1(\mathbf{W}_k)$ via power iteration (1--2 steps). Normalise: $\bar{\mathbf{W}}_k = (1-\epsilon)\mathbf{W}_k / \max(\sigma_1(\mathbf{W}_k), 1-\epsilon)$ . This ensures $\|\bar{\mathbf{W}}_k\| \leq 1 - \epsilon < 1$ .

Effect on convergence

Convergence rate: $\prod_{k=1}^K \|\mathbf{W}_k\| \leq (1-\epsilon)^K$ . For $\epsilon = 0.05$ , $K = 20$ : $(0.95)^{20} \approx 0.36$ .

Tradeoff

Smaller $\epsilon$ : faster per-layer but weaker contraction guarantee. Larger $\epsilon$ : strong contraction but limits expressivity. $\epsilon = 0.01$ -- $0.1$ works in practice. $\square$

ch18-ex15-proxnet-divergence

Hard

In unrolled OAMP, the state evolution requires the divergence of the ProxNet denoiser: $\operatorname{div}(\mathcal{D}_\theta) = \frac{1}{N}\sum_i \frac{\partial [\mathcal{D}_\theta(\mathbf{r})]_i}{\partial r_i}$ .

Show that computing the exact divergence requires $N$ backward passes.
Derive the Monte Carlo estimator using Stein's identity.
Prove unbiasedness and compute the variance.

Show Hint

Stein's identity: $\mathbb{E}[\mathbf{b}^T \nabla f(\mathbf{r}) \mathbf{b}] = \operatorname{tr}(\nabla f)$ when $\mathbf{b} \sim \mathcal{N}(0, \mathbf{I})$ .

Solution

Exact divergence cost

Each $\frac{\partial [\mathcal{D}_\theta]_i}{\partial r_i}$ requires one backward pass with seed $\mathbf{e}_i$ . Total: $N$ backward passes. For $N = 65{,}536$ ( $256 \times 256$ image), this is prohibitive.

Monte Carlo estimator

Draw $\mathbf{b} \sim \mathcal{N}(0, \mathbf{I}_N)$ . $\widehat{\operatorname{div}} = \mathbf{b}^T \mathbf{J}_\theta(\mathbf{r}) \mathbf{b}$ . This requires only one backward pass via vector-Jacobian product.

Unbiasedness and variance

$\mathbb{E}[\widehat{\operatorname{div}}] = \operatorname{tr}(\mathbf{J}) = \operatorname{div}(\mathcal{D}_\theta)$ . $\operatorname{Var}(\widehat{\operatorname{div}}) = 2\|\mathbf{J}\|_F^2$ for symmetric $\mathbf{J}$ . Averaging $L$ independent draws reduces variance by $1/L$ . $\square$

ch18-ex16-generalisation-weight-tying

Hard

For a $K$ -layer unrolled network with per-layer parameters $P_{\text{layer}}$ , compare the generalisation bound with and without weight tying. Quantify the improvement for $K = 20$ .

Show Hint

Without tying: $P = K \cdot P_{\text{layer}}$ . With tying: $P = P_{\text{layer}} + K$ (shared weights + per-layer thresholds).

Solution

Without weight tying

Total parameters: $P = K \cdot P_{\text{layer}}$ . Generalisation gap: $\propto \sqrt{K \cdot P_{\text{layer}} / n}$ .

With weight tying

Total parameters: $P = P_{\text{layer}} + K \approx P_{\text{layer}}$ . Gap: $\propto \sqrt{P_{\text{layer}} / n}$ .

Improvement

Ratio of gaps: $\sqrt{K} = \sqrt{20} \approx 4.5\times$ . Weight tying reduces the generalisation gap by a factor of $\sim 4.5$ for $K = 20$ layers. $\square$

ch18-ex17-alista-derivation

Challenge

Derive the ALISTA optimal weight matrix. Starting from the LISTA update $\hat{\mathbf{c}}^{(k+1)} = \mathcal{S}_{\tau_k}(\mathbf{W}\hat{\mathbf{c}}^{(k)} + \mathbf{S}\mathbf{y})$ , show that the optimal $\mathbf{W}$ minimising the worst-case contraction rate over all $s$ -sparse signals is:

$\mathbf{W}^* = \mathbf{I} - \mathbf{A}^{H}(\mathbf{A}\mathbf{A}^{H} + \mu^*\mathbf{I})^{-1}\mathbf{A}$

and find the equation determining $\mu^*$ .

Show Hint

The contraction rate on support $S$ is $\rho = \|(\mathbf{I} - \mathbf{W}\mathbf{G})_S\|$ .

Use the Woodbury identity and the SVD of $\mathbf{A}$ .

Solution

Formulate the minimax problem

After support identification, the contraction rate on support $S$ is $\rho(S) = \|(\mathbf{I} - \mathbf{W}\mathbf{G})_S\|$ . We seek: $\mathbf{W}^* = \arg\min_\mathbf{W} \max_{|S| = s} \rho(S)$ .

Relaxation to spectral bound

Upper-bounding by $\|\mathbf{I} - \mathbf{W}\mathbf{G}\|$ and writing $\mathbf{G} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^H$ , the optimal $\mathbf{W}$ has the Tikhonov form $w_i = \lambda_i/(\lambda_i + \mu)$ .

Optimal weights and $\mu^*$

$\mathbf{W}^* = \mathbf{I} - \mathbf{A}^{H}(\mathbf{A}\mathbf{A}^{H} + \mu\mathbf{I})^{-1}\mathbf{A}$ .

The equation for $\mu^*$ : $\frac{1}{N}\sum_{i=1}^N \frac{\lambda_i^2}{(\lambda_i + \mu)^2} = \frac{N - s}{N}$ .

This balances contraction on the support (small $\mu$ ) with stability off the support (large $\mu$ ). $\square$

ch18-ex18-rip-bound

Challenge

Prove that for a sensing matrix $\mathbf{A}$ satisfying the RIP of order $2s$ with constant $\delta_{2s} < \sqrt{2} - 1$ , the LISTA network with ALISTA weights achieves stable recovery:

$\|\hat{\mathbf{c}}^{(K)} - \mathbf{c}^*\| \leq \rho^K \|\hat{\mathbf{c}}^{(0)} - \mathbf{c}^*\| + \frac{C}{\sqrt{s}}\|\mathbf{c}^* - \mathbf{c}^*_s\|_1 + D\sqrt{\sigma^2}$

where $\rho = \delta_{2s}/(1-\delta_{2s}) < 1$ .

Show Hint

The RIP gives $\|((\mathbf{I} - \mathbf{W}\mathbf{G})_S\| \leq \delta_{2s}$ for optimal $\mu$ .

The noise amplification involves $\|\mathbf{W}\mathbf{A}^{H}\|$ .

Solution

Contraction on the support

For the ALISTA weight with optimal $\mu$ , the restricted matrix satisfies $\|(\mathbf{I} - \mathbf{W}\mathbf{G})_S\| \leq \delta_{2s}$ for any support $S$ with $|S| \leq 2s$ . After soft-thresholding, the effective contraction rate involves $\delta_{2s}/(1 - \delta_{2s})$ .

Tail and noise terms

The approximation error from non-exact sparsity contributes $C\|\mathbf{c}^* - \mathbf{c}^*_s\|_1/\sqrt{s}$ (standard LASSO bound). The noise term is $D\|\mathbf{W}\mathbf{A}^{H}\mathbf{w}\| \leq D\sqrt{\sigma^2}$ .

Combine via induction

By induction on $K$ , the error after $K$ layers satisfies: $\|\hat{\mathbf{c}}^{(K)} - \mathbf{c}^*\| \leq \rho^K \|\hat{\mathbf{c}}^{(0)} - \mathbf{c}^*\| + \frac{1}{1-\rho}(C_{\text{tail}} + C_{\text{noise}})$ . For $\delta_{2s} < \sqrt{2}-1$ : $\rho < 1$ . $\square$

ch18-ex19-hst-group-lasso

Challenge

For the OTFS channel estimation problem with angular-delay-Doppler groups $\{\mathcal{G}_\ell\}_{\ell=1}^L$ , derive the proximal operator of the group LASSO penalty $R(\mathbf{c}) = \sum_\ell \lambda_\ell \|\mathbf{c}_{\mathcal{G}_\ell}\|_2$ . Show that it performs group-wise soft-thresholding.

Show Hint

The proximal operator separates across groups since the groups are disjoint.

For the $\ell^2$ norm: $\operatorname{prox}_{\lambda\|\cdot\|_2}(\mathbf{r}) = (1 - \lambda/\|\mathbf{r}\|_2)_+ \mathbf{r}$ .

Solution

Separation across groups

Since groups $\{\mathcal{G}_\ell\}$ partition $\{1, \ldots, N\}$ : $\operatorname{prox}_R(\mathbf{r}) = \bigoplus_\ell \operatorname{prox}_{\lambda_\ell \|\cdot\|_2}(\mathbf{r}_{\mathcal{G}_\ell})$ .

Per-group proximal operator

For group $\ell$ : $\operatorname{prox}_{\lambda_\ell \|\cdot\|_2}(\mathbf{r}_\ell) = \left(1 - \frac{\lambda_\ell}{\|\mathbf{r}_\ell\|_2}\right)_+ \mathbf{r}_\ell$ .

This is block soft-thresholding: if $\|\mathbf{r}_\ell\|_2 < \lambda_\ell$ , the entire group is set to zero. Otherwise, the group is shrunk toward zero by factor $\lambda_\ell / \|\mathbf{r}_\ell\|_2$ .

Interpretation

Entire angular groups with small energy (noise only) are suppressed. Groups with signal energy are preserved and shrunk. The per-group thresholds $\lambda_\ell$ can be learned (as in HST) to adapt to the expected energy distribution. $\square$

ch18-ex20-end-to-end-training

Challenge

Design an end-to-end training procedure for a 10-layer unrolled OAMP network with ProxNet denoisers for a $128 \times 128$ RF imaging problem. Address:

(a) Training data generation (how many scenes, what distribution). (b) Loss function (MSE, perceptual, or combined). (c) Training schedule (layer-wise pre-training followed by end-to-end). (d) Memory management (gradient checkpointing). (e) Validation strategy (in-distribution and out-of-distribution).

Show Hint

Start with layer-wise pre-training, then fine-tune end-to-end.

Use gradient checkpointing every 3 layers to fit in GPU memory.

Solution

Training data

Generate 10{,}000 synthetic scenes:

5{,}000 with 5--20 point scatterers (varying amplitudes)
3{,}000 with extended targets (rectangles, circles)
2{,}000 with mixed point + extended targets

Add noise at SNR $\in [5, 30]$ dB (uniformly sampled). Split: 8{,}000 train / 1{,}000 val / 1{,}000 test.

Loss function

Combined loss: $\mathcal{L} = \|\hat{\mathbf{c}} - \mathbf{c}\|^2 + 0.1 \cdot \mathcal{L}_{\text{SE}}$

where $\mathcal{L}_{\text{SE}} = \sum_{t=1}^T (\hat{\sigma}_t^2 - \sigma_t^{\text{SE}})^2$ regularises the learned noise variances against the state evolution prediction.

Training schedule

Phase 1: Layer-wise pre-training (5 epochs per layer, learning rate $10^{-3}$ , Adam optimiser).

Phase 2: End-to-end fine-tuning (50 epochs, learning rate $10^{-4}$ , cosine annealing, batch size 16).

Memory management

Gradient checkpointing every 3 layers: store activations at layers $\{0, 3, 6, 9\}$ and recompute the rest. Memory: $\sim 4 \times 128^2 \times 8$ bytes $\times 4$ checkpoints $= 2$ MB (negligible vs. ProxNet activations).

Validation

In-distribution: same scene types and SNR range. Out-of-distribution:

Different SNR range ( $[0, 5]$ dB and $[30, 40]$ dB)
Different scene types (Shepp-Logan phantom, letter shapes)
Different number of scatterers ( $> 20$ )

Report NMSE, SSIM, and per-pixel uncertainty if available. $\square$

Exercises

ch18-ex01-ista-to-lista

ISTA iteration

LISTA parameterisation

ch18-ex02-admm-split

Write the ADMM updates

Closed-form solution

ch18-ex03-weight-tying

Without weight tying

With weight tying

ch18-ex04-oamp-unrolled-structure

Block diagram description

Summary of components

ch18-ex05-convergence-check

Iterate the state evolution

Analysis

ch18-ex06-gradient-flow

Jacobian of soft-thresholding w.r.t. input

Derivative w.r.t. threshold

Chain rule through the network

ch18-ex07-oamp-orthogonality

Substitute the SVD

Compute the trace

ch18-ex08-kronecker-complexity

Without Kronecker structure

With Kronecker structure

Speedup

ch18-ex09-intermediate-supervision

Without intermediate supervision

With intermediate supervision

Proposed weight schedule

ch18-ex10-ista-vs-oamp-condition

ISTA analysis

OAMP analysis

Conclusion

ch18-ex11-hst-weighted-l1

Proximal operator definition

Per-component solution

ch18-ex12-lista-contraction

Nonexpansiveness

Contraction of one layer

Fixed-point equation

ch18-ex13-ladmm-convergence

Douglas-Rachford reduction

Learned operators preserve structure

ch18-ex14-spectral-normalisation

Spectral normalisation

Effect on convergence

Tradeoff

ch18-ex15-proxnet-divergence

Exact divergence cost

Monte Carlo estimator

Unbiasedness and variance

ch18-ex16-generalisation-weight-tying

Without weight tying

With weight tying

Improvement

ch18-ex17-alista-derivation

Formulate the minimax problem

Relaxation to spectral bound

Optimal weights and $\mu^*$

ch18-ex18-rip-bound

Contraction on the support

Tail and noise terms

Combine via induction

ch18-ex19-hst-group-lasso

Separation across groups

Per-group proximal operator

Interpretation

ch18-ex20-end-to-end-training

Training data

Loss function

Training schedule

Memory management

Validation