Exercises
ch18-ex01-ista-to-lista
EasyWrite out the ISTA iteration for the LASSO problem and identify which quantities become learnable parameters in LISTA. State the LISTA initialisation for each learnable parameter.
Rewrite the ISTA iteration as .
ISTA iteration
$
LISTA parameterisation
| ISTA quantity | LISTA parameter | Initialisation |
|---|---|---|
At initialisation, the untrained LISTA network performs exact ISTA. Training then refines the parameters.
ch18-ex02-admm-split
EasyWrite out the three ADMM updates for with splitting variable . Identify the closed-form solution for the -update.
The -update is a quadratic minimisation.
Write the ADMM updates
-update:
-update:
Dual update:
Closed-form solution
The -update solves where . Since is positive definite for , the solution is unique. For Kronecker-structured , this can be solved via FFT.
ch18-ex03-weight-tying
EasyA 15-layer LISTA network for , has weight matrices and . Compute the parameter count with and without weight tying.
Without weight tying: .
Without weight tying
Per layer: . Total: parameters.
With weight tying
Shared: . Per-layer thresholds: . Total: parameters --- a 15 reduction.
ch18-ex04-oamp-unrolled-structure
EasyDraw a block diagram of one layer of the unrolled OAMP-ProxNet architecture. Label the inputs, outputs, and learnable components. Identify which components use the sensing matrix and which do not.
The LMMSE step uses . The ProxNet denoiser does not.
Block diagram description
Input: Previous estimate , measurements
LMMSE Block (uses , not learned):
State Evolution (uses singular values of , not learned): Compute
ProxNet (does NOT use , LEARNED):
Output: Updated estimate
Summary of components
| Component | Uses ? | Learnable? |
|---|---|---|
| LMMSE step | Yes | No (physics) |
| Orthogonalisation | Yes (SVD) | No (analytical) |
| State evolution | Yes (singular values) | No (analytical) |
| ProxNet denoiser | No | Yes () |
Only the denoiser is learned. This is why unrolled OAMP has far fewer parameters than LISTA.
ch18-ex05-convergence-check
EasyGiven the state evolution recursion with , , and (Bernoulli-Gaussian signal with sparsity 0.1 and unit variance), compute the state evolution trajectory for layers starting from .
Layer 1: .
Iterate the state evolution
Layer 1: . .
Layer 2: . .
Layer 3: . .
Layer 4: . .
Layer 5: . .
Analysis
The trajectory converges to (NMSE dB). Convergence is rapid in the first 3 layers and slows near the fixed point. A ProxNet denoiser achieving lower mmse at each step would shift the fixed point downward.
ch18-ex06-gradient-flow
MediumFor a -layer LISTA network, derive the gradient of the MSE loss with respect to the threshold at layer . Show that the Jacobian of soft-thresholding is a diagonal matrix with entries in .
The Jacobian of w.r.t. is .
The derivative w.r.t. is .
Jacobian of soft-thresholding w.r.t. input
$
Derivative w.r.t. threshold
$
Chain rule through the network
\mathbf{W}j\operatorname{diag}(\mathbf{1}{|z_i^{(j)}| > \tau_j})\square$
ch18-ex07-oamp-orthogonality
MediumShow that the OAMP linear estimator satisfies where are the singular values of .
Use the SVD and simplify.
Substitute the SVD
.
Compute the trace
vM/Nv \to 00v \to \infty\square$
ch18-ex08-kronecker-complexity
MediumFor a Kronecker-structured sensing matrix with and , compare the cost of computing and with and without exploiting the Kronecker structure.
With Kronecker: reshape into and compute .
Without Kronecker structure
with , . Cost of : .
With Kronecker structure
Reshape into . Then: .
Cost: .
Speedup
For square problems (, ): without , with . Speedup for . For : speedup.
ch18-ex09-intermediate-supervision
MediumIn intermediate supervision, the loss includes terms at every layer: . Argue that this helps with vanishing gradients and propose a weight schedule .
The direct gradient term bypasses the vanishing chain.
Without intermediate supervision
passes through Jacobians. If each has spectral norm , the gradient decays as . For , : .
With intermediate supervision
The direct term provides a gradient signal that bypasses the vanishing chain.
Proposed weight schedule
Use linearly increasing weights: . This gives the most weight to the final output while providing non-trivial supervision at every layer. Alternative: with .
ch18-ex10-ista-vs-oamp-condition
MediumA sensing matrix has singular values uniformly distributed in with . Compute the ISTA step size and the ISTA contraction rate. Compare with the OAMP per-singular-value weights and argue why OAMP converges faster.
, so .
ISTA analysis
. Step size . Contraction rate: . After 100 iterations: --- almost no progress.
OAMP analysis
OAMP applies per-singular-value weights. For the bulk of singular values near , the weight is much larger than .
The effective noise after the LMMSE step involves , which is dominated by the bulk spectrum, not the extreme .
Conclusion
ISTA is bottlenecked by the worst singular value; OAMP adapts to the full spectrum. For this matrix, OAMP converges in iterations while ISTA needs .
ch18-ex11-hst-weighted-l1
MediumShow that spatially-varying soft-thresholding with thresholds is the proximal operator of the weighted norm .
Write and solve component-wise.
Proximal operator definition
.
The objective separates across components: .
Per-component solution
The solution is .
This is exactly the spatially-varying soft-thresholding operator. Therefore HST for the weighted norm .
ch18-ex12-lista-contraction
HardFor a weight-tied LISTA network, prove that the map is contractive if . Derive the fixed-point equation.
Use nonexpansiveness of and submultiplicativity of the spectral norm.
Nonexpansiveness
.
Contraction of one layer
|\mathbf{W}| < 1\rho = |\mathbf{W}|$.
Fixed-point equation
By Banach's theorem, the unique fixed point satisfies: . The -layer network approximates this with error .
ch18-ex13-ladmm-convergence
HardProve that if the learned proximal operator in L-ADMM is firmly nonexpansive, then the L-ADMM iterates converge to a fixed point (assuming ).
ADMM is Douglas-Rachford splitting applied to the sum of two maximal monotone operators.
Douglas-Rachford reduction
ADMM is a special case of Douglas-Rachford splitting. If both proximal operators are firmly nonexpansive, the Douglas-Rachford operator is averaged, guaranteeing convergence.
Learned operators preserve structure
If is firmly nonexpansive and is the true proximal of a convex function, both are resolvents of maximal monotone operators. The ADMM convergence theory applies layer by layer, with varying as preconditioning.
ch18-ex14-spectral-normalisation
HardDesign a training procedure for a convergent LISTA network that guarantees at every training step using spectral normalisation. Analyse the effect on convergence rate.
Spectral normalisation divides by its spectral norm, then scales by .
Spectral normalisation
After each gradient update, compute via power iteration (1--2 steps). Normalise: . This ensures .
Effect on convergence
Convergence rate: . For , : .
Tradeoff
Smaller : faster per-layer but weaker contraction guarantee. Larger : strong contraction but limits expressivity. -- works in practice.
ch18-ex15-proxnet-divergence
HardIn unrolled OAMP, the state evolution requires the divergence of the ProxNet denoiser: .
- Show that computing the exact divergence requires backward passes.
- Derive the Monte Carlo estimator using Stein's identity.
- Prove unbiasedness and compute the variance.
Stein's identity: when .
Exact divergence cost
Each requires one backward pass with seed . Total: backward passes. For ( image), this is prohibitive.
Monte Carlo estimator
Draw . . This requires only one backward pass via vector-Jacobian product.
Unbiasedness and variance
. for symmetric . Averaging independent draws reduces variance by .
ch18-ex16-generalisation-weight-tying
HardFor a -layer unrolled network with per-layer parameters , compare the generalisation bound with and without weight tying. Quantify the improvement for .
Without tying: . With tying: (shared weights + per-layer thresholds).
Without weight tying
Total parameters: . Generalisation gap: .
With weight tying
Total parameters: . Gap: .
Improvement
Ratio of gaps: . Weight tying reduces the generalisation gap by a factor of for layers.
ch18-ex17-alista-derivation
ChallengeDerive the ALISTA optimal weight matrix. Starting from the LISTA update , show that the optimal minimising the worst-case contraction rate over all -sparse signals is:
and find the equation determining .
The contraction rate on support is .
Use the Woodbury identity and the SVD of .
Formulate the minimax problem
After support identification, the contraction rate on support is . We seek: .
Relaxation to spectral bound
Upper-bounding by and writing , the optimal has the Tikhonov form .
Optimal weights and $\mu^*$
.
The equation for : .
This balances contraction on the support (small ) with stability off the support (large ).
ch18-ex18-rip-bound
ChallengeProve that for a sensing matrix satisfying the RIP of order with constant , the LISTA network with ALISTA weights achieves stable recovery:
where .
The RIP gives for optimal .
The noise amplification involves .
Contraction on the support
For the ALISTA weight with optimal , the restricted matrix satisfies for any support with . After soft-thresholding, the effective contraction rate involves .
Tail and noise terms
The approximation error from non-exact sparsity contributes (standard LASSO bound). The noise term is .
Combine via induction
By induction on , the error after layers satisfies: . For : .
ch18-ex19-hst-group-lasso
ChallengeFor the OTFS channel estimation problem with angular-delay-Doppler groups , derive the proximal operator of the group LASSO penalty . Show that it performs group-wise soft-thresholding.
The proximal operator separates across groups since the groups are disjoint.
For the norm: .
Separation across groups
Since groups partition : .
Per-group proximal operator
For group : .
This is block soft-thresholding: if , the entire group is set to zero. Otherwise, the group is shrunk toward zero by factor .
Interpretation
Entire angular groups with small energy (noise only) are suppressed. Groups with signal energy are preserved and shrunk. The per-group thresholds can be learned (as in HST) to adapt to the expected energy distribution.
ch18-ex20-end-to-end-training
ChallengeDesign an end-to-end training procedure for a 10-layer unrolled OAMP network with ProxNet denoisers for a RF imaging problem. Address:
(a) Training data generation (how many scenes, what distribution). (b) Loss function (MSE, perceptual, or combined). (c) Training schedule (layer-wise pre-training followed by end-to-end). (d) Memory management (gradient checkpointing). (e) Validation strategy (in-distribution and out-of-distribution).
Start with layer-wise pre-training, then fine-tune end-to-end.
Use gradient checkpointing every 3 layers to fit in GPU memory.
Training data
Generate 10{,}000 synthetic scenes:
- 5{,}000 with 5--20 point scatterers (varying amplitudes)
- 3{,}000 with extended targets (rectangles, circles)
- 2{,}000 with mixed point + extended targets
Add noise at SNR dB (uniformly sampled). Split: 8{,}000 train / 1{,}000 val / 1{,}000 test.
Loss function
Combined loss:
where regularises the learned noise variances against the state evolution prediction.
Training schedule
Phase 1: Layer-wise pre-training (5 epochs per layer, learning rate , Adam optimiser).
Phase 2: End-to-end fine-tuning (50 epochs, learning rate , cosine annealing, batch size 16).
Memory management
Gradient checkpointing every 3 layers: store activations at layers and recompute the rest. Memory: bytes checkpoints MB (negligible vs. ProxNet activations).
Validation
In-distribution: same scene types and SNR range. Out-of-distribution:
- Different SNR range ( dB and dB)
- Different scene types (Shepp-Logan phantom, letter shapes)
- Different number of scatterers ()
Report NMSE, SSIM, and per-pixel uncertainty if available.