Exercises
ex-ch04-01
EasyLet and .
(a) What are the dimensions of ?
(b) How many complex entries does the full Kronecker product contain?
(c) How many complex entries do the two factors contain together?
(d) What is the ratio of (b) to (c)?
The Kronecker product of an and an matrix is .
Dimensions
.
Entry counts
Full product: entries. Factors: entries. Ratio: .
ex-ch04-02
EasyVerify the vec trick by hand for matrices. Let and . Compute both sides of
for and verify they are equal.
First write out as a block matrix.
.
Left-hand side
.
Multiplying by :
Right-hand side
.
Vectorizing column-by-column: .
This equals the LHS.
ex-ch04-03
MediumShow that the spectral norm of a Kronecker product satisfies .
Use the fact that the singular values of are all products .
The spectral norm is the largest singular value.
SVD of the Kronecker product
From TSpectral Properties of Kronecker Products, the SVD of is .
The singular values are .
Maximum singular value
\blacksquare$
ex-ch04-04
MediumThe Lipschitz constant of the gradient is .
(a) When , express in terms of the spectral norms of the factors.
(b) How does the ISTA step size change when we add a second Kronecker factor?
.
Use the spectral norm property of Kronecker products.
Lipschitz constant
.
Step size impact
.
Adding a second factor multiplies by , which reduces the step size by the same factor. Larger sensing operators lead to smaller step sizes and potentially slower convergence.
ex-ch04-05
MediumImplement the Kronecker-factored forward operator in Python.
Given A1 (shape ) and A2 (shape
), write a function kron_forward(c, A1, A2)
that computes
using the vec trick. Verify by comparing against
np.kron(A1, A2) @ c for random inputs.
Reshape c into shape (N2, N1) before multiplying.
The order matters: A2 @ C @ A1.T, not A1 @ C @ A2.T.
Implementation
import numpy as np
def kron_forward(c, A1, A2):
N2, N1 = A2.shape[1], A1.shape[1]
C = c.reshape(N2, N1)
return (A2 @ C @ A1.T).ravel()
Verification
M1, N1, M2, N2 = 4, 8, 6, 8
A1 = np.random.randn(M1, N1) + 1j*np.random.randn(M1, N1)
A2 = np.random.randn(M2, N2) + 1j*np.random.randn(M2, N2)
c = np.random.randn(N1*N2) + 1j*np.random.randn(N1*N2)
y_naive = np.kron(A1, A2) @ c
y_fast = kron_forward(c, A1, A2)
print(np.allclose(y_naive, y_fast)) # True
ex-ch04-06
MediumFor the NUFFT with oversampling factor and interpolation kernel width :
(a) What is the total memory required for the oversampled grid when the original grid has points?
(b) What is the computational cost of the FFT on the oversampled grid?
(c) For , , , and non-uniform points, compare the total NUFFT cost against the direct evaluation cost.
The oversampled grid has points.
An FFT on a grid costs .
Memory
Oversampled grid: complex entries. At 8 bytes each (complex64): MB.
FFT cost
.
Comparison
NUFFT total: .
Direct: .
Speedup: .
ex-ch04-07
HardProve that for the three-factor Kronecker product , the factored computation via mode products has cost
compared to the naive . Determine the conditions under which the speedup exceeds .
Count the cost of each mode- product separately.
The output of mode- multiplication changes one dimension from to .
Mode-1 product cost
: contract mode-1 of the tensor ( dimension) with (). Cost: .
Mode-2 product cost
After mode-1, the tensor is . Mode-2 contraction with (): cost .
Mode-3 product cost
After mode-2, the tensor is . Mode-3 contraction with (): cost .
Total and speedup
Total: .
For the balanced case , : factored cost , naive cost . Speedup . For , speedup exceeds .
ex-ch04-08
EasyA GPU has 4,096 CUDA cores, each capable of one complex multiply-add per clock cycle at 1.5 GHz.
(a) What is the theoretical peak throughput in complex GFLOP/s?
(b) How long does a complex matrix times a complex matrix take at peak throughput?
(c) Why is the actual time significantly longer than (b)?
One complex multiply-add = 8 real FLOPs.
Memory bandwidth, not compute, often limits performance.
Peak throughput
real TFLOP/s complex TFLOP/s.
Theoretical time
complex multiply-adds. Time: ns.
Practical limitations
Memory bandwidth limits: loading the two matrices ( complex entries KB) at TB/s takes ns β already longer than the compute time. Plus kernel launch overhead (--s) dominates for such small matrices.
ex-ch04-09
MediumExplain why CuPy is preferred over PyTorch for GPU-accelerating an existing NumPy-based RF imaging reconstruction pipeline, whereas PyTorch is preferred for training an unrolled network.
Consider the API compatibility and the need for autograd.
CuPy advantage: minimal code changes
CuPy mirrors the NumPy API almost exactly. An existing
reconstruction pipeline written with numpy can often be
GPU-accelerated by simply replacing import numpy as np
with import cupy as np. No code restructuring needed.
PyTorch advantage: automatic differentiation
Training an unrolled network requires backpropagation through the reconstruction iterations. PyTorch's autograd engine automatically builds and differentiates the computation graph. CuPy has no native AD support, so implementing backpropagation would require manual gradient derivation.
ex-ch04-10
HardFor a function with and (a scalar loss), compare the cost of computing the full gradient using:
(a) Forward-mode AD (one JVP per input dimension).
(b) Reverse-mode AD (one VJP with ).
(c) Finite differences with central differences.
Express costs in terms of (the cost of one forward evaluation of ).
Forward mode: passes. Reverse mode: 1 pass (plus storage). Finite differences: evaluations.
Forward-mode AD
JVP evaluations, each costing --. Total: .
Reverse-mode AD
One forward pass () plus one backward pass (--, depending on the computation). Total: --. Independent of .
Finite differences
Central differences: and for each . Total: . Plus: subject to truncation and round-off error from the step size .
Comparison
Reverse AD is cheaper than forward AD or finite differences for this problem. This is why backpropagation is the standard in deep learning.
ex-ch04-11
HardAn unrolled ISTA algorithm with iterations processes images of size .
(a) How much memory does standard reverse-mode AD require to store all intermediate iterates?
(b) If gradient checkpointing stores only every -th iterate, how much memory is needed?
(c) What is the computational overhead of checkpointing (expressed as a fraction of the original forward pass cost)?
Each iterate is a complex vector of dimension .
, so store every 7th iterate.
Standard reverse-mode memory
bytes (complex128) MB. In complex64: MB.
Checkpointed memory
checkpoints. Memory: MB (complex128), or MB (complex64). Reduction: .
Computational overhead
During backpropagation, each segment between checkpoints ( iterations) must be recomputed. In the worst case, each of the iterations is recomputed once (during the backward pass through its segment). Total forward computation: (original + recomputed ). Overhead: of the original forward pass, but memory is reduced .
ex-ch04-12
MediumDerive the ADMM primal and dual residuals for the -regularized imaging problem
with splitting .
The constraint is , so .
ADMM formulation
Minimize subject to , where and .
Primal residual
(constraint violation).
Dual residual
(since , the formula simplifies).
ex-ch04-13
EasyFor a noise vector with and :
(a) Compute the expected value of .
(b) Compute (use the approximation ).
(c) What stopping threshold does the discrepancy principle with give?
For complex Gaussian noise, .
Expected squared norm
.
Expected norm
.
Discrepancy threshold
.
Stop when .
ex-ch04-14
HardThe condition number of the Kronecker product is . For ISTA, the number of iterations to achieve -accuracy scales as .
(a) If and , how many ISTA iterations are needed for ? (Use the bound .)
(b) What is the total cost (in FLOPs) if each iteration uses the Kronecker vec trick with , ?
(c) Compare with the cost of forming the full Kronecker product once and using ISTA with dense matrix-vector products.
.
Each vec-trick iteration costs FLOPs.
Number of iterations
.
Vec-trick total cost
Each iteration: FLOPs (forward + adjoint). Total: .
Dense total cost
Forming : entries. Each iteration (dense matvec): . Total: .
Vec trick saves .
ex-ch04-15
ChallengeThe matched-filter image is where .
(a) When , show that the diagonal of is the Kronecker product of the diagonals: .
(b) Use this to compute the matched-filter image using only the Kronecker factors.
(c) For the RF imaging system with , , what is the computational cost of computing via the factored approach?
The entry of is .
For Kronecker products, .
Diagonal structure
The -th diagonal entry with :
This is exactly the entry of the outer product of the column-norm-squared vectors of and .
Factored matched-filter computation
- Compute and .
- Compute adjoint: reshape into , then .
- Divide element-wise by .
Cost
Diagonals: . Adjoint: (same as before). Division: . Total: dominated by the adjoint, FLOPs.
ex-ch04-16
MediumImplement a simple convergence monitor in Python that tracks: (i) objective value, (ii) fixed-point residual, (iii) data residual for ISTA applied to .
The monitor should print a warning if the objective increases at any iteration.
The objective is .
The fixed-point residual is .
Monitor implementation
def ista_with_monitor(A, y, lam, mu, T):
c = np.zeros(A.shape[1], dtype=complex)
obj_prev = np.inf
for t in range(T):
grad = A.conj().T @ (A @ c - y)
c_new = soft_threshold(c - mu * grad, mu * lam)
obj = 0.5*np.linalg.norm(A@c_new - y)**2 \
+ lam*np.linalg.norm(c_new, 1)
fpr = np.linalg.norm(c_new - c)
data_res = np.linalg.norm(A @ c_new - y)
if obj > obj_prev:
print(f"WARNING: obj increased at t={t}")
obj_prev = obj
c = c_new
return c
Soft thresholding
def soft_threshold(x, tau):
return np.sign(x) * np.maximum(np.abs(x) - tau, 0)
ex-ch04-17
HardShow that for FISTA (accelerated proximal gradient), the objective value is not guaranteed to decrease monotonically, unlike ISTA. Construct a simple 2D example where the FISTA objective oscillates.
FISTA uses momentum: .
The momentum step can overshoot, temporarily increasing the objective.
Why FISTA is non-monotone
FISTA evaluates the gradient at the extrapolated point rather than the current iterate . The extrapolation is designed to accelerate convergence in the long run, but it can overshoot in the short run.
Specifically, may lie in a region where , and the gradient step from may produce with .
Simple example
Take , , . The condition number .
Starting from with , FISTA's momentum causes the objective to oscillate for the first iterations before settling into monotone decrease.
ex-ch04-18
ChallengeDesign a convergence diagnostic for Plug-and-Play ADMM where the proximal step is replaced by a neural denoiser . Since the algorithm does not minimize a well-defined objective, the standard objective decrease criterion does not apply.
(a) What quantities can be monitored?
(b) Under what condition on does the fixed-point residual guarantee convergence?
(c) How would you estimate the spectral norm in practice?
PnP ADMM converges if the denoiser is firmly nonexpansive.
The spectral norm can be estimated via power iteration on the JVP.
Monitorable quantities
- Fixed-point residual .
- Data residual .
- PSNR vs iteration (in simulation with ground truth).
- Constraint residual .
Convergence condition
If is firmly nonexpansive (i.e., ), then PnP ADMM converges to a fixed point and .
Spectral norm estimation
Power iteration: starting from random , iterate .
Each iteration requires one JVP (, via forward-mode AD) and one VJP (, via reverse-mode AD).
Typically iterations suffice for convergence of the leading singular value.
ex-ch04-19
MediumExplain why the near-field breaks the exact Kronecker structure of the sensing operator, and propose a strategy that retains computational efficiency while handling near-field effects.
In the far field, the steering vector factors into azimuth and elevation components.
In the near field, the distance to each antenna depends on all spatial coordinates jointly.
Why Kronecker structure breaks
In the far field, the steering vector for a UPA with elements factors as . This gives the sensing matrix Kronecker structure.
In the near field, the phase depends on the full distance, which does not factor into separate and components.
Efficient near-field strategy
-
Kronecker preconditioner: Use the far-field Kronecker factorization as a preconditioner for the near-field operator. This dramatically improves the condition number while each preconditioner application is fast (vec trick).
-
Low-rank correction: Write where is the near-field correction. If has low rank (which it often does for moderate Fresnel numbers), the product is cheap.
ex-ch04-20
EasyState whether each of the following diagnostics requires access to the ground-truth image:
(a) Fixed-point residual. (b) PSNR vs iteration. (c) Discrepancy principle. (d) Primal-dual gap. (e) Objective value.
Think about what data each diagnostic uses.
Classification
(a) No β uses only consecutive iterates. (b) Yes β PSNR requires the ground-truth image. (c) No β uses only the noise level . (d) No β uses the primal and dual objectives. (e) No β uses only the objective function value.