Ferkans — Interactive Telecom Tutor

ex-ch04-01

Easy

Let $\mathbf{A}_{1} \in \mathbb{C}^{4 \times 8}$ and $\mathbf{A}_{2} \in \mathbb{C}^{6 \times 8}$ .

(a) What are the dimensions of $\mathbf{A}_{1} \otimes \mathbf{A}_{2}$ ?

(b) How many complex entries does the full Kronecker product contain?

(c) How many complex entries do the two factors contain together?

(d) What is the ratio of (b) to (c)?

Show Hint

The Kronecker product of an $M_1 \times N_1$ and an $M_2 \times N_2$ matrix is $M_1 M_2 \times N_1 N_2$ .

Solution

Dimensions

$\mathbf{A}_{1} \otimes \mathbf{A}_{2} \in \mathbb{C}^{24 \times 64}$ .

Entry counts

Full product: $24 \times 64 = 1{,}536$ entries. Factors: $4 \times 8 + 6 \times 8 = 32 + 48 = 80$ entries. Ratio: $1{,}536 / 80 = 19.2\times$ .

ex-ch04-02

Easy

Verify the vec trick by hand for $2 \times 2$ matrices. Let $\mathbf{A}_{1} = \begin{pmatrix} a & b \\ c & d \end{pmatrix}$ and $\mathbf{A}_{2} = \begin{pmatrix} e & f \\ g & h \end{pmatrix}$ . Compute both sides of

$(\mathbf{A}_{1} \otimes \mathbf{A}_{2})\text{vec}(\mathbf{X}) = \text{vec}(\mathbf{A}_{2} \mathbf{X} \mathbf{A}_{1}^{T})$

for $\mathbf{X} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$ and verify they are equal.

Show Hint

First write out $\mathbf{A}_{1} \otimes \mathbf{A}_{2}$ as a $4 \times 4$ block matrix.

$\text{vec}(\mathbf{I}_2) = (1, 0, 0, 1)^T$ .

Solution

Left-hand side

$\mathbf{A}_{1} \otimes \mathbf{A}_{2} = \begin{pmatrix} a\mathbf{A}_{2} & b\mathbf{A}_{2} \\ c\mathbf{A}_{2} & d\mathbf{A}_{2} \end{pmatrix}$ .

Multiplying by $\text{vec}(\mathbf{I}) = (1,0,0,1)^T$ :

$\begin{pmatrix} ae + b f \\ ag + bh \\ ce + df \\ cg + dh \end{pmatrix}.$

Right-hand side

$\mathbf{A}_{2} \mathbf{I} \mathbf{A}_{1}^{T} = \mathbf{A}_{2} \mathbf{A}_{1}^{T} = \begin{pmatrix} ea + fb & ec + fd \\ ga + hb & gc + hd \end{pmatrix}$ .

Vectorizing column-by-column: $(ea + fb, ga + hb, ec + fd, gc + hd)^T$ .

This equals the LHS. $\checkmark$

ex-ch04-03

Medium

Show that the spectral norm of a Kronecker product satisfies $\|\mathbf{A}_{1} \otimes \mathbf{A}_{2}\| = \|\mathbf{A}_{1}\| \cdot \|\mathbf{A}_{2}\|$ .

Show Hint

Use the fact that the singular values of $\mathbf{A}_{1} \otimes \mathbf{A}_{2}$ are all products $\sigma_i^{(1)} \sigma_j^{(2)}$ .

The spectral norm is the largest singular value.

Solution

SVD of the Kronecker product

From TSpectral Properties of Kronecker Products, the SVD of $\mathbf{A}_{1} \otimes \mathbf{A}_{2}$ is $(\mathbf{U}_1 \otimes \mathbf{U}_2) (\boldsymbol{\Sigma}_1 \otimes \boldsymbol{\Sigma}_2) (\mathbf{V}_1 \otimes \mathbf{V}_2)^H$ .

The singular values are $\{\sigma_i^{(1)} \sigma_j^{(2)}\}$ .

Maximum singular value

$\|\mathbf{A}_{1} \otimes \mathbf{A}_{2}\| = \max_{i,j} \sigma_i^{(1)} \sigma_j^{(2)} = (\max_i \sigma_i^{(1)})(\max_j \sigma_j^{(2)}) = \|\mathbf{A}_{1}\| \cdot \|\mathbf{A}_{2}\|.$ \blacksquare$

ex-ch04-04

Medium

The Lipschitz constant of the gradient $\nabla f(\mathbf{c}) = \mathbf{A}^{H}(\mathbf{A}\mathbf{c} - \mathbf{y})$ is $L_f = \|\mathbf{A}^{H}\mathbf{A}\|$ .

(a) When $\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2}$ , express $L_f$ in terms of the spectral norms of the factors.

(b) How does the ISTA step size $\mu = 1/L_f$ change when we add a second Kronecker factor?

Show Hint

$\|\mathbf{A}^{H}\mathbf{A}\| = \|\mathbf{A}\|^2$ .

Use the spectral norm property of Kronecker products.

Solution

Lipschitz constant

$L_f = \|\mathbf{A}\|^2 = (\|\mathbf{A}_{1}\| \cdot \|\mathbf{A}_{2}\|)^2 = \|\mathbf{A}_{1}\|^2 \cdot \|\mathbf{A}_{2}\|^2$ .

Step size impact

$\mu = 1/L_f = 1/(\|\mathbf{A}_{1}\|^2 \|\mathbf{A}_{2}\|^2)$ .

Adding a second factor multiplies $L_f$ by $\|\mathbf{A}_{2}\|^2$ , which reduces the step size by the same factor. Larger sensing operators lead to smaller step sizes and potentially slower convergence.

ex-ch04-05

Medium

Implement the Kronecker-factored forward operator in Python. Given A1 (shape $M_1 \times N_1$ ) and A2 (shape $M_2 \times N_2$ ), write a function kron_forward(c, A1, A2) that computes $(\mathbf{A}_{1} \otimes \mathbf{A}_{2})\mathbf{c}$ using the vec trick. Verify by comparing against np.kron(A1, A2) @ c for random inputs.

Show Hint

Reshape c into shape (N2, N1) before multiplying.

The order matters: A2 @ C @ A1.T, not A1 @ C @ A2.T.

Solution

Implementation

import numpy as np

def kron_forward(c, A1, A2):
    N2, N1 = A2.shape[1], A1.shape[1]
    C = c.reshape(N2, N1)
    return (A2 @ C @ A1.T).ravel()

Verification

M1, N1, M2, N2 = 4, 8, 6, 8
A1 = np.random.randn(M1, N1) + 1j*np.random.randn(M1, N1)
A2 = np.random.randn(M2, N2) + 1j*np.random.randn(M2, N2)
c = np.random.randn(N1*N2) + 1j*np.random.randn(N1*N2)

y_naive = np.kron(A1, A2) @ c
y_fast = kron_forward(c, A1, A2)
print(np.allclose(y_naive, y_fast))  # True

ex-ch04-06

Medium

For the NUFFT with oversampling factor $\sigma$ and interpolation kernel width $W$ :

(a) What is the total memory required for the oversampled grid when the original grid has $N^2$ points?

(b) What is the computational cost of the FFT on the oversampled grid?

(c) For $N = 128$ , $\sigma = 2$ , $W = 12$ , and $M = 2048$ non-uniform points, compare the total NUFFT cost against the direct evaluation cost.

Show Hint

The oversampled grid has $(\sigma N)^2$ points.

An FFT on a $K \times K$ grid costs $O(K^2 \log K)$ .

Solution

Memory

Oversampled grid: $(\sigma N)^2 = (2 \times 128)^2 = 65{,}536$ complex entries. At 8 bytes each (complex64): $\approx 0.5$ MB.

FFT cost

$O((\sigma N)^2 \log(\sigma N)) = O(65{,}536 \times \log 256) = O(65{,}536 \times 8) \approx 5.2 \times 10^5$ .

Comparison

NUFFT total: $\sim 5.2 \times 10^5 + M \times W = 5.2 \times 10^5 + 2048 \times 12 \approx 5.5 \times 10^5$ .

Direct: $M \times N^2 = 2048 \times 16{,}384 \approx 3.4 \times 10^7$ .

Speedup: $\sim 60\times$ .

ex-ch04-07

Hard

Prove that for the three-factor Kronecker product $\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2} \otimes \mathbf{A}_{3}$ , the factored computation via mode products has cost

$O(N_1 N_2 N_3(M_1 + M_2 + M_3))$

compared to the naive $O(M_1 M_2 M_3 \cdot N_1 N_2 N_3)$ . Determine the conditions under which the speedup exceeds $100\times$ .

Show Hint

Count the cost of each mode- $k$ product separately.

The output of mode- $k$ multiplication changes one dimension from $N_k$ to $M_k$ .

Solution

Mode-1 product cost

$\mathbf{A}_{3} \times_1$ : contract mode-1 of the tensor ( $N_3$ dimension) with $\mathbf{A}_{3}$ ( $M_3 \times N_3$ ). Cost: $M_3 \times N_3 \times N_2 \times N_1$ .

Mode-2 product cost

After mode-1, the tensor is $M_3 \times N_2 \times N_1$ . Mode-2 contraction with $\mathbf{A}_{2}$ ( $M_2 \times N_2$ ): cost $M_3 \times M_2 \times N_2 \times N_1$ .

Mode-3 product cost

After mode-2, the tensor is $M_3 \times M_2 \times N_1$ . Mode-3 contraction with $\mathbf{A}_{1}$ ( $M_1 \times N_1$ ): cost $M_3 \times M_2 \times M_1 \times N_1$ .

Total and speedup

Total: $N_1 N_2 N_3 M_3 + N_1 N_2 M_2 M_3 + N_1 M_1 M_2 M_3$ .

For the balanced case $M_k = M$ , $N_k = N$ : factored cost $\sim 3 M N^3$ , naive cost $M^3 N^3$ . Speedup $= M^2 / 3$ . For $M \geq 18$ , speedup exceeds $100\times$ .

ex-ch04-08

Easy

A GPU has 4,096 CUDA cores, each capable of one complex multiply-add per clock cycle at 1.5 GHz.

(a) What is the theoretical peak throughput in complex GFLOP/s?

(b) How long does a $32 \times 128$ complex matrix times a $128 \times 128$ complex matrix take at peak throughput?

(c) Why is the actual time significantly longer than (b)?

Show Hint

One complex multiply-add = 8 real FLOPs.

Memory bandwidth, not compute, often limits performance.

Solution

Peak throughput

$4096 \times 1.5 \times 10^9 \times 8 = 49.2$ real TFLOP/s $= 6.1$ complex TFLOP/s.

Theoretical time

$32 \times 128 \times 128 = 524{,}288$ complex multiply-adds. Time: $524{,}288 / (6.1 \times 10^{12}) \approx 86$ ns.

Practical limitations

Memory bandwidth limits: loading the two matrices ( $32 \times 128 + 128 \times 128 = 20{,}480$ complex entries $= 160$ KB) at $\sim 1$ TB/s takes $\sim 160$ ns — already longer than the compute time. Plus kernel launch overhead ( $\sim 5$ -- $10\;\mu$ s) dominates for such small matrices.

ex-ch04-09

Medium

Explain why CuPy is preferred over PyTorch for GPU-accelerating an existing NumPy-based RF imaging reconstruction pipeline, whereas PyTorch is preferred for training an unrolled network.

Show Hint

Consider the API compatibility and the need for autograd.

Solution

CuPy advantage: minimal code changes

CuPy mirrors the NumPy API almost exactly. An existing reconstruction pipeline written with numpy can often be GPU-accelerated by simply replacing import numpy as np with import cupy as np. No code restructuring needed.

PyTorch advantage: automatic differentiation

Training an unrolled network requires backpropagation through the reconstruction iterations. PyTorch's autograd engine automatically builds and differentiates the computation graph. CuPy has no native AD support, so implementing backpropagation would require manual gradient derivation.

ex-ch04-10

Hard

For a function $f : \mathbb{R}^n \to \mathbb{R}^m$ with $n = 10{,}000$ and $m = 1$ (a scalar loss), compare the cost of computing the full gradient using:

(a) Forward-mode AD (one JVP per input dimension).

(b) Reverse-mode AD (one VJP with $u = 1$ ).

(c) Finite differences with central differences.

Express costs in terms of $C_f$ (the cost of one forward evaluation of $f$ ).

Show Hint

Forward mode: $n$ passes. Reverse mode: 1 pass (plus storage). Finite differences: $2n$ evaluations.

Solution

Forward-mode AD

$n$ JVP evaluations, each costing $\sim 2$ -- $3 \times C_f$ . Total: $\sim 2n \cdot C_f = 20{,}000 \cdot C_f$ .

Reverse-mode AD

One forward pass ( $C_f$ ) plus one backward pass ( $\sim 2$ -- $4 \times C_f$ , depending on the computation). Total: $\sim 3$ -- $5 \times C_f$ . Independent of $n$ .

Finite differences

Central differences: $f(x + h e_i)$ and $f(x - h e_i)$ for each $i = 1, \ldots, n$ . Total: $2n \cdot C_f = 20{,}000 \cdot C_f$ . Plus: subject to truncation and round-off error from the step size $h$ .

Comparison

Reverse AD is $\sim 4{,}000\times$ cheaper than forward AD or finite differences for this problem. This is why backpropagation is the standard in deep learning.

ex-ch04-11

Hard

An unrolled ISTA algorithm with $T = 50$ iterations processes images of size $Q = 128^2$ .

(a) How much memory does standard reverse-mode AD require to store all intermediate iterates?

(b) If gradient checkpointing stores only every $\sqrt{T}$ -th iterate, how much memory is needed?

(c) What is the computational overhead of checkpointing (expressed as a fraction of the original forward pass cost)?

Show Hint

Each iterate is a complex vector of dimension $Q$ .

$\sqrt{50} \approx 7$ , so store every 7th iterate.

Solution

Standard reverse-mode memory

$T \times Q \times 16$ bytes (complex128) $= 50 \times 16{,}384 \times 16 \approx 13.1$ MB. In complex64: $\approx 6.6$ MB.

Checkpointed memory

$\lceil\sqrt{50}\rceil = 8$ checkpoints. Memory: $8 \times Q \times 16 = 8 \times 16{,}384 \times 16 \approx 2.1$ MB (complex128), or $1.05$ MB (complex64). Reduction: $\sim 6.3\times$ .

Computational overhead

During backpropagation, each segment between checkpoints ( $\sim 7$ iterations) must be recomputed. In the worst case, each of the $T = 50$ iterations is recomputed once (during the backward pass through its segment). Total forward computation: $\sim 2T$ (original $T$ + recomputed $T$ ). Overhead: $\sim 100\%$ of the original forward pass, but memory is reduced $6.3\times$ .

ex-ch04-12

Medium

Derive the ADMM primal and dual residuals for the $\ell_1$ -regularized imaging problem

$\min_{\mathbf{c}} \frac{1}{2}\|\mathbf{A}\mathbf{c} - \mathbf{y}\|^2 + \lambda\|\mathbf{c}\|_1$

with splitting $\mathbf{z} = \mathbf{c}$ .

Show Hint

The constraint is $\mathbf{c} = \mathbf{z}$ , so $\mathbf{B} = \mathbf{I}$ .

Solution

ADMM formulation

Minimize $f(\mathbf{c}) + g(\mathbf{z})$ subject to $\mathbf{c} = \mathbf{z}$ , where $f(\mathbf{c}) = \frac{1}{2}\|\mathbf{A}\mathbf{c} - \mathbf{y}\|^2$ and $g(\mathbf{z}) = \lambda\|\mathbf{z}\|_1$ .

Primal residual

$r^{(t)} = \mathbf{c}^{(t)} - \mathbf{z}^{(t)}$ (constraint violation).

Dual residual

$s^{(t)} = \rho(\mathbf{z}^{(t)} - \mathbf{z}^{(t-1)})$ (since $\mathbf{B} = \mathbf{I}$ , the formula simplifies).

ex-ch04-13

Easy

For a noise vector $\mathbf{w} \sim \mathcal{CN}(\mathbf{0}, \sigma^2\mathbf{I}_M)$ with $M = 2048$ and $\sigma = 0.05$ :

(a) Compute the expected value of $\|\mathbf{w}\|^2$ .

(b) Compute $\delta = \mathbb{E}[\|\mathbf{w}\|]$ (use the approximation $\mathbb{E}[\|\mathbf{w}\|] \approx \sigma\sqrt{M}$ ).

(c) What stopping threshold does the discrepancy principle with $\tau = 1.2$ give?

Show Hint

For complex Gaussian noise, $\mathbb{E}[\|\mathbf{w}\|^2] = M\sigma^2$ .

Solution

Expected squared norm

$\mathbb{E}[\|\mathbf{w}\|^2] = M\sigma^2 = 2048 \times 0.0025 = 5.12$ .

Expected norm

$\delta \approx \sigma\sqrt{M} = 0.05 \times \sqrt{2048} = 0.05 \times 45.25 \approx 2.26$ .

Discrepancy threshold

$\tau\delta = 1.2 \times 2.26 = 2.72$ .

Stop when $\|\mathbf{A}\hat{\mathbf{c}} - \mathbf{y}\| \leq 2.72$ .

ex-ch04-14

Hard

The condition number of the Kronecker product $\mathbf{A}_{1} \otimes \mathbf{A}_{2}$ is $\kappa_1 \cdot \kappa_2$ . For ISTA, the number of iterations to achieve $\varepsilon$ -accuracy scales as $O(\kappa \log(1/\varepsilon))$ .

(a) If $\kappa_1 = 50$ and $\kappa_2 = 100$ , how many ISTA iterations are needed for $\varepsilon = 10^{-4}$ ? (Use the bound $T = \kappa \log(1/\varepsilon) / \log 2$ .)

(b) What is the total cost (in FLOPs) if each iteration uses the Kronecker vec trick with $M_1 = M_2 = 32$ , $N_1 = N_2 = 128$ ?

(c) Compare with the cost of forming the full Kronecker product once and using ISTA with dense matrix-vector products.

Show Hint

$\kappa = \kappa_1 \kappa_2 = 5000$ .

Each vec-trick iteration costs $\sim 10^6$ FLOPs.

Solution

Number of iterations

$T \approx \kappa \log_2(1/\varepsilon) = 5000 \times \log_2(10^4) \approx 5000 \times 13.3 \approx 66{,}500$ .

Vec-trick total cost

Each iteration: $\sim 2 \times 10^6$ FLOPs (forward + adjoint). Total: $66{,}500 \times 2 \times 10^6 \approx 1.3 \times 10^{11}$ .

Dense total cost

Forming $\mathbf{A}_{1} \otimes \mathbf{A}_{2}$ : $M_1 M_2 N_1 N_2 = 33.6 \times 10^6$ entries. Each iteration (dense matvec): $2 \times 33.6 \times 10^6 \approx 6.7 \times 10^7$ . Total: $66{,}500 \times 6.7 \times 10^7 \approx 4.5 \times 10^{12}$ .

Vec trick saves $\sim 35\times$ .

ex-ch04-15

Challenge

The matched-filter image is $\hat{\mathbf{c}}^{\text{BP}} = \mathbf{A}^{H} \mathbf{D}^{-1} \mathbf{y}$ where $\mathbf{D} = \text{diag}(\mathbf{A}^{H} \mathbf{A})$ .

(a) When $\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2}$ , show that the diagonal of $\mathbf{A}^{H}\mathbf{A}$ is the Kronecker product of the diagonals: $\text{diag}(\mathbf{A}^{H}\mathbf{A}) = \text{diag}(\mathbf{A}_{1}^{H}\mathbf{A}_{1}) \otimes \text{diag}(\mathbf{A}_{2}^{H}\mathbf{A}_{2})$ .

(b) Use this to compute the matched-filter image using only the Kronecker factors.

(c) For the RF imaging system with $M_1 = M_2 = 32$ , $N_1 = N_2 = 128$ , what is the computational cost of computing $\hat{\mathbf{c}}^{\text{BP}}$ via the factored approach?

Show Hint

The $(q, q)$ entry of $\mathbf{A}^{H}\mathbf{A}$ is $\|\text{col}_q(\mathbf{A})\|^2$ .

For Kronecker products, $\text{col}_{(q_1-1)N_2+q_2}(\mathbf{A}_{1} \otimes \mathbf{A}_{2}) = \text{col}_{q_1}(\mathbf{A}_{1}) \otimes \text{col}_{q_2}(\mathbf{A}_{2})$ .

Solution

Diagonal structure

The $q$ -th diagonal entry with $q = (q_1 - 1)N_2 + q_2$ :

$[\mathbf{A}^{H}\mathbf{A}]_{q,q} = \|\text{col}_q(\mathbf{A})\|^2 = \|\text{col}_{q_1}(\mathbf{A}_{1})\|^2 \cdot \|\text{col}_{q_2}(\mathbf{A}_{2})\|^2.$

This is exactly the $(q_1, q_2)$ entry of the outer product of the column-norm-squared vectors of $\mathbf{A}_{1}$ and $\mathbf{A}_{2}$ .

Factored matched-filter computation

Compute $\mathbf{d}_1 = \text{diag}(\mathbf{A}_{1}^{H}\mathbf{A}_{1})$ and $\mathbf{d}_2 = \text{diag}(\mathbf{A}_{2}^{H}\mathbf{A}_{2})$ .
Compute adjoint: reshape $\mathbf{y}$ into $\mathbf{Y}$ , then $\mathbf{A}_{2}^{H} \mathbf{Y} \overline{\mathbf{A}_{1}}$ .
Divide element-wise by $\mathbf{d}_2 \mathbf{d}_1^T$ .

Cost

Diagonals: $O(M_1 N_1 + M_2 N_2)$ . Adjoint: $O(N_1(M_2 N_2 + M_1 M_2))$ (same as before). Division: $O(N_1 N_2)$ . Total: dominated by the adjoint, $\sim 10^6$ FLOPs.

ex-ch04-16

Medium

Implement a simple convergence monitor in Python that tracks: (i) objective value, (ii) fixed-point residual, (iii) data residual for ISTA applied to $\min_{\mathbf{c}} \frac{1}{2}\|\mathbf{A}\mathbf{c} - \mathbf{y}\|^2 + \lambda\|\mathbf{c}\|_1$ .

The monitor should print a warning if the objective increases at any iteration.

Show Hint

The objective is $f(c) + \lambda\|c\|_1$ .

The fixed-point residual is $\|c^{(t+1)} - c^{(t)}\|$ .

Solution

Monitor implementation

def ista_with_monitor(A, y, lam, mu, T):
    c = np.zeros(A.shape[1], dtype=complex)
    obj_prev = np.inf
    for t in range(T):
        grad = A.conj().T @ (A @ c - y)
        c_new = soft_threshold(c - mu * grad, mu * lam)
        obj = 0.5*np.linalg.norm(A@c_new - y)**2 \
              + lam*np.linalg.norm(c_new, 1)
        fpr = np.linalg.norm(c_new - c)
        data_res = np.linalg.norm(A @ c_new - y)
        if obj > obj_prev:
            print(f"WARNING: obj increased at t={t}")
        obj_prev = obj
        c = c_new
    return c

Soft thresholding

def soft_threshold(x, tau):
    return np.sign(x) * np.maximum(np.abs(x) - tau, 0)

ex-ch04-17

Hard

Show that for FISTA (accelerated proximal gradient), the objective value is not guaranteed to decrease monotonically, unlike ISTA. Construct a simple 2D example where the FISTA objective oscillates.

Show Hint

FISTA uses momentum: $\mathbf{q}^{(t)} = \mathbf{c}^{(t)} + \frac{t_k - 1}{t_{k+1}}(\mathbf{c}^{(t)} - \mathbf{c}^{(t-1)})$ .

The momentum step can overshoot, temporarily increasing the objective.

Solution

Why FISTA is non-monotone

FISTA evaluates the gradient at the extrapolated point $\mathbf{q}^{(t)}$ rather than the current iterate $\mathbf{c}^{(t)}$ . The extrapolation is designed to accelerate convergence in the long run, but it can overshoot in the short run.

Specifically, $\mathbf{q}^{(t)}$ may lie in a region where $f(\mathbf{q}^{(t)}) > f(\mathbf{c}^{(t)})$ , and the gradient step from $\mathbf{q}^{(t)}$ may produce $\mathbf{c}^{(t+1)}$ with $F(\mathbf{c}^{(t+1)}) > F(\mathbf{c}^{(t)})$ .

Simple example

Take $\mathbf{A} = \begin{pmatrix} 2 & 0 \\ 0 & 0.1 \end{pmatrix}$ , $\mathbf{y} = (1, 0.1)^T$ , $\lambda = 0.01$ . The condition number $\kappa = 400$ .

Starting from $\mathbf{c}^{(0)} = (0, 0)^T$ with $\mu = 1/\|\mathbf{A}\|^2 = 0.25$ , FISTA's momentum causes the objective to oscillate for the first $\sim 10$ iterations before settling into monotone decrease.

ex-ch04-18

Challenge

Design a convergence diagnostic for Plug-and-Play ADMM where the proximal step is replaced by a neural denoiser $\mathcal{D}_\sigma$ . Since the algorithm does not minimize a well-defined objective, the standard objective decrease criterion does not apply.

(a) What quantities can be monitored?

(b) Under what condition on $\mathcal{D}_\sigma$ does the fixed-point residual guarantee convergence?

(c) How would you estimate the spectral norm $\|\mathbf{J}_{\mathcal{D}}\|$ in practice?

Show Hint

PnP ADMM converges if the denoiser is firmly nonexpansive.

The spectral norm can be estimated via power iteration on the JVP.

Solution

Monitorable quantities

Fixed-point residual $\|\mathbf{c}^{(t+1)} - \mathbf{c}^{(t)}\|$ .
Data residual $\|\mathbf{A}\mathbf{c}^{(t)} - \mathbf{y}\|$ .
PSNR vs iteration (in simulation with ground truth).
Constraint residual $\|\mathbf{c}^{(t)} - \mathbf{z}^{(t)}\|$ .

Convergence condition

If $\mathcal{D}_\sigma$ is firmly nonexpansive (i.e., $\|\mathcal{D}_\sigma(\mathbf{x}) - \mathcal{D}_\sigma(\mathbf{y})\|^2 + \|(\mathbf{I} - \mathcal{D}_\sigma)(\mathbf{x}) - (\mathbf{I} - \mathcal{D}_\sigma)(\mathbf{y})\|^2 \leq \|\mathbf{x} - \mathbf{y}\|^2$ ), then PnP ADMM converges to a fixed point and $\text{FPR}^{(t)} \to 0$ .

Spectral norm estimation

Power iteration: starting from random $\mathbf{v}_0$ , iterate $\mathbf{v}_{k+1} = \mathbf{J}_{\mathcal{D}}^T \mathbf{J}_{\mathcal{D}} \mathbf{v}_k / \|\cdots\|$ .

Each iteration requires one JVP ( $\mathbf{J}_{\mathcal{D}} \mathbf{v}_k$ , via forward-mode AD) and one VJP ( $\mathbf{J}_{\mathcal{D}}^T \mathbf{w}$ , via reverse-mode AD).

Typically $\sim 20$ iterations suffice for convergence of the leading singular value.

ex-ch04-19

Medium

Explain why the near-field breaks the exact Kronecker structure of the sensing operator, and propose a strategy that retains computational efficiency while handling near-field effects.

Show Hint

In the far field, the steering vector factors into azimuth and elevation components.

In the near field, the distance to each antenna depends on all spatial coordinates jointly.

Solution

Why Kronecker structure breaks

In the far field, the steering vector for a UPA with $N_x \times N_y$ elements factors as $\mathbf{a}(\phi, \theta) = \mathbf{a}_{x}(\phi) \otimes \mathbf{a}_{y}(\theta)$ . This gives the sensing matrix Kronecker structure.

In the near field, the phase $e^{+j\kappa\|\mathbf{p}_n - \mathbf{p}_q\|}$ depends on the full distance, which does not factor into separate $x$ and $y$ components.

Efficient near-field strategy

Kronecker preconditioner: Use the far-field Kronecker factorization as a preconditioner for the near-field operator. This dramatically improves the condition number while each preconditioner application is fast (vec trick).
Low-rank correction: Write $\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2} + \mathbf{E}$ where $\mathbf{E}$ is the near-field correction. If $\mathbf{E}$ has low rank (which it often does for moderate Fresnel numbers), the product $\mathbf{E}\mathbf{c}$ is cheap.

ex-ch04-20

Easy

State whether each of the following diagnostics requires access to the ground-truth image:

(a) Fixed-point residual. (b) PSNR vs iteration. (c) Discrepancy principle. (d) Primal-dual gap. (e) Objective value.

Show Hint

Think about what data each diagnostic uses.

Solution

Classification

(a) No — uses only consecutive iterates. (b) Yes — PSNR requires the ground-truth image. (c) No — uses only the noise level $\delta$ . (d) No — uses the primal and dual objectives. (e) No — uses only the objective function value.

Exercises

ex-ch04-01

Dimensions

Entry counts

ex-ch04-02

Left-hand side

Right-hand side

ex-ch04-03

SVD of the Kronecker product

Maximum singular value

ex-ch04-04

Lipschitz constant

Step size impact

ex-ch04-05

Implementation

Verification

ex-ch04-06

Memory

FFT cost

Comparison

ex-ch04-07

Mode-1 product cost

Mode-2 product cost

Mode-3 product cost

Total and speedup

ex-ch04-08

Peak throughput

Theoretical time

Practical limitations

ex-ch04-09

CuPy advantage: minimal code changes

PyTorch advantage: automatic differentiation

ex-ch04-10

Forward-mode AD

Reverse-mode AD

Finite differences

Comparison

ex-ch04-11

Standard reverse-mode memory

Checkpointed memory

Computational overhead

ex-ch04-12

ADMM formulation

Primal residual

Dual residual

ex-ch04-13

Expected squared norm

Expected norm

Discrepancy threshold

ex-ch04-14

Number of iterations

Vec-trick total cost

Dense total cost

ex-ch04-15

Diagonal structure

Factored matched-filter computation

Cost

ex-ch04-16

Monitor implementation

Soft thresholding

ex-ch04-17

Why FISTA is non-monotone

Simple example

ex-ch04-18

Monitorable quantities

Convergence condition

Spectral norm estimation

ex-ch04-19

Why Kronecker structure breaks

Efficient near-field strategy

ex-ch04-20

Classification