Ferkans — Interactive Telecom Tutor

Preventing the Network from Hallucinating

The sidelobe corruption problem (Section 20.1) arises because the U-Net post-processor has no way to check whether its output is consistent with the observed measurements $\mathbf{y}$ . Once the matched filter is applied, the raw measurement information is discarded — the U-Net operates purely on the MF image.

Data-consistency layers restore this check by projecting the network output back onto the feasible set

$\mathcal{F}_\epsilon = \bigl\{\mathbf{c} : \|\mathbf{A}\mathbf{c} - \mathbf{y}\|_2 \leq \epsilon\bigr\}.$

When $\epsilon = 0$ (noiseless case), the network output is corrected so that its re-measurement exactly equals $\mathbf{y}$ . In the noisy case, a gradient step toward data consistency replaces the hard projection.

MoDL (Model-Based Deep Learning, Aggarwal et al. 2019) alternates between a CNN denoiser and a data-consistency step solved via conjugate gradient, creating a principled bridge between post-processing and algorithm unrolling.

,

Definition:
Data Consistency Layer

A data consistency (DC) layer maps a network estimate $\hat{\mathbf{c}}$ to a measurement-consistent output by taking a gradient step toward data fidelity:

$\text{DC}_\lambda(\hat{\mathbf{c}}) = \hat{\mathbf{c}} - \lambda\, \mathbf{A}^{H}\!\bigl(\mathbf{A}\hat{\mathbf{c}} - \mathbf{y}\bigr),$

where $\lambda > 0$ is the step size. When $\lambda = 1$ and $\mathbf{A}\mathbf{A}^{H} = \mathbf{I}$ (orthonormal rows), the DC layer becomes a hard projection:

$\text{DC}_1(\hat{\mathbf{c}}) = \hat{\mathbf{c}} - \mathbf{A}^{H}(\mathbf{A}\hat{\mathbf{c}} - \mathbf{y}) = (\mathbf{I} - \mathbf{A}^{H}\mathbf{A})\hat{\mathbf{c}} + \mathbf{A}^{H}\mathbf{y}.$

This replaces the measured components with $\mathbf{A}^{H}\mathbf{y}$ while preserving the network's prediction in the null space of $\mathbf{A}$ .

In MRI reconstruction, the DC layer "keeps the acquired k-space samples and lets the network fill in the missing ones." In RF imaging with physically structured $\mathbf{A}$ , the DC layer provides the missing feedback loop that the pure MF→U-Net pipeline lacks.

,

Theorem: The Hard DC Projection is Idempotent

For $\mathbf{A}$ with orthonormal rows ( $\mathbf{A}\mathbf{A}^{H} = \mathbf{I}$ ), the hard data-consistency layer

$\text{DC}(\hat{\mathbf{c}}) = \hat{\mathbf{c}} - \mathbf{A}^{H}(\mathbf{A}\hat{\mathbf{c}} - \mathbf{y})$

satisfies:

Measurement consistency: $\mathbf{A}\,\text{DC}(\hat{\mathbf{c}}) = \mathbf{y}$ .
Idempotence: $\text{DC}(\text{DC}(\hat{\mathbf{c}})) = \text{DC}(\hat{\mathbf{c}})$ .
Null-space preservation: $\text{DC}(\hat{\mathbf{c}}) - \text{DC}(\hat{\mathbf{c}}') = (\mathbf{I} - \mathbf{A}^{H}\mathbf{A})(\hat{\mathbf{c}} - \hat{\mathbf{c}}')$ .

The DC layer is a projection onto the affine measurement-consistent subspace $\{\mathbf{c} : \mathbf{A}\mathbf{c} = \mathbf{y}\}$ . Projecting twice lands at the same point (idempotence). The network's contribution survives only in the null space of $\mathbf{A}$ — the degrees of freedom not constrained by the measurements.

Proof

Verify measurement consistency

Apply $\mathbf{A}$ to the DC output: \begin{align} \mathbf{A},\text{DC}(\hat{\mathbf{c}}) &= \mathbf{A}(\mathbf{I} - \mathbf{A}^{H}\mathbf{A})\hat{\mathbf{c}} + \mathbf{A}\mathbf{A}^{H}\mathbf{y} \ &= (\mathbf{A} - \mathbf{A}\mathbf{A}^{H}\mathbf{A})\hat{\mathbf{c}} + \mathbf{y} \ &= (\mathbf{A} - \mathbf{A})\hat{\mathbf{c}} + \mathbf{y} = \mathbf{y}, \end{align} where we used $\mathbf{A}\mathbf{A}^{H} = \mathbf{I}$ twice.

Verify idempotence

Let $\mathbf{z} = \text{DC}(\hat{\mathbf{c}})$ . By part (1), $\mathbf{A}\mathbf{z} = \mathbf{y}$ . Therefore: $\text{DC}(\mathbf{z}) = \mathbf{z} - \mathbf{A}^{H}(\mathbf{A}\mathbf{z} - \mathbf{y}) = \mathbf{z} - \mathbf{A}^{H}(\mathbf{y} - \mathbf{y}) = \mathbf{z}. \quad\blacksquare$

Definition:
MoDL — Model-Based Deep Learning

MoDL (Aggarwal et al., 2019) is an alternating reconstruction architecture that interleaves CNN denoising with CG data-consistency steps. Starting from $\hat{\mathbf{c}}_0 = \hat{\mathbf{c}}^{\text{BP}}$ , the $k$ -th iteration is:

$\mathbf{z}_k = \mathcal{D}_\theta(\hat{\mathbf{c}}_{k-1}), \qquad \hat{\mathbf{c}}_k = \arg\min_{\mathbf{c}} \;\|\mathbf{A}\mathbf{c} - \mathbf{y}\|^2 + \lambda_k\|\mathbf{c} - \mathbf{z}_k\|^2.$

The regularised least-squares step has the closed-form solution

$\hat{\mathbf{c}}_k = (\mathbf{A}^{H}\mathbf{A} + \lambda_k\mathbf{I})^{-1} (\mathbf{A}^{H}\mathbf{y} + \lambda_k\mathbf{z}_k),$

solved efficiently by conjugate gradient (CG). The step sizes $\lambda_k$ may be fixed or learned as part of training.

The CNN denoiser $\mathcal{D}_\theta$ acts as an implicit prior on the scene $\mathbf{c}$ . The CG step enforces data consistency. Weights are shared across iterations (a single $\mathcal{D}_\theta$ is reused), dramatically reducing the number of parameters compared to an unrolled network with distinct parameters at each step.

MoDL Forward Pass

Complexity: Each CG step requires

\mathcal{O}(K_{\text{CG}})

matrix-vector products with

\mathbf{A}

and

\mathbf{A}^{H}

, each costing

\mathcal{O}(MN)

or

\mathcal{O}(N \log N)

for structured operators. Total cost:

\mathcal{O}(K \cdot K_{\text{CG}} \cdot MN)

for dense

\mathbf{A}

.

Input: measurements y, sensing operator A, CNN denoiser D_θ,

step sizes {λ_k}, number of iterations K

Initialize: x̂₀ = Aᴴ y (matched-filter warm start)

For k = 1, 2, ..., K:

1. Denoise: z_k = D_θ(x̂_{k-1})

2. CG solve: x̂_k = (AᴴA + λ_k I)⁻¹ (Aᴴy + λ_k z_k)

[solve via conjugate gradient with early stopping]

3. (Optional) Check data residual: ‖Ax̂_k - y‖

Return: x̂_K

For Kronecker-structured $\mathbf{A}$ (Chapter 18), the CG solve can be carried out on the factors separately, reducing cost to $\mathcal{O}(K \cdot K_{\text{CG}} \cdot (M_r N_r + M_c N_c))$ .

Theorem: MoDL Fixed-Point Condition

A fixed point $\hat{\mathbf{c}}^* = \hat{\mathbf{c}}_k = \hat{\mathbf{c}}_{k-1}$ of the MoDL iteration satisfies

$\hat{\mathbf{c}}^* = \arg\min_{\mathbf{c}} \;\|\mathbf{A}\mathbf{c} - \mathbf{y}\|^2 + \lambda\|\mathbf{c} - \mathcal{D}_\theta(\hat{\mathbf{c}}^*)\|^2.$

If the denoiser $\mathcal{D}_\theta$ is the proximal operator of some regulariser $R(\cdot)$ , i.e., $\mathcal{D}_\theta(\mathbf{c}) = \operatorname{prox}_{\lambda R}(\mathbf{c})$ , then the fixed point minimises

$\frac{1}{2}\|\mathbf{A}\mathbf{c} - \mathbf{y}\|^2 + R(\mathbf{c}).$

MoDL is a plug-and-play algorithm (Chapter 21) where the denoiser implicitly defines a regulariser. When the denoiser is well-calibrated, the fixed point is a regularised reconstruction that balances data fidelity and the implicit prior. Learned step sizes $\lambda_k$ let the network adapt this balance per iteration.

Proof

Write the stationarity condition

At a fixed point $\hat{\mathbf{c}}^* = \mathbf{z}^* = \mathcal{D}_\theta(\hat{\mathbf{c}}^*)$ , the CG solve gives $\hat{\mathbf{c}}^* = (\mathbf{A}^{H}\mathbf{A} + \lambda\mathbf{I})^{-1} (\mathbf{A}^{H}\mathbf{y} + \lambda\hat{\mathbf{c}}^*),$ which rearranges to $\mathbf{A}^{H}\mathbf{A}\hat{\mathbf{c}}^* + \lambda\hat{\mathbf{c}}^* = \mathbf{A}^{H}\mathbf{y} + \lambda\hat{\mathbf{c}}^*,$ i.e., $\mathbf{A}^{H}(\mathbf{A}\hat{\mathbf{c}}^* - \mathbf{y}) = \mathbf{0}$ . This is the normal equations for $\|\mathbf{A}\mathbf{c} - \mathbf{y}\|^2$ at the fixed point. $\blacksquare$

MoDL Iteration Convergence

Watch MoDL converge iteration by iteration. The left panel shows the reconstruction $\hat{\mathbf{c}}_k$ at each step $k$ . The right panel plots the data residual $\|\mathbf{A}\hat{\mathbf{c}}_k - \mathbf{y}\|$ and reconstruction NMSE as functions of iteration.

For the physical sensing matrix, note how data-consistency steps rapidly suppress sidelobe artefacts that the MF image alone cannot resolve. Increasing $\lambda$ enforces tighter data consistency at the expense of relying less on the CNN denoiser.

Parameters

Number of MoDL Iterations K10

Data Consistency Weight λ1

SNR (dB)20

Sensing Matrix

Example: Data Consistency for Partial k-Space Acquisition

In MRI with partial k-space sampling, the sensing operator is $\mathbf{A} = \mathbf{P}_\Omega\mathbf{F}$ where $\mathbf{F}$ is the DFT matrix and $\mathbf{P}_\Omega$ selects $M < N$ frequency locations. The measurements are $\mathbf{y} = \mathbf{P}_\Omega\mathbf{F}\mathbf{c} + \mathbf{w}$ .

(a) Write the explicit form of the hard DC layer for this operator. (b) Interpret the DC layer in terms of k-space and image-space operations.

Solution

Compute the hard DC layer

Since $\mathbf{A}\mathbf{A}^{H} = \mathbf{P}_\Omega\mathbf{F}\mathbf{F}^H\mathbf{P}_\Omega^H = \mathbf{P}_\Omega\mathbf{P}_\Omega^H = \mathbf{I}_M$ , the rows of $\mathbf{A}$ are orthonormal. The DC layer is:

$\text{DC}(\hat{\mathbf{c}}) = \hat{\mathbf{c}} - \mathbf{F}^H\mathbf{P}_\Omega^H(\mathbf{P}_\Omega\mathbf{F}\hat{\mathbf{c}} - \mathbf{y}).$

Let $\hat{X}(\omega) = [\mathbf{F}\hat{\mathbf{c}}]_\omega$ denote the DFT of the network estimate. The DC layer replaces the acquired k-space samples: $[\mathbf{F}\,\text{DC}(\hat{\mathbf{c}})]_\omega = \begin{cases} y_\omega & \text{if } \omega \in \Omega \\ \hat{X}(\omega) & \text{if } \omega \notin \Omega. \end{cases}$

Interpretation

The DC layer enforces a hard constraint in k-space: acquired frequencies are replaced by the measured values, while unacquired frequencies (filled by the CNN) are left unchanged. This prevents the network from "corrupting" frequencies that are directly observed in the data.

For RF imaging, the analog is: voxels whose delay-Doppler signature matches measured returns are pinned to the data; the network predicts only the unobserved degrees of freedom in the null space of $\mathbf{A}$ .

Common Mistake: Failing to Differentiate Through the Physics in MoDL

Mistake:

When training MoDL end-to-end, treating $\mathbf{A}$ and $\mathbf{A}^{H}$ as black-box functions and not backpropagating gradients through them.

Correction:

If $\mathbf{A}$ and $\mathbf{A}^{H}$ appear in the computational graph between input and output, gradients must flow through them. For linear operators, the Jacobian of $\mathbf{A}\mathbf{c}$ with respect to $\mathbf{c}$ is $\mathbf{A}$ , and the Jacobian of $\mathbf{A}^{H}\mathbf{r}$ with respect to $\mathbf{r}$ is $\mathbf{A}^{H}$ .

PyTorch and JAX handle this automatically if $\mathbf{A}$ is implemented as a differentiable operation (e.g., via torch.fft.fft for DFT operators, or explicit matrix multiplication for small dense operators). For large custom operators, use the adjoint method to compute vector-Jacobian products.

Failing to backpropagate through the CG solve (treating it as a fixed non-differentiable step) leads to inconsistent gradients and slower convergence. Use unrolled CG or implicit differentiation.

MF-to-U-Net vs. MoDL

Feature	MF-to-U-Net	MoDL
Uses $\mathbf{A}$ at inference?	No (only preprocessing)	Yes (CG step at each layer)
Data consistency guarantee?	No	Yes (by construction)
Handles structured PSF?	Poorly (sidelobe corruption)	Yes (DC corrects sidelobes)
Inference cost	$\mathcal{O}(1)$ forward pass	$\mathcal{O}(K \cdot K_{\text{CG}})$ forward passes
Generalises across operators?	No — retrain required	Partial — $\mathcal{D}_\theta$ may transfer
Parameters	~23M (U-Net)	~5M (shared denoiser) + $\{\\lambda_k\}$
Training requirement	Paired $(\\mathbf{c}, \\mathbf{y})$ data	Paired $(\\mathbf{c}, \\mathbf{y})$ data + differentiable $\mathbf{A}$

Data consistency

The requirement that a reconstructed image, when passed through the forward model, reproduces the observed measurements: $\mathbf{A}\hat{\mathbf{c}} \approx \mathbf{y}$ . Enforced via gradient steps, hard projection layers (for orthonormal-row $\mathbf{A}$ ), or soft penalty terms in the loss. See DData Consistency Layer.

MoDL (Model-Based Deep Learning)

An alternating reconstruction architecture (Aggarwal et al. 2019) that interleaves a CNN denoiser with a conjugate-gradient data-consistency step. The denoiser weights are shared across iterations, reducing parameter count. Learned step sizes $\lambda_k$ adapt the regularisation strength per iteration. See DMoDL — Model-Based Deep Learning.

Quick Check

A hard data-consistency layer with orthonormal-row $\mathbf{A}$ modifies the network output by:

Replacing all pixels with the matched-filter image

Keeping measured components from $\mathbf{y}$ and network-predicted components in the null space

Averaging the network output with the matched-filter image

Applying a learned projection