Training Strategies for Imaging

The Loss Function Shapes the Reconstruction

The choice of loss function determines what the network optimises for, and hence the character of the reconstruction.

MSE loss produces the posterior mean — smooth and unbiased but blurry when the posterior is multimodal.
Perceptual loss preserves textures and edges by measuring distance in VGG feature space rather than pixel space.
Adversarial loss generates sharp, realistic-looking images, but risks hallucination — inventing features not in the true scene.

For RF imaging, where fidelity to the true scene is paramount (a false target in a radar image could trigger a false alarm), the choice of loss function carries significant practical consequences. This section surveys the three main families and their implications, then discusses data augmentation and transfer learning strategies.

Definition:
MSE (Mean Squared Error) Loss

The MSE loss between reconstruction $\hat{\mathbf{c}}$ and ground truth $\mathbf{c}$ is

$\mathcal{L}_{\text{MSE}}(\hat{\mathbf{c}}, \mathbf{c}) = \frac{1}{N}\|\hat{\mathbf{c}} - \mathbf{c}\|_2^2 = \frac{1}{N}\sum_{i=1}^N |\hat{c}_i - c_i|^2.$

Minimising $\mathcal{L}_{\text{MSE}}$ over the training distribution yields the posterior mean $\mathbb{E}[\mathbf{c} \mid \mathbf{y}]$ .

MSE penalises all pixel errors equally. When the posterior $p(\mathbf{c} \mid \mathbf{y})$ is multimodal — multiple scenes are consistent with the measurements — the posterior mean lies between the modes, producing a blurry reconstruction.

Definition:
Perceptual Loss

The perceptual loss measures distance in the feature space of a pretrained classification network (typically VGG-16):

$\mathcal{L}_{\text{perc}}(\hat{\mathbf{c}}, \mathbf{c}) = \sum_{\ell \in \mathcal{S}} \frac{1}{C_\ell H_\ell W_\ell} \|\phi_\ell(\hat{\mathbf{c}}) - \phi_\ell(\mathbf{c})\|_F^2,$

where $\phi_\ell$ extracts the feature map at VGG layer $\ell$ , $\mathcal{S}$ is a set of selected layers, and $C_\ell, H_\ell, W_\ell$ are the channel, height, and width dimensions.

Perceptual loss encourages structural similarity at multiple scales. It preserves edges and textures better than MSE because VGG features are invariant to small spatial shifts that MSE penalises heavily. For RF scenes with point targets, the perceptual loss can preserve the sharp target signature while suppressing background clutter.

Definition:
Adversarial (GAN) Loss

The adversarial loss trains a discriminator $D_\psi$ alongside the reconstruction network $f_\theta$ :

$\mathcal{L}_{\text{adv}}(\theta, \psi) = \mathbb{E}_{\mathbf{c}}\bigl[\log D_\psi(\mathbf{c})\bigr] + \mathbb{E}_{\mathbf{y}}\bigl[\log(1 - D_\psi(f_\theta(\hat{\mathbf{c}}^{\text{BP}})))\bigr].$

The generator $f_\theta$ minimises $\mathcal{L}_{\text{adv}}$ while the discriminator $D_\psi$ maximises it. At equilibrium, the generator produces images that the discriminator cannot distinguish from true scenes.

Adversarial training produces the sharpest, most visually realistic reconstructions. The danger is hallucination: the network may generate plausible-looking features not present in the true scene. For scientific imaging (radar, SAR), this is problematic — hallucinated targets could trigger false alarms in detection pipelines.

Theorem: Loss Functions and Bayesian Estimators

Under mild regularity conditions, the optimal reconstruction network $f_{\theta^*}$ trained with different losses converges to distinct Bayesian estimators:

Loss	Optimal estimator
$\mathcal{L}_{\text{MSE}} = \\|\hat{\mathbf{c}} - \mathbf{c}\\|_2^2$	Posterior mean $\mathbb{E}[\mathbf{c} \mid \mathbf{y}]$
$\mathcal{L}_{\text{MAE}} = \\|\hat{\mathbf{c}} - \mathbf{c}\\|_1$	Posterior median (component-wise)
$\mathcal{L}_{\text{adv}}$ (Wasserstein)	Sample from the posterior $p(\mathbf{c} \mid \mathbf{y})$

Each loss function defines a different notion of "best." MSE minimises average squared error (yields the mean). MAE minimises average absolute error (yields the median, which is mode-seeking for heavy-tailed priors). The adversarial loss matches distributions, so the generator produces samples from the posterior rather than a single point estimate.

Proof

MSE $\to$ posterior mean

For a fixed $\mathbf{y}$ , decompose the MSE: $\mathbb{E}\bigl[\|f(\mathbf{y}) - \mathbf{c}\|^2 \mid \mathbf{y}\bigr] = \|f(\mathbf{y}) - \mathbb{E}[\mathbf{c}\mid\mathbf{y}]\|^2 + \operatorname{Var}(\mathbf{c}\mid\mathbf{y}).$ The second term is independent of $f$ . Minimising over $f$ sets the first term to zero: $f^*(\mathbf{y}) = \mathbb{E}[\mathbf{c}\mid\mathbf{y}]$ .

Adversarial $\to$ posterior samples

The optimal generator in a GAN with sufficient discriminator capacity produces samples from the data distribution. When conditioned on $\mathbf{y}$ , this becomes the posterior $p(\mathbf{c}\mid\mathbf{y})$ . Each forward pass produces a different sample (if noise is injected), unlike the deterministic MSE estimator. $\blacksquare$

Effect of Loss Function on Reconstruction Quality

Compare reconstructions produced by networks trained with different loss functions on the same RF imaging scene. The MSE-trained network produces smooth but blurry estimates. The perceptual-loss network preserves more texture and target sharpness. The adversarial-loss network produces the sharpest images but may introduce hallucinated point targets. The combined loss balances all three.

Examine the error maps carefully: MSE has the lowest pixel-wise error but the worst perceptual quality, while the adversarial loss has higher pixel-wise error but better structural similarity (SSIM).

Parameters

Loss Function

SNR (dB)20

Definition:
SSIM Loss

The structural similarity index (SSIM) measures perceptual image quality by comparing luminance, contrast, and structure between $\hat{\mathbf{c}}$ and $\mathbf{c}$ in local windows:

$\text{SSIM}(\hat{\mathbf{c}}, \mathbf{c}) = \frac{(2\mu_{\hat{c}}\mu_c + C_1)(2\sigma_{\hat{c}c} + C_2)} {(\mu_{\hat{c}}^2 + \mu_c^2 + C_1)(\sigma_{\hat{c}}^2 + \sigma_c^2 + C_2)},$

where $\mu, \sigma^2, \sigma_{\hat{c}c}$ are local means, variances, and cross-covariance, and $C_1, C_2$ are small constants for stability. The SSIM loss is $\mathcal{L}_{\text{SSIM}} = 1 - \text{SSIM}(\hat{\mathbf{c}}, \mathbf{c})$ .

SSIM correlates better with human perceptual quality than MSE. For RF scenes, SSIM is particularly useful for evaluating target detection performance: it penalises blurry targets and sidelobe artefacts while being relatively tolerant of background noise variations.

Example: Combined Loss Function for RF Imaging

Design a combined loss function for training an MF→U-Net reconstruction network for a MIMO radar imaging system. The loss should balance pixel accuracy, structural quality, and data consistency.

Solution

Define the combined loss

$\mathcal{L}(\theta) = \lambda_1 \mathcal{L}_{\text{MSE}} + \lambda_2 \mathcal{L}_{\text{perc}} + \lambda_3 \mathcal{L}_{\text{SSIM}} + \lambda_4 \mathcal{L}_{\text{DC}},$ $where$ \mathcal{L}{\text{DC}} = |\mathbf{A} f\theta(\hat{\mathbf{c}}^{\text{BP}}) - \mathbf{y}|_2^2$ is the data-consistency loss penalising measurement mismatch.

Choose the weights for RF imaging

For scientific imaging where fidelity is paramount:

$\lambda_1 = 1.0$ — MSE for pixel accuracy
$\lambda_2 = 0.01$ — mild perceptual regularisation to sharpen targets
$\lambda_3 = 0.5$ — SSIM to preserve structural similarity
$\lambda_4 = 10.0$ — strong data consistency

The high weight on $\mathcal{L}_{\text{DC}}$ ensures that the reconstruction is physically plausible. The small perceptual weight prevents hallucination while improving sharpness.

Why avoid a pure adversarial loss?

For radar and SAR, adversarial losses are typically avoided because hallucinated targets (false alarms) are more dangerous than slightly blurry reconstructions. The MSE + perceptual + SSIM + data-consistency combination provides a good compromise between sharpness and reliability.

If adversarial loss is included at all, add it with a very small weight ( $\lambda_5 \leq 0.001$ ) and monitor the measurement residual $\|\mathbf{A}\hat{\mathbf{c}} - \mathbf{y}\|$ to detect hallucination.

Data Augmentation for Inverse Problems

Training data for RF imaging reconstruction consists of paired $(\mathbf{c}, \mathbf{y})$ samples generated from a simulator. Standard image augmentation (flips, rotations, crops) must be applied carefully in this setting:

Augmentation in scene space: Flipping the scene $\mathbf{c}$ must be accompanied by applying the corresponding transformation to the measurements $\mathbf{y} = \mathbf{A}\mathbf{c}$ . For operators with spatial symmetry (e.g., symmetric antenna arrays), some transformations can be applied exactly.

Augmentation in measurement space: Randomly masking measurement components ( $M' < M$ measurements) at training time teaches the network to be robust to partial aperture coverage. This is particularly effective for MoDL, where the CG step adapts automatically to different effective $\mathbf{A}$ .

Noise-level augmentation: Training with a range of SNR values (e.g., 5–40 dB) prevents overfitting to a specific noise regime and produces networks that degrade gracefully at low SNR.

Scene diversity: RF imaging scenes are highly non-stationary (isolated point targets vs. extended objects vs. clutter). A diverse training set covering all these regimes is essential for generalisation.

Transfer Learning from Optical to RF Domains

RF imaging networks are often trained on synthetic data because paired ground truth $(\mathbf{c}, \mathbf{y})$ is unavailable in the field. Transfer learning can reduce the simulation-to-real gap:

Optical-to-RF transfer: Pretrain the U-Net backbone on large optical image datasets (ImageNet, COCO) where ground truth is abundant. Fine-tune on synthetic RF data with physically accurate $\mathbf{A}$ . The low-level feature detectors (edges, textures) transfer well; higher-level semantic features do not.

Simulation-to-real transfer: Train on high-fidelity simulated data with calibrated forward models, then fine-tune on a small set of real measurements (possibly without ground truth, using self-supervised losses from Chapter 23).

Transfer across array geometries: When the sensing geometry changes (different number of antennas, frequencies), the Gram matrix $\mathbf{G}$ changes. For MoDL, only the CG step changes — the denoiser $\mathcal{D}_\theta$ may transfer without retraining. For MF→U-Net, full retraining is needed.

Domain randomisation: During training, randomise the sensing geometry (antenna positions, frequencies) so the network learns to be geometry-agnostic. At inference, condition on the true geometry via physics-informed channels (Section 20.3).

When Supervised Training is Impossible: The Real-World RF Scenario

The fundamental assumption of supervised training is the availability of paired ground-truth data $(\mathbf{c}, \mathbf{y})$ . In real-world RF imaging deployments, this assumption frequently fails:

The true scene $\mathbf{c}$ is unknown — that is precisely what we want to measure.
Collecting calibrated ground truth requires a controlled environment that does not reflect real deployment conditions.
Scenes are non-stationary (people, vehicles, weather changes) and ground truth changes faster than data can be labelled.

This motivates the self-supervised and unsupervised approaches developed in Chapter 23: self-supervised losses (Noise2Noise variants, equivariant imaging, SURE-based estimation) that require only measurement pairs $\{(\mathbf{y}_i, \mathbf{y}_j)\}$ without ground truth.

For the short term, the CommIT group approach is: train on synthetic data, validate with a small calibration target, deploy with domain adaptation. Understanding when and why synthetic-to-real transfer works is one of the central open questions in RF imaging.

Common Mistake: Adversarial Loss Hallucination in Scientific Imaging

Mistake:

Using a pure adversarial loss for radar or SAR image reconstruction without data-consistency constraints.

Correction:

GAN-trained networks can hallucinate realistic-looking features (targets, lesions) that do not exist in the true scene. For scientific applications:

Always include a data-consistency term in the combined loss.
Prefer MSE + perceptual over pure adversarial training.
If using adversarial training, add measurement-consistency constraints as hard layers (DC layers) rather than soft penalties.
Monitor the measurement residual $\|\mathbf{y} - \mathbf{A}\hat{\mathbf{c}}\|$ on a held-out validation set to detect hallucination.
For detection tasks, evaluate false alarm rate (FAR) alongside SSIM and PSNR — hallucinated targets inflate FAR even when pixel-wise metrics look acceptable.

Loss Functions for RF Imaging Reconstruction

Loss	Optimal estimator	Strengths	Weaknesses for RF imaging
MSE	Posterior mean	Unbiased, mathematically tractable	Blurry for multimodal posteriors
MAE ( $L^1$ )	Posterior median	Robust to outliers	Non-smooth gradient, slower training
SSIM	Perceptual quality optimum	Preserves target structure	Non-convex, local optima
Perceptual (VGG)	Feature-space mean	Sharp targets, edge-preserving	Not a metric, hallucination risk
Adversarial	Posterior sample	Sharpest images	Hallucination, unstable training, false targets
Data-consistency	Measurement-feasible estimate	Physical plausibility	Does not promote sparsity or image quality alone
Combined (MSE+perc+DC)	Balanced tradeoff	Fidelity + quality + plausibility	Hyperparameter tuning required

Quick Check

Why do MSE-trained networks tend to produce blurry reconstructions?

Because MSE penalises large errors too strongly

Because the optimal MSE estimate is the posterior mean, which averages over multiple plausible reconstructions

Because MSE ignores high-frequency components

Because the U-Net architecture cannot produce sharp images

Correction:

Because the optimal MSE estimate is the posterior mean, which averages over multiple plausible reconstructions

When multiple scenes are consistent with the measurements (multimodal posterior), the posterior mean averages over all modes. This averaging produces a blurred image that lies "between" the modes rather than at any single one. The same phenomenon causes blurry face reconstructions in super-resolution and blurry target images in low-SNR radar.

Quick Check

When the sensing matrix $\mathbf{A}$ changes (different array geometry), which approach requires the least retraining?

Direct inversion network

MF-to-U-Net

MoDL (shared denoiser with new CG step)

Physics-informed U-Net (PSF channel only)

Correction:

MoDL (shared denoiser with new CG step)

In MoDL, the CG data-consistency step adapts automatically to any new $\mathbf{A}$ , because $\mathbf{A}$ is provided explicitly at inference. The shared denoiser $\mathcal{D}_\theta$ encodes scene-space priors that generalise across sensing geometries. Only when the scene statistics change significantly does the denoiser need retraining.

Perceptual loss

A loss function measuring distance between reconstructed and target images in the feature space of a pretrained network (VGG-16), rather than in pixel space. Encourages structural and textural similarity. See DPerceptual Loss.

Hallucination (in reconstruction)

A failure mode of trained reconstruction networks (especially GAN-based) where the network generates plausible-looking features in the output that are not present in the true scene. In RF imaging, hallucinated targets are dangerous because they cause false alarms in detection systems. Mitigated by data-consistency losses and hard DC layers. See ⚠Adversarial Loss Hallucination in Scientific Imaging.

Key Takeaway

MSE loss yields the posterior mean — smooth but blurry when the posterior is multimodal.
Perceptual loss preserves textures and edge sharpness by measuring distance in VGG feature space.
Adversarial loss produces sharp images but risks hallucination — generating features not in the true scene. Avoid for radar and SAR without strong data-consistency constraints.
For RF imaging, a combined loss (MSE + perceptual + SSIM + data-consistency) provides the best tradeoff between fidelity and image quality.
Data augmentation must respect the physical relationship between scene and measurements — augment jointly in $(\mathbf{c}, \mathbf{y})$ space.
Transfer learning reduces the sim-to-real gap: pretrain on synthetic data, fine-tune on real measurements. For MoDL, the denoiser transfers across geometries; for MF→U-Net, retraining is required.
When ground truth is unavailable (the real-world RF scenario), supervised training fails — motivating Chapter 23's self-supervised methods.

Physics-Informed Post-Processing Chapter Summary

Training Strategies for Imaging

The Loss Function Shapes the Reconstruction

Definition: MSE (Mean Squared Error) Loss

Definition: Perceptual Loss

Definition: Adversarial (GAN) Loss

Theorem: Loss Functions and Bayesian Estimators

MSE $\to$ posterior mean

Adversarial $\to$ posterior samples

Effect of Loss Function on Reconstruction Quality

Parameters

Definition: SSIM Loss

Example: Combined Loss Function for RF Imaging

Define the combined loss

Choose the weights for RF imaging

Why avoid a pure adversarial loss?

Data Augmentation for Inverse Problems

Transfer Learning from Optical to RF Domains

When Supervised Training is Impossible: The Real-World RF Scenario

Common Mistake: Adversarial Loss Hallucination in Scientific Imaging

Loss Functions for RF Imaging Reconstruction

Quick Check

Quick Check

Perceptual loss

Hallucination (in reconstruction)

Key Takeaway

Definition:
MSE (Mean Squared Error) Loss

Definition:
Perceptual Loss

Definition:
Adversarial (GAN) Loss

Definition:
SSIM Loss