Training Strategies for Imaging

The Loss Function Shapes the Reconstruction

The choice of loss function determines what the network optimises for, and hence the character of the reconstruction.

  • MSE loss produces the posterior mean β€” smooth and unbiased but blurry when the posterior is multimodal.
  • Perceptual loss preserves textures and edges by measuring distance in VGG feature space rather than pixel space.
  • Adversarial loss generates sharp, realistic-looking images, but risks hallucination β€” inventing features not in the true scene.

For RF imaging, where fidelity to the true scene is paramount (a false target in a radar image could trigger a false alarm), the choice of loss function carries significant practical consequences. This section surveys the three main families and their implications, then discusses data augmentation and transfer learning strategies.

Definition:

MSE (Mean Squared Error) Loss

The MSE loss between reconstruction c^\hat{\mathbf{c}} and ground truth c\mathbf{c} is

LMSE(c^,c)=1Nβˆ₯c^βˆ’cβˆ₯22=1Nβˆ‘i=1N∣c^iβˆ’ci∣2.\mathcal{L}_{\text{MSE}}(\hat{\mathbf{c}}, \mathbf{c}) = \frac{1}{N}\|\hat{\mathbf{c}} - \mathbf{c}\|_2^2 = \frac{1}{N}\sum_{i=1}^N |\hat{c}_i - c_i|^2.

Minimising LMSE\mathcal{L}_{\text{MSE}} over the training distribution yields the posterior mean E[c∣y]\mathbb{E}[\mathbf{c} \mid \mathbf{y}].

MSE penalises all pixel errors equally. When the posterior p(c∣y)p(\mathbf{c} \mid \mathbf{y}) is multimodal β€” multiple scenes are consistent with the measurements β€” the posterior mean lies between the modes, producing a blurry reconstruction.

Definition:

Perceptual Loss

The perceptual loss measures distance in the feature space of a pretrained classification network (typically VGG-16):

Lperc(c^,c)=βˆ‘β„“βˆˆS1Cβ„“Hβ„“Wβ„“βˆ₯Ο•β„“(c^)βˆ’Ο•β„“(c)βˆ₯F2,\mathcal{L}_{\text{perc}}(\hat{\mathbf{c}}, \mathbf{c}) = \sum_{\ell \in \mathcal{S}} \frac{1}{C_\ell H_\ell W_\ell} \|\phi_\ell(\hat{\mathbf{c}}) - \phi_\ell(\mathbf{c})\|_F^2,

where Ο•β„“\phi_\ell extracts the feature map at VGG layer β„“\ell, S\mathcal{S} is a set of selected layers, and Cβ„“,Hβ„“,Wβ„“C_\ell, H_\ell, W_\ell are the channel, height, and width dimensions.

Perceptual loss encourages structural similarity at multiple scales. It preserves edges and textures better than MSE because VGG features are invariant to small spatial shifts that MSE penalises heavily. For RF scenes with point targets, the perceptual loss can preserve the sharp target signature while suppressing background clutter.

Definition:

Adversarial (GAN) Loss

The adversarial loss trains a discriminator DψD_\psi alongside the reconstruction network fθf_\theta:

Ladv(ΞΈ,ψ)=Ec[log⁑Dψ(c)]+Ey[log⁑(1βˆ’Dψ(fΞΈ(c^BP)))].\mathcal{L}_{\text{adv}}(\theta, \psi) = \mathbb{E}_{\mathbf{c}}\bigl[\log D_\psi(\mathbf{c})\bigr] + \mathbb{E}_{\mathbf{y}}\bigl[\log(1 - D_\psi(f_\theta(\hat{\mathbf{c}}^{\text{BP}})))\bigr].

The generator fθf_\theta minimises Ladv\mathcal{L}_{\text{adv}} while the discriminator DψD_\psi maximises it. At equilibrium, the generator produces images that the discriminator cannot distinguish from true scenes.

Adversarial training produces the sharpest, most visually realistic reconstructions. The danger is hallucination: the network may generate plausible-looking features not present in the true scene. For scientific imaging (radar, SAR), this is problematic β€” hallucinated targets could trigger false alarms in detection pipelines.

Theorem: Loss Functions and Bayesian Estimators

Under mild regularity conditions, the optimal reconstruction network fΞΈβˆ—f_{\theta^*} trained with different losses converges to distinct Bayesian estimators:

Loss Optimal estimator
LMSE=βˆ₯c^βˆ’cβˆ₯22\mathcal{L}_{\text{MSE}} = \|\hat{\mathbf{c}} - \mathbf{c}\|_2^2 Posterior mean E[c∣y]\mathbb{E}[\mathbf{c} \mid \mathbf{y}]
LMAE=βˆ₯c^βˆ’cβˆ₯1\mathcal{L}_{\text{MAE}} = \|\hat{\mathbf{c}} - \mathbf{c}\|_1 Posterior median (component-wise)
Ladv\mathcal{L}_{\text{adv}} (Wasserstein) Sample from the posterior p(c∣y)p(\mathbf{c} \mid \mathbf{y})

Each loss function defines a different notion of "best." MSE minimises average squared error (yields the mean). MAE minimises average absolute error (yields the median, which is mode-seeking for heavy-tailed priors). The adversarial loss matches distributions, so the generator produces samples from the posterior rather than a single point estimate.

Effect of Loss Function on Reconstruction Quality

Compare reconstructions produced by networks trained with different loss functions on the same RF imaging scene. The MSE-trained network produces smooth but blurry estimates. The perceptual-loss network preserves more texture and target sharpness. The adversarial-loss network produces the sharpest images but may introduce hallucinated point targets. The combined loss balances all three.

Examine the error maps carefully: MSE has the lowest pixel-wise error but the worst perceptual quality, while the adversarial loss has higher pixel-wise error but better structural similarity (SSIM).

Parameters
20

Definition:

SSIM Loss

The structural similarity index (SSIM) measures perceptual image quality by comparing luminance, contrast, and structure between c^\hat{\mathbf{c}} and c\mathbf{c} in local windows:

SSIM(c^,c)=(2ΞΌc^ΞΌc+C1)(2Οƒc^c+C2)(ΞΌc^2+ΞΌc2+C1)(Οƒc^2+Οƒc2+C2),\text{SSIM}(\hat{\mathbf{c}}, \mathbf{c}) = \frac{(2\mu_{\hat{c}}\mu_c + C_1)(2\sigma_{\hat{c}c} + C_2)} {(\mu_{\hat{c}}^2 + \mu_c^2 + C_1)(\sigma_{\hat{c}}^2 + \sigma_c^2 + C_2)},

where ΞΌ,Οƒ2,Οƒc^c\mu, \sigma^2, \sigma_{\hat{c}c} are local means, variances, and cross-covariance, and C1,C2C_1, C_2 are small constants for stability. The SSIM loss is LSSIM=1βˆ’SSIM(c^,c)\mathcal{L}_{\text{SSIM}} = 1 - \text{SSIM}(\hat{\mathbf{c}}, \mathbf{c}).

SSIM correlates better with human perceptual quality than MSE. For RF scenes, SSIM is particularly useful for evaluating target detection performance: it penalises blurry targets and sidelobe artefacts while being relatively tolerant of background noise variations.

Example: Combined Loss Function for RF Imaging

Design a combined loss function for training an MF→U-Net reconstruction network for a MIMO radar imaging system. The loss should balance pixel accuracy, structural quality, and data consistency.

Data Augmentation for Inverse Problems

Training data for RF imaging reconstruction consists of paired (c,y)(\mathbf{c}, \mathbf{y}) samples generated from a simulator. Standard image augmentation (flips, rotations, crops) must be applied carefully in this setting:

Augmentation in scene space: Flipping the scene c\mathbf{c} must be accompanied by applying the corresponding transformation to the measurements y=Ac\mathbf{y} = \mathbf{A}\mathbf{c}. For operators with spatial symmetry (e.g., symmetric antenna arrays), some transformations can be applied exactly.

Augmentation in measurement space: Randomly masking measurement components (Mβ€²<MM' < M measurements) at training time teaches the network to be robust to partial aperture coverage. This is particularly effective for MoDL, where the CG step adapts automatically to different effective A\mathbf{A}.

Noise-level augmentation: Training with a range of SNR values (e.g., 5–40 dB) prevents overfitting to a specific noise regime and produces networks that degrade gracefully at low SNR.

Scene diversity: RF imaging scenes are highly non-stationary (isolated point targets vs. extended objects vs. clutter). A diverse training set covering all these regimes is essential for generalisation.

Transfer Learning from Optical to RF Domains

RF imaging networks are often trained on synthetic data because paired ground truth (c,y)(\mathbf{c}, \mathbf{y}) is unavailable in the field. Transfer learning can reduce the simulation-to-real gap:

Optical-to-RF transfer: Pretrain the U-Net backbone on large optical image datasets (ImageNet, COCO) where ground truth is abundant. Fine-tune on synthetic RF data with physically accurate A\mathbf{A}. The low-level feature detectors (edges, textures) transfer well; higher-level semantic features do not.

Simulation-to-real transfer: Train on high-fidelity simulated data with calibrated forward models, then fine-tune on a small set of real measurements (possibly without ground truth, using self-supervised losses from Chapter 23).

Transfer across array geometries: When the sensing geometry changes (different number of antennas, frequencies), the Gram matrix G\mathbf{G} changes. For MoDL, only the CG step changes — the denoiser Dθ\mathcal{D}_\theta may transfer without retraining. For MF→U-Net, full retraining is needed.

Domain randomisation: During training, randomise the sensing geometry (antenna positions, frequencies) so the network learns to be geometry-agnostic. At inference, condition on the true geometry via physics-informed channels (Section 20.3).

,

When Supervised Training is Impossible: The Real-World RF Scenario

The fundamental assumption of supervised training is the availability of paired ground-truth data (c,y)(\mathbf{c}, \mathbf{y}). In real-world RF imaging deployments, this assumption frequently fails:

  • The true scene c\mathbf{c} is unknown β€” that is precisely what we want to measure.
  • Collecting calibrated ground truth requires a controlled environment that does not reflect real deployment conditions.
  • Scenes are non-stationary (people, vehicles, weather changes) and ground truth changes faster than data can be labelled.

This motivates the self-supervised and unsupervised approaches developed in Chapter 23: self-supervised losses (Noise2Noise variants, equivariant imaging, SURE-based estimation) that require only measurement pairs {(yi,yj)}\{(\mathbf{y}_i, \mathbf{y}_j)\} without ground truth.

For the short term, the CommIT group approach is: train on synthetic data, validate with a small calibration target, deploy with domain adaptation. Understanding when and why synthetic-to-real transfer works is one of the central open questions in RF imaging.

Common Mistake: Adversarial Loss Hallucination in Scientific Imaging

Mistake:

Using a pure adversarial loss for radar or SAR image reconstruction without data-consistency constraints.

Correction:

GAN-trained networks can hallucinate realistic-looking features (targets, lesions) that do not exist in the true scene. For scientific applications:

  1. Always include a data-consistency term in the combined loss.
  2. Prefer MSE + perceptual over pure adversarial training.
  3. If using adversarial training, add measurement-consistency constraints as hard layers (DC layers) rather than soft penalties.
  4. Monitor the measurement residual βˆ₯yβˆ’Ac^βˆ₯\|\mathbf{y} - \mathbf{A}\hat{\mathbf{c}}\| on a held-out validation set to detect hallucination.
  5. For detection tasks, evaluate false alarm rate (FAR) alongside SSIM and PSNR β€” hallucinated targets inflate FAR even when pixel-wise metrics look acceptable.

Loss Functions for RF Imaging Reconstruction

LossOptimal estimatorStrengthsWeaknesses for RF imaging
MSEPosterior meanUnbiased, mathematically tractableBlurry for multimodal posteriors
MAE (L1L^1)Posterior medianRobust to outliersNon-smooth gradient, slower training
SSIMPerceptual quality optimumPreserves target structureNon-convex, local optima
Perceptual (VGG)Feature-space meanSharp targets, edge-preservingNot a metric, hallucination risk
AdversarialPosterior sampleSharpest imagesHallucination, unstable training, false targets
Data-consistencyMeasurement-feasible estimatePhysical plausibilityDoes not promote sparsity or image quality alone
Combined (MSE+perc+DC)Balanced tradeoffFidelity + quality + plausibilityHyperparameter tuning required

Quick Check

Why do MSE-trained networks tend to produce blurry reconstructions?

Because MSE penalises large errors too strongly

Because the optimal MSE estimate is the posterior mean, which averages over multiple plausible reconstructions

Because MSE ignores high-frequency components

Because the U-Net architecture cannot produce sharp images

Quick Check

When the sensing matrix A\mathbf{A} changes (different array geometry), which approach requires the least retraining?

Direct inversion network

MF-to-U-Net

MoDL (shared denoiser with new CG step)

Physics-informed U-Net (PSF channel only)

Perceptual loss

A loss function measuring distance between reconstructed and target images in the feature space of a pretrained network (VGG-16), rather than in pixel space. Encourages structural and textural similarity. See DPerceptual Loss.

Related: Perceptual Loss, Adversarial (GAN) Loss

Hallucination (in reconstruction)

A failure mode of trained reconstruction networks (especially GAN-based) where the network generates plausible-looking features in the output that are not present in the true scene. In RF imaging, hallucinated targets are dangerous because they cause false alarms in detection systems. Mitigated by data-consistency losses and hard DC layers. See ⚠Adversarial Loss Hallucination in Scientific Imaging.

Related: Adversarial Loss Hallucination in Scientific Imaging, Adversarial (GAN) Loss

Key Takeaway

  1. MSE loss yields the posterior mean β€” smooth but blurry when the posterior is multimodal.

  2. Perceptual loss preserves textures and edge sharpness by measuring distance in VGG feature space.

  3. Adversarial loss produces sharp images but risks hallucination β€” generating features not in the true scene. Avoid for radar and SAR without strong data-consistency constraints.

  4. For RF imaging, a combined loss (MSE + perceptual + SSIM + data-consistency) provides the best tradeoff between fidelity and image quality.

  5. Data augmentation must respect the physical relationship between scene and measurements β€” augment jointly in (c,y)(\mathbf{c}, \mathbf{y}) space.

  6. Transfer learning reduces the sim-to-real gap: pretrain on synthetic data, fine-tune on real measurements. For MoDL, the denoiser transfers across geometries; for MF→U-Net, retraining is required.

  7. When ground truth is unavailable (the real-world RF scenario), supervised training fails β€” motivating Chapter 23's self-supervised methods.