Foundation Models for Imaging
General-Purpose Priors: From Natural Images to RF
Foundation models --- large-scale models pretrained on diverse datasets --- are emerging as general-purpose priors for imaging. Models like Stable Diffusion, DINOv2, and SAM encode rich visual knowledge that can be transferred to domain-specific tasks, including RF imaging.
The key question is: can a model trained on billions of natural images provide useful priors for radar reflectivity maps, SAR images, and microwave scenes? The answer depends on the domain gap between natural images and RF data --- and on how effectively we can close that gap through transfer learning.
Definition: Foundation Models as Reconstruction Priors
Foundation Models as Reconstruction Priors
A foundation model prior for inverse problems uses a pretrained generative model as the prior in MAP reconstruction:
For diffusion-based foundation models, this reduces to the DPS framework (Chapter 22) with the pretrained score function.
For discriminative foundation models (DINOv2, CLIP), the features serve as a perceptual regulariser:
where are intermediate features and is an initial reconstruction.
The domain gap between natural images and RF scenes is significant. Foundation model priors work best when the RF scene has visual similarity to natural images (e.g., SAR images of urban areas). For scenes with very different statistics (sparse point scatterers, subsurface imaging), domain-specific fine-tuning is necessary.
Definition: Reconstruct Anything Model (RAM)
Reconstruct Anything Model (RAM)
The Reconstruct Anything Model (RAM) is a foundation model for image reconstruction trained across many forward operators:
where includes denoising, inpainting, super-resolution, compressed sensing, and other forward operators.
At test time, RAM adapts to new forward operators via a conditioning mechanism (the operator is encoded as an input to the network), enabling zero-shot transfer to unseen inverse problems.
RAM represents a paradigm shift: instead of training a separate network for each forward operator, a single model handles all of them. The key enabler is a large and diverse training set that covers many forward operators and noise levels.
Example: Transfer Learning from Natural Images to RF Imaging
Assess the feasibility of using a Stable Diffusion model (trained on natural images) as a prior for SAR image reconstruction. Identify the domain gap and propose adaptation strategies.
Domain similarity assessment
SAR images of urban areas share some properties with aerial photographs:
- Similar: building footprints, road networks, geometric edges
- Different: speckle texture, layover/shadow artefacts, complex-valued signal, high dynamic range (40+ dB)
Domain gap is moderate for urban SAR but severe for subsurface imaging or sparse-target scenes.
Transfer approaches
-
Direct transfer (zero-shot): Use the pretrained diffusion model with DPS. Works for perceptual quality but may miss domain-specific features (speckle statistics, sidelobes).
-
LoRA adaptation: Low-rank adaptation of the diffusion model with a small SAR dataset ( images). Parameter-efficient and effective for moderate domain gaps.
-
Full fine-tuning: Retrain significant portions of the model on simulated SAR data. Best quality but requires large datasets and compute.
Recommendation for RF imaging
For SAR: LoRA fine-tuning of a pretrained diffusion model provides the best tradeoff between domain adaptation and data efficiency. For highly domain-specific RF imaging (MIMO radar, ground-penetrating radar): train a domain-specific model from scratch using simulation data from electromagnetic solvers.
Domain Gap: Natural Images vs. RF Scenes
Visualise the domain gap between natural images and RF scenes by comparing feature distributions (in a learned feature space). The plot shows the distribution overlap between natural image features and RF scene features for different scene types and adaptation strategies.
Observe that urban SAR has the smallest domain gap (highest overlap), while subsurface imaging has the largest. LoRA adaptation significantly closes the gap for all scene types.
Parameters
Computational Cost of Foundation Model Inference
Foundation models for image reconstruction are computationally expensive at inference time:
- Diffusion-based (DPS with Stable Diffusion): 100--1000 NFEs at s each --s per image on A100 GPU.
- RAM (single forward pass): --s per image, but requires the forward operator as a conditioning input.
- DIP (per-image optimisation): --s per image.
For real-time RF imaging ( frames/s), only feed-forward models (RAM, trained U-Nets) are feasible. Diffusion-based approaches are suitable for offline high-quality reconstruction.
Memory: Stable Diffusion requires GB VRAM; RAM requires GB. Both fit on modern GPUs.
Caire's Vision: RF Imaging Foundation Model
Caire envisions a foundation model for RF imaging that generalises across sensing geometries, frequencies, and scene types. The key ingredients:
-
Diverse simulation data: Electromagnetic solvers (Sionna, custom FDTD) generate training data for many array geometries, carrier frequencies, and scene types.
-
Physics-conditioned architecture: The forward model is encoded as an input, so the model adapts to new sensing configurations without retraining.
-
Self-supervised fine-tuning: For deployment with real measurements (where ground truth is unavailable), EI or SURE losses fine-tune the foundation model using techniques from Sections 23.3--23.4.
This vision connects the electromagnetic modelling of Part II (Chapters 5--8) with the learned reconstruction of Part VI, closing the loop from physics to data-driven methods.
Why This Matters: The Future of Foundation Models in RF Imaging
Foundation models for RF imaging are in their infancy, but the trajectory is clear:
-
Simulation-pretrained RF foundation models: Large-scale electromagnetic simulators can generate diverse RF scenes for pretraining domain-specific foundation models.
-
Multi-modal foundation models: Models trained jointly on optical and RF data can transfer visual knowledge to RF imaging, exploiting the structural similarity between optical and RF scenes (same underlying geometry, different wavelengths).
-
Physics-aware foundation models: Integrating the forward model into the foundation model architecture combines the generality of foundation models with the specificity of physics-based methods.
The convergence of foundation models and physics-based imaging is a frontier research direction. The self-supervised methods of this chapter (EI, SURE, measurement splitting) provide the fine-tuning tools needed to adapt foundation models to real RF measurements without ground truth.
Quick Check
What is the main challenge in using a natural-image foundation model (e.g., Stable Diffusion) as a prior for RF imaging reconstruction?
The model is too large to run on available hardware
The domain gap between natural images and RF scenes causes the model to hallucinate features not present in the RF data
Foundation models cannot be conditioned on measurement operators
Foundation models require too much training data
Correct. Natural images and RF scenes have different statistical properties: RF scenes have speckle, high dynamic range, complex values, and unique artefact patterns. The foundation model's prior may generate natural-looking features that do not correspond to the actual RF reflectivity.
Common Mistake: Foundation Models Can Hallucinate RF Features
Mistake:
Using a natural-image foundation model for RF reconstruction without verifying that the reconstructed features are consistent with the physics of RF propagation.
Correction:
Foundation models trained on natural images may generate textures, patterns, or structures that look plausible visually but have no physical basis in the RF measurements.
Mitigation strategies:
- Always enforce data consistency () as a hard constraint or strong loss term.
- Use physics-based sanity checks: verify that the reconstructed reflectivity is consistent with the radar cross-section budget.
- Prefer LoRA fine-tuning on domain-specific data over zero-shot transfer.
- Report uncertainty maps to flag regions where the reconstruction relies heavily on the prior rather than the measurements.
Historical Note: From GPT to RAM: The Foundation Model Revolution
2021--presentThe term "foundation model" was coined by the Stanford CRFM in 2021 to describe large models trained on broad data that can be adapted to many downstream tasks. While the concept originated in NLP (GPT, BERT), it quickly spread to computer vision (CLIP, SAM, DINOv2) and then to scientific imaging.
The Reconstruct Anything Model (RAM, 2024) was among the first foundation models specifically designed for inverse problems. Trained on multiple forward operators simultaneously, RAM demonstrated that a single network can handle denoising, inpainting, super-resolution, and compressed sensing --- tasks that previously required separate specialised networks.
For RF imaging, the foundation model paradigm is still emerging. The closest analogue is Liyue Shen's work on diffusion models for medical image reconstruction, which Caire identified as a potential collaboration direction.
Foundation Model
A large-scale model pretrained on broad, diverse data that can be adapted (via fine-tuning, prompting, or conditioning) to many downstream tasks with minimal additional training.
Related: Domain Gap, Transfer Learning
Domain Gap
The statistical mismatch between the training data distribution (e.g., natural images) and the target domain (e.g., RF reflectivity maps), which causes pretrained models to produce inaccurate or hallucinated outputs in the new domain.
Related: Foundation Model
Transfer Learning
The practice of adapting a model trained on one task/domain to a different task/domain, typically via fine-tuning, LoRA, or feature extraction, reducing the need for large labelled datasets in the target domain.
Related: Foundation Model, Domain Gap
Key Takeaway
Foundation models provide general-purpose image priors that can be applied to RF imaging via DPS, feature-based regularisation, or multi-operator conditioning (RAM). The domain gap between natural images and RF scenes is the main challenge, requiring LoRA adaptation or domain-specific pretraining. Caire's vision of a simulation-pretrained RF foundation model that fine-tunes via EI/SURE losses represents a convergence of the physics-based and data-driven approaches developed throughout this book.