Foundation Models for Imaging

General-Purpose Priors: From Natural Images to RF

Foundation models --- large-scale models pretrained on diverse datasets --- are emerging as general-purpose priors for imaging. Models like Stable Diffusion, DINOv2, and SAM encode rich visual knowledge that can be transferred to domain-specific tasks, including RF imaging.

The key question is: can a model trained on billions of natural images provide useful priors for radar reflectivity maps, SAR images, and microwave scenes? The answer depends on the domain gap between natural images and RF data --- and on how effectively we can close that gap through transfer learning.

Definition:

Foundation Models as Reconstruction Priors

A foundation model prior for inverse problems uses a pretrained generative model pΞΈ(x)p_\theta(\mathbf{x}) as the prior in MAP reconstruction:

x^=arg⁑max⁑xβ€…β€Šlog⁑p(y∣x)+log⁑pΞΈ(x).\hat{\mathbf{x}} = \arg\max_\mathbf{x} \; \log p(\mathbf{y} \mid \mathbf{x}) + \log p_\theta(\mathbf{x}).

For diffusion-based foundation models, this reduces to the DPS framework (Chapter 22) with the pretrained score function.

For discriminative foundation models (DINOv2, CLIP), the features serve as a perceptual regulariser:

RFM(x)=βˆ‘β„“βˆ₯Ο•β„“(x)βˆ’Ο•β„“(x^ref)βˆ₯2R_{\text{FM}}(\mathbf{x}) = \sum_\ell \|\phi_\ell(\mathbf{x}) - \phi_\ell(\hat{\mathbf{x}}_{\text{ref}})\|^2

where Ο•β„“\phi_\ell are intermediate features and x^ref\hat{\mathbf{x}}_{\text{ref}} is an initial reconstruction.

The domain gap between natural images and RF scenes is significant. Foundation model priors work best when the RF scene has visual similarity to natural images (e.g., SAR images of urban areas). For scenes with very different statistics (sparse point scatterers, subsurface imaging), domain-specific fine-tuning is necessary.

Definition:

Reconstruct Anything Model (RAM)

The Reconstruct Anything Model (RAM) is a foundation model for image reconstruction trained across many forward operators:

fΞΈβˆ—=arg⁑minβ‘ΞΈβˆ‘k=1KEx,w[βˆ₯fΞΈ(Akx+w)βˆ’xβˆ₯2]f_\theta^* = \arg\min_\theta \sum_{k=1}^K \mathbb{E}_{\mathbf{x}, \mathbf{w}}\bigl[\|f_\theta(\mathbf{A}_k\mathbf{x} + \mathbf{w}) - \mathbf{x}\|^2\bigr]

where {Ak}k=1K\{\mathbf{A}_k\}_{k=1}^K includes denoising, inpainting, super-resolution, compressed sensing, and other forward operators.

At test time, RAM adapts to new forward operators via a conditioning mechanism (the operator A\mathbf{A} is encoded as an input to the network), enabling zero-shot transfer to unseen inverse problems.

RAM represents a paradigm shift: instead of training a separate network for each forward operator, a single model handles all of them. The key enabler is a large and diverse training set that covers many forward operators and noise levels.

Example: Transfer Learning from Natural Images to RF Imaging

Assess the feasibility of using a Stable Diffusion model (trained on natural images) as a prior for SAR image reconstruction. Identify the domain gap and propose adaptation strategies.

,

Domain Gap: Natural Images vs. RF Scenes

Visualise the domain gap between natural images and RF scenes by comparing feature distributions (in a learned feature space). The plot shows the distribution overlap between natural image features and RF scene features for different scene types and adaptation strategies.

Observe that urban SAR has the smallest domain gap (highest overlap), while subsurface imaging has the largest. LoRA adaptation significantly closes the gap for all scene types.

Parameters
⚠️Engineering Note

Computational Cost of Foundation Model Inference

Foundation models for image reconstruction are computationally expensive at inference time:

  • Diffusion-based (DPS with Stable Diffusion): 100--1000 NFEs at ∼0.5\sim 0.5s each =50= 50--500500s per image on A100 GPU.
  • RAM (single forward pass): ∼0.1\sim 0.1--11s per image, but requires the forward operator as a conditioning input.
  • DIP (per-image optimisation): ∼60\sim 60--300300s per image.

For real-time RF imaging (>10> 10 frames/s), only feed-forward models (RAM, trained U-Nets) are feasible. Diffusion-based approaches are suitable for offline high-quality reconstruction.

Memory: Stable Diffusion requires ∼8\sim 8 GB VRAM; RAM requires ∼4\sim 4 GB. Both fit on modern GPUs.

πŸŽ“CommIT Contribution(2026)

Caire's Vision: RF Imaging Foundation Model

G. Caire β€” Internal note, TU Berlin

Caire envisions a foundation model for RF imaging that generalises across sensing geometries, frequencies, and scene types. The key ingredients:

  1. Diverse simulation data: Electromagnetic solvers (Sionna, custom FDTD) generate training data for many array geometries, carrier frequencies, and scene types.

  2. Physics-conditioned architecture: The forward model A\mathbf{A} is encoded as an input, so the model adapts to new sensing configurations without retraining.

  3. Self-supervised fine-tuning: For deployment with real measurements (where ground truth is unavailable), EI or SURE losses fine-tune the foundation model using techniques from Sections 23.3--23.4.

This vision connects the electromagnetic modelling of Part II (Chapters 5--8) with the learned reconstruction of Part VI, closing the loop from physics to data-driven methods.

foundation modelRF imagingtransfer learning

Why This Matters: The Future of Foundation Models in RF Imaging

Foundation models for RF imaging are in their infancy, but the trajectory is clear:

  1. Simulation-pretrained RF foundation models: Large-scale electromagnetic simulators can generate diverse RF scenes for pretraining domain-specific foundation models.

  2. Multi-modal foundation models: Models trained jointly on optical and RF data can transfer visual knowledge to RF imaging, exploiting the structural similarity between optical and RF scenes (same underlying geometry, different wavelengths).

  3. Physics-aware foundation models: Integrating the forward model into the foundation model architecture combines the generality of foundation models with the specificity of physics-based methods.

The convergence of foundation models and physics-based imaging is a frontier research direction. The self-supervised methods of this chapter (EI, SURE, measurement splitting) provide the fine-tuning tools needed to adapt foundation models to real RF measurements without ground truth.

Quick Check

What is the main challenge in using a natural-image foundation model (e.g., Stable Diffusion) as a prior for RF imaging reconstruction?

The model is too large to run on available hardware

The domain gap between natural images and RF scenes causes the model to hallucinate features not present in the RF data

Foundation models cannot be conditioned on measurement operators

Foundation models require too much training data

Common Mistake: Foundation Models Can Hallucinate RF Features

Mistake:

Using a natural-image foundation model for RF reconstruction without verifying that the reconstructed features are consistent with the physics of RF propagation.

Correction:

Foundation models trained on natural images may generate textures, patterns, or structures that look plausible visually but have no physical basis in the RF measurements.

Mitigation strategies:

  • Always enforce data consistency (Ax^β‰ˆy\mathbf{A}\hat{\mathbf{x}} \approx \mathbf{y}) as a hard constraint or strong loss term.
  • Use physics-based sanity checks: verify that the reconstructed reflectivity is consistent with the radar cross-section budget.
  • Prefer LoRA fine-tuning on domain-specific data over zero-shot transfer.
  • Report uncertainty maps to flag regions where the reconstruction relies heavily on the prior rather than the measurements.

Historical Note: From GPT to RAM: The Foundation Model Revolution

2021--present

The term "foundation model" was coined by the Stanford CRFM in 2021 to describe large models trained on broad data that can be adapted to many downstream tasks. While the concept originated in NLP (GPT, BERT), it quickly spread to computer vision (CLIP, SAM, DINOv2) and then to scientific imaging.

The Reconstruct Anything Model (RAM, 2024) was among the first foundation models specifically designed for inverse problems. Trained on multiple forward operators simultaneously, RAM demonstrated that a single network can handle denoising, inpainting, super-resolution, and compressed sensing --- tasks that previously required separate specialised networks.

For RF imaging, the foundation model paradigm is still emerging. The closest analogue is Liyue Shen's work on diffusion models for medical image reconstruction, which Caire identified as a potential collaboration direction.

Foundation Model

A large-scale model pretrained on broad, diverse data that can be adapted (via fine-tuning, prompting, or conditioning) to many downstream tasks with minimal additional training.

Related: Domain Gap, Transfer Learning

Domain Gap

The statistical mismatch between the training data distribution (e.g., natural images) and the target domain (e.g., RF reflectivity maps), which causes pretrained models to produce inaccurate or hallucinated outputs in the new domain.

Related: Foundation Model

Transfer Learning

The practice of adapting a model trained on one task/domain to a different task/domain, typically via fine-tuning, LoRA, or feature extraction, reducing the need for large labelled datasets in the target domain.

Related: Foundation Model, Domain Gap

Key Takeaway

Foundation models provide general-purpose image priors that can be applied to RF imaging via DPS, feature-based regularisation, or multi-operator conditioning (RAM). The domain gap between natural images and RF scenes is the main challenge, requiring LoRA adaptation or domain-specific pretraining. Caire's vision of a simulation-pretrained RF foundation model that fine-tunes via EI/SURE losses represents a convergence of the physics-based and data-driven approaches developed throughout this book.