Datasets for RF Imaging Research

Data Is the Foundation

Every RF imaging algorithm must be evaluated on data. Real measured data is scarce and expensive to collect; synthetic data is abundant but risks the inverse crime. This section surveys the standard datasets available to the community, describes how to generate synthetic data responsibly, and formalises the inverse crime --- the most common methodological sin in computational imaging.

Definition:

3D Shape Datasets for Synthetic RF Imaging

ShapeNet: a large repository of 3D CAD models (>51,000> 51{,}000 models in 55 categories). Used to generate diverse target geometries for training learned reconstruction methods. Each model can be voxelised onto the imaging grid to produce a ground-truth reflectivity map.

THuman: a collection of 3D human body meshes from body scanning. Used for human-body RF imaging (e.g., through-wall sensing, vital sign monitoring) where the target has realistic human anatomy and posture.

MPEG-7 Shape Dataset: 2D shape silhouettes used as targets for 2D RF imaging experiments. Simpler than 3D models but useful for rapid prototyping and algorithm comparison.

Definition:

Measurement Datasets for RF Imaging

Automotive radar datasets:

  • nuScenes: 1000 driving scenes with 5 radar sensors, camera, and LiDAR. 3D bounding boxes for 23 object classes.
  • RadarScenes: 158 sequences, 4 hours of driving data with radar point clouds and annotations.

Indoor imaging datasets:

  • DeepMIMO: simulated massive MIMO channel dataset from ray tracing, covering multiple indoor and outdoor scenarios at 2.4, 28, and 60 GHz.
  • DeepSense 6G: multi-modal sensing dataset for 6G research, including radar, camera, LiDAR, and GPS data.

SAR datasets:

  • MSTAR: Moving and Stationary Target Recognition dataset. SAR images of military vehicles at various aspect angles.
  • SEN12MS: multi-modal satellite dataset (SAR + optical) for land use classification.

Most datasets provide processed images or point clouds rather than raw radar measurements. For algorithm development at the signal processing level, raw ADC data or channel measurements are needed --- these are less commonly available.

,

Definition:

Synthetic Data Generation Pipeline

When real data is unavailable or insufficient, simulation provides controlled, labelled data. A synthetic data pipeline has four stages:

  1. Scene sampling: draw a random scene from a distribution (e.g., sample a ShapeNet model, place it at random position/orientation, set reflectivity parameters).

  2. Forward model: compute synthetic measurements y=Ac+w\mathbf{y} = \mathbf{A}\mathbf{c} + \mathbf{w} using the chosen physics (Born model, ray tracing, or full-wave).

  3. Noise and impairments: add receiver noise (wCN(0,σ2I)\mathbf{w} \sim \mathcal{CN}(0, \sigma^2\mathbf{I})), phase noise, clutter, and hardware impairments.

  4. Ground truth pairing: store both the measurement y\mathbf{y} and the scene c\mathbf{c} as a (input, target) pair for supervised training.

The critical question: does the forward model used for data generation match the one used for reconstruction? If yes, you have committed the inverse crime.

,

Definition:

The Inverse Crime

The inverse crime is committed when the same forward model generates the synthetic data and reconstructs from it:

c^=argmincAinvcAfwdctrue+wsynthetic data2with Ainv=Afwd.\hat{\mathbf{c}} = \arg\min_{\mathbf{c}} \|\mathbf{A}_{\mathrm{inv}} \mathbf{c} - \underbrace{\mathbf{A}_{\mathrm{fwd}} \mathbf{c}_{\mathrm{true}} + \mathbf{w}}_{\text{synthetic data}}\|^2 \qquad \text{with } \mathbf{A}_{\mathrm{inv}} = \mathbf{A}_{\mathrm{fwd}}.

When Afwd=Ainv\mathbf{A}_{\mathrm{fwd}} = \mathbf{A}_{\mathrm{inv}}, the reconstruction problem reduces to denoising: the model mismatch error is zero, and the only error source is noise. This makes any reconstruction method look artificially good.

The crime is aggravated when:

  • The same discretisation grid is used for generation and reconstruction;
  • Point scatterers are placed exactly on grid points;
  • The same random seed generates training and test data.
,

Theorem: Error Inflation from the Inverse Crime

Let Atrue\mathbf{A}_{\mathrm{true}} be the true physics and Aapprox\mathbf{A}_{\mathrm{approx}} the approximate model used for both simulation and reconstruction. The reconstruction error with the crime is:

c^crimectruewσmin(Aapprox),\|\hat{\mathbf{c}}_{\mathrm{crime}} - \mathbf{c}_{\mathrm{true}}\| \leq \frac{\|\mathbf{w}\|}{\sigma_{\min}(\mathbf{A}_{\mathrm{approx}})},

while the honest error (different models) is:

c^honestctruew+(AtrueAapprox)ctrueσmin(Aapprox).\|\hat{\mathbf{c}}_{\mathrm{honest}} - \mathbf{c}_{\mathrm{true}}\| \leq \frac{\|\mathbf{w}\| + \|(\mathbf{A}_{\mathrm{true}} - \mathbf{A}_{\mathrm{approx}})\mathbf{c}_{\mathrm{true}}\|}{\sigma_{\min}(\mathbf{A}_{\mathrm{approx}})}.

The model mismatch term (AtrueAapprox)ctrue\|(\mathbf{A}_{\mathrm{true}} - \mathbf{A}_{\mathrm{approx}})\mathbf{c}_{\mathrm{true}}\| is absent in the crime, leading to optimistic error bounds.

,

Inverse Crime Demonstration

Demonstrate the inverse crime for a 1D imaging scenario. The plot shows: (left) the ground truth scene, (centre) the reconstruction via least squares, and (right) the error map.

Same grid (crime): toggle ON to use identical grids for simulation and reconstruction. The reconstruction is nearly perfect.

Different grids (honest): toggle OFF to use a finer grid (256 points) for simulation and a coarser grid (64 points) for reconstruction. Model mismatch causes visible artifacts, revealing the true performance.

Parameters
20
5

Definition:

Strategies for Avoiding the Inverse Crime

To avoid the inverse crime, ensure the forward model used for data generation differs meaningfully from the reconstruction model:

  1. Different discretisations: generate data on a fine grid (NfwdNinvN_{\mathrm{fwd}} \gg N_{\mathrm{inv}}) and reconstruct on a coarser grid. Rule of thumb: Nfwd4NinvN_{\mathrm{fwd}} \geq 4 N_{\mathrm{inv}}.

  2. Different physics: generate with ray tracing or FDTD; reconstruct with the Born approximation.

  3. Off-grid scatterers: place point targets at positions that do not coincide with the reconstruction grid.

  4. Model mismatch injection: add calibration errors, timing offsets, and mutual coupling to the simulated data but not to the reconstruction model.

  5. Real data validation: always validate on measured data when available.

,

Example: Detecting the Inverse Crime in a Paper

A paper claims 45 dB PSNR for CS reconstruction of a radar image using LASSO with λ\lambda optimised on a validation set. The data is generated using a point-scatterer model with 20 targets on a 64×6464 \times 64 grid. The same model is used for LASSO's A\mathbf{A} matrix. Is the inverse crime present?

Common Mistake: The Inverse Crime Is Not Just About Grids

Mistake:

Believing that using a slightly different grid spacing avoids the inverse crime --- e.g., generating on a 65×6565 \times 65 grid and reconstructing on 64×6464 \times 64.

Correction:

The grid difference must introduce meaningful model mismatch. A 65×6565 \times 65 vs. 64×6464 \times 64 grid produces near-zero mismatch. Use at least 4×4\times oversampling for generation (256×256256 \times 256), and ideally use a different physics model (ray tracing vs. Born) or inject hardware impairments (phase noise, mutual coupling) that the reconstruction model does not know about.

Common Mistake: On-Grid Targets Inflate Sparse Recovery Performance

Mistake:

Placing point scatterers exactly on the reconstruction grid when evaluating sparse recovery algorithms (LASSO, OMP).

Correction:

On-grid targets perfectly match the dictionary, making the problem artificially easy. In reality, targets are continuous and the grid introduces basis mismatch. Always place targets at positions that do not coincide with the reconstruction grid (off-grid), or use super-resolution methods that explicitly model the continuous parameter.

Historical Note: The Name "Inverse Crime"

2000s

The term "inverse crime" was coined by Armand Wirgin in 2004, though the pitfall was well known in the inverse problems community for decades before. Colton and Kress formalised the concept in their classic textbook on inverse scattering theory. Despite widespread awareness, the inverse crime continues to appear in published RF imaging papers, often unintentionally, because the default simulation setup (generate and reconstruct with the same code) commits the crime by default.

,

Inverse Crime

The methodological error of using the same forward model for both synthetic data generation and reconstruction, producing artificially optimistic results by eliminating model mismatch.

ShapeNet

A large-scale 3D model repository containing over 51,000 CAD models in 55 categories, widely used for generating training data in computational imaging and computer vision research.

Quick Check

A paper reports 42 dB PSNR using a deep unrolling network trained and tested on data from the same Born-approximation forward model with on-grid targets. What is the most likely explanation for this high PSNR?

The deep unrolling architecture is extremely powerful

The inverse crime: same forward model and on-grid targets

The SNR is very high

The scene is very sparse

Key Takeaway

Standard datasets (nuScenes, MSTAR, DeepMIMO) provide benchmarking data, but raw signal-level data remains scarce. Synthetic generation fills the gap but demands vigilance against the inverse crime: use different grids (4×4\times oversampling), different physics models, off-grid targets, and always validate on measured data. Papers reporting PSNR >40> 40 dB on purely simulated data should be scrutinised.