Evaluation Metrics
What Gets Measured Gets Optimised
The choice of evaluation metric shapes the conclusions of every RF imaging paper. PSNR, SSIM, and LPIPS can disagree on which reconstruction is "better." Shape metrics (Chamfer, IoU) matter for 3D reconstruction. Detection metrics (, ROC) matter for radar applications. Computational metrics (FLOPs, time) determine practical deployability. This section defines each metric precisely and shows when they agree and disagree.
Definition: Peak Signal-to-Noise Ratio (PSNR)
Peak Signal-to-Noise Ratio (PSNR)
PSNR measures the reconstruction quality relative to the dynamic range of the image:
where is the maximum scene value and MSE is the mean squared error over voxels.
Typical values: dB (good), dB (excellent), dB (poor).
PSNR is the most widely used metric but has limitations: it does not capture perceptual quality, structural preservation, or the distribution of errors across the image. PSNR favours smooth reconstructions that minimise total error energy.
Definition: Structural Similarity Index (SSIM)
Structural Similarity Index (SSIM)
SSIM compares local image patches in terms of luminance, contrast, and structure:
where , , and are the local mean, variance, and cross-covariance computed over a sliding window, and are stabilisation constants.
SSIM where indicates perfect reconstruction. The overall SSIM is the average over all windows (MSSIM).
SSIM is more perceptually meaningful than PSNR because it captures structural information. For RF imaging, SSIM correlates better with target detection performance than PSNR.
Definition: Learned Perceptual Image Patch Similarity (LPIPS)
Learned Perceptual Image Patch Similarity (LPIPS)
LPIPS measures perceptual distance using features from a pretrained deep network (typically VGG or AlexNet):
where extracts features at layer and are learned weights. Lower LPIPS means higher perceptual similarity.
LPIPS captures high-level structural features that PSNR and SSIM miss. However, it was trained on natural images, so its relevance to RF reflectivity maps (which look different from photographs) is an open question.
Definition: Normalised Mean Squared Error (NMSE)
Normalised Mean Squared Error (NMSE)
NMSE measures the relative reconstruction error:
NMSE is perfect reconstruction; NMSE means the reconstruction error equals the signal energy.
In dB: . Typical targets: dB (good), dB (excellent).
NMSE is preferred over PSNR when comparing across different scenes because it is normalised by the signal energy. PSNR depends on the dynamic range of each specific image.
Definition: Chamfer and Hausdorff Distances
Chamfer and Hausdorff Distances
For 3D reconstructions represented as point sets (reconstruction) and (ground truth):
Chamfer distance:
Hausdorff distance:
Chamfer is the average nearest-neighbour distance (sensitive to overall shape). Hausdorff is the worst-case distance (sensitive to outliers).
Definition: IoU and F-Score for Volumetric Evaluation
IoU and F-Score for Volumetric Evaluation
For volumetric (voxel-based) reconstructions:
Intersection over Union (IoU): where is the set of occupied voxels after thresholding. IoU ; higher is better.
F-Score at threshold : where precision and recall are computed by counting predicted points within distance of a ground truth point and vice versa. The F-score at multiple thresholds (e.g., ) characterises both coarse and fine shape recovery.
Definition: Detection Metrics
Detection Metrics
For radar detection applications:
Probability of detection : the fraction of true targets that are correctly detected (above threshold).
Probability of false alarm : the fraction of non-target locations that exceed the detection threshold.
ROC curve: plots vs. as the threshold varies. The area under the ROC (AUC) summarises detection performance in a single number ().
Operating point: typically report at a fixed or , reflecting the practical requirement of low false alarm rates.
PSNR vs SSIM vs : When Metrics Disagree
Apply different types of degradation to a 1D scene with 4 targets on a smooth background. Observe how PSNR, SSIM, and detection probability respond differently:
- Noise degrades PSNR but may preserve target peaks ( stays high).
- Blur preserves PSNR (low total error) but destroys structure (SSIM drops, drops).
- Missing data affects all metrics.
- Structured artifacts can fool PSNR while SSIM and degrade.
Diamond markers show detected (green) vs. missed (red) targets.
Parameters
ROC Curves for Different Reconstruction Methods
Compare ROC curves for matched filter, LASSO, and a deep network at different SNR levels. At low SNR, the gap between methods widens. The deep network achieves higher at the same because it has learned to separate targets from noise.
Parameters
Definition: Computational Metrics
Computational Metrics
Beyond reconstruction quality, practical deployment requires:
-
FLOPs: floating-point operations per reconstruction. Matched filter: . LASSO (ISTA, iterations): . Deep network: depends on architecture.
-
Inference time: wall-clock time for a single reconstruction on specified hardware (GPU model, batch size).
-
Memory: peak GPU memory during reconstruction.
-
Training time: total GPU-hours for learned methods (amortised over all future reconstructions).
-
Number of parameters: for neural networks, total trainable parameters.
Report computational metrics alongside quality metrics. A method that achieves 0.5 dB higher PSNR but takes longer may not be preferable.
Evaluation Metrics Overview
| Metric | Type | Range | Best For | Limitation |
|---|---|---|---|---|
| PSNR (dB) | Image quality | Overall error | Ignores structure | |
| SSIM | Image quality | Structural fidelity | Local windows only | |
| LPIPS | Perceptual | Perceptual quality | Trained on natural images | |
| NMSE (dB) | Image quality | Cross-scene comparison | Same as PSNR | |
| Chamfer | Shape (3D) | Average shape error | Ignores outliers | |
| Hausdorff | Shape (3D) | Worst-case error | Sensitive to outliers | |
| IoU | Volume | Occupancy overlap | Threshold-dependent | |
| at | Detection | Target detection | Binary (no localisation) | |
| AUC-ROC | Detection | Overall detectability | Averages over all thresholds | |
| FLOPs | Compute | Computational cost | Hardware-dependent |
Example: Metric Disagreement in Practice
Two reconstruction algorithms produce images with PSNR 28 dB and 25 dB respectively. However, the second algorithm has higher target detection probability ( vs. at ). Explain this discrepancy.
PSNR analysis
Algorithm 1 has lower total error (higher PSNR) but the errors may be concentrated in the background (uniform noise reduction), leaving target regions slightly degraded.
Detection analysis
Algorithm 2 has higher total error but preserves target peaks better (perhaps using a sparsity prior that enhances point targets at the cost of background noise). The improved contrast at target locations leads to better detection despite worse overall PSNR.
Recommendation
For detection applications, use task-specific metrics (, ROC) rather than image quality metrics (PSNR, SSIM). PSNR is a reasonable proxy for general image quality but can be misleading for specific tasks.
Report Multiple Metrics β Always
No single metric captures all aspects of reconstruction quality. Every RF imaging paper should report at minimum:
- One image quality metric (PSNR or NMSE).
- One structural metric (SSIM).
- One task-specific metric (, Chamfer, or IoU depending on the application).
- Computational cost (inference time and memory).
Papers reporting only PSNR are incomplete. Reviewers should request additional metrics when they are missing.
Common Mistake: Reporting Only PSNR
Mistake:
Comparing algorithms using only PSNR and concluding that the one with higher PSNR is "better."
Correction:
PSNR measures total squared error, which favours smooth reconstructions. A method with 3 dB lower PSNR may preserve edges and targets better (higher SSIM, higher ). Always report SSIM and a task-specific metric alongside PSNR.
Common Mistake: Blindly Using LPIPS for RF Images
Mistake:
Using LPIPS (trained on natural images) as the primary metric for RF reflectivity maps without questioning its relevance.
Correction:
LPIPS was trained on natural photographs and may not capture the perceptual features relevant to RF images (which are sparse, have different statistics, and are interpreted differently). Use LPIPS as a supplementary metric, not as the primary one. For RF imaging, SSIM and detection metrics are more meaningful.
Historical Note: SSIM: Beyond Mean Squared Error
2004SSIM was introduced by Wang, Bovik, Sheikh, and Simoncelli in 2004, motivated by the observation that MSE (and hence PSNR) correlates poorly with human perception of image quality. The paper has been cited over 50,000 times and SSIM has become a standard metric across all imaging fields. The key insight: the human visual system is adapted to structural information, not pixel-wise error.
PSNR (Peak Signal-to-Noise Ratio)
A logarithmic measure of reconstruction quality defined as . Higher is better. Standard image quality metric but insensitive to structural degradation.
ROC Curve
Receiver Operating Characteristic: a plot of vs. as the detection threshold varies. Summarises the trade-off between detecting true targets and generating false alarms.
Quick Check
Algorithm A achieves PSNR = 30 dB, SSIM = 0.75. Algorithm B achieves PSNR = 27 dB, SSIM = 0.92. For a target detection task, which is likely better?
A, because it has higher PSNR
B, because higher SSIM indicates better structural preservation
Cannot determine without detection metrics
They are equivalent
Correct. SSIM correlates better with detection because it measures structural fidelity, which preserves target peaks.
Key Takeaway
PSNR measures total error energy; widely used but insensitive to structure. SSIM captures structural similarity and correlates better with detection. LPIPS captures perceptual quality but was trained on natural images. For 3D reconstruction, use Chamfer/Hausdorff distance and IoU. For detection, use ROC curves and at fixed . Report multiple metrics --- no single number tells the full story.