Evaluation Metrics

What Gets Measured Gets Optimised

The choice of evaluation metric shapes the conclusions of every RF imaging paper. PSNR, SSIM, and LPIPS can disagree on which reconstruction is "better." Shape metrics (Chamfer, IoU) matter for 3D reconstruction. Detection metrics (PDP_D, ROC) matter for radar applications. Computational metrics (FLOPs, time) determine practical deployability. This section defines each metric precisely and shows when they agree and disagree.

Definition:

Peak Signal-to-Noise Ratio (PSNR)

PSNR measures the reconstruction quality relative to the dynamic range of the image:

PSNR=10log⁑10 ⁣(max⁑q∣cq∣21Qβˆ‘q=1Q∣c^qβˆ’cq∣2)=10log⁑10 ⁣(cmax⁑2MSE)\mathrm{PSNR} = 10\log_{10}\!\left( \frac{\max_q |\mathbf{c}_{q}|^2} {\frac{1}{Q}\sum_{q=1}^{Q}|\hat{\mathbf{c}}_q - \mathbf{c}_{q}|^2} \right) = 10\log_{10}\!\left(\frac{\mathbf{c}_{\max}^{2}}{\mathrm{MSE}}\right)

where cmax⁑\mathbf{c}_{\max} is the maximum scene value and MSE is the mean squared error over QQ voxels.

Typical values: >30> 30 dB (good), >40> 40 dB (excellent), <20< 20 dB (poor).

PSNR is the most widely used metric but has limitations: it does not capture perceptual quality, structural preservation, or the distribution of errors across the image. PSNR favours smooth reconstructions that minimise total error energy.

Definition:

Structural Similarity Index (SSIM)

SSIM compares local image patches in terms of luminance, contrast, and structure:

SSIM(x,x^)=(2ΞΌxΞΌx^+c1)(2Οƒxx^+c2)(ΞΌx2+ΞΌx^2+c1)(Οƒx2+Οƒx^2+c2)\mathrm{SSIM}(\mathbf{x}, \hat{\mathbf{x}}) = \frac{(2\mu_x\mu_{\hat{x}} + c_1)(2\sigma_{x\hat{x}} + c_2)} {(\mu_x^2 + \mu_{\hat{x}}^2 + c_1)(\sigma_x^2 + \sigma_{\hat{x}}^2 + c_2)}

where ΞΌ\mu, Οƒ2\sigma^2, and Οƒxx^\sigma_{x\hat{x}} are the local mean, variance, and cross-covariance computed over a sliding window, and c1,c2c_1, c_2 are stabilisation constants.

SSIM ∈[0,1]\in [0, 1] where 11 indicates perfect reconstruction. The overall SSIM is the average over all windows (MSSIM).

SSIM is more perceptually meaningful than PSNR because it captures structural information. For RF imaging, SSIM correlates better with target detection performance than PSNR.

Definition:

Learned Perceptual Image Patch Similarity (LPIPS)

LPIPS measures perceptual distance using features from a pretrained deep network (typically VGG or AlexNet):

LPIPS(x,x^)=βˆ‘lwlβ‹…βˆ₯fl(x)βˆ’fl(x^)βˆ₯22\mathrm{LPIPS}(\mathbf{x}, \hat{\mathbf{x}}) = \sum_l w_l \cdot \|f_l(\mathbf{x}) - f_l(\hat{\mathbf{x}})\|_2^2

where flf_l extracts features at layer ll and wlw_l are learned weights. Lower LPIPS means higher perceptual similarity.

LPIPS captures high-level structural features that PSNR and SSIM miss. However, it was trained on natural images, so its relevance to RF reflectivity maps (which look different from photographs) is an open question.

Definition:

Normalised Mean Squared Error (NMSE)

NMSE measures the relative reconstruction error:

NMSE=βˆ₯c^βˆ’cβˆ₯22βˆ₯cβˆ₯22\mathrm{NMSE} = \frac{\|\hat{\mathbf{c}} - \mathbf{c}\|_2^2}{\|\mathbf{c}\|_2^2}

NMSE =0= 0 is perfect reconstruction; NMSE =1= 1 means the reconstruction error equals the signal energy.

In dB: NMSEdB=10log⁑10(NMSE)\mathrm{NMSE}_{\mathrm{dB}} = 10\log_{10}(\mathrm{NMSE}). Typical targets: <βˆ’20< -20 dB (good), <βˆ’30< -30 dB (excellent).

NMSE is preferred over PSNR when comparing across different scenes because it is normalised by the signal energy. PSNR depends on the dynamic range of each specific image.

Definition:

Chamfer and Hausdorff Distances

For 3D reconstructions represented as point sets P\mathcal{P} (reconstruction) and Q\mathcal{Q} (ground truth):

Chamfer distance: dC(P,Q)=1∣Pβˆ£βˆ‘p∈Pmin⁑q∈Qβˆ₯pβˆ’qβˆ₯2+1∣Qβˆ£βˆ‘q∈Qmin⁑p∈Pβˆ₯qβˆ’pβˆ₯2d_C(\mathcal{P}, \mathcal{Q}) = \frac{1}{|\mathcal{P}|}\sum_{\mathbf{p} \in \mathcal{P}} \min_{\mathbf{q} \in \mathcal{Q}} \|\mathbf{p} - \mathbf{q}\|^2 + \frac{1}{|\mathcal{Q}|}\sum_{\mathbf{q} \in \mathcal{Q}} \min_{\mathbf{p} \in \mathcal{P}} \|\mathbf{q} - \mathbf{p}\|^2

Hausdorff distance: dH(P,Q)=max⁑ ⁣(max⁑p∈Pmin⁑q∈Qβˆ₯pβˆ’qβˆ₯,β€…β€Šmax⁑q∈Qmin⁑p∈Pβˆ₯qβˆ’pβˆ₯)d_H(\mathcal{P}, \mathcal{Q}) = \max\!\left( \max_{\mathbf{p} \in \mathcal{P}} \min_{\mathbf{q} \in \mathcal{Q}} \|\mathbf{p} - \mathbf{q}\|,\; \max_{\mathbf{q} \in \mathcal{Q}} \min_{\mathbf{p} \in \mathcal{P}} \|\mathbf{q} - \mathbf{p}\| \right)

Chamfer is the average nearest-neighbour distance (sensitive to overall shape). Hausdorff is the worst-case distance (sensitive to outliers).

Definition:

IoU and F-Score for Volumetric Evaluation

For volumetric (voxel-based) reconstructions:

Intersection over Union (IoU): IoU=∣Vpred∩Vtrue∣∣VpredβˆͺVtrue∣\mathrm{IoU} = \frac{|\mathcal{V}_{\mathrm{pred}} \cap \mathcal{V}_{\mathrm{true}}|} {|\mathcal{V}_{\mathrm{pred}} \cup \mathcal{V}_{\mathrm{true}}|} where V\mathcal{V} is the set of occupied voxels after thresholding. IoU ∈[0,1]\in [0, 1]; higher is better.

F-Score at threshold Ο„\tau: FΟ„=2β‹…PrecisionΟ„β‹…RecallΟ„PrecisionΟ„+RecallΟ„F_\tau = \frac{2 \cdot \mathrm{Precision}_\tau \cdot \mathrm{Recall}_\tau} {\mathrm{Precision}_\tau + \mathrm{Recall}_\tau} where precision and recall are computed by counting predicted points within distance Ο„\tau of a ground truth point and vice versa. The F-score at multiple thresholds (e.g., Ο„βˆˆ{Ξ»/4,Ξ»/2,Ξ»}\tau \in \{\lambda/4, \lambda/2, \lambda\}) characterises both coarse and fine shape recovery.

Definition:

Detection Metrics

For radar detection applications:

Probability of detection PDP_D: the fraction of true targets that are correctly detected (above threshold).

Probability of false alarm PFAP_{\mathrm{FA}}: the fraction of non-target locations that exceed the detection threshold.

ROC curve: plots PDP_D vs. PFAP_{\mathrm{FA}} as the threshold varies. The area under the ROC (AUC) summarises detection performance in a single number (AUC∈[0.5,1]\mathrm{AUC} \in [0.5, 1]).

Operating point: typically report PDP_D at a fixed PFA=10βˆ’4P_{\mathrm{FA}} = 10^{-4} or 10βˆ’610^{-6}, reflecting the practical requirement of low false alarm rates.

PSNR vs SSIM vs PDP_D: When Metrics Disagree

Apply different types of degradation to a 1D scene with 4 targets on a smooth background. Observe how PSNR, SSIM, and detection probability PDP_D respond differently:

  • Noise degrades PSNR but may preserve target peaks (PDP_D stays high).
  • Blur preserves PSNR (low total error) but destroys structure (SSIM drops, PDP_D drops).
  • Missing data affects all metrics.
  • Structured artifacts can fool PSNR while SSIM and PDP_D degrade.

Diamond markers show detected (green) vs. missed (red) targets.

Parameters
0.1

ROC Curves for Different Reconstruction Methods

Compare ROC curves for matched filter, LASSO, and a deep network at different SNR levels. At low SNR, the gap between methods widens. The deep network achieves higher PDP_D at the same PFAP_{\mathrm{FA}} because it has learned to separate targets from noise.

Parameters
15
5

Definition:

Computational Metrics

Beyond reconstruction quality, practical deployment requires:

  • FLOPs: floating-point operations per reconstruction. Matched filter: O(MQ)O(MQ). LASSO (ISTA, TT iterations): O(Tβ‹…MQ)O(T \cdot MQ). Deep network: depends on architecture.

  • Inference time: wall-clock time for a single reconstruction on specified hardware (GPU model, batch size).

  • Memory: peak GPU memory during reconstruction.

  • Training time: total GPU-hours for learned methods (amortised over all future reconstructions).

  • Number of parameters: for neural networks, total trainable parameters.

Report computational metrics alongside quality metrics. A method that achieves 0.5 dB higher PSNR but takes 100Γ—100\times longer may not be preferable.

Evaluation Metrics Overview

MetricTypeRangeBest ForLimitation
PSNR (dB)Image quality[0,∞)[0, \infty)Overall errorIgnores structure
SSIMImage quality[0,1][0, 1]Structural fidelityLocal windows only
LPIPSPerceptual[0,∞)[0, \infty)Perceptual qualityTrained on natural images
NMSE (dB)Image quality(βˆ’βˆž,0](-\infty, 0]Cross-scene comparisonSame as PSNR
ChamferShape (3D)[0,∞)[0, \infty)Average shape errorIgnores outliers
HausdorffShape (3D)[0,∞)[0, \infty)Worst-case errorSensitive to outliers
IoUVolume[0,1][0, 1]Occupancy overlapThreshold-dependent
PDP_D at PFAP_{\mathrm{FA}}Detection[0,1][0, 1]Target detectionBinary (no localisation)
AUC-ROCDetection[0.5,1][0.5, 1]Overall detectabilityAverages over all thresholds
FLOPsCompute[0,∞)[0, \infty)Computational costHardware-dependent

Example: Metric Disagreement in Practice

Two reconstruction algorithms produce images with PSNR 28 dB and 25 dB respectively. However, the second algorithm has higher target detection probability (PD=0.95P_D = 0.95 vs. 0.870.87 at PFA=10βˆ’4P_{\mathrm{FA}} = 10^{-4}). Explain this discrepancy.

⚠️Engineering Note

Report Multiple Metrics β€” Always

No single metric captures all aspects of reconstruction quality. Every RF imaging paper should report at minimum:

  1. One image quality metric (PSNR or NMSE).
  2. One structural metric (SSIM).
  3. One task-specific metric (PDP_D, Chamfer, or IoU depending on the application).
  4. Computational cost (inference time and memory).

Papers reporting only PSNR are incomplete. Reviewers should request additional metrics when they are missing.

Common Mistake: Reporting Only PSNR

Mistake:

Comparing algorithms using only PSNR and concluding that the one with higher PSNR is "better."

Correction:

PSNR measures total squared error, which favours smooth reconstructions. A method with 3 dB lower PSNR may preserve edges and targets better (higher SSIM, higher PDP_D). Always report SSIM and a task-specific metric alongside PSNR.

Common Mistake: Blindly Using LPIPS for RF Images

Mistake:

Using LPIPS (trained on natural images) as the primary metric for RF reflectivity maps without questioning its relevance.

Correction:

LPIPS was trained on natural photographs and may not capture the perceptual features relevant to RF images (which are sparse, have different statistics, and are interpreted differently). Use LPIPS as a supplementary metric, not as the primary one. For RF imaging, SSIM and detection metrics are more meaningful.

Historical Note: SSIM: Beyond Mean Squared Error

2004

SSIM was introduced by Wang, Bovik, Sheikh, and Simoncelli in 2004, motivated by the observation that MSE (and hence PSNR) correlates poorly with human perception of image quality. The paper has been cited over 50,000 times and SSIM has become a standard metric across all imaging fields. The key insight: the human visual system is adapted to structural information, not pixel-wise error.

PSNR (Peak Signal-to-Noise Ratio)

A logarithmic measure of reconstruction quality defined as 10log⁑10(cmax⁑2/MSE)10\log_{10}(\mathbf{c}_{\max}^{2} / \mathrm{MSE}). Higher is better. Standard image quality metric but insensitive to structural degradation.

ROC Curve

Receiver Operating Characteristic: a plot of PDP_D vs. PFAP_{\mathrm{FA}} as the detection threshold varies. Summarises the trade-off between detecting true targets and generating false alarms.

Quick Check

Algorithm A achieves PSNR = 30 dB, SSIM = 0.75. Algorithm B achieves PSNR = 27 dB, SSIM = 0.92. For a target detection task, which is likely better?

A, because it has higher PSNR

B, because higher SSIM indicates better structural preservation

Cannot determine without detection metrics

They are equivalent

Key Takeaway

PSNR measures total error energy; widely used but insensitive to structure. SSIM captures structural similarity and correlates better with detection. LPIPS captures perceptual quality but was trained on natural images. For 3D reconstruction, use Chamfer/Hausdorff distance and IoU. For detection, use ROC curves and PDP_D at fixed PFAP_{\mathrm{FA}}. Report multiple metrics --- no single number tells the full story.