SURE-Based Training
Estimating MSE Without Ground Truth
Both DIP and Noise2Noise address the absence of clean targets, but from different angles: DIP avoids training altogether; Noise2Noise requires paired measurements. Stein's Unbiased Risk Estimate (SURE) offers a third path: it provides an unbiased estimate of the MSE using only the noisy observation and the denoiser's divergence.
The price of not having clean targets is a single extra term --- the divergence --- which can be computed efficiently via a single vector-Jacobian product.
Definition: Stein's Unbiased Risk Estimate (SURE)
Stein's Unbiased Risk Estimate (SURE)
SURE provides an unbiased estimate of the MSE of a denoiser without access to the clean signal:
where is the divergence of the denoiser, and is the noise variance.
Unbiasedness property: for with .
SURE converts the unsupervised denoising problem into a supervised one: the SURE loss can be minimised via gradient descent, and the minimiser is the MMSE denoiser. The divergence term measures how much the denoiser "spreads" its output --- it is the price of not having clean targets.
Theorem: SURE Is an Unbiased Estimate of MSE
For with and a weakly differentiable denoiser :
The SURE identity is a consequence of Stein's lemma: for . This connects the cross-term (which depends on the unknown ) to the divergence (which depends only on the observable ).
Expand the MSE
$
Apply Stein's lemma to the cross-term
The cross-term is .
By Stein's lemma, for any weakly differentiable .
Therefore: .
Combine terms
$
Definition: Monte Carlo Divergence Estimation
Monte Carlo Divergence Estimation
Computing requires backpropagation passes (one per pixel), which is prohibitively expensive. The Monte Carlo estimator uses a single random probe vector:
where and is computed via a single vector-Jacobian product (one backward pass).
Unbiasedness: .
The MC divergence adds exactly one extra backward pass per training sample. The variance can be reduced by averaging over multiple probe vectors, but in practice a single probe suffices.
SURE vs. True MSE During Training
Compare the SURE loss and the true MSE (computed with ground truth) during training. For a linear denoiser , SURE is exact (no estimation variance). For nonlinear denoisers (soft thresholding, neural network), SURE tracks the true MSE with increasing variance.
Observe that the SURE minimum coincides with the MSE minimum, confirming unbiasedness. The divergence term increases with denoiser complexity (neural net > soft threshold > linear).
Parameters
Example: SURE for Linear Denoisers
Compute SURE in closed form for the linear denoiser and find the optimal shrinkage .
Divergence of linear map
, so and .
SURE expression
$
Minimise over $\alpha$
Setting :
This is the James-Stein shrinkage estimator. For (high SNR), (no shrinkage). For low SNR, (heavy shrinkage toward zero).
Theorem: Generalised SURE for Inverse Problems (GSURE)
For the inverse problem with , the Generalised SURE for a reconstruction network is:
where .
GSURE is unbiased for the projected MSE , not the full reconstruction MSE.
GSURE constrains only the component of the reconstruction in the range of --- it says nothing about the null-space component. An additional regulariser (TV, DIP, equivariance) is needed for the null space.
Apply SURE to the projected estimate
Define . This maps . Applying standard SURE to :
Compute the divergence
.
MC estimate: with .
Null-space limitation
GSURE is blind to the null space of . For underdetermined systems (), GSURE alone cannot distinguish between reconstructions that differ only in the null space. Regularisation (e.g., TV, equivariance) is needed.
SURE-Based Denoiser Training
Complexity: Each training step costs a standard supervised step (one extra backward pass for the divergence).SURE training requires the noise variance to be known. If unknown, can be estimated from the measurements (e.g., median absolute deviation of wavelet coefficients).
Common Mistake: SURE Requires Gaussian Noise with Known Variance
Mistake:
Applying SURE-based training to RF imaging data with non-Gaussian noise (e.g., speckle, Poisson photon noise) or unknown noise level.
Correction:
Standard SURE assumes with known . Violations cause biased risk estimates:
- Non-Gaussian noise: Use Poisson-SURE or exponential-family SURE extensions (Eldar, 2008).
- Unknown : Estimate from data using robust methods (MAD of wavelet coefficients, or from measurement residuals).
- Correlated noise: Use the generalised form with known covariance .
For RF imaging, thermal noise is Gaussian but the effective noise after beamforming/matched filtering may be coloured.
Computational Cost of MC Divergence
The MC divergence estimator requires one vector-Jacobian product, which in PyTorch/JAX costs the same as one backward pass. This doubles the per-step training cost compared to supervised training.
Practical tips:
- Use a single probe vector per sample (variance is acceptable for SGD).
- For large images (), compute the divergence on random patches rather than the full image.
- If using GSURE for inverse problems, the cost is per sample.
Quick Check
For a soft-thresholding denoiser , what is ?
(the dimension)
(number of non-zero components)
(soft thresholding is not differentiable)
Correct. For : . For : . So .
Stein's Unbiased Risk Estimate (SURE)
A formula that provides an unbiased estimate of the MSE of a denoiser without access to the clean signal, using the denoiser's divergence as a correction term. Requires Gaussian noise with known variance.
Related: Generalised SURE (GSURE)
Generalised SURE (GSURE)
An extension of SURE to inverse problems that estimates the projected MSE without clean ground truth, applicable to underdetermined systems.
Key Takeaway
SURE estimates MSE without clean targets by adding a divergence correction to the residual. The divergence is computed efficiently via Monte Carlo estimation (one extra backward pass). SURE-trained denoisers match the quality of supervised training for Gaussian noise. GSURE extends this to inverse problems but is blind to the null space --- an additional regulariser is needed. The main limitation is the requirement for Gaussian noise with known .