Ferkans — Interactive Telecom Tutor

Estimating MSE Without Ground Truth

Both DIP and Noise2Noise address the absence of clean targets, but from different angles: DIP avoids training altogether; Noise2Noise requires paired measurements. Stein's Unbiased Risk Estimate (SURE) offers a third path: it provides an unbiased estimate of the MSE $\mathbb{E}[\|f_\theta(\mathbf{y}) - \mathbf{x}\|^2]$ using only the noisy observation $\mathbf{y}$ and the denoiser's divergence.

The price of not having clean targets is a single extra term --- the divergence $\operatorname{div}(f_\theta)$ --- which can be computed efficiently via a single vector-Jacobian product.

Definition:
Stein's Unbiased Risk Estimate (SURE)

SURE provides an unbiased estimate of the MSE of a denoiser $f_\theta(\mathbf{y})$ without access to the clean signal:

$\text{SURE}(f_\theta) = \frac{1}{N}\|\mathbf{y} - f_\theta(\mathbf{y})\|^2 - \sigma^2 + \frac{2\sigma^2}{N}\operatorname{div}(f_\theta)(\mathbf{y})$

where $\operatorname{div}(f_\theta) = \sum_{i=1}^N \frac{\partial [f_\theta(\mathbf{y})]_i}{\partial y_i}$ is the divergence of the denoiser, and $\sigma^2$ is the noise variance.

Unbiasedness property: $\mathbb{E}[\text{SURE}(f_\theta)] = \frac{1}{N}\mathbb{E}[\|f_\theta(\mathbf{y}) - \mathbf{x}\|^2]$ for $\mathbf{y} = \mathbf{x} + \mathbf{w}$ with $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$ .

SURE converts the unsupervised denoising problem into a supervised one: the SURE loss can be minimised via gradient descent, and the minimiser is the MMSE denoiser. The divergence term measures how much the denoiser "spreads" its output --- it is the price of not having clean targets.

,

Theorem: SURE Is an Unbiased Estimate of MSE

For $\mathbf{y} = \mathbf{x} + \mathbf{w}$ with $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$ and a weakly differentiable denoiser $f_\theta$ :

$\mathbb{E}_\mathbf{w}\bigl[\text{SURE}(f_\theta)\bigr] = \frac{1}{N}\mathbb{E}_\mathbf{w}\bigl[\|f_\theta(\mathbf{y}) - \mathbf{x}\|^2\bigr].$

The SURE identity is a consequence of Stein's lemma: $\mathbb{E}[w_i \cdot g(\mathbf{w})] = \sigma^2 \mathbb{E}[\partial g / \partial w_i]$ for $w_i \sim \mathcal{N}(0, \sigma^2)$ . This connects the cross-term $\langle f_\theta(\mathbf{y}) - \mathbf{y}, \mathbf{x} - \mathbf{y}\rangle$ (which depends on the unknown $\mathbf{x}$ ) to the divergence (which depends only on the observable $\mathbf{y}$ ).

Proof

Expand the MSE

$\|f_\theta(\mathbf{y}) - \mathbf{x}\|^2 = \|f_\theta(\mathbf{y}) - \mathbf{y}\|^2 + 2\langle f_\theta(\mathbf{y}) - \mathbf{y}, \mathbf{y} - \mathbf{x}\rangle + \|\mathbf{y} - \mathbf{x}\|^2.$ $

Apply Stein's lemma to the cross-term

The cross-term is $2\langle f_\theta(\mathbf{y}) - \mathbf{y}, \mathbf{w}\rangle = 2\sum_i (f_\theta(\mathbf{y})_i - y_i)\,w_i$ .

By Stein's lemma, $\mathbb{E}[w_i \cdot h(\mathbf{y})] = \sigma^2\,\mathbb{E}[\partial h / \partial y_i]$ for any weakly differentiable $h$ .

Therefore: $\mathbb{E}\bigl[\langle f_\theta(\mathbf{y}) - \mathbf{y}, \mathbf{w}\rangle\bigr] = \sigma^2\bigl(\mathbb{E}[\operatorname{div}(f_\theta)] - N\bigr)$ .

Combine terms

$\mathbb{E}[\text{MSE}] = \mathbb{E}[\|f_\theta - \mathbf{y}\|^2] + 2\sigma^2\mathbb{E}[\operatorname{div}(f_\theta)] - 2N\sigma^2 + N\sigma^2KATEXPLACEHOLDER0END= \mathbb{E}[\|f_\theta - \mathbf{y}\|^2] + 2\sigma^2\mathbb{E}[\operatorname{div}(f_\theta)] - N\sigma^2 = N\,\mathbb{E}[\text{SURE}]. \;\blacksquare$ $

Definition:
Monte Carlo Divergence Estimation

Computing $\operatorname{div}(f_\theta) = \sum_i \partial [f_\theta(\mathbf{y})]_i / \partial y_i$ requires $N$ backpropagation passes (one per pixel), which is prohibitively expensive. The Monte Carlo estimator uses a single random probe vector:

$\widehat{\operatorname{div}}(f_\theta) = \mathbf{b}^\top \mathbf{J}_{f_\theta} \mathbf{b}$

where $\mathbf{b} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_N)$ and $\mathbf{J}_{f_\theta} \mathbf{b}$ is computed via a single vector-Jacobian product (one backward pass).

Unbiasedness: $\mathbb{E}_\mathbf{b}[\widehat{\operatorname{div}}] = \operatorname{div}(f_\theta)$ .

The MC divergence adds exactly one extra backward pass per training sample. The variance can be reduced by averaging over multiple probe vectors, but in practice a single probe suffices.

SURE vs. True MSE During Training

Compare the SURE loss and the true MSE (computed with ground truth) during training. For a linear denoiser $f(\mathbf{y}) = \alpha\mathbf{y}$ , SURE is exact (no estimation variance). For nonlinear denoisers (soft thresholding, neural network), SURE tracks the true MSE with increasing variance.

Observe that the SURE minimum coincides with the MSE minimum, confirming unbiasedness. The divergence term increases with denoiser complexity (neural net > soft threshold > linear).

Parameters

\text{SNR (dB)}

20

Denoiser type

Example: SURE for Linear Denoisers

Compute SURE in closed form for the linear denoiser $f_\alpha(\mathbf{y}) = \alpha\,\mathbf{y}$ and find the optimal shrinkage $\alpha^*$ .

Solution

Divergence of linear map

$[f_\alpha(\mathbf{y})]_i = \alpha\, y_i$ , so $\frac{\partial [f_\alpha]_i}{\partial y_i} = \alpha$ and $\operatorname{div}(f_\alpha) = N\alpha$ .

SURE expression

$\text{SURE}(\alpha) = \frac{1}{N}\|(1-\alpha)\mathbf{y}\|^2 - \sigma^2 + 2\sigma^2\alpha = (1-\alpha)^2\frac{\|\mathbf{y}\|^2}{N} - \sigma^2 + 2\sigma^2\alpha.$ $

Minimise over $\alpha$

Setting $\frac{d}{d\alpha}\text{SURE} = 0$ :

$-2(1-\alpha)\frac{\|\mathbf{y}\|^2}{N} + 2\sigma^2 = 0 \implies \alpha^* = 1 - \frac{N\sigma^2}{\|\mathbf{y}\|^2} = \frac{\|\mathbf{y}\|^2 - N\sigma^2}{\|\mathbf{y}\|^2}.$

This is the James-Stein shrinkage estimator. For $\|\mathbf{y}\|^2 \gg N\sigma^2$ (high SNR), $\alpha^* \to 1$ (no shrinkage). For low SNR, $\alpha^* \to 0$ (heavy shrinkage toward zero).

Theorem: Generalised SURE for Inverse Problems (GSURE)

For the inverse problem $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{w}$ with $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_M)$ , the Generalised SURE for a reconstruction network $f_\theta \colon \mathbb{R}^M \to \mathbb{R}^N$ is:

$\text{GSURE}(f_\theta) = \frac{1}{M}\|\mathbf{y} - \mathbf{A}f_\theta(\mathbf{y})\|^2 - \sigma^2 + \frac{2\sigma^2}{M}\operatorname{div}_\mathbf{y}(\mathbf{A}f_\theta)$

where $\operatorname{div}_\mathbf{y}(\mathbf{A}f_\theta) = \operatorname{tr}(\mathbf{A}\,\mathbf{J}_{f_\theta})$ .

GSURE is unbiased for the projected MSE $\frac{1}{M}\mathbb{E}[\|\mathbf{A}f_\theta(\mathbf{y}) - \mathbf{A}\mathbf{x}\|^2]$ , not the full reconstruction MSE.

GSURE constrains only the component of the reconstruction in the range of $\mathbf{A}^H$ --- it says nothing about the null-space component. An additional regulariser (TV, DIP, equivariance) is needed for the null space.

Proof

Apply SURE to the projected estimate

Define $h_\theta(\mathbf{y}) = \mathbf{A}f_\theta(\mathbf{y})$ . This maps $\mathbb{R}^M \to \mathbb{R}^M$ . Applying standard SURE to $h_\theta$ :

$\text{GSURE} = \frac{1}{M}\|\mathbf{y} - h_\theta(\mathbf{y})\|^2 - \sigma^2 + \frac{2\sigma^2}{M}\operatorname{div}(h_\theta).$

Compute the divergence

$\operatorname{div}(h_\theta) = \operatorname{tr}\bigl(\frac{\partial (\mathbf{A}f_\theta)}{\partial \mathbf{y}}\bigr) = \operatorname{tr}(\mathbf{A}\,\mathbf{J}_{f_\theta})$ .

MC estimate: $\widehat{\operatorname{div}} = \mathbf{b}^\top \mathbf{A}\,\mathbf{J}_{f_\theta}\,\mathbf{b}$ with $\mathbf{b} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_M)$ .

Null-space limitation

GSURE is blind to the null space of $\mathbf{A}$ . For underdetermined systems ( $M < N$ ), GSURE alone cannot distinguish between reconstructions that differ only in the null space. Regularisation (e.g., TV, equivariance) is needed. $\blacksquare$

,

SURE-Based Denoiser Training

Complexity: Each training step costs

\sim 2\times

a standard supervised step (one extra backward pass for the divergence).

Input: Noisy images

\{\mathbf{y}_j\}_{j=1}^J

, noise variance

\sigma^2

,

denoiser network

f_\theta

Output: Trained denoiser

f_{\theta^*}

1. Initialise

\theta

randomly

2. for epoch

= 1, \ldots, E

do

3.

\quad

for mini-batch

\{\mathbf{y}_j\}

do

4.

\quad\quad

Forward pass:

\hat{\mathbf{x}}_j = f_\theta(\mathbf{y}_j)

5.

\quad\quad

Residual:

r_j = \frac{1}{N}\|\mathbf{y}_j - \hat{\mathbf{x}}_j\|^2

6.

\quad\quad

MC divergence: sample

\mathbf{b} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

,

compute

\hat{d}_j = \mathbf{b}^\top \mathbf{J}_{f_\theta}\mathbf{b}

via vector-Jacobian product

7.

\quad\quad

SURE:

\mathcal{L}_j = r_j - \sigma^2 + \frac{2\sigma^2}{N}\hat{d}_j

8.

\quad\quad

\theta \leftarrow \theta - \eta\,\nabla_\theta\,\frac{1}{|\text{batch}|}\sum_j \mathcal{L}_j

9.

\quad

end for

10. end for

11. return

f_{\theta^*}

SURE training requires the noise variance $\sigma^2$ to be known. If unknown, $\sigma^2$ can be estimated from the measurements (e.g., median absolute deviation of wavelet coefficients).

,

Common Mistake: SURE Requires Gaussian Noise with Known Variance

Mistake:

Applying SURE-based training to RF imaging data with non-Gaussian noise (e.g., speckle, Poisson photon noise) or unknown noise level.

Correction:

Standard SURE assumes $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$ with known $\sigma^2$ . Violations cause biased risk estimates:

Non-Gaussian noise: Use Poisson-SURE or exponential-family SURE extensions (Eldar, 2008).
Unknown $\sigma^2$ : Estimate from data using robust methods (MAD of wavelet coefficients, or from measurement residuals).
Correlated noise: Use the generalised form with known covariance $\boldsymbol{\Sigma}_w$ .

For RF imaging, thermal noise is Gaussian but the effective noise after beamforming/matched filtering may be coloured.

🔧Engineering Note

Computational Cost of MC Divergence

The MC divergence estimator $\widehat{\operatorname{div}} = \mathbf{b}^\top \mathbf{J}_{f_\theta}\mathbf{b}$ requires one vector-Jacobian product, which in PyTorch/JAX costs the same as one backward pass. This doubles the per-step training cost compared to supervised training.

Practical tips:

Use a single probe vector $\mathbf{b}$ per sample (variance is acceptable for SGD).
For large images ( $N > 10^6$ ), compute the divergence on random patches rather than the full image.
If using GSURE for inverse problems, the cost is $O(C_{\text{fwd}} + C_{\text{bwd}} + C_{\mathbf{A}})$ per sample.

Quick Check

For a soft-thresholding denoiser $f_\lambda(\mathbf{y})_i = \text{sign}(y_i)\max(|y_i| - \lambda, 0)$ , what is $\operatorname{div}(f_\lambda)$ ?

$N$ (the dimension)

$|\{i : |y_i| > \lambda\}|$ (number of non-zero components)

$\lambda \cdot N$

$0$ (soft thresholding is not differentiable)

Correction:

|\{i : |y_i| > \lambda\}|

(number of non-zero components)

Correct. For $|y_i| > \lambda$ : $\partial f_i / \partial y_i = 1$ . For $|y_i| \leq \lambda$ : $\partial f_i / \partial y_i = 0$ . So $\operatorname{div}(f_\lambda) = \#\{i : |y_i| > \lambda\}$ .

Stein's Unbiased Risk Estimate (SURE)

A formula that provides an unbiased estimate of the MSE of a denoiser without access to the clean signal, using the denoiser's divergence as a correction term. Requires Gaussian noise with known variance.

Related: Generalised SURE (GSURE)

Generalised SURE (GSURE)

An extension of SURE to inverse problems that estimates the projected MSE $\|\mathbf{A}\hat{\mathbf{x}} - \mathbf{A}\mathbf{x}\|^2$ without clean ground truth, applicable to underdetermined systems.

Key Takeaway

SURE estimates MSE without clean targets by adding a divergence correction to the residual. The divergence is computed efficiently via Monte Carlo estimation (one extra backward pass). SURE-trained denoisers match the quality of supervised training for Gaussian noise. GSURE extends this to inverse problems but is blind to the null space --- an additional regulariser is needed. The main limitation is the requirement for Gaussian noise with known $\sigma^2$ .

SURE-Based Training