Ferkans — Interactive Telecom Tutor

Can We Train a Denoiser Without Clean Data?

Supervised denoisers require paired data: noisy input and clean target. In many imaging domains --- and especially RF imaging --- clean ground truth is expensive or impossible to obtain. The Noise2X family of methods asks: what is the minimum data requirement for training a denoiser?

The answer is surprising. Noise2Noise shows that noisy-noisy pairs suffice. Noise2Void goes further: a single noisy image is enough, provided the noise is pixel-independent. The key insight in all these methods is that the MSE loss minimum is the conditional expectation, regardless of whether the target is clean or noisy.

Definition:
Noise2Noise

Noise2Noise trains a denoiser using pairs of noisy images of the same scene, without requiring clean ground truth:

$\min_\theta \; \mathbb{E}_{(\mathbf{y}_1, \mathbf{y}_2)}\bigl[\|f_\theta(\mathbf{y}_1) - \mathbf{y}_2\|^2\bigr]$

where $\mathbf{y}_1 = \mathbf{x} + \mathbf{w}_1$ and $\mathbf{y}_2 = \mathbf{x} + \mathbf{w}_2$ are two independent noisy observations of the same scene $\mathbf{x}$ .

The optimal network converges to the conditional mean $f_{\theta^*}(\mathbf{y}_1) = \mathbb{E}[\mathbf{x} \mid \mathbf{y}_1]$ --- the same MMSE estimator as supervised training with clean targets.

Noise2Noise works because $\mathbb{E}[\mathbf{y}_2 \mid \mathbf{y}_1] = \mathbb{E}[\mathbf{x} \mid \mathbf{y}_1]$ when $\mathbf{w}_1$ and $\mathbf{w}_2$ are independent with zero mean. Replacing the clean target with a noisy one changes only the variance of the gradient, not its expectation.

Theorem: Noise2Noise Achieves the MMSE Estimator

Let $\mathbf{y}_1 = \mathbf{x} + \mathbf{w}_1$ and $\mathbf{y}_2 = \mathbf{x} + \mathbf{w}_2$ where $\mathbf{w}_1$ and $\mathbf{w}_2$ are independent, zero-mean noise. Then the minimiser of the Noise2Noise loss

$\mathcal{L}_{\text{N2N}}(\theta) = \mathbb{E}\bigl[\|f_\theta(\mathbf{y}_1) - \mathbf{y}_2\|^2\bigr]$

is identical to the minimiser of the supervised loss $\mathcal{L}_{\text{sup}}(\theta) = \mathbb{E}\bigl[\|f_\theta(\mathbf{y}_1) - \mathbf{x}\|^2\bigr]$ :

$f_{\theta^*}(\mathbf{y}_1) = \mathbb{E}[\mathbf{x} \mid \mathbf{y}_1].$

Replacing the clean target $\mathbf{x}$ with a noisy version $\mathbf{y}_2 = \mathbf{x} + \mathbf{w}_2$ adds a term that depends on $\mathbf{w}_2$ but not on $\theta$ . Since the noise is zero-mean and independent, the cross-term vanishes in expectation, and the minimiser remains the conditional mean.

Proof

Expand the Noise2Noise loss

$\mathcal{L}_{\text{N2N}} = \mathbb{E}\bigl[\|f_\theta(\mathbf{y}_1) - \mathbf{x} - \mathbf{w}_2\|^2\bigr]KATEXPLACEHOLDER0END= \mathbb{E}\bigl[\|f_\theta(\mathbf{y}_1) - \mathbf{x}\|^2\bigr] - 2\mathbb{E}\bigl[\langle f_\theta(\mathbf{y}_1) - \mathbf{x}, \mathbf{w}_2\rangle\bigr] + \mathbb{E}\bigl[\|\mathbf{w}_2\|^2\bigr].$ $

Cross-term vanishes

Since $\mathbf{w}_2$ is independent of $\mathbf{y}_1$ (hence of $f_\theta(\mathbf{y}_1)$ ) and of $\mathbf{x}$ , with $\mathbb{E}[\mathbf{w}_2] = \mathbf{0}$ :

$\mathbb{E}\bigl[\langle f_\theta(\mathbf{y}_1) - \mathbf{x}, \mathbf{w}_2\rangle\bigr] = \bigl\langle \mathbb{E}[f_\theta(\mathbf{y}_1) - \mathbf{x}],\, \mathbb{E}[\mathbf{w}_2]\bigr\rangle = 0.$

Conclude

Therefore $\mathcal{L}_{\text{N2N}} = \mathcal{L}_{\text{sup}} + \mathbb{E}[\|\mathbf{w}_2\|^2]$ . The constant $\mathbb{E}[\|\mathbf{w}_2\|^2]$ does not depend on $\theta$ , so $\arg\min_\theta \mathcal{L}_{\text{N2N}} = \arg\min_\theta \mathcal{L}_{\text{sup}} = \mathbb{E}[\mathbf{x} \mid \mathbf{y}_1]$ . $\blacksquare$

Definition:
Noise2Self and Noise2Void

Noise2Self and Noise2Void train denoisers from a single noisy image by exploiting the statistical independence of noise across pixels.

Noise2Void masks a subset of pixels in the input and predicts them from the surrounding context:

$\min_\theta \; \sum_{i \in \mathcal{M}} |f_\theta(\mathbf{y}_{\setminus i})_i - y_i|^2$

where $\mathbf{y}_{\setminus i}$ denotes the noisy image with pixel $i$ replaced by interpolation from neighbours, and $\mathcal{M}$ is the mask set.

Noise2Self generalises this via the concept of $J$ -invariance: a function $f$ is $J$ -invariant if $f(\mathbf{y})_i$ depends only on $\{y_j : j \neq i\}$ . For $J$ -invariant estimators, self-supervised loss equals supervised loss in expectation.

Noise2Void requires only a single noisy image for training --- no pairs, no clean data. This is the most data-efficient self-supervised method. However, it assumes pixel-independent noise, which fails for correlated noise (common in RF imaging after matched filtering).

,

Noise2Noise vs. Supervised Training

Compare the denoising PSNR of Noise2Noise training (noisy-noisy pairs) vs. supervised training (noisy-clean pairs) as a function of training set size and SNR. For Gaussian noise, the two methods converge to the same performance; the gap appears only for small training sets (higher gradient variance in N2N).

For non-Gaussian noise (Poisson, speckle), N2N requires the noise to be zero-mean, which fails for Poisson. Observe the performance gap in that regime.

Parameters

\text{SNR (dB)}

20

Training pairs100

Noise type

Example: Noise2Noise for Radar Imaging

A MIMO radar system collects two independent measurements of the same scene in consecutive coherent processing intervals (CPIs). How can Noise2Noise be applied, and what assumptions must hold?

Solution

Training data generation

Each measurement pair $(\mathbf{y}_1, \mathbf{y}_2)$ consists of two radar returns from the same scene with independent noise (thermal noise, clutter realisation).

Critical assumption: The noise must be independent between the two CPIs. Thermal noise is always independent. Clutter, however, may be static or slowly varying --- if the clutter is identical in both CPIs, Noise2Noise will treat it as signal.

Training

Train a denoiser $f_\theta$ with loss: $\mathcal{L} = \|f_\theta(\mathbf{y}_1) - \mathbf{y}_2\|^2$ .

After training, apply $f_\theta$ to any single measurement. The denoiser must handle complex-valued inputs (2-channel real/imaginary representation).

RF-specific considerations

For multi-static RF imaging, independent noise realisations can come from:

Multiple CPIs (time diversity)
Sub-aperture processing (spatial diversity)
Frequency sub-band splitting (frequency diversity)

The denoiser architecture should preserve phase information (use complex convolutions or 2-channel representation).

Common Mistake: Noise2Void Fails for Correlated Noise

Mistake:

Applying Noise2Void to RF images after matched filtering, where the noise is spatially correlated due to the point spread function.

Correction:

Noise2Void assumes pixel-independent noise. After matched filtering, the noise is correlated over the mainlobe width, violating this assumption. The network can "cheat" by predicting the masked pixel from correlated noise in neighbouring pixels, learning to reproduce noise rather than remove it.

Workaround: Apply Noise2Void in the measurement domain (before matched filtering) where noise is i.i.d., or use Noise2Noise with independent measurement pairs instead.

Comparison of Self-Supervised Denoising Methods

Method	Training Data	Noise Assumption	RF Imaging Source	Limitation
Supervised	Noisy-clean pairs	Any	N/A (no ground truth)	Requires ground truth
Noise2Noise	Noisy-noisy pairs	Zero-mean, independent	Multiple CPIs	Need repeated measurements
Noise2Void	Single noisy image	Pixel-independent	Single measurement	Fails for correlated noise
Noise2Self	Single noisy image	J-invariant	Single measurement	Reduced resolution

Quick Check

Why does Noise2Noise training converge to the MMSE estimator?

The noisy targets average out to the clean image over many training samples

The cross-term $\langle f_\theta(\mathbf{y}_1) - \mathbf{x}, \mathbf{w}_2\rangle$ vanishes in expectation due to independence

The network learns to subtract the noise from the target

The MSE loss is convex, so any local minimum is global

Correction:

The cross-term

\langle f_\theta(\mathbf{y}_1) - \mathbf{x}, \mathbf{w}_2\rangle

vanishes in expectation due to independence

Correct. The N2N loss equals the supervised loss plus a constant $\mathbb{E}[\|\mathbf{w}_2\|^2]$ because the cross-term vanishes. The minimiser is the same: $f_{\theta^*}(\mathbf{y}_1) = \mathbb{E}[\mathbf{x} \mid \mathbf{y}_1]$ .

Historical Note: From NVIDIA Research to Medical Imaging

2018

Noise2Noise was developed at NVIDIA Research by Lehtinen et al. (2018). The original paper demonstrated results on natural images, MRI, and Monte Carlo rendered images. The work was motivated by the observation that in many practical scenarios --- medical imaging, microscopy, astronomy --- collecting two noisy measurements is far easier than obtaining a clean ground truth.

The paper's most striking result was that a Noise2Noise-trained denoiser matched the quality of a supervised denoiser trained on millions of clean-noisy pairs, using only noisy-noisy pairs. This result was initially met with skepticism but is now well understood through the lens of conditional expectation theory.

Noise2Noise

A self-supervised training strategy that uses pairs of independent noisy observations of the same scene as input-target pairs, converging to the MMSE estimator without clean ground truth.

Related: Noise2Void

Noise2Void

A self-supervised denoising method that trains from a single noisy image by masking pixels and predicting them from context, requiring only pixel-independent noise.

Related: Noise2Noise

Key Takeaway

Noise2Noise achieves MMSE-optimal denoising using only noisy-noisy pairs --- the cross-term in the MSE expansion vanishes by independence. Noise2Void extends this to a single noisy image by exploiting pixel-independent noise. For RF imaging, independent noise realisations arise naturally from multiple CPIs, sub-aperture processing, or frequency diversity. The critical limitation is that these methods fail for spatially correlated noise (post-matched-filtering).

Noise2Noise, Noise2Self, Noise2Void