Noise2Noise, Noise2Self, Noise2Void

Can We Train a Denoiser Without Clean Data?

Supervised denoisers require paired data: noisy input and clean target. In many imaging domains --- and especially RF imaging --- clean ground truth is expensive or impossible to obtain. The Noise2X family of methods asks: what is the minimum data requirement for training a denoiser?

The answer is surprising. Noise2Noise shows that noisy-noisy pairs suffice. Noise2Void goes further: a single noisy image is enough, provided the noise is pixel-independent. The key insight in all these methods is that the MSE loss minimum is the conditional expectation, regardless of whether the target is clean or noisy.

Definition:

Noise2Noise

Noise2Noise trains a denoiser using pairs of noisy images of the same scene, without requiring clean ground truth:

minβ‘ΞΈβ€…β€ŠE(y1,y2)[βˆ₯fΞΈ(y1)βˆ’y2βˆ₯2]\min_\theta \; \mathbb{E}_{(\mathbf{y}_1, \mathbf{y}_2)}\bigl[\|f_\theta(\mathbf{y}_1) - \mathbf{y}_2\|^2\bigr]

where y1=x+w1\mathbf{y}_1 = \mathbf{x} + \mathbf{w}_1 and y2=x+w2\mathbf{y}_2 = \mathbf{x} + \mathbf{w}_2 are two independent noisy observations of the same scene x\mathbf{x}.

The optimal network converges to the conditional mean fΞΈβˆ—(y1)=E[x∣y1]f_{\theta^*}(\mathbf{y}_1) = \mathbb{E}[\mathbf{x} \mid \mathbf{y}_1] --- the same MMSE estimator as supervised training with clean targets.

Noise2Noise works because E[y2∣y1]=E[x∣y1]\mathbb{E}[\mathbf{y}_2 \mid \mathbf{y}_1] = \mathbb{E}[\mathbf{x} \mid \mathbf{y}_1] when w1\mathbf{w}_1 and w2\mathbf{w}_2 are independent with zero mean. Replacing the clean target with a noisy one changes only the variance of the gradient, not its expectation.

Theorem: Noise2Noise Achieves the MMSE Estimator

Let y1=x+w1\mathbf{y}_1 = \mathbf{x} + \mathbf{w}_1 and y2=x+w2\mathbf{y}_2 = \mathbf{x} + \mathbf{w}_2 where w1\mathbf{w}_1 and w2\mathbf{w}_2 are independent, zero-mean noise. Then the minimiser of the Noise2Noise loss

LN2N(ΞΈ)=E[βˆ₯fΞΈ(y1)βˆ’y2βˆ₯2]\mathcal{L}_{\text{N2N}}(\theta) = \mathbb{E}\bigl[\|f_\theta(\mathbf{y}_1) - \mathbf{y}_2\|^2\bigr]

is identical to the minimiser of the supervised loss Lsup(ΞΈ)=E[βˆ₯fΞΈ(y1)βˆ’xβˆ₯2]\mathcal{L}_{\text{sup}}(\theta) = \mathbb{E}\bigl[\|f_\theta(\mathbf{y}_1) - \mathbf{x}\|^2\bigr]:

fΞΈβˆ—(y1)=E[x∣y1].f_{\theta^*}(\mathbf{y}_1) = \mathbb{E}[\mathbf{x} \mid \mathbf{y}_1].

Replacing the clean target x\mathbf{x} with a noisy version y2=x+w2\mathbf{y}_2 = \mathbf{x} + \mathbf{w}_2 adds a term that depends on w2\mathbf{w}_2 but not on ΞΈ\theta. Since the noise is zero-mean and independent, the cross-term vanishes in expectation, and the minimiser remains the conditional mean.

Definition:

Noise2Self and Noise2Void

Noise2Self and Noise2Void train denoisers from a single noisy image by exploiting the statistical independence of noise across pixels.

Noise2Void masks a subset of pixels in the input and predicts them from the surrounding context:

minβ‘ΞΈβ€…β€Šβˆ‘i∈M∣fΞΈ(yβˆ–i)iβˆ’yi∣2\min_\theta \; \sum_{i \in \mathcal{M}} |f_\theta(\mathbf{y}_{\setminus i})_i - y_i|^2

where yβˆ–i\mathbf{y}_{\setminus i} denotes the noisy image with pixel ii replaced by interpolation from neighbours, and M\mathcal{M} is the mask set.

Noise2Self generalises this via the concept of JJ-invariance: a function ff is JJ-invariant if f(y)if(\mathbf{y})_i depends only on {yj:j≠i}\{y_j : j \neq i\}. For JJ-invariant estimators, self-supervised loss equals supervised loss in expectation.

Noise2Void requires only a single noisy image for training --- no pairs, no clean data. This is the most data-efficient self-supervised method. However, it assumes pixel-independent noise, which fails for correlated noise (common in RF imaging after matched filtering).

,

Noise2Noise vs. Supervised Training

Compare the denoising PSNR of Noise2Noise training (noisy-noisy pairs) vs. supervised training (noisy-clean pairs) as a function of training set size and SNR. For Gaussian noise, the two methods converge to the same performance; the gap appears only for small training sets (higher gradient variance in N2N).

For non-Gaussian noise (Poisson, speckle), N2N requires the noise to be zero-mean, which fails for Poisson. Observe the performance gap in that regime.

Parameters
20
100

Example: Noise2Noise for Radar Imaging

A MIMO radar system collects two independent measurements of the same scene in consecutive coherent processing intervals (CPIs). How can Noise2Noise be applied, and what assumptions must hold?

Common Mistake: Noise2Void Fails for Correlated Noise

Mistake:

Applying Noise2Void to RF images after matched filtering, where the noise is spatially correlated due to the point spread function.

Correction:

Noise2Void assumes pixel-independent noise. After matched filtering, the noise is correlated over the mainlobe width, violating this assumption. The network can "cheat" by predicting the masked pixel from correlated noise in neighbouring pixels, learning to reproduce noise rather than remove it.

Workaround: Apply Noise2Void in the measurement domain (before matched filtering) where noise is i.i.d., or use Noise2Noise with independent measurement pairs instead.

Comparison of Self-Supervised Denoising Methods

MethodTraining DataNoise AssumptionRF Imaging SourceLimitation
SupervisedNoisy-clean pairsAnyN/A (no ground truth)Requires ground truth
Noise2NoiseNoisy-noisy pairsZero-mean, independentMultiple CPIsNeed repeated measurements
Noise2VoidSingle noisy imagePixel-independentSingle measurementFails for correlated noise
Noise2SelfSingle noisy imageJ-invariantSingle measurementReduced resolution

Quick Check

Why does Noise2Noise training converge to the MMSE estimator?

The noisy targets average out to the clean image over many training samples

The cross-term ⟨fΞΈ(y1)βˆ’x,w2⟩\langle f_\theta(\mathbf{y}_1) - \mathbf{x}, \mathbf{w}_2\rangle vanishes in expectation due to independence

The network learns to subtract the noise from the target

The MSE loss is convex, so any local minimum is global

Historical Note: From NVIDIA Research to Medical Imaging

2018

Noise2Noise was developed at NVIDIA Research by Lehtinen et al. (2018). The original paper demonstrated results on natural images, MRI, and Monte Carlo rendered images. The work was motivated by the observation that in many practical scenarios --- medical imaging, microscopy, astronomy --- collecting two noisy measurements is far easier than obtaining a clean ground truth.

The paper's most striking result was that a Noise2Noise-trained denoiser matched the quality of a supervised denoiser trained on millions of clean-noisy pairs, using only noisy-noisy pairs. This result was initially met with skepticism but is now well understood through the lens of conditional expectation theory.

Noise2Noise

A self-supervised training strategy that uses pairs of independent noisy observations of the same scene as input-target pairs, converging to the MMSE estimator without clean ground truth.

Related: Noise2Void

Noise2Void

A self-supervised denoising method that trains from a single noisy image by masking pixels and predicting them from context, requiring only pixel-independent noise.

Related: Noise2Noise

Key Takeaway

Noise2Noise achieves MMSE-optimal denoising using only noisy-noisy pairs --- the cross-term in the MSE expansion vanishes by independence. Noise2Void extends this to a single noisy image by exploiting pixel-independent noise. For RF imaging, independent noise realisations arise naturally from multiple CPIs, sub-aperture processing, or frequency diversity. The critical limitation is that these methods fail for spatially correlated noise (post-matched-filtering).