Deep Image Prior (DIP) and Deep Decoder

The Network Architecture IS the Prior

All methods in Chapters 20--22 require either paired training data (supervised) or a pretrained generative model (diffusion). Deep Image Prior (DIP) demonstrates a remarkable fact: the architecture of a neural network itself encodes a prior over natural images, even with random, untrained weights.

DIP fits a randomly initialised network to a single measurement by optimising the network weights. The key observation is that the network architecture imposes an implicit regularisation: natural images are learned faster than noise, so early stopping acts as the regulariser. This is particularly valuable for RF imaging, where ground-truth reflectivity maps are rarely available.

Definition:

Deep Image Prior (DIP)

The Deep Image Prior reconstructs an image by optimising the weights θ\theta of a generator network fθf_\theta to fit the measurements:

minθ  12yAfθ(z)2\min_\theta \; \frac{1}{2}\|\mathbf{y} - \mathbf{A}\,f_\theta(\mathbf{z})\|^2

where zRd\mathbf{z} \in \mathbb{R}^d is a fixed random input (not optimised) and fθ ⁣:RdRNf_\theta \colon \mathbb{R}^d \to \mathbb{R}^N is a U-Net or encoder-decoder CNN.

The reconstruction is x^=fθ(z)\hat{\mathbf{x}} = f_{\theta^*}(\mathbf{z}) where θ=argminθyAfθ(z)2\theta^* = \arg\min_\theta \|\mathbf{y} - \mathbf{A}\,f_\theta(\mathbf{z})\|^2, with early stopping to prevent overfitting to noise.

DIP requires no training data --- it reconstructs from a single measurement. The "prior" is encoded in the network architecture: convolutional layers favour spatially smooth, locally correlated images. This bias toward natural-looking images provides implicit regularisation.

Historical Note: The Accidental Discovery of DIP

2018

Ulyanov, Vedaldi, and Lempitsky (2018) initially set out to study texture generation using untrained networks. They noticed that when fitting a CNN to a single noisy image, the network produced a clean version before learning the noise --- a behaviour that seemed to violate the expectation that over-parameterised networks immediately memorise their training data.

This observation led to the DIP paper, which argued that the CNN architecture itself acts as a regulariser. The result was surprising because it contradicted the prevailing view that neural network priors must be learned from data. DIP demonstrated that the inductive bias of convolutional architectures is, by itself, a powerful image prior.

Theorem: Spectral Bias of Deep Image Prior

During gradient descent optimisation of the DIP objective, the network fθf_\theta learns low-frequency components of the target faster than high-frequency components. Specifically, for a network with LL convolutional layers of kernel size kk, the Fourier coefficient of the output at frequency ω\boldsymbol{\omega} and optimisation step tt satisfies:

f^θ(ω,t)1eμ(ω)t|\hat{f}_\theta(\boldsymbol{\omega}, t)| \propto 1 - e^{-\mu(\boldsymbol{\omega})\, t}

where the convergence rate μ(ω)\mu(\boldsymbol{\omega}) decreases with ω\|\boldsymbol{\omega}\| --- low frequencies converge first.

The network acts as a low-pass filter that progressively admits higher frequencies as training progresses. Signal (which is typically low-frequency dominant) is learned first; noise (which is broadband) is learned later. Stopping optimisation at the right time captures the signal while rejecting noise.

,

DIP Spectral Bias and Early Stopping

Visualise the DIP reconstruction process. The left subplot shows PSNR vs. iteration, exhibiting the characteristic "rise and fall": PSNR increases as the network learns the signal, peaks at the optimal stopping point, then decreases as the network overfits to noise.

The right subplot shows the power spectrum of the reconstruction at the current iteration, compared to the ground truth and the noisy measurement. Observe how low frequencies are recovered first. The Deep Decoder architecture has a stronger spectral bias (slower overfitting) than the U-Net.

Parameters
20
3000

Definition:

Deep Decoder

The Deep Decoder is an under-parameterised variant of DIP that uses a decoder-only architecture (no skip connections, no encoder):

fθ(z)=ConvLUpReLUConvL1UpReLUConv1(z)f_\theta(\mathbf{z}) = \text{Conv}_L \circ \text{Up} \circ \text{ReLU} \circ \text{Conv}_{L-1} \circ \cdots \circ \text{Up} \circ \text{ReLU} \circ \text{Conv}_1(\mathbf{z})

where Up\text{Up} is bilinear upsampling and Conv\text{Conv}_\ell has CC channels. The number of parameters is deliberately kept much smaller than the number of pixels: θ=O(LC2k2)N|\theta| = O(LC^2k^2) \ll N.

This under-parameterisation prevents the network from memorising noise, eliminating the need for early stopping.

The Deep Decoder provides a more principled alternative to DIP: the regularisation is architectural (under-parameterisation) rather than procedural (early stopping). However, the reconstruction quality is slightly lower because the reduced capacity also limits the network's ability to represent fine details.

Example: Early Stopping in DIP: Practical Strategies

A DIP reconstruction of a 128×128128 \times 128 RF reflectivity map is run for 10,000 iterations. The PSNR curve peaks at iteration 3,200 (PSNR=28.5\text{PSNR} = 28.5 dB) and falls to 22.1 dB at iteration 10,000. How do you determine the optimal stopping point in practice (without ground truth)?

Deep Image Prior Reconstruction

Complexity: O(TmaxCfwd)O(T_{\max} \cdot C_{\text{fwd}}) where CfwdC_{\text{fwd}} is the cost of one forward+backward pass through the network. Typically Tmax=2,000T_{\max} = 2{,}000--5,0005{,}000 iterations.
Input: Measurements y\mathbf{y}, forward model A\mathbf{A},
noise variance σ2\sigma^2, network fθf_\theta
Output: Reconstruction x^\hat{\mathbf{x}}
1. Sample zN(0,Id)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_d)
and fix it
2. Initialise θ\theta randomly
3. for t=1,2,,Tmaxt = 1, 2, \ldots, T_{\max} do
4. \quad x^tfθ(z)\hat{\mathbf{x}}_t \leftarrow f_\theta(\mathbf{z})
5. \quad L12yAx^t2\mathcal{L} \leftarrow \frac{1}{2}\|\mathbf{y} - \mathbf{A}\hat{\mathbf{x}}_t\|^2
6. \quad θθηθL\theta \leftarrow \theta - \eta\,\nabla_\theta\,\mathcal{L}
7. \quad if L<Mσ2\mathcal{L} < M\sigma^2 then break
8. end for
9. return x^=fθ(z)\hat{\mathbf{x}} = f_\theta(\mathbf{z})
(or running average of recent iterates)

Unlike supervised methods that amortise training cost over many test images, DIP requires per-image optimisation. This makes it slow (1\sim 1--55 minutes per image on GPU) but eliminates the need for any training data.

Common Mistake: DIP Overfits to Noise Without Early Stopping

Mistake:

Running DIP optimisation until convergence and using the final iterate as the reconstruction.

Correction:

DIP will overfit to noise if run long enough. The final iterate perfectly reproduces the noisy measurements (Afθ(z)=y\mathbf{A}f_\theta(\mathbf{z}) = \mathbf{y}) but amplifies noise in the null space of A\mathbf{A}.

Always use early stopping, running-average smoothing, or the Deep Decoder to prevent overfitting. For RF imaging with unknown noise level, use cross-validation on held-out measurements.

Quick Check

In DIP, the random input z\mathbf{z} is:

Optimised jointly with the network weights θ\theta

Fixed throughout optimisation --- only θ\theta is updated

Set equal to the noisy measurement y\mathbf{y}

Learned from a separate training set

Why This Matters: DIP for RF Imaging Without Training Data

DIP is especially valuable for RF imaging because:

  1. No training data: RF reflectivity ground truth is almost never available. DIP reconstructs from a single measurement.

  2. Flexible forward model: The sensing matrix A\mathbf{A} can be any linear operator --- partial Fourier, diffraction tomography, MIMO radar --- without retraining.

  3. Complex-valued extension: DIP naturally handles complex RF signals by using a 2-channel (real/imaginary) output.

The main limitation is speed: per-image optimisation takes minutes, making DIP unsuitable for real-time RF imaging. However, it serves as an excellent baseline and can be combined with learned initialisation (meta-learning) for faster convergence.

,

Deep Image Prior (DIP)

A reconstruction method that uses the architecture of an untrained CNN as an implicit regulariser, optimising network weights to fit a single measurement with early stopping to prevent noise overfitting.

Related: Deep Decoder, Spectral Bias

Deep Decoder

An under-parameterised variant of DIP using a decoder-only architecture with far fewer parameters than pixels, eliminating the need for early stopping by preventing the network from memorising noise.

Related: Deep Image Prior (DIP)

Spectral Bias

The tendency of neural networks to learn low-frequency components of a target function before high-frequency components, arising from the eigenvalue structure of the Neural Tangent Kernel.

Related: Deep Image Prior (DIP)

Key Takeaway

Deep Image Prior uses the CNN architecture as an implicit prior, requiring no training data. Spectral bias causes signal to be learned before noise; early stopping acts as regularisation. The Deep Decoder eliminates early stopping via under-parameterisation but sacrifices some expressivity. DIP is especially valuable for RF imaging where ground-truth reflectivity data is scarce.