Ferkans — Interactive Telecom Tutor

The Network Architecture IS the Prior

All methods in Chapters 20--22 require either paired training data (supervised) or a pretrained generative model (diffusion). Deep Image Prior (DIP) demonstrates a remarkable fact: the architecture of a neural network itself encodes a prior over natural images, even with random, untrained weights.

DIP fits a randomly initialised network to a single measurement by optimising the network weights. The key observation is that the network architecture imposes an implicit regularisation: natural images are learned faster than noise, so early stopping acts as the regulariser. This is particularly valuable for RF imaging, where ground-truth reflectivity maps are rarely available.

Definition:
Deep Image Prior (DIP)

The Deep Image Prior reconstructs an image by optimising the weights $\theta$ of a generator network $f_\theta$ to fit the measurements:

$\min_\theta \; \frac{1}{2}\|\mathbf{y} - \mathbf{A}\,f_\theta(\mathbf{z})\|^2$

where $\mathbf{z} \in \mathbb{R}^d$ is a fixed random input (not optimised) and $f_\theta \colon \mathbb{R}^d \to \mathbb{R}^N$ is a U-Net or encoder-decoder CNN.

The reconstruction is $\hat{\mathbf{x}} = f_{\theta^*}(\mathbf{z})$ where $\theta^* = \arg\min_\theta \|\mathbf{y} - \mathbf{A}\,f_\theta(\mathbf{z})\|^2$ , with early stopping to prevent overfitting to noise.

DIP requires no training data --- it reconstructs from a single measurement. The "prior" is encoded in the network architecture: convolutional layers favour spatially smooth, locally correlated images. This bias toward natural-looking images provides implicit regularisation.

Historical Note: The Accidental Discovery of DIP

2018

Ulyanov, Vedaldi, and Lempitsky (2018) initially set out to study texture generation using untrained networks. They noticed that when fitting a CNN to a single noisy image, the network produced a clean version before learning the noise --- a behaviour that seemed to violate the expectation that over-parameterised networks immediately memorise their training data.

This observation led to the DIP paper, which argued that the CNN architecture itself acts as a regulariser. The result was surprising because it contradicted the prevailing view that neural network priors must be learned from data. DIP demonstrated that the inductive bias of convolutional architectures is, by itself, a powerful image prior.

Theorem: Spectral Bias of Deep Image Prior

During gradient descent optimisation of the DIP objective, the network $f_\theta$ learns low-frequency components of the target faster than high-frequency components. Specifically, for a network with $L$ convolutional layers of kernel size $k$ , the Fourier coefficient of the output at frequency $\boldsymbol{\omega}$ and optimisation step $t$ satisfies:

$|\hat{f}_\theta(\boldsymbol{\omega}, t)| \propto 1 - e^{-\mu(\boldsymbol{\omega})\, t}$

where the convergence rate $\mu(\boldsymbol{\omega})$ decreases with $\|\boldsymbol{\omega}\|$ --- low frequencies converge first.

The network acts as a low-pass filter that progressively admits higher frequencies as training progresses. Signal (which is typically low-frequency dominant) is learned first; noise (which is broadband) is learned later. Stopping optimisation at the right time captures the signal while rejecting noise.

Proof

Neural tangent kernel analysis

In the infinite-width limit, the network's learning dynamics are characterised by the Neural Tangent Kernel (NTK). For convolutional networks, the NTK has a block-circulant structure whose eigenvalues decay with spatial frequency. Denoting the NTK eigenvalue at frequency $\boldsymbol{\omega}$ by $\lambda(\boldsymbol{\omega})$ , gradient descent converges as:

$|\hat{f}_\theta(\boldsymbol{\omega}, t) - \hat{f}_\theta(\boldsymbol{\omega}, 0)| \propto 1 - e^{-\lambda(\boldsymbol{\omega})\, t}.$

Since $\lambda(\boldsymbol{\omega})$ decreases with $\|\boldsymbol{\omega}\|$ , low frequencies converge first.

Early stopping as regularisation

Stopping at iteration $t^*$ is equivalent to spectral regularisation with a frequency-dependent filter: $\hat{g}(\boldsymbol{\omega}) = 1 - e^{-\lambda(\boldsymbol{\omega})\, t^*}$ . High frequencies (small $\lambda$ ) are suppressed. This is analogous to Tikhonov regularisation with parameter $\lambda_{\text{reg}} \propto 1/t^*$ . $\blacksquare$

,

DIP Spectral Bias and Early Stopping

Visualise the DIP reconstruction process. The left subplot shows PSNR vs. iteration, exhibiting the characteristic "rise and fall": PSNR increases as the network learns the signal, peaks at the optimal stopping point, then decreases as the network overfits to noise.

The right subplot shows the power spectrum of the reconstruction at the current iteration, compared to the ground truth and the noisy measurement. Observe how low frequencies are recovered first. The Deep Decoder architecture has a stronger spectral bias (slower overfitting) than the U-Net.

Parameters

\text{SNR (dB)}

20

Iterations3000

Architecture

Definition:
Deep Decoder

The Deep Decoder is an under-parameterised variant of DIP that uses a decoder-only architecture (no skip connections, no encoder):

$f_\theta(\mathbf{z}) = \text{Conv}_L \circ \text{Up} \circ \text{ReLU} \circ \text{Conv}_{L-1} \circ \cdots \circ \text{Up} \circ \text{ReLU} \circ \text{Conv}_1(\mathbf{z})$

where $\text{Up}$ is bilinear upsampling and $\text{Conv}_\ell$ has $C$ channels. The number of parameters is deliberately kept much smaller than the number of pixels: $|\theta| = O(LC^2k^2) \ll N$ .

This under-parameterisation prevents the network from memorising noise, eliminating the need for early stopping.

The Deep Decoder provides a more principled alternative to DIP: the regularisation is architectural (under-parameterisation) rather than procedural (early stopping). However, the reconstruction quality is slightly lower because the reduced capacity also limits the network's ability to represent fine details.

Example: Early Stopping in DIP: Practical Strategies

A DIP reconstruction of a $128 \times 128$ RF reflectivity map is run for 10,000 iterations. The PSNR curve peaks at iteration 3,200 ( $\text{PSNR} = 28.5$ dB) and falls to 22.1 dB at iteration 10,000. How do you determine the optimal stopping point in practice (without ground truth)?

Solution

The stopping problem

Without ground truth, we cannot compute PSNR. Three practical methods for determining the stopping point:

Measurement residual monitoring

Track $\|\mathbf{y} - \mathbf{A}f_\theta(\mathbf{z})\|^2$ vs. iteration. When this drops below $M\sigma^2$ (the expected noise energy), the network has fit the signal; further iterations fit noise. Stop when $\|\mathbf{r}\|^2 \approx M\sigma^2$ .

Running average (exponential smoothing)

Average the last $K$ iterates: $\bar{\mathbf{x}} = \frac{1}{K}\sum_{t=T-K}^T f_{\theta_t}(\mathbf{z})$ . This smooths out the noise-fitting oscillations and reduces sensitivity to the exact stopping point. Typically $K = 100$ -- $500$ .

Cross-validation on held-out measurements

Split the measurements: fit using $\mathbf{y}_{\text{train}}$ , monitor the residual on $\mathbf{y}_{\text{val}}$ . The validation residual increases when the network begins overfitting. This is the most reliable method but requires sufficient measurements.

Deep Image Prior Reconstruction

Complexity:

O(T_{\max} \cdot C_{\text{fwd}})

where

C_{\text{fwd}}

is the cost of one forward+backward pass through the network. Typically

T_{\max} = 2{,}000

--

5{,}000

iterations.

Input: Measurements

\mathbf{y}

, forward model

\mathbf{A}

,

noise variance

\sigma^2

, network

f_\theta

Output: Reconstruction

\hat{\mathbf{x}}

1. Sample

\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_d)

and fix it

2. Initialise

\theta

randomly

3. for

t = 1, 2, \ldots, T_{\max}

do

4.

\quad

\hat{\mathbf{x}}_t \leftarrow f_\theta(\mathbf{z})

5.

\quad

\mathcal{L} \leftarrow \frac{1}{2}\|\mathbf{y} - \mathbf{A}\hat{\mathbf{x}}_t\|^2

6.

\quad

\theta \leftarrow \theta - \eta\,\nabla_\theta\,\mathcal{L}

7.

\quad

if

\mathcal{L} < M\sigma^2

then break

8. end for

9. return

\hat{\mathbf{x}} = f_\theta(\mathbf{z})

(or running average of recent iterates)

Unlike supervised methods that amortise training cost over many test images, DIP requires per-image optimisation. This makes it slow ( $\sim 1$ -- $5$ minutes per image on GPU) but eliminates the need for any training data.

Common Mistake: DIP Overfits to Noise Without Early Stopping

Mistake:

Running DIP optimisation until convergence and using the final iterate as the reconstruction.

Correction:

DIP will overfit to noise if run long enough. The final iterate perfectly reproduces the noisy measurements ( $\mathbf{A}f_\theta(\mathbf{z}) = \mathbf{y}$ ) but amplifies noise in the null space of $\mathbf{A}$ .

Always use early stopping, running-average smoothing, or the Deep Decoder to prevent overfitting. For RF imaging with unknown noise level, use cross-validation on held-out measurements.

Quick Check

In DIP, the random input $\mathbf{z}$ is:

Optimised jointly with the network weights $\theta$

Fixed throughout optimisation --- only $\theta$ is updated

Set equal to the noisy measurement $\mathbf{y}$

Learned from a separate training set

Correction:

Fixed throughout optimisation --- only

\theta

is updated

Correct. The input $\mathbf{z}$ is sampled once from $\mathcal{N}(\mathbf{0}, \mathbf{I})$ and remains fixed. The network weights $\theta$ are the only optimisation variables.

Why This Matters: DIP for RF Imaging Without Training Data

DIP is especially valuable for RF imaging because:

No training data: RF reflectivity ground truth is almost never available. DIP reconstructs from a single measurement.
Flexible forward model: The sensing matrix $\mathbf{A}$ can be any linear operator --- partial Fourier, diffraction tomography, MIMO radar --- without retraining.
Complex-valued extension: DIP naturally handles complex RF signals by using a 2-channel (real/imaginary) output.

The main limitation is speed: per-image optimisation takes minutes, making DIP unsuitable for real-time RF imaging. However, it serves as an excellent baseline and can be combined with learned initialisation (meta-learning) for faster convergence.

,

Deep Image Prior (DIP)

A reconstruction method that uses the architecture of an untrained CNN as an implicit regulariser, optimising network weights to fit a single measurement with early stopping to prevent noise overfitting.

Related: Deep Decoder, Spectral Bias

Deep Decoder

An under-parameterised variant of DIP using a decoder-only architecture with far fewer parameters than pixels, eliminating the need for early stopping by preventing the network from memorising noise.

Related: Deep Image Prior (DIP)

Spectral Bias

The tendency of neural networks to learn low-frequency components of a target function before high-frequency components, arising from the eigenvalue structure of the Neural Tangent Kernel.