Score-Based Diffusion Models Recap

From Denoisers to Generative Priors: The Diffusion Revolution

Chapter 21 established the equivalence between denoising and score estimation: training a denoiser at noise level Οƒ\sigma is equivalent to learning the score function βˆ‡xlog⁑pΟƒ(x)\nabla_\mathbf{x}\log p_\sigma(\mathbf{x}). Diffusion models exploit this connection to build powerful generative models by training a single noise-conditioned score network across all noise levels. In this chapter, we use these pretrained diffusion models as priors for solving inverse problems β€” the central theme of RF imaging.

The golden thread is: diffusion models provide the strongest learned priors available today, enabling state-of-the-art reconstruction quality, but at high computational cost. Understanding the tradeoffs between quality and speed is essential for determining when diffusion-based reconstruction is appropriate for RF applications.

Definition:

The Score Function

The score function of a probability distribution p(x)p(\mathbf{x}) is the gradient of its log-density:

s(x)β‰œβˆ‡xlog⁑p(x).\mathbf{s}(\mathbf{x}) \triangleq \nabla_\mathbf{x} \log p(\mathbf{x}).

The score points in the direction of steepest ascent of the log-density. Unlike the density p(x)p(\mathbf{x}) itself, the score does not require computing the normalisation constant: if p(x)=p~(x)/Zp(\mathbf{x}) = \tilde{p}(\mathbf{x}) / Z, then βˆ‡log⁑p=βˆ‡log⁑p~\nabla \log p = \nabla \log \tilde{p} since βˆ‡log⁑Z=0\nabla \log Z = 0.

This normalisation-free property is what makes score-based methods tractable in high dimensions, where computing Z=∫p~(x) dxZ = \int \tilde{p}(\mathbf{x})\,d\mathbf{x} is intractable.

Definition:

Denoising Score Matching (DSM)

Denoising score matching trains a network sΞΈ(x,Οƒ)\mathbf{s}_\theta(\mathbf{x}, \sigma) to approximate the score by minimising

LDSM=Eσ Ex0∼p Eϡ∼N(0,Οƒ2I) ⁣[βˆ₯sΞΈ(x0+Ο΅,Οƒ)+ϡσ2βˆ₯2].\mathcal{L}_{\text{DSM}} = \mathbb{E}_{\sigma}\,\mathbb{E}_{\mathbf{x}_0 \sim p}\,\mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})}\!\left[\left\|\mathbf{s}_\theta(\mathbf{x}_0 + \boldsymbol{\epsilon}, \sigma) + \frac{\boldsymbol{\epsilon}}{\sigma^2}\right\|^2\right].

The optimal score network satisfies sΞΈβˆ—(x,Οƒ)=βˆ‡xlog⁑pΟƒ(x)\mathbf{s}_\theta^*(\mathbf{x}, \sigma) = \nabla_\mathbf{x}\log p_\sigma(\mathbf{x}), where pΟƒ=pβˆ—N(0,Οƒ2I)p_\sigma = p * \mathcal{N}(0, \sigma^2 I) is the noise-convolved density. This confirms that score estimation is equivalent to denoising (Chapter 21).

The expectation over Οƒ\sigma is essential: we need the score at all noise levels, not just one. In practice, Οƒ\sigma is sampled from a geometric schedule or a continuous distribution over the noise levels.

,

Definition:

DDPM Forward Process

The forward process of a Denoising Diffusion Probabilistic Model (DDPM) gradually adds Gaussian noise over TT steps:

q(xt∣xtβˆ’1)=N(xt; 1βˆ’Ξ²t xtβˆ’1, βtI),q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t;\, \sqrt{1 - \beta_t}\,\mathbf{x}_{t-1},\, \beta_t\mathbf{I}),

where {Ξ²t}t=1T\{\beta_t\}_{t=1}^T is the noise schedule with 0<Ξ²t<10 < \beta_t < 1. The marginal at time tt has a closed form:

q(xt∣x0)=N(xt; αˉt x0, (1βˆ’Ξ±Λ‰t)I),q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t;\, \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\, (1 - \bar{\alpha}_t)\mathbf{I}),

where Ξ±t=1βˆ’Ξ²t\alpha_t = 1 - \beta_t and Ξ±Λ‰t=∏s=1tΞ±s\bar{\alpha}_t = \prod_{s=1}^t \alpha_s. Equivalently:

xt=Ξ±Λ‰t x0+1βˆ’Ξ±Λ‰t ϡ,ϡ∼N(0,I).\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}).

At t=Tt = T (with Ξ±Λ‰Tβ‰ˆ0\bar{\alpha}_T \approx 0), xT\mathbf{x}_T is approximately pure Gaussian noise.

Definition:

DDPM Reverse Process

The reverse process starts from xT∼N(0,I)\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and iteratively denoises:

pΞΈ(xtβˆ’1∣xt)=N(xtβˆ’1; μθ(xt,t), σt2I),p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1};\, \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\, \sigma_t^2\mathbf{I}),

where the mean is parameterised via the noise prediction network ϡθ\boldsymbol{\epsilon}_\theta:

ΞΌΞΈ(xt,t)=1Ξ±t(xtβˆ’Ξ²t1βˆ’Ξ±Λ‰t ϡθ(xt,t)).\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right).

The training objective is:

LDDPM=Et,x0,ϡ ⁣[βˆ₯Ο΅βˆ’Ο΅ΞΈ(xt,t)βˆ₯2].\mathcal{L}_{\text{DDPM}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\!\left[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2\right].

The connection to score matching: the noise prediction and score are related by

sΞΈ(xt,t)=βˆ’Ο΅ΞΈ(xt,t)1βˆ’Ξ±Λ‰t.\mathbf{s}_\theta(\mathbf{x}_t, t) = -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1 - \bar{\alpha}_t}}.

Theorem: Tweedie's Formula

Let xt=Ξ±Λ‰t x0+1βˆ’Ξ±Λ‰t ϡ\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon} with ϡ∼N(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). Then the posterior mean of x0\mathbf{x}_0 given xt\mathbf{x}_t is:

x^0(xt)β‰œE[x0∣xt]=1Ξ±Λ‰t(xt+(1βˆ’Ξ±Λ‰t)β€‰βˆ‡xtlog⁑pt(xt)).\hat{\mathbf{x}}_0(\mathbf{x}_t) \triangleq \mathbb{E}[\mathbf{x}_0 \mid \mathbf{x}_t] = \frac{1}{\sqrt{\bar{\alpha}_t}}\bigl(\mathbf{x}_t + (1 - \bar{\alpha}_t)\,\nabla_{\mathbf{x}_t}\log p_t(\mathbf{x}_t)\bigr).

This connects the score function, the denoiser, and the posterior mean in a single identity.

Tweedie's formula says: to estimate x0\mathbf{x}_0 from a noisy observation xt\mathbf{x}_t, take the noisy observation, correct it using the score (which points toward high-density regions), and rescale. This is exactly what the denoiser does.

,

Score Field Visualisation

Visualise the score field βˆ‡log⁑pΟƒ(x)\nabla\log p_\sigma(\mathbf{x}) for a 2D Gaussian mixture at different noise levels. At low Οƒ\sigma, the score arrows point sharply toward the nearest mode. At high Οƒ\sigma, the modes merge and the field becomes smooth. Langevin dynamics following these arrows would generate samples from pΟƒp_\sigma.

Parameters
0.3
3

Example: Tweedie's Formula for a 1D Gaussian

Let x0∼N(0,1)x_0 \sim \mathcal{N}(0, 1) and xt=Ξ±Λ‰t x0+1βˆ’Ξ±Λ‰t ϡx_t = \sqrt{\bar{\alpha}_t}\,x_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon with ϡ∼N(0,1)\epsilon \sim \mathcal{N}(0, 1). Verify Tweedie's formula by computing E[x0∣xt]\mathbb{E}[x_0 \mid x_t] directly.

Forward Diffusion Process

Observe how a 1D signal is progressively corrupted as the diffusion time tt increases. The plot shows the signal at several intermediate steps, illustrating the transition from structured data to pure noise. Compare linear and cosine noise schedules: the cosine schedule preserves more signal structure at intermediate times.

Parameters
0.5

Historical Note: From Thermodynamics to Image Generation

2015--2021

The idea of using a diffusion process for generative modelling was introduced by Sohl-Dickstein et al. (2015), inspired by non-equilibrium statistical mechanics. The approach remained largely dormant until Ho et al. (2020) demonstrated that a simple noise-prediction objective produced image quality rivalling GANs. Independently, Song and Ermon (2019) developed score-based generative models via Langevin dynamics. The unification of these perspectives through stochastic differential equations by Song et al. (2021) established the modern framework used throughout this chapter.

, ,

Score Function

The gradient of the log-density of a probability distribution: s(x)=βˆ‡xlog⁑p(x)\mathbf{s}(\mathbf{x}) = \nabla_\mathbf{x}\log p(\mathbf{x}). The score encodes the data distribution without requiring the normalisation constant.

Related: Denoising Score Matching, Tweedie Formula

Tweedie's Formula

An identity relating the posterior mean E[x0∣xt]\mathbb{E}[\mathbf{x}_0 \mid \mathbf{x}_t] to the score function of the noisy distribution pt(xt)p_t(\mathbf{x}_t). In the DDPM setting: x^0=(xt+(1βˆ’Ξ±Λ‰t)βˆ‡log⁑pt(xt))/Ξ±Λ‰t\hat{\mathbf{x}}_0 = (\mathbf{x}_t + (1-\bar{\alpha}_t)\nabla\log p_t(\mathbf{x}_t))/\sqrt{\bar{\alpha}_t}.

Related: Score Function, DDPM Forward Process

Network Function Evaluation (NFE)

A single forward pass through the score network sθ(xt,t)\mathbf{s}_\theta(\mathbf{x}_t, t). The total number of NFEs determines the computational cost of diffusion-based reconstruction. Standard DDPM requires TT NFEs; DPS requires ∼2T\sim 2T due to the additional backpropagation step.

Related: DDPM Forward Process, Diffusion Posterior Sampling (DPS)

Common Mistake: Confusing Noise Prediction and Score Prediction

Mistake:

Treating ϡθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) and sθ(xt,t)\mathbf{s}_\theta(\mathbf{x}_t, t) as the same quantity.

Correction:

They are related by a sign and a scaling factor: sΞΈ(xt,t)=βˆ’Ο΅ΞΈ(xt,t)/1βˆ’Ξ±Λ‰t\mathbf{s}_\theta(\mathbf{x}_t, t) = -\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)/\sqrt{1-\bar{\alpha}_t}. The noise predictor outputs the noise Ο΅\boldsymbol{\epsilon} that was added; the score points toward the data manifold. Mixing up the sign or forgetting the scaling factor produces reconstructions that diverge from the data.

Quick Check

Tweedie's formula gives x^0=(xt+(1βˆ’Ξ±Λ‰t)βˆ‡log⁑pt(xt))/Ξ±Λ‰t\hat{\mathbf{x}}_0 = (\mathbf{x}_t + (1-\bar{\alpha}_t)\nabla\log p_t(\mathbf{x}_t))/\sqrt{\bar{\alpha}_t}. At t=0t = 0 (no noise, Ξ±Λ‰0=1\bar{\alpha}_0 = 1), what does the formula reduce to?

x^0=x0\hat{\mathbf{x}}_0 = \mathbf{x}_0

x^0=βˆ‡log⁑p0(x0)\hat{\mathbf{x}}_0 = \nabla\log p_0(\mathbf{x}_0)

x^0=0\hat{\mathbf{x}}_0 = \mathbf{0}

x^0=x0+βˆ‡log⁑p0(x0)\hat{\mathbf{x}}_0 = \mathbf{x}_0 + \nabla\log p_0(\mathbf{x}_0)

Key Takeaway

Diffusion models combine three ingredients: (1) a forward process that gradually adds noise, (2) a noise-conditioned score network trained via denoising score matching, and (3) a reverse process that uses the learned score to denoise step by step. Tweedie's formula provides the bridge from the score at any noise level to a clean-image estimate, which is the key tool for incorporating measurements into the reverse process (Section 22.2).