Variational Autoencoder (VAE)

Definition:

Variational Autoencoder

A VAE consists of:

  • Encoder qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}): maps input to latent distribution
  • Decoder pθ(xz)p_\theta(\mathbf{x}|\mathbf{z}): generates from latent
  • Loss (ELBO): L=Eq[logpθ(xz)]DKL(qϕ(zx)p(z))L = \mathbb{E}_{q}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))

The reparameterisation trick: z=μ+σε\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\varepsilon}, εN(0,I)\boldsymbol{\varepsilon} \sim \mathcal{N}(0, I).

Definition:

KL Divergence for Gaussians

For q=N(μ,diag(σ2))q = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2)) and p=N(0,I)p = \mathcal{N}(0, I):

DKL=12j=1d(1+logσj2μj2σj2)D_{\text{KL}} = -\frac{1}{2}\sum_{j=1}^{d}(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2)

Definition:

Reparameterisation Trick

Instead of sampling zqϕ\mathbf{z} \sim q_\phi, sample εN(0,I)\boldsymbol{\varepsilon} \sim \mathcal{N}(0,I) and compute z=μ+σε\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\varepsilon}. This makes the sampling differentiable with respect to ϕ\phi.

Definition:

Evidence Lower Bound (ELBO)

The ELBO is a lower bound on the log-likelihood:

logp(x)Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\log p(\mathbf{x}) \ge \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))

The first term is reconstruction quality; the second is latent regularisation.

Definition:

Beta-VAE

Multiply the KL term by β>1\beta > 1 to encourage more disentangled latent representations:

Lβ=Recon+βDKLL_{\beta} = \text{Recon} + \beta \cdot D_{\text{KL}}

Theorem: ELBO Derivation

Starting from Jensen's inequality applied to logp(x)=logp(xz)p(z)dz\log p(\mathbf{x}) = \log \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z}:

logp(x)=logp(xz)p(z)q(zx)q(zx)dzq(zx)logp(xz)p(z)q(zx)dz\log p(\mathbf{x}) = \log \int \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{q(\mathbf{z}|\mathbf{x})} q(\mathbf{z}|\mathbf{x}) d\mathbf{z} \ge \int q(\mathbf{z}|\mathbf{x}) \log \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{q(\mathbf{z}|\mathbf{x})} d\mathbf{z}

The gap equals DKL(qp(zx))0D_{\text{KL}}(q \| p(\mathbf{z}|\mathbf{x})) \ge 0.

Maximising the ELBO simultaneously improves reconstruction and makes the approximate posterior closer to the true posterior.

Example: Implementing a VAE

Build a VAE for 28x28 grayscale images.

Example: VAE Loss Function

Implement the ELBO loss.

Example: Latent Space Interpolation

Interpolate between two images in the latent space.

VAE Latent Space Explorer

Explore the 2D latent space of a trained VAE.

Parameters

KL vs Reconstruction Trade-off

See how beta affects the KL-reconstruction balance.

Parameters

Generative Model Taxonomy

Generative Model Taxonomy
VAE, GAN, Diffusion, and Flow models with their key differences.

VAE Architecture

VAE Architecture
Encoder maps to latent distribution, reparameterisation trick enables gradient flow, decoder generates.

Quick Check

Why is the reparameterisation trick needed in VAEs?

To make sampling faster

To make the sampling operation differentiable with respect to encoder parameters

To reduce the latent dimension

Quick Check

What does the KL term in the VAE loss encourage?

Better reconstruction

The approximate posterior to be close to the prior (regularisation)

Faster training

Quick Check

In beta-VAE with beta > 1, what happens?

Better reconstruction quality

More disentangled latent space but blurrier reconstructions

Faster convergence

Common Mistake: KL Vanishing (Posterior Collapse)

Mistake:

The KL term drops to zero and the decoder ignores the latent code.

Correction:

Use KL annealing (warm up beta from 0 to 1), free bits, or cyclic annealing.

Common Mistake: BCE Loss Without Sigmoid

Mistake:

Using BCELoss on decoder output without constraining to [0,1].

Correction:

Add sigmoid to the last decoder layer, or use BCEWithLogitsLoss.

Common Mistake: Predicting sigma Instead of log(sigma^2)

Mistake:

Predicting sigma directly, which requires softplus to ensure positivity.

Correction:

Predict log(sigma^2) and use exp(0.5*logvar) for std. More numerically stable.

Key Takeaway

VAEs provide a principled probabilistic framework for generation. The ELBO balances reconstruction and regularisation. The reparameterisation trick makes training end-to-end differentiable.

Key Takeaway

Generative models learn to sample from the data distribution. VAEs are simple but produce blurry samples. GANs are sharp but unstable. Diffusion models offer the best quality but are slow.

Why This Matters: VAEs for Channel Model Generation

VAEs can learn to generate realistic wireless channel realisations from measured data. The latent space captures channel parameters (delay spread, angular spread) in a continuous representation, enabling interpolation between channel conditions.

Historical Note: VAE: Probabilistic Deep Learning

2013

Kingma and Welling introduced the VAE in 2013, unifying variational inference with deep learning. The reparameterisation trick was the key insight enabling backpropagation through stochastic layers.

Historical Note: GANs: Adversarial Training

2014

Goodfellow et al. introduced GANs in 2014, training a generator against a discriminator. The resulting min-max game produces sharp samples but is notoriously difficult to train.

VAE

Variational Autoencoder: generative model that learns a latent space via variational inference.

ELBO

Evidence Lower Bound: the objective maximised in VAE training. Lower bound on log-likelihood.

KL Divergence

Kullback-Leibler divergence: measures how one distribution differs from another.

Diffusion Model

Generative model that learns to reverse a gradual noising process.

GAN

Generative Adversarial Network: generator and discriminator trained in a min-max game.

Generative Model Comparison

ModelTrainingSample QualityDiversitySpeed
VAEStable (ELBO)BlurryHighFast
GANUnstable (adversarial)SharpMode collapse riskFast
Diffusion (DDPM)Stable (denoising)BestHighSlow (iterative)
Flow MatchingStable (ODE)HighHighMedium