Transform Coding and Quantization

From Theory to JPEG

Rate-distortion theory tells us what is achievable; this section bridges to how it is achieved in practice. The dominant architecture in modern compression is transform coding: apply a unitary transform (DCT, wavelet) to decorrelate the source, then quantize each coefficient independently, then entropy-code the quantized values. This three-stage pipeline approaches the rate-distortion bound for Gaussian sources and is the foundation of JPEG, JPEG 2000, H.264/AVC, HEVC, and essentially all image and video compression standards.

Definition:

Transform Coding

Transform coding consists of three stages:

  1. Transform: Apply a unitary (orthogonal) transform U\mathbf{U} to decorrelate the source: Y=UX\mathbf{Y} = \mathbf{U} \mathbf{X}. The transform coefficients Y1,…,YkY_1, \ldots, Y_k are (approximately) independent with variances Οƒ12,…,Οƒk2\sigma_1^2, \ldots, \sigma_k^2.
  2. Quantize: Apply scalar quantization to each coefficient YiY_i independently, allocating bits according to the variance (reverse waterfilling).
  3. Entropy code: Encode the quantized coefficients losslessly using arithmetic coding or Huffman coding.

The key insight: unitary transforms preserve distortion (βˆ₯Xβˆ’X^βˆ₯2=βˆ₯Yβˆ’Y^βˆ₯2\|\mathbf{X} - \hat{\mathbf{X}}\|^2 = \|\mathbf{Y} - \hat{\mathbf{Y}}\|^2), so we can work in the transform domain without penalty. If the transform achieves perfect decorrelation (the KLT for Gaussian sources), the problem reduces to independent scalar quantization β€” which we know how to do optimally.

Definition:

Lloyd-Max Quantizer

The Lloyd-Max algorithm designs an optimal scalar quantizer for a given source distribution by iterating between two steps:

  1. Nearest-neighbor condition: Given reconstruction points y^1,…,y^M\hat{y}_1, \ldots, \hat{y}_M, set decision boundaries to the midpoints: bi=(y^i+y^i+1)/2b_i = (\hat{y}_i + \hat{y}_{i+1})/2.
  2. Centroid condition: Given decision regions, set each reconstruction point to the centroid: y^i=E[Y∣Y∈regioni]\hat{y}_i = \mathbb{E}[Y | Y \in \text{region}_i].

Iterate until convergence. The algorithm converges to a local optimum of the distortion E[(Yβˆ’Q(Y))2]\mathbb{E}[(Y - Q(Y))^2].

Lloyd-Max without entropy coding achieves approximately Οƒ22βˆ’2R\sigma^2 2^{-2R} distortion for Gaussian sources β€” matching the rate-distortion slope but with a constant gap. Adding entropy coding (non-uniform quantizer + arithmetic coder) closes most of this gap.

Theorem: Entropy-Coded Scalar Quantization Gap (Gish-Pierce)

For a continuous source with finite differential entropy, a uniform scalar quantizer followed by an optimal entropy coder achieves distortion within a constant gap of the rate-distortion function: DECSQ(R)≀22hβ‹…2βˆ’2Rβ‹…Ο€e6D_{ECSQ}(R) \leq 2^{2h} \cdot 2^{-2R} \cdot \frac{\pi e}{6} where h=h(X)h = h(X) is the differential entropy. For a Gaussian source: DECSQ=Ο€e6β‹…DRDβ‰ˆ1.53Β dBΒ aboveΒ R.D_{ECSQ} = \frac{\pi e}{6} \cdot D_{RD} \approx 1.53 \text{ dB above } R.

The 1.53 dB gap (a factor of Ο€e/6β‰ˆ1.42\pi e / 6 \approx 1.42) comes from the difference between a uniform quantization cell (a cube) and the optimal quantization cell (a sphere, in high dimensions). This is the "shaping loss" β€” the price of using scalar quantization instead of vector quantization. It can be reduced by dithering (adding controlled randomness) or by using lattice quantizers with better shaping.

Example: The JPEG Compression Pipeline

Describe how JPEG implements transform coding and identify the rate-distortion implications at each stage.

ECSQ vs. Rate-Distortion Bound

Compare the distortion-rate curves of: (i) the Gaussian rate-distortion bound, (ii) entropy-coded uniform quantizer (ECSQ), and (iii) fixed-rate uniform quantizer. Observe the 1.53 dB gap between ECSQ and the R-D bound.

Parameters
1
6

Blahut-Arimoto Algorithm for R(D)R(D)

Complexity: O(∣Xβˆ£β‹…βˆ£X^∣)O(|\mathcal{X}| \cdot |\hat{\mathcal{X}}|) per iteration. Convergence is geometric.
Input: Source distribution PXP_X, distortion measure d(x,x^)d(x, \hat{x}), Lagrange multiplier Ξ»>0\lambda > 0
Output: Optimal test channel PX^∣Xβˆ—P^*_{\hat{X}|X} and rate R(D)R(D)
Initialize: q(x^)=1/∣X^∣q(\hat{x}) = 1/|\hat{\mathcal{X}}| (uniform marginal)
Repeat until convergence:
1. Update test channel:
Px^∣x=q(x^)exp⁑(βˆ’Ξ»β‹…d(x,x^))βˆ‘x^β€²q(x^β€²)exp⁑(βˆ’Ξ»β‹…d(x,x^β€²))P_{\hat{x}|x} = \frac{q(\hat{x}) \exp(-\lambda \cdot d(x, \hat{x}))}{\sum_{\hat{x}'} q(\hat{x}') \exp(-\lambda \cdot d(x, \hat{x}'))}
2. Update marginal:
q(x^)=βˆ‘xPX(x)Px^∣xq(\hat{x}) = \sum_x P_X(x) P_{\hat{x}|x}
3. Compute D=βˆ‘x,x^PX(x)Px^∣xd(x,x^)D = \sum_{x, \hat{x}} P_X(x) P_{\hat{x}|x} d(x, \hat{x})
and R=βˆ‘x,x^PX(x)Px^∣xlog⁑Px^∣xq(x^)R = \sum_{x, \hat{x}} P_X(x) P_{\hat{x}|x} \log \frac{P_{\hat{x}|x}}{q(\hat{x})}
Varying Ξ»\lambda traces out the R(D)R(D) curve.

The Blahut-Arimoto algorithm alternates between optimizing the test channel (for fixed marginal) and updating the marginal (for fixed test channel). This is the same algorithmic structure as the Blahut-Arimoto algorithm for channel capacity β€” another manifestation of the source-channel duality. The parameter Ξ»\lambda controls the operating point on the R(D)R(D) curve: large Ξ»\lambda gives low distortion (high rate), small Ξ»\lambda gives high distortion.

Historical Note: Blahut and Arimoto: Two Independent Discoveries

1972

Richard Blahut (Bell Labs) and Suguru Arimoto (University of Tokyo) independently discovered the alternating optimization algorithm for computing channel capacity in 1972. Blahut extended it to rate-distortion in the same paper. The algorithm is a special case of coordinate descent (or block relaxation) on a convex objective, which guarantees convergence. It remains the standard numerical method for computing R(D)R(D) for discrete sources, and its structure has influenced the design of iterative algorithms in communications (turbo decoding, expectation-maximization) and machine learning (variational inference).

Practical Compression Standards and Their R-D Performance

StandardTransformQuantizationEntropy CoderGap to R(D)
JPEG8Γ—8 DCTUniform (per-coeff)Huffman~5 dB
JPEG 2000Wavelet (9/7)EBCOT (embedded)Arithmetic~3 dB
H.264/AVC4Γ—4/8Γ—8 DCT + predictionUniformCABAC (arithmetic)~2-4 dB
HEVC/H.265Variable-size DCT + predictionUniformCABAC~1.5-3 dB
VVC/H.266Multi-type transformDependent scalarCABAC~1-2.5 dB

Transform coding

A lossy compression architecture: apply a decorrelating transform (DCT, wavelet, KLT), quantize the transform coefficients, and entropy-code the result. The foundation of JPEG, HEVC, and nearly all practical image/video compression.

Related: Transform Coding

Shaping loss

The gap between scalar quantization and the rate-distortion bound, arising from the suboptimality of cubic quantization cells vs. spherical ones. For uniform ECSQ on Gaussian sources: 1.53 dB (Ο€e/6\pi e/6 factor).

Key Takeaway

Transform coding (transform + quantize + entropy code) is the practical realization of rate-distortion theory. The KLT decorrelates optimally, reducing the problem to independent scalar quantization with reverse waterfilling. Entropy-coded scalar quantization operates within 1.53 dB of the Gaussian R-D bound. Modern codecs (HEVC, VVC) close much of this gap through prediction, variable-size transforms, and context-adaptive entropy coding. The Blahut-Arimoto algorithm computes R(D)R(D) numerically for any discrete source.