Maxwell-Boltzmann Shaping

From the 1.53 dB Ceiling to a Constructive Recipe

Chapter 4 told us, with a beautiful entropy-power argument, that on a bounded 2D lattice we leave up to πe/61.53\pi e / 6 \approx 1.53 dB on the table by transmitting uniform QAM. Chapter 9 showed that 400G coherent optical has actually claimed that dB in the field. But how do you claim it? What is the concrete input distribution that the system designer should target?

The answer is the Maxwell-Boltzmann (MB) distribution pλ(x)exp(λx2)p_\lambda(x) \propto \exp(-\lambda |x|^2). It is the minimum-change, most natural recipe: keep the same QAM grid, just weight inner points exponentially more than outer points. The parameter λ\lambda is the shaping knob — zero recovers uniform QAM, large λ\lambda collapses onto the constellation origin. For every operating SNR there is a unique optimal λ\lambda^\star that extracts the maximum shaping gain.

This is not an ad-hoc choice. MB is the unique max-entropy distribution on a finite alphabet subject to an average-power constraint — exactly the same variational principle that gives Gaussian on Rn\mathbb{R}^n, Poisson on N\mathbb{N}, and exponential on R+\mathbb{R}_+. The KKT conditions drop out an exponential form. The Gaussian capacity-achieving theorem on AWGN thus maps, lattice by lattice, onto MB as the finite-constellation analogue.

Two payoffs from this section:

  1. Theoretical: a complete derivation of MB from Lagrangian KKT, and a numerical read of MB entropy vs λ\lambda for 16-/64-/256-QAM.
  2. Operational: a plot of achievable rates — Shannon vs uniform QAM vs MB-shaped QAM — showing where the 1.53 dB is recovered and where it isn't (hint: at low SNR, shaping barely helps).
,

Definition:

Maxwell-Boltzmann Distribution on a Constellation

Let XC\mathcal{X} \subset \mathbb{C} be a finite constellation (e.g., square MM-QAM). For any λ>0\lambda > 0, the Maxwell-Boltzmann (MB) distribution on X\mathcal{X} with parameter λ\lambda is pλ(x)=exp(λx2)Z(λ),Z(λ)=xXexp(λx2),xX.p_\lambda(x) = \frac{\exp(-\lambda |x|^2)}{Z(\lambda)}, \qquad Z(\lambda) = \sum_{x' \in \mathcal{X}} \exp(-\lambda |x'|^2), \qquad x \in \mathcal{X}. Here Z(λ)Z(\lambda) is the partition function. The expected energy under pλp_\lambda is Epλ[X2]=Z(λ)Z(λ)=xx2exp(λx2)xexp(λx2)\mathbb{E}_{p_\lambda}[|X|^2] = -\frac{Z'(\lambda)}{Z(\lambda)} = \frac{\sum_x |x|^2 \exp(-\lambda |x|^2)}{\sum_x \exp(-\lambda |x|^2)} and its entropy (in bits) is H(pλ)=λEpλ[X2]+lnZ(λ)ln2.H(p_\lambda) = \frac{\lambda \mathbb{E}_{p_\lambda}[|X|^2] + \ln Z(\lambda)}{\ln 2}. Limits. As λ0\lambda \to 0, pλp_\lambda \to uniform distribution; as λ\lambda \to \infty, pλp_\lambda collapses onto the minimum-energy point (the constellation centre). The map Eλ(E)E \mapsto \lambda(E) from average energy to shaping parameter is strictly decreasing and bijective on (Emin,Emax](E_{\min}, E_{\max}].

The MB distribution is the discrete finite-alphabet analogue of the 2D Gaussian on R2\mathbb{R}^2: both are proportional to exp(λr2)\exp(-\lambda r^2) up to normalisation, and both emerge from the same Lagrangian stationary-point equation. For MM \to \infty with a fixed average power, MB on the MM-QAM lattice converges weakly to the continuous Gaussian.

, ,

Theorem: Maxwell-Boltzmann is Capacity-Achieving on a Finite Constellation with Average-Power Constraint

Let XC\mathcal{X} \subset \mathbb{C} be a finite constellation and let E>0E > 0 be an average-power budget. Consider the AWGN channel Y=X+WY = X + W, W\cgauss(0,N0)W \sim \cgauss(0, N_0), with the input restricted to X\mathcal{X} and subject to E[X2]E\mathbb{E}[|X|^2] \le E. At high SNR, among all probability mass functions pp on X\mathcal{X} with xp(x)x2E\sum_x p(x) |x|^2 \le E, mutual information I(X;Y)I(X;Y) is maximised by the Maxwell-Boltzmann distribution pλp_\lambda with λ=λ(E)>0\lambda = \lambda(E) > 0 chosen so that Epλ[X2]=E\mathbb{E}_{p_\lambda}[|X|^2] = E.

The maximum entropy (bits) is H(E)=log2e[λE+lnZ(λ)].H^\star(E) = \log_2 e \cdot \bigl[\lambda E + \ln Z(\lambda)\bigr].

At high SNR, I(X;Y)H(X)I(X;Y) \to H(X) because H(XY)0H(X \mid Y) \to 0. So maximising mutual information reduces to maximising entropy at fixed energy — a textbook Lagrangian. The KKT stationarity equation gives an exponential form p(x)exp(λx2)p(x) \propto \exp(-\lambda |x|^2), the MB distribution. The same Lagrangian machinery on a continuous alphabet R\mathbb{R} with a second-moment constraint yields the Gaussian.

Note the high-SNR qualifier: at finite SNR the optimum pp^\star deviates slightly from MB (the Arimoto-Blahut iteration converges to a different vector because H(XY)0H(X \mid Y) \neq 0). But the MB distribution is always a very good approximation, and in the high-SNR limit it is exactly optimal.

, ,

Maxwell-Boltzmann Distribution on MM-QAM

Visualisation of the MB distribution on a square MM-QAM grid. Each constellation point is drawn as a disk whose area is proportional to the MB probability pλ(x)exp(λx2)p_\lambda(x) \propto \exp(-\lambda |x|^2). Move the λ\lambda slider: at λ=0\lambda = 0 all disks are equal (uniform); as λ\lambda grows, outer (high-energy) points shrink and inner points grow, producing a Gaussian-like envelope on the lattice. Observe that the effect is barely visible for very small λ\lambda (the MB distribution is locally a smooth perturbation of uniform), becomes pronounced at λ0.1\lambda \approx 0.1, and collapses onto the inner ring for large λ\lambda. Operating range in practice: λ[0.02,0.15]\lambda \in [0.02, 0.15] for 64-QAM at SNR between 12 and 25 dB.

Parameters
0.08

Example: MB Entropy for 64-QAM at λ=0.1\lambda = 0.1

Compute the entropy H(pλ)H(p_\lambda) in bits/symbol of the MB distribution on 64-QAM (unit-spacing square grid, points at {(±1,±3,±5,±7)}\{(\pm 1, \pm 3, \pm 5, \pm 7)\} per dimension) with λ=0.1\lambda = 0.1. Compare to the uniform-QAM entropy of 6 bits/symbol.

Shaping Gain vs SNR: Shannon, Uniform QAM, and MB-Shaped QAM

Achievable rate (bits/2D symbol) versus SNR\text{SNR} in dB for three input distributions: (i) Shannon bound log2(1+SNR)\log_2(1 + \text{SNR}) (continuous Gaussian input); (ii) uniform MM-QAM; (iii) optimally MB-shaped MM-QAM (i.e., λ\lambda^\star chosen per SNR to maximise mutual information). The shaping gain γs(SNR)=SNRunifSNRMB\gamma_s(\text{SNR}) = \text{SNR}_{\rm unif} - \text{SNR}_{\rm MB} (in dB at a fixed target rate) grows from 0.1\approx 0.1 dB at low SNR to 1.5\approx 1.5 dB at the high-SNR "knee" of each QAM curve. The optimally-shaped curve asymptotically approaches Shannon within 0.10.1 dB for large MM. Toggle the 256-QAM curve to see how larger constellations close the gap faster but at higher SNR.

Parameters

Operational Reading of the Shaping Curve

The shaping-gain plot tells a sharp story. At low SNR (below the uniform-QAM knee), the uniform distribution is already near-optimal because the outer constellation points are unusable — noise dominates and only the inner ring survives. There MB barely beats uniform.

At the high-SNR knee, where the uniform rate is close to log2M\log_2 M, the shaping gain is largest: MB squeezes an extra 1.31.3 to 1.51.5 dB out of the constellation. This is the design sweet spot.

In deployment:

  • 400ZR optical: DP-16QAM at SNR17\text{SNR} \approx 17 dB, target rate 3.17\sim 3.17 bits/polarisation. Uniform rate limit 44 bits/pol is at the knee; shaping saves 1.3\approx 1.3 dB.
  • 6G eMBB (proposed): 256-QAM or 1024-QAM at SNR22\text{SNR} \approx 22-3030 dB. Same knee logic; 1.5\approx 1.5 dB savings.
  • Satellite DVB-S2X: 32-APSK or 64-APSK; shaping is optional (Annex extension), used for highest-rate MODCODs where the knee is near the link budget.

A rough rule: shaping pays off when the uniform BICM rate is within 11 bit/symbol of log2M\log_2 M. Below that, switch to a smaller constellation.

Common Mistake: MB Distribution is NOT a Discrete Gaussian

Mistake:

A common confusion is to think that the Maxwell-Boltzmann distribution on a constellation is the same thing as a "discretised Gaussian" obtained by restricting \gauss(0,σ2)\gauss(0, \sigma^2) to the constellation lattice and renormalising. They look similar — both are proportional to exp(αx2)\exp(-\alpha |x|^2) — but they are not the same distribution.

Correction:

A Gaussian restricted to X\mathcal{X} and renormalised has the form p(x)=\gauss(x;0,σ2)/x\gauss(x;0,σ2)p(x) = \gauss(x; 0, \sigma^2) / \sum_{x'} \gauss(x'; 0, \sigma^2), which after simplification is p(x)exp(x2/(2σ2))p(x) \propto \exp(-|x|^2/(2\sigma^2)). This is MB with λ=1/(2σ2)\lambda = 1/(2\sigma^2). So up to the parameter mapping λ1/(2σ2)\lambda \leftrightarrow 1/(2\sigma^2), the two distributions are the same functional form, and the confusion is harmless.

The subtle but real difference: the discrete MB distribution is optimised to maximise entropy on the finite grid, not to match the continuous Gaussian on R2\mathbb{R}^2. For large MM the two converge; for small MM (e.g., 16-QAM), MB entropy at a target energy exceeds the entropy of the clipped-Gaussian approximation by a small but non-trivial amount. Always specify MB and solve λ\lambda from the energy constraint rather than setting σ\sigma by eyeballing the Gaussian envelope.

Historical Note: From Shannon 1948 to Forney's Sphere Bound to MB

1948-1984

Shannon's 1948 capacity theorem establishes that the Gaussian distribution is optimal on AWGN, leaving open the question of constellation design for practical codes. Kschischang and Pasupathy (1993, 2016) gave the first systematic treatment of the shaping gain as the 1/(2πe)1/(2\pi e) ratio of spherical to cubic second moments — independent of any code. Forney (1984, 1992) developed the lattice- theoretic framework: shape the constellation bounding region as a sphere-like fundamental domain (Voronoi region of a dense lattice), recovering up to πe/61.53\pi e / 6 \approx 1.53 dB asymptotically. This was the "geometric" side of shaping — before it had that name.

The "probabilistic" side emerged from information theory: the MB distribution was introduced by Kschischang and Pasupathy in the shaping-gain analysis, and later by Calderbank-Ozarow (1990) in non-equiprobable signalling. The key 1998 survey of Forney and Ungerboeck pulled it all together: for a finite constellation, MB is the max-entropy-at-fixed-energy distribution, and it achieves the same asymptotic πe/6\pi e / 6 gain as sphere shaping.

The practical adoption had to wait another 17 years: shaping remained theoretical until Bocherer-Steiner-Schulte (2015) showed how to reconcile MB with the systematic LDPC + BICM infrastructure of 5G-era systems. We pick up that story in Section 2.

, ,
🔧Engineering Note

Why High-SNR Optical Was First to Deploy MB Shaping

Probabilistic shaping with MB distribution only pays off at high SNR — specifically when the uniform BICM rate is within about 11 bit per 2D symbol of the cardinality bound log2M\log_2 M. The first commercial domain where this was routinely the operating point was coherent optical transmission:

  • Optical AWGN channel: dominated by amplified spontaneous emission (ASE) from erbium-doped fibre amplifiers. At the target 120120 km reach of 400ZR, per-polarisation SNR is around 1717 dB — right at the DP-16QAM knee.
  • Flexibility: optical links are upgraded in discrete 100100 Gbps steps, and the fibre reach varies by deployment. Shaping provides continuous rate adaptation (by varying λ\lambda) without changing hardware.
  • DSP budget: coherent modems already run 100\ge 100 GHz ASICs with sophisticated DSP; adding a CCDM block is a small incremental cost.

In contrast, cellular has historically operated at lower SNR per stream (most UEs are not at the knee) and uses discrete MCS indices for AMC. These conditions made PAS low-priority for LTE and 5G NR Release 15-17. The 6G study item (Release 20+) revisits PAS for the highest-MCS modes.

Practical Constraints
  • Shaping pays when uniform BICM rate is within 1 bit/symbol of log2M\log_2 M

  • Requires average SNR at the constellation knee (typically 15\ge 15 dB for 16-QAM, 22\ge 22 dB for 64-QAM)

  • Adds CCDM computational complexity per block

📋 Ref: OIF 400ZR, DVB-S2X Annex M (2015)
,

Quick Check

In the Lagrangian derivation of the MB distribution, the exponential form p(x)exp(λx2)p(x) \propto \exp(-\lambda |x|^2) arises from which source?

The KKT stationarity condition L/p(x)=0\partial L / \partial p(x) = 0 applied to an entropy objective with a quadratic energy constraint

A Gaussian assumption on the transmitted waveform

The central limit theorem applied to a uniform QAM distribution

Arithmetic coding of the information bits

Maxwell-Boltzmann Distribution

A probability mass function on a finite constellation X\mathcal{X} of the form pλ(x)=exp(λx2)/Z(λ)p_\lambda(x) = \exp(-\lambda |x|^2) / Z(\lambda), where λ>0\lambda > 0 is a shaping parameter and Z(λ)=xexp(λx2)Z(\lambda) = \sum_{x} \exp(-\lambda |x|^2) is the partition function. Uniquely maximises entropy subject to a fixed-energy constraint; the finite-alphabet analogue of the Gaussian distribution on Rn\mathbb{R}^n.

Related: Probabilistic Shaping, Constant-Composition Distribution Matcher (CCDM), Probabilistic Amplitude Shaping (PAS) Architecture

Partition Function

The normalisation constant Z(λ)=xXexp(λx2)Z(\lambda) = \sum_{x \in \mathcal{X}} \exp(-\lambda |x|^2) of the MB distribution. The name is borrowed from statistical mechanics, where the analogous Z(β)=sexp(βEs)Z(\beta) = \sum_s \exp(-\beta E_s) normalises the Boltzmann distribution over energy states. The first-log-derivative dlnZ/dλ-d \ln Z / d\lambda gives the expected energy; the second gives the energy variance.

Related: Mb Distribution, Lagrangian, Convexity

Key Takeaway

The Maxwell-Boltzmann distribution pλ(x)exp(λx2)p_\lambda(x) \propto \exp(-\lambda |x|^2) is the unique capacity-achieving input distribution on a finite constellation at high SNR, derived by a textbook Lagrangian that places an exponential weight on each constellation point according to its squared norm. The shaping parameter λ\lambda is a one-dimensional knob that continuously interpolates between uniform (λ=0)(\lambda = 0) and collapsed-to-origin (λ)(\lambda \to \infty) distributions. At the high-SNR knee of any QAM curve, λ\lambda^\star buys 1.3\approx 1.3-1.51.5 dB — close to the πe/6\pi e / 6 asymptote from Chapter 4.