Ferkans — Interactive Telecom Tutor

From the 1.53 dB Ceiling to a Constructive Recipe

Chapter 4 told us, with a beautiful entropy-power argument, that on a bounded 2D lattice we leave up to $\pi e / 6 \approx 1.53$ dB on the table by transmitting uniform QAM. Chapter 9 showed that 400G coherent optical has actually claimed that dB in the field. But how do you claim it? What is the concrete input distribution that the system designer should target?

The answer is the Maxwell-Boltzmann (MB) distribution $p_\lambda(x) \propto \exp(-\lambda |x|^2)$ . It is the minimum-change, most natural recipe: keep the same QAM grid, just weight inner points exponentially more than outer points. The parameter $\lambda$ is the shaping knob — zero recovers uniform QAM, large $\lambda$ collapses onto the constellation origin. For every operating SNR there is a unique optimal $\lambda^\star$ that extracts the maximum shaping gain.

This is not an ad-hoc choice. MB is the unique max-entropy distribution on a finite alphabet subject to an average-power constraint — exactly the same variational principle that gives Gaussian on $\mathbb{R}^n$ , Poisson on $\mathbb{N}$ , and exponential on $\mathbb{R}_+$ . The KKT conditions drop out an exponential form. The Gaussian capacity-achieving theorem on AWGN thus maps, lattice by lattice, onto MB as the finite-constellation analogue.

Two payoffs from this section:

Theoretical: a complete derivation of MB from Lagrangian KKT, and a numerical read of MB entropy vs $\lambda$ for 16-/64-/256-QAM.
Operational: a plot of achievable rates — Shannon vs uniform QAM vs MB-shaped QAM — showing where the 1.53 dB is recovered and where it isn't (hint: at low SNR, shaping barely helps).

,

Definition:
Maxwell-Boltzmann Distribution on a Constellation

Let $\mathcal{X} \subset \mathbb{C}$ be a finite constellation (e.g., square $M$ -QAM). For any $\lambda > 0$ , the Maxwell-Boltzmann (MB) distribution on $\mathcal{X}$ with parameter $\lambda$ is $p_\lambda(x) = \frac{\exp(-\lambda |x|^2)}{Z(\lambda)}, \qquad Z(\lambda) = \sum_{x' \in \mathcal{X}} \exp(-\lambda |x'|^2), \qquad x \in \mathcal{X}.$ Here $Z(\lambda)$ is the partition function. The expected energy under $p_\lambda$ is $\mathbb{E}_{p_\lambda}[|X|^2] = -\frac{Z'(\lambda)}{Z(\lambda)} = \frac{\sum_x |x|^2 \exp(-\lambda |x|^2)}{\sum_x \exp(-\lambda |x|^2)}$ and its entropy (in bits) is $H(p_\lambda) = \frac{\lambda \mathbb{E}_{p_\lambda}[|X|^2] + \ln Z(\lambda)}{\ln 2}.$ Limits. As $\lambda \to 0$ , $p_\lambda \to$ uniform distribution; as $\lambda \to \infty$ , $p_\lambda$ collapses onto the minimum-energy point (the constellation centre). The map $E \mapsto \lambda(E)$ from average energy to shaping parameter is strictly decreasing and bijective on $(E_{\min}, E_{\max}]$ .

The MB distribution is the discrete finite-alphabet analogue of the 2D Gaussian on $\mathbb{R}^2$ : both are proportional to $\exp(-\lambda r^2)$ up to normalisation, and both emerge from the same Lagrangian stationary-point equation. For $M \to \infty$ with a fixed average power, MB on the $M$ -QAM lattice converges weakly to the continuous Gaussian.

, ,

Theorem: Maxwell-Boltzmann is Capacity-Achieving on a Finite Constellation with Average-Power Constraint

Let $\mathcal{X} \subset \mathbb{C}$ be a finite constellation and let $E > 0$ be an average-power budget. Consider the AWGN channel $Y = X + W$ , $W \sim \cgauss(0, N_0)$ , with the input restricted to $\mathcal{X}$ and subject to $\mathbb{E}[|X|^2] \le E$ . At high SNR, among all probability mass functions $p$ on $\mathcal{X}$ with $\sum_x p(x) |x|^2 \le E$ , mutual information $I(X;Y)$ is maximised by the Maxwell-Boltzmann distribution $p_\lambda$ with $\lambda = \lambda(E) > 0$ chosen so that $\mathbb{E}_{p_\lambda}[|X|^2] = E$ .

The maximum entropy (bits) is $H^\star(E) = \log_2 e \cdot \bigl[\lambda E + \ln Z(\lambda)\bigr].$

At high SNR, $I(X;Y) \to H(X)$ because $H(X \mid Y) \to 0$ . So maximising mutual information reduces to maximising entropy at fixed energy — a textbook Lagrangian. The KKT stationarity equation gives an exponential form $p(x) \propto \exp(-\lambda |x|^2)$ , the MB distribution. The same Lagrangian machinery on a continuous alphabet $\mathbb{R}$ with a second-moment constraint yields the Gaussian.

Note the high-SNR qualifier: at finite SNR the optimum $p^\star$ deviates slightly from MB (the Arimoto-Blahut iteration converges to a different vector because $H(X \mid Y) \neq 0$ ). But the MB distribution is always a very good approximation, and in the high-SNR limit it is exactly optimal.

Show Hint

Use $I(X;Y) = H(X) - H(X \mid Y)$ and $H(X \mid Y) \to 0$ at high SNR.

Maximise $H(p)$ subject to $\sum_x p(x) |x|^2 = E$ and $\sum_x p(x) = 1$ via Lagrangian $L(p) = -\sum p \log p - \lambda(\sum p|x|^2 - E) - \mu(\sum p - 1)$ .

Set $\partial L / \partial p(x) = 0$ and solve for $p(x)$ .

Normalise via the $\sum p = 1$ constraint.

The Lagrange multiplier $\lambda$ is determined uniquely by the energy constraint.

Proof

Step 1: High-SNR reduction to entropy maximisation

Let $p$ be any distribution on $\mathcal{X}$ with energy $E$ . Then $I(X;Y) = H(X) - H(X \mid Y)$ where $X$ is discrete and $Y$ is continuous. At high SNR the receiver identifies $X$ with high probability, so $H(X \mid Y) \to 0$ as $\text{SNR} \to \infty$ . Hence asymptotically $I(X;Y) \to H(X) = H(p)$ , and the optimisation reduces to $\max_p H(p)$ subject to the constraints.

Step 2: Lagrangian for constrained max-entropy

Form the Lagrangian $L(p; \lambda, \mu) = -\sum_x p(x) \ln p(x) - \lambda \bigl(\sum_x p(x) |x|^2 - E\bigr) - \mu \bigl(\sum_x p(x) - 1\bigr).$ The Lagrange multipliers $\lambda$ (power constraint) and $\mu$ (normalisation) are unknowns to be determined from the constraints.

Step 3: Stationarity gives MB form

Stationarity: $\partial L / \partial p(x) = -\ln p(x) - 1 - \lambda |x|^2 - \mu = 0$ , so $p^\star(x) = \exp\bigl(-1 - \mu - \lambda |x|^2\bigr) = C \, e^{-\lambda |x|^2},$ where $C = e^{-1-\mu}$ absorbs the constant. This is exactly the MB functional form.

Step 4: Normalisation gives partition function

The constraint $\sum_x p^\star(x) = 1$ fixes $C = 1/Z(\lambda)$ with $Z(\lambda) = \sum_x \exp(-\lambda |x|^2)$ . So $p^\star(x) = \frac{\exp(-\lambda |x|^2)}{Z(\lambda)} = p_\lambda(x).$

Step 5: Energy constraint fixes $\lambda$

The remaining constraint $\sum_x p_\lambda(x) |x|^2 = E$ defines $\lambda = \lambda(E)$ uniquely because the map $\lambda \mapsto \mathbb{E}_{p_\lambda}[|X|^2] = -Z'(\lambda)/Z(\lambda)$ is strictly decreasing (by convexity of $\ln Z$ in $\lambda$ ). So for every achievable $E$ there exists a unique $\lambda$ and hence a unique MB distribution.

Step 6: Maximum entropy value

Substituting $p_\lambda$ into $H(p)$ : $H(p_\lambda) = -\sum_x p_\lambda(x) [-\lambda |x|^2 - \ln Z(\lambda)] = \lambda \mathbb{E}[|X|^2] + \ln Z(\lambda) = \lambda E + \ln Z(\lambda).$ In bits, $H^\star(E) = (\lambda E + \ln Z(\lambda))/\ln 2$ . $\blacksquare$

, ,

Maxwell-Boltzmann Distribution on $M$ -QAM

Visualisation of the MB distribution on a square $M$ -QAM grid. Each constellation point is drawn as a disk whose area is proportional to the MB probability $p_\lambda(x) \propto \exp(-\lambda |x|^2)$ . Move the $\lambda$ slider: at $\lambda = 0$ all disks are equal (uniform); as $\lambda$ grows, outer (high-energy) points shrink and inner points grow, producing a Gaussian-like envelope on the lattice. Observe that the effect is barely visible for very small $\lambda$ (the MB distribution is locally a smooth perturbation of uniform), becomes pronounced at $\lambda \approx 0.1$ , and collapses onto the inner ring for large $\lambda$ . Operating range in practice: $\lambda \in [0.02, 0.15]$ for 64-QAM at SNR between 12 and 25 dB.

Parameters

QAM size

M

MB parameter

\lambda

0.08

Example: MB Entropy for 64-QAM at $\lambda = 0.1$

Compute the entropy $H(p_\lambda)$ in bits/symbol of the MB distribution on 64-QAM (unit-spacing square grid, points at $\{(\pm 1, \pm 3, \pm 5, \pm 7)\}$ per dimension) with $\lambda = 0.1$ . Compare to the uniform-QAM entropy of 6 bits/symbol.

Solution

Enumerate the 64 QAM points

Each point has coordinates $(a_I, a_Q)$ with $a_I, a_Q \in \{\pm 1, \pm 3, \pm 5, \pm 7\}$ . The squared norm is $|x|^2 = a_I^2 + a_Q^2$ . There are 4 points with $|x|^2 = 2$ (the inner ring $(\pm 1, \pm 1)$ ), 8 points with $|x|^2 = 10$ , 4 with $|x|^2 = 18$ , and so on up to $(\pm 7, \pm 7)$ with $|x|^2 = 98$ .

Compute the partition function

$Z(\lambda) = \sum_{|x|^2} N(|x|^2) \exp(-\lambda |x|^2)$ where $N(|x|^2)$ is the multiplicity. Numerical evaluation at $\lambda = 0.1$ gives $Z(0.1) \approx 7.83$ (versus $Z(0) = 64$ for uniform).

Compute the average energy

$\mathbb{E}[|X|^2] = \sum_x p_\lambda(x) |x|^2$ . At $\lambda = 0.1$ this is approximately $\bar E \approx 13.7$ (versus $\bar E_{\rm unif} = 42$ for uniform 64-QAM). The shaping has cut average energy by almost a factor 3 — that is exactly the power saving at constant rate.

Compute the entropy

$H(p_\lambda) = (\lambda \bar E + \ln Z(\lambda))/\ln 2 = (0.1 \cdot 13.7 + \ln 7.83)/\ln 2 \approx (1.37 + 2.06)/0.693 \approx 4.95$ bits/symbol. So MB-shaped 64-QAM at $\lambda = 0.1$ carries $4.95$ bits/symbol at average energy $13.7$ , versus uniform 64-QAM at $6$ bits/symbol at average energy $42$ .

Operational comparison

To transmit $4.95$ bits/symbol reliably, uniform 64-QAM would need to drop to 32-QAM (5 bits/symbol, $\bar E \approx 20$ ) — or use a rate- $0.825$ code on 64-QAM at $\bar E = 42$ . MB-shaped 64-QAM achieves the same rate at $\bar E = 13.7$ , which is a power saving of $10 \log_{10}(20/13.7) \approx 1.6$ dB over the 32-QAM route and $10 \log_{10}(42/13.7) \approx 4.8$ dB over the brute-force rate-controlled route. The 1.6 dB saving matches the shaping-gain prediction of roughly 1.5 dB. $\blacksquare$

Shaping Gain vs SNR: Shannon, Uniform QAM, and MB-Shaped QAM

Achievable rate (bits/2D symbol) versus $\text{SNR}$ in dB for three input distributions: (i) Shannon bound $\log_2(1 + \text{SNR})$ (continuous Gaussian input); (ii) uniform $M$ -QAM; (iii) optimally MB-shaped $M$ -QAM (i.e., $\lambda^\star$ chosen per SNR to maximise mutual information). The shaping gain $\gamma_s(\text{SNR}) = \text{SNR}_{\rm unif} - \text{SNR}_{\rm MB}$ (in dB at a fixed target rate) grows from $\approx 0.1$ dB at low SNR to $\approx 1.5$ dB at the high-SNR "knee" of each QAM curve. The optimally-shaped curve asymptotically approaches Shannon within $0.1$ dB for large $M$ . Toggle the 256-QAM curve to see how larger constellations close the gap faster but at higher SNR.

Parameters

Include 256-QAM

Operational Reading of the Shaping Curve

The shaping-gain plot tells a sharp story. At low SNR (below the uniform-QAM knee), the uniform distribution is already near-optimal because the outer constellation points are unusable — noise dominates and only the inner ring survives. There MB barely beats uniform.

At the high-SNR knee, where the uniform rate is close to $\log_2 M$ , the shaping gain is largest: MB squeezes an extra $1.3$ to $1.5$ dB out of the constellation. This is the design sweet spot.

In deployment:

400ZR optical: DP-16QAM at $\text{SNR} \approx 17$ dB, target rate $\sim 3.17$ bits/polarisation. Uniform rate limit $4$ bits/pol is at the knee; shaping saves $\approx 1.3$ dB.
6G eMBB (proposed): 256-QAM or 1024-QAM at $\text{SNR} \approx 22$ - $30$ dB. Same knee logic; $\approx 1.5$ dB savings.
Satellite DVB-S2X: 32-APSK or 64-APSK; shaping is optional (Annex extension), used for highest-rate MODCODs where the knee is near the link budget.

A rough rule: shaping pays off when the uniform BICM rate is within $1$ bit/symbol of $\log_2 M$ . Below that, switch to a smaller constellation.

Common Mistake: MB Distribution is NOT a Discrete Gaussian

Mistake:

A common confusion is to think that the Maxwell-Boltzmann distribution on a constellation is the same thing as a "discretised Gaussian" obtained by restricting $\gauss(0, \sigma^2)$ to the constellation lattice and renormalising. They look similar — both are proportional to $\exp(-\alpha |x|^2)$ — but they are not the same distribution.

Correction:

A Gaussian restricted to $\mathcal{X}$ and renormalised has the form $p(x) = \gauss(x; 0, \sigma^2) / \sum_{x'} \gauss(x'; 0, \sigma^2)$ , which after simplification is $p(x) \propto \exp(-|x|^2/(2\sigma^2))$ . This is MB with $\lambda = 1/(2\sigma^2)$ . So up to the parameter mapping $\lambda \leftrightarrow 1/(2\sigma^2)$ , the two distributions are the same functional form, and the confusion is harmless.

The subtle but real difference: the discrete MB distribution is optimised to maximise entropy on the finite grid, not to match the continuous Gaussian on $\mathbb{R}^2$ . For large $M$ the two converge; for small $M$ (e.g., 16-QAM), MB entropy at a target energy exceeds the entropy of the clipped-Gaussian approximation by a small but non-trivial amount. Always specify MB and solve $\lambda$ from the energy constraint rather than setting $\sigma$ by eyeballing the Gaussian envelope.

Historical Note: From Shannon 1948 to Forney's Sphere Bound to MB

1948-1984

Shannon's 1948 capacity theorem establishes that the Gaussian distribution is optimal on AWGN, leaving open the question of constellation design for practical codes. Kschischang and Pasupathy (1993, 2016) gave the first systematic treatment of the shaping gain as the $1/(2\pi e)$ ratio of spherical to cubic second moments — independent of any code. Forney (1984, 1992) developed the lattice- theoretic framework: shape the constellation bounding region as a sphere-like fundamental domain (Voronoi region of a dense lattice), recovering up to $\pi e / 6 \approx 1.53$ dB asymptotically. This was the "geometric" side of shaping — before it had that name.

The "probabilistic" side emerged from information theory: the MB distribution was introduced by Kschischang and Pasupathy in the shaping-gain analysis, and later by Calderbank-Ozarow (1990) in non-equiprobable signalling. The key 1998 survey of Forney and Ungerboeck pulled it all together: for a finite constellation, MB is the max-entropy-at-fixed-energy distribution, and it achieves the same asymptotic $\pi e / 6$ gain as sphere shaping.

The practical adoption had to wait another 17 years: shaping remained theoretical until Bocherer-Steiner-Schulte (2015) showed how to reconcile MB with the systematic LDPC + BICM infrastructure of 5G-era systems. We pick up that story in Section 2.

, ,

🔧Engineering Note

Why High-SNR Optical Was First to Deploy MB Shaping

Probabilistic shaping with MB distribution only pays off at high SNR — specifically when the uniform BICM rate is within about $1$ bit per 2D symbol of the cardinality bound $\log_2 M$ . The first commercial domain where this was routinely the operating point was coherent optical transmission:

Optical AWGN channel: dominated by amplified spontaneous emission (ASE) from erbium-doped fibre amplifiers. At the target $120$ km reach of 400ZR, per-polarisation SNR is around $17$ dB — right at the DP-16QAM knee.
Flexibility: optical links are upgraded in discrete $100$ Gbps steps, and the fibre reach varies by deployment. Shaping provides continuous rate adaptation (by varying $\lambda$ ) without changing hardware.
DSP budget: coherent modems already run $\ge 100$ GHz ASICs with sophisticated DSP; adding a CCDM block is a small incremental cost.

In contrast, cellular has historically operated at lower SNR per stream (most UEs are not at the knee) and uses discrete MCS indices for AMC. These conditions made PAS low-priority for LTE and 5G NR Release 15-17. The 6G study item (Release 20+) revisits PAS for the highest-MCS modes.

Practical Constraints

•
Shaping pays when uniform BICM rate is within 1 bit/symbol of $\log_2 M$
•
Requires average SNR at the constellation knee (typically $\ge 15$ dB for 16-QAM, $\ge 22$ dB for 64-QAM)
•
Adds CCDM computational complexity per block

📋 Ref: OIF 400ZR, DVB-S2X Annex M (2015)

,

Quick Check

In the Lagrangian derivation of the MB distribution, the exponential form $p(x) \propto \exp(-\lambda |x|^2)$ arises from which source?

The KKT stationarity condition $\partial L / \partial p(x) = 0$ applied to an entropy objective with a quadratic energy constraint

A Gaussian assumption on the transmitted waveform

The central limit theorem applied to a uniform QAM distribution

Arithmetic coding of the information bits

Correction:

The KKT stationarity condition

\partial L / \partial p(x) = 0

applied to an entropy objective with a quadratic energy constraint

Correct. The stationarity equation $-\ln p(x) - 1 - \lambda |x|^2 - \mu = 0$ gives $p(x) = \exp(-1 - \mu - \lambda |x|^2)$ , which after normalisation is the MB distribution. The quadratic constraint $|x|^2$ is what produces the exponential-squared form (rather than an exponential-linear form that would arise from an $|x|$ constraint).

Maxwell-Boltzmann Distribution

A probability mass function on a finite constellation $\mathcal{X}$ of the form $p_\lambda(x) = \exp(-\lambda |x|^2) / Z(\lambda)$ , where $\lambda > 0$ is a shaping parameter and $Z(\lambda) = \sum_{x} \exp(-\lambda |x|^2)$ is the partition function. Uniquely maximises entropy subject to a fixed-energy constraint; the finite-alphabet analogue of the Gaussian distribution on $\mathbb{R}^n$ .

Partition Function

The normalisation constant $Z(\lambda) = \sum_{x \in \mathcal{X}} \exp(-\lambda |x|^2)$ of the MB distribution. The name is borrowed from statistical mechanics, where the analogous $Z(\beta) = \sum_s \exp(-\beta E_s)$ normalises the Boltzmann distribution over energy states. The first-log-derivative $-d \ln Z / d\lambda$ gives the expected energy; the second gives the energy variance.

Key Takeaway

The Maxwell-Boltzmann distribution $p_\lambda(x) \propto \exp(-\lambda |x|^2)$ is the unique capacity-achieving input distribution on a finite constellation at high SNR, derived by a textbook Lagrangian that places an exponential weight on each constellation point according to its squared norm. The shaping parameter $\lambda$ is a one-dimensional knob that continuously interpolates between uniform $(\lambda = 0)$ and collapsed-to-origin $(\lambda \to \infty)$ distributions. At the high-SNR knee of any QAM curve, $\lambda^\star$ buys $\approx 1.3$ - $1.5$ dB — close to the $\pi e / 6$ asymptote from Chapter 4.

Maxwell-Boltzmann Shaping

From the 1.53 dB Ceiling to a Constructive Recipe

Definition: Maxwell-Boltzmann Distribution on a Constellation

Theorem: Maxwell-Boltzmann is Capacity-Achieving on a Finite Constellation with Average-Power Constraint

Step 1: High-SNR reduction to entropy maximisation

Step 2: Lagrangian for constrained max-entropy

Step 3: Stationarity gives MB form

Step 4: Normalisation gives partition function

Step 5: Energy constraint fixes $\lambda$

Step 6: Maximum entropy value

Maxwell-Boltzmann Distribution on MMM-QAM

Parameters

Example: MB Entropy for 64-QAM at λ=0.1\lambda = 0.1λ=0.1

Enumerate the 64 QAM points

Compute the partition function

Compute the average energy

Compute the entropy

Operational comparison

Shaping Gain vs SNR: Shannon, Uniform QAM, and MB-Shaped QAM

Parameters

Operational Reading of the Shaping Curve

Common Mistake: MB Distribution is NOT a Discrete Gaussian

Historical Note: From Shannon 1948 to Forney's Sphere Bound to MB

Why High-SNR Optical Was First to Deploy MB Shaping

Quick Check

Maxwell-Boltzmann Distribution

Partition Function

Key Takeaway

Definition:
Maxwell-Boltzmann Distribution on a Constellation

Maxwell-Boltzmann Distribution on $M$ -QAM

Example: MB Entropy for 64-QAM at $\lambda = 0.1$