Ferkans — Interactive Telecom Tutor

Why a Rate Rule Is Even Possible

Before stating the capacity rule, consider what we are asking. MLC has $L$ separate binary codes, each with its own rate $R_i$ . In principle the designer has $L$ degrees of freedom. The question is: what rates should we pick?

The answer is striking. The point is that the chain rule of mutual information gives a complete decomposition of the capacity of the non-binary channel into a sum of $L$ binary capacities. This is not an approximation — it is exact. The cost is that the binary channels are not independent: level $i$ must condition on levels $0, 1, \ldots, i-1$ . This conditioning is the reason multistage decoding exists, and it is what separates MLC/MSD from BICM.

With the chain-rule decomposition in hand, the rate rule writes itself: pick $R_i$ equal to the capacity of the $i$ -th conditioned binary sub-channel. The sum telescopes to the full CM capacity. No single choice gives more, and any other choice gives less — this is the content of the main theorem of this section.

,

Definition:
Binary Sub-Channel at Level $i$

Fix a partition chain, a partition-based labelling $\mu$ , and the AWGN channel $Y = X + W$ with $W \sim \mathcal{N}(0, N_0/2)$ . Let $X = \mu(B_0, B_1, \ldots, B_{L-1})$ where each $B_i$ is uniform on $\{0, 1\}$ and the bits are independent. The binary sub-channel at level $i$ is the channel with input $B_i$ and output $\bigl(Y, B_0, B_1, \ldots, B_{i-1}\bigr)$ — that is, the channel sees the received signal $Y$ and a genie-provided history of the previously decoded bits. Its capacity is

$C_i \;\triangleq\; I(Y; B_i \mid B_0, B_1, \ldots, B_{i-1})$

with the understanding that $C_0 = I(Y; B_0)$ .

The "genie-provided history" terminology is standard but misleading at first reading. What is really happening is that MSD decodes level $i$ using the decoded bits from levels $0, \ldots, i-1$ . At rates below the capacity rule the probability of a history error is driven to zero, so the decoded bits equal the transmitted bits with high probability — and the genie assumption is justified in the information-theoretic limit.

Theorem: The Capacity Rule for MLC

Let $\mathcal{X}$ be a constellation with $|\mathcal{X}| = M = 2^L$ , let $\mu$ be a partition-based labelling, and let $X = \mu(B_0, \ldots, B_{L-1})$ with i.i.d.\ uniform label bits. Then

$C_{\rm CM} \;\triangleq\; I(Y; X) \;=\; \sum_{i=0}^{L-1} I(Y; B_i \mid B_0, \ldots, B_{i-1}) \;=\; \sum_{i=0}^{L-1} C_i.$

Consequently, an MLC scheme with binary codes of rates $R_i = C_i$ at each level achieves the full constellation capacity $C_{\rm CM}$ . No other choice of rates $(R_0, \ldots, R_{L-1})$ satisfying $\sum_i R_i \le C_{\rm CM}$ can be decoded reliably by multistage decoding unless $R_i \le C_i$ at every level.

The equality is the chain rule of mutual information applied to the bijective correspondence $X \leftrightarrow (B_0, \ldots, B_{L-1})$ . The "no-other-choice" part is the weak converse for binary channels applied level-by-level: with MSD as the receiver, each stage sees a binary channel of capacity $C_i$ , and a rate above $C_i$ at any level cannot be decoded with vanishing error probability.

Show Hint

Start with the chain rule: $I(Y; B_0, \ldots, B_{L-1}) = \sum_i I(Y; B_i \mid B_0, \ldots, B_{i-1})$ .

Argue that $I(Y; B_0, \ldots, B_{L-1}) = I(Y; X)$ because $\mu$ is a bijection between label vectors and points.

For the second claim, note that MSD at stage $i$ sees a binary channel of capacity $I(Y; B_i \mid B_0, \ldots, B_{i-1})$ and apply the binary channel coding theorem.

Proof

Chain rule gives the decomposition

By the chain rule of mutual information,

$I(Y; B_0, B_1, \ldots, B_{L-1}) = \sum_{i=0}^{L-1} I(Y; B_i \mid B_0, \ldots, B_{i-1}).$

Each term on the right is $C_i$ by definition (def-binary-subchannel).

Label vector ↔ constellation point bijection

Since $\mu : \{0,1\}^L \to \mathcal{X}$ is a bijection, the random variables $X$ and $(B_0, \ldots, B_{L-1})$ carry the same information: $X = \mu(B_0, \ldots, B_{L-1})$ is a deterministic function of the label vector, and vice versa. Mutual information is invariant under one-to-one functions, so

$I(Y; B_0, \ldots, B_{L-1}) \;=\; I(Y; X) \;=\; C_{\rm CM}.$

Combining with the previous step yields $C_{\rm CM} = \sum_{i=0}^{L-1} C_i$ , as claimed.

Achievability of $R_i = C_i$ via MSD

Fix $\epsilon > 0$ and choose binary codes of rates $R_i = C_i - \epsilon$ and block length $n$ . By the noisy-channel coding theorem for binary channels, there exist codes $\mathcal{C}_i$ whose error probability on the level- $i$ binary sub-channel goes to zero as $n \to \infty$ .

Multistage decoding decodes $\mathcal{C}_0$ first (against the unconditional channel $B_0 \to Y$ ), then $\mathcal{C}_1$ using $\hat B_0$ as side information, and so on. Each stage operates at a rate strictly below its channel's capacity, so by a union bound over the $L$ stages, the aggregate error probability is at most $L \cdot 2^{-n E(\epsilon)}$ for some positive error exponent $E(\epsilon) > 0$ . Letting $n \to \infty$ , the aggregate rate approaches $\sum_i C_i = C_{\rm CM}$ .

Tightness (the converse direction)

Suppose an MLC/MSD scheme uses rates $R_i$ with $R_{i^*} > C_{i^*}$ for some level $i^*$ . At stage $i^*$ the decoder sees a binary channel of capacity $C_{i^*}$ (conditional on the decoded history, which is correct with high probability under reasonable operating assumptions). By the converse to the binary channel coding theorem, the error probability at stage $i^*$ is bounded away from zero for all code lengths $n$ . Hence no MSD-achievable rate vector has $R_i > C_i$ at any level.

The converse combined with achievability establishes that the rate vector $(R_0, \ldots, R_{L-1}) = (C_0, \ldots, C_{L-1})$ is the unique maximiser of $\sum_i R_i$ subject to $R_i \le C_i$ — and its sum is $C_{\rm CM}$ . $\blacksquare$

, ,

Key Takeaway

The capacity rule is the chain rule in disguise. For any partition-based labelling of any constellation, the full CM capacity decomposes exactly into a sum of $L$ conditional binary capacities, and MLC with $R_i = C_i$ achieves the sum. The rule is optimal, not heuristic — this is what makes MLC fundamentally different from ad hoc rate allocations.

Three Binary Sub-Channel Capacities vs. SNR for 8-PSK

For 8-PSK with the Ungerboeck partition chain, the plot shows the three binary sub-channel capacities $C_0, C_1, C_2$ as functions of SNR, together with their sum (the MLC/MSD capacity) and the Shannon limit $\log_2(1 + \text{SNR})$ . Observe that $C_2$ saturates to $1$ bit almost immediately: level 2 is effectively BPSK with squared distance $d_2^2 = 4$ . $C_1$ saturates next. $C_0$ is the bottleneck — the capacity of level 0 is what throttles the total at low-to-medium SNR.

Parameters

Example: 8-PSK Rate Allocation at $\text{SNR} = 8$ dB

Using the Ungerboeck partition chain of 8-PSK and operating at $\text{SNR} = 8$ dB (equivalently $E_s/N_0 = 10^{0.8} \approx 6.31$ ), compute the binary sub-channel capacities $C_0, C_1, C_2$ , the total MLC/MSD capacity, and the implied rate allocation $(R_0, R_1, R_2)$ when the designer picks $R_i = C_i$ .

Solution

Intra-level squared distances and effective SNRs

At unit $E_s$ the intra-level squared distances are $d_0^2 \approx 0.586$ , $d_1^2 = 2$ , $d_2^2 = 4$ . The effective SNR per dimension for an antipodal binary channel of squared distance $d^2$ is $d^2 \cdot E_s/(2 N_0) = d^2 \text{SNR}/2$ . At $\text{SNR} = 6.31$ this gives $1.85$ , $6.31$ , and $12.62$ for the three levels.

Capacity of a binary antipodal sub-channel

The capacity of a binary-input AWGN channel with squared distance $d^2$ and noise variance $N_0/2$ per dimension is well-approximated for our purpose by $C \approx 1 - h_2\!\bigl(Q(\sqrt{d^2 \text{SNR}/2})\bigr)$ , where $h_2$ is the binary entropy function.

Plug in and compute

Level 0: $Q(\sqrt{1.85}) = Q(1.36) \approx 0.087$ , so $C_0 \approx 1 - h_2(0.087) \approx 1 - 0.432 = 0.568$ bit.
Level 1: $Q(\sqrt{6.31}) = Q(2.51) \approx 0.006$ , so $C_1 \approx 1 - h_2(0.006) \approx 0.947$ bit.
Level 2: $Q(\sqrt{12.62}) = Q(3.55) \approx 2 \times 10^{-4}$ , so $C_2 \approx 1 - h_2(2\times 10^{-4}) \approx 0.998$ bit.

These numbers are the exact (to three digits) outputs of the interactive plot above.

Total capacity and rate allocation

Summing, $C_{\rm MLC/MSD} \approx 0.568 + 0.947 + 0.998 = 2.513$ bits/symbol. This compares favourably with the Shannon AWGN capacity $\log_2(1 + 6.31) \approx 2.87$ bits/dim — the $0.36$ -bit gap is the modulation-capacity loss of 8-PSK at $\text{SNR} = 8$ dB.

The capacity-rule allocation is $(R_0, R_1, R_2) = (0.568, 0.947, 0.998)$ . Notice how asymmetric it is: level 0 is a low-rate, heavily protected code; level 2 is nearly uncoded. A designer using the same-rate allocation $(R_0, R_1, R_2) = (0.84, 0.84, 0.84)$ would lose at level 0 — $R_0 = 0.84$ exceeds $C_0 = 0.568$ , so reliable decoding is impossible. This is the operational meaning of "no other allocation works."

Rate Allocation in Practice

In a real system one does not get to pick $R_i$ continuously. The available binary codes (LDPC, polar, convolutional) come in discrete rate classes: maybe rates $1/4, 1/3, 1/2, 2/3, 3/4, 5/6, 7/8, 1$ . The designer picks the available rate closest to $C_i$ from below at each level (rounding down ensures reliable decoding). This discretisation costs a small amount of capacity — usually a fraction of a dB — and is the practical reason that the theoretical MLC-capacity curve in the next section is not fully reached by a specific implementation.

Common Mistake: The unconditional $I(Y; B_i)$ is NOT the capacity rule

Mistake:

Allocating rate $R_i = I(Y; B_i)$ — the unconditional mutual information between $Y$ and the $i$ -th label bit — and expecting MSD to achieve the sum.

Correction:

The capacity rule uses the conditional mutual information $C_i = I(Y; B_i \mid B_0, \ldots, B_{i-1})$ at every level except $i = 0$ . The sum of unconditional $I(Y; B_i)$ is precisely the BICM capacity $C_{\rm BICM}$ , which is generally less than $C_{\rm CM}$ — so the unconditional allocation shortchanges the scheme. Section s04 makes this gap explicit.

Quick Check

Which identity lies at the heart of the MLC capacity rule?

The data processing inequality $I(X; Y) \geq I(X; g(Y))$

Fano's inequality

The chain rule of mutual information: $I(Y; B_0, \ldots, B_{L-1}) = \sum_i I(Y; B_i \mid B_0, \ldots, B_{i-1})$

Jensen's inequality for concave functions

Correction:

The chain rule of mutual information:

I(Y; B_0, \ldots, B_{L-1}) = \sum_i I(Y; B_i \mid B_0, \ldots, B_{i-1})

The capacity rule is the chain rule of mutual information applied to the bijective label-to-constellation map. Every other information inequality in the proof (the noisy-channel theorem, its converse) is applied separately at each level — the decomposition itself is the chain rule.

Capacity rule

The rate-allocation rule for multilevel coding: set the rate of the level- $i$ binary code equal to the conditional mutual information $C_i = I(Y; B_i \mid B_0, \ldots, B_{i-1})$ . Sums to the full CM capacity $I(Y; X)$ .

⚠️Engineering Note

From the Rule to an LDPC Rate Table

The DVB-S2 standard's MODCOD table is, from a coded-modulation point of view, a pragmatic instantiation of the capacity rule with one code rate per constellation (that is, BICM-style, not MLC). The table lists the LDPC code rate and the constellation size as a pair, chosen to cover the expected operating $E_s/N_0$ range in roughly $1$ dB steps. An MLC-native DVB-S2 would instead store three (for 8-PSK) or four (for 16-APSK) rates per modulation — one per level.

The fact that DVB-S2 shipped with the simpler BICM-style single-rate design, not an MLC-style multi-rate design, is the clearest practical evidence that the BICM–MLC gap was judged too small to justify the complexity. We will quantify this gap in s04.

Practical Constraints

•
DVB-S2 LDPC code rates: 1/4, 1/3, 2/5, 1/2, 3/5, 2/3, 3/4, 4/5, 5/6, 8/9, 9/10
•
A full MLC implementation would need a separate rate per level, multiplying the code-table size by $L = \log_2 M$

📋 Ref: ETSI EN 302 307, §5.4

,

The Capacity Rule for Rate Allocation