Ferkans — Interactive Telecom Tutor

From Marginals to MAP

Sum-product computes marginals $p(x_i)$ — the probability distribution of a single variable integrating over all others. But many applications ask a different question: what is the most likely joint configuration $\hat{\mathbf{x}}_{\text{MAP}} = \arg\max_{\mathbf{x}} p(\mathbf{x})$ ?

The point is that replacing $\sum$ with $\max$ in the sum-product rules gives a new algorithm — max-product — that solves the MAP problem. The Viterbi algorithm is a famous instance: max-product on a trellis. Same factor graph, same algorithm skeleton, different operator. That is the elegance of the framework.

Definition:
Max-Product Messages

The max-product algorithm replaces the marginalization sum by a maximization: $\mu_{a \to i}^{\max}(x_i) = \max_{\mathbf{x}_{\partial a \setminus i}} f_a(\mathbf{x}_{\partial a}) \prod_{j \in \partial a \setminus i} \mu_{j \to a}^{\max}(x_j),$ while the variable-to-factor update is unchanged ( $\mu_{i \to a}^{\max}(x_i) = \prod_{b \neq a} \mu_{b \to i}^{\max}(x_i)$ ). The final max-marginal is $b_i^{\max}(x_i) \propto \prod_{a \in \partial i} \mu_{a \to i}^{\max}(x_i).$

In log-domain (to avoid underflow), max-product becomes max-sum: products become sums and $\max$ stays $\max$ . This is the form used in Viterbi.

Theorem: Max-Product Computes MAP on Trees

If the factor graph is a tree, the max-product algorithm computes $b_i^{\max}(x_i) = \max_{\mathbf{x}_{\sim i}} p(\mathbf{x}) = p(\mathbf{x}_{\text{MAP}})$ up to normalization, for every $i$ . Traceback (selecting $\hat{x}_i = \arg\max b_i^{\max}(x_i)$ ) yields the MAP configuration provided the MAP is unique.

Max-product is the MAP analog of sum-product. The max distributes over products in the same way sums do (because $\max(ab, ac) = a \max(b, c)$ for $a > 0$ ). On a tree, the same subtree-factorization argument applies, with max replacing sum.

Proof

Distributive law for max

For nonnegative factors, $\max_x f(x)g(x) = (\max_x f(x)) \cdot g(x_0)$ only if $x_0$ is the maximizer. The identity $\max_x f(x) g = g \max_x f$ holds when $g$ does not depend on $x$ . This is the 'distributivity' over products.

Subtree argument

On a tree, eliminating a variable $x$ by maximization decouples into subtree maxima. The recursion gives exactly the max-product definition of the messages.

Correct backpointers

For reconstruction, each $\max$ operation stores the argmax as a backpointer. Following backpointers from the root yields the joint MAP.

Example: Viterbi as Max-Product on a Trellis

A rate-1/1 convolutional code has state $s_t \in \{0, 1\}$ , with transition $s_{t+1} = (s_t + u_{t+1}) \bmod 2$ and output $c_t = s_t + s_{t-1} \bmod 2$ . We observe the outputs through AWGN. Show that the Viterbi decoder is max-product on the trellis factor graph.

Solution

Construct factor graph

Variable nodes: states $s_0, s_1, \ldots, s_T$ . Factors: $f_t(s_{t-1}, s_t) = \exp(-\|y_t - c_t(s_{t-1}, s_t)\|^2 / (2\sigma^2))$ encoding both the trellis transitions and the observation likelihoods.

Max-product messages

Forward messages: $\mu_t^\rightarrow(s_t) = \max_{s_{t-1}} f_t(s_{t-1}, s_t) \cdot \mu_{t-1}^\rightarrow(s_{t-1})$ . These are the path metrics in Viterbi. Backpointer $\text{bp}_t(s_t) = \arg\max_{s_{t-1}} [\ldots]$ .

Identify with Viterbi

Taking logs: $V_t(s_t) = \max_{s_{t-1}} [V_{t-1}(s_{t-1}) - \|y_t - c_t\|^2 / (2\sigma^2)]$ . This is exactly the Viterbi recursion with branch metrics being the squared-distance terms.

Traceback recovers MAP sequence

After reaching $t = T$ , pick the best final state and trace back through stored backpointers to recover the most likely state sequence — and hence the most likely information bit sequence.

Viterbi Is Not New

Viterbi (1967) predates the factor graph formalism, but in retrospect his algorithm is nothing more than max-product on a trellis factor graph. Similarly, the Forney-Bahl-Cocke-Jelinek-Raviv (BCJR) algorithm is sum-product on the same trellis. Viterbi and BCJR are not just related — they are literally the same algorithm with $\max$ vs. $\sum$ swapped.

Max-Product Algorithm (Log-Domain)

Complexity: Per iteration:

O\left(\sum_a |\mathcal{X}|^{|\partial a|}\right)

— same as sum-product.

Input: factor graph with log-factors g_a = log f_a

Output: MAP estimate x_hat

Initialize all messages m_{i->a}(x_i) = 0 and m_{a->i}(x_i) = 0

for t = 1 to T:

// Variable-to-factor

for each edge (i, a):

m_{i->a}(x_i) = sum over b in N(i){a}: m_{b->i}(x_i)

// Factor-to-variable (max replaces sum)

for each edge (a, i):

for each x_i:

m_{a->i}(x_i) = max over x_{N(a){i}}:

g_a(x_{N(a)}) + sum over j in N(a){i}: m_{j->a}(x_j)

record backpointer bp_{a,i}(x_i) = argmax

// Decode

for each variable i:

x_hat[i] = argmax over x_i: sum over a in N(i): m_{a->i}(x_i)

// On trees: follow backpointers for globally consistent solution

return x_hat

On trees, one pair of passes is exact. On loopy graphs, max-product may not converge; damping and attention to ties are needed.

Theorem: Loopy Max-Product: Local Optimality

Let $\hat{\mathbf{x}}$ be a fixed point of loopy max-product on a graph with girth $g$ . Then $\hat{\mathbf{x}}$ is locally optimal: any alternative assignment differing from $\hat{\mathbf{x}}$ on a subset $S$ with $|\text{boundary}(S)| < g/2$ has lower joint probability.

Loopy max-product does not always converge, but when it does, the returned configuration beats any 'local' perturbation. This is a weaker guarantee than global optimality, but it is nontrivial and is the theoretical basis of why max-product is useful in practice.

Proof

Reduction to tree

Consider a subset $S$ to perturb. The boundary of $S$ consists of edges leaving $S$ . If the boundary has fewer than $g/2$ edges, the local change does not create a cycle.

Tree bound

On the unrolled tree of depth $g/2$ , max-product is exact and $\hat{\mathbf{x}}$ is optimal. Hence no such local perturbation decreases the joint cost.

,

Sum-Product vs. Max-Product

Aspect	Sum-product (BP)	Max-product (MAP-BP)
Computes	Marginals $p(x_i)$	MAP $\arg\max_{\mathbf{x}} p(\mathbf{x})$
Message operation at factor	Sum over neighbors	Max over neighbors
Log-domain equivalent	Log-sum-exp	Max-sum (Viterbi-style)
Classical instance	BCJR, forward-backward	Viterbi
Estimation criterion	MMSE / posterior mean	MAP
Exact on trees	Yes	Yes (with traceback)
Loopy behavior	Converges empirically; Bethe approximation	May oscillate; locally optimal fixed points
Bit vs. sequence error	Minimizes bit errors (marginal MAP)	Minimizes sequence errors (joint MAP)

Common Mistake: Marginal MAP ≠ Joint MAP

Mistake:

Computing marginal beliefs $b_i(x_i)$ via sum-product and then setting $\hat{x}_i = \arg\max b_i(x_i)$ , expecting this to equal the joint MAP.

Correction:

The vector of marginal-MAP choices is the posterior-mean decoder (minimizes bit error rate), not the joint MAP (minimizes word error rate). The two differ whenever the posterior has competing modes. If you want the joint MAP, use max-product, not sum-product. LDPC decoding uses sum-product because bit errors are what count; Viterbi uses max-product because sequence errors are what count.

Example: When Marginal MAP Differs from Joint MAP

A joint distribution on $(x_1, x_2) \in \{0,1\}^2$ has $p(0,0) = 0.4, p(0,1) = 0.1, p(1,0) = 0.1, p(1,1) = 0.4$ . Find the joint MAP and the marginal-MAP assignments. Do they agree?

Solution

Joint MAP

$\arg\max p = (0,0)$ or $(1,1)$ with mass 0.4 each. Either is a valid joint MAP.

Marginals

$p(x_1 = 0) = 0.5, p(x_1 = 1) = 0.5$ (same for $x_2$ ).

Marginal MAP

Ties in both marginals — any of $(0,0), (0,1), (1,0), (1,1)$ could be picked. If ties are broken by lexicographic order, marginal-MAP returns $(0, 0)$ , which happens to agree with a joint-MAP configuration. But in less symmetric problems, the two disagree (e.g., $p(0,0) = 0.3, p(1,1) = 0.3, p(0,1) = 0.3, p(1,0) = 0.1$ — joint MAP ties between three; marginal MAP $(0,0)$ or $(0,1)$ ).

Why This Matters: Viterbi in 5G Uplink Control Channels

While 5G data channels use LDPC (sum-product decoding), control channels (PUCCH) still use tail-biting convolutional codes decoded with Viterbi (max-product) — because they are short and Viterbi's low-latency, deterministic decoding is preferred over iterative BP. The same max-product algorithm, applied to a tiny trellis, decodes billions of control messages per day in deployed networks.

Viterbi Trellis Decoding Visualization

Visualize the Viterbi (max-product) algorithm on a convolutional code trellis. Watch the path metrics evolve and the survivor paths emerge.

Parameters

Trellis length

T

15

Channel SNR (dB)3

Quick Check

You want the estimator that minimizes the bit error rate over a long codeword. Which algorithm should you use?

Sum-product BP (marginal beliefs)

Max-product (joint MAP)

Both are equivalent

Neither — use least squares