The Max-Product Algorithm

From Marginals to MAP

Sum-product computes marginals p(xi)p(x_i) β€” the probability distribution of a single variable integrating over all others. But many applications ask a different question: what is the most likely joint configuration x^MAP=arg⁑max⁑xp(x)\hat{\mathbf{x}}_{\text{MAP}} = \arg\max_{\mathbf{x}} p(\mathbf{x})?

The point is that replacing βˆ‘\sum with max⁑\max in the sum-product rules gives a new algorithm β€” max-product β€” that solves the MAP problem. The Viterbi algorithm is a famous instance: max-product on a trellis. Same factor graph, same algorithm skeleton, different operator. That is the elegance of the framework.

Definition:

Max-Product Messages

The max-product algorithm replaces the marginalization sum by a maximization: ΞΌaβ†’imax⁑(xi)=max⁑xβˆ‚aβˆ–ifa(xβˆ‚a)∏jβˆˆβˆ‚aβˆ–iΞΌjβ†’amax⁑(xj),\mu_{a \to i}^{\max}(x_i) = \max_{\mathbf{x}_{\partial a \setminus i}} f_a(\mathbf{x}_{\partial a}) \prod_{j \in \partial a \setminus i} \mu_{j \to a}^{\max}(x_j), while the variable-to-factor update is unchanged (ΞΌiβ†’amax⁑(xi)=∏bβ‰ aΞΌbβ†’imax⁑(xi)\mu_{i \to a}^{\max}(x_i) = \prod_{b \neq a} \mu_{b \to i}^{\max}(x_i)). The final max-marginal is bimax⁑(xi)∝∏aβˆˆβˆ‚iΞΌaβ†’imax⁑(xi).b_i^{\max}(x_i) \propto \prod_{a \in \partial i} \mu_{a \to i}^{\max}(x_i).

In log-domain (to avoid underflow), max-product becomes max-sum: products become sums and max⁑\max stays max⁑\max. This is the form used in Viterbi.

Theorem: Max-Product Computes MAP on Trees

If the factor graph is a tree, the max-product algorithm computes bimax⁑(xi)=max⁑x∼ip(x)=p(xMAP)b_i^{\max}(x_i) = \max_{\mathbf{x}_{\sim i}} p(\mathbf{x}) = p(\mathbf{x}_{\text{MAP}}) up to normalization, for every ii. Traceback (selecting x^i=arg⁑max⁑bimax⁑(xi)\hat{x}_i = \arg\max b_i^{\max}(x_i)) yields the MAP configuration provided the MAP is unique.

Max-product is the MAP analog of sum-product. The max distributes over products in the same way sums do (because max⁑(ab,ac)=amax⁑(b,c)\max(ab, ac) = a \max(b, c) for a>0a > 0). On a tree, the same subtree-factorization argument applies, with max replacing sum.

Example: Viterbi as Max-Product on a Trellis

A rate-1/1 convolutional code has state st∈{0,1}s_t \in \{0, 1\}, with transition st+1=(st+ut+1)β€Šmodβ€Š2s_{t+1} = (s_t + u_{t+1}) \bmod 2 and output ct=st+stβˆ’1β€Šmodβ€Š2c_t = s_t + s_{t-1} \bmod 2. We observe the outputs through AWGN. Show that the Viterbi decoder is max-product on the trellis factor graph.

Viterbi Is Not New

Viterbi (1967) predates the factor graph formalism, but in retrospect his algorithm is nothing more than max-product on a trellis factor graph. Similarly, the Forney-Bahl-Cocke-Jelinek-Raviv (BCJR) algorithm is sum-product on the same trellis. Viterbi and BCJR are not just related β€” they are literally the same algorithm with max⁑\max vs. βˆ‘\sum swapped.

Max-Product Algorithm (Log-Domain)

Complexity: Per iteration: O(βˆ‘a∣Xβˆ£βˆ£βˆ‚a∣)O\left(\sum_a |\mathcal{X}|^{|\partial a|}\right) β€” same as sum-product.
Input: factor graph with log-factors g_a = log f_a
Output: MAP estimate x_hat
Initialize all messages m_{i->a}(x_i) = 0 and m_{a->i}(x_i) = 0
for t = 1 to T:
// Variable-to-factor
for each edge (i, a):
m_{i->a}(x_i) = sum over b in N(i){a}: m_{b->i}(x_i)
// Factor-to-variable (max replaces sum)
for each edge (a, i):
for each x_i:
m_{a->i}(x_i) = max over x_{N(a){i}}:
g_a(x_{N(a)}) + sum over j in N(a){i}: m_{j->a}(x_j)
record backpointer bp_{a,i}(x_i) = argmax
// Decode
for each variable i:
x_hat[i] = argmax over x_i: sum over a in N(i): m_{a->i}(x_i)
// On trees: follow backpointers for globally consistent solution
return x_hat

On trees, one pair of passes is exact. On loopy graphs, max-product may not converge; damping and attention to ties are needed.

Theorem: Loopy Max-Product: Local Optimality

Let x^\hat{\mathbf{x}} be a fixed point of loopy max-product on a graph with girth gg. Then x^\hat{\mathbf{x}} is locally optimal: any alternative assignment differing from x^\hat{\mathbf{x}} on a subset SS with ∣boundary(S)∣<g/2|\text{boundary}(S)| < g/2 has lower joint probability.

Loopy max-product does not always converge, but when it does, the returned configuration beats any 'local' perturbation. This is a weaker guarantee than global optimality, but it is nontrivial and is the theoretical basis of why max-product is useful in practice.

,

Sum-Product vs. Max-Product

AspectSum-product (BP)Max-product (MAP-BP)
ComputesMarginals p(xi)p(x_i)MAP arg⁑max⁑xp(x)\arg\max_{\mathbf{x}} p(\mathbf{x})
Message operation at factorSum over neighborsMax over neighbors
Log-domain equivalentLog-sum-expMax-sum (Viterbi-style)
Classical instanceBCJR, forward-backwardViterbi
Estimation criterionMMSE / posterior meanMAP
Exact on treesYesYes (with traceback)
Loopy behaviorConverges empirically; Bethe approximationMay oscillate; locally optimal fixed points
Bit vs. sequence errorMinimizes bit errors (marginal MAP)Minimizes sequence errors (joint MAP)

Common Mistake: Marginal MAP β‰  Joint MAP

Mistake:

Computing marginal beliefs bi(xi)b_i(x_i) via sum-product and then setting x^i=arg⁑max⁑bi(xi)\hat{x}_i = \arg\max b_i(x_i), expecting this to equal the joint MAP.

Correction:

The vector of marginal-MAP choices is the posterior-mean decoder (minimizes bit error rate), not the joint MAP (minimizes word error rate). The two differ whenever the posterior has competing modes. If you want the joint MAP, use max-product, not sum-product. LDPC decoding uses sum-product because bit errors are what count; Viterbi uses max-product because sequence errors are what count.

Example: When Marginal MAP Differs from Joint MAP

A joint distribution on (x1,x2)∈{0,1}2(x_1, x_2) \in \{0,1\}^2 has p(0,0)=0.4,p(0,1)=0.1,p(1,0)=0.1,p(1,1)=0.4p(0,0) = 0.4, p(0,1) = 0.1, p(1,0) = 0.1, p(1,1) = 0.4. Find the joint MAP and the marginal-MAP assignments. Do they agree?

Why This Matters: Viterbi in 5G Uplink Control Channels

While 5G data channels use LDPC (sum-product decoding), control channels (PUCCH) still use tail-biting convolutional codes decoded with Viterbi (max-product) β€” because they are short and Viterbi's low-latency, deterministic decoding is preferred over iterative BP. The same max-product algorithm, applied to a tiny trellis, decodes billions of control messages per day in deployed networks.

Viterbi Trellis Decoding Visualization

Visualize the Viterbi (max-product) algorithm on a convolutional code trellis. Watch the path metrics evolve and the survivor paths emerge.

Parameters
15
3

Quick Check

You want the estimator that minimizes the bit error rate over a long codeword. Which algorithm should you use?

Sum-product BP (marginal beliefs)

Max-product (joint MAP)

Both are equivalent

Neither β€” use least squares

Key Takeaway

Max-product replaces βˆ‘\sum with max⁑\max in sum-product, computing MAP instead of marginals. The Viterbi algorithm is max-product on a trellis. On trees, max-product is exact with traceback; on loopy graphs, fixed points are locally optimal. Use sum-product when you want bit-level MMSE; use max-product when you want sequence-level MAP.