From Distributions to Graphs

The Unifying Idea of Graphical Models

In Chapter 8 we studied EM β€” an iterative algorithm for inferring latent variables. In Chapter 14 we studied LASSO β€” an algorithm for sparse recovery. In Book ITA Chapter 12 we studied LDPC decoding β€” an iterative algorithm for channel coding. These problems look completely different, yet their solution methods all share a common structure: repeated local updates on a graph.

The point is that these problems are all instances of a single task: inference in a factorized probability distribution. And the algorithms that solve them β€” belief propagation, approximate message passing, Viterbi β€” are all instances of a single computational recipe: message passing on a graph.

Factor graphs are the language that makes this unity visible. Once we have the graph, the algorithm writes itself. The same code, applied to different factor graphs, decodes LDPC codes, equalizes ISI channels, performs MIMO detection, and runs the Kalman filter.

Definition:

Factor Graph

Let p(x)=1Z∏a∈Ffa(xβˆ‚a)p(\mathbf{x}) = \frac{1}{Z}\prod_{a \in \mathcal{F}} f_a(\mathbf{x}_{\partial a}) be a factorized distribution on x=(x1,…,xn)\mathbf{x} = (x_1, \ldots, x_n), where each factor faf_a depends on a subset xβˆ‚a\mathbf{x}_{\partial a} of variables. The factor graph is a bipartite graph with:

  • Variable nodes {1,…,n}\{1, \ldots, n\}, drawn as circles β—―\bigcirc;
  • Factor nodes F\mathcal{F}, drawn as squares β–‘\square;
  • Edges: i∼ai \sim a iff xix_i appears in xβˆ‚a\mathbf{x}_{\partial a}. The neighborhood of variable node ii is βˆ‚i={a∈F:i∼a}\partial i = \{a \in \mathcal{F}: i \sim a\}; the neighborhood of factor node aa is βˆ‚a={i:i∼a}\partial a = \{i : i \sim a\}.

A factor graph is more explicit than a Bayesian network or Markov random field: each factor function faf_a is represented by its own node, and the graph shows exactly which variables enter each factor. This explicitness is what makes message-passing algorithms easy to describe.

Example: From Distribution to Factor Graph

Draw the factor graph of the distribution p(x1,x2,x3,x4)=1ZfA(x1,x2)fB(x2,x3)fC(x3,x4)fD(x1,x4)p(x_1, x_2, x_3, x_4) = \frac{1}{Z} f_A(x_1, x_2) f_B(x_2, x_3) f_C(x_3, x_4) f_D(x_1, x_4). Is the graph a tree?

A Basic Factor Graph

A Basic Factor Graph
Bipartite factor graph: circles are variable nodes, squares are factor nodes. Each factor node is connected to the variables it depends on.

Theorem: Marginalization via Factor-Graph Pushforward

Let p(x)=1Z∏afa(xβˆ‚a)p(\mathbf{x}) = \frac{1}{Z}\prod_a f_a(\mathbf{x}_{\partial a}). For any variable xix_i, its marginal is p(xi)=1Zβˆ‘x∼i∏afa(xβˆ‚a),p(x_i) = \frac{1}{Z} \sum_{\mathbf{x}_{\sim i}} \prod_a f_a(\mathbf{x}_{\partial a}), where x∼i\mathbf{x}_{\sim i} denotes all variables except xix_i. When the factor graph is a tree, this marginal can be computed in time O(nβ‹…max⁑a∣Xβˆ£βˆ£βˆ‚a∣)O(n \cdot \max_a |\mathcal{X}|^{|\partial a|}) via the sum-product algorithm.

The factorization of pp means that marginalization distributes over the product: pushing sums inside products splits the global computation into local ones. On a tree this distribution is lossless and we get an exact polynomial-time algorithm. On a loopy graph, the distribution introduces approximations β€” that is the subject of Chapter 18.

,

Bipartite graph

A graph with two disjoint node sets and edges only between sets. Factor graphs are bipartite with variable nodes and factor nodes as the two sides.

Related: Tanner graph, LDPC code

Partition function

The normalization constant Z=βˆ‘x∏afa(xβˆ‚a)Z = \sum_{\mathbf{x}} \prod_a f_a(\mathbf{x}_{\partial a}) that makes p(x)p(\mathbf{x}) sum to one. Computing ZZ is #P-hard in general but polynomial on trees.

Related: Evidence Lower Bound (ELBO) / Free Energy, Bethe approximation

Definition:

Relation to Bayesian Networks and Markov Random Fields

Every Bayesian network (directed acyclic graph with conditional probability tables) can be converted to a factor graph: each conditional p(xi∣parents)p(x_i | \text{parents}) becomes one factor node with the corresponding neighborhood. Every Markov random field (undirected graph with cliques potentials) can similarly be converted: each clique potential becomes a factor.

The factor graph is usually more informative because it disambiguates factorizations that share the same undirected skeleton. For instance, the distributions f(x1,x2,x3)f(x_1, x_2, x_3) and fA(x1,x2)fB(x2,x3)fC(x1,x3)f_A(x_1, x_2) f_B(x_2, x_3) f_C(x_1, x_3) share the same pairwise MRF graph but have distinct factor graphs.

The factor graph captures precise factor structure. Two models with the same undirected graph can have different factor graphs, and message-passing algorithms operate on the factor graph β€” so the distinction matters.

Encoding Conditional Distributions

The classical modeling task β€” "write down the posterior p(x∣y)p(\mathbf{x}|\mathbf{y})" β€” maps to factor graphs via observations. An observation y\mathbf{y} is clamped: its factor becomes f(xβˆ‚a,yβˆ‚a)f(\mathbf{x}_{\partial a}, \mathbf{y}_{\partial a}), evaluated at the observed value. The resulting graph has only x\mathbf{x} as free variables. Inference on it computes p(x∣y)p(\mathbf{x}|\mathbf{y}).

Example: Posterior in a Linear Gaussian Model

The model is y=Hx+w\mathbf{y} = \mathbf{H}\mathbf{x} + \mathbf{w}, x∼N(0,Οƒx2I)\mathbf{x} \sim \mathcal{N}(\mathbf{0}, \sigma_x^2 \mathbf{I}), w∼N(0,Οƒ2I)\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}). Construct the factor graph for the posterior p(x∣y)p(\mathbf{x}|\mathbf{y}).

⚠️Engineering Note

Sparsity Matters for Tractability

The computational cost of message passing scales with the factor sizes βˆ£βˆ‚a∣|\partial a|. Dense factors are the enemy of tractability. In practice, choose models whose factors depend on few variables: LDPC parity checks involve only dcd_c (check degree) variables; HMMs have pairwise transitions; ISI channels have LL-variable factors where LL is the channel length. When models lack natural sparsity (e.g., dense MIMO), message passing is combined with approximations (Gaussian BP, AMP) that replace high-dimensional integrals by Gaussian moments.

Practical Constraints
  • β€’

    Exact inference scales as O(∣Xβˆ£βˆ£βˆ‚a∣)O(|\mathcal{X}|^{|\partial a|}) per factor β€” exponential in factor size.

  • β€’

    Tree-width β‰₯βˆ£βˆ‚a∣\geq |\partial a|, so cliques of degree dd force tree-width β‰₯dβˆ’1\geq d - 1.

Common Mistake: Factor Graphs Are Undirected

Mistake:

Drawing arrows on factor graph edges to represent 'causal' relationships between variables and factors.

Correction:

Factor graphs are undirected bipartite graphs. They represent symmetric dependence through factors; the direction of message flow in algorithms is a property of the algorithm, not the graph. If directed causal structure matters, use a Bayesian network and convert it to a factor graph when running inference.

Quick Check

A joint distribution p(x1,x2,x3)=f1(x1)f12(x1,x2)f23(x2,x3)p(x_1, x_2, x_3) = f_1(x_1) f_{12}(x_1, x_2) f_{23}(x_2, x_3) has a factor graph structure that is:

A tree

A 3-cycle

Complete bipartite

Cannot be determined

Key Takeaway

A factor graph is a bipartite graph that makes the factorization of a distribution explicit. It is the natural data structure for message-passing inference: on trees, exact marginals compute in linear time; on loopy graphs, approximate inference becomes possible through the same local update rules.