Conditional Independence

When Independence Emerges from Conditioning

In complex systems, events that are correlated marginally (without any conditioning) may become independent once we condition on some intermediate variable. Conversely, events that are marginally independent may become dependent once we condition on a common effect. Conditional independence is the language for expressing these relationships precisely.

Graphical models β€” Bayesian networks, factor graphs, Markov random fields β€” are nothing but compact representations of conditional independence structures. The belief propagation algorithm passes messages along graph edges by repeatedly applying Bayes' theorem to locally conditionally independent components. Understanding conditional independence is a prerequisite for everything in probabilistic inference.

Definition:

Conditional Independence

Events AA and BB are conditionally independent given CC, written AβŠ₯B∣CA \perp B \mid C, if P(C)>0\mathbb{P}(C) > 0 and P(A∩B∣C)=P(A∣C) P(B∣C).\mathbb{P}(A \cap B \mid C) = \mathbb{P}(A \mid C)\,\mathbb{P}(B \mid C). Equivalently (when P(B∩C)>0\mathbb{P}(B \cap C) > 0): P(A∣B∩C)=P(A∣C),\mathbb{P}(A \mid B \cap C) = \mathbb{P}(A \mid C), i.e., given CC, knowledge of BB adds no further information about AA.

Conditional independence given CC does NOT imply marginal independence, and marginal independence does NOT imply conditional independence. The two notions are logically unrelated and can hold in either combination.

Example: Conditionally Independent but Marginally Dependent

A student passes an exam (AA) or fails, and the exam difficulty is high (C=1C = 1) or low (C=0C = 0). Suppose:

  • P(C=1)=0.5\mathbb{P}(C=1) = 0.5
  • P(A=1∣C=1)=0.3\mathbb{P}(A=1 \mid C=1) = 0.3 (hard exam: student more likely fails)
  • P(A=1∣C=0)=0.8\mathbb{P}(A=1 \mid C=0) = 0.8 (easy exam: student more likely passes)
  • BB = "neighbour passes the same exam", with the same conditional probabilities as AA and conditionally independent of AA given CC.

Show that AA and BB are marginally dependent (correlated) but conditionally independent given CC.

Example: Marginally Independent but Conditionally Dependent

Toss two fair coins independently: let AA = "first coin heads" and BB = "second coin heads". They are independent. Now condition on CC = "exactly one head". Show that AA and BB become dependent given CC.

Definition:

Markov Chain (Three Events)

Three events AA, BB, CC with positive probability form a Markov chain A⊸B⊸CA \multimap B \multimap C (or Aβ†’Bβ†’CA \to B \to C) if P(A∩B∩C)=P(A) P(B∣A) P(C∣B).\mathbb{P}(A \cap B \cap C) = \mathbb{P}(A)\,\mathbb{P}(B \mid A)\,\mathbb{P}(C \mid B). Equivalently, AA and CC are conditionally independent given BB: P(C∣A∩B)=P(C∣B),\mathbb{P}(C \mid A \cap B) = \mathbb{P}(C \mid B), i.e., given BB, knowledge of AA provides no additional information about CC.

The Markov chain notation A⊸B⊸CA \multimap B \multimap C does NOT imply a temporal or causal ordering β€” it is a statement about conditional independence. Importantly, A⊸B⊸CA \multimap B \multimap C is equivalent to C⊸B⊸AC \multimap B \multimap A: the chain is symmetric in AA and CC given BB.

Theorem: Markov Chain Implies Conditional Independence

If A⊸B⊸CA \multimap B \multimap C is a Markov chain, then:

  1. AA and CC are conditionally independent given BB: AβŠ₯C∣BA \perp C \mid B.
  2. The chain is symmetric: C⊸B⊸AC \multimap B \multimap A.

Why This Matters: Markov Chains and Factor Graphs

In channel decoding, the code graph imposes a Markov structure on the encoded bits. For a convolutional code, consecutive encoder states form a Markov chain: S0⊸S1βŠΈβ‹―βŠΈSnS_0 \multimap S_1 \multimap \cdots \multimap S_n. The Viterbi algorithm exploits this structure β€” it propagates messages along the trellis using exactly the chain rule of conditional probability.

More generally, factor graphs (used in LDPC and turbo code decoding) are graphical representations of conditional independence structures. The belief propagation algorithm on a tree-structured factor graph is an efficient algorithm for computing marginal posteriors, and its correctness follows directly from the Markov properties encoded in the graph.

Quick Check

A⊸B⊸CA \multimap B \multimap C is a Markov chain. Which of the following is guaranteed to hold?

P(A∩B∩C)=P(A)P(B)P(C)\mathbb{P}(A \cap B \cap C) = \mathbb{P}(A)\mathbb{P}(B)\mathbb{P}(C)

P(A∣B,C)=P(A∣B)\mathbb{P}(A \mid B, C) = \mathbb{P}(A \mid B)

AA and CC are marginally independent

P(C∣A,B)=P(C∣A)\mathbb{P}(C \mid A, B) = \mathbb{P}(C \mid A)

Historical Note: Andrei Markov and the Chain That Bears His Name

1906

Andrei Andreyevich Markov (1856–1922) introduced what we now call Markov chains in 1906 as part of a dispute with Pavel Nekrasov. Nekrasov had claimed that the law of large numbers required independence. Markov constructed a counterexample: a dependent sequence (alternating letters from Pushkin's poem "Eugene Onegin") for which the law of large numbers still held. The chain Markov used was exactly what bears his name: a sequence where each element depends only on the immediately preceding one.

Today, Markov chains are the foundation of stochastic processes (Chapters 13–15), Monte Carlo methods, hidden Markov models in speech recognition, and the PageRank algorithm. The concept of the Markov property β€” "the future is independent of the past given the present" β€” is arguably the single most important simplifying assumption in applied probability.

Key Takeaway

Conditional independence is the language of graphical models. AβŠ₯B∣CA \perp B \mid C means: once you know CC, AA and BB carry no mutual information. Conditioning on a common cause removes correlation between its effects; conditioning on a common effect (collider) creates correlation between otherwise independent causes. The Markov chain A⊸B⊸CA \multimap B \multimap C is the simplest conditional independence structure: AA and CC are screened off by BB.