Conditional Expectation Given a Sigma-Algebra

Beyond Conditioning on Events and Random Variables

In Chapter 12 we defined E[Y∣X=x]\mathbb{E}[Y \mid X = x] as a function of xx and showed it is the MMSE estimator of YY given XX. But that definition relies on the existence of conditional densities or PMFs, and it conditions on a specific value of XX β€” an event that typically has probability zero.

Measure theory provides a cleaner, more general definition: E[X∣G]\mathbb{E}[X \mid \mathcal{G}] is a G\mathcal{G}-measurable random variable that captures "the best prediction of XX given the information in G\mathcal{G}." This definition works regardless of whether G\mathcal{G} is generated by a discrete RV, a continuous RV, or something far more exotic. It also gives us martingales β€” the most powerful tool in modern probability.

,

Definition:

Conditional Expectation Given a Οƒ\sigma-Algebra

Let XX be an integrable random variable on (Ξ©,F,P)(\Omega, \mathcal{F}, P) and let GβŠ†F\mathcal{G} \subseteq \mathcal{F} be a sub-Οƒ\sigma-algebra. The conditional expectation of XX given G\mathcal{G}, written E[X∣G]\mathbb{E}[X \mid \mathcal{G}], is any random variable YY satisfying:

  1. YY is G\mathcal{G}-measurable.
  2. ∫GY dP=∫GX dP\int_G Y\, dP = \int_G X\, dP for every G∈GG \in \mathcal{G}.

Any two such random variables agree PP-a.s. (i.e., the conditional expectation is unique up to a.s. equivalence).

Condition 1 says YY depends only on the information in G\mathcal{G}. Condition 2 says YY preserves the "average value" of XX on every G\mathcal{G}-measurable set. Together, they define YY as the G\mathcal{G}-measurable function that is closest to XX in the L2L^2 sense (when X∈L2X \in L^2, conditional expectation is the orthogonal projection onto the subspace of G\mathcal{G}-measurable square-integrable functions).

,

Theorem: Existence and Uniqueness of Conditional Expectation

Let X∈L1(Ξ©,F,P)X \in L^1(\Omega, \mathcal{F}, P) and let GβŠ†F\mathcal{G} \subseteq \mathcal{F} be a sub-Οƒ\sigma-algebra. Then E[X∣G]\mathbb{E}[X \mid \mathcal{G}] exists and is unique (a.s.).

Existence follows from the Radon-Nikodym theorem: the signed measure Ξ½(G)=∫GX dP\nu(G) = \int_G X\, dP is absolutely continuous with respect to P∣GP|_{\mathcal{G}}, so it has a Radon-Nikodym derivative β€” and that derivative is E[X∣G]\mathbb{E}[X \mid \mathcal{G}]. Uniqueness follows from the fact that two G\mathcal{G}-measurable functions that agree on all G∈GG \in \mathcal{G} must agree a.s.

Example: Conditional Expectation on a Finite Partition

Let Ξ©=[0,1]\Omega = [0,1] with Lebesgue measure and G=Οƒ({[0,1/2),[1/2,1]})\mathcal{G} = \sigma(\{[0, 1/2), [1/2, 1]\}). Compute E[X∣G]\mathbb{E}[X \mid \mathcal{G}] where X(Ο‰)=Ο‰2X(\omega) = \omega^2.

Theorem: Properties of Conditional Expectation

Let X,YX, Y be integrable random variables on (Ξ©,F,P)(\Omega, \mathcal{F}, P) and G,H\mathcal{G}, \mathcal{H} be sub-Οƒ\sigma-algebras. Then (all equalities a.s.):

  1. Linearity: E[aX+bY∣G]=aE[X∣G]+bE[Y∣G]\mathbb{E}[aX + bY \mid \mathcal{G}] = a\mathbb{E}[X \mid \mathcal{G}] + b\mathbb{E}[Y \mid \mathcal{G}].
  2. Positivity: If Xβ‰₯0X \geq 0 a.s., then E[X∣G]β‰₯0\mathbb{E}[X \mid \mathcal{G}] \geq 0 a.s.
  3. Tower property: If HβŠ†G\mathcal{H} \subseteq \mathcal{G}, then E[E[X∣G]∣H]=E[X∣H]\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}] = \mathbb{E}[X \mid \mathcal{H}].
  4. Taking out what is known: If YY is G\mathcal{G}-measurable and XY∈L1XY \in L^1, then E[XY∣G]=Yβ‹…E[X∣G]\mathbb{E}[XY \mid \mathcal{G}] = Y \cdot \mathbb{E}[X \mid \mathcal{G}].
  5. Independence: If XX is independent of G\mathcal{G}, then E[X∣G]=E[X]\mathbb{E}[X \mid \mathcal{G}] = \mathbb{E}[X].
  6. Jensen's inequality: If Ο†\varphi is convex and Ο†(X)∈L1\varphi(X) \in L^1, then Ο†(E[X∣G])≀E[Ο†(X)∣G]\varphi(\mathbb{E}[X \mid \mathcal{G}]) \leq \mathbb{E}[\varphi(X) \mid \mathcal{G}].

Properties 1--2 say conditional expectation is a positive linear operator. Property 3 (tower) says that coarsening the information can only lose precision. Property 4 says known quantities factor out. Property 5 says irrelevant information does not help. Property 6 is the conditional version of Jensen's inequality β€” it underlies the data processing inequality in information theory.

,

Conditional Expectation mathbbE[YmidX=x]\\mathbb{E}[Y \\mid X = x] for Bivariate Gaussian

For jointly Gaussian (X,Y)(X, Y) with correlation ρ\rho, the conditional expectation E[Y∣X=x]=ΞΌY+ρσYΟƒX(xβˆ’ΞΌX)\mathbb{E}[Y \mid X = x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X) is a linear function of xx. Vary the correlation to see how the regression line and conditional variance change.

Parameters
0.7

Correlation coefficient between X and Y

Definition:

Filtration

A filtration on (Ξ©,F)(\Omega, \mathcal{F}) is an increasing sequence of sub-Οƒ\sigma-algebras: F0βŠ†F1βŠ†F2βŠ†β‹―βŠ†F.\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \mathcal{F}_2 \subseteq \cdots \subseteq \mathcal{F}. Intuitively, Fn\mathcal{F}_n represents the information available at time nn. A random variable XnX_n is adapted to the filtration {Fn}\{\mathcal{F}_n\} if XnX_n is Fn\mathcal{F}_n-measurable for every nn.

,

Definition:

Martingale

A sequence {Xn,Fn}nβ‰₯0\{X_n, \mathcal{F}_n\}_{n \geq 0} is a martingale if:

  1. XnX_n is adapted to {Fn}\{\mathcal{F}_n\} (each XnX_n is Fn\mathcal{F}_n-measurable).
  2. E[∣Xn∣]<∞\mathbb{E}[|X_n|] < \infty for all nn.
  3. E[Xn+1∣Fn]=Xn\mathbb{E}[X_{n+1} \mid \mathcal{F}_n] = X_n a.s. for all nn.

If condition 3 is replaced by ≀\leq (resp. β‰₯\geq), the sequence is a supermartingale (resp. submartingale).

The name comes from a class of betting strategies in gambling. A martingale models a "fair game": your expected future fortune, given everything you know now, equals your current fortune.

,

Example: Standard Martingale Examples

Verify that each of the following is a martingale:

(a) Simple random walk Sn=βˆ‘k=1nXkS_n = \sum_{k=1}^n X_k where XkX_k are i.i.d. with P(Xk=1)=P(Xk=βˆ’1)=1/2P(X_k = 1) = P(X_k = -1) = 1/2.

(b) Likelihood ratio process: Let X1,X2,…X_1, X_2, \ldots be i.i.d. with density ff under PP and density gg under QQ. Define Ln=∏k=1ng(Xk)/f(Xk)L_n = \prod_{k=1}^n g(X_k)/f(X_k). Then {Ln}\{L_n\} is a martingale under PP.

,

Martingale Sample Paths

Visualize sample paths of a simple random walk martingale or a Polya urn martingale. The random walk has E[Sn]=0\mathbb{E}[S_n] = 0 for all nn; the Polya urn fraction converges to a Beta-distributed limit.

Parameters
500
5

Theorem: Martingale Convergence Theorem

Let {Xn,Fn}\{X_n, \mathcal{F}_n\} be a submartingale satisfying sup⁑nE[Xn+]<∞\sup_n \mathbb{E}[X_n^+] < \infty. Then Xnβ†’X∞X_n \to X_{\infty} a.s. for some integrable random variable X∞X_{\infty}.

In particular, every non-negative supermartingale converges a.s.

A supermartingale that is bounded below cannot oscillate forever β€” its "upcrossings" of any interval [a,b][a, b] are bounded in expectation. This forces convergence. The result is surprisingly general: we get a.s. convergence without any domination condition, just a bound on the positive part.

Theorem: Optional Stopping Theorem

Let {Xn,Fn}\{X_n, \mathcal{F}_n\} be a martingale and let TT be a stopping time (i.e., {T=n}∈Fn\{T = n\} \in \mathcal{F}_n for all nn). If either:

(a) TT is bounded (i.e., T≀NT \leq N a.s. for some constant NN), or

(b) E[T]<∞\mathbb{E}[T] < \infty and there exists c>0c > 0 with E[∣Xn+1βˆ’Xn∣∣Fn]≀c\mathbb{E}[|X_{n+1} - X_n| \mid \mathcal{F}_n] \leq c a.s. for all nn,

then E[XT]=E[X0]\mathbb{E}[X_T] = \mathbb{E}[X_0].

In a fair game, no stopping strategy can create expected profit. If you flip a fair coin and stop whenever you are ahead, you cannot beat the house β€” provided you stop in bounded time or with bounded increments.

The conditions are essential: without them, you could construct a "doubling" strategy that would beat the house (but requires infinite bankroll and infinite time).

,

Example: Gambler's Ruin via Optional Stopping

A gambler starts with aa dollars and bets \1oneachroundofafaircoinflip.Thegameendswhenthegamblerreaches1 on each round of a fair coin flip. The game ends when the gambler reachesNdollarsorgoesbroke(dollars or goes broke (0dollars).Findtheprobabilitydollars). Find the probabilitypofreachingof reachingN$.

Common Mistake: Optional Stopping Requires Conditions

Mistake:

Applying E[XT]=E[X0]\mathbb{E}[X_T] = \mathbb{E}[X_0] without verifying the conditions (bounded stopping time or uniform integrability).

Correction:

Consider the doubling strategy on a fair coin: bet 1, 2, 4, 8, ... until you win. The stopping time TT is a.s. finite (geometric), and XT=1X_T = 1 always (you always net \1).But1). But\mathbb{E}[X_0] = 0fortheassociatedmartingale,sofor the associated martingale, so\mathbb{E}[X_T] \neq \mathbb{E}[X_0].Theconditionsofthetheoremfailbecause. The conditions of the theorem fail because\mathbb{E}[T] = \infty$ and the increments grow exponentially.

πŸ”§Engineering Note

Wald's Sequential Probability Ratio Test

The optional stopping theorem, applied to the log-likelihood ratio martingale, is the theoretical foundation of Wald's Sequential Probability Ratio Test (SPRT). In the SPRT, observations are collected one at a time and the test stops as soon as the cumulative log-likelihood ratio crosses one of two thresholds. The expected number of samples is minimized among all tests with the same error probabilities β€” a result known as the optimality of the SPRT.

In radar and communications, sequential detection is used when the cost of observations is significant (e.g., energy-constrained sensors) and one wants to minimize the average sample number.

Quick Check

The tower property of conditional expectation states that for HβŠ†G\mathcal{H} \subseteq \mathcal{G}:

E[E[X∣G]∣H]=E[X∣G]\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}] = \mathbb{E}[X \mid \mathcal{G}]

E[E[X∣H]∣G]=E[X∣H]\mathbb{E}[\mathbb{E}[X \mid \mathcal{H}] \mid \mathcal{G}] = \mathbb{E}[X \mid \mathcal{H}]

E[E[X∣G]∣H]=E[X∣H]\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}] = \mathbb{E}[X \mid \mathcal{H}]

Martingale

An adapted integrable sequence {Xn,Fn}\{X_n, \mathcal{F}_n\} satisfying E[Xn+1∣Fn]=Xn\mathbb{E}[X_{n+1} \mid \mathcal{F}_n] = X_n a.s. Models a "fair game" where the expected future value, given all current information, equals the present value.

Related: Filtration, Stopping Time

Filtration

An increasing sequence of sigma-algebras F0βŠ†F1βŠ†β‹―\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \cdots representing growing information over time.

Related: Martingale, Adapted Process

Key Takeaway

Conditional expectation given a sigma-algebra is a random variable, not a number. It is the best (in the L2L^2 sense) approximation of XX using only the information in G\mathcal{G}. The tower property, Jensen's inequality, and the martingale framework all flow from this single definition β€” and these are the tools that power the deepest results in information theory and statistical inference.