Conditional Expectation Given a Sigma-Algebra
Beyond Conditioning on Events and Random Variables
In Chapter 12 we defined as a function of and showed it is the MMSE estimator of given . But that definition relies on the existence of conditional densities or PMFs, and it conditions on a specific value of β an event that typically has probability zero.
Measure theory provides a cleaner, more general definition: is a -measurable random variable that captures "the best prediction of given the information in ." This definition works regardless of whether is generated by a discrete RV, a continuous RV, or something far more exotic. It also gives us martingales β the most powerful tool in modern probability.
Definition: Conditional Expectation Given a -Algebra
Conditional Expectation Given a -Algebra
Let be an integrable random variable on and let be a sub--algebra. The conditional expectation of given , written , is any random variable satisfying:
- is -measurable.
- for every .
Any two such random variables agree -a.s. (i.e., the conditional expectation is unique up to a.s. equivalence).
Condition 1 says depends only on the information in . Condition 2 says preserves the "average value" of on every -measurable set. Together, they define as the -measurable function that is closest to in the sense (when , conditional expectation is the orthogonal projection onto the subspace of -measurable square-integrable functions).
Theorem: Existence and Uniqueness of Conditional Expectation
Let and let be a sub--algebra. Then exists and is unique (a.s.).
Existence follows from the Radon-Nikodym theorem: the signed measure is absolutely continuous with respect to , so it has a Radon-Nikodym derivative β and that derivative is . Uniqueness follows from the fact that two -measurable functions that agree on all must agree a.s.
Define the signed measure
For , define by . Then is a finite measure on , and (if then ).
Apply Radon-Nikodym
By the Radon-Nikodym theorem (Section 22.4), there exists a -measurable function such that for all . This satisfies both conditions for .
General case and uniqueness
For general , apply the above to and separately. Uniqueness: if both satisfy the conditions, then for all . Taking (which is in ) forces , and similarly .
Example: Conditional Expectation on a Finite Partition
Let with Lebesgue measure and . Compute where .
Identify $\mathcal{G}$-measurable functions
. A -measurable function must be constant on and constant on . So for some constants .
Apply the defining condition
For : must equal . So .
For : . So .
Result
\omega^2$ on each piece of the partition.
Theorem: Properties of Conditional Expectation
Let be integrable random variables on and be sub--algebras. Then (all equalities a.s.):
- Linearity: .
- Positivity: If a.s., then a.s.
- Tower property: If , then .
- Taking out what is known: If is -measurable and , then .
- Independence: If is independent of , then .
- Jensen's inequality: If is convex and , then .
Properties 1--2 say conditional expectation is a positive linear operator. Property 3 (tower) says that coarsening the information can only lose precision. Property 4 says known quantities factor out. Property 5 says irrelevant information does not help. Property 6 is the conditional version of Jensen's inequality β it underlies the data processing inequality in information theory.
Tower property (proof sketch)
We must show satisfies the two conditions for . It is -measurable by definition. For any : The first equality uses the defining property of (since ), and the second uses the defining property of .
Jensen (proof sketch)
Since is convex, for every there exists a supporting line: for some depending on . Set (which is -measurable) and . Take conditional expectation given of both sides, using linearity and "taking out what is known."
Conditional Expectation for Bivariate Gaussian
For jointly Gaussian with correlation , the conditional expectation is a linear function of . Vary the correlation to see how the regression line and conditional variance change.
Parameters
Correlation coefficient between X and Y
Definition: Filtration
Filtration
A filtration on is an increasing sequence of sub--algebras: Intuitively, represents the information available at time . A random variable is adapted to the filtration if is -measurable for every .
Definition: Martingale
Martingale
A sequence is a martingale if:
- is adapted to (each is -measurable).
- for all .
- a.s. for all .
If condition 3 is replaced by (resp. ), the sequence is a supermartingale (resp. submartingale).
The name comes from a class of betting strategies in gambling. A martingale models a "fair game": your expected future fortune, given everything you know now, equals your current fortune.
Example: Standard Martingale Examples
Verify that each of the following is a martingale:
(a) Simple random walk where are i.i.d. with .
(b) Likelihood ratio process: Let be i.i.d. with density under and density under . Define . Then is a martingale under .
(a) Random walk
Let . Then is -measurable, (bounded moments), and where we used linearity, the "taking out what is known" property (since is -measurable), and independence of from .
(b) Likelihood ratio
Under , . Then: This likelihood ratio martingale is fundamental to sequential hypothesis testing and connects to the Radon-Nikodym derivative in Section 22.4.
Martingale Sample Paths
Visualize sample paths of a simple random walk martingale or a Polya urn martingale. The random walk has for all ; the Polya urn fraction converges to a Beta-distributed limit.
Parameters
Theorem: Martingale Convergence Theorem
Let be a submartingale satisfying . Then a.s. for some integrable random variable .
In particular, every non-negative supermartingale converges a.s.
A supermartingale that is bounded below cannot oscillate forever β its "upcrossings" of any interval are bounded in expectation. This forces convergence. The result is surprisingly general: we get a.s. convergence without any domination condition, just a bound on the positive part.
Doob's upcrossing inequality
Let count the number of times crosses from below to above . Doob's inequality states: Since is bounded, is bounded, so , meaning a.s.
Convergence
If fails to converge, then and there exist rationals with infinitely many upcrossings from to . But we just showed this has probability zero. Hence converges a.s. Integrability of the limit follows from Fatou's lemma.
Theorem: Optional Stopping Theorem
Let be a martingale and let be a stopping time (i.e., for all ). If either:
(a) is bounded (i.e., a.s. for some constant ), or
(b) and there exists with a.s. for all ,
then .
In a fair game, no stopping strategy can create expected profit. If you flip a fair coin and stop whenever you are ahead, you cannot beat the house β provided you stop in bounded time or with bounded increments.
The conditions are essential: without them, you could construct a "doubling" strategy that would beat the house (but requires infinite bankroll and infinite time).
Bounded case
Define (the stopped process). Then is also a martingale: (verify by conditioning on and separately). At time : a.s. Taking expectations: .
Condition (b) sketch
Under condition (b), one shows is uniformly integrable, so as . Since for all , the result follows.
Example: Gambler's Ruin via Optional Stopping
A gambler starts with dollars and bets \N0pN$.
Set up the martingale
The fortune is a martingale (fair game). The stopping time is finite a.s. (the random walk on is recurrent).
Apply optional stopping
By the optional stopping theorem (the stopped process is bounded by ): But with probability and with probability , so , giving .
Common Mistake: Optional Stopping Requires Conditions
Mistake:
Applying without verifying the conditions (bounded stopping time or uniform integrability).
Correction:
Consider the doubling strategy on a fair coin: bet 1, 2, 4, 8, ... until you win. The stopping time is a.s. finite (geometric), and always (you always net \\mathbb{E}[X_0] = 0\mathbb{E}[X_T] \neq \mathbb{E}[X_0]\mathbb{E}[T] = \infty$ and the increments grow exponentially.
Wald's Sequential Probability Ratio Test
The optional stopping theorem, applied to the log-likelihood ratio martingale, is the theoretical foundation of Wald's Sequential Probability Ratio Test (SPRT). In the SPRT, observations are collected one at a time and the test stops as soon as the cumulative log-likelihood ratio crosses one of two thresholds. The expected number of samples is minimized among all tests with the same error probabilities β a result known as the optimality of the SPRT.
In radar and communications, sequential detection is used when the cost of observations is significant (e.g., energy-constrained sensors) and one wants to minimize the average sample number.
Quick Check
The tower property of conditional expectation states that for :
Refining the conditioning (from to ) of an -measurable quantity has no effect. But this is also the tower property written differently.
Martingale
An adapted integrable sequence satisfying a.s. Models a "fair game" where the expected future value, given all current information, equals the present value.
Related: Filtration, Stopping Time
Filtration
An increasing sequence of sigma-algebras representing growing information over time.
Related: Martingale, Adapted Process
Key Takeaway
Conditional expectation given a sigma-algebra is a random variable, not a number. It is the best (in the sense) approximation of using only the information in . The tower property, Jensen's inequality, and the martingale framework all flow from this single definition β and these are the tools that power the deepest results in information theory and statistical inference.