Ferkans — Interactive Telecom Tutor

Beyond Conditioning on Events and Random Variables

In Chapter 12 we defined $\mathbb{E}[Y \mid X = x]$ as a function of $x$ and showed it is the MMSE estimator of $Y$ given $X$ . But that definition relies on the existence of conditional densities or PMFs, and it conditions on a specific value of $X$ — an event that typically has probability zero.

Measure theory provides a cleaner, more general definition: $\mathbb{E}[X \mid \mathcal{G}]$ is a $\mathcal{G}$ -measurable random variable that captures "the best prediction of $X$ given the information in $\mathcal{G}$ ." This definition works regardless of whether $\mathcal{G}$ is generated by a discrete RV, a continuous RV, or something far more exotic. It also gives us martingales — the most powerful tool in modern probability.

,

Definition:
Conditional Expectation Given a $\sigma$ -Algebra

Let $X$ be an integrable random variable on $(\Omega, \mathcal{F}, P)$ and let $\mathcal{G} \subseteq \mathcal{F}$ be a sub- $\sigma$ -algebra. The conditional expectation of $X$ given $\mathcal{G}$ , written $\mathbb{E}[X \mid \mathcal{G}]$ , is any random variable $Y$ satisfying:

$Y$ is $\mathcal{G}$ -measurable.
$\int_G Y\, dP = \int_G X\, dP$ for every $G \in \mathcal{G}$ .

Any two such random variables agree $P$ -a.s. (i.e., the conditional expectation is unique up to a.s. equivalence).

Condition 1 says $Y$ depends only on the information in $\mathcal{G}$ . Condition 2 says $Y$ preserves the "average value" of $X$ on every $\mathcal{G}$ -measurable set. Together, they define $Y$ as the $\mathcal{G}$ -measurable function that is closest to $X$ in the $L^2$ sense (when $X \in L^2$ , conditional expectation is the orthogonal projection onto the subspace of $\mathcal{G}$ -measurable square-integrable functions).

,

Theorem: Existence and Uniqueness of Conditional Expectation

Let $X \in L^1(\Omega, \mathcal{F}, P)$ and let $\mathcal{G} \subseteq \mathcal{F}$ be a sub- $\sigma$ -algebra. Then $\mathbb{E}[X \mid \mathcal{G}]$ exists and is unique (a.s.).

Existence follows from the Radon-Nikodym theorem: the signed measure $\nu(G) = \int_G X\, dP$ is absolutely continuous with respect to $P|_{\mathcal{G}}$ , so it has a Radon-Nikodym derivative — and that derivative is $\mathbb{E}[X \mid \mathcal{G}]$ . Uniqueness follows from the fact that two $\mathcal{G}$ -measurable functions that agree on all $G \in \mathcal{G}$ must agree a.s.

Proof

Define the signed measure

For $X \geq 0$ , define $\nu : \mathcal{G} \to [0, \infty)$ by $\nu(G) = \int_G X\, dP$ . Then $\nu$ is a finite measure on $(\Omega, \mathcal{G})$ , and $\nu \ll P|_{\mathcal{G}}$ (if $P(G) = 0$ then $\nu(G) = 0$ ).

Apply Radon-Nikodym

By the Radon-Nikodym theorem (Section 22.4), there exists a $\mathcal{G}$ -measurable function $Y \geq 0$ such that $\nu(G) = \int_G Y\, dP$ for all $G \in \mathcal{G}$ . This $Y$ satisfies both conditions for $\mathbb{E}[X \mid \mathcal{G}]$ .

General case and uniqueness

For general $X$ , apply the above to $X^+$ and $X^-$ separately. Uniqueness: if $Y_1, Y_2$ both satisfy the conditions, then $\int_G (Y_1 - Y_2)\, dP = 0$ for all $G \in \mathcal{G}$ . Taking $G = \{Y_1 > Y_2\}$ (which is in $\mathcal{G}$ ) forces $P(Y_1 > Y_2) = 0$ , and similarly $P(Y_1 < Y_2) = 0$ . $\blacksquare$

Example: Conditional Expectation on a Finite Partition

Let $\Omega = [0,1]$ with Lebesgue measure and $\mathcal{G} = \sigma(\{[0, 1/2), [1/2, 1]\})$ . Compute $\mathbb{E}[X \mid \mathcal{G}]$ where $X(\omega) = \omega^2$ .

Solution

Identify $\mathcal{G}$-measurable functions

$\mathcal{G} = \{\emptyset, [0, 1/2), [1/2, 1], [0, 1]\}$ . A $\mathcal{G}$ -measurable function must be constant on $[0, 1/2)$ and constant on $[1/2, 1]$ . So $\mathbb{E}[X \mid \mathcal{G}] = a \mathbf{1}_{[0,1/2)} + b \mathbf{1}_{[1/2,1]}$ for some constants $a, b$ .

Apply the defining condition

For $G = [0, 1/2)$ : $\int_{[0,1/2)} a\, d\lambda = a/2$ must equal $\int_0^{1/2} \omega^2\, d\omega = 1/24$ . So $a = 1/12$ .

For $G = [1/2, 1]$ : $b/2 = \int_{1/2}^{1} \omega^2\, d\omega = 7/24$ . So $b = 7/12$ .

Result

$\mathbb{E}[\omega^2 \mid \mathcal{G}] = \frac{1}{12} \mathbf{1}_{[0,1/2)}(\omega) + \frac{7}{12} \mathbf{1}_{[1/2,1]}(\omega).$ $This is the "best constant approximation" of$ \omega^2$ on each piece of the partition.

Theorem: Properties of Conditional Expectation

Let $X, Y$ be integrable random variables on $(\Omega, \mathcal{F}, P)$ and $\mathcal{G}, \mathcal{H}$ be sub- $\sigma$ -algebras. Then (all equalities a.s.):

Linearity: $\mathbb{E}[aX + bY \mid \mathcal{G}] = a\mathbb{E}[X \mid \mathcal{G}] + b\mathbb{E}[Y \mid \mathcal{G}]$ .
Positivity: If $X \geq 0$ a.s., then $\mathbb{E}[X \mid \mathcal{G}] \geq 0$ a.s.
Tower property: If $\mathcal{H} \subseteq \mathcal{G}$ , then $\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}] = \mathbb{E}[X \mid \mathcal{H}]$ .
Taking out what is known: If $Y$ is $\mathcal{G}$ -measurable and $XY \in L^1$ , then $\mathbb{E}[XY \mid \mathcal{G}] = Y \cdot \mathbb{E}[X \mid \mathcal{G}]$ .
Independence: If $X$ is independent of $\mathcal{G}$ , then $\mathbb{E}[X \mid \mathcal{G}] = \mathbb{E}[X]$ .
Jensen's inequality: If $\varphi$ is convex and $\varphi(X) \in L^1$ , then $\varphi(\mathbb{E}[X \mid \mathcal{G}]) \leq \mathbb{E}[\varphi(X) \mid \mathcal{G}]$ .

Properties 1--2 say conditional expectation is a positive linear operator. Property 3 (tower) says that coarsening the information can only lose precision. Property 4 says known quantities factor out. Property 5 says irrelevant information does not help. Property 6 is the conditional version of Jensen's inequality — it underlies the data processing inequality in information theory.

Proof

Tower property (proof sketch)

We must show $\mathbb{E}[X \mid \mathcal{H}]$ satisfies the two conditions for $\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}]$ . It is $\mathcal{H}$ -measurable by definition. For any $H \in \mathcal{H} \subseteq \mathcal{G}$ : $\int_H \mathbb{E}[X \mid \mathcal{G}]\, dP = \int_H X\, dP = \int_H \mathbb{E}[X \mid \mathcal{H}]\, dP.$ The first equality uses the defining property of $\mathbb{E}[X \mid \mathcal{G}]$ (since $H \in \mathcal{G}$ ), and the second uses the defining property of $\mathbb{E}[X \mid \mathcal{H}]$ . $\blacksquare$

Jensen (proof sketch)

Since $\varphi$ is convex, for every $y$ there exists a supporting line: $\varphi(x) \geq \varphi(y) + \lambda(x - y)$ for some $\lambda$ depending on $y$ . Set $y = \mathbb{E}[X \mid \mathcal{G}]$ (which is $\mathcal{G}$ -measurable) and $x = X$ . Take conditional expectation given $\mathcal{G}$ of both sides, using linearity and "taking out what is known." $\blacksquare$

,

Conditional Expectation $\\mathbb{E}[Y \\mid X = x]$ for Bivariate Gaussian

For jointly Gaussian $(X, Y)$ with correlation $\rho$ , the conditional expectation $\mathbb{E}[Y \mid X = x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X)$ is a linear function of $x$ . Vary the correlation to see how the regression line and conditional variance change.

Parameters

\rho

(correlation)0.7

Correlation coefficient between X and Y

Definition:
Filtration

A filtration on $(\Omega, \mathcal{F})$ is an increasing sequence of sub- $\sigma$ -algebras: $\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \mathcal{F}_2 \subseteq \cdots \subseteq \mathcal{F}.$ Intuitively, $\mathcal{F}_n$ represents the information available at time $n$ . A random variable $X_n$ is adapted to the filtration $\{\mathcal{F}_n\}$ if $X_n$ is $\mathcal{F}_n$ -measurable for every $n$ .

,

Definition:
Martingale

A sequence $\{X_n, \mathcal{F}_n\}_{n \geq 0}$ is a martingale if:

$X_n$ is adapted to $\{\mathcal{F}_n\}$ (each $X_n$ is $\mathcal{F}_n$ -measurable).
$\mathbb{E}[|X_n|] < \infty$ for all $n$ .
$\mathbb{E}[X_{n+1} \mid \mathcal{F}_n] = X_n$ a.s. for all $n$ .

If condition 3 is replaced by $\leq$ (resp. $\geq$ ), the sequence is a supermartingale (resp. submartingale).

The name comes from a class of betting strategies in gambling. A martingale models a "fair game": your expected future fortune, given everything you know now, equals your current fortune.

,

Example: Standard Martingale Examples

Verify that each of the following is a martingale:

(a) Simple random walk $S_n = \sum_{k=1}^n X_k$ where $X_k$ are i.i.d. with $P(X_k = 1) = P(X_k = -1) = 1/2$ .

(b) Likelihood ratio process: Let $X_1, X_2, \ldots$ be i.i.d. with density $f$ under $P$ and density $g$ under $Q$ . Define $L_n = \prod_{k=1}^n g(X_k)/f(X_k)$ . Then $\{L_n\}$ is a martingale under $P$ .

Solution

(a) Random walk

Let $\mathcal{F}_n = \sigma(X_1, \ldots, X_n)$ . Then $S_n$ is $\mathcal{F}_n$ -measurable, $\mathbb{E}[|S_n|] < \infty$ (bounded moments), and $\mathbb{E}[S_{n+1} \mid \mathcal{F}_n] = \mathbb{E}[S_n + X_{n+1} \mid \mathcal{F}_n] = S_n + \mathbb{E}[X_{n+1}] = S_n + 0 = S_n,$ where we used linearity, the "taking out what is known" property (since $S_n$ is $\mathcal{F}_n$ -measurable), and independence of $X_{n+1}$ from $\mathcal{F}_n$ .

(b) Likelihood ratio

Under $P$ , $\mathbb{E}_P[g(X_k)/f(X_k)] = \int (g(x)/f(x)) f(x)\, dx = \int g(x)\, dx = 1$ . Then: $\mathbb{E}_P[L_{n+1} \mid \mathcal{F}_n] = L_n \cdot \mathbb{E}_P\left[\frac{g(X_{n+1})}{f(X_{n+1})}\right] = L_n \cdot 1 = L_n.$ This likelihood ratio martingale is fundamental to sequential hypothesis testing and connects to the Radon-Nikodym derivative in Section 22.4.

,

Martingale Sample Paths

Visualize sample paths of a simple random walk martingale or a Polya urn martingale. The random walk has $\mathbb{E}[S_n] = 0$ for all $n$ ; the Polya urn fraction converges to a Beta-distributed limit.

Parameters

Martingale type

Number of steps500

Number of paths5

Theorem: Martingale Convergence Theorem

Let $\{X_n, \mathcal{F}_n\}$ be a submartingale satisfying $\sup_n \mathbb{E}[X_n^+] < \infty$ . Then $X_n \to X_{\infty}$ a.s. for some integrable random variable $X_{\infty}$ .

In particular, every non-negative supermartingale converges a.s.

A supermartingale that is bounded below cannot oscillate forever — its "upcrossings" of any interval $[a, b]$ are bounded in expectation. This forces convergence. The result is surprisingly general: we get a.s. convergence without any domination condition, just a bound on the positive part.

Proof

Doob's upcrossing inequality

Let $U_n(a, b)$ count the number of times $X_0, X_1, \ldots, X_n$ crosses from below $a$ to above $b$ . Doob's inequality states: $(b - a) \mathbb{E}[U_n(a, b)] \leq \mathbb{E}[(X_n - a)^+].$ Since $\mathbb{E}[X_n^+]$ is bounded, $\mathbb{E}[(X_n - a)^+]$ is bounded, so $\mathbb{E}[U_{\infty}(a, b)] < \infty$ , meaning $U_{\infty}(a, b) < \infty$ a.s.

Convergence

If $X_n$ fails to converge, then $\liminf X_n < \limsup X_n$ and there exist rationals $a < b$ with infinitely many upcrossings from $a$ to $b$ . But we just showed this has probability zero. Hence $X_n$ converges a.s. Integrability of the limit follows from Fatou's lemma. $\blacksquare$

Theorem: Optional Stopping Theorem

Let $\{X_n, \mathcal{F}_n\}$ be a martingale and let $T$ be a stopping time (i.e., $\{T = n\} \in \mathcal{F}_n$ for all $n$ ). If either:

(a) $T$ is bounded (i.e., $T \leq N$ a.s. for some constant $N$ ), or

(b) $\mathbb{E}[T] < \infty$ and there exists $c > 0$ with $\mathbb{E}[|X_{n+1} - X_n| \mid \mathcal{F}_n] \leq c$ a.s. for all $n$ ,

then $\mathbb{E}[X_T] = \mathbb{E}[X_0]$ .

In a fair game, no stopping strategy can create expected profit. If you flip a fair coin and stop whenever you are ahead, you cannot beat the house — provided you stop in bounded time or with bounded increments.

The conditions are essential: without them, you could construct a "doubling" strategy that would beat the house (but requires infinite bankroll and infinite time).

Proof

Bounded case

Define $Y_n = X_{T \wedge n}$ (the stopped process). Then $\{Y_n\}$ is also a martingale: $\mathbb{E}[Y_{n+1} \mid \mathcal{F}_n] = Y_n$ (verify by conditioning on $\{T \leq n\}$ and $\{T > n\}$ separately). At time $N$ : $Y_N = X_T$ a.s. Taking expectations: $\mathbb{E}[X_T] = \mathbb{E}[Y_N] = \mathbb{E}[Y_0] = \mathbb{E}[X_0]$ . $\blacksquare$

Condition (b) sketch

Under condition (b), one shows $\{Y_n\}$ is uniformly integrable, so $\mathbb{E}[Y_n] \to \mathbb{E}[X_T]$ as $n \to \infty$ . Since $\mathbb{E}[Y_n] = \mathbb{E}[X_0]$ for all $n$ , the result follows.

,

Example: Gambler's Ruin via Optional Stopping

A gambler starts with $a$ dollars and bets \ $1 on each round of a fair coin flip. The game ends when the gambler reaches$ N $dollars or goes broke ($ 0 $dollars). Find the probability$ p $of reaching$ N$.

Solution

Set up the martingale

The fortune $S_n$ is a martingale (fair game). The stopping time $T = \min\{n : S_n = 0 \text{ or } S_n = N\}$ is finite a.s. (the random walk on $\{0, 1, \ldots, N\}$ is recurrent).

Apply optional stopping

By the optional stopping theorem (the stopped process is bounded by $[0, N]$ ): $\mathbb{E}[S_T] = \mathbb{E}[S_0] = a.$ But $S_T = N$ with probability $p$ and $S_T = 0$ with probability $1 - p$ , so $N p + 0 \cdot (1-p) = a$ , giving $p = a/N$ .

Common Mistake: Optional Stopping Requires Conditions

Mistake:

Applying $\mathbb{E}[X_T] = \mathbb{E}[X_0]$ without verifying the conditions (bounded stopping time or uniform integrability).

Correction:

Consider the doubling strategy on a fair coin: bet 1, 2, 4, 8, ... until you win. The stopping time $T$ is a.s. finite (geometric), and $X_T = 1$ always (you always net \ $1). But$ \mathbb{E}[X_0] = 0 $for the associated martingale, so$ \mathbb{E}[X_T] \neq \mathbb{E}[X_0] $. The conditions of the theorem fail because$ \mathbb{E}[T] = \infty$ and the increments grow exponentially.

🔧Engineering Note

Wald's Sequential Probability Ratio Test

The optional stopping theorem, applied to the log-likelihood ratio martingale, is the theoretical foundation of Wald's Sequential Probability Ratio Test (SPRT). In the SPRT, observations are collected one at a time and the test stops as soon as the cumulative log-likelihood ratio crosses one of two thresholds. The expected number of samples is minimized among all tests with the same error probabilities — a result known as the optimality of the SPRT.

In radar and communications, sequential detection is used when the cost of observations is significant (e.g., energy-constrained sensors) and one wants to minimize the average sample number.

Quick Check

The tower property of conditional expectation states that for $\mathcal{H} \subseteq \mathcal{G}$ :

$\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}] = \mathbb{E}[X \mid \mathcal{G}]$

$\mathbb{E}[\mathbb{E}[X \mid \mathcal{H}] \mid \mathcal{G}] = \mathbb{E}[X \mid \mathcal{H}]$

$\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}] = \mathbb{E}[X \mid \mathcal{H}]$

Correction:

\mathbb{E}[\mathbb{E}[X \mid \mathcal{H}] \mid \mathcal{G}] = \mathbb{E}[X \mid \mathcal{H}]

,

\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}] = \mathbb{E}[X \mid \mathcal{H}]

Refining the conditioning (from $\mathcal{H}$ to $\mathcal{G}$ ) of an $\mathcal{H}$ -measurable quantity has no effect. But this is also the tower property written differently.

Martingale

An adapted integrable sequence $\{X_n, \mathcal{F}_n\}$ satisfying $\mathbb{E}[X_{n+1} \mid \mathcal{F}_n] = X_n$ a.s. Models a "fair game" where the expected future value, given all current information, equals the present value.

Related: Filtration, Stopping Time

Filtration

An increasing sequence of sigma-algebras $\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \cdots$ representing growing information over time.

Related: Martingale, Adapted Process

Key Takeaway

Conditional expectation given a sigma-algebra is a random variable, not a number. It is the best (in the $L^2$ sense) approximation of $X$ using only the information in $\mathcal{G}$ . The tower property, Jensen's inequality, and the martingale framework all flow from this single definition — and these are the tools that power the deepest results in information theory and statistical inference.

Conditional Expectation Given a Sigma-Algebra

Beyond Conditioning on Events and Random Variables

Definition: Conditional Expectation Given a σ\sigmaσ-Algebra

Theorem: Existence and Uniqueness of Conditional Expectation

Define the signed measure

Apply Radon-Nikodym

General case and uniqueness

Example: Conditional Expectation on a Finite Partition

Identify $\mathcal{G}$-measurable functions

Apply the defining condition

Result

Theorem: Properties of Conditional Expectation

Tower property (proof sketch)

Jensen (proof sketch)

Conditional Expectation mathbbE[YmidX=x]\\mathbb{E}[Y \\mid X = x]mathbbE[YmidX=x] for Bivariate Gaussian

Parameters

Definition: Filtration

Definition: Martingale

Example: Standard Martingale Examples

(a) Random walk

(b) Likelihood ratio

Martingale Sample Paths

Parameters

Theorem: Martingale Convergence Theorem

Doob's upcrossing inequality

Convergence

Theorem: Optional Stopping Theorem

Bounded case

Condition (b) sketch

Example: Gambler's Ruin via Optional Stopping

Set up the martingale

Apply optional stopping

Common Mistake: Optional Stopping Requires Conditions

Wald's Sequential Probability Ratio Test

Quick Check

Martingale

Filtration

Key Takeaway

Definition:
Conditional Expectation Given a $\sigma$ -Algebra

Conditional Expectation $\\mathbb{E}[Y \\mid X = x]$ for Bivariate Gaussian

Definition:
Filtration

Definition:
Martingale