Conditional Expectation: The Deeper View

Why Conditional Expectation Deserves Its Own Chapter

In Chapter 4, we computed conditional expectations E[XY=y]\mathbb{E}[X|Y=y] — plugging in a specific observed value yy and obtaining a number. That perspective is useful for computation but misses the deeper structure.

The key shift in this chapter: we treat E[XY]\mathbb{E}[X|Y] as a random variable — a function of the random variable YY, not of a particular value yy. This shift unlocks the tower property, the orthogonality principle, and the entire theory of optimal estimation.

The payoff is immediate: E[XY]\mathbb{E}[X|Y] turns out to be the best predictor of XX given YY in the mean square sense — and understanding why requires thinking of it as a random variable.

Definition:

Conditional Expectation as a Random Variable

Let XX and YY be random variables with joint density fX,Y(x,y)f_{X,Y}(x,y). The conditional expectation of XX given YY, denoted E[XY]\mathbb{E}[X|Y], is the random variable defined by

E[XY]=g(Y),whereg(y)=E[XY=y]=xf(xy)dx.\mathbb{E}[X|Y] = g(Y), \quad \text{where} \quad g(y) = \mathbb{E}[X|Y=y] = \int_{-\infty}^{\infty} x \, f(x|y) \, dx.

The function g:RRg: \mathbb{R} \to \mathbb{R} maps each possible value of YY to the conditional mean of XX given that value. Since YY is random, g(Y)g(Y) is random.

The distinction matters: E[XY=y]\mathbb{E}[X|Y=y] is a number (for each fixed yy), while E[XY]\mathbb{E}[X|Y] is a random variable (a function of the random YY). The former is a function of yy; the latter is a function of ω\omega through Y(ω)Y(\omega).

Example: Conditional Expectation for Exponential-Gamma Pair

Let YGamma(α,β)Y \sim \text{Gamma}(\alpha, \beta) and XY=yExp(y)X | Y=y \sim \text{Exp}(y). Find E[XY]\mathbb{E}[X|Y] as a random variable.

Theorem: Tower Property (Law of Iterated Expectations)

For any random variables XX and YY with E[X]<\mathbb{E}[|X|] < \infty:

E[E[XY]]=E[X].\mathbb{E}\bigl[\mathbb{E}[X|Y]\bigr] = \mathbb{E}[X].

More generally, if YY is a function of ZZ (i.e., Y=h(Z)Y = h(Z)), then

E[E[XZ]Y]=E[XY].\mathbb{E}\bigl[\mathbb{E}[X|Z] \,\big|\, Y\bigr] = \mathbb{E}[X|Y].

Averaging over YY after conditioning on YY recovers the unconditional average. Refining information (conditioning on more) and then coarsening (averaging out the extra) brings you back to the coarser conditioning.

,

Theorem: Properties of Conditional Expectation

Let XX, YY, ZZ be random variables with finite expectations. Then:

  1. Linearity: E[αX+βZY]=αE[XY]+βE[ZY]\mathbb{E}[\alpha X + \beta Z | Y] = \alpha\,\mathbb{E}[X|Y] + \beta\,\mathbb{E}[Z|Y] for constants α,β\alpha, \beta.

  2. Pulling out what is known: If h(Y)h(Y) is a function of YY, then E[h(Y)XY]=h(Y)E[XY]\mathbb{E}[h(Y) \cdot X | Y] = h(Y) \cdot \mathbb{E}[X|Y].

  3. Independence: If XX and YY are independent, then E[XY]=E[X]\mathbb{E}[X|Y] = \mathbb{E}[X].

  4. Tower property: E[E[XY]]=E[X]\mathbb{E}[\mathbb{E}[X|Y]] = \mathbb{E}[X].

  5. Conditional Jensen: If φ\varphi is convex, then φ(E[XY])E[φ(X)Y]\varphi(\mathbb{E}[X|Y]) \leq \mathbb{E}[\varphi(X)|Y].

Properties 1-2 say that conditional expectation behaves like an "expectation operator" where YY plays the role of a constant. Property 3 says that if YY tells you nothing about XX, conditioning on YY does not improve your estimate. Property 4 is the tower property. Property 5 extends Jensen's inequality to the conditional setting.

Quick Check

If E[XY]=c\mathbb{E}[X|Y] = c (a constant) for all values of YY, what can we conclude?

XX and YY are independent

c=E[X]c = \mathbb{E}[X]

XX is a constant

Var(XY)=0\text{Var}(X|Y) = 0

Example: Conditional Expectation for Jointly Gaussian (X,Y)(X,Y)

Let (X,Y)(X,Y) be jointly Gaussian with means μX,μY\mu_X, \mu_Y, variances σX2,σY2\sigma_X^2, \sigma_Y^2, and correlation coefficient ρ\rho. Find E[XY]\mathbb{E}[X|Y].

Key Takeaway

For jointly Gaussian random variables, E[XY]\mathbb{E}[X|Y] is a linear function of YY. This is the only distribution family with this property, and it is the reason why Gaussian models are so tractable in estimation theory.

Common Mistake: E[XY]\mathbb{E}[X|Y] Is Not a Number

Mistake:

Writing "E[XY]=3\mathbb{E}[X|Y] = 3" and treating it as a fixed quantity.

Correction:

E[XY]\mathbb{E}[X|Y] is a random variable. It takes different values for different realizations of YY. The statement "E[XY]=3\mathbb{E}[X|Y] = 3" means that the function g(y)=E[XY=y]g(y) = \mathbb{E}[X|Y=y] happens to equal 3 for all yy — which implies E[X]=3\mathbb{E}[X] = 3 by the tower property. In most cases, E[XY]\mathbb{E}[X|Y] varies with YY.

Definition:

Conditional Expectation for Random Vectors

For random vectors XRn\mathbf{X} \in \mathbb{R}^n and YRm\mathbf{Y} \in \mathbb{R}^m, the conditional expectation E[XY]\mathbb{E}[\mathbf{X}|\mathbf{Y}] is the random vector whose ii-th component is E[XiY]\mathbb{E}[X_i|\mathbf{Y}]:

E[XY]=(E[X1Y]E[XnY]).\mathbb{E}[\mathbf{X}|\mathbf{Y}] = \begin{pmatrix} \mathbb{E}[X_1|\mathbf{Y}] \\ \vdots \\ \mathbb{E}[X_n|\mathbf{Y}] \end{pmatrix}.

All the properties (linearity, tower, pulling out known, independence) extend component-wise.

Conditional Density f(xy)f(x|y) and E[XY=y]\mathbb{E}[X|Y=y] for Jointly Gaussian (X,Y)(X,Y)

Visualize the joint Gaussian density, a slice at Y=yY=y, and the conditional mean E[XY=y]\mathbb{E}[X|Y=y] as yy varies. The red line traces the conditional mean across all yy values.

Parameters
0.7

Correlation coefficient

1

Conditioning value of $Y$

Historical Note: Kolmogorov and the Measure-Theoretic Foundation

1933

The rigorous definition of conditional expectation as a random variable was established by Andrey Kolmogorov in his 1933 monograph Grundbegriffe der Wahrscheinlichkeitsrechnung. Before Kolmogorov, conditional expectation was defined only for discrete random variables or via Bayes' rule when densities exist. Kolmogorov's approach — defining E[XF]\mathbb{E}[X|\mathcal{F}] as a Radon-Nikodym derivative — extended the concept to arbitrary σ\sigma-algebras, laying the foundation for martingale theory and modern stochastic processes.

Conditional Expectation

The random variable E[XY]=g(Y)\mathbb{E}[X|Y] = g(Y) where g(y)=E[XY=y]g(y) = \mathbb{E}[X|Y=y]. It is the best predictor of XX given YY in the mean square error sense.

Related: Minimum Mean Square Error (MMSE) Estimator, Tower Property

Tower Property

The identity E[E[XY]]=E[X]\mathbb{E}[\mathbb{E}[X|Y]] = \mathbb{E}[X], also called the law of iterated expectations or the smoothing property. Averaging the conditional expectation over YY recovers the unconditional expectation.

Related: Conditional Expectation