Conditional Distributions

From Events to Random Variables

In Chapter 2 we defined P(A∣B)\mathbb{P}(A \mid B) for events. Now we extend this idea to random variables: given that XX takes a particular value xx, what is the distribution of YY? This is the conditional distribution, and it is the mathematical foundation for Bayesian inference, channel estimation, and signal detection.

Definition:

Conditional PMF (Discrete Case)

For discrete RVs XX and YY with joint PMF PX,YP_{X,Y}, the conditional PMF of YY given X=xiX = x_i is

PY∣X(yj∣xi)=PX,Y(xi,yj)PX(xi),P_{Y|X}(y_j \mid x_i) = \frac{P_{X,Y}(x_i, y_j)}{P_{X}(x_i)},

defined for all xix_i with PX(xi)>0P_{X}(x_i) > 0.

For each fixed xix_i, the function yj↦PY∣X(yj∣xi)y_j \mapsto P_{Y|X}(y_j \mid x_i) is a valid PMF: it is non-negative and sums to 1 over jj. The conditional PMF is simply the ii-th row of the joint PMF table, normalized by the row sum PX(xi)P_{X}(x_i).

Definition:

Conditional PDF (Continuous Case)

For jointly continuous RVs (X,Y)(X, Y) with joint PDF fX,Yf_{X,Y}, the conditional PDF of YY given X=xX = x is

f(y∣x)=fX,Y(x,y)fX(x),f(y \mid x) = \frac{f_{X,Y}(x, y)}{f_{X}(x)},

defined for all xx with fX(x)>0f_{X}(x) > 0.

The conditional CDF is

FY∣X(y∣x)=βˆ«βˆ’βˆžyf(v∣x) dv=βˆ«βˆ’βˆžyfX,Y(x,v)fX(x) dv.F_{Y|X}(y \mid x) = \int_{-\infty}^{y} f(v \mid x)\,dv = \int_{-\infty}^{y} \frac{f_{X,Y}(x, v)}{f_{X}(x)}\,dv.

The conditioning event {X=x}\{X = x\} has probability zero for continuous XX, so the definition is a limit: we condition on the thin strip {x<X≀x+dx}\{x < X \le x + dx\} and let dxβ†’0dx \to 0. The resulting object is well-defined as a Radon-Nikodym derivative.

Slicing the Joint Density

Intuitively, conditioning on X=xX = x amounts to "slicing" the joint density fX,Y(x,y)f_{X,Y}(x,y) at a fixed xx value and then renormalizing so that the slice integrates to 1. The shape of f(y∣x)f(y \mid x) as a function of yy is proportional to fX,Y(x,y)f_{X,Y}(x, y), but the normalization constant fX(x)f_{X}(x) ensures it is a proper density.

Conditional PDF f(y∣x)f(y \mid x) as xx Varies

Use the slider to move the conditioning value xx and observe how the conditional density of YY changes shape. The joint density is a bivariate Gaussian with adjustable correlation.

Parameters
0
0.5
1
1

Theorem: Bayes' Rule for Continuous Random Variables

For jointly continuous (X,Y)(X, Y) with fX(x)>0f_{X}(x) > 0 and fY(y)>0f_{Y}(y) > 0:

fX∣Y(x∣y)=f(y∣x) fX(x)fY(y),f_{X|Y}(x \mid y) = \frac{f(y \mid x)\,f_{X}(x)}{f_{Y}(y)},

where fY(y)=βˆ«βˆ’βˆžβˆžf(y∣x) fX(x) dxf_{Y}(y) = \int_{-\infty}^{\infty} f(y \mid x)\,f_{X}(x)\,dx.

This is the continuous analogue of Bayes' theorem: the prior density fX(x)f_{X}(x) is updated to the posterior density fX∣Y(x∣y)f_{X|Y}(x \mid y) via the likelihood f(y∣x)f(y \mid x). The denominator fY(y)f_{Y}(y) serves as the normalizing constant.

Definition:

Conditional Expectation

For jointly continuous (X,Y)(X, Y), the conditional expectation of XX given Y=yY = y is

E[X∣Y=y]=βˆ«βˆ’βˆžβˆžx fX∣Y(x∣y) dx.\mathbb{E}[X \mid Y = y] = \int_{-\infty}^{\infty} x\,f_{X|Y}(x \mid y)\,dx.

Viewed as a function of yy, g(y)=E[X∣Y=y]g(y) = \mathbb{E}[X \mid Y = y] is a real-valued function. The random variable g(Y)=E[X∣Y]g(Y) = \mathbb{E}[X \mid Y] is called the conditional expectation of XX given YY.

Theorem: Law of Iterated Expectation (Tower Property)

For any random variables XX and YY with E[∣X∣]<∞\mathbb{E}[|X|] < \infty:

E[X]=E[E[X∣Y]].\mathbb{E}[X] = \mathbb{E}\bigl[\mathbb{E}[X \mid Y]\bigr].

The tower property says: average the conditional averages, weighted by the distribution of what you conditioned on, and you recover the unconditional average. This identity is the workhorse behind performance analysis in wireless β€” whenever you want to average a rate or an error probability over a fading channel, you first compute the conditional quantity given the channel realization, then average over the channel distribution.

Theorem: Law of Total Variance

For random variables XX and YY with E[X2]<∞\mathbb{E}[X^2] < \infty:

Var(X)=E[Var(X∣Y)]+Var(E[X∣Y]).\text{Var}(X) = \mathbb{E}\bigl[\text{Var}(X \mid Y)\bigr] + \text{Var}\bigl(\mathbb{E}[X \mid Y]\bigr).

The total variance of XX decomposes into two parts: the average of the conditional variances (the "within-group" variability) plus the variance of the conditional means (the "between-group" variability). This identity is used extensively in Bayesian analysis: the posterior variance averages the conditional variance, and the remaining uncertainty comes from not knowing YY.

Example: Conditional Expectation on the Triangle

Let (X,Y)(X, Y) be uniform on {(x,y):0≀y≀x≀1}\{(x,y) : 0 \le y \le x \le 1\} (so fX,Y(x,y)=2f_{X,Y}(x,y) = 2 on this region). Compute E[Y∣X=x]\mathbb{E}[Y \mid X = x] and verify the tower property.

Conditional expectation

The expected value of a random variable XX computed under the conditional distribution given Y=yY = y. The function y↦E[X∣Y=y]y \mapsto \mathbb{E}[X \mid Y = y] is itself a random variable when evaluated at YY.

Related: Joint probability density function

Why This Matters: Tower Property in Fading Channel Analysis

The tower property is the engine behind computing average performance metrics over fading channels. For instance, the average bit error rate of a modulation scheme over a Rayleigh fading channel is computed as E[Pe]=E[E[Pe∣H]]\mathbb{E}[P_e] = \mathbb{E}[\mathbb{E}[P_e \mid H]]: first compute the conditional BER given the channel gain H=hH = h (which is just the AWGN BER at SNR ∣h∣2β‹…SNR|h|^2 \cdot \text{SNR}), then average over the distribution of ∣H∣2|H|^2. This two-step approach is used throughout Books 1 and FSI.

Common Mistake: Conditioning on a Zero-Probability Event

Mistake:

Writing P(Y≀y∣X=x)=P(Y≀y,X=x)/P(X=x)\mathbb{P}(Y \le y \mid X = x) = \mathbb{P}(Y \le y, X = x) / \mathbb{P}(X = x) for continuous XX, which gives 0/00/0.

Correction:

For continuous XX, P(X=x)=0\mathbb{P}(X = x) = 0, so the ratio is undefined. The conditional CDF is instead defined as the limit of P(Y≀y∣x<X≀x+dx)\mathbb{P}(Y \le y \mid x < X \le x + dx) as dxβ†’0dx \to 0, which yields the formula involving the conditional PDF.

Quick Check

Let X∼Exp(1)X \sim \text{Exp}(1) and Y∣X=x∼Uniform[0,x]Y \mid X = x \sim \text{Uniform}[0, x]. What is E[Y]\mathbb{E}[Y]?

1/21/2

11

1/41/4

22

πŸ”§Engineering Note

Conditional Expectation as Optimal Estimator

The conditional expectation E[X∣Y=y]\mathbb{E}[X \mid Y = y] is the minimum mean square error (MMSE) estimator of XX given Y=yY = y. That is, among all functions g(Y)g(Y), the choice g(y)=E[X∣Y=y]g(y) = \mathbb{E}[X \mid Y = y] minimizes E[(Xβˆ’g(Y))2]\mathbb{E}[(X - g(Y))^2]. This result, proved in Book FSI, is the theoretical foundation for Bayesian estimation, LMMSE filtering, and Kalman filtering.