Expectation

What Is the Average?

Knowing the full PMF of a random variable is ideal, but we often need a single number that summarizes its "typical" value. The expectation does exactly this: it is the probability-weighted average of all possible values. The expectation is the single most important summary of a random variable β€” not because it tells the whole story, but because it enjoys a remarkable property (linearity) that makes it computable even when the full distribution is out of reach.

Definition:

Expectation of a Discrete Random Variable

The expectation (or mean) of a discrete random variable XX with PMF PP and support X\mathcal{X} is

E[X]=βˆ‘x∈Xx P(x),\mathbb{E}[X] = \sum_{x \in \mathcal{X}} x \, P(x),

provided the sum converges absolutely: βˆ‘x∣xβˆ£β€‰P(x)<∞\sum_{x} |x| \, P(x) < \infty.

If the absolute convergence condition fails, we say the expectation does not exist. The Cauchy distribution is the classical example in the continuous case; among discrete distributions, a random variable with PMF p(k)∝1/k2p(k) \propto 1/k^2 on the positive integers has finite mean, but p(k)∝1/(klog⁑2k)p(k) \propto 1/(k \log^2 k) does not.

,

Theorem: Linearity of Expectation

For any random variables X1,…,XnX_1, \ldots, X_n (not necessarily independent) and constants a1,…,an,b∈Ra_1, \ldots, a_n, b \in \mathbb{R}:

E ⁣[βˆ‘i=1naiXi+b]=βˆ‘i=1nai E[Xi]+b.\mathbb{E}\!\left[\sum_{i=1}^n a_i X_i + b\right] = \sum_{i=1}^n a_i \, \mathbb{E}[X_i] + b.

Linearity holds unconditionally β€” it does not require independence. This is arguably the single most useful property in all of probability. It allows us to compute the expected number of successes in nn trials without knowing the joint distribution, simply by summing the individual success probabilities.

,

Example: Expected Number of Fixed Points (Matching Problem)

Find E[X]\mathbb{E}[X] where XX is the number of fixed points in a random permutation of {1,…,n}\{1, \ldots, n\}.

Theorem: LOTUS β€” Law of the Unconscious Statistician

If XX is a discrete random variable with PMF PP and g:R→Rg : \mathbb{R} \to \mathbb{R} is any function, then

E[g(X)]=βˆ‘x∈Xg(x) P(x).\mathbb{E}[g(X)] = \sum_{x \in \mathcal{X}} g(x) \, P(x).

In particular, we do not need to first find the PMF of Y=g(X)Y = g(X).

The name is tongue-in-cheek: the formula is so natural that beginners use it "unconsciously" before proving it. The point is that to compute E[g(X)]\mathbb{E}[g(X)], it suffices to know the distribution of XX β€” not of g(X)g(X).

,

Example: Computing E[X2]\mathbb{E}[X^2] via LOTUS

Let XX be a Bernoulli(pp) random variable. Compute E[X2]\mathbb{E}[X^2] using LOTUS.

Common Mistake: E[g(X)]β‰ g(E[X])\mathbb{E}[g(X)] \neq g(\mathbb{E}[X]) in General

Mistake:

Assuming that E[X2]=(E[X])2\mathbb{E}[X^2] = (\mathbb{E}[X])^2, or more generally that the expectation "passes through" nonlinear functions.

Correction:

Linearity holds only for linear functions of XX. For a nonlinear gg, Jensen's inequality tells us the direction of the inequality: if gg is convex, E[g(X)]β‰₯g(E[X])\mathbb{E}[g(X)] \geq g(\mathbb{E}[X]); if concave, ≀\leq.

Quick Check

If XX and YY are random variables (possibly dependent) with E[X]=3\mathbb{E}[X] = 3 and E[Y]=5\mathbb{E}[Y] = 5, what is E[2Xβˆ’Y+1]\mathbb{E}[2X - Y + 1]?

1

2

7

Cannot determine without knowing the joint distribution

Why This Matters: Expected Bit Error Rate

In a digital communication system, the bit error rate (BER) is the expected fraction of bits received incorrectly. If we transmit nn bits and define Xi=1{bitΒ iΒ inΒ error}X_i = \mathbf{1}\{\text{bit } i \text{ in error}\}, then the total number of errors is X=βˆ‘i=1nXiX = \sum_{i=1}^n X_i and E[X]=βˆ‘i=1nP(bitΒ iΒ inΒ error)\mathbb{E}[X] = \sum_{i=1}^n \mathbb{P}(\text{bit } i \text{ in error}). For i.i.d. errors with probability PeP_e per bit, E[X]=nPe\mathbb{E}[X] = nP_e. The BER is E[X]/n=Pe\mathbb{E}[X]/n = P_e β€” linearity of expectation makes this trivial.

Expectation

The probability-weighted average of all possible values of a random variable: E[X]=βˆ‘xx P(x)\mathbb{E}[X] = \sum_x x \, P(x) for discrete RVs.

Related: Probability Mass Function (PMF)

LOTUS (Law of the Unconscious Statistician)

E[g(X)]=βˆ‘xg(x) P(x)\mathbb{E}[g(X)] = \sum_x g(x) \, P(x). Allows computing expectations of functions of XX directly from the PMF of XX.

Related: Expectation

Historical Note: Huygens and the Origins of Expected Value

1657

The concept of expected value dates to Christiaan Huygens' 1657 treatise De Ratiociniis in Ludo Aleae (On Reasoning in Games of Chance), the first published work on probability. Huygens defined the "value of a game" as the price a rational player should pay to participate β€” exactly what we now call the expected payoff. Pierre-Simon Laplace later placed the concept on a firmer mathematical footing and used it extensively in his 1812 ThΓ©orie analytique des probabilitΓ©s.

Key Takeaway

Linearity of expectation, E[aX+bY]=a E[X]+b E[Y]\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y], holds without any assumption of independence. This is the most useful single property in discrete probability β€” whenever you can decompose a complicated count into a sum of indicators, linearity hands you the answer.