Ferkans — Interactive Telecom Tutor

What Is the Average?

Knowing the full PMF of a random variable is ideal, but we often need a single number that summarizes its "typical" value. The expectation does exactly this: it is the probability-weighted average of all possible values. The expectation is the single most important summary of a random variable — not because it tells the whole story, but because it enjoys a remarkable property (linearity) that makes it computable even when the full distribution is out of reach.

Definition:
Expectation of a Discrete Random Variable

The expectation (or mean) of a discrete random variable $X$ with PMF $P$ and support $\mathcal{X}$ is

$\mathbb{E}[X] = \sum_{x \in \mathcal{X}} x \, P(x),$

provided the sum converges absolutely: $\sum_{x} |x| \, P(x) < \infty$ .

If the absolute convergence condition fails, we say the expectation does not exist. The Cauchy distribution is the classical example in the continuous case; among discrete distributions, a random variable with PMF $p(k) \propto 1/k^2$ on the positive integers has finite mean, but $p(k) \propto 1/(k \log^2 k)$ does not.

,

Theorem: Linearity of Expectation

For any random variables $X_1, \ldots, X_n$ (not necessarily independent) and constants $a_1, \ldots, a_n, b \in \mathbb{R}$ :

$\mathbb{E}\!\left[\sum_{i=1}^n a_i X_i + b\right] = \sum_{i=1}^n a_i \, \mathbb{E}[X_i] + b.$

Linearity holds unconditionally — it does not require independence. This is arguably the single most useful property in all of probability. It allows us to compute the expected number of successes in $n$ trials without knowing the joint distribution, simply by summing the individual success probabilities.

Proof

Scalar case

For a single RV, $\mathbb{E}[aX + b] = \sum_x (ax + b) P(x) = a \sum_x x \, P(x) + b \sum_x P(x) = a\,\mathbb{E}[X] + b$ .

Sum of two RVs

For $X + Y$ with joint PMF $p_{X,Y}$ :

$\mathbb{E}[X + Y] = \sum_x \sum_y (x + y)\, p_{X,Y}(x,y) = \sum_x x \sum_y p_{X,Y}(x,y) + \sum_y y \sum_x p_{X,Y}(x,y).$

The inner sums are the marginal PMFs, so this equals $\mathbb{E}[X] + \mathbb{E}[Y]$ .

General case

The result for $n$ terms follows by induction. $\blacksquare$

,

Example: Expected Number of Fixed Points (Matching Problem)

Find $\mathbb{E}[X]$ where $X$ is the number of fixed points in a random permutation of $\{1, \ldots, n\}$ .

Solution

Indicator decomposition

Write $X = \sum_{i=1}^n \mathbf{1}\{i \text{ is a fixed point}\}$ . By linearity, $\mathbb{E}[X] = \sum_{i=1}^n \mathbb{P}(i \text{ is a fixed point})$ .

Compute individual probabilities

For each $i$ , $\mathbb{P}(\text{position } i \text{ holds element } i) = 1/n$ . (There are $(n-1)!$ permutations fixing element $i$ , out of $n!$ total.)

Conclude

$\mathbb{E}[X] = n \cdot \frac{1}{n} = 1$ .

Remarkably, the expected number of fixed points is exactly 1, regardless of $n$ . This is the power of linearity: we never needed the (complicated) distribution of $X$ .

Theorem: LOTUS — Law of the Unconscious Statistician

If $X$ is a discrete random variable with PMF $P$ and $g : \mathbb{R} \to \mathbb{R}$ is any function, then

$\mathbb{E}[g(X)] = \sum_{x \in \mathcal{X}} g(x) \, P(x).$

In particular, we do not need to first find the PMF of $Y = g(X)$ .

The name is tongue-in-cheek: the formula is so natural that beginners use it "unconsciously" before proving it. The point is that to compute $\mathbb{E}[g(X)]$ , it suffices to know the distribution of $X$ — not of $g(X)$ .

Proof

Group by values of $X$

Let $Y = g(X)$ . The support of $Y$ is $\{g(x) : x \in \mathcal{X}\}$ . For each value $y$ of $Y$ :

$\mathbb{P}(Y = y) = \sum_{x : g(x) = y} P(x).$

Expand $\mathbb{E}[Y]$

$\mathbb{E}[Y] = \sum_y y \, \mathbb{P}(Y = y) = \sum_y y \sum_{x : g(x) = y} P(x) = \sum_x g(x) \, P(x).$ $The last equality follows by re-indexing: each$ x $appears exactly once, contributing$ g(x) \cdot P(x) $.$ \blacksquare$

,

Example: Computing $\mathbb{E}[X^2]$ via LOTUS

Let $X$ be a Bernoulli( $p$ ) random variable. Compute $\mathbb{E}[X^2]$ using LOTUS.

Solution

Apply LOTUS with $g(x) = x^2$

$\mathbb{E}[X^2] = \sum_{x \in \{0,1\}} x^2 \, P(x) = 0^2 \cdot (1-p) + 1^2 \cdot p = p.$ $

Observation

For a Bernoulli RV, $\mathbb{E}[X^2] = p = \mathbb{E}[X]$ . This is because $X^2 = X$ when $X \in \{0, 1\}$ . We will use this fact when computing the variance of the Bernoulli distribution.

Common Mistake: $\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])$ in General

Mistake:

Assuming that $\mathbb{E}[X^2] = (\mathbb{E}[X])^2$ , or more generally that the expectation "passes through" nonlinear functions.

Correction:

Linearity holds only for linear functions of $X$ . For a nonlinear $g$ , Jensen's inequality tells us the direction of the inequality: if $g$ is convex, $\mathbb{E}[g(X)] \geq g(\mathbb{E}[X])$ ; if concave, $\leq$ .

Quick Check

If $X$ and $Y$ are random variables (possibly dependent) with $\mathbb{E}[X] = 3$ and $\mathbb{E}[Y] = 5$ , what is $\mathbb{E}[2X - Y + 1]$ ?

1

2

7

Cannot determine without knowing the joint distribution

Correction:

2

$\mathbb{E}[2X - Y + 1] = 2(3) - 5 + 1 = 2$ . Linearity holds regardless of dependence.

Why This Matters: Expected Bit Error Rate

In a digital communication system, the bit error rate (BER) is the expected fraction of bits received incorrectly. If we transmit $n$ bits and define $X_i = \mathbf{1}\{\text{bit } i \text{ in error}\}$ , then the total number of errors is $X = \sum_{i=1}^n X_i$ and $\mathbb{E}[X] = \sum_{i=1}^n \mathbb{P}(\text{bit } i \text{ in error})$ . For i.i.d. errors with probability $P_e$ per bit, $\mathbb{E}[X] = nP_e$ . The BER is $\mathbb{E}[X]/n = P_e$ — linearity of expectation makes this trivial.

Expectation

The probability-weighted average of all possible values of a random variable: $\mathbb{E}[X] = \sum_x x \, P(x)$ for discrete RVs.

LOTUS (Law of the Unconscious Statistician)

$\mathbb{E}[g(X)] = \sum_x g(x) \, P(x)$ . Allows computing expectations of functions of $X$ directly from the PMF of $X$ .

Related: Expectation

Historical Note: Huygens and the Origins of Expected Value

1657

The concept of expected value dates to Christiaan Huygens' 1657 treatise De Ratiociniis in Ludo Aleae (On Reasoning in Games of Chance), the first published work on probability. Huygens defined the "value of a game" as the price a rational player should pay to participate — exactly what we now call the expected payoff. Pierre-Simon Laplace later placed the concept on a firmer mathematical footing and used it extensively in his 1812 Théorie analytique des probabilités.

Key Takeaway

Linearity of expectation, $\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]$ , holds without any assumption of independence. This is the most useful single property in discrete probability — whenever you can decompose a complicated count into a sum of indicators, linearity hands you the answer.

Expectation

What Is the Average?

Definition: Expectation of a Discrete Random Variable

Theorem: Linearity of Expectation

Scalar case

Sum of two RVs

General case

Example: Expected Number of Fixed Points (Matching Problem)

Indicator decomposition

Compute individual probabilities

Conclude

Theorem: LOTUS — Law of the Unconscious Statistician

Group by values of $X$

Expand $\mathbb{E}[Y]$

Example: Computing E[X2]\mathbb{E}[X^2]E[X2] via LOTUS

Apply LOTUS with $g(x) = x^2$

Observation

Common Mistake: E[g(X)]≠g(E[X])\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])E[g(X)]=g(E[X]) in General

Quick Check

Why This Matters: Expected Bit Error Rate

Expectation

LOTUS (Law of the Unconscious Statistician)

Historical Note: Huygens and the Origins of Expected Value

Key Takeaway

Definition:
Expectation of a Discrete Random Variable

Example: Computing $\mathbb{E}[X^2]$ via LOTUS

Common Mistake: $\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])$ in General