Matching, Occupancy, and Coupon Collection

Balls, Bins, and Protocol Design

Three classic problems — balls into bins, the coupon collector, and the birthday problem — look like recreational puzzles but are in fact foundational models in algorithms, cryptography, and communications protocol design. The balls-into-bins model describes packet collisions in random access channels, hash table probing, and random placement of data in distributed storage. The coupon collector quantifies the expected time until all resources have been sampled at least once, which governs channel sounding protocols and random access convergence. The birthday problem underpins hash collision security and the capacity of slotted ALOHA channels.

All three problems reduce to the same underlying calculation: how many i.i.d.\ uniform draws from {1,,n}\{1, \ldots, n\} are needed before a prescribed coverage condition is met?

Definition:

Occupancy Problem (Balls into Bins)

In the occupancy problem, mm balls are thrown independently and uniformly at random into nn bins. The key quantities are:

  • P(no empty bin)=P(\text{no empty bin}) = probability all nn bins are occupied (at least one ball each)
  • P(no collision)=P(\text{no collision}) = probability all mm balls land in distinct bins (at most one ball per bin)
  • E[empty bins]\mathbb{E}[\text{empty bins}] and E[collisions]\mathbb{E}[\text{collisions}]

For equal priors, the probability that all mm balls land in distinct bins (the no-collision event) is P(no collision)=n!(nm)!nm=k=0m1 ⁣(1kn).P(\text{no collision}) = \frac{n!}{(n-m)!\,n^m} = \prod_{k=0}^{m-1}\!\Bigl(1 - \frac{k}{n}\Bigr).

The occupancy problem is the continuous-time limit of the coupon collector: if we allow multiple balls per bin, the time to cover all bins is the coupon collector time.

Occupancy Problem

The random experiment of throwing mm balls independently and uniformly into nn bins, and studying the resulting occupancy distribution. Key special cases: the birthday problem (mm people, nn birthdays) and coupon collection (mm draws until all nn coupons seen).

Related: Coupon Collector Problem, Birthday Problem

Definition:

Birthday Problem

The birthday problem asks: if mm people each have a birthday chosen uniformly at random from n=365n = 365 days, what is the probability that at least two people share a birthday?

P(collision)=1k=0m1 ⁣(1kn)    1em(m1)/(2n).P(\text{collision}) = 1 - \prod_{k=0}^{m-1}\!\Bigl(1 - \frac{k}{n}\Bigr) \;\approx\; 1 - e^{-m(m-1)/(2n)}.

The threshold is m2nln21.177nm^* \approx \sqrt{2n\ln 2} \approx 1.177\sqrt{n}: once m>mm > m^* the collision probability exceeds 1/21/2.

For n=365n = 365, m23m^* \approx 23. The surprising lesson — 23 people suffice for a 50% chance of shared birthday — shows that collision probabilities grow much faster than intuition suggests. This has profound implications for hash function security and random access.

Birthday Problem

The probability that among mm uniformly random draws from {1,,n}\{1,\ldots,n\}, at least two are equal. Collision probability 1em2/(2n)\approx 1 - e^{-m^2/(2n)} for mnm \ll n. The collision threshold is m1.177nm^* \approx 1.177\sqrt{n}.

Related: Occupancy Problem, Coupon Collector Problem

Theorem: Birthday Bound

Let mm items be drawn i.i.d.\ uniformly from a set of size nn. The probability that all items are distinct satisfies: em2/(2n)    P(no collision)=k=1m1 ⁣(1kn)    em(m1)/(2n).e^{-m^2/(2n)} \;\leq\; P(\text{no collision}) = \prod_{k=1}^{m-1}\!\Bigl(1 - \frac{k}{n}\Bigr) \;\leq\; e^{-m(m-1)/(2n)}. In particular, for m=2nln(1/δ)m = \lfloor\sqrt{2n\ln(1/\delta)}\rfloor the collision probability exceeds 1δ1 - \delta.

The lower bound follows from the inequality 1xexx21-x \geq e^{-x-x^2} for x[0,1/2]x \in [0,1/2]; the upper bound from 1xex1-x \leq e^{-x}. The product then telescopes into an exponential of a sum.

Key Takeaway

The birthday bound states that among mnm \approx \sqrt{n} items drawn from a universe of size nn, a collision is roughly 50% likely. This is the fundamental limit of hash-based collision resistance: any hash function with nn possible outputs will have a collision after n\approx \sqrt{n} inputs.

Birthday Problem: Collision Probability

The probability of at least one collision among mm draws from a universe of size nn. Adjust nn (hash output size or number of slots) and watch the collision probability grow. The dashed line marks the threshold mm^* where collision probability first exceeds 0.5.

Parameters
365
60

Definition:

Coupon Collector Problem

Suppose there are nn distinct coupons. At each step, one coupon is drawn uniformly at random with replacement. Let TnT_n be the number of draws until all nn coupons have been collected at least once. The coupon collector's expected waiting time is E[Tn]=nHn=nk=1n1k    nlnn(n),\mathbb{E}[T_n] = n H_n = n \sum_{k=1}^n \frac{1}{k} \;\sim\; n \ln n \quad (n \to \infty), where HnH_n is the nn-th harmonic number.

The kk-th new coupon is collected after a geometric waiting time with success probability (nk+1)/n(n-k+1)/n. The total time is the sum of nn independent geometric random variables. This decomposition is the key to the proof.

Coupon Collector

A random process where items are drawn uniformly with replacement from a set of size nn until all nn items have appeared at least once. Expected waiting time: E[Tn]=nHnnlnn\mathbb{E}[T_n] = n H_n \sim n \ln n.

Related: Occupancy Problem, Birthday Problem, Harmonic Number

Theorem: Coupon Collector's Expected Time

In the coupon collector problem with nn coupon types, the stopping time TnT_n (number of draws until all nn types seen) satisfies E[Tn]=nHn=nk=1n1k,\mathbb{E}[T_n] = n H_n = n \sum_{k=1}^n \frac{1}{k}, and the variance is Var[Tn]=n2k=1n1k2    n2π26.\text{Var}[T_n] = n^2 \sum_{k=1}^n \frac{1}{k^2} \;\leq\; n^2 \cdot \frac{\pi^2}{6}. For large nn, E[Tn]nlnn+0.5772n\mathbb{E}[T_n] \approx n\ln n + 0.5772\,n (the Euler–Mascheroni correction), and TnT_n is concentrated within O(n)O(n) of its mean with high probability.

After collecting k1k-1 distinct coupons, the probability of drawing a new coupon on the next step is (nk+1)/n(n-k+1)/n. The waiting time for the kk-th new coupon is geometric with this success probability, so its expectation is n/(nk+1)n/(n-k+1). Summing over all nn new coupons gives n(1/n+1/(n1)++1/1)=nHnn(1/n + 1/(n-1) + \cdots + 1/1) = nH_n.

Example: Coupon Collector: Numerical Illustration

A WiFi access point probes n=10n = 10 orthogonal frequency-division multiple access (OFDMA) subcarriers in random order (sampling with replacement) to estimate channel state on all subcarriers. (a) What is the expected number of probes to sample every subcarrier? (b) How many probes mm are needed to ensure that all 10 subcarriers have been probed with probability at least 11031 - 10^{-3}?

Coupon Collector: Monte Carlo Simulation

Monte Carlo estimate of the coupon collector distribution. The histogram shows the empirical distribution of TnT_n over many trials. The dashed line marks the theoretical expectation nHnnH_n. Adjust nn (number of coupons) to see how the distribution widens as nn grows.

Parameters
20
2000

Historical Note: History of the Coupon Collector Problem

1950s–present

The coupon collector problem appeared in early probabilistic combinatorics and was a favorite of William Feller, who treated it extensively in his landmark 1950 textbook An Introduction to Probability Theory and Its Applications. The asymptotic nlnnn \ln n law was known before Feller, but his textbook treatment cemented it as a standard result. The connection to harmonic numbers makes it one of the most elegant intersections of analysis, combinatorics, and probability. Modern applications range from the Lempel-Ziv universal source coding algorithm (where the time to see all nn-grams is Θ(nlogn)\Theta(n \log n)) to cache invalidation in distributed systems.

Why This Matters: Random Access, ALOHA, and the Birthday Problem

In slotted ALOHA, mm devices each transmit in the same time slot with probability 1/n1/n, where nn is the total number of slots per frame. Two devices transmitting in the same slot cause a collision. The probability of collision in a given slot given mm active devices is 1(11/n)m1(m1)/n1 - (1-1/n)^{m-1} \approx (m-1)/n for mnm \ll n — exactly the birthday collision probability for 2 draws from nn possibilities.

The expected number of collision-free (successful) transmissions per frame is m(11/n)m1m(1-1/n)^{m-1}, maximized at m=nm = n with throughput 1/e1/e — the classic ALOHA capacity. Increasing mm beyond nn triggers the birthday effect and throughput collapses. The coupon collector analysis then tells us how many collision-free rounds are needed for every device to transmit successfully at least once: approximately nlnnn \ln n.

Monte Carlo Coupon Collector Simulation

Complexity: O(Ntrialsnlogn)O(N_\text{trials} \cdot n \log n) expected
Input: nn (coupons), NtrialsN_\text{trials} (number of trials)
Output: Empirical distribution of TnT_n
1. Initialize stopping_times[]\text{stopping\_times} \leftarrow []
2. for trial =1,,Ntrials= 1, \ldots, N_\text{trials} do
3. collected\text{collected} \leftarrow \emptyset (empty set of seen coupons)
4. t0t \leftarrow 0
5. while collected<n|\text{collected}| < n do
6. cUniform{1,,n}c \leftarrow \text{Uniform}\{1, \ldots, n\}
7. collectedcollected{c}\text{collected} \leftarrow \text{collected} \cup \{c\}
8. tt+1t \leftarrow t + 1
9. end while
10. Append tt to stopping_times\text{stopping\_times}
11. end for
12. return stopping_times\text{stopping\_times}

The inner loop runs for E[Tn]=nHn\mathbb{E}[T_n] = nH_n iterations in expectation. For large nn, the simulation slows but analytical formulas suffice.

Quick Check

A channel sounder must probe n=100n = 100 frequency subcarriers, sampling randomly with replacement. The expected number of probes to cover all 100 subcarriers is closest to:

100

519

1000

100 \ln 100 \approx 460$

Common Mistake: Birthday Problem Intuition Failure

Mistake:

Students often guess that the birthday collision threshold (50% probability of two people sharing a birthday) is around n/2182n/2 \approx 182 people, since there are 365 possible birthdays and you need to "fill half of them."

Correction:

The threshold is m2365ln223m^* \approx \sqrt{2 \cdot 365 \cdot \ln 2} \approx 23, not 365/2365/2. The quadratic growth of collision probability (m2/(2n)\sim m^2/(2n)) arises because we are comparing every pair, not just counting distinct values. With mm people there are (m2)m2/2\binom{m}{2} \approx m^2/2 pairs, each with probability 1/n1/n of collision. The expected number of collisions is m2/(2n)m^2/(2n), which reaches 1 when m2nm \approx \sqrt{2n}.

Three Occupancy Problems: Balls into Bins

ProblemQuestionKey FormulaWireless Analog
Birthday problemP(no collision) among mm draws from {1,,n}\{1,\ldots,n\}em2/(2n)\approx e^{-m^2/(2n)}; threshold m1.18nm^* \approx 1.18\sqrt{n}Hash collision; slotted ALOHA collision probability
Coupon collectorE[Tn]\mathbb{E}[T_n]: draws to see all nn couponsnHnnlnnnH_n \approx n\ln nChannel sounding: probes to sample all subcarriers; MAC-layer coverage
Occupancy distributionP(P(exactly kk bins empty)) after mm throwsStirling numbers; k(nk)(1)k(1k/n)m\sum_k \binom{n}{k}(-1)^k (1-k/n)^mLoad balancing; distributed hash table utilization