Matching, Occupancy, and Coupon Collection
Balls, Bins, and Protocol Design
Three classic problems — balls into bins, the coupon collector, and the birthday problem — look like recreational puzzles but are in fact foundational models in algorithms, cryptography, and communications protocol design. The balls-into-bins model describes packet collisions in random access channels, hash table probing, and random placement of data in distributed storage. The coupon collector quantifies the expected time until all resources have been sampled at least once, which governs channel sounding protocols and random access convergence. The birthday problem underpins hash collision security and the capacity of slotted ALOHA channels.
All three problems reduce to the same underlying calculation: how many i.i.d.\ uniform draws from are needed before a prescribed coverage condition is met?
Definition: Occupancy Problem (Balls into Bins)
Occupancy Problem (Balls into Bins)
In the occupancy problem, balls are thrown independently and uniformly at random into bins. The key quantities are:
- probability all bins are occupied (at least one ball each)
- probability all balls land in distinct bins (at most one ball per bin)
- and
For equal priors, the probability that all balls land in distinct bins (the no-collision event) is
The occupancy problem is the continuous-time limit of the coupon collector: if we allow multiple balls per bin, the time to cover all bins is the coupon collector time.
Occupancy Problem
The random experiment of throwing balls independently and uniformly into bins, and studying the resulting occupancy distribution. Key special cases: the birthday problem ( people, birthdays) and coupon collection ( draws until all coupons seen).
Related: Coupon Collector Problem, Birthday Problem
Definition: Birthday Problem
Birthday Problem
The birthday problem asks: if people each have a birthday chosen uniformly at random from days, what is the probability that at least two people share a birthday?
The threshold is : once the collision probability exceeds .
For , . The surprising lesson — 23 people suffice for a 50% chance of shared birthday — shows that collision probabilities grow much faster than intuition suggests. This has profound implications for hash function security and random access.
Birthday Problem
The probability that among uniformly random draws from , at least two are equal. Collision probability for . The collision threshold is .
Related: Occupancy Problem, Coupon Collector Problem
Theorem: Birthday Bound
Let items be drawn i.i.d.\ uniformly from a set of size . The probability that all items are distinct satisfies: In particular, for the collision probability exceeds .
The lower bound follows from the inequality for ; the upper bound from . The product then telescopes into an exponential of a sum.
Use for each factor in the product.
The sum ; exponentiate.
Upper bound on $P( ext{no collision})$
Apply to each factor:
Lower bound on $P( ext{no collision})$
For , the inequality holds for . Therefore A sharper bound uses to get .
Key Takeaway
The birthday bound states that among items drawn from a universe of size , a collision is roughly 50% likely. This is the fundamental limit of hash-based collision resistance: any hash function with possible outputs will have a collision after inputs.
Birthday Problem: Collision Probability
The probability of at least one collision among draws from a universe of size . Adjust (hash output size or number of slots) and watch the collision probability grow. The dashed line marks the threshold where collision probability first exceeds 0.5.
Parameters
Definition: Coupon Collector Problem
Coupon Collector Problem
Suppose there are distinct coupons. At each step, one coupon is drawn uniformly at random with replacement. Let be the number of draws until all coupons have been collected at least once. The coupon collector's expected waiting time is where is the -th harmonic number.
The -th new coupon is collected after a geometric waiting time with success probability . The total time is the sum of independent geometric random variables. This decomposition is the key to the proof.
Coupon Collector
A random process where items are drawn uniformly with replacement from a set of size until all items have appeared at least once. Expected waiting time: .
Related: Occupancy Problem, Birthday Problem, Harmonic Number
Theorem: Coupon Collector's Expected Time
In the coupon collector problem with coupon types, the stopping time (number of draws until all types seen) satisfies and the variance is For large , (the Euler–Mascheroni correction), and is concentrated within of its mean with high probability.
After collecting distinct coupons, the probability of drawing a new coupon on the next step is . The waiting time for the -th new coupon is geometric with this success probability, so its expectation is . Summing over all new coupons gives .
Let be the number of draws to collect the -th new coupon after having distinct ones. What is ?
Write where are independent geometric RVs. Apply linearity of expectation.
Decompose into geometric waiting times
Let be the number of draws to get the -th new coupon, given that distinct coupons have already been collected. At this point, there are unseen coupon types. Each draw independently produces a new coupon with probability . Therefore , and .
Independence and linearity
The waiting times are independent because the future draws after collecting the -th new coupon do not depend on the past (each draw is i.i.d.). Therefore and by linearity of expectation:
Variance computation
For a geometric RV with success probability , the variance is . Therefore
Example: Coupon Collector: Numerical Illustration
A WiFi access point probes orthogonal frequency-division multiple access (OFDMA) subcarriers in random order (sampling with replacement) to estimate channel state on all subcarriers. (a) What is the expected number of probes to sample every subcarrier? (b) How many probes are needed to ensure that all 10 subcarriers have been probed with probability at least ?
Expected probes
$
Tail bound via union bound
The probability that coupon has NOT been collected after probes is . By the union bound over coupons: Setting and solving: , so probes.
Interpretation
To guarantee probability of any missed subcarrier, approximately 92 probes are needed — roughly the expected 29 probes. The extra probes protect against the rare event that some subcarrier goes unsampled for a long time.
Coupon Collector: Monte Carlo Simulation
Monte Carlo estimate of the coupon collector distribution. The histogram shows the empirical distribution of over many trials. The dashed line marks the theoretical expectation . Adjust (number of coupons) to see how the distribution widens as grows.
Parameters
Historical Note: History of the Coupon Collector Problem
1950s–presentThe coupon collector problem appeared in early probabilistic combinatorics and was a favorite of William Feller, who treated it extensively in his landmark 1950 textbook An Introduction to Probability Theory and Its Applications. The asymptotic law was known before Feller, but his textbook treatment cemented it as a standard result. The connection to harmonic numbers makes it one of the most elegant intersections of analysis, combinatorics, and probability. Modern applications range from the Lempel-Ziv universal source coding algorithm (where the time to see all -grams is ) to cache invalidation in distributed systems.
Why This Matters: Random Access, ALOHA, and the Birthday Problem
In slotted ALOHA, devices each transmit in the same time slot with probability , where is the total number of slots per frame. Two devices transmitting in the same slot cause a collision. The probability of collision in a given slot given active devices is for — exactly the birthday collision probability for 2 draws from possibilities.
The expected number of collision-free (successful) transmissions per frame is , maximized at with throughput — the classic ALOHA capacity. Increasing beyond triggers the birthday effect and throughput collapses. The coupon collector analysis then tells us how many collision-free rounds are needed for every device to transmit successfully at least once: approximately .
Monte Carlo Coupon Collector Simulation
Complexity: expectedThe inner loop runs for iterations in expectation. For large , the simulation slows but analytical formulas suffice.
Quick Check
A channel sounder must probe frequency subcarriers, sampling randomly with replacement. The expected number of probes to cover all 100 subcarriers is closest to:
100
519
1000
100 \ln 100 \approx 460$
.
Common Mistake: Birthday Problem Intuition Failure
Mistake:
Students often guess that the birthday collision threshold (50% probability of two people sharing a birthday) is around people, since there are 365 possible birthdays and you need to "fill half of them."
Correction:
The threshold is , not . The quadratic growth of collision probability () arises because we are comparing every pair, not just counting distinct values. With people there are pairs, each with probability of collision. The expected number of collisions is , which reaches 1 when .
Three Occupancy Problems: Balls into Bins
| Problem | Question | Key Formula | Wireless Analog |
|---|---|---|---|
| Birthday problem | P(no collision) among draws from | ; threshold | Hash collision; slotted ALOHA collision probability |
| Coupon collector | : draws to see all coupons | Channel sounding: probes to sample all subcarriers; MAC-layer coverage | |
| Occupancy distribution | exactly bins empty after throws | Stirling numbers; | Load balancing; distributed hash table utilization |