Stragglers and the Need for Redundancy
Why a Synchronous Job is Bottlenecked by its Slowest Worker
A synchronous distributed computation cannot move forward until every worker has reported its contribution. The wall-clock latency of the iteration is therefore not the average completion time β it is the maximum. A handful of slow workers (called stragglers) can dominate the latency even when the rest of the system is fast. In production clusters at Google, the slowest 5% of workers routinely take 5β10 times longer than the median, and this gap is the dominant source of tail latency in big-data jobs.
The first half of this section formalizes the latency penalty quantitatively. The second half motivates the central trick of Part II: introduce computational redundancy so that the master only needs to wait for any sufficiently large subset of workers, sidestepping the stragglers entirely.
Definition: Stragglers and Synchronous Latency
Stragglers and Synchronous Latency
Let be the (random) task-completion times of workers in a synchronous distributed computation. The iteration latency is the order statistic A worker is called a straggler if is significantly larger than the median completion time. The straggler effect is the gap between and , which grows with for any non-degenerate service-time distribution.
Theorem: Latency Penalty for Exponential Service Times
If are i.i.d. exponential random variables with rate (mean ), then where denotes the -th harmonic number. The latency penalty grows logarithmically with the number of workers β a doubling of adds roughly to the expected wall-clock time.
Memorylessness gives a clean recursion: after the first worker finishes (in expected time ), the remaining are again i.i.d. exponential. Summing the expected gaps yields the harmonic number. The point is that even with average per-worker time that does not depend on , the worst-case time grows without bound as we add workers.
Order-statistic decomposition
For i.i.d. exponentials with rate , the order statistics satisfy and the gaps are mutually independent. This is a classical consequence of the memorylessness of the exponential distribution β once the first workers have finished, the remaining workers' future service times are again i.i.d. exponentials.
Sum of expectations
Therefore
Asymptotic
Since where is the Euler-Mascheroni constant, we have , i.e., logarithmic in .
Example: How Bad is a Single Straggler?
A synchronous gradient-aggregation step uses workers. Suppose each worker's task time is , where is i.i.d. exponential with mean (units: arbitrary). What is relative to ?
Per-worker mean
.
Order-statistic mean
Inflation factor
The slowest worker takes about the average per-worker time. Just extra workers in addition to already slow the iteration by another β this is the "diminishing returns" of adding parallelism, and the reason coded redundancy will be so attractive in Chapters 5β6.
Straggler Latency vs. Number of Workers
Compare the expected wait for the slowest worker (the synchronous latency ) against the expected wait for any out of responses (the redundant scheme of Section 1.2). The horizontal axis is the number of workers , with two curves: (pure synchronous) and (tolerate the slowest 20%). Increase the service-time spread to see how stragglers amplify when the per-worker variance grows.
Parameters
Maximum number of workers to plot
Higher = faster mean, but tail still grows logarithmically
Fraction of fastest responses we wait for
Stragglers in a Synchronous Iteration
Definition: Recovery Threshold
Recovery Threshold
A redundant distributed-computation scheme has recovery threshold if the master can reconstruct the desired output from the responses of any workers (regardless of which finish first). In a synchronous setting with order statistics , the iteration latency drops from to .
Theorem: Latency Gain from Redundancy
For i.i.d. exponential service times with rate and recovery threshold , In particular recovers the synchronous penalty , while for fixed gives as β a constant, not a logarithm.
Tolerating even a small fraction of stragglers β say , discarding the slowest 10% β converts a logarithmically growing tail into a finite asymptote. This is the central operational reason coded computing exists. Of course nothing is free: the price is that workers must store and compute on more than just their share of the data, which is exactly the storage/computation cost we will quantify in Chapters 5 and 6.
Sum gaps from the first to the $K$-th finisher
Using the same memoryless decomposition as in the previous theorem,
Asymptotic with $K = \alpha N$
For and large , Hence as β a bounded asymptote that depends only on the tolerated fraction , not on .
Key Takeaway
Tolerating a constant fraction of stragglers turns a logarithmic latency penalty into a constant. This is the central operational payoff of coded computing β the price is paid in storage/computation redundancy at each worker, which we quantify in Chapters 5β6. The information-theoretic question becomes: how little redundancy suffices to achieve a target recovery threshold?
Common Mistake: Doesn't Asynchronous SGD Solve Stragglers?
Mistake:
Use asynchronous parameter updates: the server applies any worker's gradient as soon as it arrives, so stragglers never block.
Correction:
Asynchrony does eliminate the synchronous wait, but the cost is that the parameter server applies stale gradients computed at older model parameters. In non-convex landscapes (deep networks) staleness causes convergence-rate degradation that grows with worker delay; in some pathological cases it causes outright divergence. Coded computing aims for the best of both worlds: synchronous semantics, with redundancy used to bypass stragglers without staleness. The trade-off between asynchrony, coded redundancy, and convergence rate is itself an open research area (Chapter 18).
Three Ways to Cope with Stragglers
| Strategy | How it works | Latency | Hidden cost |
|---|---|---|---|
| Plain synchronous | Wait for all workers | (logarithmic in ) | None β but bottlenecked by tail |
| Asynchronous SGD | Apply gradients as they arrive | Per-update: roughly | Stale gradients hurt convergence |
| Coded redundancy ( of ) | Master decodes from any responses | (bounded as ) | Per-worker storage / compute redundancy |
Tail Latency in Production: The 99.9th Percentile is What Matters
Dean and Barroso's Tail at Scale paper measured request latencies in Google's web-serving fleet and showed that the 99.9th percentile can exceed the median by an order of magnitude β even on identical hardware running an identical workload. The mechanisms identified (background activities, garbage collection, queueing, network congestion, hardware aging, thermal throttling) are exactly the kinds of perturbations that motivate coded computing. From a system design standpoint, the right object to optimize is rarely the average latency; it is the high percentile of the latency distribution.
- β’
Tail-latency phenomena scale with the number of components a request must touch (the 'fan-out' problem)
- β’
Hedging requests, partial-result aggregation, and coded redundancy all attack the same root cause from different angles
Recovery Threshold
The smallest number of worker responses sufficient to reconstruct the desired output. In coded matrix multiplication (Chapter 5), the recovery threshold of the polynomial code with workers is exactly , and the optimal trade-off between , , and the per-worker storage is the central object of study.
Quick Check
With i.i.d. exponential service times and recovery threshold , what happens to the expected iteration latency as ?
It grows like
It converges to
It diverges to infinity
It equals
From the asymptotic in Theorem (Latency Gain from Redundancy): with , giving .