Stragglers and the Need for Redundancy

Why a Synchronous Job is Bottlenecked by its Slowest Worker

A synchronous distributed computation cannot move forward until every worker has reported its contribution. The wall-clock latency of the iteration is therefore not the average completion time β€” it is the maximum. A handful of slow workers (called stragglers) can dominate the latency even when the rest of the system is fast. In production clusters at Google, the slowest 5% of workers routinely take 5–10 times longer than the median, and this gap is the dominant source of tail latency in big-data jobs.

The first half of this section formalizes the latency penalty quantitatively. The second half motivates the central trick of Part II: introduce computational redundancy so that the master only needs to wait for any sufficiently large subset of workers, sidestepping the stragglers entirely.

Definition:

Stragglers and Synchronous Latency

Let T1,T2,…,TNT_1, T_2, \ldots, T_N be the (random) task-completion times of NN workers in a synchronous distributed computation. The iteration latency is the order statistic T(N)β€…β€Šβ‰œβ€…β€Šmax⁑i=1,…,NTi.T_{(N)} \;\triangleq\; \max_{i = 1, \ldots, N} T_i. A worker ii is called a straggler if TiT_i is significantly larger than the median completion time. The straggler effect is the gap between E[T(N)]\mathbb{E}[T_{(N)}] and E[Ti]\mathbb{E}[T_i], which grows with NN for any non-degenerate service-time distribution.

Theorem: Latency Penalty for Exponential Service Times

If T1,…,TNT_1, \ldots, T_N are i.i.d. exponential random variables with rate Ξ»\lambda (mean 1/Ξ»1/\lambda), then E[T(N)]β€…β€Š=β€…β€ŠHNΞ»β€…β€Š=β€…β€Š1Ξ»βˆ‘i=1N1iβ€…β€ŠβˆΌβ€…β€Šln⁑NΞ»asΒ Nβ†’βˆž,\mathbb{E}[T_{(N)}] \;=\; \frac{H_N}{\lambda} \;=\; \frac{1}{\lambda}\sum_{i=1}^N \frac{1}{i} \;\sim\; \frac{\ln N}{\lambda} \quad \text{as } N \to \infty, where HNH_N denotes the NN-th harmonic number. The latency penalty grows logarithmically with the number of workers β€” a doubling of NN adds roughly ln⁑2/Ξ»\ln 2 / \lambda to the expected wall-clock time.

Memorylessness gives a clean recursion: after the first worker finishes (in expected time 1/(NΞ»)1/(N\lambda)), the remaining Nβˆ’1N-1 are again i.i.d. exponential. Summing the expected gaps yields the harmonic number. The point is that even with average per-worker time E[Ti]=1/Ξ»\mathbb{E}[T_i] = 1/\lambda that does not depend on NN, the worst-case time grows without bound as we add workers.

Example: How Bad is a Single Straggler?

A synchronous gradient-aggregation step uses N=100N = 100 workers. Suppose each worker's task time is Ti=1+SiT_i = 1 + S_i, where SiS_i is i.i.d. exponential with mean 0.50.5 (units: arbitrary). What is E[T(N)]\mathbb{E}[T_{(N)}] relative to E[Ti]\mathbb{E}[T_i]?

Straggler Latency vs. Number of Workers NN

Compare the expected wait for the slowest worker (the synchronous latency E[T(N)]\mathbb{E}[T_{(N)}]) against the expected wait for any KK out of NN responses (the redundant scheme of Section 1.2). The horizontal axis is the number of workers NN, with two curves: K=NK = N (pure synchronous) and K=⌈0.8NβŒ‰K = \lceil 0.8 N\rceil (tolerate the slowest 20%). Increase the service-time spread to see how stragglers amplify when the per-worker variance grows.

Parameters
100

Maximum number of workers to plot

1

Higher = faster mean, but tail still grows logarithmically

0.8

Fraction of fastest responses we wait for

Stragglers in a Synchronous Iteration

Animation of NN workers finishing at random times. The synchronous iteration (top bar) is gated by the slowest. Below, a redundant scheme that needs only K<NK < N responses finishes much earlier.

Definition:

Recovery Threshold

A redundant distributed-computation scheme has recovery threshold K≀NK \leq N if the master can reconstruct the desired output from the responses of any KK workers (regardless of which KK finish first). In a synchronous setting with order statistics T(1)≀T(2)≀⋯≀T(N)T_{(1)} \leq T_{(2)} \leq \cdots \leq T_{(N)}, the iteration latency drops from T(N)T_{(N)} to T(K)T_{(K)}.

Theorem: Latency Gain from Redundancy

For i.i.d. exponential service times with rate Ξ»\lambda and recovery threshold K≀NK \leq N, E[T(K)]β€…β€Š=β€…β€Š1Ξ»βˆ‘j=Nβˆ’K+1N1jβ€…β€Š=β€…β€ŠHNβˆ’HNβˆ’KΞ».\mathbb{E}[T_{(K)}] \;=\; \frac{1}{\lambda}\sum_{j=N-K+1}^{N} \frac{1}{j} \;=\; \frac{H_N - H_{N-K}}{\lambda}. In particular K=NK = N recovers the synchronous penalty HN/Ξ»H_N/\lambda, while K=Ξ±NK = \alpha N for fixed Ξ±<1\alpha < 1 gives E[T(K)]β†’βˆ’ln⁑(1βˆ’Ξ±)/Ξ»\mathbb{E}[T_{(K)}] \to -\ln(1 - \alpha)/\lambda as Nβ†’βˆžN \to \infty β€” a constant, not a logarithm.

Tolerating even a small fraction of stragglers β€” say Ξ±=0.9\alpha = 0.9, discarding the slowest 10% β€” converts a logarithmically growing tail into a finite asymptote. This is the central operational reason coded computing exists. Of course nothing is free: the price is that workers must store and compute on more than just their share of the data, which is exactly the storage/computation cost we will quantify in Chapters 5 and 6.

Key Takeaway

Tolerating a constant fraction of stragglers turns a logarithmic latency penalty into a constant. This is the central operational payoff of coded computing β€” the price is paid in storage/computation redundancy at each worker, which we quantify in Chapters 5–6. The information-theoretic question becomes: how little redundancy suffices to achieve a target recovery threshold?

Common Mistake: Doesn't Asynchronous SGD Solve Stragglers?

Mistake:

Use asynchronous parameter updates: the server applies any worker's gradient as soon as it arrives, so stragglers never block.

Correction:

Asynchrony does eliminate the synchronous wait, but the cost is that the parameter server applies stale gradients computed at older model parameters. In non-convex landscapes (deep networks) staleness causes convergence-rate degradation that grows with worker delay; in some pathological cases it causes outright divergence. Coded computing aims for the best of both worlds: synchronous semantics, with redundancy used to bypass stragglers without staleness. The trade-off between asynchrony, coded redundancy, and convergence rate is itself an open research area (Chapter 18).

Three Ways to Cope with Stragglers

StrategyHow it worksLatencyHidden cost
Plain synchronousWait for all NN workersHN/Ξ»H_N / \lambda (logarithmic in NN)None β€” but bottlenecked by tail
Asynchronous SGDApply gradients as they arrivePer-update: roughly 1/(NΞ»)1/(N\lambda)Stale gradients hurt convergence
Coded redundancy (KK of NN)Master decodes from any KK responses(HNβˆ’HNβˆ’K)/Ξ»(H_N - H_{N-K})/\lambda (bounded as Nβ†’βˆžN \to \infty)Per-worker storage / compute redundancy
⚠️Engineering Note

Tail Latency in Production: The 99.9th Percentile is What Matters

Dean and Barroso's Tail at Scale paper measured request latencies in Google's web-serving fleet and showed that the 99.9th percentile can exceed the median by an order of magnitude β€” even on identical hardware running an identical workload. The mechanisms identified (background activities, garbage collection, queueing, network congestion, hardware aging, thermal throttling) are exactly the kinds of perturbations that motivate coded computing. From a system design standpoint, the right object to optimize is rarely the average latency; it is the high percentile of the latency distribution.

Practical Constraints
  • β€’

    Tail-latency phenomena scale with the number of components a request must touch (the 'fan-out' problem)

  • β€’

    Hedging requests, partial-result aggregation, and coded redundancy all attack the same root cause from different angles

πŸ“‹ Ref: Google Cloud SRE Book; The Tail at Scale (Dean & Barroso 2013)

Recovery Threshold KK

The smallest number of worker responses sufficient to reconstruct the desired output. In coded matrix multiplication (Chapter 5), the recovery threshold of the polynomial code with NN workers is exactly KK, and the optimal trade-off between NN, KK, and the per-worker storage is the central object of study.

Quick Check

With i.i.d. exponential service times and recovery threshold K=⌈0.5NβŒ‰K = \lceil 0.5 N\rceil, what happens to the expected iteration latency as Nβ†’βˆžN \to \infty?

It grows like ln⁑N\ln N

It converges to βˆ’ln⁑(0.5)/Ξ»=(ln⁑2)/Ξ»-\ln(0.5)/\lambda = (\ln 2)/\lambda

It diverges to infinity

It equals 1/Ξ»1/\lambda