Coded MapReduce: Setup and Motivation
From Caching to Computing
Chapters 2β15 showed that caching + coded multicast saves communication. Chapter 15 applied the same trick to data shuffling in ML. The next conceptual leap, due to Li-Maddah-Ali-Yu-Avestimehr (LMYA, 2018), is that distributed computing itself has a memory-communication tradeoff analogous to coded caching:
Redundant computation = cache. Data shuffle = delivery.
By replicating map computations across workers and coding the shuffle phase, the total shuffle traffic drops by factor , where is the replication factor. This is the coded caching gain in a new domain.
Definition: MapReduce Framework
MapReduce Framework
A MapReduce job partitions a task into:
- Map phase. Each worker applies a map function to assigned input files, producing intermediate key-value pairs.
- Shuffle phase. Intermediate values are exchanged so that each worker receives all values for its assigned keys.
- Reduce phase. Each worker applies a reduce function to its assigned keys.
The shuffle phase is the communication bottleneck at scale.
On workers with input files and reduce keys, the uncoded shuffle volume is per key β each worker receives all values for its key from the other workers.
Definition: Computation Load
Computation Load
The computation load is the average number of workers on which each input file's map function is computed. Baseline (each file mapped once). Redundant computation means each file is mapped at workers β trading compute for communication savings.
The coded-caching analog: plays the role of (local cache effect). More redundant storage / computation more coding gain.
Why Shuffle Dominates in Modern Analytics
In petabyte-scale analytics (Facebook, Netflix), the shuffle phase accounts for 60β70% of total MapReduce runtime. Network, not compute, is the bottleneck. The LMYA framework reveals that much of this shuffle traffic is structurally redundant β coding can substantially reduce it.
Production MapReduce systems (Hadoop, Spark) operate in the uncoded regime. The LMYA scheme represents an opportunity, not yet fully realized in production systems as of 2024, but intellectually foundational to subsequent coded-computing work.
Coded MapReduce Communication-Computation Tradeoff
Communication load vs computation load . As increases, shuffle data shrinks by factor β the coded-caching gain applied to MapReduce shuffle.
Parameters
Example: Coded MapReduce: Word Count
Word-count over input shards on workers, with replication . Quantify uncoded vs coded shuffle volume per reduce key.
Uncoded shuffle
Each worker maps 12/4 = 3 shards (uncoded: ). For reduce keys, worker sends to others.
Coded placement
With : each shard mapped at 2 of 4 workers. Each worker now holds partial data for 6 shards.
Coded shuffle formula
vs uncoded 3Γ reduction.
Compute overhead
Map compute is doubled (). Trade: 2Γ compute for 3Γ shuffle savings β good if shuffle is the bottleneck.
At scale
For , : shuffle reduces by . Compute overhead is , but shuffle dominates at scale, so net wallclock savings are substantial.
Historical Note: MapReduce: From Google to the IT Community
Google's MapReduce paper (Dean-Ghemawat, 2004; CACM 2008) introduced the framework to simplify petabyte-scale data processing. The Apache Hadoop open-source re-implementation (2006) and later Spark (2010) made MapReduce a ubiquitous primitive. But the information-theoretic view β that the shuffle phase is a broadcast problem with side information β didn't emerge until Li-Maddah-Ali-Yu-Avestimehr (2014 ISIT, 2018 IT Transactions).
Their work bridged coded caching and distributed computing, opening a new research direction: coded computing. This has since grown into a subfield with dedicated workshops, a tutorial (Li-Ali Foundations and Trends 2020), and influence on straggler-aware ML frameworks.
Key Takeaway
MapReduce's shuffle phase mirrors MAN's delivery phase. Redundant computation () plays the role of cache. Coded shuffling recovers the -fold reduction in communication. The LMYA framework establishes this formally β a direct port of coded-caching intuition to distributed analytics.