Coded MapReduce: Setup and Motivation

From Caching to Computing

Chapters 2–15 showed that caching + coded multicast saves communication. Chapter 15 applied the same trick to data shuffling in ML. The next conceptual leap, due to Li-Maddah-Ali-Yu-Avestimehr (LMYA, 2018), is that distributed computing itself has a memory-communication tradeoff analogous to coded caching:

Redundant computation = cache. Data shuffle = delivery.

By replicating map computations across workers and coding the shuffle phase, the total shuffle traffic drops by factor $r$ , where $r$ is the replication factor. This is the coded caching gain in a new domain.

Definition:
MapReduce Framework

A MapReduce job partitions a task into:

Map phase. Each worker applies a map function $f$ to assigned input files, producing intermediate key-value pairs.
Shuffle phase. Intermediate values are exchanged so that each worker receives all values for its assigned keys.
Reduce phase. Each worker applies a reduce function $g$ to its assigned keys.

The shuffle phase is the communication bottleneck at scale.

On $K$ workers with $N$ input files and $Q$ reduce keys, the uncoded shuffle volume is $Q(1 - 1/K)$ per key — each worker receives all values for its key from the other $K-1$ workers.

Definition:
Computation Load

The computation load $r$ is the average number of workers on which each input file's map function is computed. Baseline $r = 1$ (each file mapped once). Redundant computation $r > 1$ means each file is mapped at $r$ workers — trading compute for communication savings.

The coded-caching analog: $r$ plays the role of $Kt + 1$ (local cache effect). More redundant storage / computation $\Rightarrow$ more coding gain.

Why Shuffle Dominates in Modern Analytics

In petabyte-scale analytics (Facebook, Netflix), the shuffle phase accounts for 60–70% of total MapReduce runtime. Network, not compute, is the bottleneck. The LMYA framework reveals that much of this shuffle traffic is structurally redundant — coding can substantially reduce it.

Production MapReduce systems (Hadoop, Spark) operate in the uncoded regime. The LMYA scheme represents an opportunity, not yet fully realized in production systems as of 2024, but intellectually foundational to subsequent coded-computing work.

Coded MapReduce Communication-Computation Tradeoff

Communication load $L = (1/r)(1 - r/K)$ vs computation load $r$ . As $r$ increases, shuffle data shrinks by factor $r$ — the coded-caching gain applied to MapReduce shuffle.

Parameters

Workers K20

Example: Coded MapReduce: Word Count

Word-count over $N = 12$ input shards on $K = 4$ workers, with replication $r = 2$ . Quantify uncoded vs coded shuffle volume per reduce key.

Solution

Uncoded shuffle

Each worker maps 12/4 = 3 shards (uncoded: $r = 1$ ). For $Q$ reduce keys, worker $k$ sends $Q(1 - 1/K) = 0.75Q$ to others.

Coded placement

With $r = 2$ : each shard mapped at 2 of 4 workers. Each worker now holds partial data for 6 shards.

Coded shuffle formula

$L = (1/r)(1 - r/K) = 0.25$ vs uncoded $L = 0.75$ $\Rightarrow$ 3× reduction.

Compute overhead

Map compute is doubled ( $r = 2$ ). Trade: 2× compute for 3× shuffle savings — good if shuffle is the bottleneck.

At scale

For $K = 100$ , $r = 10$ : shuffle reduces by $\sim 10\times$ . Compute overhead is $10\times$ , but shuffle dominates at scale, so net wallclock savings are substantial.

Historical Note: MapReduce: From Google to the IT Community

Google's MapReduce paper (Dean-Ghemawat, 2004; CACM 2008) introduced the framework to simplify petabyte-scale data processing. The Apache Hadoop open-source re-implementation (2006) and later Spark (2010) made MapReduce a ubiquitous primitive. But the information-theoretic view — that the shuffle phase is a broadcast problem with side information — didn't emerge until Li-Maddah-Ali-Yu-Avestimehr (2014 ISIT, 2018 IT Transactions).

Their work bridged coded caching and distributed computing, opening a new research direction: coded computing. This has since grown into a subfield with dedicated workshops, a tutorial (Li-Ali Foundations and Trends 2020), and influence on straggler-aware ML frameworks.

Key Takeaway

MapReduce's shuffle phase mirrors MAN's delivery phase. Redundant computation ( $r > 1$ ) plays the role of cache. Coded shuffling recovers the $r$ -fold reduction in communication. The LMYA framework establishes this formally — a direct port of coded-caching intuition to distributed analytics.

Prerequisites & Notation Coded MapReduce (LMYA 2018)