Coded MapReduce: Setup and Motivation

From Caching to Computing

Chapters 2–15 showed that caching + coded multicast saves communication. Chapter 15 applied the same trick to data shuffling in ML. The next conceptual leap, due to Li-Maddah-Ali-Yu-Avestimehr (LMYA, 2018), is that distributed computing itself has a memory-communication tradeoff analogous to coded caching:

Redundant computation = cache. Data shuffle = delivery.

By replicating map computations across workers and coding the shuffle phase, the total shuffle traffic drops by factor rr, where rr is the replication factor. This is the coded caching gain in a new domain.

Definition:

MapReduce Framework

A MapReduce job partitions a task into:

  1. Map phase. Each worker applies a map function ff to assigned input files, producing intermediate key-value pairs.
  2. Shuffle phase. Intermediate values are exchanged so that each worker receives all values for its assigned keys.
  3. Reduce phase. Each worker applies a reduce function gg to its assigned keys.

The shuffle phase is the communication bottleneck at scale.

On KK workers with NN input files and QQ reduce keys, the uncoded shuffle volume is Q(1βˆ’1/K)Q(1 - 1/K) per key β€” each worker receives all values for its key from the other Kβˆ’1K-1 workers.

Definition:

Computation Load

The computation load rr is the average number of workers on which each input file's map function is computed. Baseline r=1r = 1 (each file mapped once). Redundant computation r>1r > 1 means each file is mapped at rr workers β€” trading compute for communication savings.

The coded-caching analog: rr plays the role of Kt+1Kt + 1 (local cache effect). More redundant storage / computation β‡’\Rightarrow more coding gain.

Why Shuffle Dominates in Modern Analytics

In petabyte-scale analytics (Facebook, Netflix), the shuffle phase accounts for 60–70% of total MapReduce runtime. Network, not compute, is the bottleneck. The LMYA framework reveals that much of this shuffle traffic is structurally redundant β€” coding can substantially reduce it.

Production MapReduce systems (Hadoop, Spark) operate in the uncoded regime. The LMYA scheme represents an opportunity, not yet fully realized in production systems as of 2024, but intellectually foundational to subsequent coded-computing work.

Coded MapReduce Communication-Computation Tradeoff

Communication load L=(1/r)(1βˆ’r/K)L = (1/r)(1 - r/K) vs computation load rr. As rr increases, shuffle data shrinks by factor rr β€” the coded-caching gain applied to MapReduce shuffle.

Parameters
20

Example: Coded MapReduce: Word Count

Word-count over N=12N = 12 input shards on K=4K = 4 workers, with replication r=2r = 2. Quantify uncoded vs coded shuffle volume per reduce key.

Historical Note: MapReduce: From Google to the IT Community

Google's MapReduce paper (Dean-Ghemawat, 2004; CACM 2008) introduced the framework to simplify petabyte-scale data processing. The Apache Hadoop open-source re-implementation (2006) and later Spark (2010) made MapReduce a ubiquitous primitive. But the information-theoretic view β€” that the shuffle phase is a broadcast problem with side information β€” didn't emerge until Li-Maddah-Ali-Yu-Avestimehr (2014 ISIT, 2018 IT Transactions).

Their work bridged coded caching and distributed computing, opening a new research direction: coded computing. This has since grown into a subfield with dedicated workshops, a tutorial (Li-Ali Foundations and Trends 2020), and influence on straggler-aware ML frameworks.

Key Takeaway

MapReduce's shuffle phase mirrors MAN's delivery phase. Redundant computation (r>1r > 1) plays the role of cache. Coded shuffling recovers the rr-fold reduction in communication. The LMYA framework establishes this formally β€” a direct port of coded-caching intuition to distributed analytics.