Multi-GPU and Distributed Computing

When One GPU Is Not Enough

As models and datasets grow, a single GPU becomes insufficient. Multi-GPU training provides two scaling strategies:

Data parallelism: Each GPU processes a different batch; gradients are averaged across GPUs. Scales batch size linearly with GPU count.
Model parallelism: Different parts of the model live on different GPUs. Enables training models too large for a single GPU's memory.

This section covers PyTorch's DataParallel, DistributedDataParallel (DDP), FSDP, and the communication primitives that enable multi-GPU computing.

Definition:
DataParallel (DP)

torch.nn.DataParallel is the simplest multi-GPU wrapper:

model = nn.DataParallel(model)  # wraps model for multi-GPU
output = model(input)           # splits input across GPUs

How it works:

Replicates model to each GPU
Splits the input batch across GPUs
Forward pass runs in parallel on each GPU
Outputs are gathered to GPU 0
Loss and backward run on GPU 0
Gradients are broadcast to update all replicas

Limitations: GPU 0 is a bottleneck (gathers all outputs, computes loss); uses Python threads (GIL-limited); does not scale beyond a single machine.

DataParallel is deprecated in favor of DistributedDataParallel. Use DP only for quick prototyping on a single machine.

Definition:
DistributedDataParallel (DDP)

DistributedDataParallel is the production-grade multi-GPU solution:

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group('nccl')
local_rank = int(os.environ['LOCAL_RANK'])
model = model.to(local_rank)
model = DDP(model, device_ids=[local_rank])

Key advantages over DataParallel:

One process per GPU (no GIL contention)
Gradient all-reduce overlaps with backward pass
Scales to multiple machines via NCCL
No GPU 0 bottleneck (symmetric across all GPUs)

Launch DDP with torchrun: torchrun --nproc_per_node=4 train.py. Each process gets its own LOCAL_RANK environment variable.

Definition:
All-Reduce Communication Primitive

All-reduce is the fundamental operation in distributed training. It computes the element-wise sum (or average) of tensors across all GPUs and distributes the result to every GPU:

$\mathbf{g}_{\text{avg}} = \frac{1}{N_{\text{GPU}}} \sum_{i=1}^{N_{\text{GPU}}} \mathbf{g}_i$

where $\mathbf{g}_i$ is the gradient on GPU $i$ .

# Manual all-reduce
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
tensor /= dist.get_world_size()

DDP performs all-reduce automatically during backward(), overlapping communication with gradient computation using bucketed all-reduce.

NCCL implements all-reduce using a ring algorithm with bandwidth cost $O(P \cdot 2(N-1)/N)$ for $P$ parameters across $N$ GPUs, nearly independent of GPU count for large $P$ .

Definition:
Model Parallelism (Pipeline and Tensor)

When a model does not fit on a single GPU, split it across GPUs:

Pipeline parallelism: Split layers across GPUs.

class PipelineModel(nn.Module):
    def __init__(self):
        self.part1 = nn.Sequential(...).to('cuda:0')
        self.part2 = nn.Sequential(...).to('cuda:1')

    def forward(self, x):
        x = self.part1(x.to('cuda:0'))
        x = self.part2(x.to('cuda:1'))
        return x

Tensor parallelism: Split individual layers (e.g., split a large linear layer's weight matrix column-wise across GPUs).

FSDP (Fully Sharded Data Parallel): Shards parameters, gradients, and optimizer states across GPUs, gathering them on-demand.

Definition:
Communication Backends: NCCL, Gloo, MPI

PyTorch supports three communication backends:

Backend	GPU-GPU	CPU-CPU	Multi-node	Recommended for
NCCL	Yes	No	Yes	GPU training
Gloo	No	Yes	Yes	CPU training
MPI	Yes	Yes	Yes	HPC clusters

# Initialize with NCCL (best for GPU)
dist.init_process_group(
    backend='nccl',
    init_method='env://',
)

NCCL (NVIDIA Collective Communications Library) provides optimized GPU-to-GPU communication that bypasses the CPU entirely via NVLink or PCIe peer-to-peer.

Theorem: Linear Scaling Rule for Distributed Training

When scaling from 1 GPU to $N$ GPUs with data parallelism, to maintain the same convergence behavior:

$\eta_N = N \cdot \eta_1, \qquad B_N = N \cdot B_1$

where $\eta$ is the learning rate and $B$ is the batch size. This keeps the ratio $\eta / B$ constant. Use a linear warmup over the first few epochs to stabilize training at the higher learning rate.

Each GPU sees $1/N$ of the data per step. Scaling the learning rate compensates for the $N$ -fold increase in effective batch size. The warmup prevents divergence at the start when the model is far from a good minimum.

Proof

Gradient averaging

With $N$ GPUs, the gradient estimate is averaged: $\bar{\mathbf{g}} = \frac{1}{N}\sum_{i=1}^N \mathbf{g}_i$ . The variance decreases by $1/N$ , but the step size should scale with the batch, giving $\eta_N = N \cdot \eta_1$ .

Example: Complete DDP Training Script

Write a minimal but complete distributed training script using DistributedDataParallel that works on 1-8 GPUs.

Solution

Initialize distributed

import os, torch, torch.distributed as dist
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler

def setup():
    dist.init_process_group('nccl')
    local_rank = int(os.environ['LOCAL_RANK'])
    torch.cuda.set_device(local_rank)
    return local_rank

Create model and data

def main():
    local_rank = setup()
    model = nn.Linear(1024, 10).to(local_rank)
    model = DDP(model, device_ids=[local_rank])

    dataset = ...  # your dataset
    sampler = DistributedSampler(dataset)
    loader = DataLoader(dataset, sampler=sampler,
                        batch_size=64, num_workers=4)

Training loop

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    for epoch in range(10):
        sampler.set_epoch(epoch)  # essential for shuffling
        for x, y in loader:
            x, y = x.to(local_rank), y.to(local_rank)
            loss = nn.functional.cross_entropy(model(x), y)
            loss.backward()  # all-reduce happens here
            optimizer.step()
            optimizer.zero_grad()
    dist.destroy_process_group()

# Launch: torchrun --nproc_per_node=4 train.py

Example: Profiling Communication vs Computation

Measure the fraction of training time spent on gradient communication (all-reduce) vs computation to identify scaling bottlenecks.

Solution

Timing approach

import torch
import time

# Time computation only (no sync)
torch.cuda.synchronize()
t0 = time.perf_counter()
output = model(x)
loss = criterion(output, y)
loss.backward()
torch.cuda.synchronize()
t_total = time.perf_counter() - t0

# Estimate communication (DDP adds all-reduce to backward)
# Compare with model without DDP wrapper

Using PyTorch profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
) as prof:
    for i, (x, y) in enumerate(loader):
        if i >= 5: break
        loss = model(x.cuda()).sum()
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

print(prof.key_averages().table(
    sort_by="cuda_time_total", row_limit=10))

Distributed Training Scaling Efficiency

Watch how training throughput and communication overhead evolve as GPUs are added. Observe the transition from compute-bound to communication-bound scaling as the number of GPUs increases.

Parameters

DataParallel vs DistributedDataParallel — DataParallel (left) gathers outputs to GPU 0 creating an asymmetric bottleneck. DDP (right) uses symmetric all-reduce during backward, overlapping communication with gradient computation.

Multi-GPU Training

python

DDP setup, DistributedSampler, gradient all-reduce, and scaling benchmarks.

# Code from: ch13/python/multi_gpu.py
# Load from backend supplements endpoint

Performance Profiling

python

PyTorch profiler, CUDA event timing, and bottleneck analysis.

# Code from: ch13/python/profiling.py
# Load from backend supplements endpoint

Quick Check

In DDP training with 4 GPUs and a dataset of 10000 samples, what does DistributedSampler do?

Each GPU sees all 10000 samples per epoch

Each GPU sees 2500 non-overlapping samples per epoch

Samples are randomly assigned to GPUs each iteration

Only GPU 0 loads data and broadcasts to others

Correction:

Each GPU sees 2500 non-overlapping samples per epoch

DistributedSampler partitions the dataset into world_size shards. Each GPU processes its shard, and gradients are averaged via all-reduce.

Common Mistake: Forgetting sampler.set_epoch() in DDP

Mistake:

Not calling sampler.set_epoch(epoch) before each epoch in DDP training. Without this, the same data ordering is used every epoch, meaning each GPU always sees the same 1/N partition with no shuffling across GPUs.

Correction:

Always call sampler.set_epoch(epoch) at the start of each epoch. This seeds the sampler's random shuffling, ensuring different data ordering each epoch while maintaining non-overlapping partitions.

Key Takeaway

Always use DistributedDataParallel over DataParallel. DDP uses one process per GPU, overlaps communication with backward pass, and scales to multiple machines. Remember: DistributedSampler + sampler.set_epoch(epoch) are required for correct data partitioning.

Why This Matters: Distributed Computing for Large-Scale MIMO Simulations

Simulating a cell-free massive MIMO network with 128 access points, 1000 users, and full channel estimation requires processing $128 \times 1000$ channel matrices per coherence interval. Multi-GPU parallelism distributes users across GPUs (data parallelism) or distributes antenna processing across GPUs (model parallelism). The all-reduce operation for gradient averaging mirrors the fronthaul aggregation in real cell-free systems.

See full treatment in Domain-Specific Plot Types

Historical Note: The Rise of Distributed Training

21st century

The foundational paper on large-scale distributed deep learning was "Large Scale Distributed Deep Networks" by Dean et al. at Google (2012), which used model parallelism across 16000 CPU cores. The shift to data-parallel GPU training was catalyzed by the ImageNet moment: in 2017, Facebook trained ResNet-50 on ImageNet in 1 hour using 256 GPUs with synchronized SGD and the linear scaling rule. Today, models like GPT-4 train on thousands of GPUs using a combination of data, tensor, and pipeline parallelism.

DistributedDataParallel (DDP)

PyTorch's production-grade multi-GPU training wrapper that uses one process per GPU, NCCL for communication, and overlaps gradient all-reduce with backward computation.

Related: All-Reduce

All-Reduce

A collective communication operation that computes the sum (or average) of tensors across all processes and distributes the result to every process.

Related: DistributedDataParallel (DDP)

NCCL

NVIDIA Collective Communications Library -- an optimized library for GPU-to-GPU communication via NVLink or PCIe, implementing all-reduce, broadcast, and other collective operations.

DataParallel vs DistributedDataParallel

Feature	DataParallel	DDP
Process model	Single process, multi-thread	Multi-process (1 per GPU)
GIL limitation	Yes (threads share GIL)	No (separate processes)
GPU utilization	Asymmetric (GPU 0 bottleneck)	Symmetric
Gradient sync	Gather on GPU 0	All-reduce (overlapped)
Multi-node	No	Yes (via NCCL/Gloo)
Recommended	No (deprecated)	Yes (production)

Data Loading and Streaming Chapter Summary

Multi-GPU and Distributed Computing

When One GPU Is Not Enough

Definition: DataParallel (DP)

Definition: DistributedDataParallel (DDP)

Definition: All-Reduce Communication Primitive

Definition: Model Parallelism (Pipeline and Tensor)

Definition: Communication Backends: NCCL, Gloo, MPI

Theorem: Linear Scaling Rule for Distributed Training

Gradient averaging

Example: Complete DDP Training Script

Initialize distributed

Create model and data

Training loop

Example: Profiling Communication vs Computation

Timing approach

Using PyTorch profiler

Distributed Training Scaling Efficiency

Parameters

DataParallel vs DistributedDataParallel

Multi-GPU Training

Performance Profiling

Quick Check

Common Mistake: Forgetting sampler.set_epoch() in DDP

Key Takeaway

Why This Matters: Distributed Computing for Large-Scale MIMO Simulations

Historical Note: The Rise of Distributed Training

DistributedDataParallel (DDP)

All-Reduce

NCCL

DataParallel vs DistributedDataParallel

Definition:
DataParallel (DP)

Definition:
DistributedDataParallel (DDP)

Definition:
All-Reduce Communication Primitive

Definition:
Model Parallelism (Pipeline and Tensor)

Definition:
Communication Backends: NCCL, Gloo, MPI