Multiprocessing and Parallelism

Why Multiprocessing?

Python's Global Interpreter Lock (GIL) prevents true multi-threaded parallelism for CPU-bound work. To utilize multiple CPU cores, you need multiprocessing (separate processes, each with its own GIL) or external libraries that release the GIL (NumPy, Numba).

This section covers multiprocessing.Pool, concurrent.futures, joblib, and the GIL β€” the tools for scaling CPU-bound scientific computation across cores.

Definition:

The Global Interpreter Lock (GIL)

The Global Interpreter Lock (GIL) is a mutex in CPython that allows only one thread to execute Python bytecode at a time. This means threading.Thread does not provide CPU parallelism for pure Python code.

import threading

# These threads run SEQUENTIALLY due to the GIL:
t1 = threading.Thread(target=cpu_bound_func, args=(data1,))
t2 = threading.Thread(target=cpu_bound_func, args=(data2,))
t1.start(); t2.start()
t1.join(); t2.join()

The GIL is released during I/O operations and by C extensions (NumPy, SciPy), so threading works for I/O-bound and NumPy-heavy workloads.

CPython 3.13 introduced an experimental free-threaded mode (python3.13t) that disables the GIL entirely. As of 2025, this is opt-in and not yet production-ready.

Definition:

Process Pools with multiprocessing.Pool

multiprocessing.Pool spawns worker processes, each with its own Python interpreter and GIL. Work is distributed via:

from multiprocessing import Pool

def process_chunk(data):
    return heavy_computation(data)

with Pool(processes=4) as pool:
    results = pool.map(process_chunk, data_chunks)

Data is serialized (pickled) to send to workers and deserialized on return. This serialization overhead makes Pool inefficient for small tasks or large data transfers.

Use pool.starmap() for functions with multiple arguments, pool.imap() for lazy iteration, and pool.apply_async() for individual asynchronous tasks.

Definition:

concurrent.futures: Unified Executor API

concurrent.futures provides a high-level API with two executors:

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

# CPU-bound: use processes
with ProcessPoolExecutor(max_workers=4) as exe:
    futures = [exe.submit(compute, chunk) for chunk in chunks]
    results = [f.result() for f in futures]

# I/O-bound: use threads
with ThreadPoolExecutor(max_workers=8) as exe:
    futures = [exe.submit(download, url) for url in urls]
    results = [f.result() for f in futures]

The Future object provides .result(), .done(), .cancel(), and supports as_completed() for processing results as they arrive.

Definition:

joblib: Easy Parallel Loops

joblib.Parallel provides a concise syntax for parallelizing loops, with automatic backend selection and memory mapping for large arrays:

from joblib import Parallel, delayed

results = Parallel(n_jobs=4, backend='loky')(
    delayed(process)(item) for item in data
)

joblib's loky backend handles process management more robustly than multiprocessing.Pool, with better error reporting and automatic cleanup.

joblib is the parallelism backend used by scikit-learn. It automatically memory-maps large NumPy arrays to avoid copying them to each worker process.

Definition:

Amdahl's Law

Amdahl's Law gives the theoretical maximum speedup when parallelizing a program with serial fraction fsf_s:

Sp=1fs+1βˆ’fspS_p = \frac{1}{f_s + \frac{1-f_s}{p}}

where pp is the number of parallel workers. As pβ†’βˆžp \to \infty:

S∞=1fsS_{\infty} = \frac{1}{f_s}

If 10% of your code is serial (fs=0.1f_s = 0.1), the maximum possible speedup is 10Γ—10\times, regardless of how many cores you have.

Amdahl's Law assumes a fixed problem size. Gustafson's Law (1988) considers scaling the problem size with the number of processors, giving a more optimistic bound for data-parallel work.

Theorem: Amdahl's Law Bound

For a program with serial fraction fs∈(0,1]f_s \in (0, 1] running on pp processors, the parallel speedup satisfies:

Sp=T1Tp=1fs+(1βˆ’fs)/p≀1fsS_p = \frac{T_1}{T_p} = \frac{1}{f_s + (1-f_s)/p} \le \frac{1}{f_s}

with equality in the limit pβ†’βˆžp \to \infty. The parallel efficiency is:

Ξ·p=Spp=1pβ‹…fs+(1βˆ’fs)\eta_p = \frac{S_p}{p} = \frac{1}{p \cdot f_s + (1-f_s)}

which decreases toward zero as pβ†’βˆžp \to \infty for any fs>0f_s > 0.

Even with infinite parallelism, the serial portion bottlenecks the entire computation. This motivates reducing the serial fraction (algorithmic improvement) over adding more cores.

Theorem: Gustafson's Law (Scaled Speedup)

If the problem size scales with the number of processors pp such that each processor does the same amount of work, the scaled speedup is:

Spβˆ—=pβˆ’fs(pβˆ’1)=fs+p(1βˆ’fs)S_p^* = p - f_s(p - 1) = f_s + p(1 - f_s)

This grows linearly with pp, unlike Amdahl's Law which saturates.

Amdahl asks "how fast can I solve a fixed problem?" Gustafson asks "how big a problem can I solve in fixed time?" In practice, scientists scale problem sizes with available compute, making Gustafson's model more realistic.

Example: Parallel Monte Carlo with multiprocessing

Parallelize a Monte Carlo simulation across pp cores by splitting the total number of samples into chunks.

Example: Progress Tracking with concurrent.futures

Submit jobs to a ProcessPoolExecutor and process results as they complete, showing a progress bar.

Amdahl's Law Speedup Visualizer

Visualize parallel speedup as a function of the number of processors for different serial fractions, comparing Amdahl and Gustafson models.

Parameters

Python GIL and Multiprocessing Architecture

Python GIL and Multiprocessing Architecture
Left: Threading under the GIL β€” only one thread executes Python bytecode at a time. Right: Multiprocessing β€” each process has its own interpreter and GIL, enabling true parallelism at the cost of inter-process communication overhead.

Quick Check

Which workload benefits from Python threading (not multiprocessing)?

A tight numerical loop computing matrix products in pure Python

Downloading 100 files from the internet concurrently

Training a neural network with custom Python gradient code

Sorting a large Python list

Quick Check

If 5% of a program is serial, what is the maximum speedup with unlimited processors according to Amdahl's Law?

5x

20x

95x

Unlimited

Common Mistake: Pickling Failures in multiprocessing

Mistake:

Trying to pass a lambda or local function to Pool.map():

with Pool(4) as pool:
    pool.map(lambda x: x**2, data)  # PicklingError!

Correction:

Use module-level named functions, functools.partial, or joblib.Parallel (which handles closures via cloudpickle):

def square(x): return x**2
with Pool(4) as pool:
    pool.map(square, data)

Historical Note: The GIL: A 1992 Design Decision

1997-2024

Guido van Rossum introduced the GIL in CPython 1.5 (1997) to simplify memory management with reference counting. Multiple attempts to remove it (e.g., Greg Stein's 1999 patch) showed that removing the GIL degraded single-threaded performance by ~40%. The no-GIL project (PEP 703, by Sam Gross, accepted in 2023) finally achieved GIL removal with <5% single-thread overhead, landing in CPython 3.13 as an experimental option.

Why This Matters: Parallel BER Simulations

Monte Carlo BER simulations (Chapter 9) are embarrassingly parallel: each SNR point is independent. Using ProcessPoolExecutor with SeedSequence.spawn() for reproducible RNG, you can simulate all SNR points simultaneously. A 20-point BER curve that takes 2 hours sequentially finishes in ~15 minutes on 8 cores.

See full treatment in Chapter 9

Key Takeaway

Choose your parallelism tool by workload type: threading for I/O-bound tasks, multiprocessing.Pool or concurrent.futures for CPU-bound tasks, joblib for scikit-learn-style parallel loops, and Numba's prange for tight numerical loops. Always measure the serial fraction first β€” Amdahl's Law sets the ceiling.

GIL (Global Interpreter Lock)

A mutex in CPython that prevents multiple threads from executing Python bytecode simultaneously, simplifying memory management at the cost of CPU-bound thread parallelism.

Related: Multiprocessing

Multiprocessing

Running multiple Python interpreter processes, each with its own GIL, to achieve true CPU parallelism. Data is exchanged via serialization (pickle).

Related: GIL (Global Interpreter Lock)

Multiprocessing and Parallelism

python
Complete examples of multiprocessing.Pool, concurrent.futures, joblib.Parallel, and threading for I/O-bound work.
# Code from: ch14/python/parallelism.py
# Load from backend supplements endpoint