Multiprocessing and Parallelism
Why Multiprocessing?
Python's Global Interpreter Lock (GIL) prevents true multi-threaded parallelism for CPU-bound work. To utilize multiple CPU cores, you need multiprocessing (separate processes, each with its own GIL) or external libraries that release the GIL (NumPy, Numba).
This section covers multiprocessing.Pool, concurrent.futures,
joblib, and the GIL β the tools for scaling CPU-bound scientific
computation across cores.
Definition: The Global Interpreter Lock (GIL)
The Global Interpreter Lock (GIL)
The Global Interpreter Lock (GIL) is a mutex in CPython that
allows only one thread to execute Python bytecode at a time. This
means threading.Thread does not provide CPU parallelism for
pure Python code.
import threading
# These threads run SEQUENTIALLY due to the GIL:
t1 = threading.Thread(target=cpu_bound_func, args=(data1,))
t2 = threading.Thread(target=cpu_bound_func, args=(data2,))
t1.start(); t2.start()
t1.join(); t2.join()
The GIL is released during I/O operations and by C extensions (NumPy, SciPy), so threading works for I/O-bound and NumPy-heavy workloads.
CPython 3.13 introduced an experimental free-threaded mode
(python3.13t) that disables the GIL entirely. As of 2025,
this is opt-in and not yet production-ready.
Definition: Process Pools with multiprocessing.Pool
Process Pools with multiprocessing.Pool
multiprocessing.Pool spawns worker processes, each with its own
Python interpreter and GIL. Work is distributed via:
from multiprocessing import Pool
def process_chunk(data):
return heavy_computation(data)
with Pool(processes=4) as pool:
results = pool.map(process_chunk, data_chunks)
Data is serialized (pickled) to send to workers and deserialized
on return. This serialization overhead makes Pool inefficient
for small tasks or large data transfers.
Use pool.starmap() for functions with multiple arguments,
pool.imap() for lazy iteration, and pool.apply_async() for
individual asynchronous tasks.
Definition: concurrent.futures: Unified Executor API
concurrent.futures: Unified Executor API
concurrent.futures provides a high-level API with two executors:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
# CPU-bound: use processes
with ProcessPoolExecutor(max_workers=4) as exe:
futures = [exe.submit(compute, chunk) for chunk in chunks]
results = [f.result() for f in futures]
# I/O-bound: use threads
with ThreadPoolExecutor(max_workers=8) as exe:
futures = [exe.submit(download, url) for url in urls]
results = [f.result() for f in futures]
The Future object provides .result(), .done(), .cancel(),
and supports as_completed() for processing results as they arrive.
Definition: joblib: Easy Parallel Loops
joblib: Easy Parallel Loops
joblib.Parallel provides a concise syntax for parallelizing loops,
with automatic backend selection and memory mapping for large arrays:
from joblib import Parallel, delayed
results = Parallel(n_jobs=4, backend='loky')(
delayed(process)(item) for item in data
)
joblib's loky backend handles process management more robustly
than multiprocessing.Pool, with better error reporting and
automatic cleanup.
joblib is the parallelism backend used by scikit-learn. It automatically memory-maps large NumPy arrays to avoid copying them to each worker process.
Definition: Amdahl's Law
Amdahl's Law
Amdahl's Law gives the theoretical maximum speedup when parallelizing a program with serial fraction :
where is the number of parallel workers. As :
If 10% of your code is serial (), the maximum possible speedup is , regardless of how many cores you have.
Amdahl's Law assumes a fixed problem size. Gustafson's Law (1988) considers scaling the problem size with the number of processors, giving a more optimistic bound for data-parallel work.
Theorem: Amdahl's Law Bound
For a program with serial fraction running on processors, the parallel speedup satisfies:
with equality in the limit . The parallel efficiency is:
which decreases toward zero as for any .
Even with infinite parallelism, the serial portion bottlenecks the entire computation. This motivates reducing the serial fraction (algorithmic improvement) over adding more cores.
Decompose execution time
where is serial and is parallelizable.
Parallel execution
With workers: . Thus .
Theorem: Gustafson's Law (Scaled Speedup)
If the problem size scales with the number of processors such that each processor does the same amount of work, the scaled speedup is:
This grows linearly with , unlike Amdahl's Law which saturates.
Amdahl asks "how fast can I solve a fixed problem?" Gustafson asks "how big a problem can I solve in fixed time?" In practice, scientists scale problem sizes with available compute, making Gustafson's model more realistic.
Example: Parallel Monte Carlo with multiprocessing
Parallelize a Monte Carlo simulation across cores by splitting the total number of samples into chunks.
Implementation
from multiprocessing import Pool
import numpy as np
def mc_chunk(args):
n_samples, seed = args
rng = np.random.default_rng(seed)
x = rng.random(n_samples)
y = rng.random(n_samples)
return np.sum(x**2 + y**2 <= 1.0)
N = 10_000_000
n_workers = 4
chunk_size = N // n_workers
seeds = [42 + i for i in range(n_workers)]
with Pool(n_workers) as pool:
counts = pool.map(mc_chunk, zip([chunk_size]*n_workers, seeds))
pi_est = 4.0 * sum(counts) / N
print(f"pi = {pi_est:.6f}")
Each worker gets an independent RNG seed. The total count is summed after all workers finish. Near-linear speedup for embarrassingly parallel problems.
Example: Progress Tracking with concurrent.futures
Submit jobs to a ProcessPoolExecutor and process results as they
complete, showing a progress bar.
Implementation
from concurrent.futures import ProcessPoolExecutor, as_completed
from tqdm import tqdm
def simulate_snr(snr_db):
# ... heavy BER simulation ...
return snr_db, ber
snr_values = range(0, 21)
with ProcessPoolExecutor(max_workers=8) as exe:
futures = {exe.submit(simulate_snr, s): s for s in snr_values}
results = {}
for f in tqdm(as_completed(futures), total=len(futures)):
snr, ber = f.result()
results[snr] = ber
as_completed() yields futures in the order they finish, not
submission order. This enables real-time progress tracking.
Amdahl's Law Speedup Visualizer
Visualize parallel speedup as a function of the number of processors for different serial fractions, comparing Amdahl and Gustafson models.
Parameters
Python GIL and Multiprocessing Architecture
Quick Check
Which workload benefits from Python threading (not multiprocessing)?
A tight numerical loop computing matrix products in pure Python
Downloading 100 files from the internet concurrently
Training a neural network with custom Python gradient code
Sorting a large Python list
Correct. The GIL is released during I/O operations, so threading provides real concurrency for I/O-bound workloads.
Quick Check
If 5% of a program is serial, what is the maximum speedup with unlimited processors according to Amdahl's Law?
5x
20x
95x
Unlimited
Correct. .
Common Mistake: Pickling Failures in multiprocessing
Mistake:
Trying to pass a lambda or local function to Pool.map():
with Pool(4) as pool:
pool.map(lambda x: x**2, data) # PicklingError!
Correction:
Use module-level named functions, functools.partial, or
joblib.Parallel (which handles closures via cloudpickle):
def square(x): return x**2
with Pool(4) as pool:
pool.map(square, data)
Historical Note: The GIL: A 1992 Design Decision
1997-2024Guido van Rossum introduced the GIL in CPython 1.5 (1997) to simplify memory management with reference counting. Multiple attempts to remove it (e.g., Greg Stein's 1999 patch) showed that removing the GIL degraded single-threaded performance by ~40%. The no-GIL project (PEP 703, by Sam Gross, accepted in 2023) finally achieved GIL removal with <5% single-thread overhead, landing in CPython 3.13 as an experimental option.
Why This Matters: Parallel BER Simulations
Monte Carlo BER simulations (Chapter 9) are embarrassingly parallel:
each SNR point is independent. Using ProcessPoolExecutor with
SeedSequence.spawn() for reproducible RNG, you can simulate
all SNR points simultaneously. A 20-point BER curve that takes
2 hours sequentially finishes in ~15 minutes on 8 cores.
See full treatment in Chapter 9
Key Takeaway
Choose your parallelism tool by workload type: threading for
I/O-bound tasks, multiprocessing.Pool or concurrent.futures for
CPU-bound tasks, joblib for scikit-learn-style parallel loops,
and Numba's prange for tight numerical loops. Always measure
the serial fraction first β Amdahl's Law sets the ceiling.
GIL (Global Interpreter Lock)
A mutex in CPython that prevents multiple threads from executing Python bytecode simultaneously, simplifying memory management at the cost of CPU-bound thread parallelism.
Related: Multiprocessing
Multiprocessing
Running multiple Python interpreter processes, each with its own GIL, to achieve true CPU parallelism. Data is exchanged via serialization (pickle).
Related: GIL (Global Interpreter Lock)
Multiprocessing and Parallelism
# Code from: ch14/python/parallelism.py
# Load from backend supplements endpoint