Ferkans — Interactive Telecom Tutor

Why Multiprocessing?

Python's Global Interpreter Lock (GIL) prevents true multi-threaded parallelism for CPU-bound work. To utilize multiple CPU cores, you need multiprocessing (separate processes, each with its own GIL) or external libraries that release the GIL (NumPy, Numba).

This section covers multiprocessing.Pool, concurrent.futures, joblib, and the GIL — the tools for scaling CPU-bound scientific computation across cores.

Definition:
The Global Interpreter Lock (GIL)

The Global Interpreter Lock (GIL) is a mutex in CPython that allows only one thread to execute Python bytecode at a time. This means threading.Thread does not provide CPU parallelism for pure Python code.

import threading

# These threads run SEQUENTIALLY due to the GIL:
t1 = threading.Thread(target=cpu_bound_func, args=(data1,))
t2 = threading.Thread(target=cpu_bound_func, args=(data2,))
t1.start(); t2.start()
t1.join(); t2.join()

The GIL is released during I/O operations and by C extensions (NumPy, SciPy), so threading works for I/O-bound and NumPy-heavy workloads.

CPython 3.13 introduced an experimental free-threaded mode (python3.13t) that disables the GIL entirely. As of 2025, this is opt-in and not yet production-ready.

Definition:
Process Pools with multiprocessing.Pool

multiprocessing.Pool spawns worker processes, each with its own Python interpreter and GIL. Work is distributed via:

from multiprocessing import Pool

def process_chunk(data):
    return heavy_computation(data)

with Pool(processes=4) as pool:
    results = pool.map(process_chunk, data_chunks)

Data is serialized (pickled) to send to workers and deserialized on return. This serialization overhead makes Pool inefficient for small tasks or large data transfers.

Use pool.starmap() for functions with multiple arguments, pool.imap() for lazy iteration, and pool.apply_async() for individual asynchronous tasks.

Definition:
concurrent.futures: Unified Executor API

concurrent.futures provides a high-level API with two executors:

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

# CPU-bound: use processes
with ProcessPoolExecutor(max_workers=4) as exe:
    futures = [exe.submit(compute, chunk) for chunk in chunks]
    results = [f.result() for f in futures]

# I/O-bound: use threads
with ThreadPoolExecutor(max_workers=8) as exe:
    futures = [exe.submit(download, url) for url in urls]
    results = [f.result() for f in futures]

The Future object provides .result(), .done(), .cancel(), and supports as_completed() for processing results as they arrive.

Definition:
joblib: Easy Parallel Loops

joblib.Parallel provides a concise syntax for parallelizing loops, with automatic backend selection and memory mapping for large arrays:

from joblib import Parallel, delayed

results = Parallel(n_jobs=4, backend='loky')(
    delayed(process)(item) for item in data
)

joblib's loky backend handles process management more robustly than multiprocessing.Pool, with better error reporting and automatic cleanup.

joblib is the parallelism backend used by scikit-learn. It automatically memory-maps large NumPy arrays to avoid copying them to each worker process.

Definition:
Amdahl's Law

Amdahl's Law gives the theoretical maximum speedup when parallelizing a program with serial fraction $f_s$ :

$S_p = \frac{1}{f_s + \frac{1-f_s}{p}}$

where $p$ is the number of parallel workers. As $p \to \infty$ :

$S_{\infty} = \frac{1}{f_s}$

If 10% of your code is serial ( $f_s = 0.1$ ), the maximum possible speedup is $10\times$ , regardless of how many cores you have.

Amdahl's Law assumes a fixed problem size. Gustafson's Law (1988) considers scaling the problem size with the number of processors, giving a more optimistic bound for data-parallel work.

Theorem: Amdahl's Law Bound

For a program with serial fraction $f_s \in (0, 1]$ running on $p$ processors, the parallel speedup satisfies:

$S_p = \frac{T_1}{T_p} = \frac{1}{f_s + (1-f_s)/p} \le \frac{1}{f_s}$

with equality in the limit $p \to \infty$ . The parallel efficiency is:

$\eta_p = \frac{S_p}{p} = \frac{1}{p \cdot f_s + (1-f_s)}$

which decreases toward zero as $p \to \infty$ for any $f_s > 0$ .

Even with infinite parallelism, the serial portion bottlenecks the entire computation. This motivates reducing the serial fraction (algorithmic improvement) over adding more cores.

Proof

Decompose execution time

$T_1 = T_s + T_p$ where $T_s = f_s T_1$ is serial and $T_p = (1-f_s) T_1$ is parallelizable.

Parallel execution

With $p$ workers: $T_p = T_s + T_p/p = f_s T_1 + (1-f_s) T_1/p$ . Thus $S_p = T_1/T_p = 1/(f_s + (1-f_s)/p)$ .

Theorem: Gustafson's Law (Scaled Speedup)

If the problem size scales with the number of processors $p$ such that each processor does the same amount of work, the scaled speedup is:

$S_p^* = p - f_s(p - 1) = f_s + p(1 - f_s)$

This grows linearly with $p$ , unlike Amdahl's Law which saturates.

Amdahl asks "how fast can I solve a fixed problem?" Gustafson asks "how big a problem can I solve in fixed time?" In practice, scientists scale problem sizes with available compute, making Gustafson's model more realistic.

Example: Parallel Monte Carlo with multiprocessing

Parallelize a Monte Carlo simulation across $p$ cores by splitting the total number of samples into chunks.

Solution

Implementation

from multiprocessing import Pool
import numpy as np

def mc_chunk(args):
    n_samples, seed = args
    rng = np.random.default_rng(seed)
    x = rng.random(n_samples)
    y = rng.random(n_samples)
    return np.sum(x**2 + y**2 <= 1.0)

N = 10_000_000
n_workers = 4
chunk_size = N // n_workers
seeds = [42 + i for i in range(n_workers)]

with Pool(n_workers) as pool:
    counts = pool.map(mc_chunk, zip([chunk_size]*n_workers, seeds))

pi_est = 4.0 * sum(counts) / N
print(f"pi = {pi_est:.6f}")

Each worker gets an independent RNG seed. The total count is summed after all workers finish. Near-linear speedup for embarrassingly parallel problems.

Example: Progress Tracking with concurrent.futures

Submit jobs to a ProcessPoolExecutor and process results as they complete, showing a progress bar.

Solution

Implementation

from concurrent.futures import ProcessPoolExecutor, as_completed
from tqdm import tqdm

def simulate_snr(snr_db):
    # ... heavy BER simulation ...
    return snr_db, ber

snr_values = range(0, 21)

with ProcessPoolExecutor(max_workers=8) as exe:
    futures = {exe.submit(simulate_snr, s): s for s in snr_values}
    results = {}
    for f in tqdm(as_completed(futures), total=len(futures)):
        snr, ber = f.result()
        results[snr] = ber

as_completed() yields futures in the order they finish, not submission order. This enables real-time progress tracking.

Amdahl's Law Speedup Visualizer

Visualize parallel speedup as a function of the number of processors for different serial fractions, comparing Amdahl and Gustafson models.

Parameters

Python GIL and Multiprocessing Architecture — Left: Threading under the GIL — only one thread executes Python bytecode at a time. Right: Multiprocessing — each process has its own interpreter and GIL, enabling true parallelism at the cost of inter-process communication overhead.

Quick Check

Which workload benefits from Python threading (not multiprocessing)?

A tight numerical loop computing matrix products in pure Python

Downloading 100 files from the internet concurrently

Training a neural network with custom Python gradient code

Sorting a large Python list

Correction:

Downloading 100 files from the internet concurrently

Correct. The GIL is released during I/O operations, so threading provides real concurrency for I/O-bound workloads.

Quick Check

If 5% of a program is serial, what is the maximum speedup with unlimited processors according to Amdahl's Law?

5x

20x

95x

Unlimited

Correction:

20x

Correct. $S_{\infty} = 1/f_s = 1/0.05 = 20$ .

Common Mistake: Pickling Failures in multiprocessing

Mistake:

Trying to pass a lambda or local function to Pool.map():

with Pool(4) as pool:
    pool.map(lambda x: x**2, data)  # PicklingError!

Correction:

Use module-level named functions, functools.partial, or joblib.Parallel (which handles closures via cloudpickle):

def square(x): return x**2
with Pool(4) as pool:
    pool.map(square, data)

Historical Note: The GIL: A 1992 Design Decision

1997-2024

Guido van Rossum introduced the GIL in CPython 1.5 (1997) to simplify memory management with reference counting. Multiple attempts to remove it (e.g., Greg Stein's 1999 patch) showed that removing the GIL degraded single-threaded performance by ~40%. The no-GIL project (PEP 703, by Sam Gross, accepted in 2023) finally achieved GIL removal with <5% single-thread overhead, landing in CPython 3.13 as an experimental option.

Why This Matters: Parallel BER Simulations

Monte Carlo BER simulations (Chapter 9) are embarrassingly parallel: each SNR point is independent. Using ProcessPoolExecutor with SeedSequence.spawn() for reproducible RNG, you can simulate all SNR points simultaneously. A 20-point BER curve that takes 2 hours sequentially finishes in ~15 minutes on 8 cores.

See full treatment in Chapter 9

Key Takeaway

Choose your parallelism tool by workload type: threading for I/O-bound tasks, multiprocessing.Pool or concurrent.futures for CPU-bound tasks, joblib for scikit-learn-style parallel loops, and Numba's prange for tight numerical loops. Always measure the serial fraction first — Amdahl's Law sets the ceiling.

GIL (Global Interpreter Lock)

A mutex in CPython that prevents multiple threads from executing Python bytecode simultaneously, simplifying memory management at the cost of CPU-bound thread parallelism.

Related: Multiprocessing

Multiprocessing

Running multiple Python interpreter processes, each with its own GIL, to achieve true CPU parallelism. Data is exchanged via serialization (pickle).

Related: GIL (Global Interpreter Lock)

Multiprocessing and Parallelism

python

Complete examples of multiprocessing.Pool, concurrent.futures, joblib.Parallel, and threading for I/O-bound work.

# Code from: ch14/python/parallelism.py
# Load from backend supplements endpoint

Multiprocessing and Parallelism

Why Multiprocessing?

Definition: The Global Interpreter Lock (GIL)

Definition: Process Pools with multiprocessing.Pool

Definition: concurrent.futures: Unified Executor API

Definition: joblib: Easy Parallel Loops

Definition: Amdahl's Law

Theorem: Amdahl's Law Bound

Decompose execution time

Parallel execution

Theorem: Gustafson's Law (Scaled Speedup)

Example: Parallel Monte Carlo with multiprocessing

Implementation

Example: Progress Tracking with concurrent.futures

Implementation

Amdahl's Law Speedup Visualizer

Parameters

Python GIL and Multiprocessing Architecture

Quick Check

Quick Check

Common Mistake: Pickling Failures in multiprocessing

Historical Note: The GIL: A 1992 Design Decision

Why This Matters: Parallel BER Simulations

Key Takeaway

GIL (Global Interpreter Lock)

Multiprocessing

Multiprocessing and Parallelism

Definition:
The Global Interpreter Lock (GIL)

Definition:
Process Pools with multiprocessing.Pool

Definition:
concurrent.futures: Unified Executor API

Definition:
joblib: Easy Parallel Loops

Definition:
Amdahl's Law