Chapter Summary

Key Points

1.
Use Numba for tight numerical loops. @numba.njit compiles Python loops to native machine code via LLVM, achieving 50-200x speedup over pure Python. Always use nopython=True (or @njit), warm up before benchmarking, and use parallel=True with prange for multi-core scaling. @numba.vectorize creates custom ufuncs, and @numba.cuda.jit targets NVIDIA GPUs.
2.
Use JAX for functional numerical computing. JAX provides composable transformations: jax.jit for XLA compilation, jax.grad for automatic differentiation, and jax.vmap for automatic vectorization. Write pure functions with jax.numpy, and compose transformations freely: jax.jit(jax.vmap(jax.grad(f))) gives a compiled, batched gradient. JAX arrays are immutable; use x.at[i].set(v) instead of x[i] = v.
3.
Choose the right parallelism tool. The GIL prevents CPU-bound thread parallelism in CPython. Use multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor for CPU-bound work, ThreadPoolExecutor for I/O-bound work, and joblib.Parallel for scikit-learn-style loops. Amdahl's Law sets the speedup ceiling: $S_\infty = 1/f_s$ where $f_s$ is the serial fraction.
4.
Interface with C/C++ when needed. Use ctypes for quick calls to existing shared libraries (no compilation), cffi for safer C API wrapping, pybind11 for new C++ extensions with automatic NumPy conversion, and f2py for Fortran codes. Always pass contiguous arrays and batch operations to amortize FFI call overhead.

Looking Ahead

Chapter 15 applies these acceleration techniques to large-scale data processing with Pandas and Dask. The multiprocessing patterns from Section 14.3 appear again in Dask's distributed scheduler, and the JIT compilation from Section 14.1 can accelerate custom Pandas apply functions via Numba.

Interfacing with C/C++ and Fortran Exercises