Vectorization and Avoiding Loops

The 100x Speedup

A Python for loop over a NumPy array is interpreted, dynamically typed, and cache-unfriendly. The same operation expressed as a vectorized NumPy call runs in compiled C, operates on contiguous memory, and can leverage SIMD instructions.

The typical speedup is 50-200x for numerical operations. Learning to "think in arrays" rather than "think in loops" is the single most impactful skill for scientific Python performance.

Definition:

Vectorization

Vectorization is the practice of replacing explicit Python loops with equivalent NumPy array operations that execute in compiled C.

# SLOW: Python loop
result = np.empty(len(x))
for i in range(len(x)):
    result[i] = x[i] ** 2 + 2 * x[i] + 1

# FAST: Vectorized
result = x ** 2 + 2 * x + 1

Vectorized code is not only faster, it is also shorter and often more readable once you learn the idioms.

Definition:

Universal Functions (ufuncs)

A ufunc (universal function) is a NumPy function that operates element-wise on ndarrays, supports broadcasting, and is implemented in compiled C.

Examples: np.sin, np.exp, np.log, np.maximum, np.add.

a = np.array([0, np.pi/4, np.pi/2])
np.sin(a)       # ufunc: [0.0, 0.707, 1.0]

# ufuncs support broadcasting
np.maximum(a[:, None], a[None, :])  # pairwise max, shape (3, 3)

You can create custom ufuncs with np.frompyfunc or (faster) Numba.

Definition:

np.where, np.select, np.piecewise

These functions replace if-elif-else loops over arrays:

np.where(condition, x, y): element-wise ternary β€” returns x where condition is True, y elsewhere:

a = np.array([-1, 2, -3, 4])
np.where(a > 0, a, 0)   # [0, 2, 0, 4]  β€” ReLU!

np.select(condlist, choicelist, default): multi-branch:

x = np.linspace(-3, 3, 100)
conditions = [x < -1, (x >= -1) & (x <= 1), x > 1]
choices    = [-1, x, 1]          # hard tanh
y = np.select(conditions, choices, default=0)

np.piecewise(x, condlist, funclist): apply different functions per region:

y = np.piecewise(x, [x < 0, x >= 0], [lambda t: -t, lambda t: t**2])

Definition:

np.vectorize vs np.frompyfunc

np.vectorize wraps a scalar Python function to accept arrays. It does not provide C-level speed β€” it is syntactic sugar for a Python loop with broadcasting support.

def my_func(x, y):
    return x ** 2 + y if x > 0 else y - x

vfunc = np.vectorize(my_func)
result = vfunc(np.array([-1, 2, -3]), np.array([10, 20, 30]))

np.frompyfunc is similar but returns object arrays.

For real speed, use np.where / np.select (vectorized) or Numba @jit (compiled).

Theorem: Python Loop Overhead on Arrays

For an array of nn elements, a Python loop performing one arithmetic operation per element has time complexity Θ(nβ‹…cpy)\Theta(n \cdot c_{\text{py}}) where cpyc_{\text{py}} includes bytecode dispatch, dynamic type checking, and reference counting overhead (~100 ns per element). The equivalent NumPy vectorized operation has time complexity Θ(nβ‹…cC)\Theta(n \cdot c_{\text{C}}) where cCβ‰ˆ1c_{\text{C}} \approx 1-55 ns per element.

Each Python loop iteration pays for: bytecode interpretation, type lookup on the operands, calling the C math function through Python's number protocol, and creating a new Python float object for the result. NumPy skips all of this by iterating in C over raw memory.

Example: Replacing a Loop with Vectorized Operations

Convert the following Python loop into vectorized NumPy code:

result = []
for i in range(len(signal)):
    if signal[i] > threshold:
        result.append(signal[i] * gain)
    else:
        result.append(signal[i])
result = np.array(result)

Example: Vectorized Activation Functions

Implement the following activation functions without Python loops: ReLU, leaky ReLU, hard sigmoid, and GELU approximation.

Example: Benchmarking Loop vs Vectorized

Benchmark a simple element-wise computation (compute sin⁑(x)+cos⁑(x)\sin(x) + \cos(x)) using a Python loop vs vectorized NumPy for arrays of increasing size.

Vectorization Benchmark

Compare execution time of Python loops vs NumPy vectorized operations for different operations and array sizes. See the 100x speedup in real time.

Parameters

Quick Check

Does np.vectorize provide C-level speed?

No β€” it is syntactic sugar for a Python loop with broadcasting support

Yes β€” it compiles the function to C automatically

Yes, but only for simple arithmetic operations

It depends on the dtype

Common Mistake: The np.vectorize Performance Trap

Mistake:

Using np.vectorize and expecting a speedup:

@np.vectorize
def my_func(x):
    return x ** 2 + 1 if x > 0 else -x
# This is still a Python loop β€” no speedup!

Correction:

Use np.where for conditionals or Numba @jit for complex functions:

# Vectorized (fast)
def my_func(x):
    return np.where(x > 0, x ** 2 + 1, -x)

vectorization

Replacing Python loops with NumPy array operations that execute in compiled C for 50-200x speedup.

Related: ufunc, Broadcasting, SIMD

ufunc

Universal function: a NumPy function that operates element-wise on arrays with broadcasting support, implemented in compiled C.

Related: Vectorization, Broadcasting

Vectorization Speedup

python
Replace Python loops with vectorized ops. Benchmark the 100x speedup.
# Code from: ch05/python/vectorization_speedup.py
# Load from backend supplements endpoint

Key Takeaway

Vectorized NumPy operations are 50-200x faster than Python loops. Use np.where for conditionals, np.select for multi-branch logic, and avoid np.vectorize (it is just a loop in disguise). For truly complex scalar functions, use Numba @jit.