Vectorization and Avoiding Loops

The 100x Speedup

A Python for loop over a NumPy array is interpreted, dynamically typed, and cache-unfriendly. The same operation expressed as a vectorized NumPy call runs in compiled C, operates on contiguous memory, and can leverage SIMD instructions.

The typical speedup is 50-200x for numerical operations. Learning to "think in arrays" rather than "think in loops" is the single most impactful skill for scientific Python performance.

Definition:
Vectorization

Vectorization is the practice of replacing explicit Python loops with equivalent NumPy array operations that execute in compiled C.

# SLOW: Python loop
result = np.empty(len(x))
for i in range(len(x)):
    result[i] = x[i] ** 2 + 2 * x[i] + 1

# FAST: Vectorized
result = x ** 2 + 2 * x + 1

Vectorized code is not only faster, it is also shorter and often more readable once you learn the idioms.

Definition:
Universal Functions (ufuncs)

A ufunc (universal function) is a NumPy function that operates element-wise on ndarrays, supports broadcasting, and is implemented in compiled C.

Examples: np.sin, np.exp, np.log, np.maximum, np.add.

a = np.array([0, np.pi/4, np.pi/2])
np.sin(a)       # ufunc: [0.0, 0.707, 1.0]

# ufuncs support broadcasting
np.maximum(a[:, None], a[None, :])  # pairwise max, shape (3, 3)

You can create custom ufuncs with np.frompyfunc or (faster) Numba.

Definition:
np.where, np.select, np.piecewise

These functions replace if-elif-else loops over arrays:

np.where(condition, x, y): element-wise ternary — returns x where condition is True, y elsewhere:

a = np.array([-1, 2, -3, 4])
np.where(a > 0, a, 0)   # [0, 2, 0, 4]  — ReLU!

np.select(condlist, choicelist, default): multi-branch:

x = np.linspace(-3, 3, 100)
conditions = [x < -1, (x >= -1) & (x <= 1), x > 1]
choices    = [-1, x, 1]          # hard tanh
y = np.select(conditions, choices, default=0)

np.piecewise(x, condlist, funclist): apply different functions per region:

y = np.piecewise(x, [x < 0, x >= 0], [lambda t: -t, lambda t: t**2])

Definition:
np.vectorize vs np.frompyfunc

np.vectorize wraps a scalar Python function to accept arrays. It does not provide C-level speed — it is syntactic sugar for a Python loop with broadcasting support.

def my_func(x, y):
    return x ** 2 + y if x > 0 else y - x

vfunc = np.vectorize(my_func)
result = vfunc(np.array([-1, 2, -3]), np.array([10, 20, 30]))

np.frompyfunc is similar but returns object arrays.

For real speed, use np.where / np.select (vectorized) or Numba @jit (compiled).

Theorem: Python Loop Overhead on Arrays

For an array of $n$ elements, a Python loop performing one arithmetic operation per element has time complexity $\Theta(n \cdot c_{\text{py}})$ where $c_{\text{py}}$ includes bytecode dispatch, dynamic type checking, and reference counting overhead (~100 ns per element). The equivalent NumPy vectorized operation has time complexity $\Theta(n \cdot c_{\text{C}})$ where $c_{\text{C}} \approx 1$ - $5$ ns per element.

Each Python loop iteration pays for: bytecode interpretation, type lookup on the operands, calling the C math function through Python's number protocol, and creating a new Python float object for the result. NumPy skips all of this by iterating in C over raw memory.

Example: Replacing a Loop with Vectorized Operations

Convert the following Python loop into vectorized NumPy code:

result = []
for i in range(len(signal)):
    if signal[i] > threshold:
        result.append(signal[i] * gain)
    else:
        result.append(signal[i])
result = np.array(result)

Solution

Vectorized version

result = np.where(signal > threshold, signal * gain, signal)

One line replaces the entire loop. This is ~100x faster for large arrays and immediately readable once you know np.where.

Example: Vectorized Activation Functions

Implement the following activation functions without Python loops: ReLU, leaky ReLU, hard sigmoid, and GELU approximation.

Solution

Vectorized implementations

x = np.linspace(-3, 3, 1000)

# ReLU
relu = np.maximum(x, 0)

# Leaky ReLU (alpha = 0.01)
leaky_relu = np.where(x > 0, x, 0.01 * x)

# Hard sigmoid
hard_sigmoid = np.clip(0.2 * x + 0.5, 0, 1)

# GELU approximation
gelu = 0.5 * x * (1 + np.tanh(
    np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)
))

Example: Benchmarking Loop vs Vectorized

Benchmark a simple element-wise computation (compute $\sin(x) + \cos(x)$ ) using a Python loop vs vectorized NumPy for arrays of increasing size.

Solution

Benchmark code

import time

def loop_version(x):
    result = np.empty_like(x)
    for i in range(len(x)):
        result[i] = math.sin(x[i]) + math.cos(x[i])
    return result

def vectorized_version(x):
    return np.sin(x) + np.cos(x)

for n in [1000, 10_000, 100_000, 1_000_000]:
    x = np.random.randn(n)

    t0 = time.perf_counter()
    loop_version(x)
    t_loop = time.perf_counter() - t0

    t0 = time.perf_counter()
    vectorized_version(x)
    t_vec = time.perf_counter() - t0

    print(f"n={n:>10,}  loop={t_loop:.4f}s  vec={t_vec:.6f}s  "
          f"speedup={t_loop/t_vec:.0f}x")

Vectorization Benchmark

Compare execution time of Python loops vs NumPy vectorized operations for different operations and array sizes. See the 100x speedup in real time.

Parameters

Quick Check

Does np.vectorize provide C-level speed?

No — it is syntactic sugar for a Python loop with broadcasting support

Yes — it compiles the function to C automatically

Yes, but only for simple arithmetic operations

It depends on the dtype

Correction:

No — it is syntactic sugar for a Python loop with broadcasting support

np.vectorize does NOT compile the function. It just loops in Python.

Common Mistake: The np.vectorize Performance Trap

Mistake:

Using np.vectorize and expecting a speedup:

@np.vectorize
def my_func(x):
    return x ** 2 + 1 if x > 0 else -x
# This is still a Python loop — no speedup!

Correction:

Use np.where for conditionals or Numba @jit for complex functions:

# Vectorized (fast)
def my_func(x):
    return np.where(x > 0, x ** 2 + 1, -x)

vectorization

Replacing Python loops with NumPy array operations that execute in compiled C for 50-200x speedup.

Related: ufunc, Broadcasting, SIMD

ufunc

Universal function: a NumPy function that operates element-wise on arrays with broadcasting support, implemented in compiled C.

Related: Vectorization, Broadcasting

Vectorization Speedup

python

Replace Python loops with vectorized ops. Benchmark the 100x speedup.

# Code from: ch05/python/vectorization_speedup.py
# Load from backend supplements endpoint

Key Takeaway

Vectorized NumPy operations are 50-200x faster than Python loops. Use np.where for conditionals, np.select for multi-branch logic, and avoid np.vectorize (it is just a loop in disguise). For truly complex scalar functions, use Numba @jit.

Broadcasting Array Creation and Manipulation

Vectorization and Avoiding Loops

The 100x Speedup

Definition: Vectorization

Definition: Universal Functions (ufuncs)

Definition: np.where, np.select, np.piecewise

Definition: np.vectorize vs np.frompyfunc

Theorem: Python Loop Overhead on Arrays

Example: Replacing a Loop with Vectorized Operations

Vectorized version

Example: Vectorized Activation Functions

Vectorized implementations

Example: Benchmarking Loop vs Vectorized

Benchmark code

Vectorization Benchmark

Parameters

Quick Check

Common Mistake: The np.vectorize Performance Trap

vectorization

ufunc

Vectorization Speedup

Key Takeaway

Definition:
Vectorization

Definition:
Universal Functions (ufuncs)

Definition:
np.where, np.select, np.piecewise

Definition:
np.vectorize vs np.frompyfunc