Vectorization and Avoiding Loops
The 100x Speedup
A Python for loop over a NumPy array is interpreted, dynamically typed,
and cache-unfriendly. The same operation expressed as a vectorized NumPy
call runs in compiled C, operates on contiguous memory, and can leverage
SIMD instructions.
The typical speedup is 50-200x for numerical operations. Learning to "think in arrays" rather than "think in loops" is the single most impactful skill for scientific Python performance.
Definition: Vectorization
Vectorization
Vectorization is the practice of replacing explicit Python loops with equivalent NumPy array operations that execute in compiled C.
# SLOW: Python loop
result = np.empty(len(x))
for i in range(len(x)):
result[i] = x[i] ** 2 + 2 * x[i] + 1
# FAST: Vectorized
result = x ** 2 + 2 * x + 1
Vectorized code is not only faster, it is also shorter and often more readable once you learn the idioms.
Definition: Universal Functions (ufuncs)
Universal Functions (ufuncs)
A ufunc (universal function) is a NumPy function that operates element-wise on ndarrays, supports broadcasting, and is implemented in compiled C.
Examples: np.sin, np.exp, np.log, np.maximum, np.add.
a = np.array([0, np.pi/4, np.pi/2])
np.sin(a) # ufunc: [0.0, 0.707, 1.0]
# ufuncs support broadcasting
np.maximum(a[:, None], a[None, :]) # pairwise max, shape (3, 3)
You can create custom ufuncs with np.frompyfunc or (faster) Numba.
Definition: np.where, np.select, np.piecewise
np.where, np.select, np.piecewise
These functions replace if-elif-else loops over arrays:
np.where(condition, x, y): element-wise ternary β returns x
where condition is True, y elsewhere:
a = np.array([-1, 2, -3, 4])
np.where(a > 0, a, 0) # [0, 2, 0, 4] β ReLU!
np.select(condlist, choicelist, default): multi-branch:
x = np.linspace(-3, 3, 100)
conditions = [x < -1, (x >= -1) & (x <= 1), x > 1]
choices = [-1, x, 1] # hard tanh
y = np.select(conditions, choices, default=0)
np.piecewise(x, condlist, funclist): apply different functions
per region:
y = np.piecewise(x, [x < 0, x >= 0], [lambda t: -t, lambda t: t**2])
Definition: np.vectorize vs np.frompyfunc
np.vectorize vs np.frompyfunc
np.vectorize wraps a scalar Python function to accept arrays.
It does not provide C-level speed β it is syntactic sugar for
a Python loop with broadcasting support.
def my_func(x, y):
return x ** 2 + y if x > 0 else y - x
vfunc = np.vectorize(my_func)
result = vfunc(np.array([-1, 2, -3]), np.array([10, 20, 30]))
np.frompyfunc is similar but returns object arrays.
For real speed, use np.where / np.select (vectorized) or
Numba @jit (compiled).
Theorem: Python Loop Overhead on Arrays
For an array of elements, a Python loop performing one arithmetic operation per element has time complexity where includes bytecode dispatch, dynamic type checking, and reference counting overhead (~100 ns per element). The equivalent NumPy vectorized operation has time complexity where - ns per element.
Each Python loop iteration pays for: bytecode interpretation, type lookup on the operands, calling the C math function through Python's number protocol, and creating a new Python float object for the result. NumPy skips all of this by iterating in C over raw memory.
Example: Replacing a Loop with Vectorized Operations
Convert the following Python loop into vectorized NumPy code:
result = []
for i in range(len(signal)):
if signal[i] > threshold:
result.append(signal[i] * gain)
else:
result.append(signal[i])
result = np.array(result)
Vectorized version
result = np.where(signal > threshold, signal * gain, signal)
One line replaces the entire loop. This is ~100x faster for
large arrays and immediately readable once you know np.where.
Example: Vectorized Activation Functions
Implement the following activation functions without Python loops: ReLU, leaky ReLU, hard sigmoid, and GELU approximation.
Vectorized implementations
x = np.linspace(-3, 3, 1000)
# ReLU
relu = np.maximum(x, 0)
# Leaky ReLU (alpha = 0.01)
leaky_relu = np.where(x > 0, x, 0.01 * x)
# Hard sigmoid
hard_sigmoid = np.clip(0.2 * x + 0.5, 0, 1)
# GELU approximation
gelu = 0.5 * x * (1 + np.tanh(
np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)
))
Example: Benchmarking Loop vs Vectorized
Benchmark a simple element-wise computation (compute ) using a Python loop vs vectorized NumPy for arrays of increasing size.
Benchmark code
import time
def loop_version(x):
result = np.empty_like(x)
for i in range(len(x)):
result[i] = math.sin(x[i]) + math.cos(x[i])
return result
def vectorized_version(x):
return np.sin(x) + np.cos(x)
for n in [1000, 10_000, 100_000, 1_000_000]:
x = np.random.randn(n)
t0 = time.perf_counter()
loop_version(x)
t_loop = time.perf_counter() - t0
t0 = time.perf_counter()
vectorized_version(x)
t_vec = time.perf_counter() - t0
print(f"n={n:>10,} loop={t_loop:.4f}s vec={t_vec:.6f}s "
f"speedup={t_loop/t_vec:.0f}x")
Vectorization Benchmark
Compare execution time of Python loops vs NumPy vectorized operations for different operations and array sizes. See the 100x speedup in real time.
Parameters
Quick Check
Does np.vectorize provide C-level speed?
No β it is syntactic sugar for a Python loop with broadcasting support
Yes β it compiles the function to C automatically
Yes, but only for simple arithmetic operations
It depends on the dtype
np.vectorize does NOT compile the function. It just loops in Python.
Common Mistake: The np.vectorize Performance Trap
Mistake:
Using np.vectorize and expecting a speedup:
@np.vectorize
def my_func(x):
return x ** 2 + 1 if x > 0 else -x
# This is still a Python loop β no speedup!
Correction:
Use np.where for conditionals or Numba @jit for complex functions:
# Vectorized (fast)
def my_func(x):
return np.where(x > 0, x ** 2 + 1, -x)
vectorization
Replacing Python loops with NumPy array operations that execute in compiled C for 50-200x speedup.
Related: ufunc, Broadcasting, SIMD
ufunc
Universal function: a NumPy function that operates element-wise on arrays with broadcasting support, implemented in compiled C.
Related: Vectorization, Broadcasting
Vectorization Speedup
# Code from: ch05/python/vectorization_speedup.py
# Load from backend supplements endpointKey Takeaway
Vectorized NumPy operations are 50-200x faster than Python loops.
Use np.where for conditionals, np.select for multi-branch logic,
and avoid np.vectorize (it is just a loop in disguise). For truly
complex scalar functions, use Numba @jit.