Function Design for Scientific Code
Why Function Design Matters in Scientific Computing
Scientific code is often written as a long script: load data, transform, compute, plot. This works for one-off explorations but falls apart when you need to:
- Re-run with different parameters (e.g., sweep over regularization strengths)
- Test individual components (unit tests on a 500-line script?)
- Share code with collaborators who use different data
Well-designed functions are the first step toward reusable, testable, composable scientific software. This section covers the patterns that make functions robust in production research code.
Definition: Pure Function
Pure Function
A pure function is a function that:
- Always returns the same output for the same input (deterministic)
- Has no side effects β it does not modify external state, write to files, print to stdout, or mutate its arguments
# Pure: depends only on inputs, no side effects
def normalize(x: np.ndarray) -> np.ndarray:
return (x - x.mean()) / x.std()
# Impure: modifies the input array in place
def normalize_inplace(x: np.ndarray) -> None:
x -= x.mean()
x /= x.std()
Pure functions are easier to test, parallelize, and reason about. In scientific code, prefer pure functions for data transformations and reserve side effects for I/O boundaries.
Definition: First-Class Functions
First-Class Functions
In Python, functions are first-class objects: they can be assigned to variables, stored in data structures, passed as arguments, and returned from other functions.
def square(x):
return x ** 2
# Assign to a variable
f = square
print(f(5)) # 25
# Store in a list
transforms = [square, np.sqrt, np.log]
# Pass as argument
results = list(map(square, [1, 2, 3, 4]))
This property is the foundation for closures, decorators, and higher-order functions β the main topics of this chapter.
Definition: Function Signature and Type Hints
Function Signature and Type Hints
A function signature specifies the function's name, parameters, default values, and return type. In scientific Python, type hints serve as executable documentation:
def compute_ber(
tx_bits: np.ndarray,
rx_bits: np.ndarray,
*, # force keyword-only after this
ignore_prefix: int = 0,
) -> float:
"""Compute bit error rate between transmitted and received bits."""
tx = tx_bits[ignore_prefix:]
rx = rx_bits[ignore_prefix:]
return np.mean(tx != rx)
The * forces ignore_prefix to be keyword-only, preventing
positional errors like compute_ber(tx, rx, 100).
Definition: NumPy-Style Docstrings
NumPy-Style Docstrings
The NumPy docstring convention is the standard for scientific Python projects. It provides structured sections for parameters, returns, examples, and references:
def add_awgn(
signal: np.ndarray,
snr_db: float,
) -> np.ndarray:
"""Add white Gaussian noise to a signal.
Parameters
----------
signal : np.ndarray
Input signal (1-D complex or real).
snr_db : float
Signal-to-noise ratio in decibels.
Returns
-------
np.ndarray
Noisy signal with same shape as input.
Examples
--------
>>> rng = np.random.default_rng(42)
>>> x = np.ones(1000)
>>> y = add_awgn(x, snr_db=10.0)
>>> 10 * np.log10(np.mean(x**2) / np.mean((y - x)**2))
10.0 # approximately
"""
...
Tools like Sphinx, pdoc, and VS Code IntelliSense parse these docstrings automatically.
Definition: Variadic Arguments: *args and **kwargs
Variadic Arguments: *args and **kwargs
Python supports variadic arguments for flexible function signatures:
*argscollects extra positional arguments into a tuple**kwargscollects extra keyword arguments into a dict
def run_experiment(name: str, *configs, **overrides):
"""Run an experiment with multiple configs and parameter overrides."""
for config in configs:
params = {**config, **overrides}
print(f"{name}: {params}")
run_experiment(
"BER Sweep",
{"modulation": "QPSK", "snr": 10},
{"modulation": "16QAM", "snr": 10},
num_trials=1000,
)
In scientific code, **kwargs is commonly used to forward
parameters to plotting functions or solver backends.
Common Mistake: The Mutable Default Argument Trap
Mistake:
Using a mutable object (list, dict, set) as a default argument:
def collect_results(result, cache=[]): # BUG!
cache.append(result)
return cache
print(collect_results(1)) # [1]
print(collect_results(2)) # [1, 2] β cache persists!
The default [] is created once at function definition time
and shared across all calls.
Correction:
Use None as the default and create a new object inside the function:
def collect_results(result, cache=None):
if cache is None:
cache = []
cache.append(result)
return cache
This is the standard Python idiom. NumPy, SciPy, and scikit-learn all use this pattern extensively.
Common Mistake: Accidental Array Mutation via Views
Mistake:
NumPy slices return views, not copies. Modifying a slice inside a function mutates the caller's array:
def zero_negative(x: np.ndarray) -> np.ndarray:
result = x # This is NOT a copy!
result[result < 0] = 0
return result
data = np.array([1, -2, 3, -4])
clean = zero_negative(data)
print(data) # [1, 0, 3, 0] β original mutated!
Correction:
Explicitly copy when you intend to leave the original unchanged:
def zero_negative(x: np.ndarray) -> np.ndarray:
result = x.copy()
result[result < 0] = 0
return result
Or use a pure functional approach: np.where(x < 0, 0, x).
Example: Pure vs Impure Data Transformation Pipeline
Write a data preprocessing pipeline for simulation results. Compare a pure functional approach with a stateful approach. The pipeline should:
- Remove NaN values
- Apply log-scale transformation
- Normalize to zero mean and unit variance
Impure approach (mutates state)
class DataPipeline:
def __init__(self, data):
self.data = data # shared mutable state
def remove_nans(self):
self.data = self.data[~np.isnan(self.data)]
def log_transform(self):
self.data = np.log(self.data + 1e-10)
def normalize(self):
self.data = (self.data - self.data.mean()) / self.data.std()
# Problem: order matters, state is implicit, hard to test individually
pipe = DataPipeline(raw_data)
pipe.remove_nans()
pipe.log_transform()
pipe.normalize()
result = pipe.data
Pure approach (composable functions)
def remove_nans(x: np.ndarray) -> np.ndarray:
return x[~np.isnan(x)]
def log_transform(x: np.ndarray, eps: float = 1e-10) -> np.ndarray:
return np.log(x + eps)
def normalize(x: np.ndarray) -> np.ndarray:
return (x - x.mean()) / x.std()
# Each function is independently testable
result = normalize(log_transform(remove_nans(raw_data)))
# Or compose them:
from functools import reduce
pipeline = [remove_nans, log_transform, normalize]
result = reduce(lambda data, fn: fn(data), pipeline, raw_data)
Why pure wins
- Each function can be unit tested in isolation
- Functions can be reordered or swapped easily
- No hidden coupling through shared state
- Easy to parallelize (no race conditions on shared data)
Example: Designing a Flexible API with *args and **kwargs
Design a plot_results function that accepts multiple result arrays
and forwards styling options to Matplotlib, following the NumPy
docstring convention.
Implementation with forwarded kwargs
import matplotlib.pyplot as plt
import numpy as np
def plot_results(
*datasets: np.ndarray,
labels: list[str] | None = None,
title: str = "Simulation Results",
xlabel: str = "Index",
ylabel: str = "Value",
**plot_kwargs,
) -> plt.Figure:
"""Plot one or more datasets on the same axes.
Parameters
----------
*datasets : np.ndarray
One or more 1-D arrays to plot.
labels : list[str] or None
Legend labels for each dataset.
title, xlabel, ylabel : str
Axis labels and title.
**plot_kwargs
Forwarded to ``ax.plot()`` (e.g., linewidth, marker).
Returns
-------
plt.Figure
The created figure.
"""
fig, ax = plt.subplots()
for i, data in enumerate(datasets):
label = labels[i] if labels else f"Dataset {i}"
ax.plot(data, label=label, **plot_kwargs)
ax.set(title=title, xlabel=xlabel, ylabel=ylabel)
ax.legend()
return fig
Usage
snr_values = np.linspace(0, 20, 50)
ber_zf = np.exp(-snr_values / 5)
ber_mmse = np.exp(-snr_values / 3)
fig = plot_results(
ber_zf, ber_mmse,
labels=["ZF", "MMSE"],
title="BER Comparison",
xlabel="SNR (dB)",
ylabel="BER",
linewidth=2,
marker='o',
markersize=3,
)
Example: Keyword-Only Arguments for Safety
Write a resample_signal function where critical parameters like
target_rate and method must be passed as keyword arguments to
prevent positional errors.
Using the bare * separator
def resample_signal(
signal: np.ndarray,
original_rate: float,
*, # everything after is keyword-only
target_rate: float,
method: str = "linear",
preserve_energy: bool = True,
) -> np.ndarray:
"""Resample a signal to a new sample rate.
Parameters
----------
signal : np.ndarray
Input signal.
original_rate : float
Original sample rate in Hz.
target_rate : float
Desired sample rate in Hz (keyword-only).
method : str
Interpolation method: 'linear', 'cubic', 'nearest'.
preserve_energy : bool
If True, scale amplitude to preserve signal energy.
Returns
-------
np.ndarray
Resampled signal.
"""
ratio = target_rate / original_rate
n_new = int(len(signal) * ratio)
x_old = np.linspace(0, 1, len(signal))
x_new = np.linspace(0, 1, n_new)
resampled = np.interp(x_new, x_old, signal)
if preserve_energy:
resampled *= np.sqrt(1 / ratio)
return resampled
Why keyword-only matters
# This works β clear and unambiguous:
resample_signal(sig, 44100, target_rate=22050, method="cubic")
# This raises TypeError β prevents positional mistakes:
resample_signal(sig, 44100, 22050, "cubic")
# TypeError: resample_signal() takes 2 positional arguments
# but 4 were given
Pure Function Examples for Scientific Computing
# Code from: ch02/python/pure_functions.py
# Load from backend supplements endpointNumPy-Style Docstring Templates
# Code from: ch02/python/docstring_examples.py
# Load from backend supplements endpointQuick Check
Which of the following functions is pure?
def f(x):
print(f"Processing {x}")
return x ** 2
def f(x):
return x ** 2
def f(x):
x.append(0)
return sum(x)
counter = 0
def f(x):
global counter
counter += 1
return x ** 2
python def f(x): return x ** 2
No side effects, deterministic output for any input.
Quick Check
What does the following code print on the third call?
def add_item(item, items=[]):
items.append(item)
return items
add_item('a')
add_item('b')
print(add_item('c'))
['c']
['a', 'b', 'c']
['b', 'c']
Raises TypeError
['a', 'b', 'c']The mutable default [] is created once and accumulates items across all calls.
Function Dispatch Flow Visualization
Visualize how Python dispatches function calls: argument binding, default value resolution, *args/**kwargs packing, and return. Adjust parameters to see how different calling conventions work.
Parameters
pure function
A function with no side effects that returns the same output for the same inputs. Pure functions are deterministic and do not modify external state.
Related: side effect
side effect
Any observable change a function makes beyond returning a value: printing, writing files, modifying global variables, mutating arguments, or making network requests.
Related: pure function
first-class object
An entity that can be assigned to variables, passed as arguments, returned from functions, and stored in data structures. In Python, functions, classes, and modules are all first-class objects.
Historical Note: From Lambda Calculus to Python Functions
1930s-1950sThe idea of functions as first-class objects traces back to Alonzo
Church's lambda calculus (1930s), a formal system for expressing
computation through function abstraction and application. Python's
lambda keyword is a direct nod to this heritage. Lisp (1958) was
the first programming language to treat functions as first-class
citizens, and this idea eventually made its way into Python through
the influence of functional programming languages.
Key Takeaway
Design functions for composability. Prefer pure functions with
explicit inputs and outputs. Use None for mutable defaults,
keyword-only arguments (*) for safety-critical parameters, and
NumPy-style docstrings for documentation. Reserve side effects
for I/O boundaries.