Core Data Structures Refresher
Choosing the Right Container
Scientific code spends most of its time in NumPy/PyTorch. But the glue around those hot paths β configuration parsing, result aggregation, bookkeeping β uses standard Python data structures. Choosing the wrong container turns lookups into scans and makes code fragile. This section sharpens your intuition for when to reach for each tool.
Definition: Time Complexity of Core Operations
Time Complexity of Core Operations
The performance of Python's built-in containers:
| Operation | list |
dict |
set |
deque |
|---|---|---|---|---|
Index [i] |
N/A | |||
| Append/push right | β | β | ||
| Prepend/push left | β | β | ||
Search x in |
||||
| Delete by value | ||||
| Sort | β | β | β |
Amortized β occasional resizes cost but average out.
Theorem: Amortized Cost of List Append
Python's list.append() has amortized time complexity.
Specifically, a sequence of append operations on an initially
empty list takes total time, even though individual appends
occasionally cost due to reallocation.
CPython doubles the underlying array capacity when it runs out of space. This means resizes happen at append , and the total work across all resizes is .
Consider the total cost of all resize operations over appends.
The resize at capacity copies elements. Sum the geometric series.
Setup
Let be the total cost of append operations. Each append costs unless a resize is triggered, in which case it costs where is the current capacity.
Counting resize costs
With a doubling strategy, resizes occur when the list size reaches powers of 2: . The total resize cost is
Total cost
The total cost is . Therefore the amortized cost per append is .
Theorem: Expected-Case Hash Table Lookup
Under the simple uniform hashing assumption (each key is equally likely to hash to any slot), lookup in a hash table with elements and slots has expected time . When the load factor is kept bounded (Python keeps ), this simplifies to expected .
A hash table maps keys to slots via a hash function. With a good hash function and a bounded load factor, the expected number of keys per slot (chain length) is constant, so lookup takes constant expected time.
Expected chain length
Under simple uniform hashing, each of the keys is placed in one of slots independently and uniformly. The expected number of keys in any given slot is .
Lookup cost
An unsuccessful lookup examines the entire chain in the target slot, costing . A successful lookup examines, on average, elements.
Bounded load factor
Python's dict resizes when , keeping
at all times. Therefore the expected lookup cost is
.
Definition: collections.defaultdict
collections.defaultdict
A dict subclass that calls a factory function for missing keys:
from collections import defaultdict
# Group results by SNR level
results = defaultdict(list)
for snr, ber in measurements:
results[snr].append(ber)
Without defaultdict, you would need if snr not in results: results[snr] = []
before every append β verbose and error-prone.
Definition: collections.Counter
collections.Counter
A dict subclass for counting hashable objects:
from collections import Counter
# Count bit errors
errors = Counter(detected_bits ^ true_bits)
print(errors[1]) # Number of bit errors (1s in XOR result)
# Most common elements
symbol_counts = Counter(received_symbols)
print(symbol_counts.most_common(5))
Definition: collections.deque
collections.deque
A double-ended queue supporting append and pop from both ends:
from collections import deque
# Sliding window of recent measurements
window = deque(maxlen=100)
for sample in data_stream:
window.append(sample)
# window automatically discards oldest when full
moving_avg = sum(window) / len(window)
Use deque whenever you need a fixed-size buffer, a queue, or
efficient prepending.
Definition: collections.namedtuple and dataclasses
collections.namedtuple and dataclasses
Named tuples provide lightweight, immutable records:
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y', 'z'])
p = Point(1.0, 2.0, 3.0)
print(p.x, p.y) # 1.0 2.0
Dataclasses (Python 3.7+) are mutable and more flexible:
from dataclasses import dataclass
@dataclass
class SimConfig:
snr_db: float = 10.0
n_antennas: int = 4
n_realizations: int = 1000
channel_model: str = "rayleigh"
The @dataclass decorator auto-generates __init__, __repr__,
and __eq__. Prefer dataclasses for parameter containers.
dataclass
A Python class decorated with @dataclass that auto-generates
__init__, __repr__, __eq__, and optionally __hash__.
Ideal for parameter containers in scientific code.
Related: dunder method
Example: Choosing the Right Data Structure
You are running a Monte Carlo simulation that generates 10,000 channel realizations. For each realization, you need to: (a) store the BER result, (b) check if a particular channel matrix has been seen before (exact match), and (c) maintain a running window of the last 50 BER values. Which data structures should you use for each task?
BER storage: use a list
A list is the natural choice for ordered, append-only results:
ber_results: list[float] = []
for trial in range(10_000):
ber = run_simulation(trial)
ber_results.append(ber) # O(1) amortized
Duplicate detection: use a set
For membership testing, use a set. Since NumPy arrays
are not hashable, convert to a hashable form:
seen_channels: set[bytes] = set()
h_bytes = H.tobytes()
if h_bytes not in seen_channels: # O(1)
seen_channels.add(h_bytes)
Sliding window: use a deque
from collections import deque
recent = deque(maxlen=50)
for ber in ber_results:
recent.append(ber) # O(1), auto-drops oldest
avg = sum(recent) / len(recent)
Data Structures Performance Benchmark
# Code from: ch01/python/data_structures_benchmark.py
# Load from backend supplements endpointDataclass Patterns for Simulations
# Code from: ch01/python/dataclass_patterns.py
# Load from backend supplements endpointData Structure Operation Costs
Compare the empirical time of common operations (append, lookup, delete) across Python's built-in data structures as the container size grows.
Parameters
Dynamic Array Growth Animation
Watch Python's list grow as elements are appended. The animation shows capacity doublings, element copies during resizes, and the amortized cost converging to per operation.
Parameters
Python Operator Dispatch Mechanism
a + b: check a.__add__, fall back to
b.__radd__, raise TypeError if neither works.
Common Mistake: Using a List for Membership Testing
Mistake:
Checking if x in my_list inside a loop over elements, giving
total time. Common in code that accumulates "seen" items.
Correction:
Use a set for membership testing β if x in my_set is .
Convert the list to a set if you need repeated lookups:
seen = set(my_list) # O(n) once
for item in other_items:
if item in seen: # O(1) per check
...
Historical Note: Dictionaries Became Ordered in Python 3.7
2018Before Python 3.7, dict iteration order was implementation-dependent.
CPython 3.6 made insertion order preservation an implementation detail;
Python 3.7 made it a language guarantee. This means you can now rely
on for k, v in params.items() iterating in insertion order, which is
useful for ordered parameter sweeps and reproducible experiment configs.
Why This Matters: Dataclasses as Simulation Configs
In wireless simulation, experiments typically involve sweeping over
many parameters (SNR, number of antennas, channel model, coding rate).
A @dataclass provides a clean, typed, self-documenting container
for these parameters β much better than passing loose **kwargs or
raw dictionaries. We will use this pattern throughout Books 1 and 2
whenever we configure simulations.
See full treatment in Classes and Inheritance
Quick Check
Which data structure provides O(1) append AND O(1) prepend?
collections.deque
list
dict
tuple
collections.dequedeque (double-ended queue) supports O(1) operations at both ends.
Key Takeaway
Match the data structure to the access pattern. Lists for ordered sequences, sets for membership testing, dicts for key-value lookup, deques for double-ended access. Dataclasses for parameter containers. The wrong choice turns into and makes code fragile.