Structured Arrays and Memory-Mapped Files
Beyond Homogeneous Arrays
Sometimes your data has mixed types: a particle simulation with (float64 position, float64 velocity, int32 species_id, bool active). Structured arrays let you store heterogeneous records in a single ndarray, and memory-mapped files let you work with datasets that exceed RAM.
These tools bridge the gap between NumPy arrays and database-like records without leaving the NumPy ecosystem.
Definition: Structured Arrays
Structured Arrays
A structured array uses a compound dtype to store heterogeneous fields in each element:
dt = np.dtype([
('name', 'U10'), # Unicode string, max 10 chars
('position', np.float64, (3,)), # 3-D position vector
('mass', np.float64),
('active', np.bool_),
])
particles = np.zeros(100, dtype=dt)
particles['name'][0] = 'electron'
particles['position'][0] = [1.0, 2.0, 3.0]
particles['mass'][0] = 9.109e-31
particles['active'][0] = True
# Field access returns a view into the structured array
masses = particles['mass'] # shape (100,), dtype float64
positions = particles['position'] # shape (100, 3)
Definition: Memory-Mapped Files: np.memmap
Memory-Mapped Files: np.memmap
np.memmap maps a file on disk directly into virtual memory,
allowing you to access huge arrays without loading them into RAM:
# Write a large array to disk
data = np.random.randn(10_000, 1000).astype(np.float32)
data.tofile('large_data.bin')
# Memory-map it (lazy loading, OS manages paging)
mm = np.memmap('large_data.bin', dtype=np.float32,
mode='r', shape=(10_000, 1000))
# Access slices without loading the full file
subset = mm[500:600, :] # only these rows are read from disk
print(subset.mean())
Modes: 'r' (read-only), 'r+' (read-write), 'w+' (create/overwrite),
'c' (copy-on-write).
Definition: HDF5 and zarr
HDF5 and zarr
For structured, self-describing datasets, use HDF5 (via h5py)
or zarr (cloud-native):
import h5py
# Write
with h5py.File('experiment.h5', 'w') as f:
f.create_dataset('signal', data=np.random.randn(1000, 64))
f.create_dataset('labels', data=np.arange(1000))
f.attrs['sampling_rate'] = 44100
# Read (lazy β data loaded on access)
with h5py.File('experiment.h5', 'r') as f:
chunk = f['signal'][100:200, :] # loads only this slice
sr = f.attrs['sampling_rate']
zarr adds chunked compression and cloud storage backends:
import zarr
z = zarr.open('data.zarr', mode='w',
shape=(100_000, 256), chunks=(1000, 256),
dtype='float32')
z[0:1000] = np.random.randn(1000, 256).astype(np.float32)
Theorem: Memory-Mapped I/O Performance
For sequential access patterns, memory-mapped files achieve throughput close to RAM speed because the OS prefetches pages. For random access, performance degrades to disk I/O speed due to page faults.
The OS virtual memory system loads file pages on demand (4 KB at a time). Sequential reads trigger hardware prefetching. Random reads cause cache misses, each requiring a disk seek (~microseconds for SSD, ~milliseconds for HDD).
Example: Sorting Structured Arrays
Create a structured array of particles with fields (name, energy, charge) and sort by energy in descending order.
Create and sort
dt = np.dtype([('name', 'U10'), ('energy', np.float64), ('charge', np.int32)])
particles = np.array([
('proton', 938.3, 1),
('electron', 0.511, -1),
('neutron', 939.6, 0),
('muon', 105.7, -1),
], dtype=dt)
# Sort by energy (descending)
sorted_p = np.sort(particles, order='energy')[::-1]
print(sorted_p['name']) # ['neutron', 'proton', 'muon', 'electron']
Example: Out-of-Core Computation with memmap
Compute the column-wise mean of a 10 GB dataset that does not fit in memory, using memory-mapped files and chunk processing.
Chunked processing
# Assume 'huge_data.bin' is a (1_000_000, 1000) float64 array
# Total: 8 GB β may not fit in RAM
mm = np.memmap('huge_data.bin', dtype=np.float64,
mode='r', shape=(1_000_000, 1000))
# Compute column means in chunks
chunk_size = 10_000
col_sums = np.zeros(1000)
for i in range(0, 1_000_000, chunk_size):
col_sums += mm[i:i+chunk_size].sum(axis=0)
col_means = col_sums / 1_000_000
Array Storage Formats Comparison
| Format | Metadata | Compression | Partial Read | Cloud Native |
|---|---|---|---|---|
| .npy / .npz | dtype, shape only | Optional (npz) | No (full load) | No |
| Raw binary | None (you track it) | No | Via memmap | No |
| HDF5 (h5py) | Rich (groups, attrs) | Yes (gzip, lzf) | Yes (chunked) | Limited |
| zarr | Rich (JSON attrs) | Yes (many codecs) | Yes (chunked) | Yes (S3, GCS) |
Quick Check
When you access a field of a structured array with particles['mass'],
is the result a view or a copy?
A view β it points to the same underlying memory
A copy β structured arrays always copy on field access
It depends on the dtype
Neither β it returns a scalar
Field access on a structured array returns a view with stride equal to the record size.
Common Mistake: Writing to a Read-Only memmap
Mistake:
Opening a memmap in read mode and trying to modify it:
mm = np.memmap('data.bin', dtype='float64', mode='r', shape=(100,))
mm[0] = 42.0 # ValueError: assignment destination is read-only
Correction:
Use mode='r+' for read-write access:
mm = np.memmap('data.bin', dtype='float64', mode='r+', shape=(100,))
mm[0] = 42.0 # OK
mm.flush() # ensure changes are written to disk
structured array
A NumPy array with a compound dtype that can store heterogeneous fields (like a table row) in each element.
Related: Data Types (dtype), record array
memory-mapped file
A file mapped into virtual memory, allowing array-like access to disk data without loading it entirely into RAM.
Related: Np.Memmap, HDF5 and zarr, HDF5 and zarr
Why This Matters: Structured Arrays to Pandas DataFrames
Structured arrays are the ancestor of Pandas DataFrames. In
Chapter 8 (Pandas), you will see that DataFrames internally use
NumPy arrays for each column. Converting between them is trivial:
pd.DataFrame(structured_array). Understanding structured dtypes
helps when interfacing with low-level file formats that predate Pandas.
See full treatment in Chapter 8
Structured Arrays & Memory-Mapped Files
# Code from: ch05/python/structured_memmap.py
# Load from backend supplements endpoint