Structured Arrays and Memory-Mapped Files

Beyond Homogeneous Arrays

Sometimes your data has mixed types: a particle simulation with (float64 position, float64 velocity, int32 species_id, bool active). Structured arrays let you store heterogeneous records in a single ndarray, and memory-mapped files let you work with datasets that exceed RAM.

These tools bridge the gap between NumPy arrays and database-like records without leaving the NumPy ecosystem.

Definition:
Structured Arrays

A structured array uses a compound dtype to store heterogeneous fields in each element:

dt = np.dtype([
    ('name', 'U10'),       # Unicode string, max 10 chars
    ('position', np.float64, (3,)),  # 3-D position vector
    ('mass', np.float64),
    ('active', np.bool_),
])

particles = np.zeros(100, dtype=dt)
particles['name'][0] = 'electron'
particles['position'][0] = [1.0, 2.0, 3.0]
particles['mass'][0] = 9.109e-31
particles['active'][0] = True

# Field access returns a view into the structured array
masses = particles['mass']          # shape (100,), dtype float64
positions = particles['position']   # shape (100, 3)

Definition:
Memory-Mapped Files: np.memmap

np.memmap maps a file on disk directly into virtual memory, allowing you to access huge arrays without loading them into RAM:

# Write a large array to disk
data = np.random.randn(10_000, 1000).astype(np.float32)
data.tofile('large_data.bin')

# Memory-map it (lazy loading, OS manages paging)
mm = np.memmap('large_data.bin', dtype=np.float32,
               mode='r', shape=(10_000, 1000))

# Access slices without loading the full file
subset = mm[500:600, :]   # only these rows are read from disk
print(subset.mean())

Modes: 'r' (read-only), 'r+' (read-write), 'w+' (create/overwrite), 'c' (copy-on-write).

Definition:
HDF5 and zarr

For structured, self-describing datasets, use HDF5 (via h5py) or zarr (cloud-native):

import h5py

# Write
with h5py.File('experiment.h5', 'w') as f:
    f.create_dataset('signal', data=np.random.randn(1000, 64))
    f.create_dataset('labels', data=np.arange(1000))
    f.attrs['sampling_rate'] = 44100

# Read (lazy — data loaded on access)
with h5py.File('experiment.h5', 'r') as f:
    chunk = f['signal'][100:200, :]   # loads only this slice
    sr = f.attrs['sampling_rate']

zarr adds chunked compression and cloud storage backends:

import zarr
z = zarr.open('data.zarr', mode='w',
              shape=(100_000, 256), chunks=(1000, 256),
              dtype='float32')
z[0:1000] = np.random.randn(1000, 256).astype(np.float32)

Theorem: Memory-Mapped I/O Performance

For sequential access patterns, memory-mapped files achieve throughput close to RAM speed because the OS prefetches pages. For random access, performance degrades to disk I/O speed due to page faults.

The OS virtual memory system loads file pages on demand (4 KB at a time). Sequential reads trigger hardware prefetching. Random reads cause cache misses, each requiring a disk seek (~microseconds for SSD, ~milliseconds for HDD).

Example: Sorting Structured Arrays

Create a structured array of particles with fields (name, energy, charge) and sort by energy in descending order.

Solution

Create and sort

dt = np.dtype([('name', 'U10'), ('energy', np.float64), ('charge', np.int32)])
particles = np.array([
    ('proton',   938.3, 1),
    ('electron', 0.511, -1),
    ('neutron',  939.6, 0),
    ('muon',     105.7, -1),
], dtype=dt)

# Sort by energy (descending)
sorted_p = np.sort(particles, order='energy')[::-1]
print(sorted_p['name'])   # ['neutron', 'proton', 'muon', 'electron']

Example: Out-of-Core Computation with memmap

Compute the column-wise mean of a 10 GB dataset that does not fit in memory, using memory-mapped files and chunk processing.

Solution

Chunked processing

# Assume 'huge_data.bin' is a (1_000_000, 1000) float64 array
# Total: 8 GB — may not fit in RAM
mm = np.memmap('huge_data.bin', dtype=np.float64,
               mode='r', shape=(1_000_000, 1000))

# Compute column means in chunks
chunk_size = 10_000
col_sums = np.zeros(1000)
for i in range(0, 1_000_000, chunk_size):
    col_sums += mm[i:i+chunk_size].sum(axis=0)
col_means = col_sums / 1_000_000

Array Storage Formats Comparison

Format	Metadata	Compression	Partial Read	Cloud Native
.npy / .npz	dtype, shape only	Optional (npz)	No (full load)	No
Raw binary	None (you track it)	No	Via memmap	No
HDF5 (h5py)	Rich (groups, attrs)	Yes (gzip, lzf)	Yes (chunked)	Limited
zarr	Rich (JSON attrs)	Yes (many codecs)	Yes (chunked)	Yes (S3, GCS)

Quick Check

When you access a field of a structured array with particles['mass'], is the result a view or a copy?

A view — it points to the same underlying memory

A copy — structured arrays always copy on field access

It depends on the dtype

Neither — it returns a scalar

Correction:

A view — it points to the same underlying memory

Field access on a structured array returns a view with stride equal to the record size.

Common Mistake: Writing to a Read-Only memmap

Mistake:

Opening a memmap in read mode and trying to modify it:

mm = np.memmap('data.bin', dtype='float64', mode='r', shape=(100,))
mm[0] = 42.0   # ValueError: assignment destination is read-only

Correction:

Use mode='r+' for read-write access:

mm = np.memmap('data.bin', dtype='float64', mode='r+', shape=(100,))
mm[0] = 42.0   # OK
mm.flush()      # ensure changes are written to disk

structured array

A NumPy array with a compound dtype that can store heterogeneous fields (like a table row) in each element.

memory-mapped file

A file mapped into virtual memory, allowing array-like access to disk data without loading it entirely into RAM.

Why This Matters: Structured Arrays to Pandas DataFrames

Structured arrays are the ancestor of Pandas DataFrames. In Chapter 8 (Pandas), you will see that DataFrames internally use NumPy arrays for each column. Converting between them is trivial: pd.DataFrame(structured_array). Understanding structured dtypes helps when interfacing with low-level file formats that predate Pandas.

See full treatment in Chapter 8

Structured Arrays & Memory-Mapped Files

python

Structured arrays, memory-mapped files, and HDF5 examples.

# Code from: ch05/python/structured_memmap.py
# Load from backend supplements endpoint

Random Number Generation Chapter Summary