Structured Arrays and Memory-Mapped Files

Beyond Homogeneous Arrays

Sometimes your data has mixed types: a particle simulation with (float64 position, float64 velocity, int32 species_id, bool active). Structured arrays let you store heterogeneous records in a single ndarray, and memory-mapped files let you work with datasets that exceed RAM.

These tools bridge the gap between NumPy arrays and database-like records without leaving the NumPy ecosystem.

Definition:

Structured Arrays

A structured array uses a compound dtype to store heterogeneous fields in each element:

dt = np.dtype([
    ('name', 'U10'),       # Unicode string, max 10 chars
    ('position', np.float64, (3,)),  # 3-D position vector
    ('mass', np.float64),
    ('active', np.bool_),
])

particles = np.zeros(100, dtype=dt)
particles['name'][0] = 'electron'
particles['position'][0] = [1.0, 2.0, 3.0]
particles['mass'][0] = 9.109e-31
particles['active'][0] = True

# Field access returns a view into the structured array
masses = particles['mass']          # shape (100,), dtype float64
positions = particles['position']   # shape (100, 3)

Definition:

Memory-Mapped Files: np.memmap

np.memmap maps a file on disk directly into virtual memory, allowing you to access huge arrays without loading them into RAM:

# Write a large array to disk
data = np.random.randn(10_000, 1000).astype(np.float32)
data.tofile('large_data.bin')

# Memory-map it (lazy loading, OS manages paging)
mm = np.memmap('large_data.bin', dtype=np.float32,
               mode='r', shape=(10_000, 1000))

# Access slices without loading the full file
subset = mm[500:600, :]   # only these rows are read from disk
print(subset.mean())

Modes: 'r' (read-only), 'r+' (read-write), 'w+' (create/overwrite), 'c' (copy-on-write).

Definition:

HDF5 and zarr

For structured, self-describing datasets, use HDF5 (via h5py) or zarr (cloud-native):

import h5py

# Write
with h5py.File('experiment.h5', 'w') as f:
    f.create_dataset('signal', data=np.random.randn(1000, 64))
    f.create_dataset('labels', data=np.arange(1000))
    f.attrs['sampling_rate'] = 44100

# Read (lazy β€” data loaded on access)
with h5py.File('experiment.h5', 'r') as f:
    chunk = f['signal'][100:200, :]   # loads only this slice
    sr = f.attrs['sampling_rate']

zarr adds chunked compression and cloud storage backends:

import zarr
z = zarr.open('data.zarr', mode='w',
              shape=(100_000, 256), chunks=(1000, 256),
              dtype='float32')
z[0:1000] = np.random.randn(1000, 256).astype(np.float32)

Theorem: Memory-Mapped I/O Performance

For sequential access patterns, memory-mapped files achieve throughput close to RAM speed because the OS prefetches pages. For random access, performance degrades to disk I/O speed due to page faults.

The OS virtual memory system loads file pages on demand (4 KB at a time). Sequential reads trigger hardware prefetching. Random reads cause cache misses, each requiring a disk seek (~microseconds for SSD, ~milliseconds for HDD).

Example: Sorting Structured Arrays

Create a structured array of particles with fields (name, energy, charge) and sort by energy in descending order.

Example: Out-of-Core Computation with memmap

Compute the column-wise mean of a 10 GB dataset that does not fit in memory, using memory-mapped files and chunk processing.

Array Storage Formats Comparison

FormatMetadataCompressionPartial ReadCloud Native
.npy / .npzdtype, shape onlyOptional (npz)No (full load)No
Raw binaryNone (you track it)NoVia memmapNo
HDF5 (h5py)Rich (groups, attrs)Yes (gzip, lzf)Yes (chunked)Limited
zarrRich (JSON attrs)Yes (many codecs)Yes (chunked)Yes (S3, GCS)

Quick Check

When you access a field of a structured array with particles['mass'], is the result a view or a copy?

A view β€” it points to the same underlying memory

A copy β€” structured arrays always copy on field access

It depends on the dtype

Neither β€” it returns a scalar

Common Mistake: Writing to a Read-Only memmap

Mistake:

Opening a memmap in read mode and trying to modify it:

mm = np.memmap('data.bin', dtype='float64', mode='r', shape=(100,))
mm[0] = 42.0   # ValueError: assignment destination is read-only

Correction:

Use mode='r+' for read-write access:

mm = np.memmap('data.bin', dtype='float64', mode='r+', shape=(100,))
mm[0] = 42.0   # OK
mm.flush()      # ensure changes are written to disk

structured array

A NumPy array with a compound dtype that can store heterogeneous fields (like a table row) in each element.

Related: Data Types (dtype), record array

memory-mapped file

A file mapped into virtual memory, allowing array-like access to disk data without loading it entirely into RAM.

Related: Np.Memmap, HDF5 and zarr, HDF5 and zarr

Why This Matters: Structured Arrays to Pandas DataFrames

Structured arrays are the ancestor of Pandas DataFrames. In Chapter 8 (Pandas), you will see that DataFrames internally use NumPy arrays for each column. Converting between them is trivial: pd.DataFrame(structured_array). Understanding structured dtypes helps when interfacing with low-level file formats that predate Pandas.

See full treatment in Chapter 8

Structured Arrays & Memory-Mapped Files

python
Structured arrays, memory-mapped files, and HDF5 examples.
# Code from: ch05/python/structured_memmap.py
# Load from backend supplements endpoint