ndarray Internals

Why Understanding ndarray Internals Matters

NumPy arrays feel like Python lists at first glance, but their power comes from a radically different memory model. Understanding how data is laid out in memory, what strides are, and when you get a view vs a copy is the difference between code that runs in milliseconds and code that takes minutes.

This section peels back the abstraction to show you the raw machinery inside every np.ndarray.

Definition:
The ndarray Object

An np.ndarray is a fixed-size, homogeneous, n-dimensional container for numerical data. Every ndarray consists of:

A data buffer — a contiguous block of raw bytes in memory
A dtype — specifies how to interpret each element (e.g., float64 = 8 bytes)
A shape — tuple of axis lengths, e.g., (3, 4) means 3 rows, 4 columns
Strides — tuple of byte steps to advance along each axis

import numpy as np
a = np.array([[1.0, 2.0, 3.0],
              [4.0, 5.0, 6.0]], dtype=np.float64)
print(a.shape)    # (2, 3)
print(a.dtype)    # float64
print(a.strides)  # (24, 8) — 24 bytes to next row, 8 bytes to next column
print(a.nbytes)   # 48 — total bytes = 2 * 3 * 8

Definition:
C-Contiguous vs Fortran-Contiguous

C-contiguous (row-major): elements in each row are stored consecutively. The last index changes fastest. This is NumPy's default.

Fortran-contiguous (column-major): elements in each column are stored consecutively. The first index changes fastest. Used by MATLAB and Fortran.

c = np.array([[1, 2, 3], [4, 5, 6]], order='C')
f = np.array([[1, 2, 3], [4, 5, 6]], order='F')

print(c.strides)  # (24, 8)  — row-major
print(f.strides)  # (8, 16)  — column-major

print(c.flags['C_CONTIGUOUS'])  # True
print(f.flags['F_CONTIGUOUS'])  # True

The memory layout affects cache performance: iterate along contiguous axes for speed.

Definition:
Strides

Strides is a tuple of integers telling NumPy how many bytes to skip in memory to advance one position along each axis.

For a 2-D array with shape (m, n) and dtype float64 (8 bytes):

C-contiguous: strides = (n * 8, 8) — skip an entire row to go down, skip 8 bytes to go right
F-contiguous: strides = (8, m * 8) — skip 8 bytes to go down, skip an entire column to go right

Strides enable views without copying: slicing just changes the strides and the starting pointer.

a = np.arange(12, dtype=np.float64).reshape(3, 4)
print(a.strides)       # (32, 8)
print(a[::2].strides)  # (64, 8) — every other row, stride doubled
print(a[:, ::2].strides)  # (32, 16) — every other column

Definition:
View vs Copy

A view shares the same data buffer as the original array. Modifying the view modifies the original. A copy allocates a new data buffer.

Rules of thumb:

Operation	Result
Basic slicing: `a[1:3]`, `a[::2]`	View
Transpose: `a.T`	View
Reshape (when possible): `a.reshape(...)`	View
Fancy indexing: `a[[0, 2, 4]]`	Copy
Boolean indexing: `a[mask]`	Copy
`a.copy()`	Copy

Use np.shares_memory(a, b) to check at runtime:

a = np.arange(10)
b = a[::2]           # view
c = a[[0, 2, 4]]     # copy

print(np.shares_memory(a, b))  # True
print(np.shares_memory(a, c))  # False

Definition:
Data Types (dtype)

Every ndarray element has the same dtype. Common scientific dtypes:

dtype	Size	Range / Precision
`float32`	4 B	~7 decimal digits, $\pm 3.4 \times 10^{38}$
`float64`	8 B	~16 decimal digits, $\pm 1.8 \times 10^{308}$
`complex64`	8 B	Two `float32` (real + imag)
`complex128`	16 B	Two `float64` (real + imag)
`int32`	4 B	$-2^{31}$ to $2^{31}-1$
`int64`	8 B	$-2^{63}$ to $2^{63}-1$
`bool_`	1 B	True / False

a = np.array([1.0, 2.0, 3.0])       # default float64
b = a.astype(np.float32)             # downcast: saves memory
c = np.array([1+2j, 3+4j])           # complex128
print(c.real.dtype)                   # float64

Rule: use float64 unless memory or GPU constraints force float32.

Theorem: Stride Computation Formula

For a C-contiguous array with shape $(d_0, d_1, \ldots, d_{n-1})$ and element size $s$ bytes, the stride along axis $k$ is:

$s_k = s \cdot \prod_{j=k+1}^{n-1} d_j$

In particular, the last axis always has stride $s$ , and the first axis has the largest stride.

To move one step along axis $k$ , you must skip over all elements in the remaining axes — that is, $d_{k+1} \times d_{k+2} \times \cdots \times d_{n-1}$ elements, each of size $s$ bytes.

Show Hint

Think of a 3-D array as a book: axis 0 = page, axis 1 = row on page, axis 2 = column.

Proof

Base case

The last axis ( $k = n-1$ ) has stride $s_{n-1} = s \cdot \prod_{j=n}^{n-1} d_j = s \cdot 1 = s$ . Moving one element along the last axis means moving $s$ bytes.

Inductive step

Assume stride for axis $k+1$ is $s_{k+1} = s \cdot \prod_{j=k+2}^{n-1} d_j$ . Axis $k$ must skip over $d_{k+1}$ elements of axis $k+1$ , so $s_k = d_{k+1} \cdot s_{k+1} = s \cdot \prod_{j=k+1}^{n-1} d_j$ .

Theorem: View vs Copy Rule

A NumPy operation produces a view if and only if the result can be described by the same data buffer with a different combination of (offset, shape, strides). Otherwise, it must produce a copy.

Basic slicing with start:stop:step can always be expressed as new strides (multiply by step) and new shape (number of selected elements). Fancy indexing with an arbitrary list of indices cannot be expressed as a regular stride pattern, so NumPy must allocate new memory.

Theorem: Row-Major Iteration is Cache-Friendly

For a C-contiguous array, iterating over the last axis first (innermost loop) accesses memory sequentially and maximizes CPU cache utilization. Iterating over the first axis first causes cache misses.

Modern CPUs load data in cache lines (typically 64 bytes = 8 float64s). Sequential access loads a full cache line and uses every element. Strided access may load a cache line but use only one element before evicting it.

Example: Strides Under Slicing

Given a = np.arange(24, dtype=np.float64).reshape(4, 6), compute the strides and shape of b = a[::2, 1::3] without running the code.

Solution

Original strides

a has shape (4, 6) with float64 (8 bytes). C-contiguous strides: $(6 \times 8, 8) = (48, 8)$ .

Row slicing a[::2]

Step 2 along axis 0: new stride = $2 \times 48 = 96$ . New shape along axis 0: $\lceil 4/2 \rceil = 2$ .

Column slicing [:, 1::3]

Start at index 1, step 3 along axis 1: new stride = $3 \times 8 = 24$ . New shape along axis 1: $\lceil (6-1)/3 \rceil = 2$ .

Result

b.shape = (2, 2), b.strides = (96, 24). b is a view: np.shares_memory(a, b) returns True.

Example: Detecting Views with np.shares_memory

Determine which of the following operations produce views and which produce copies for a = np.arange(10):

b = a[2:7]
c = a[[1, 3, 5]]
d = a.reshape(2, 5)
e = a[a > 3]

Solution

Check each

a = np.arange(10)
b = a[2:7]           # basic slice -> VIEW
c = a[[1, 3, 5]]     # fancy index -> COPY
d = a.reshape(2, 5)  # reshape     -> VIEW (contiguous)
e = a[a > 3]         # bool index  -> COPY

for name, arr in [('b', b), ('c', c), ('d', d), ('e', e)]:
    print(f"{name}: shares_memory = {np.shares_memory(a, arr)}")
# b: True, c: False, d: True, e: False

Example: C vs Fortran Order Performance

Create a large matrix in both C and Fortran order. Sum along rows (axis 1) and along columns (axis 0). Which is faster for each order?

Solution

Setup and benchmark

import time
n = 5000
c_arr = np.random.randn(n, n)                     # C-order
f_arr = np.asfortranarray(c_arr)                   # F-order

# Sum along axis 1 (row sums) — accesses contiguous in C-order
t0 = time.perf_counter()
_ = c_arr.sum(axis=1)
print(f"C-order, row sum: {time.perf_counter()-t0:.4f}s")

t0 = time.perf_counter()
_ = f_arr.sum(axis=1)
print(f"F-order, row sum: {time.perf_counter()-t0:.4f}s")
# C-order is faster for row sums (contiguous access)
# F-order is faster for column sums (contiguous access)

Memory Layout Explorer

Visualize how ndarray elements are stored in memory for different shapes, dtypes, and memory orders. See strides, contiguity flags, and the raw byte layout.

Parameters

ndarray Memory Layout — C-contiguous (row-major) vs Fortran-contiguous (column-major) memory layout for a 3x4 array. Colors indicate which elements are stored consecutively in memory.

Historical Note: From Numeric to NumPy

2005

NumPy traces its lineage to Numeric (1995), one of the first array libraries for Python, created by Jim Hugunin. A competing library, Numarray, offered better support for large arrays. In 2005, Travis Oliphant unified both into NumPy, which became the foundation of the entire scientific Python ecosystem. The C-contiguous default was inherited from Numeric's C implementation.

Historical Note: Row-Major vs Column-Major: A Language War

1950s-1970s

C and its descendants (C++, Python/NumPy) use row-major order. Fortran, MATLAB, R, and Julia use column-major order. This split dates back to the 1950s: Fortran's array layout was optimized for the IBM 704's memory architecture. When Dennis Ritchie designed C in the 1970s, he chose the opposite convention for consistency with pointer arithmetic and multi-dimensional array decay.

Quick Check

For a C-contiguous float64 array with shape (3, 5), what are the strides in bytes?

(40, 8)

(8, 40)

(24, 8)

(8, 8)

Correction:

(40, 8)

Stride for axis 0: 5 * 8 = 40; stride for axis 1: 8.

Common Mistake: Accidental Mutation Through Views

Mistake:

Modifying a slice without realizing it is a view, which silently changes the original array:

a = np.arange(10)
b = a[3:7]
b[:] = 0       # a is now [0, 1, 2, 0, 0, 0, 0, 7, 8, 9]!

Correction:

Use .copy() explicitly when you need an independent array:

b = a[3:7].copy()
b[:] = 0       # a is unchanged

Common Mistake: Assuming Fancy Indexing Creates a View

Mistake:

Assigning through fancy indexing and expecting the original to update in a view-like manner:

a = np.arange(10)
b = a[[1, 3, 5]]     # COPY, not view
b[0] = 99             # a is unchanged!

Correction:

To modify elements via fancy indexing, assign directly to the original:

a[[1, 3, 5]] = [99, 99, 99]   # modifies a directly

ndarray

NumPy's core data structure: a fixed-size, homogeneous, n-dimensional array with a contiguous memory buffer.

Related: Data Types (dtype), Strides, Shape

stride

The number of bytes to step in memory to advance one position along a given axis of an ndarray.

view

An ndarray that shares its data buffer with another array. Modifications to the view are reflected in the original.

contiguous

An array whose elements are stored in a single unbroken block of memory, without gaps or out-of-order elements.

ndarray Internals

python

Hands-on exploration of memory layout, strides, views vs copies, and np.shares_memory.

# Code from: ch05/python/ndarray_internals.py
# Load from backend supplements endpoint

Why This Matters: GPU Arrays Use the Same Memory Model

CuPy and PyTorch tensors adopt the same concepts: contiguous memory buffers, strides, dtypes, and views. Understanding ndarray internals transfers directly to GPU programming. In Chapter 12, you will use tensor.is_contiguous() and tensor.stride() — the same mental model applies.

See full treatment in Chapter 12

Key Takeaway

Basic slicing (a[::2], a[1:5]) creates views; fancy indexing (a[[0,2,4]]) and boolean indexing (a[mask]) create copies. Always use np.shares_memory() when in doubt.

Prerequisites & Notation Indexing and Slicing — Advanced

ndarray Internals

Why Understanding ndarray Internals Matters

Definition: The ndarray Object

Definition: C-Contiguous vs Fortran-Contiguous

Definition: Strides

Definition: View vs Copy

Definition: Data Types (dtype)

Theorem: Stride Computation Formula

Base case

Inductive step

Theorem: View vs Copy Rule

Theorem: Row-Major Iteration is Cache-Friendly

Example: Strides Under Slicing

Original strides

Row slicing a[::2]

Column slicing [:, 1::3]

Result

Example: Detecting Views with np.shares_memory

Check each

Example: C vs Fortran Order Performance

Setup and benchmark

Memory Layout Explorer

Parameters

ndarray Memory Layout

Historical Note: From Numeric to NumPy

Historical Note: Row-Major vs Column-Major: A Language War

Quick Check

Common Mistake: Accidental Mutation Through Views

Common Mistake: Assuming Fancy Indexing Creates a View

ndarray

stride

view

contiguous

ndarray Internals

Why This Matters: GPU Arrays Use the Same Memory Model

Key Takeaway

Definition:
The ndarray Object

Definition:
C-Contiguous vs Fortran-Contiguous

Definition:
Strides

Definition:
View vs Copy

Definition:
Data Types (dtype)