Interoperability: NumPy, CuPy, PyTorch
Definition: Zero-Copy NumPy <-> PyTorch Conversion
Zero-Copy NumPy <-> PyTorch Conversion
CPU tensors and NumPy arrays can share the same memory:
import numpy as np
import torch
# NumPy -> PyTorch (zero-copy, shared memory)
a = np.array([1.0, 2.0, 3.0])
t = torch.from_numpy(a) # shares memory
t[0] = 99.0
print(a[0]) # 99.0 β mutation visible!
# PyTorch -> NumPy (zero-copy for CPU tensors)
t2 = torch.randn(5)
a2 = t2.numpy() # shares memory
Critical constraint: .numpy() only works on CPU tensors.
For GPU tensors, call .cpu().numpy() or .detach().cpu().numpy().
Zero-copy conversion preserves the dtype. Since PyTorch defaults to float32 and NumPy to float64, the resulting array will have the dtype of the source β be aware of this mismatch.
Definition: DLPack: Universal Tensor Exchange
DLPack: Universal Tensor Exchange
DLPack is a protocol for zero-copy tensor sharing between
frameworks, including GPU memory. Since Python 3.10 and NumPy 1.23,
the __dlpack__ protocol is standardized:
import torch
import numpy as np
# PyTorch -> anything via DLPack
t = torch.randn(5, device="cuda")
capsule = t.__dlpack__()
# Anything -> PyTorch
a = np.array([1.0, 2.0])
t = torch.from_dlpack(a) # zero-copy from NumPy
# CuPy <-> PyTorch (GPU zero-copy)
import cupy as cp
c = cp.from_dlpack(t) # PyTorch GPU -> CuPy
t2 = torch.from_dlpack(c) # CuPy -> PyTorch GPU
DLPack is the lingua franca of tensor libraries β it works across NumPy, CuPy, PyTorch, JAX, and TensorFlow.
Definition: Writing Backend-Agnostic Code
Writing Backend-Agnostic Code
To write code that works with NumPy, CuPy, and PyTorch tensors interchangeably, use the Array API standard (NEP 47):
import array_api_compat
def normalize(x):
xp = array_api_compat.array_namespace(x)
return x / xp.linalg.norm(x)
# Works with any backend:
normalize(np.array([3.0, 4.0])) # NumPy
normalize(torch.tensor([3.0, 4.0])) # PyTorch
normalize(cp.array([3.0, 4.0])) # CuPy
The array_namespace function inspects the input and returns the
appropriate module (numpy, torch, cupy), letting you write
framework-agnostic scientific code.
The Array API standard covers a common subset of operations. For advanced features (autograd, custom CUDA kernels), you still need framework-specific code.
Example: SciPy + PyTorch Mixed Computation Pipeline
Use SciPy for sparse matrix assembly (which PyTorch lacks), convert to a dense PyTorch tensor, compute eigenvalues on GPU, and convert back to NumPy for plotting.
Implementation
import numpy as np
import scipy.sparse as sp
import torch
# Step 1: Build sparse Laplacian in SciPy
n = 100
L = sp.diags([-1, 2, -1], [-1, 0, 1], shape=(n, n), format="csr")
# Step 2: Convert to dense PyTorch tensor
L_dense = torch.from_numpy(L.toarray()).to(torch.float64)
# Step 3: Compute eigenvalues (on GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
L_dev = L_dense.to(device)
eigvals = torch.linalg.eigvalsh(L_dev)
# Step 4: Back to NumPy for analysis
eigvals_np = eigvals.cpu().numpy()
print(f"Smallest eigenvalue: {eigvals_np[0]:.6f}")
print(f"Largest eigenvalue: {eigvals_np[-1]:.6f}")
print(f"Condition number: {eigvals_np[-1] / eigvals_np[0]:.2f}")
When This Pattern Is Useful
SciPy has the best sparse matrix support in Python. PyTorch has the best GPU linear algebra. Combining them gives you the best of both worlds for problems like PDE discretization followed by eigenvalue analysis.
Example: Zero-Copy GPU Transfer Between CuPy and PyTorch
Demonstrate zero-copy GPU memory sharing between CuPy and PyTorch using DLPack.
Implementation
import torch
import cupy as cp
# Create on PyTorch GPU
t = torch.randn(1000, 1000, device="cuda", dtype=torch.float64)
# Zero-copy to CuPy
c = cp.from_dlpack(t)
print(f"Same pointer: {c.data.ptr == t.data_ptr()}") # True
# Modify in CuPy β visible in PyTorch
c[0, 0] = 42.0
print(f"PyTorch sees change: {t[0, 0].item() == 42.0}") # True
# Use CuPy's unique features (e.g., custom kernels)
result_cp = cp.fft.fft2(c)
# Back to PyTorch β zero-copy
result_pt = torch.from_dlpack(result_cp)
Why This Matters
CuPy provides custom CUDA kernels and cuFFT wrappers that PyTorch may not expose directly. DLPack lets you mix frameworks without any data copies on GPU.
Interoperability Transfer Benchmark
Compare the time and memory cost of different conversion methods (from_numpy, DLPack, explicit copy) across array sizes.
Parameters
Tensor Conversion Methods
| Method | Copy? | GPU? | Notes |
|---|---|---|---|
| torch.from_numpy(a) | No (shared) | CPU only | Fastest for CPU NumPy arrays |
| torch.tensor(a) | Yes (always) | Any | Creates independent copy |
| torch.as_tensor(a) | No (if possible) | CPU only for NumPy | Smart: avoids copy when safe |
| t.numpy() | No (shared) | CPU only | Requires .cpu() for GPU tensors |
| torch.from_dlpack(x) | No (shared) | Yes | Universal protocol, works with CuPy/JAX |
| cp.from_dlpack(t) | No (shared) | Yes | PyTorch GPU -> CuPy GPU |
Quick Check
After t = torch.from_numpy(a), you modify t[0] = 99.
What happens to a[0]?
a[0] remains unchanged
a[0] becomes 99
A RuntimeError is raised
It depends on the dtype
Zero-copy sharing means both point to the same memory.
Common Mistake: Calling .numpy() on GPU Tensors
Mistake:
Calling .numpy() on a CUDA tensor:
t = torch.randn(5, device="cuda")
a = t.numpy() # RuntimeError!
Correction:
Move to CPU first, and detach from the graph if needed:
a = t.detach().cpu().numpy()
Historical Note: The DLPack Standard
2010s-2020sDLPack was proposed in 2017 by the DMLC (Distributed Machine Learning Community) to solve the growing fragmentation between tensor libraries. By 2022, it was adopted by NumPy, CuPy, PyTorch, JAX, and TensorFlow. The Python Array API consortium (2020-present) built on DLPack to define a standard set of array operations across all these libraries.
Key Takeaway
NumPy, CuPy, and PyTorch can share memory zero-copy on CPU
(via from_numpy/.numpy()) and on GPU (via DLPack). Use
the Array API standard for backend-agnostic code. Never call
.numpy() on GPU tensors β always .cpu() first.
DLPack
A protocol for zero-copy tensor exchange between deep learning frameworks, supporting both CPU and GPU memory.
Related: Tensor
Array API Standard
A specification (NEP 47) defining a common subset of array operations that NumPy, PyTorch, CuPy, and JAX all implement, enabling backend-agnostic code.
Related: DLPack
NumPy-CuPy-PyTorch Interoperability
# Code from: ch12/python/interop.py
# Load from backend supplements endpointBackend-Agnostic Scientific Computing
# Code from: ch12/python/backend_agnostic.py
# Load from backend supplements endpoint