CuPy Basics

From CPU to GPU: Why Bother?

A modern CPU has 8-64 cores optimized for low-latency serial execution. A GPU has thousands of cores optimized for high-throughput parallel execution. For embarrassingly parallel operations β€” element-wise math, matrix multiplication, FFTs β€” a GPU can deliver 10-100x speedups.

CuPy lets you tap into this power with one-line changes: replace import numpy as np with import cupy as cp, and most code runs on the GPU unchanged. This section covers the mechanics of creating arrays, moving data, and understanding when GPU acceleration pays off.

Definition:

CuPy ndarray β€” A GPU-Resident Array

A cupy.ndarray is a multidimensional array whose data lives in GPU (device) memory. It mirrors numpy.ndarray in API:

import cupy as cp

a = cp.array([1, 2, 3])           # from Python list
b = cp.zeros((3, 4), dtype=cp.float64)  # zeros on GPU
c = cp.random.randn(1000, 1000)   # random on GPU

Arithmetic on cupy.ndarray dispatches to CUDA kernels automatically:

d = a + b      # element-wise add on GPU
e = c @ c.T    # matrix multiply via cuBLAS

CuPy arrays cannot be mixed with NumPy arrays in operations. You must explicitly transfer data between host and device.

Definition:

Host-Device Data Transfer

Data transfer between CPU (host) and GPU (device) uses:

import numpy as np
import cupy as cp

# Host -> Device (H2D)
a_np = np.random.randn(1000)
a_cp = cp.asarray(a_np)       # copies to GPU

# Device -> Host (D2H)
b_np = cp.asnumpy(a_cp)       # copies to CPU
b_np = a_cp.get()              # equivalent

Transfer bandwidth is limited by the PCIe bus (typically 12-32 GB/s), which is 10-50x slower than GPU memory bandwidth (1-3 TB/s). The cost of a transfer is:

Ttransfer=Tlatency+NbytesBPCIeT_{\mathrm{transfer}} = T_{\mathrm{latency}} + \frac{N_{\mathrm{bytes}}}{B_{\mathrm{PCIe}}}

where Tlatencyβ‰ˆ10 μsT_{\mathrm{latency}} \approx 10\,\mu\text{s} and BPCIeB_{\mathrm{PCIe}} is the bus bandwidth.

The golden rule: move data to the GPU once, compute many times, move results back once. Avoid round-tripping in a loop.

Definition:

Drop-in NumPy Replacement

CuPy implements the NumPy API on the GPU. The standard pattern for making code GPU-portable is:

import cupy as cp

xp = cp    # swap to np for CPU fallback

a = xp.linspace(0, 1, 1000)
b = xp.fft.fft(a)
c = xp.linalg.norm(b)

The cupy.get_array_module function (or the newer array_api_compat package) auto-detects the backend:

def process(x):
    xp = cp.get_array_module(x)
    return xp.fft.fft(x) * xp.hamming(len(x))

This function works identically whether x is a NumPy or CuPy array.

Definition:

CuPy Memory Pool

CuPy uses a memory pool to avoid the overhead of cudaMalloc / cudaFree on every allocation. Freed arrays return their memory to the pool rather than the OS:

mempool = cp.get_default_memory_pool()
print(f"Used: {mempool.used_bytes() / 1e6:.1f} MB")
print(f"Total: {mempool.total_bytes() / 1e6:.1f} MB")

mempool.free_all_blocks()  # release unused memory

The pool grows on demand and reuses blocks, making repeated allocations of the same size essentially free after the first call.

If you need to interoperate with other GPU libraries (PyTorch, TensorFlow), you may need to limit CuPy's pool size or use a shared allocator to avoid GPU out-of-memory errors.

Definition:

Multi-GPU Device Management

CuPy supports multi-GPU systems. Select a device with a context manager:

with cp.cuda.Device(0):
    a = cp.zeros((1000,))   # on GPU 0

with cp.cuda.Device(1):
    b = cp.zeros((1000,))   # on GPU 1

Arrays are bound to the device where they were created. Cross-device operations require explicit transfer via cp.asarray.

Theorem: Amdahl's Law Applied to GPU Acceleration

Let ff be the fraction of total work that can be offloaded to the GPU, and let SGPUS_{\mathrm{GPU}} be the GPU's speedup on that fraction. The overall speedup is bounded by:

Stotal≀1(1βˆ’f)+f/SGPUS_{\mathrm{total}} \le \frac{1}{(1 - f) + f / S_{\mathrm{GPU}}}

Even with SGPUβ†’βˆžS_{\mathrm{GPU}} \to \infty, the maximum speedup is 1/(1βˆ’f)1 / (1 - f).

If only 50% of your code is GPU-parallelizable (f=0.5f = 0.5), the best possible speedup is 2Γ—2\times, no matter how fast the GPU. This is why it is critical to minimize serial bottlenecks and data transfers.

Theorem: Transfer-Compute Crossover Point

GPU acceleration is beneficial only when the compute time saved exceeds the transfer overhead. For an O(nk)O(n^k) algorithm with O(nm)O(n^m) data transfer (m<km < k), the crossover size nβˆ—n^* satisfies:

cCPUβ‹…nk=cGPUβ‹…nk+2β‹…nmβ‹…wBPCIec_{\mathrm{CPU}} \cdot n^k = c_{\mathrm{GPU}} \cdot n^k + 2 \cdot \frac{n^m \cdot w}{B_{\mathrm{PCIe}}}

where ww is the element width in bytes. Solving:

nkβˆ’mβ‰₯2w(cCPUβˆ’cGPU)β‹…BPCIen^{k-m} \ge \frac{2w}{(c_{\mathrm{CPU}} - c_{\mathrm{GPU}}) \cdot B_{\mathrm{PCIe}}}

For matrix multiplication (k=3k=3, m=2m=2), small matrices (e.g., n<128n < 128) are often faster on CPU.

The GPU wins when there is enough computation to amortize the PCIe transfer cost. Cubic algorithms (matmul, SVD) cross over at smaller sizes than linear algorithms (element-wise ops).

Theorem: The Roofline Model

The attainable performance PP (FLOP/s) of a kernel with arithmetic intensity II (FLOP/byte) is bounded by:

P≀min⁑ ⁣( Ppeak,β€…β€ŠIβ‹…Bmem )P \le \min\!\bigl(\,P_{\mathrm{peak}},\; I \cdot B_{\mathrm{mem}}\,\bigr)

where PpeakP_{\mathrm{peak}} is the GPU's peak compute throughput and BmemB_{\mathrm{mem}} is the memory bandwidth. The crossover point Iβˆ—=Ppeak/BmemI^* = P_{\mathrm{peak}} / B_{\mathrm{mem}} separates memory-bound (I<Iβˆ—I < I^*) from compute-bound (I>Iβˆ—I > I^*) kernels.

Element-wise operations (add, multiply) have low arithmetic intensity and are memory-bound β€” they run at memory bandwidth speed. Matrix multiplication has high arithmetic intensity (O(n)O(n) FLOP per byte loaded) and can approach peak compute.

Example: Your First CuPy Program

Create two random 1000Γ—10001000 \times 1000 matrices on the GPU and multiply them. Time the operation and compare to NumPy on the CPU.

Example: Measuring Transfer Overhead

Measure the time to transfer a 10610^6-element array from CPU to GPU and back. Compare to the time for element-wise squaring on the GPU.

Example: Writing GPU-Portable Code

Write a function that computes the softmax of an array and works with both NumPy and CuPy inputs without code changes.

CPU vs GPU: Matrix Multiply Benchmark

Compare matrix multiplication speed on CPU (NumPy/BLAS) vs GPU (CuPy/cuBLAS) across a range of matrix sizes.

Parameters

Transfer Overhead Animation

Animated visualization showing how transfer overhead shrinks relative to compute time as problem size grows.

Parameters

CPU vs GPU Architecture

CPU vs GPU Architecture
Simplified comparison of CPU and GPU architectures. The CPU has a few powerful cores with large caches; the GPU has thousands of simple cores with high-bandwidth memory.

Quick Check

You need to add two 10001000-element vectors on the GPU. The transfer time (H2D + D2H) is 50 μs50\,\mu\text{s} and the GPU computation takes 2 μs2\,\mu\text{s}. Is it worth using the GPU for this single operation?

Yes β€” the GPU computation is 25x faster than CPU

No β€” the transfer overhead dominates for such small work

It depends on the GPU model

Quick Check

Why must you call cp.cuda.Stream.null.synchronize() before measuring GPU execution time with time.perf_counter()?

To flush the GPU cache

Because GPU operations are asynchronous β€” the Python call returns before the GPU finishes

To prevent memory leaks

It is only needed for multi-GPU systems

Common Mistake: Mixing NumPy and CuPy Arrays

Mistake:

Trying to operate on a NumPy array and a CuPy array together:

a_np = np.ones(100)
b_cp = cp.ones(100)
c = a_np + b_cp  # TypeError!

Correction:

Always convert explicitly:

c = cp.asarray(a_np) + b_cp    # both on GPU
# or
c = a_np + cp.asnumpy(b_cp)    # both on CPU

Common Mistake: Forgetting GPU Synchronization When Timing

Mistake:

Timing GPU operations without synchronization:

t0 = time.perf_counter()
c = a @ b  # returns immediately!
elapsed = time.perf_counter() - t0  # wrong: kernel not done

Correction:

Always synchronize before and after the timed section:

cp.cuda.Stream.null.synchronize()
t0 = time.perf_counter()
c = a @ b
cp.cuda.Stream.null.synchronize()
elapsed = time.perf_counter() - t0  # correct

Alternatively, use CuPy's built-in benchmarking:

print(cp.testing.benchmark(lambda: a @ b))

Key Takeaway

CuPy is a drop-in GPU replacement for NumPy: same API, same semantics, 10-100x faster for large arrays. The key caveat is data transfer: move data to the GPU once, do all computation there, and transfer results back once. For small arrays or single operations, the PCIe transfer overhead can negate the GPU speedup entirely.

Key Takeaway

Write GPU-portable code using cp.get_array_module(x) to auto-detect the backend. This lets the same function run on CPU or GPU depending on the input type β€” essential for library code that should work everywhere.

Historical Note: The Birth of GPGPU Computing

2007-2015

Before 2007, GPUs could only be programmed through graphics APIs (OpenGL, DirectX). NVIDIA's release of CUDA in 2007 opened GPUs to general-purpose computing. CuPy, released by Preferred Networks in 2015, brought CUDA's power to the Python ecosystem with a NumPy-compatible interface, democratizing GPU computing for scientists who had never written C code.

Historical Note: The End of Dennard Scaling and the Rise of Accelerators

2006-present

From the 1970s to ~2006, CPU clock speeds doubled every 18 months (Dennard scaling). When power density limits halted frequency increases around 4 GHz, the industry shifted to parallelism: multi-core CPUs and massively parallel GPUs. Today, a single GPU can deliver over 300 TFLOPS in FP16, dwarfing any CPU.

Why This Matters: GPU-Accelerated MIMO Processing

In massive MIMO systems with Nt=64N_t = 64 or more antennas, the channel matrix H∈CNrΓ—Nt\mathbf{H} \in \mathbb{C}^{N_r \times N_t} must be processed in real time for every subcarrier and time slot. Operations like HHH\mathbf{H}^H \mathbf{H}, matrix inversion, and SVD are natural candidates for GPU acceleration via CuPy. A single cp.linalg.svd call on a batch of channel matrices can process all subcarriers simultaneously.

See full treatment in CuPy Linear Algebra

Host

The CPU and its main system memory (RAM). In GPU computing, data originates on the host.

Device

The GPU and its dedicated high-bandwidth memory (HBM/GDDR). CuPy arrays reside on the device.

CUDA Kernel

A function compiled to run on the GPU across many threads in parallel. CuPy auto-generates kernels from NumPy-like expressions.

CUDA Stream

A sequence of GPU operations that execute in order. Different streams can execute concurrently for pipeline parallelism.

Memory Pool

CuPy's caching allocator that reuses GPU memory blocks to avoid expensive cudaMalloc/cudaFree calls on every array creation.

CuPy Basics β€” Complete Examples

python
Complete runnable examples for array creation, data transfer, GPU-portable functions, and memory pool management.
# Code from: ch11/python/s01_cupy_basics.py
# Load from backend supplements endpoint