CuPy Basics

From CPU to GPU: Why Bother?

A modern CPU has 8-64 cores optimized for low-latency serial execution. A GPU has thousands of cores optimized for high-throughput parallel execution. For embarrassingly parallel operations — element-wise math, matrix multiplication, FFTs — a GPU can deliver 10-100x speedups.

CuPy lets you tap into this power with one-line changes: replace import numpy as np with import cupy as cp, and most code runs on the GPU unchanged. This section covers the mechanics of creating arrays, moving data, and understanding when GPU acceleration pays off.

Definition:
CuPy ndarray — A GPU-Resident Array

A cupy.ndarray is a multidimensional array whose data lives in GPU (device) memory. It mirrors numpy.ndarray in API:

import cupy as cp

a = cp.array([1, 2, 3])           # from Python list
b = cp.zeros((3, 4), dtype=cp.float64)  # zeros on GPU
c = cp.random.randn(1000, 1000)   # random on GPU

Arithmetic on cupy.ndarray dispatches to CUDA kernels automatically:

d = a + b      # element-wise add on GPU
e = c @ c.T    # matrix multiply via cuBLAS

CuPy arrays cannot be mixed with NumPy arrays in operations. You must explicitly transfer data between host and device.

Definition:
Host-Device Data Transfer

Data transfer between CPU (host) and GPU (device) uses:

import numpy as np
import cupy as cp

# Host -> Device (H2D)
a_np = np.random.randn(1000)
a_cp = cp.asarray(a_np)       # copies to GPU

# Device -> Host (D2H)
b_np = cp.asnumpy(a_cp)       # copies to CPU
b_np = a_cp.get()              # equivalent

Transfer bandwidth is limited by the PCIe bus (typically 12-32 GB/s), which is 10-50x slower than GPU memory bandwidth (1-3 TB/s). The cost of a transfer is:

$T_{\mathrm{transfer}} = T_{\mathrm{latency}} + \frac{N_{\mathrm{bytes}}}{B_{\mathrm{PCIe}}}$

where $T_{\mathrm{latency}} \approx 10\,\mu\text{s}$ and $B_{\mathrm{PCIe}}$ is the bus bandwidth.

The golden rule: move data to the GPU once, compute many times, move results back once. Avoid round-tripping in a loop.

Definition:
Drop-in NumPy Replacement

CuPy implements the NumPy API on the GPU. The standard pattern for making code GPU-portable is:

import cupy as cp

xp = cp    # swap to np for CPU fallback

a = xp.linspace(0, 1, 1000)
b = xp.fft.fft(a)
c = xp.linalg.norm(b)

The cupy.get_array_module function (or the newer array_api_compat package) auto-detects the backend:

def process(x):
    xp = cp.get_array_module(x)
    return xp.fft.fft(x) * xp.hamming(len(x))

This function works identically whether x is a NumPy or CuPy array.

Definition:
CuPy Memory Pool

CuPy uses a memory pool to avoid the overhead of cudaMalloc / cudaFree on every allocation. Freed arrays return their memory to the pool rather than the OS:

mempool = cp.get_default_memory_pool()
print(f"Used: {mempool.used_bytes() / 1e6:.1f} MB")
print(f"Total: {mempool.total_bytes() / 1e6:.1f} MB")

mempool.free_all_blocks()  # release unused memory

The pool grows on demand and reuses blocks, making repeated allocations of the same size essentially free after the first call.

If you need to interoperate with other GPU libraries (PyTorch, TensorFlow), you may need to limit CuPy's pool size or use a shared allocator to avoid GPU out-of-memory errors.

Definition:
Multi-GPU Device Management

CuPy supports multi-GPU systems. Select a device with a context manager:

with cp.cuda.Device(0):
    a = cp.zeros((1000,))   # on GPU 0

with cp.cuda.Device(1):
    b = cp.zeros((1000,))   # on GPU 1

Arrays are bound to the device where they were created. Cross-device operations require explicit transfer via cp.asarray.

Theorem: Amdahl's Law Applied to GPU Acceleration

Let $f$ be the fraction of total work that can be offloaded to the GPU, and let $S_{\mathrm{GPU}}$ be the GPU's speedup on that fraction. The overall speedup is bounded by:

$S_{\mathrm{total}} \le \frac{1}{(1 - f) + f / S_{\mathrm{GPU}}}$

Even with $S_{\mathrm{GPU}} \to \infty$ , the maximum speedup is $1 / (1 - f)$ .

If only 50% of your code is GPU-parallelizable ( $f = 0.5$ ), the best possible speedup is $2\times$ , no matter how fast the GPU. This is why it is critical to minimize serial bottlenecks and data transfers.

Theorem: Transfer-Compute Crossover Point

GPU acceleration is beneficial only when the compute time saved exceeds the transfer overhead. For an $O(n^k)$ algorithm with $O(n^m)$ data transfer ( $m < k$ ), the crossover size $n^*$ satisfies:

$c_{\mathrm{CPU}} \cdot n^k = c_{\mathrm{GPU}} \cdot n^k + 2 \cdot \frac{n^m \cdot w}{B_{\mathrm{PCIe}}}$

where $w$ is the element width in bytes. Solving:

$n^{k-m} \ge \frac{2w}{(c_{\mathrm{CPU}} - c_{\mathrm{GPU}}) \cdot B_{\mathrm{PCIe}}}$

For matrix multiplication ( $k=3$ , $m=2$ ), small matrices (e.g., $n < 128$ ) are often faster on CPU.

The GPU wins when there is enough computation to amortize the PCIe transfer cost. Cubic algorithms (matmul, SVD) cross over at smaller sizes than linear algorithms (element-wise ops).

Theorem: The Roofline Model

The attainable performance $P$ (FLOP/s) of a kernel with arithmetic intensity $I$ (FLOP/byte) is bounded by:

$P \le \min\!\bigl(\,P_{\mathrm{peak}},\; I \cdot B_{\mathrm{mem}}\,\bigr)$

where $P_{\mathrm{peak}}$ is the GPU's peak compute throughput and $B_{\mathrm{mem}}$ is the memory bandwidth. The crossover point $I^* = P_{\mathrm{peak}} / B_{\mathrm{mem}}$ separates memory-bound ( $I < I^*$ ) from compute-bound ( $I > I^*$ ) kernels.

Element-wise operations (add, multiply) have low arithmetic intensity and are memory-bound — they run at memory bandwidth speed. Matrix multiplication has high arithmetic intensity ( $O(n)$ FLOP per byte loaded) and can approach peak compute.

Example: Your First CuPy Program

Create two random $1000 \times 1000$ matrices on the GPU and multiply them. Time the operation and compare to NumPy on the CPU.

Solution

GPU computation

import cupy as cp
import numpy as np
import time

n = 1000

# GPU
a_gpu = cp.random.randn(n, n)
b_gpu = cp.random.randn(n, n)
cp.cuda.Stream.null.synchronize()  # ensure creation is done

t0 = time.perf_counter()
c_gpu = a_gpu @ b_gpu
cp.cuda.Stream.null.synchronize()  # wait for GPU
t_gpu = time.perf_counter() - t0

# CPU
a_cpu = np.random.randn(n, n)
b_cpu = np.random.randn(n, n)
t0 = time.perf_counter()
c_cpu = a_cpu @ b_cpu
t_cpu = time.perf_counter() - t0

print(f"GPU: {t_gpu*1000:.2f} ms")
print(f"CPU: {t_cpu*1000:.2f} ms")
print(f"Speedup: {t_cpu/t_gpu:.1f}x")

Key observations

cp.cuda.Stream.null.synchronize() is essential for accurate timing because GPU operations are asynchronous.
The first GPU call includes a kernel compilation overhead (JIT warm-up). Run the multiplication once before timing.
Typical speedup: 5-20x for $n=1000$ , increasing with size.

Example: Measuring Transfer Overhead

Measure the time to transfer a $10^6$ -element array from CPU to GPU and back. Compare to the time for element-wise squaring on the GPU.

Solution

Implementation

import cupy as cp
import numpy as np
import time

n = 10**6
a_np = np.random.randn(n)

# H2D transfer
cp.cuda.Stream.null.synchronize()
t0 = time.perf_counter()
a_cp = cp.asarray(a_np)
cp.cuda.Stream.null.synchronize()
t_h2d = time.perf_counter() - t0

# Compute on GPU
t0 = time.perf_counter()
b_cp = a_cp ** 2
cp.cuda.Stream.null.synchronize()
t_compute = time.perf_counter() - t0

# D2H transfer
t0 = time.perf_counter()
b_np = b_cp.get()
t_d2h = time.perf_counter() - t0

print(f"H2D: {t_h2d*1e6:.0f} us")
print(f"Compute: {t_compute*1e6:.0f} us")
print(f"D2H: {t_d2h*1e6:.0f} us")

Analysis

For element-wise operations on small arrays, transfer time often dominates compute time. The lesson: keep data on the GPU and batch operations rather than round-tripping after every step.

Example: Writing GPU-Portable Code

Write a function that computes the softmax of an array and works with both NumPy and CuPy inputs without code changes.

Solution

Implementation

import cupy as cp
import numpy as np

def softmax(x):
    xp = cp.get_array_module(x)
    e = xp.exp(x - xp.max(x))
    return e / xp.sum(e)

# Works with NumPy
y_cpu = softmax(np.array([1.0, 2.0, 3.0]))

# Works with CuPy — runs entirely on GPU
y_gpu = softmax(cp.array([1.0, 2.0, 3.0]))

print(np.allclose(y_cpu, y_gpu.get()))  # True

Why this matters

Using get_array_module, you write the algorithm once and it runs on whichever backend the caller provides. This is the standard CuPy pattern for library code.

CPU vs GPU: Matrix Multiply Benchmark

Compare matrix multiplication speed on CPU (NumPy/BLAS) vs GPU (CuPy/cuBLAS) across a range of matrix sizes.

Parameters

Transfer Overhead Animation

Animated visualization showing how transfer overhead shrinks relative to compute time as problem size grows.

Parameters

CPU vs GPU Architecture — Simplified comparison of CPU and GPU architectures. The CPU has a few powerful cores with large caches; the GPU has thousands of simple cores with high-bandwidth memory.

Quick Check

You need to add two $1000$ -element vectors on the GPU. The transfer time (H2D + D2H) is $50\,\mu\text{s}$ and the GPU computation takes $2\,\mu\text{s}$ . Is it worth using the GPU for this single operation?

Yes — the GPU computation is 25x faster than CPU

No — the transfer overhead dominates for such small work

It depends on the GPU model

Correction:

No — the transfer overhead dominates for such small work

Correct. The total GPU path ( $52\,\mu\text{s}$ ) is far slower than CPU (~ $1\,\mu\text{s}$ ). GPU wins only when data is already on the device or the computation is heavy enough.

Quick Check

Why must you call cp.cuda.Stream.null.synchronize() before measuring GPU execution time with time.perf_counter()?

To flush the GPU cache

Because GPU operations are asynchronous — the Python call returns before the GPU finishes

To prevent memory leaks

It is only needed for multi-GPU systems

Correction:

Because GPU operations are asynchronous — the Python call returns before the GPU finishes

CuPy launches GPU kernels asynchronously. Without synchronization, time.perf_counter() measures only the kernel launch overhead, not actual execution time.

Common Mistake: Mixing NumPy and CuPy Arrays

Mistake:

Trying to operate on a NumPy array and a CuPy array together:

a_np = np.ones(100)
b_cp = cp.ones(100)
c = a_np + b_cp  # TypeError!

Correction:

Always convert explicitly:

c = cp.asarray(a_np) + b_cp    # both on GPU
# or
c = a_np + cp.asnumpy(b_cp)    # both on CPU

Common Mistake: Forgetting GPU Synchronization When Timing

Mistake:

Timing GPU operations without synchronization:

t0 = time.perf_counter()
c = a @ b  # returns immediately!
elapsed = time.perf_counter() - t0  # wrong: kernel not done

Correction:

Always synchronize before and after the timed section:

cp.cuda.Stream.null.synchronize()
t0 = time.perf_counter()
c = a @ b
cp.cuda.Stream.null.synchronize()
elapsed = time.perf_counter() - t0  # correct

Alternatively, use CuPy's built-in benchmarking:

print(cp.testing.benchmark(lambda: a @ b))

Key Takeaway

CuPy is a drop-in GPU replacement for NumPy: same API, same semantics, 10-100x faster for large arrays. The key caveat is data transfer: move data to the GPU once, do all computation there, and transfer results back once. For small arrays or single operations, the PCIe transfer overhead can negate the GPU speedup entirely.

Key Takeaway

Write GPU-portable code using cp.get_array_module(x) to auto-detect the backend. This lets the same function run on CPU or GPU depending on the input type — essential for library code that should work everywhere.

Historical Note: The Birth of GPGPU Computing

2007-2015

Before 2007, GPUs could only be programmed through graphics APIs (OpenGL, DirectX). NVIDIA's release of CUDA in 2007 opened GPUs to general-purpose computing. CuPy, released by Preferred Networks in 2015, brought CUDA's power to the Python ecosystem with a NumPy-compatible interface, democratizing GPU computing for scientists who had never written C code.

Historical Note: The End of Dennard Scaling and the Rise of Accelerators

2006-present

From the 1970s to ~2006, CPU clock speeds doubled every 18 months (Dennard scaling). When power density limits halted frequency increases around 4 GHz, the industry shifted to parallelism: multi-core CPUs and massively parallel GPUs. Today, a single GPU can deliver over 300 TFLOPS in FP16, dwarfing any CPU.

Why This Matters: GPU-Accelerated MIMO Processing

In massive MIMO systems with $N_t = 64$ or more antennas, the channel matrix $\mathbf{H} \in \mathbb{C}^{N_r \times N_t}$ must be processed in real time for every subcarrier and time slot. Operations like $\mathbf{H}^H \mathbf{H}$ , matrix inversion, and SVD are natural candidates for GPU acceleration via CuPy. A single cp.linalg.svd call on a batch of channel matrices can process all subcarriers simultaneously.

See full treatment in CuPy Linear Algebra

Host

The CPU and its main system memory (RAM). In GPU computing, data originates on the host.

Device

The GPU and its dedicated high-bandwidth memory (HBM/GDDR). CuPy arrays reside on the device.

CUDA Kernel

A function compiled to run on the GPU across many threads in parallel. CuPy auto-generates kernels from NumPy-like expressions.

CUDA Stream

A sequence of GPU operations that execute in order. Different streams can execute concurrently for pipeline parallelism.

Memory Pool

CuPy's caching allocator that reuses GPU memory blocks to avoid expensive cudaMalloc/cudaFree calls on every array creation.

CuPy Basics — Complete Examples

python

Complete runnable examples for array creation, data transfer, GPU-portable functions, and memory pool management.

# Code from: ch11/python/s01_cupy_basics.py
# Load from backend supplements endpoint

Prerequisites & Notation CuPy Linear Algebra

CuPy Basics

From CPU to GPU: Why Bother?

Definition: CuPy ndarray — A GPU-Resident Array

Definition: Host-Device Data Transfer

Definition: Drop-in NumPy Replacement

Definition: CuPy Memory Pool

Definition: Multi-GPU Device Management

Theorem: Amdahl's Law Applied to GPU Acceleration

Theorem: Transfer-Compute Crossover Point

Theorem: The Roofline Model

Example: Your First CuPy Program

GPU computation

Key observations

Example: Measuring Transfer Overhead

Implementation

Analysis

Example: Writing GPU-Portable Code

Implementation

Why this matters

CPU vs GPU: Matrix Multiply Benchmark

Parameters

Transfer Overhead Animation

Parameters

CPU vs GPU Architecture

Quick Check

Quick Check

Common Mistake: Mixing NumPy and CuPy Arrays

Common Mistake: Forgetting GPU Synchronization When Timing

Key Takeaway

Key Takeaway

Historical Note: The Birth of GPGPU Computing

Historical Note: The End of Dennard Scaling and the Rise of Accelerators

Why This Matters: GPU-Accelerated MIMO Processing

Host

Device

CUDA Kernel

CUDA Stream

Memory Pool

CuPy Basics — Complete Examples

Definition:
CuPy ndarray — A GPU-Resident Array

Definition:
Host-Device Data Transfer

Definition:
Drop-in NumPy Replacement

Definition:
CuPy Memory Pool

Definition:
Multi-GPU Device Management