CuPy Basics
From CPU to GPU: Why Bother?
A modern CPU has 8-64 cores optimized for low-latency serial execution. A GPU has thousands of cores optimized for high-throughput parallel execution. For embarrassingly parallel operations β element-wise math, matrix multiplication, FFTs β a GPU can deliver 10-100x speedups.
CuPy lets you tap into this power with one-line changes: replace
import numpy as np with import cupy as cp, and most code runs on
the GPU unchanged. This section covers the mechanics of creating
arrays, moving data, and understanding when GPU acceleration pays off.
Definition: CuPy ndarray β A GPU-Resident Array
CuPy ndarray β A GPU-Resident Array
A cupy.ndarray is a multidimensional array whose data lives in
GPU (device) memory. It mirrors numpy.ndarray in API:
import cupy as cp
a = cp.array([1, 2, 3]) # from Python list
b = cp.zeros((3, 4), dtype=cp.float64) # zeros on GPU
c = cp.random.randn(1000, 1000) # random on GPU
Arithmetic on cupy.ndarray dispatches to CUDA kernels automatically:
d = a + b # element-wise add on GPU
e = c @ c.T # matrix multiply via cuBLAS
CuPy arrays cannot be mixed with NumPy arrays in operations. You must explicitly transfer data between host and device.
Definition: Host-Device Data Transfer
Host-Device Data Transfer
Data transfer between CPU (host) and GPU (device) uses:
import numpy as np
import cupy as cp
# Host -> Device (H2D)
a_np = np.random.randn(1000)
a_cp = cp.asarray(a_np) # copies to GPU
# Device -> Host (D2H)
b_np = cp.asnumpy(a_cp) # copies to CPU
b_np = a_cp.get() # equivalent
Transfer bandwidth is limited by the PCIe bus (typically 12-32 GB/s), which is 10-50x slower than GPU memory bandwidth (1-3 TB/s). The cost of a transfer is:
where and is the bus bandwidth.
The golden rule: move data to the GPU once, compute many times, move results back once. Avoid round-tripping in a loop.
Definition: Drop-in NumPy Replacement
Drop-in NumPy Replacement
CuPy implements the NumPy API on the GPU. The standard pattern for making code GPU-portable is:
import cupy as cp
xp = cp # swap to np for CPU fallback
a = xp.linspace(0, 1, 1000)
b = xp.fft.fft(a)
c = xp.linalg.norm(b)
The cupy.get_array_module function (or the newer
array_api_compat package) auto-detects the backend:
def process(x):
xp = cp.get_array_module(x)
return xp.fft.fft(x) * xp.hamming(len(x))
This function works identically whether x is a NumPy or CuPy array.
Definition: CuPy Memory Pool
CuPy Memory Pool
CuPy uses a memory pool to avoid the overhead of cudaMalloc /
cudaFree on every allocation. Freed arrays return their memory to
the pool rather than the OS:
mempool = cp.get_default_memory_pool()
print(f"Used: {mempool.used_bytes() / 1e6:.1f} MB")
print(f"Total: {mempool.total_bytes() / 1e6:.1f} MB")
mempool.free_all_blocks() # release unused memory
The pool grows on demand and reuses blocks, making repeated allocations of the same size essentially free after the first call.
If you need to interoperate with other GPU libraries (PyTorch, TensorFlow), you may need to limit CuPy's pool size or use a shared allocator to avoid GPU out-of-memory errors.
Definition: Multi-GPU Device Management
Multi-GPU Device Management
CuPy supports multi-GPU systems. Select a device with a context manager:
with cp.cuda.Device(0):
a = cp.zeros((1000,)) # on GPU 0
with cp.cuda.Device(1):
b = cp.zeros((1000,)) # on GPU 1
Arrays are bound to the device where they were created. Cross-device
operations require explicit transfer via cp.asarray.
Theorem: Amdahl's Law Applied to GPU Acceleration
Let be the fraction of total work that can be offloaded to the GPU, and let be the GPU's speedup on that fraction. The overall speedup is bounded by:
Even with , the maximum speedup is .
If only 50% of your code is GPU-parallelizable (), the best possible speedup is , no matter how fast the GPU. This is why it is critical to minimize serial bottlenecks and data transfers.
Theorem: Transfer-Compute Crossover Point
GPU acceleration is beneficial only when the compute time saved exceeds the transfer overhead. For an algorithm with data transfer (), the crossover size satisfies:
where is the element width in bytes. Solving:
For matrix multiplication (, ), small matrices (e.g., ) are often faster on CPU.
The GPU wins when there is enough computation to amortize the PCIe transfer cost. Cubic algorithms (matmul, SVD) cross over at smaller sizes than linear algorithms (element-wise ops).
Theorem: The Roofline Model
The attainable performance (FLOP/s) of a kernel with arithmetic intensity (FLOP/byte) is bounded by:
where is the GPU's peak compute throughput and is the memory bandwidth. The crossover point separates memory-bound () from compute-bound () kernels.
Element-wise operations (add, multiply) have low arithmetic intensity and are memory-bound β they run at memory bandwidth speed. Matrix multiplication has high arithmetic intensity ( FLOP per byte loaded) and can approach peak compute.
Example: Your First CuPy Program
Create two random matrices on the GPU and multiply them. Time the operation and compare to NumPy on the CPU.
GPU computation
import cupy as cp
import numpy as np
import time
n = 1000
# GPU
a_gpu = cp.random.randn(n, n)
b_gpu = cp.random.randn(n, n)
cp.cuda.Stream.null.synchronize() # ensure creation is done
t0 = time.perf_counter()
c_gpu = a_gpu @ b_gpu
cp.cuda.Stream.null.synchronize() # wait for GPU
t_gpu = time.perf_counter() - t0
# CPU
a_cpu = np.random.randn(n, n)
b_cpu = np.random.randn(n, n)
t0 = time.perf_counter()
c_cpu = a_cpu @ b_cpu
t_cpu = time.perf_counter() - t0
print(f"GPU: {t_gpu*1000:.2f} ms")
print(f"CPU: {t_cpu*1000:.2f} ms")
print(f"Speedup: {t_cpu/t_gpu:.1f}x")
Key observations
cp.cuda.Stream.null.synchronize()is essential for accurate timing because GPU operations are asynchronous.- The first GPU call includes a kernel compilation overhead (JIT warm-up). Run the multiplication once before timing.
- Typical speedup: 5-20x for , increasing with size.
Example: Measuring Transfer Overhead
Measure the time to transfer a -element array from CPU to GPU and back. Compare to the time for element-wise squaring on the GPU.
Implementation
import cupy as cp
import numpy as np
import time
n = 10**6
a_np = np.random.randn(n)
# H2D transfer
cp.cuda.Stream.null.synchronize()
t0 = time.perf_counter()
a_cp = cp.asarray(a_np)
cp.cuda.Stream.null.synchronize()
t_h2d = time.perf_counter() - t0
# Compute on GPU
t0 = time.perf_counter()
b_cp = a_cp ** 2
cp.cuda.Stream.null.synchronize()
t_compute = time.perf_counter() - t0
# D2H transfer
t0 = time.perf_counter()
b_np = b_cp.get()
t_d2h = time.perf_counter() - t0
print(f"H2D: {t_h2d*1e6:.0f} us")
print(f"Compute: {t_compute*1e6:.0f} us")
print(f"D2H: {t_d2h*1e6:.0f} us")
Analysis
For element-wise operations on small arrays, transfer time often dominates compute time. The lesson: keep data on the GPU and batch operations rather than round-tripping after every step.
Example: Writing GPU-Portable Code
Write a function that computes the softmax of an array and works with both NumPy and CuPy inputs without code changes.
Implementation
import cupy as cp
import numpy as np
def softmax(x):
xp = cp.get_array_module(x)
e = xp.exp(x - xp.max(x))
return e / xp.sum(e)
# Works with NumPy
y_cpu = softmax(np.array([1.0, 2.0, 3.0]))
# Works with CuPy β runs entirely on GPU
y_gpu = softmax(cp.array([1.0, 2.0, 3.0]))
print(np.allclose(y_cpu, y_gpu.get())) # True
Why this matters
Using get_array_module, you write the algorithm once and it
runs on whichever backend the caller provides. This is the
standard CuPy pattern for library code.
CPU vs GPU: Matrix Multiply Benchmark
Compare matrix multiplication speed on CPU (NumPy/BLAS) vs GPU (CuPy/cuBLAS) across a range of matrix sizes.
Parameters
Transfer Overhead Animation
Animated visualization showing how transfer overhead shrinks relative to compute time as problem size grows.
Parameters
CPU vs GPU Architecture
Quick Check
You need to add two -element vectors on the GPU. The transfer time (H2D + D2H) is and the GPU computation takes . Is it worth using the GPU for this single operation?
Yes β the GPU computation is 25x faster than CPU
No β the transfer overhead dominates for such small work
It depends on the GPU model
Correct. The total GPU path () is far slower than CPU (~). GPU wins only when data is already on the device or the computation is heavy enough.
Quick Check
Why must you call cp.cuda.Stream.null.synchronize() before
measuring GPU execution time with time.perf_counter()?
To flush the GPU cache
Because GPU operations are asynchronous β the Python call returns before the GPU finishes
To prevent memory leaks
It is only needed for multi-GPU systems
CuPy launches GPU kernels asynchronously. Without
synchronization, time.perf_counter() measures only the
kernel launch overhead, not actual execution time.
Common Mistake: Mixing NumPy and CuPy Arrays
Mistake:
Trying to operate on a NumPy array and a CuPy array together:
a_np = np.ones(100)
b_cp = cp.ones(100)
c = a_np + b_cp # TypeError!
Correction:
Always convert explicitly:
c = cp.asarray(a_np) + b_cp # both on GPU
# or
c = a_np + cp.asnumpy(b_cp) # both on CPU
Common Mistake: Forgetting GPU Synchronization When Timing
Mistake:
Timing GPU operations without synchronization:
t0 = time.perf_counter()
c = a @ b # returns immediately!
elapsed = time.perf_counter() - t0 # wrong: kernel not done
Correction:
Always synchronize before and after the timed section:
cp.cuda.Stream.null.synchronize()
t0 = time.perf_counter()
c = a @ b
cp.cuda.Stream.null.synchronize()
elapsed = time.perf_counter() - t0 # correct
Alternatively, use CuPy's built-in benchmarking:
print(cp.testing.benchmark(lambda: a @ b))
Key Takeaway
CuPy is a drop-in GPU replacement for NumPy: same API, same semantics, 10-100x faster for large arrays. The key caveat is data transfer: move data to the GPU once, do all computation there, and transfer results back once. For small arrays or single operations, the PCIe transfer overhead can negate the GPU speedup entirely.
Key Takeaway
Write GPU-portable code using cp.get_array_module(x) to auto-detect
the backend. This lets the same function run on CPU or GPU depending
on the input type β essential for library code that should work everywhere.
Historical Note: The Birth of GPGPU Computing
2007-2015Before 2007, GPUs could only be programmed through graphics APIs (OpenGL, DirectX). NVIDIA's release of CUDA in 2007 opened GPUs to general-purpose computing. CuPy, released by Preferred Networks in 2015, brought CUDA's power to the Python ecosystem with a NumPy-compatible interface, democratizing GPU computing for scientists who had never written C code.
Historical Note: The End of Dennard Scaling and the Rise of Accelerators
2006-presentFrom the 1970s to ~2006, CPU clock speeds doubled every 18 months (Dennard scaling). When power density limits halted frequency increases around 4 GHz, the industry shifted to parallelism: multi-core CPUs and massively parallel GPUs. Today, a single GPU can deliver over 300 TFLOPS in FP16, dwarfing any CPU.
Why This Matters: GPU-Accelerated MIMO Processing
In massive MIMO systems with or more antennas, the
channel matrix must
be processed in real time for every subcarrier and time slot.
Operations like , matrix inversion, and
SVD are natural candidates for GPU acceleration via CuPy.
A single cp.linalg.svd call on a batch of channel matrices
can process all subcarriers simultaneously.
See full treatment in CuPy Linear Algebra
Host
The CPU and its main system memory (RAM). In GPU computing, data originates on the host.
Device
The GPU and its dedicated high-bandwidth memory (HBM/GDDR). CuPy arrays reside on the device.
CUDA Kernel
A function compiled to run on the GPU across many threads in parallel. CuPy auto-generates kernels from NumPy-like expressions.
CUDA Stream
A sequence of GPU operations that execute in order. Different streams can execute concurrently for pipeline parallelism.
Memory Pool
CuPy's caching allocator that reuses GPU memory blocks to avoid expensive cudaMalloc/cudaFree calls on every array creation.
CuPy Basics β Complete Examples
# Code from: ch11/python/s01_cupy_basics.py
# Load from backend supplements endpoint