GPU Architecture for the Programmer
Why GPUs Matter for Scientific Computing
Modern GPUs contain thousands of cores designed for data-parallel workloads. While a CPU excels at complex serial tasks (branch prediction, out-of-order execution), a GPU excels when the same operation is applied to millions of data elements simultaneously.
Scientific computing is dominated by matrix operations, element-wise transforms, and Monte Carlo simulations — all embarrassingly parallel. A single NVIDIA A100 GPU delivers 19.5 TFLOPS of FP32 throughput, roughly 30x that of a high-end CPU. Understanding the hardware is the key to writing code that achieves this potential.
Definition: CUDA Core
CUDA Core
A CUDA core is the basic floating-point/integer execution unit on an NVIDIA GPU. Each core can perform one single-precision fused multiply-add (FMA) per clock cycle:
Modern GPUs also include Tensor Cores for mixed-precision matrix multiply-accumulate operations (used in deep learning), and separate units for double-precision, transcendental functions (sin, cos, exp), and load/store operations.
A single GPU may contain thousands of CUDA cores:
- NVIDIA A100: 6912 FP32 CUDA cores
- NVIDIA RTX 4090: 16384 FP32 CUDA cores
- NVIDIA H100: 16896 FP32 CUDA cores
The raw core count is misleading for performance comparisons. What matters is throughput = cores x clock speed x operations/cycle, combined with memory bandwidth. Most scientific workloads are memory-bound, not compute-bound.
Definition: Streaming Multiprocessor (SM)
Streaming Multiprocessor (SM)
A Streaming Multiprocessor (SM) is the fundamental compute building block of an NVIDIA GPU. Each SM contains:
- A set of CUDA cores (e.g., 128 FP32 cores per SM on Ampere)
- A shared memory / L1 cache (configurable, typically 48-164 KB)
- A warp scheduler (issues instructions to warps every cycle)
- Register file (65536 32-bit registers on Ampere)
- Special function units (SFUs) for transcendentals
- Tensor Cores (on architectures from Volta onward)
The GPU partitions work across SMs. Each SM can execute multiple thread blocks concurrently, limited by register and shared-memory usage. An A100 has 108 SMs; an RTX 4090 has 128 SMs.
Think of each SM as a "mini-GPU." Programming for GPU efficiency means ensuring every SM is busy (occupancy) and that threads within each SM access memory efficiently.
Definition: Warp (SIMT Execution Unit)
Warp (SIMT Execution Unit)
A warp is a group of 32 threads that execute the same instruction simultaneously in lock-step (SIMT — Single Instruction, Multiple Threads). This is the fundamental unit of execution scheduling on NVIDIA GPUs.
At every clock cycle, the warp scheduler selects an eligible warp
and issues one instruction to all 32 threads. If threads within
a warp take different branches (e.g., if/else), the warp
executes both paths serially, masking out inactive threads. This
is called warp divergence and is a major performance pitfall.
AMD GPUs use a "wavefront" of 64 threads (32 on RDNA3), analogous to NVIDIA's warp. The concept is identical: groups of threads execute in lock-step.
Definition: GPU Memory Hierarchy
GPU Memory Hierarchy
The GPU memory system forms a hierarchy from fast-but-small to slow-but-large:
| Level | Size | Latency | Bandwidth | Scope |
|---|---|---|---|---|
| Registers | ~256 KB/SM | 0 cycles | - | Per thread |
| Shared Memory | 48-164 KB/SM | ~20 cycles | ~19 TB/s | Per block |
| L1 Cache | Shared with above | ~30 cycles | ~19 TB/s | Per SM |
| L2 Cache | 6-50 MB | ~200 cycles | ~6 TB/s | Global |
| Global (HBM) | 16-80 GB | ~400 cycles | 1-3 TB/s | Global |
Global memory (HBM) is the main GPU memory visible to all threads. Shared memory is a programmer-managed scratchpad shared among threads in the same block. Registers are the fastest storage, private to each thread.
Efficient GPU code minimizes global memory accesses by maximizing data reuse in shared memory and registers.
The CPU-GPU relationship adds another level: data must be transferred from CPU RAM to GPU HBM over PCIe (up to 64 GB/s) or NVLink (up to 900 GB/s). This host-device transfer is often the dominant bottleneck.
Definition: Coalesced Memory Access
Coalesced Memory Access
A global memory access is coalesced when consecutive threads in a warp access consecutive memory addresses. The memory controller can then serve the entire warp with a single transaction (128 bytes on modern GPUs).
Coalesced (optimal):
Strided (poor):
A stride of 2 wastes 50% of bandwidth. Random access patterns can reduce effective bandwidth by 10-32x.
In NumPy terms, coalesced access corresponds to iterating over the last axis of a C-contiguous array. When converting NumPy code to GPU, ensure the parallelized axis is contiguous.
Definition: Occupancy
Occupancy
Occupancy is the ratio of active warps on an SM to the maximum number of warps the SM can support:
On Ampere, each SM supports up to 64 active warps (2048 threads). Occupancy is limited by three resources:
- Registers per thread: more registers per thread = fewer threads
- Shared memory per block: more shared memory = fewer blocks
- Block size: must be a multiple of 32 (warp size)
Higher occupancy helps hide memory latency through warp-level parallelism, but 100% occupancy is not always necessary for peak performance.
Theorem: Amdahl's Law Applied to GPU Offloading
Let be the fraction of computation that can be parallelized on the GPU, and let be the speedup of the GPU portion relative to the CPU. The overall speedup is bounded by:
where is the host-device data transfer time and is the original CPU execution time.
Even if the GPU computes 1000x faster, the serial portion and data transfer overhead set a hard ceiling on the overall speedup. For a workload that is 90% parallelizable with zero transfer cost, the maximum speedup is 10x regardless of GPU speed.
Decompose total time
Theorem: Memory Bandwidth as the Performance Ceiling
For a kernel that performs floating-point operations on bytes of data, the arithmetic intensity is:
The kernel is memory-bound if , where is the machine's compute-to-bandwidth ratio:
For an A100: FLOP/byte.
Element-wise operations like a + b have (1 FLOP, 12 bytes:
read 2 floats, write 1). This is far below , so the GPU can
never reach peak compute on such operations — it waits for data.
Matrix multiplication with is one of the few
operations that can saturate compute.
Theorem: Roofline Performance Model
The attainable performance (FLOP/s) of a kernel with arithmetic intensity is bounded by:
This defines the "roofline" — a piecewise-linear upper bound when plotting vs . The knee (transition point) occurs at .
Below the knee, performance scales linearly with arithmetic intensity (memory-bound region). Above the knee, performance is flat at peak compute (compute-bound region). Most scientific kernels live in the memory-bound region, so optimizing memory access patterns matters more than increasing parallelism.
Example: Classifying Workloads by Arithmetic Intensity
Compute the arithmetic intensity of three operations and classify each as memory-bound or compute-bound on an A100 GPU ( FLOP/byte):
- Vector addition:
- Matrix-vector product: ,
- Matrix-matrix product: , all
Vector addition
Read floats (8n bytes), write floats (4n bytes), perform adds. Memory-bound.
Matrix-vector product
Read floats ( bytes), perform FLOPs (multiply + add). Memory-bound for any .
Matrix-matrix product
Read floats ( bytes), perform FLOPs. For , : compute-bound for large matrices. This is why GEMM is the poster child for GPU acceleration.
Example: Quantifying Warp Divergence
A kernel applies a threshold function to an array:
if x[i] > threshold:
y[i] = expensive_function(x[i]) # 100 cycles
else:
y[i] = 0.0 # 1 cycle
If 50% of elements exceed the threshold and they are randomly distributed, estimate the execution time per warp compared to the best case (all elements on the same branch).
Analyze warp behavior
With random distribution, the probability that all 32 threads in a warp take the same branch is: So virtually every warp diverges.
Compute divergent cost
A divergent warp executes both branches: cycles.
Best case (all on expensive branch): cycles. Best case (all on cheap branch): cycle. Average non-divergent: cycles.
The divergent warp is slower than the theoretical average.
Mitigation: sort data so that elements above threshold are grouped together, ensuring most warps are non-divergent.
Example: Coalesced vs Strided Memory Access
A matrix of shape is stored in row-major order. Compare the memory access pattern when parallelizing over rows vs columns with 32 threads per warp.
Column-parallel (coalesced)
Each thread processes one column in the current row:
Thread accesses A[row, i] = address base + row * N + i.
Consecutive threads access consecutive addresses:
stride = sizeof(float) = 4 bytes. This is coalesced —
a single 128-byte transaction serves all 32 threads.
Row-parallel (strided)
Each thread processes one row in the current column:
Thread accesses A[i, col] = address base + i * N + col.
Consecutive threads access addresses separated by bytes. For , this is a stride of 4096 bytes — each thread's request goes to a different cache line. This requires up to 32 separate memory transactions, reducing effective bandwidth by ~32x.
Practical fix
Transpose the matrix or change the loop order so that the
innermost parallel dimension maps to contiguous memory.
In CuPy/PyTorch, ensure the tensor is contiguous along the
last dimension with .contiguous().
Interactive Roofline Model
Explore how peak compute, memory bandwidth, and arithmetic intensity interact. Adjust GPU parameters and see where common operations fall on the roofline.
Parameters
GPU Memory Hierarchy Latency Explorer
Visualize the GPU memory hierarchy with latency, bandwidth, and size at each level. Adjust architecture generation to compare.
Parameters
Warp Execution and Divergence Animation
Watch how a warp scheduler dispatches instructions to 32 threads, and observe the cost of branch divergence step-by-step.
Parameters
NVIDIA GPU Architecture Overview
CPU vs GPU Architecture Comparison
| Feature | CPU (e.g., AMD EPYC 9654) | GPU (e.g., NVIDIA A100) |
|---|---|---|
| Cores | 96 cores | 6912 CUDA cores (108 SMs) |
| Clock Speed | 2.4-3.7 GHz | 1.41 GHz |
| Peak FP32 | ~3.6 TFLOPS | 19.5 TFLOPS |
| Peak FP64 | ~1.8 TFLOPS | 9.7 TFLOPS |
| Memory | 768 GB DDR5 | 80 GB HBM2e |
| Memory BW | 460 GB/s | 2039 GB/s |
| Cache | 384 MB L3 | 40 MB L2 |
| Strength | Complex serial logic, branching | Massively parallel data processing |
| Weakness | Low throughput for data-parallel tasks | High latency for serial tasks, small memory |
Quick Check
How many threads are in a single NVIDIA warp?
16
32
64
128
A warp consists of 32 threads that execute in lock-step (SIMT).
Quick Check
An element-wise operation c = a * b + d on float32 arrays reads 3 arrays and writes 1 (16 bytes per element) and performs 2 FLOPs per element. On an A100 (), this operation is:
Compute-bound
Memory-bound
Latency-bound
Transfer-bound
With I = 0.125 << 9.75, the GPU spends most of its time waiting for data from memory.
Common Mistake: Warp Divergence from Data-Dependent Branching
Mistake:
Writing GPU kernels with if/else branches where the condition
varies across threads in the same warp:
if x[tid] > 0:
y[tid] = sqrt(x[tid])
else:
y[tid] = -x[tid]
Correction:
Restructure to avoid divergence. Use branchless operations:
mask = x[tid] > 0
y[tid] = mask * sqrt(abs(x[tid])) + (1 - mask) * (-x[tid])
Or sort input data so threads within each warp take the same branch.
Common Mistake: Strided Memory Access Kills Bandwidth
Mistake:
Accessing array elements with a stride (e.g., column-major access on a row-major array), causing each thread in a warp to trigger a separate memory transaction.
Correction:
Ensure threads access contiguous memory. Transpose arrays if needed, or restructure as Structure of Arrays (SoA) instead of Array of Structures (AoS).
Key Takeaway
A GPU is a throughput machine, not a latency machine. It hides memory latency by keeping thousands of threads in flight simultaneously. Understanding the SM/warp/memory hierarchy is essential for predicting which workloads will benefit from GPU acceleration.
Key Takeaway
Use the roofline model to classify your workload. Most scientific operations (element-wise, reductions, matvec) are memory-bound. Only operations with high arithmetic intensity like GEMM can saturate the GPU's compute capability.
Historical Note: From Graphics Pipeline to General-Purpose Computing
2004-2007Before CUDA (2007), GPU computing required casting scientific problems as "rendering" operations — pixel shaders became compute kernels, and textures served as arrays. Researchers like Ian Buck (Stanford, later NVIDIA) developed Brook for GPUs (2004), which influenced CUDA's design. CUDA was the first framework to expose the GPU as a general-purpose parallel processor without the graphics abstraction.
Historical Note: GPU Performance Scaling Outpaces CPUs
2005-2024While CPU single-thread performance plateaued around 2005 due to power and frequency walls, GPU throughput has continued to scale roughly at 2x every 2 years. NVIDIA's Tesla C1060 (2008) delivered 0.9 TFLOPS; the H100 (2022) delivers 67 TFLOPS — a 74x improvement in 14 years. This "GPU Moore's Law" is driven by increasing parallelism (more SMs), wider memory buses (HBM), and architecture-specific accelerators (Tensor Cores).
Why This Matters: GPU Acceleration for MIMO Processing
In massive MIMO systems with 64-256 antennas, the baseband processing involves large matrix operations: channel estimation (), MMSE detection (), and beamforming (). These are compute-bound matrix operations (high arithmetic intensity) that map perfectly to GPU architectures. GPU-based massive MIMO detectors can achieve real-time processing at sub-millisecond latency.
See full treatment in Chapter 15
Warp
A group of 32 threads that execute the same instruction simultaneously in SIMT fashion.
Related: Streaming Multiprocessor (SM)
CUDA Core
A single floating-point/integer execution unit within an SM, capable of one FMA per clock cycle.
Related: Streaming Multiprocessor (SM)
Coalesced Memory Access
A memory access pattern where consecutive threads access consecutive addresses, allowing the memory controller to batch requests into minimal transactions.
Related: HBM (High Bandwidth Memory)
HBM (High Bandwidth Memory)
The main GPU memory technology providing high bandwidth (1-3 TB/s) through a wide memory bus, used as global memory accessible by all GPU threads.
Related: Coalesced Memory Access
Occupancy
The ratio of active warps to maximum supported warps per SM, indicating how well the GPU can hide memory latency through parallelism.
Related: Warp, Streaming Multiprocessor (SM)