GPU Architecture for the Programmer

Why GPUs Matter for Scientific Computing

Modern GPUs contain thousands of cores designed for data-parallel workloads. While a CPU excels at complex serial tasks (branch prediction, out-of-order execution), a GPU excels when the same operation is applied to millions of data elements simultaneously.

Scientific computing is dominated by matrix operations, element-wise transforms, and Monte Carlo simulations — all embarrassingly parallel. A single NVIDIA A100 GPU delivers 19.5 TFLOPS of FP32 throughput, roughly 30x that of a high-end CPU. Understanding the hardware is the key to writing code that achieves this potential.

Definition:

CUDA Core

A CUDA core is the basic floating-point/integer execution unit on an NVIDIA GPU. Each core can perform one single-precision fused multiply-add (FMA) per clock cycle:

FMA:d=a×b+c\text{FMA}: \quad d = a \times b + c

Modern GPUs also include Tensor Cores for mixed-precision matrix multiply-accumulate operations (used in deep learning), and separate units for double-precision, transcendental functions (sin, cos, exp), and load/store operations.

A single GPU may contain thousands of CUDA cores:

  • NVIDIA A100: 6912 FP32 CUDA cores
  • NVIDIA RTX 4090: 16384 FP32 CUDA cores
  • NVIDIA H100: 16896 FP32 CUDA cores

The raw core count is misleading for performance comparisons. What matters is throughput = cores x clock speed x operations/cycle, combined with memory bandwidth. Most scientific workloads are memory-bound, not compute-bound.

Definition:

Streaming Multiprocessor (SM)

A Streaming Multiprocessor (SM) is the fundamental compute building block of an NVIDIA GPU. Each SM contains:

  • A set of CUDA cores (e.g., 128 FP32 cores per SM on Ampere)
  • A shared memory / L1 cache (configurable, typically 48-164 KB)
  • A warp scheduler (issues instructions to warps every cycle)
  • Register file (65536 32-bit registers on Ampere)
  • Special function units (SFUs) for transcendentals
  • Tensor Cores (on architectures from Volta onward)

The GPU partitions work across SMs. Each SM can execute multiple thread blocks concurrently, limited by register and shared-memory usage. An A100 has 108 SMs; an RTX 4090 has 128 SMs.

Think of each SM as a "mini-GPU." Programming for GPU efficiency means ensuring every SM is busy (occupancy) and that threads within each SM access memory efficiently.

Definition:

Warp (SIMT Execution Unit)

A warp is a group of 32 threads that execute the same instruction simultaneously in lock-step (SIMT — Single Instruction, Multiple Threads). This is the fundamental unit of execution scheduling on NVIDIA GPUs.

At every clock cycle, the warp scheduler selects an eligible warp and issues one instruction to all 32 threads. If threads within a warp take different branches (e.g., if/else), the warp executes both paths serially, masking out inactive threads. This is called warp divergence and is a major performance pitfall.

Active threads per warp32\text{Active threads per warp} \le 32 Divergent warp costbranchesTbranch\text{Divergent warp cost} \approx \sum_{\text{branches}} T_{\text{branch}}

AMD GPUs use a "wavefront" of 64 threads (32 on RDNA3), analogous to NVIDIA's warp. The concept is identical: groups of threads execute in lock-step.

Definition:

GPU Memory Hierarchy

The GPU memory system forms a hierarchy from fast-but-small to slow-but-large:

Level Size Latency Bandwidth Scope
Registers ~256 KB/SM 0 cycles - Per thread
Shared Memory 48-164 KB/SM ~20 cycles ~19 TB/s Per block
L1 Cache Shared with above ~30 cycles ~19 TB/s Per SM
L2 Cache 6-50 MB ~200 cycles ~6 TB/s Global
Global (HBM) 16-80 GB ~400 cycles 1-3 TB/s Global

Global memory (HBM) is the main GPU memory visible to all threads. Shared memory is a programmer-managed scratchpad shared among threads in the same block. Registers are the fastest storage, private to each thread.

Efficient GPU code minimizes global memory accesses by maximizing data reuse in shared memory and registers.

The CPU-GPU relationship adds another level: data must be transferred from CPU RAM to GPU HBM over PCIe (up to 64 GB/s) or NVLink (up to 900 GB/s). This host-device transfer is often the dominant bottleneck.

Definition:

Coalesced Memory Access

A global memory access is coalesced when consecutive threads in a warp access consecutive memory addresses. The memory controller can then serve the entire warp with a single transaction (128 bytes on modern GPUs).

Coalesced (optimal): Thread i accesses address base+i×sizeof(element)\text{Thread } i \text{ accesses address } \texttt{base} + i \times \texttt{sizeof(element)}

Strided (poor): Thread i accesses address base+i×stride×sizeof(element)\text{Thread } i \text{ accesses address } \texttt{base} + i \times \texttt{stride} \times \texttt{sizeof(element)}

A stride of 2 wastes 50% of bandwidth. Random access patterns can reduce effective bandwidth by 10-32x.

In NumPy terms, coalesced access corresponds to iterating over the last axis of a C-contiguous array. When converting NumPy code to GPU, ensure the parallelized axis is contiguous.

Definition:

Occupancy

Occupancy is the ratio of active warps on an SM to the maximum number of warps the SM can support:

Occupancy=Active warps per SMMax warps per SM\text{Occupancy} = \frac{\text{Active warps per SM}}{\text{Max warps per SM}}

On Ampere, each SM supports up to 64 active warps (2048 threads). Occupancy is limited by three resources:

  1. Registers per thread: more registers per thread = fewer threads
  2. Shared memory per block: more shared memory = fewer blocks
  3. Block size: must be a multiple of 32 (warp size)

Higher occupancy helps hide memory latency through warp-level parallelism, but 100% occupancy is not always necessary for peak performance.

Theorem: Amdahl's Law Applied to GPU Offloading

Let ff be the fraction of computation that can be parallelized on the GPU, and let SGPUS_{\text{GPU}} be the speedup of the GPU portion relative to the CPU. The overall speedup is bounded by:

Soverall1(1f)+fSGPU+TxferTtotalS_{\text{overall}} \le \frac{1}{(1 - f) + \frac{f}{S_{\text{GPU}}} + \frac{T_{\text{xfer}}}{T_{\text{total}}}}

where TxferT_{\text{xfer}} is the host-device data transfer time and TtotalT_{\text{total}} is the original CPU execution time.

Even if the GPU computes 1000x faster, the serial portion and data transfer overhead set a hard ceiling on the overall speedup. For a workload that is 90% parallelizable with zero transfer cost, the maximum speedup is 10x regardless of GPU speed.

Theorem: Memory Bandwidth as the Performance Ceiling

For a kernel that performs WW floating-point operations on BB bytes of data, the arithmetic intensity is:

I=WB[FLOP/byte]I = \frac{W}{B} \quad [\text{FLOP/byte}]

The kernel is memory-bound if I<II < I^*, where II^* is the machine's compute-to-bandwidth ratio:

I=Peak FLOP/sPeak Bandwidth (bytes/s)I^* = \frac{\text{Peak FLOP/s}}{\text{Peak Bandwidth (bytes/s)}}

For an A100: I=19.5×1012/(2.0×1012)9.75I^* = 19.5 \times 10^{12} / (2.0 \times 10^{12}) \approx 9.75 FLOP/byte.

Element-wise operations like a + b have I=1/12I = 1/12 (1 FLOP, 12 bytes: read 2 floats, write 1). This is far below II^*, so the GPU can never reach peak compute on such operations — it waits for data. Matrix multiplication with IO(n)I \approx O(n) is one of the few operations that can saturate compute.

Theorem: Roofline Performance Model

The attainable performance PP (FLOP/s) of a kernel with arithmetic intensity II is bounded by:

Pmin ⁣(Peak FLOP/s,  I×Peak Bandwidth)P \le \min\!\big(\text{Peak FLOP/s},\; I \times \text{Peak Bandwidth}\big)

This defines the "roofline" — a piecewise-linear upper bound when plotting logP\log P vs logI\log I. The knee (transition point) occurs at I=II = I^*.

Below the knee, performance scales linearly with arithmetic intensity (memory-bound region). Above the knee, performance is flat at peak compute (compute-bound region). Most scientific kernels live in the memory-bound region, so optimizing memory access patterns matters more than increasing parallelism.

Example: Classifying Workloads by Arithmetic Intensity

Compute the arithmetic intensity of three operations and classify each as memory-bound or compute-bound on an A100 GPU (I9.75I^* \approx 9.75 FLOP/byte):

  1. Vector addition: c=a+b\mathbf{c} = \mathbf{a} + \mathbf{b}
  2. Matrix-vector product: y=Ax\mathbf{y} = \mathbf{A}\mathbf{x}, ARn×n\mathbf{A} \in \mathbb{R}^{n \times n}
  3. Matrix-matrix product: C=AB\mathbf{C} = \mathbf{A}\mathbf{B}, all n×nn \times n

Example: Quantifying Warp Divergence

A kernel applies a threshold function to an array:

if x[i] > threshold:
    y[i] = expensive_function(x[i])  # 100 cycles
else:
    y[i] = 0.0                        # 1 cycle

If 50% of elements exceed the threshold and they are randomly distributed, estimate the execution time per warp compared to the best case (all elements on the same branch).

Example: Coalesced vs Strided Memory Access

A matrix A\mathbf{A} of shape (M,N)(M, N) is stored in row-major order. Compare the memory access pattern when parallelizing over rows vs columns with 32 threads per warp.

Interactive Roofline Model

Explore how peak compute, memory bandwidth, and arithmetic intensity interact. Adjust GPU parameters and see where common operations fall on the roofline.

Parameters

GPU Memory Hierarchy Latency Explorer

Visualize the GPU memory hierarchy with latency, bandwidth, and size at each level. Adjust architecture generation to compare.

Parameters

Warp Execution and Divergence Animation

Watch how a warp scheduler dispatches instructions to 32 threads, and observe the cost of branch divergence step-by-step.

Parameters

NVIDIA GPU Architecture Overview

NVIDIA GPU Architecture Overview

CPU vs GPU Architecture Comparison

FeatureCPU (e.g., AMD EPYC 9654)GPU (e.g., NVIDIA A100)
Cores96 cores6912 CUDA cores (108 SMs)
Clock Speed2.4-3.7 GHz1.41 GHz
Peak FP32~3.6 TFLOPS19.5 TFLOPS
Peak FP64~1.8 TFLOPS9.7 TFLOPS
Memory768 GB DDR580 GB HBM2e
Memory BW460 GB/s2039 GB/s
Cache384 MB L340 MB L2
StrengthComplex serial logic, branchingMassively parallel data processing
WeaknessLow throughput for data-parallel tasksHigh latency for serial tasks, small memory

Quick Check

How many threads are in a single NVIDIA warp?

16

32

64

128

Quick Check

An element-wise operation c = a * b + d on float32 arrays reads 3 arrays and writes 1 (16 bytes per element) and performs 2 FLOPs per element. On an A100 (I9.75I^* \approx 9.75), this operation is:

Compute-bound

Memory-bound

Latency-bound

Transfer-bound

Common Mistake: Warp Divergence from Data-Dependent Branching

Mistake:

Writing GPU kernels with if/else branches where the condition varies across threads in the same warp:

if x[tid] > 0:
    y[tid] = sqrt(x[tid])
else:
    y[tid] = -x[tid]

Correction:

Restructure to avoid divergence. Use branchless operations:

mask = x[tid] > 0
y[tid] = mask * sqrt(abs(x[tid])) + (1 - mask) * (-x[tid])

Or sort input data so threads within each warp take the same branch.

Common Mistake: Strided Memory Access Kills Bandwidth

Mistake:

Accessing array elements with a stride (e.g., column-major access on a row-major array), causing each thread in a warp to trigger a separate memory transaction.

Correction:

Ensure threads access contiguous memory. Transpose arrays if needed, or restructure as Structure of Arrays (SoA) instead of Array of Structures (AoS).

Key Takeaway

A GPU is a throughput machine, not a latency machine. It hides memory latency by keeping thousands of threads in flight simultaneously. Understanding the SM/warp/memory hierarchy is essential for predicting which workloads will benefit from GPU acceleration.

Key Takeaway

Use the roofline model to classify your workload. Most scientific operations (element-wise, reductions, matvec) are memory-bound. Only operations with high arithmetic intensity like GEMM can saturate the GPU's compute capability.

Historical Note: From Graphics Pipeline to General-Purpose Computing

2004-2007

Before CUDA (2007), GPU computing required casting scientific problems as "rendering" operations — pixel shaders became compute kernels, and textures served as arrays. Researchers like Ian Buck (Stanford, later NVIDIA) developed Brook for GPUs (2004), which influenced CUDA's design. CUDA was the first framework to expose the GPU as a general-purpose parallel processor without the graphics abstraction.

Historical Note: GPU Performance Scaling Outpaces CPUs

2005-2024

While CPU single-thread performance plateaued around 2005 due to power and frequency walls, GPU throughput has continued to scale roughly at 2x every 2 years. NVIDIA's Tesla C1060 (2008) delivered 0.9 TFLOPS; the H100 (2022) delivers 67 TFLOPS — a 74x improvement in 14 years. This "GPU Moore's Law" is driven by increasing parallelism (more SMs), wider memory buses (HBM), and architecture-specific accelerators (Tensor Cores).

Why This Matters: GPU Acceleration for MIMO Processing

In massive MIMO systems with 64-256 antennas, the baseband processing involves large matrix operations: channel estimation (HCNr×Nt\mathbf{H} \in \mathbb{C}^{N_r \times N_t}), MMSE detection ((HHH+σ2I)1HHy(\mathbf{H}^H\mathbf{H} + \sigma^2\mathbf{I})^{-1}\mathbf{H}^H\mathbf{y}), and beamforming (w=HH(HHH)1ek\mathbf{w} = \mathbf{H}^H(\mathbf{H}\mathbf{H}^H)^{-1}\mathbf{e}_k). These are compute-bound matrix operations (high arithmetic intensity) that map perfectly to GPU architectures. GPU-based massive MIMO detectors can achieve real-time processing at sub-millisecond latency.

See full treatment in Chapter 15

Streaming Multiprocessor (SM)

The fundamental compute unit of an NVIDIA GPU, containing CUDA cores, shared memory, registers, and warp schedulers.

Related: Warp, CUDA Core

Warp

A group of 32 threads that execute the same instruction simultaneously in SIMT fashion.

Related: Streaming Multiprocessor (SM)

CUDA Core

A single floating-point/integer execution unit within an SM, capable of one FMA per clock cycle.

Related: Streaming Multiprocessor (SM)

Coalesced Memory Access

A memory access pattern where consecutive threads access consecutive addresses, allowing the memory controller to batch requests into minimal transactions.

Related: HBM (High Bandwidth Memory)

HBM (High Bandwidth Memory)

The main GPU memory technology providing high bandwidth (1-3 TB/s) through a wide memory bus, used as global memory accessible by all GPU threads.

Related: Coalesced Memory Access

Occupancy

The ratio of active warps to maximum supported warps per SM, indicating how well the GPU can hide memory latency through parallelism.

Related: Warp, Streaming Multiprocessor (SM)