Ferkans — Interactive Telecom Tutor

Why GPUs Matter for Scientific Computing

Modern GPUs contain thousands of cores designed for data-parallel workloads. While a CPU excels at complex serial tasks (branch prediction, out-of-order execution), a GPU excels when the same operation is applied to millions of data elements simultaneously.

Scientific computing is dominated by matrix operations, element-wise transforms, and Monte Carlo simulations — all embarrassingly parallel. A single NVIDIA A100 GPU delivers 19.5 TFLOPS of FP32 throughput, roughly 30x that of a high-end CPU. Understanding the hardware is the key to writing code that achieves this potential.

Definition:
CUDA Core

A CUDA core is the basic floating-point/integer execution unit on an NVIDIA GPU. Each core can perform one single-precision fused multiply-add (FMA) per clock cycle:

$\text{FMA}: \quad d = a \times b + c$

Modern GPUs also include Tensor Cores for mixed-precision matrix multiply-accumulate operations (used in deep learning), and separate units for double-precision, transcendental functions (sin, cos, exp), and load/store operations.

A single GPU may contain thousands of CUDA cores:

NVIDIA A100: 6912 FP32 CUDA cores
NVIDIA RTX 4090: 16384 FP32 CUDA cores
NVIDIA H100: 16896 FP32 CUDA cores

The raw core count is misleading for performance comparisons. What matters is throughput = cores x clock speed x operations/cycle, combined with memory bandwidth. Most scientific workloads are memory-bound, not compute-bound.

Definition:
Streaming Multiprocessor (SM)

A Streaming Multiprocessor (SM) is the fundamental compute building block of an NVIDIA GPU. Each SM contains:

A set of CUDA cores (e.g., 128 FP32 cores per SM on Ampere)
A shared memory / L1 cache (configurable, typically 48-164 KB)
A warp scheduler (issues instructions to warps every cycle)
Register file (65536 32-bit registers on Ampere)
Special function units (SFUs) for transcendentals
Tensor Cores (on architectures from Volta onward)

The GPU partitions work across SMs. Each SM can execute multiple thread blocks concurrently, limited by register and shared-memory usage. An A100 has 108 SMs; an RTX 4090 has 128 SMs.

Think of each SM as a "mini-GPU." Programming for GPU efficiency means ensuring every SM is busy (occupancy) and that threads within each SM access memory efficiently.

Definition:
Warp (SIMT Execution Unit)

A warp is a group of 32 threads that execute the same instruction simultaneously in lock-step (SIMT — Single Instruction, Multiple Threads). This is the fundamental unit of execution scheduling on NVIDIA GPUs.

At every clock cycle, the warp scheduler selects an eligible warp and issues one instruction to all 32 threads. If threads within a warp take different branches (e.g., if/else), the warp executes both paths serially, masking out inactive threads. This is called warp divergence and is a major performance pitfall.

$\text{Active threads per warp} \le 32$ $\text{Divergent warp cost} \approx \sum_{\text{branches}} T_{\text{branch}}$

AMD GPUs use a "wavefront" of 64 threads (32 on RDNA3), analogous to NVIDIA's warp. The concept is identical: groups of threads execute in lock-step.

Definition:
GPU Memory Hierarchy

The GPU memory system forms a hierarchy from fast-but-small to slow-but-large:

Level	Size	Latency	Bandwidth	Scope
Registers	~256 KB/SM	0 cycles	-	Per thread
Shared Memory	48-164 KB/SM	~20 cycles	~19 TB/s	Per block
L1 Cache	Shared with above	~30 cycles	~19 TB/s	Per SM
L2 Cache	6-50 MB	~200 cycles	~6 TB/s	Global
Global (HBM)	16-80 GB	~400 cycles	1-3 TB/s	Global

Global memory (HBM) is the main GPU memory visible to all threads. Shared memory is a programmer-managed scratchpad shared among threads in the same block. Registers are the fastest storage, private to each thread.

Efficient GPU code minimizes global memory accesses by maximizing data reuse in shared memory and registers.

The CPU-GPU relationship adds another level: data must be transferred from CPU RAM to GPU HBM over PCIe (up to 64 GB/s) or NVLink (up to 900 GB/s). This host-device transfer is often the dominant bottleneck.

Definition:
Coalesced Memory Access

A global memory access is coalesced when consecutive threads in a warp access consecutive memory addresses. The memory controller can then serve the entire warp with a single transaction (128 bytes on modern GPUs).

Coalesced (optimal): $\text{Thread } i \text{ accesses address } \texttt{base} + i \times \texttt{sizeof(element)}$

Strided (poor): $\text{Thread } i \text{ accesses address } \texttt{base} + i \times \texttt{stride} \times \texttt{sizeof(element)}$

A stride of 2 wastes 50% of bandwidth. Random access patterns can reduce effective bandwidth by 10-32x.

In NumPy terms, coalesced access corresponds to iterating over the last axis of a C-contiguous array. When converting NumPy code to GPU, ensure the parallelized axis is contiguous.

Definition:
Occupancy

Occupancy is the ratio of active warps on an SM to the maximum number of warps the SM can support:

$\text{Occupancy} = \frac{\text{Active warps per SM}}{\text{Max warps per SM}}$

On Ampere, each SM supports up to 64 active warps (2048 threads). Occupancy is limited by three resources:

Registers per thread: more registers per thread = fewer threads
Shared memory per block: more shared memory = fewer blocks
Block size: must be a multiple of 32 (warp size)

Higher occupancy helps hide memory latency through warp-level parallelism, but 100% occupancy is not always necessary for peak performance.

Theorem: Amdahl's Law Applied to GPU Offloading

Let $f$ be the fraction of computation that can be parallelized on the GPU, and let $S_{\text{GPU}}$ be the speedup of the GPU portion relative to the CPU. The overall speedup is bounded by:

$S_{\text{overall}} \le \frac{1}{(1 - f) + \frac{f}{S_{\text{GPU}}} + \frac{T_{\text{xfer}}}{T_{\text{total}}}}$

where $T_{\text{xfer}}$ is the host-device data transfer time and $T_{\text{total}}$ is the original CPU execution time.

Even if the GPU computes 1000x faster, the serial portion and data transfer overhead set a hard ceiling on the overall speedup. For a workload that is 90% parallelizable with zero transfer cost, the maximum speedup is 10x regardless of GPU speed.

Proof

Decompose total time

$T_{\text{new}} = (1-f) \cdot T_{\text{total}} + \frac{f \cdot T_{\text{total}}}{S_{\text{GPU}}} + T_{\text{xfer}}$

$S_{\text{overall}} = \frac{T_{\text{total}}}{T_{\text{new}}} = \frac{1}{(1-f) + f/S_{\text{GPU}} + T_{\text{xfer}}/T_{\text{total}}}$

Theorem: Memory Bandwidth as the Performance Ceiling

For a kernel that performs $W$ floating-point operations on $B$ bytes of data, the arithmetic intensity is:

$I = \frac{W}{B} \quad [\text{FLOP/byte}]$

The kernel is memory-bound if $I < I^*$ , where $I^*$ is the machine's compute-to-bandwidth ratio:

$I^* = \frac{\text{Peak FLOP/s}}{\text{Peak Bandwidth (bytes/s)}}$

For an A100: $I^* = 19.5 \times 10^{12} / (2.0 \times 10^{12}) \approx 9.75$ FLOP/byte.

Element-wise operations like a + b have $I = 1/12$ (1 FLOP, 12 bytes: read 2 floats, write 1). This is far below $I^*$ , so the GPU can never reach peak compute on such operations — it waits for data. Matrix multiplication with $I \approx O(n)$ is one of the few operations that can saturate compute.

Theorem: Roofline Performance Model

The attainable performance $P$ (FLOP/s) of a kernel with arithmetic intensity $I$ is bounded by:

$P \le \min\!\big(\text{Peak FLOP/s},\; I \times \text{Peak Bandwidth}\big)$

This defines the "roofline" — a piecewise-linear upper bound when plotting $\log P$ vs $\log I$ . The knee (transition point) occurs at $I = I^*$ .

Below the knee, performance scales linearly with arithmetic intensity (memory-bound region). Above the knee, performance is flat at peak compute (compute-bound region). Most scientific kernels live in the memory-bound region, so optimizing memory access patterns matters more than increasing parallelism.

Example: Classifying Workloads by Arithmetic Intensity

Compute the arithmetic intensity of three operations and classify each as memory-bound or compute-bound on an A100 GPU ( $I^* \approx 9.75$ FLOP/byte):

Vector addition: $\mathbf{c} = \mathbf{a} + \mathbf{b}$
Matrix-vector product: $\mathbf{y} = \mathbf{A}\mathbf{x}$ , $\mathbf{A} \in \mathbb{R}^{n \times n}$
Matrix-matrix product: $\mathbf{C} = \mathbf{A}\mathbf{B}$ , all $n \times n$

Solution

Vector addition

Read $2n$ floats (8n bytes), write $n$ floats (4n bytes), perform $n$ adds. $I = \frac{n}{12n} = \frac{1}{12} \approx 0.083 \ll 9.75$ Memory-bound.

Matrix-vector product

Read $n^2 + n$ floats ( $\approx 4n^2$ bytes), perform $2n^2$ FLOPs (multiply + add). $I = \frac{2n^2}{4n^2} = 0.5 \ll 9.75$ Memory-bound for any $n$ .

Matrix-matrix product

Read $2n^2$ floats ( $8n^2$ bytes), perform $2n^3$ FLOPs. $I = \frac{2n^3}{8n^2} = \frac{n}{4}$ For $n > 39$ , $I > 9.75$ : compute-bound for large matrices. This is why GEMM is the poster child for GPU acceleration.

Example: Quantifying Warp Divergence

A kernel applies a threshold function to an array:

if x[i] > threshold:
    y[i] = expensive_function(x[i])  # 100 cycles
else:
    y[i] = 0.0                        # 1 cycle

If 50% of elements exceed the threshold and they are randomly distributed, estimate the execution time per warp compared to the best case (all elements on the same branch).

Solution

Analyze warp behavior

With random distribution, the probability that all 32 threads in a warp take the same branch is: $P(\text{no divergence}) = 0.5^{32} + 0.5^{32} \approx 4.7 \times 10^{-10}$ So virtually every warp diverges.

Compute divergent cost

A divergent warp executes both branches: $T_{\text{divergent}} = 100 + 1 = 101$ cycles.

Best case (all on expensive branch): $T_{\text{best}} = 100$ cycles. Best case (all on cheap branch): $T_{\text{best}} = 1$ cycle. Average non-divergent: $T_{\text{avg}} = 50.5$ cycles.

The divergent warp is $101/50.5 = 2\times$ slower than the theoretical average.

Mitigation: sort data so that elements above threshold are grouped together, ensuring most warps are non-divergent.

Example: Coalesced vs Strided Memory Access

A matrix $\mathbf{A}$ of shape $(M, N)$ is stored in row-major order. Compare the memory access pattern when parallelizing over rows vs columns with 32 threads per warp.

Solution

Column-parallel (coalesced)

Each thread processes one column in the current row: Thread $i$ accesses A[row, i] = address base + row * N + i.

Consecutive threads access consecutive addresses: stride = sizeof(float) = 4 bytes. This is coalesced — a single 128-byte transaction serves all 32 threads.

Row-parallel (strided)

Each thread processes one row in the current column: Thread $i$ accesses A[i, col] = address base + i * N + col.

Consecutive threads access addresses separated by $N \times 4$ bytes. For $N = 1024$ , this is a stride of 4096 bytes — each thread's request goes to a different cache line. This requires up to 32 separate memory transactions, reducing effective bandwidth by ~32x.

Practical fix

Transpose the matrix or change the loop order so that the innermost parallel dimension maps to contiguous memory. In CuPy/PyTorch, ensure the tensor is contiguous along the last dimension with .contiguous().

Interactive Roofline Model

Explore how peak compute, memory bandwidth, and arithmetic intensity interact. Adjust GPU parameters and see where common operations fall on the roofline.

Parameters

GPU Memory Hierarchy Latency Explorer

Visualize the GPU memory hierarchy with latency, bandwidth, and size at each level. Adjust architecture generation to compare.

Parameters

Warp Execution and Divergence Animation

Watch how a warp scheduler dispatches instructions to 32 threads, and observe the cost of branch divergence step-by-step.

Parameters

CPU vs GPU Architecture Comparison

Feature	CPU (e.g., AMD EPYC 9654)	GPU (e.g., NVIDIA A100)
Cores	96 cores	6912 CUDA cores (108 SMs)
Clock Speed	2.4-3.7 GHz	1.41 GHz
Peak FP32	~3.6 TFLOPS	19.5 TFLOPS
Peak FP64	~1.8 TFLOPS	9.7 TFLOPS
Memory	768 GB DDR5	80 GB HBM2e
Memory BW	460 GB/s	2039 GB/s
Cache	384 MB L3	40 MB L2
Strength	Complex serial logic, branching	Massively parallel data processing
Weakness	Low throughput for data-parallel tasks	High latency for serial tasks, small memory

Quick Check

How many threads are in a single NVIDIA warp?

16

32

64

128

Correction:

32

A warp consists of 32 threads that execute in lock-step (SIMT).

Quick Check

An element-wise operation c = a * b + d on float32 arrays reads 3 arrays and writes 1 (16 bytes per element) and performs 2 FLOPs per element. On an A100 ( $I^* \approx 9.75$ ), this operation is:

Compute-bound

Memory-bound

Latency-bound

Transfer-bound

Correction:

Memory-bound

With I = 0.125 << 9.75, the GPU spends most of its time waiting for data from memory.

Common Mistake: Warp Divergence from Data-Dependent Branching

Mistake:

Writing GPU kernels with if/else branches where the condition varies across threads in the same warp:

if x[tid] > 0:
    y[tid] = sqrt(x[tid])
else:
    y[tid] = -x[tid]

Correction:

Restructure to avoid divergence. Use branchless operations:

mask = x[tid] > 0
y[tid] = mask * sqrt(abs(x[tid])) + (1 - mask) * (-x[tid])

Or sort input data so threads within each warp take the same branch.

Common Mistake: Strided Memory Access Kills Bandwidth

Mistake:

Accessing array elements with a stride (e.g., column-major access on a row-major array), causing each thread in a warp to trigger a separate memory transaction.

Correction:

Ensure threads access contiguous memory. Transpose arrays if needed, or restructure as Structure of Arrays (SoA) instead of Array of Structures (AoS).

Key Takeaway

A GPU is a throughput machine, not a latency machine. It hides memory latency by keeping thousands of threads in flight simultaneously. Understanding the SM/warp/memory hierarchy is essential for predicting which workloads will benefit from GPU acceleration.

Key Takeaway

Use the roofline model to classify your workload. Most scientific operations (element-wise, reductions, matvec) are memory-bound. Only operations with high arithmetic intensity like GEMM can saturate the GPU's compute capability.

Historical Note: From Graphics Pipeline to General-Purpose Computing

2004-2007

Before CUDA (2007), GPU computing required casting scientific problems as "rendering" operations — pixel shaders became compute kernels, and textures served as arrays. Researchers like Ian Buck (Stanford, later NVIDIA) developed Brook for GPUs (2004), which influenced CUDA's design. CUDA was the first framework to expose the GPU as a general-purpose parallel processor without the graphics abstraction.

Historical Note: GPU Performance Scaling Outpaces CPUs

2005-2024

While CPU single-thread performance plateaued around 2005 due to power and frequency walls, GPU throughput has continued to scale roughly at 2x every 2 years. NVIDIA's Tesla C1060 (2008) delivered 0.9 TFLOPS; the H100 (2022) delivers 67 TFLOPS — a 74x improvement in 14 years. This "GPU Moore's Law" is driven by increasing parallelism (more SMs), wider memory buses (HBM), and architecture-specific accelerators (Tensor Cores).

Why This Matters: GPU Acceleration for MIMO Processing

In massive MIMO systems with 64-256 antennas, the baseband processing involves large matrix operations: channel estimation ( $\mathbf{H} \in \mathbb{C}^{N_r \times N_t}$ ), MMSE detection ( $(\mathbf{H}^H\mathbf{H} + \sigma^2\mathbf{I})^{-1}\mathbf{H}^H\mathbf{y}$ ), and beamforming ( $\mathbf{w} = \mathbf{H}^H(\mathbf{H}\mathbf{H}^H)^{-1}\mathbf{e}_k$ ). These are compute-bound matrix operations (high arithmetic intensity) that map perfectly to GPU architectures. GPU-based massive MIMO detectors can achieve real-time processing at sub-millisecond latency.

See full treatment in Chapter 15

Streaming Multiprocessor (SM)

The fundamental compute unit of an NVIDIA GPU, containing CUDA cores, shared memory, registers, and warp schedulers.

Related: Warp, CUDA Core

Warp

A group of 32 threads that execute the same instruction simultaneously in SIMT fashion.

Related: Streaming Multiprocessor (SM)

CUDA Core

A single floating-point/integer execution unit within an SM, capable of one FMA per clock cycle.

Related: Streaming Multiprocessor (SM)

Coalesced Memory Access

A memory access pattern where consecutive threads access consecutive addresses, allowing the memory controller to batch requests into minimal transactions.

Related: HBM (High Bandwidth Memory)

HBM (High Bandwidth Memory)

The main GPU memory technology providing high bandwidth (1-3 TB/s) through a wide memory bus, used as global memory accessible by all GPU threads.

Related: Coalesced Memory Access

Occupancy

The ratio of active warps to maximum supported warps per SM, indicating how well the GPU can hide memory latency through parallelism.

GPU Architecture for the Programmer

Why GPUs Matter for Scientific Computing

Definition: CUDA Core

Definition: Streaming Multiprocessor (SM)

Definition: Warp (SIMT Execution Unit)

Definition: GPU Memory Hierarchy

Definition: Coalesced Memory Access

Definition: Occupancy

Theorem: Amdahl's Law Applied to GPU Offloading

Decompose total time

Theorem: Memory Bandwidth as the Performance Ceiling

Theorem: Roofline Performance Model

Example: Classifying Workloads by Arithmetic Intensity

Vector addition

Matrix-vector product

Matrix-matrix product

Example: Quantifying Warp Divergence

Analyze warp behavior

Compute divergent cost

Example: Coalesced vs Strided Memory Access

Column-parallel (coalesced)

Row-parallel (strided)

Practical fix

Interactive Roofline Model

Parameters

GPU Memory Hierarchy Latency Explorer

Parameters

Warp Execution and Divergence Animation

Parameters

NVIDIA GPU Architecture Overview

CPU vs GPU Architecture Comparison

Quick Check

Quick Check

Common Mistake: Warp Divergence from Data-Dependent Branching

Common Mistake: Strided Memory Access Kills Bandwidth

Key Takeaway

Key Takeaway

Historical Note: From Graphics Pipeline to General-Purpose Computing

Historical Note: GPU Performance Scaling Outpaces CPUs

Why This Matters: GPU Acceleration for MIMO Processing

Streaming Multiprocessor (SM)

Warp

CUDA Core

Coalesced Memory Access

HBM (High Bandwidth Memory)

Occupancy

Definition:
CUDA Core

Definition:
Streaming Multiprocessor (SM)

Definition:
Warp (SIMT Execution Unit)

Definition:
GPU Memory Hierarchy

Definition:
Coalesced Memory Access

Definition:
Occupancy