Ferkans — Interactive Telecom Tutor

ex-sp-ch10-01

Easy

Compute the arithmetic intensity $I = W/B$ (FLOP/byte) for the following operations on float32 arrays of length $n$ :

Vector copy: b = a.copy()
Scalar multiply: b = 3.0 * a
Dot product: s = np.dot(a, b) Classify each as memory-bound or compute-bound on an A100 ( $I^* \approx 9.75$ FLOP/byte).

Show Hint

Count bytes read AND bytes written.

A dot product reads $2n$ floats and performs $2n$ FLOPs (multiply + add).

Solution

Analysis

Copy: 0 FLOPs, $8n$ bytes (read + write) -> $I = 0$ . Memory-bound.
Scalar multiply: $n$ FLOPs, $8n$ bytes -> $I = 1/8 = 0.125$ . Memory-bound.
Dot product: $2n$ FLOPs, $8n$ bytes -> $I = 2/8 = 0.25$ . Memory-bound. All three are far below $I^* = 9.75$ , confirming they are memory-bound.

ex-sp-ch10-02

Easy

For an array of $n = 100{,}000$ float32 elements with block size $B = 256$ :

How many blocks are in the grid?
How many total threads are launched?
How many threads are "wasted" (don't process data)?
How many warps are there in total?

Show Hint

$G = \lceil n / B \rceil$ .

Total threads = $G \times B$ .

Solution

Computation

$G = \lceil 100{,}000 / 256 \rceil = 391$ blocks
Total threads: $391 \times 256 = 100{,}096$
Wasted: $100{,}096 - 100{,}000 = 96$ threads (0.096%)
Total warps: $100{,}096 / 32 = 3{,}128$ warps

ex-sp-ch10-03

Easy

An NVIDIA A100 has 108 SMs, each supporting up to 2048 threads. A kernel launches 50,000 threads with a block size of 256.

How many blocks are created?
What is the maximum occupancy if each SM gets 8 blocks?
What percentage of SMs are utilized?

Show Hint

Blocks = total_threads / block_size.

Occupancy = active_threads / max_threads per SM.

Solution

Computation

Blocks = $50{,}000 / 256 \approx 196$ blocks (rounding up)
8 blocks x 256 threads = 2048 threads per SM = 100% occupancy
With 196 blocks and 8 blocks per SM, need $\lceil 196/8 \rceil = 25$ SMs. Utilization = $25/108 = 23\%$ . Most SMs are idle. Need $108 \times 8 \times 256 = 221{,}184$ threads for full utilization.

ex-sp-ch10-04

Easy

Write a Python script that checks your GPU setup:

Print the NVIDIA driver version and CUDA version from nvidia-smi
Check if PyTorch can see the GPU
Print GPU name, total memory, and CUDA version
Run a simple GPU computation and verify the result

Show Hint

Use torch.cuda.get_device_properties(0) for device info.

Solution

Implementation

import subprocess
import torch

# 1. nvidia-smi
result = subprocess.run(['nvidia-smi', '--query-gpu=driver_version,name,memory.total',
                         '--format=csv,noheader'],
                        capture_output=True, text=True)
print(f"nvidia-smi: {result.stdout.strip()}")

# 2-3. PyTorch GPU info
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    props = torch.cuda.get_device_properties(0)
    print(f"GPU: {props.name}")
    print(f"Memory: {props.total_mem / 1e9:.1f} GB")
    print(f"SMs: {props.multi_processor_count}")
    print(f"CUDA: {torch.version.cuda}")

    # 4. Test computation
    x = torch.randn(1000, 1000, device='cuda')
    y = x @ x
    torch.cuda.synchronize()
    print(f"Test matmul: shape={y.shape}, OK")

ex-sp-ch10-05

Easy

Explain why each of these GPU timing measurements is incorrect and fix each one:

# Measurement A
t0 = time.time()
y = x_gpu @ x_gpu
print(time.time() - t0)

# Measurement B
torch.cuda.synchronize()
t0 = time.time()
y = x_gpu @ x_gpu
torch.cuda.synchronize()
print(time.time() - t0)  # first run ever

Show Hint

Consider asynchronous execution and warmup.

Solution

Analysis

A: Missing synchronize() — measures kernel launch time (~0.1 ms), not GPU compute time. Fix: add torch.cuda.synchronize() after the operation.

B: No warmup — the first CUDA operation includes context initialization (100-500 ms extra). Fix: run 2-3 warmup iterations before timing.

Correct version:

# Warmup
for _ in range(3):
    _ = x_gpu @ x_gpu
torch.cuda.synchronize()

# Timed
torch.cuda.synchronize()
t0 = time.perf_counter()
y = x_gpu @ x_gpu
torch.cuda.synchronize()
print(f"{time.perf_counter() - t0:.4f}s")

ex-sp-ch10-06

Medium

Using the roofline model, determine the maximum achievable FLOP/s for the following operations on an A100 (19.5 TFLOPS, 2039 GB/s bandwidth):

Element-wise exp(x) on float32 (read $n$ , write $n$ , ~8 FLOPs per element)
Matrix-matrix multiply of two $1024 \times 1024$ float32 matrices
Batched vector addition of 1000 vectors of length 10000

For each, state whether the operation is memory-bound or compute-bound.

Show Hint

$I = \text{FLOPs} / \text{Bytes}$ , then $P = \min(\text{Peak}, I \times \text{BW})$ .

Solution

Analysis

exp(x): $I = 8 / 8 = 1.0$ FLOP/byte. $P = 1.0 \times 2039 = 2.04$ TFLOPS. Memory-bound.
GEMM: $I = 2 \times 1024^3 / (3 \times 1024^2 \times 4) = 1024/6 \approx 170$ . $P = 19.5$ TFLOPS. Compute-bound.
Vector add: $I = 1/12 \approx 0.083$ . $P = 0.083 \times 2039 = 0.17$ TFLOPS. Memory-bound.

ex-sp-ch10-07

Medium

A warp of 32 threads executes a loop where thread $i$ accesses data[i * stride] for stride values of 1, 2, 4, 8, 16, 32. For each stride, calculate:

The number of 128-byte cache lines touched by the warp
The effective bandwidth utilization (useful bytes / total bytes fetched) Assume float32 elements (4 bytes each).

Show Hint

With stride $s$ , the address range is $[0, 32 \times s \times 4)$ bytes.

Each cache line is 128 bytes = 32 float32 elements.

Solution

Analysis

Stride	Range (bytes)	Cache Lines	Useful/Fetched
1	128	1	128/128 = 100%
2	256	2	128/256 = 50%
4	512	4	128/512 = 25%
8	1024	8	128/1024 = 12.5%
16	2048	16	128/2048 = 6.25%
32	4096	32	128/4096 = 3.125%

Stride-1 access is perfectly coalesced (1 transaction). Stride-32 wastes 97% of memory bandwidth.

ex-sp-ch10-08

Medium

Write a Python script using PyTorch that benchmarks matrix multiplication on CPU vs GPU for sizes $n = 128, 256, 512, 1024, 2048, 4096$ . Plot the speedup (CPU time / GPU time) vs matrix size. Include proper warmup and synchronization.

Show Hint

Use torch.cuda.synchronize() for GPU timing.

Use CUDA Events for precise GPU timing.

Solution

Implementation

import torch
import time
import numpy as np

sizes = [128, 256, 512, 1024, 2048, 4096]
cpu_times, gpu_times = [], []

for n in sizes:
    # CPU
    x_cpu = torch.randn(n, n)
    times = []
    for _ in range(5):
        t0 = time.perf_counter()
        _ = x_cpu @ x_cpu
        times.append(time.perf_counter() - t0)
    cpu_times.append(np.median(times))

    # GPU
    x_gpu = torch.randn(n, n, device='cuda')
    for _ in range(3):  # warmup
        _ = x_gpu @ x_gpu
    torch.cuda.synchronize()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    times = []
    for _ in range(10):
        start.record()
        _ = x_gpu @ x_gpu
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end) / 1000)
    gpu_times.append(np.median(times))

speedups = [c/g for c, g in zip(cpu_times, gpu_times)]
for n, s in zip(sizes, speedups):
    print(f"n={n:5d}: speedup = {s:.1f}x")

ex-sp-ch10-09

Medium

A training pipeline has the following profile:

Data loading: 200 ms (CPU)
Data augmentation: 50 ms (CPU)
Host-to-device transfer: 10 ms
Forward pass: 30 ms (GPU)
Backward pass: 50 ms (GPU)
Optimizer step: 5 ms (GPU)
Device-to-host (loss): 0.1 ms

What is the total iteration time?
What is the GPU utilization?
Apply Amdahl's law: what speedup would a 2x faster GPU give?
Suggest two optimizations with estimated impact.

Show Hint

GPU utilization = GPU time / total time.

Consider pipelining data loading with GPU compute.

Solution

Analysis

Total: $200 + 50 + 10 + 30 + 50 + 5 + 0.1 = 345.1$ ms
GPU util: $(30 + 50 + 5) / 345.1 = 24.6\%$
Amdahl's: $f_{\text{GPU}} = 85/345.1 = 0.246$ . $S = 1/(1-0.246 + 0.246/2) = 1/(0.754 + 0.123) = 1.14\times$ . A 2x faster GPU gives only 14% overall speedup!
Optimizations:
- GPU data augmentation: saves 50 ms, total -> 295 ms (16% faster)
- Prefetch/pipeline data loading: overlap 200 ms with GPU, total -> ~145 ms (58% faster)

ex-sp-ch10-10

Medium

Use torch.profiler to profile 10 iterations of a matrix multiplication ( $2048 \times 2048$ ) and answer:

What is the kernel name for matmul?
What percentage of time is CUDA vs CPU?
What is the CUDA memory allocated? Print the profiler table sorted by CUDA time.

Show Hint

Use ProfilerActivity.CPU and ProfilerActivity.CUDA.

The kernel name is typically aten::mm or sm80_xmma_gemm.

Solution

Implementation

import torch
from torch.profiler import profile, ProfilerActivity

x = torch.randn(2048, 2048, device='cuda')

# Warmup
for _ in range(3):
    _ = x @ x
torch.cuda.synchronize()

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
) as prof:
    for _ in range(10):
        y = x @ x

print(prof.key_averages().table(
    sort_by="cuda_time_total", row_limit=10))

ex-sp-ch10-11

Hard

Build a "GPU calculator" that determines whether a given computation should run on CPU or GPU based on the roofline model:

def should_use_gpu(flops, bytes_moved, cpu_gflops, gpu_gflops,
                   gpu_bandwidth_gbs, pcie_bandwidth_gbs):
    """Return True if GPU is faster, accounting for transfer."""

Test it for: vector add (n=1000 to n=10M), GEMM (n=100 to n=4096), and batch FFT (batch=1 to batch=10000, size=1024).

Show Hint

GPU time = max(flops/gpu_gflops, bytes_moved/bandwidth) + 2*data_size/pcie_bw.

CPU time = flops/cpu_gflops.

Solution

Implementation

def should_use_gpu(flops, bytes_moved, data_bytes,
                   cpu_gflops=100, gpu_gflops=19500,
                   gpu_bw_gbs=2039, pcie_bw_gbs=25):
    cpu_time = flops / (cpu_gflops * 1e9)
    gpu_compute = flops / (gpu_gflops * 1e9)
    gpu_memtime = bytes_moved / (gpu_bw_gbs * 1e9)
    gpu_kernel = max(gpu_compute, gpu_memtime)
    gpu_transfer = 2 * data_bytes / (pcie_bw_gbs * 1e9)
    gpu_total = gpu_kernel + gpu_transfer
    return gpu_total < cpu_time, cpu_time, gpu_total

import numpy as np
for n in [1000, 10000, 100000, 1000000, 10000000]:
    flops = n
    bytes_moved = 12 * n
    data_bytes = 12 * n
    use, tc, tg = should_use_gpu(flops, bytes_moved, data_bytes)
    print(f"VecAdd n={n:>10,}: GPU={'yes' if use else 'no':3s}  "
          f"CPU={tc*1e6:.1f}us  GPU={tg*1e6:.1f}us")

ex-sp-ch10-12

Hard

Write a benchmark that measures effective memory bandwidth for GPU memory operations at different data sizes (1 KB to 1 GB). Plot the achieved bandwidth vs data size and explain why small transfers achieve much lower bandwidth than large ones.

Show Hint

Bandwidth = bytes / time.

Use torch.cuda.Event for precise timing.

Include both read and write bandwidth.

Solution

Implementation

import torch
import numpy as np

sizes = [2**k for k in range(10, 31)]  # 1 KB to 1 GB
bandwidths = []

for nbytes in sizes:
    n = nbytes // 4  # float32
    x = torch.randn(n, device='cuda')
    y = torch.empty_like(x)

    # Warmup
    for _ in range(3):
        y.copy_(x)
    torch.cuda.synchronize()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    reps = max(1, 10_000_000 // n)

    start.record()
    for _ in range(reps):
        y.copy_(x)
    end.record()
    torch.cuda.synchronize()

    time_ms = start.elapsed_time(end) / reps
    bw = 2 * nbytes / (time_ms / 1000) / 1e9  # GB/s (read + write)
    bandwidths.append(bw)
    print(f"{nbytes/1024:>10.0f} KB: {bw:.0f} GB/s")

ex-sp-ch10-13

Hard

Build a comprehensive GPU diagnostic script that reports:

Driver version, CUDA version, GPU name
Total/free GPU memory
Number of SMs and max threads per SM
Measured memory bandwidth (copy benchmark)
Measured compute throughput (GEMM benchmark)
PCIe transfer bandwidth (host-device copy) Format the output as a table.

Show Hint

Use torch.cuda.get_device_properties for hardware info.

Use torch.cuda.mem_get_info for memory.

Solution

Implementation

import torch
import time

def gpu_diagnostic():
    props = torch.cuda.get_device_properties(0)
    free, total = torch.cuda.mem_get_info(0)

    print("=" * 50)
    print("GPU DIAGNOSTIC REPORT")
    print("=" * 50)
    print(f"GPU:          {props.name}")
    print(f"CUDA:         {torch.version.cuda}")
    print(f"SMs:          {props.multi_processor_count}")
    print(f"Memory:       {total/1e9:.1f} GB ({free/1e9:.1f} GB free)")

    # Memory BW benchmark
    n = 100_000_000
    x = torch.randn(n, device='cuda')
    y = torch.empty_like(x)
    for _ in range(3): y.copy_(x)
    torch.cuda.synchronize()
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    for _ in range(20): y.copy_(x)
    end.record()
    torch.cuda.synchronize()
    bw = 20 * 2 * n * 4 / (start.elapsed_time(end)/1000) / 1e9
    print(f"Memory BW:    {bw:.0f} GB/s")

    # Compute benchmark
    m = 4096
    a = torch.randn(m, m, device='cuda')
    for _ in range(3): _ = a @ a
    torch.cuda.synchronize()
    start.record()
    for _ in range(10): _ = a @ a
    end.record()
    torch.cuda.synchronize()
    tflops = 10 * 2 * m**3 / (start.elapsed_time(end)/1000) / 1e12
    print(f"Compute:      {tflops:.1f} TFLOPS (FP32)")

gpu_diagnostic()

ex-sp-ch10-14

Challenge

Implement a Python roofline analysis tool that:

Auto-detects GPU specs (peak FLOPS, memory bandwidth)
Benchmarks a user-provided function at multiple input sizes
Measures both FLOP count and bytes moved
Plots the function on the roofline diagram
Automatically classifies the bottleneck and suggests optimizations

Test on: element-wise operations, GEMM, batched SVD, and FFT.

Show Hint

Estimate FLOPs from operation type and dimensions.

Measure achieved FLOP/s and compare to the roofline bound.

Solution

Design

class RooflineAnalyzer:
    def __init__(self):
        props = torch.cuda.get_device_properties(0)
        self.peak_tflops = self._measure_peak_compute()
        self.peak_bw = self._measure_peak_bandwidth()
        self.ridge_point = self.peak_tflops * 1e12 / (self.peak_bw * 1e9)

    def analyze(self, func, sizes, flop_counter, byte_counter):
        results = []
        for size in sizes:
            args = self._make_args(size)
            flops = flop_counter(size)
            bytes_moved = byte_counter(size)
            time_s = self._benchmark(func, args)
            achieved = flops / time_s / 1e9
            intensity = flops / bytes_moved
            bound = min(self.peak_tflops * 1e3,
                        intensity * self.peak_bw)
            efficiency = achieved / bound * 100
            results.append({
                'size': size, 'intensity': intensity,
                'achieved_gflops': achieved,
                'bound_gflops': bound,
                'efficiency': efficiency,
                'bottleneck': 'compute' if intensity > self.ridge_point else 'memory'
            })
        return results

ex-sp-ch10-15

Challenge

Create a GPU memory transfer optimizer that:

Traces all host-device transfers in a PyTorch program
Identifies redundant transfers (data sent to GPU then immediately back)
Detects small transfers that should be batched
Suggests pinned memory for frequently transferred tensors
Reports total transfer time and potential savings

Use torch.cuda.memory._record_memory_history() or manual hooks.

Show Hint

Override torch.Tensor.cuda() and .cpu() with logging wrappers.

Track tensor IDs to detect ping-pong transfers.

Solution

Design

import torch
from collections import defaultdict

class TransferTracker:
    def __init__(self):
        self.transfers = []
        self.tensor_locations = {}

    def track_to_cuda(self, tensor, original_fn):
        size = tensor.nelement() * tensor.element_size()
        t0 = time.perf_counter()
        result = original_fn()
        torch.cuda.synchronize()
        elapsed = time.perf_counter() - t0
        tid = id(tensor)
        if self.tensor_locations.get(tid) == 'cuda':
            self.transfers.append({
                'type': 'redundant', 'size': size, 'time': elapsed
            })
        self.tensor_locations[tid] = 'cuda'
        self.transfers.append({
            'type': 'h2d', 'size': size, 'time': elapsed
        })
        return result

    def report(self):
        total_bytes = sum(t['size'] for t in self.transfers)
        total_time = sum(t['time'] for t in self.transfers)
        redundant = [t for t in self.transfers if t['type'] == 'redundant']
        small = [t for t in self.transfers if t['size'] < 4096]
        print(f"Total transfers: {len(self.transfers)}")
        print(f"Total data: {total_bytes/1e6:.1f} MB")
        print(f"Total time: {total_time*1000:.1f} ms")
        print(f"Redundant: {len(redundant)}")
        print(f"Small (<4KB): {len(small)} (batch these!)")

Exercises

ex-sp-ch10-01

Analysis

ex-sp-ch10-02

Computation

ex-sp-ch10-03

Computation

ex-sp-ch10-04

Implementation

ex-sp-ch10-05

Analysis

ex-sp-ch10-06

Analysis

ex-sp-ch10-07

Analysis

ex-sp-ch10-08

Implementation

ex-sp-ch10-09

Analysis

ex-sp-ch10-10

Implementation

ex-sp-ch10-11

Implementation

ex-sp-ch10-12

Implementation

ex-sp-ch10-13

Implementation

ex-sp-ch10-14

Design

ex-sp-ch10-15

Design