Batched Operations

Why Batching Matters for GPU Performance

A GPU has thousands of cores but each CUDA kernel launch has overhead (~5-50 microseconds). If you launch one kernel per matrix multiply in a loop, the launch overhead dominates and the GPU sits idle most of the time. Batching combines many independent operations into a single kernel launch, keeping the GPU fully utilized.

This section covers torch.bmm, batched linear algebra, and how to restructure scientific computations for maximum GPU throughput.

Definition:

Batched Matrix Multiplication (BMM)

Given two batched tensors A∈RBΓ—mΓ—k\mathbf{A} \in \mathbb{R}^{B \times m \times k} and B∈RBΓ—kΓ—n\mathbf{B} \in \mathbb{R}^{B \times k \times n}, the batched matrix multiplication computes BB independent matrix products:

Ci=AiBi,i=1,…,B\mathbf{C}_i = \mathbf{A}_i \mathbf{B}_i, \quad i = 1, \ldots, B

yielding C∈RBΓ—mΓ—n\mathbf{C} \in \mathbb{R}^{B \times m \times n}.

A = torch.randn(64, 128, 256)    # 64 matrices of shape 128x256
B = torch.randn(64, 256, 64)     # 64 matrices of shape 256x64
C = torch.bmm(A, B)              # 64 matrices of shape 128x64
# Equivalent using @ operator:
C = A @ B

Under the hood, torch.bmm calls cuBLAS cublasSgemmBatched, which processes all BB multiplications in a single kernel launch. The speedup over a Python loop is typically 10-100x.

Definition:

Batched Operations via Einstein Summation

torch.einsum provides a concise notation for batched operations:

# Batched matrix multiply
C = torch.einsum('bij,bjk->bik', A, B)

# Batched dot product
dots = torch.einsum('bi,bi->b', X, Y)

# Batched outer product
outer = torch.einsum('bi,bj->bij', u, v)

# Batched trace
traces = torch.einsum('bii->b', M)

Einstein summation is particularly useful when the batch dimension is not the first axis or when operations involve contractions over multiple indices.

Definition:

Batched Singular Value Decomposition

PyTorch's torch.linalg.svd natively supports batched input:

Ai=UiΞ£iViH,i=1,…,B\mathbf{A}_i = \mathbf{U}_i \boldsymbol{\Sigma}_i \mathbf{V}_i^H, \quad i = 1, \ldots, B

A = torch.randn(100, 64, 32)     # 100 matrices of shape 64x32
U, S, Vh = torch.linalg.svd(A, full_matrices=False)
# U: (100, 64, 32), S: (100, 32), Vh: (100, 32, 32)

All batched torch.linalg functions (svd, solve, eigh, cholesky, inv) follow the same convention: batch dimensions are all dimensions except the last two.

Batched eigh is especially useful for simultaneously diagonalizing covariance matrices across frequency bins or time slots in MIMO-OFDM.

Theorem: Kernel Launch Overhead and Batching Speedup

Let TkT_k be the kernel execution time for a single operation and TlaunchT_{\text{launch}} be the kernel launch overhead. For BB independent operations:

  • Sequential (loop): Tloop=Bβ‹…(Tlaunch+Tk)T_{\text{loop}} = B \cdot (T_{\text{launch}} + T_k)
  • Batched: Tbatch=Tlaunch+Tk(B)T_{\text{batch}} = T_{\text{launch}} + T_k(B)

where Tk(B)T_k(B) is the execution time with batch parallelism. For small TkT_k (small matrices), the speedup approaches:

S=TloopTbatchβ‰ˆBβ‹…TlaunchTlaunch=BS = \frac{T_{\text{loop}}}{T_{\text{batch}}} \approx \frac{B \cdot T_{\text{launch}}}{T_{\text{launch}}} = B

For large TkT_k, the speedup saturates as compute dominates launch overhead.

Batching eliminates Bβˆ’1B - 1 kernel launches. The bigger the ratio of launch overhead to compute, the greater the benefit. Small matrix operations (e.g., 4Γ—44 \times 4 covariance matrices) benefit enormously.

Theorem: Arithmetic Intensity and the Roofline Model

The arithmetic intensity of an operation is:

I=FLOPsBytesΒ transferredI = \frac{\text{FLOPs}}{\text{Bytes transferred}}

For matrix multiplication C=AB\mathbf{C} = \mathbf{A}\mathbf{B} with A∈RmΓ—k\mathbf{A} \in \mathbb{R}^{m \times k}, B∈RkΓ—n\mathbf{B} \in \mathbb{R}^{k \times n}:

I=2mkn4(mk+kn+mn)β‰ˆn2forΒ m=k=nI = \frac{2mkn}{4(mk + kn + mn)} \approx \frac{n}{2} \quad\text{for } m = k = n

An operation is compute-bound if I>Iβˆ—=peakΒ FLOPS/peakΒ bandwidthI > I^* = \text{peak FLOPS} / \text{peak bandwidth}, and memory-bound otherwise. On an A100: Iβˆ—β‰ˆ150I^* \approx 150 FLOP/byte.

Large matrix multiplications have high arithmetic intensity and are compute-bound (GPU cores are the bottleneck). Element-wise operations have Iβ‰ˆ1I \approx 1 and are memory-bound (bandwidth is the bottleneck).

Example: Loop vs Batched Matrix Multiply Benchmark

Compare the speed of computing 1000 independent 32Γ—3232 \times 32 matrix multiplications using a Python loop vs torch.bmm.

Example: Batched Cholesky Solve for Multiple Systems

Solve 500 independent linear systems Rixi=bi\mathbf{R}_i \mathbf{x}_i = \mathbf{b}_i where each Ri\mathbf{R}_i is a 16Γ—1616 \times 16 positive definite covariance matrix (as in MIMO detection across subcarriers).

Batching Speedup Explorer

Compare execution time of loop-based vs batched operations as a function of batch size and matrix dimension. Observe how small matrices benefit most from batching due to kernel launch overhead amortization.

Parameters

Batched Operations

python
BMM, batched SVD, batched solve, and einsum patterns for GPU computing.
# Code from: ch13/python/batched_operations.py
# Load from backend supplements endpoint

Quick Check

You need to compute Ci=AiTBi\mathbf{C}_i = \mathbf{A}_i^T \mathbf{B}_i for i=1,…,Bi = 1, \ldots, B. Which is the correct einsum expression?

torch.einsum('bij,bjk->bik', A, B)

torch.einsum('bji,bjk->bik', A, B)

torch.einsum('bij,bkj->bik', A, B)

torch.einsum('ibj,bjk->bik', A, B)

Common Mistake: Python Loops Over Batch Dimension on GPU

Mistake:

Writing for i in range(B): result[i] = func(A[i]) on CUDA tensors. Each iteration launches a separate kernel, and the Python loop adds interpreter overhead. With B=1000 and small matrices, this can be 50-100x slower than a batched call.

Correction:

Use batched operations: result = func(A) where A has a batch dimension. Most torch.linalg and torch.nn.functional operations support arbitrary batch dimensions.

Key Takeaway

Always batch independent operations into a single tensor operation. The speedup from eliminating kernel launch overhead is 10-100x for small matrices. Use torch.bmm, torch.einsum, or batched torch.linalg functions instead of Python loops over the batch dimension.

Historical Note: From BLAS to Batched BLAS

21st century

The Basic Linear Algebra Subprograms (BLAS) were first published in 1979 by Lawson, Hanson, Kincaid, and Krogh. BLAS Level 3 (matrix-matrix operations) was added in 1990. The batched extension came much later with cuBLAS in CUDA 4.1 (2012), driven by deep learning workloads that require thousands of small matrix multiplications per layer. Today, batched GEMM is the single most important operation in GPU computing.

Batched Matrix Multiply (BMM)

Computing BB independent matrix products Ci=AiBi\mathbf{C}_i = \mathbf{A}_i \mathbf{B}_i in a single GPU kernel launch via torch.bmm or the @ operator on 3D tensors.

Related: GEMM (General Matrix Multiply)

GEMM (General Matrix Multiply)

The fundamental C=Ξ±AB+Ξ²C\mathbf{C} = \alpha \mathbf{A}\mathbf{B} + \beta \mathbf{C} operation. The batched variant (batched GEMM) is the workhorse of GPU computing.

Related: Batched Matrix Multiply (BMM)

Batched Operations in PyTorch

OperationFunctionInput ShapesOutput Shape
Matrix multiplytorch.bmm(A, B)(B, m, k) + (B, k, n)(B, m, n)
Solve Ax=btorch.linalg.solve(A, b)(B, n, n) + (B, n, p)(B, n, p)
SVDtorch.linalg.svd(A)(B, m, n)U(B,m,k), S(B,k), Vh(B,k,n)
Choleskytorch.linalg.cholesky(A)(B, n, n)(B, n, n)
Eigenvaluestorch.linalg.eigh(A)(B, n, n)vals(B,n), vecs(B,n,n)
Inversetorch.linalg.inv(A)(B, n, n)(B, n, n)