Exercises
ex-sp-ch11-01
EasyCreate a random matrix on the GPU using CuPy. Compute its transpose, its Frobenius norm, and its trace. Verify the results match NumPy on the CPU.
Use cp.random.randn(1000, 1000) and cp.linalg.norm(A, "fro").
Transfer to CPU with .get() for comparison.
Implementation
import cupy as cp
import numpy as np
A_cp = cp.random.randn(1000, 1000)
A_np = A_cp.get()
# GPU
norm_gpu = float(cp.linalg.norm(A_cp, 'fro'))
trace_gpu = float(cp.trace(A_cp))
# CPU
norm_cpu = np.linalg.norm(A_np, 'fro')
trace_cpu = np.trace(A_np)
print(f"Norm match: {np.isclose(norm_gpu, norm_cpu)}")
print(f"Trace match: {np.isclose(trace_gpu, trace_cpu)}")
ex-sp-ch11-02
EasyMeasure the time to transfer a -element float64 array from
CPU to GPU (cp.asarray) and back (cp.asnumpy). Calculate the
effective bandwidth in GB/s for each direction.
Bandwidth = bytes / time. float64 is 8 bytes per element.
Remember to synchronize before timing.
Implementation
import cupy as cp
import numpy as np
import time
n = 10**7
a = np.random.randn(n)
nbytes = n * 8
cp.cuda.Stream.null.synchronize()
t0 = time.perf_counter()
a_gpu = cp.asarray(a)
cp.cuda.Stream.null.synchronize()
t_h2d = time.perf_counter() - t0
t0 = time.perf_counter()
a_cpu = a_gpu.get()
t_d2h = time.perf_counter() - t0
print(f"H2D: {nbytes/t_h2d/1e9:.1f} GB/s")
print(f"D2H: {nbytes/t_d2h/1e9:.1f} GB/s")
ex-sp-ch11-03
EasyWrite a GPU-portable function that computes the element-wise ReLU and works with both NumPy and CuPy arrays.
Use cp.get_array_module(x) to detect the backend.
Implementation
import cupy as cp
import numpy as np
def relu(x):
xp = cp.get_array_module(x)
return xp.maximum(x, 0)
# CPU
y_cpu = relu(np.array([-1.0, 0.5, 2.0]))
# GPU
y_gpu = relu(cp.array([-1.0, 0.5, 2.0]))
print(np.allclose(y_cpu, y_gpu.get())) # True
ex-sp-ch11-04
EasyCompute the eigenvalues of a Hermitian correlation
matrix using cp.linalg.eigh. Verify
all eigenvalues are positive.
Build the matrix with cp.array and list comprehension.
Implementation
import cupy as cp
n = 500
R = cp.array([[0.9 ** abs(i - j) for j in range(n)]
for i in range(n)], dtype=cp.float64)
eigvals, _ = cp.linalg.eigh(R)
print(f"Min eigenvalue: {float(eigvals[0]):.6e}")
print(f"All positive: {bool(cp.all(eigvals > 0))}")
ex-sp-ch11-05
EasyCompute the 1-D FFT of a -point complex signal on the GPU.
Verify the result matches np.fft.fft on the CPU (use cp.allclose
after transferring).
Use cp.fft.fft and compare with np.fft.fft.
Implementation
import cupy as cp
import numpy as np
n = 2**16
x_np = np.random.randn(n) + 1j * np.random.randn(n)
x_cp = cp.asarray(x_np)
X_np = np.fft.fft(x_np)
X_cp = cp.fft.fft(x_cp)
print(f"Match: {np.allclose(X_np, X_cp.get(), rtol=1e-10)}")
ex-sp-ch11-06
EasyCreate a sparse CSR matrix with density 0.001 on the GPU. Compute the sparse matvec and time it.
Build with SciPy, transfer to GPU with cupyx.scipy.sparse.csr_matrix.
Implementation
import cupy as cp
import cupyx.scipy.sparse as csp
import scipy.sparse as sp
import time
A_cpu = sp.random(5000, 5000, density=0.001, format='csr')
A_gpu = csp.csr_matrix(A_cpu)
x_gpu = cp.random.randn(5000)
cp.cuda.Stream.null.synchronize()
t0 = time.perf_counter()
y = A_gpu @ x_gpu
cp.cuda.Stream.null.synchronize()
print(f"Sparse matvec: {(time.perf_counter()-t0)*1e6:.0f} us")
ex-sp-ch11-07
MediumBenchmark cp.linalg.solve against np.linalg.solve for matrix
sizes . Plot the speedup vs size.
At what size does the GPU become faster?
Synchronize the GPU and exclude transfer time from GPU timing.
Warm up the GPU with a dummy solve first.
Approach
Generate random matrices on each device, solve on each, compute speedup = t_cpu / t_gpu. The crossover is typically around --.
ex-sp-ch11-08
MediumCompute the batched SVD of 512 random matrices on the GPU. Compare the wall-clock time against a sequential CPU loop.
Stack matrices into a 3-D array: H.shape = (512, 16, 8).
CuPy broadcasts SVD over the batch dimension.
Approach
H = cp.random.randn(512, 16, 8) + 1j*cp.random.randn(512, 16, 8)
U, s, Vh = cp.linalg.svd(H, full_matrices=False)
ex-sp-ch11-09
MediumWrite an ElementwiseKernel that computes the Huber loss:
Benchmark against the equivalent CuPy expression.
Use the ternary operator in CUDA C: y = (cond) ? expr1 : expr2.
Implementation
huber = cp.ElementwiseKernel(
'float64 x, float64 delta',
'float64 loss',
'''
double ax = abs(x);
loss = (ax <= delta) ? 0.5 * x * x
: delta * (ax - 0.5 * delta);
''',
'huber_loss'
)
ex-sp-ch11-10
MediumImplement the Kronecker matvec
on GPU using the vec
trick. Verify correctness against cp.kron(A, B) @ x for small
, then benchmark for where forming the full
product is impractical.
Use x.reshape(n, n, order="F") and two matrix multiplies.
Approach
For : form full Kronecker and compare. For : only the vec trick fits in memory.
ex-sp-ch11-11
MediumCompute a range-Doppler map from a simulated radar data cube (128 pulses, 1024 range bins) with 3 injected targets at known range-Doppler positions. Apply Hamming windows in both dimensions.
Inject targets as sinusoids in slow-time and delayed pulses in fast-time.
Approach
Create the data cube, add targets as complex exponentials, apply 2-D windowed FFT, and display the power spectrum in dB.
ex-sp-ch11-12
MediumCompare cp.fft.fft throughput in complex64 vs complex128 for
. Plot the throughput ratio
and explain why it is approximately 2x.
Throughput = N / time. The FFT is memory-bandwidth bound.
Approach
Time both precisions, compute the ratio, and relate to the roofline model (memory-bound operation, half the bytes = ~2x speed).
ex-sp-ch11-13
MediumWrite a ReductionKernel that computes the log-sum-exp:
Use the numerically stable trick of subtracting first.
First compute m = max(x), then reduce exp(x_i - m), then add log(result) + m.
Implementation
Two-pass approach: (1) find max with cp.max, (2) use
ReductionKernel to sum exp(x - m), (3) take log and add m.
ex-sp-ch11-14
HardImplement a GPU-accelerated power iteration to find the dominant
eigenvalue of a large matrix. Compare
convergence speed and wall-clock time against cp.linalg.eigh
(which computes all eigenvalues).
Power iteration: .
Converge when the Rayleigh quotient stabilizes.
Approach
Implement power iteration entirely on GPU. It only needs repeated matvec and norm, making it much faster than full eigendecomposition when only the top eigenvalue is needed.
ex-sp-ch11-15
HardBuild a three-factor Kronecker matvec on GPU using sequential mode- products. Benchmark for and compare against the theoretical complexity.
Use cp.einsum or explicit reshapes for each mode multiply.
Approach
Reshape x to , apply each factor along its mode
using cp.einsum, measure time, and compare against .
ex-sp-ch11-16
HardWrite a RawKernel that implements a 1-D stencil operation
using shared memory for the
halo elements. Benchmark against cupyx.scipy.sparse CSR matvec
with the tridiagonal Laplacian.
Each block loads its range plus one halo element on each side into shared memory.
Approach
Load blockDim.x + 2 elements into shared memory, apply the
stencil, and write results. This avoids redundant global memory
reads at block boundaries.
ex-sp-ch11-17
HardImplement GPU-accelerated conjugate gradient (CG) to solve
for a sparse SPD matrix.
All operations (matvec, dot products, axpy) must stay on the GPU.
Compare wall-clock time against cp.linalg.solve on the dense
equivalent.
CG only needs matvec, dot products, and vector updates — all GPU-friendly.
Approach
Implement CG with A_csr @ p for matvec, cp.dot for
inner products, and standard CuPy vector operations. No data
leaves the GPU during iteration.
ex-sp-ch11-18
HardProfile the memory usage of a CuPy workflow that creates and
destroys many temporary arrays. Use cp.get_default_memory_pool()
to monitor pool growth. Then implement an in-place version that
reuses buffers and compare peak memory.
Use mempool.used_bytes() and mempool.total_bytes() at each step.
Approach
Track memory at each step, identify peak, then rewrite using
pre-allocated buffers and in-place operations (out= parameter).
ex-sp-ch11-19
ChallengeImplement a complete GPU-accelerated OFDM demodulator:
- Remove cyclic prefix
- Apply FFT per symbol
- Channel estimation (least squares per subcarrier)
- Equalization (ZF or MMSE)
Process a batch of 14 OFDM symbols with 1024 subcarriers, 4 Tx and 4 Rx antennas entirely on the GPU.
Use batched FFT and batched cp.linalg.solve for per-subcarrier equalization.
Approach
Shape data as (n_symbols, n_rx, fft_size), apply batched FFT,
estimate channel from pilots, then solve per-subcarrier MIMO
systems using batched linalg.
ex-sp-ch11-20
ChallengeCompare three approaches for applying a Kronecker-structured covariance matrix to 1000 random channel vectors:
- Form full Kronecker and batch-multiply (if possible)
- Vec trick with sequential GPU matmuls
- Vec trick with a custom ElementwiseKernel that fuses the reshape
Plot speedup and memory usage for .
Method 1 will run out of memory for n=128. That is the point.
Approach
Implement all three, benchmark with proper GPU synchronization, and show that the vec trick is the only viable approach for large antenna arrays.