Exercises

ex-sp-ch11-01

Easy

Create a 1000×10001000 \times 1000 random matrix on the GPU using CuPy. Compute its transpose, its Frobenius norm, and its trace. Verify the results match NumPy on the CPU.

ex-sp-ch11-02

Easy

Measure the time to transfer a 10710^7-element float64 array from CPU to GPU (cp.asarray) and back (cp.asnumpy). Calculate the effective bandwidth in GB/s for each direction.

ex-sp-ch11-03

Easy

Write a GPU-portable function that computes the element-wise ReLU max(0,x)\max(0, x) and works with both NumPy and CuPy arrays.

ex-sp-ch11-04

Easy

Compute the eigenvalues of a 500×500500 \times 500 Hermitian correlation matrix Rij=0.9ijR_{ij} = 0.9^{|i-j|} using cp.linalg.eigh. Verify all eigenvalues are positive.

ex-sp-ch11-05

Easy

Compute the 1-D FFT of a 2162^{16}-point complex signal on the GPU. Verify the result matches np.fft.fft on the CPU (use cp.allclose after transferring).

ex-sp-ch11-06

Easy

Create a 5000×50005000 \times 5000 sparse CSR matrix with density 0.001 on the GPU. Compute the sparse matvec y=Ax\mathbf{y} = \mathbf{A}\mathbf{x} and time it.

ex-sp-ch11-07

Medium

Benchmark cp.linalg.solve against np.linalg.solve for matrix sizes n=128,256,512,1024,2048n = 128, 256, 512, 1024, 2048. Plot the speedup vs size. At what size does the GPU become faster?

ex-sp-ch11-08

Medium

Compute the batched SVD of 512 random 16×816 \times 8 matrices on the GPU. Compare the wall-clock time against a sequential CPU loop.

ex-sp-ch11-09

Medium

Write an ElementwiseKernel that computes the Huber loss:

L(x)={12x2xδδ(x12δ)x>δL(x) = \begin{cases} \frac{1}{2}x^2 & |x| \le \delta \\ \delta(|x| - \frac{1}{2}\delta) & |x| > \delta \end{cases}

Benchmark against the equivalent CuPy expression.

ex-sp-ch11-10

Medium

Implement the Kronecker matvec (AB)x(\mathbf{A} \otimes \mathbf{B})\mathbf{x} on GPU using the vec trick. Verify correctness against cp.kron(A, B) @ x for small n=20n = 20, then benchmark for n=200n = 200 where forming the full product is impractical.

ex-sp-ch11-11

Medium

Compute a range-Doppler map from a simulated radar data cube (128 pulses, 1024 range bins) with 3 injected targets at known range-Doppler positions. Apply Hamming windows in both dimensions.

ex-sp-ch11-12

Medium

Compare cp.fft.fft throughput in complex64 vs complex128 for N=214,216,218,220N = 2^{14}, 2^{16}, 2^{18}, 2^{20}. Plot the throughput ratio and explain why it is approximately 2x.

ex-sp-ch11-13

Medium

Write a ReductionKernel that computes the log-sum-exp:

LSE(x)=log ⁣(iexi)\mathrm{LSE}(\mathbf{x}) = \log\!\left(\sum_{i} e^{x_i}\right)

Use the numerically stable trick of subtracting max(x)\max(\mathbf{x}) first.

ex-sp-ch11-14

Hard

Implement a GPU-accelerated power iteration to find the dominant eigenvalue of a large 10000×1000010000 \times 10000 matrix. Compare convergence speed and wall-clock time against cp.linalg.eigh (which computes all eigenvalues).

ex-sp-ch11-15

Hard

Build a three-factor Kronecker matvec (ABC)x(\mathbf{A} \otimes \mathbf{B} \otimes \mathbf{C})\mathbf{x} on GPU using sequential mode-kk products. Benchmark for n=30,50,80n = 30, 50, 80 and compare against the theoretical complexity.

ex-sp-ch11-16

Hard

Write a RawKernel that implements a 1-D stencil operation yi=xi1+2xixi+1y_i = -x_{i-1} + 2x_i - x_{i+1} using shared memory for the halo elements. Benchmark against cupyx.scipy.sparse CSR matvec with the tridiagonal Laplacian.

ex-sp-ch11-17

Hard

Implement GPU-accelerated conjugate gradient (CG) to solve Ax=b\mathbf{A}\mathbf{x} = \mathbf{b} for a sparse SPD matrix. All operations (matvec, dot products, axpy) must stay on the GPU. Compare wall-clock time against cp.linalg.solve on the dense equivalent.

ex-sp-ch11-18

Hard

Profile the memory usage of a CuPy workflow that creates and destroys many temporary arrays. Use cp.get_default_memory_pool() to monitor pool growth. Then implement an in-place version that reuses buffers and compare peak memory.

ex-sp-ch11-19

Challenge

Implement a complete GPU-accelerated OFDM demodulator:

  1. Remove cyclic prefix
  2. Apply FFT per symbol
  3. Channel estimation (least squares per subcarrier)
  4. Equalization (ZF or MMSE)

Process a batch of 14 OFDM symbols with 1024 subcarriers, 4 Tx and 4 Rx antennas entirely on the GPU.

ex-sp-ch11-20

Challenge

Compare three approaches for applying a Kronecker-structured covariance matrix RtRr\mathbf{R}_t \otimes \mathbf{R}_r to 1000 random channel vectors:

  1. Form full Kronecker and batch-multiply (if possible)
  2. Vec trick with sequential GPU matmuls
  3. Vec trick with a custom ElementwiseKernel that fuses the reshape

Plot speedup and memory usage for nt=nr=32,64,128n_t = n_r = 32, 64, 128.