Ferkans — Interactive Telecom Tutor

ex-sp-ch11-01

Easy

Create a $1000 \times 1000$ random matrix on the GPU using CuPy. Compute its transpose, its Frobenius norm, and its trace. Verify the results match NumPy on the CPU.

Show Hint

Use cp.random.randn(1000, 1000) and cp.linalg.norm(A, "fro").

Transfer to CPU with .get() for comparison.

Solution

Implementation

import cupy as cp
import numpy as np

A_cp = cp.random.randn(1000, 1000)
A_np = A_cp.get()

# GPU
norm_gpu = float(cp.linalg.norm(A_cp, 'fro'))
trace_gpu = float(cp.trace(A_cp))

# CPU
norm_cpu = np.linalg.norm(A_np, 'fro')
trace_cpu = np.trace(A_np)

print(f"Norm match: {np.isclose(norm_gpu, norm_cpu)}")
print(f"Trace match: {np.isclose(trace_gpu, trace_cpu)}")

ex-sp-ch11-02

Easy

Measure the time to transfer a $10^7$ -element float64 array from CPU to GPU (cp.asarray) and back (cp.asnumpy). Calculate the effective bandwidth in GB/s for each direction.

Show Hint

Bandwidth = bytes / time. float64 is 8 bytes per element.

Remember to synchronize before timing.

Solution

Implementation

import cupy as cp
import numpy as np
import time

n = 10**7
a = np.random.randn(n)
nbytes = n * 8

cp.cuda.Stream.null.synchronize()
t0 = time.perf_counter()
a_gpu = cp.asarray(a)
cp.cuda.Stream.null.synchronize()
t_h2d = time.perf_counter() - t0

t0 = time.perf_counter()
a_cpu = a_gpu.get()
t_d2h = time.perf_counter() - t0

print(f"H2D: {nbytes/t_h2d/1e9:.1f} GB/s")
print(f"D2H: {nbytes/t_d2h/1e9:.1f} GB/s")

ex-sp-ch11-03

Easy

Write a GPU-portable function that computes the element-wise ReLU $\max(0, x)$ and works with both NumPy and CuPy arrays.

Show Hint

Use cp.get_array_module(x) to detect the backend.

Solution

Implementation

import cupy as cp
import numpy as np

def relu(x):
    xp = cp.get_array_module(x)
    return xp.maximum(x, 0)

# CPU
y_cpu = relu(np.array([-1.0, 0.5, 2.0]))
# GPU
y_gpu = relu(cp.array([-1.0, 0.5, 2.0]))
print(np.allclose(y_cpu, y_gpu.get()))  # True

ex-sp-ch11-04

Easy

Compute the eigenvalues of a $500 \times 500$ Hermitian correlation matrix $R_{ij} = 0.9^{|i-j|}$ using cp.linalg.eigh. Verify all eigenvalues are positive.

Show Hint

Build the matrix with cp.array and list comprehension.

Solution

Implementation

import cupy as cp

n = 500
R = cp.array([[0.9 ** abs(i - j) for j in range(n)]
               for i in range(n)], dtype=cp.float64)
eigvals, _ = cp.linalg.eigh(R)
print(f"Min eigenvalue: {float(eigvals[0]):.6e}")
print(f"All positive: {bool(cp.all(eigvals > 0))}")

ex-sp-ch11-05

Easy

Compute the 1-D FFT of a $2^{16}$ -point complex signal on the GPU. Verify the result matches np.fft.fft on the CPU (use cp.allclose after transferring).

Show Hint

Use cp.fft.fft and compare with np.fft.fft.

Solution

Implementation

import cupy as cp
import numpy as np

n = 2**16
x_np = np.random.randn(n) + 1j * np.random.randn(n)
x_cp = cp.asarray(x_np)

X_np = np.fft.fft(x_np)
X_cp = cp.fft.fft(x_cp)

print(f"Match: {np.allclose(X_np, X_cp.get(), rtol=1e-10)}")

ex-sp-ch11-06

Easy

Create a $5000 \times 5000$ sparse CSR matrix with density 0.001 on the GPU. Compute the sparse matvec $\mathbf{y} = \mathbf{A}\mathbf{x}$ and time it.

Show Hint

Build with SciPy, transfer to GPU with cupyx.scipy.sparse.csr_matrix.

Solution

Implementation

import cupy as cp
import cupyx.scipy.sparse as csp
import scipy.sparse as sp
import time

A_cpu = sp.random(5000, 5000, density=0.001, format='csr')
A_gpu = csp.csr_matrix(A_cpu)
x_gpu = cp.random.randn(5000)

cp.cuda.Stream.null.synchronize()
t0 = time.perf_counter()
y = A_gpu @ x_gpu
cp.cuda.Stream.null.synchronize()
print(f"Sparse matvec: {(time.perf_counter()-t0)*1e6:.0f} us")

ex-sp-ch11-07

Medium

Benchmark cp.linalg.solve against np.linalg.solve for matrix sizes $n = 128, 256, 512, 1024, 2048$ . Plot the speedup vs size. At what size does the GPU become faster?

Show Hint

Synchronize the GPU and exclude transfer time from GPU timing.

Warm up the GPU with a dummy solve first.

Solution

Approach

Generate random matrices on each device, solve on each, compute speedup = t_cpu / t_gpu. The crossover is typically around $n \approx 256$ -- $512$ .

ex-sp-ch11-08

Medium

Compute the batched SVD of 512 random $16 \times 8$ matrices on the GPU. Compare the wall-clock time against a sequential CPU loop.

Show Hint

Stack matrices into a 3-D array: H.shape = (512, 16, 8).

CuPy broadcasts SVD over the batch dimension.

Solution

Approach

H = cp.random.randn(512, 16, 8) + 1j*cp.random.randn(512, 16, 8)
U, s, Vh = cp.linalg.svd(H, full_matrices=False)

ex-sp-ch11-09

Medium

Write an ElementwiseKernel that computes the Huber loss:

$L(x) = \begin{cases} \frac{1}{2}x^2 & |x| \le \delta \\ \delta(|x| - \frac{1}{2}\delta) & |x| > \delta \end{cases}$

Benchmark against the equivalent CuPy expression.

Show Hint

Use the ternary operator in CUDA C: y = (cond) ? expr1 : expr2.

Solution

Implementation

huber = cp.ElementwiseKernel(
    'float64 x, float64 delta',
    'float64 loss',
    '''
    double ax = abs(x);
    loss = (ax <= delta) ? 0.5 * x * x
                         : delta * (ax - 0.5 * delta);
    ''',
    'huber_loss'
)

ex-sp-ch11-10

Medium

Implement the Kronecker matvec $(\mathbf{A} \otimes \mathbf{B})\mathbf{x}$ on GPU using the vec trick. Verify correctness against cp.kron(A, B) @ x for small $n = 20$ , then benchmark for $n = 200$ where forming the full product is impractical.

Show Hint

Use x.reshape(n, n, order="F") and two matrix multiplies.

Solution

Approach

For $n = 20$ : form full Kronecker and compare. For $n = 200$ : only the vec trick fits in memory.

ex-sp-ch11-11

Medium

Compute a range-Doppler map from a simulated radar data cube (128 pulses, 1024 range bins) with 3 injected targets at known range-Doppler positions. Apply Hamming windows in both dimensions.

Show Hint

Inject targets as sinusoids in slow-time and delayed pulses in fast-time.

Solution

Approach

Create the data cube, add targets as complex exponentials, apply 2-D windowed FFT, and display the power spectrum in dB.

ex-sp-ch11-12

Medium

Compare cp.fft.fft throughput in complex64 vs complex128 for $N = 2^{14}, 2^{16}, 2^{18}, 2^{20}$ . Plot the throughput ratio and explain why it is approximately 2x.

Show Hint

Throughput = N / time. The FFT is memory-bandwidth bound.

Solution

Approach

Time both precisions, compute the ratio, and relate to the roofline model (memory-bound operation, half the bytes = ~2x speed).

ex-sp-ch11-13

Medium

Write a ReductionKernel that computes the log-sum-exp:

$\mathrm{LSE}(\mathbf{x}) = \log\!\left(\sum_{i} e^{x_i}\right)$

Use the numerically stable trick of subtracting $\max(\mathbf{x})$ first.

Show Hint

First compute m = max(x), then reduce exp(x_i - m), then add log(result) + m.

Solution

Implementation

Two-pass approach: (1) find max with cp.max, (2) use ReductionKernel to sum exp(x - m), (3) take log and add m.

ex-sp-ch11-14

Hard

Implement a GPU-accelerated power iteration to find the dominant eigenvalue of a large $10000 \times 10000$ matrix. Compare convergence speed and wall-clock time against cp.linalg.eigh (which computes all eigenvalues).

Show Hint

Power iteration: $\mathbf{v}_{k+1} = \mathbf{A}\mathbf{v}_k / \|\mathbf{A}\mathbf{v}_k\|$ .

Converge when the Rayleigh quotient stabilizes.

Solution

Approach

Implement power iteration entirely on GPU. It only needs repeated matvec and norm, making it much faster than full eigendecomposition when only the top eigenvalue is needed.

ex-sp-ch11-15

Hard

Build a three-factor Kronecker matvec $(\mathbf{A} \otimes \mathbf{B} \otimes \mathbf{C})\mathbf{x}$ on GPU using sequential mode- $k$ products. Benchmark for $n = 30, 50, 80$ and compare against the theoretical complexity.

Show Hint

Use cp.einsum or explicit reshapes for each mode multiply.

Solution

Approach

Reshape x to $(n, n, n)$ , apply each factor along its mode using cp.einsum, measure time, and compare against $O(3n^4)$ .

ex-sp-ch11-16

Hard

Write a RawKernel that implements a 1-D stencil operation $y_i = -x_{i-1} + 2x_i - x_{i+1}$ using shared memory for the halo elements. Benchmark against cupyx.scipy.sparse CSR matvec with the tridiagonal Laplacian.

Show Hint

Each block loads its range plus one halo element on each side into shared memory.

Solution

Approach

Load blockDim.x + 2 elements into shared memory, apply the stencil, and write results. This avoids redundant global memory reads at block boundaries.

ex-sp-ch11-17

Hard

Implement GPU-accelerated conjugate gradient (CG) to solve $\mathbf{A}\mathbf{x} = \mathbf{b}$ for a sparse SPD matrix. All operations (matvec, dot products, axpy) must stay on the GPU. Compare wall-clock time against cp.linalg.solve on the dense equivalent.

Show Hint

CG only needs matvec, dot products, and vector updates — all GPU-friendly.

Solution

Approach

Implement CG with A_csr @ p for matvec, cp.dot for inner products, and standard CuPy vector operations. No data leaves the GPU during iteration.

ex-sp-ch11-18

Hard

Profile the memory usage of a CuPy workflow that creates and destroys many temporary arrays. Use cp.get_default_memory_pool() to monitor pool growth. Then implement an in-place version that reuses buffers and compare peak memory.

Show Hint

Use mempool.used_bytes() and mempool.total_bytes() at each step.

Solution

Approach

Track memory at each step, identify peak, then rewrite using pre-allocated buffers and in-place operations (out= parameter).

ex-sp-ch11-19

Challenge

Implement a complete GPU-accelerated OFDM demodulator:

Remove cyclic prefix
Apply FFT per symbol
Channel estimation (least squares per subcarrier)
Equalization (ZF or MMSE)

Process a batch of 14 OFDM symbols with 1024 subcarriers, 4 Tx and 4 Rx antennas entirely on the GPU.

Show Hint

Use batched FFT and batched cp.linalg.solve for per-subcarrier equalization.

Solution

Approach

Shape data as (n_symbols, n_rx, fft_size), apply batched FFT, estimate channel from pilots, then solve per-subcarrier MIMO systems using batched linalg.

ex-sp-ch11-20

Challenge

Compare three approaches for applying a Kronecker-structured covariance matrix $\mathbf{R}_t \otimes \mathbf{R}_r$ to 1000 random channel vectors:

Form full Kronecker and batch-multiply (if possible)
Vec trick with sequential GPU matmuls
Vec trick with a custom ElementwiseKernel that fuses the reshape

Plot speedup and memory usage for $n_t = n_r = 32, 64, 128$ .

Show Hint

Method 1 will run out of memory for n=128. That is the point.

Solution

Approach

Implement all three, benchmark with proper GPU synchronization, and show that the vec trick is the only viable approach for large antenna arrays.

Exercises

ex-sp-ch11-01

Implementation

ex-sp-ch11-02

Implementation

ex-sp-ch11-03

Implementation

ex-sp-ch11-04

Implementation

ex-sp-ch11-05

Implementation

ex-sp-ch11-06

Implementation

ex-sp-ch11-07

Approach

ex-sp-ch11-08

Approach

ex-sp-ch11-09

Implementation

ex-sp-ch11-10

Approach

ex-sp-ch11-11

Approach

ex-sp-ch11-12

Approach

ex-sp-ch11-13

Implementation

ex-sp-ch11-14

Approach

ex-sp-ch11-15

Approach

ex-sp-ch11-16

Approach

ex-sp-ch11-17

Approach

ex-sp-ch11-18

Approach

ex-sp-ch11-19

Approach

ex-sp-ch11-20

Approach