Exercises

ex-sp-ch32-01

Easy

Create a Gymnasium environment and run 100 episodes with random actions. Plot cumulative rewards.

ex-sp-ch32-02

Easy

Implement tabular Q-learning for a 4x4 gridworld. Visualise learned Q-values.

ex-sp-ch32-03

Easy

Implement epsilon-greedy exploration with epsilon decaying from 1.0 to 0.01.

ex-sp-ch32-04

Easy

Implement a replay buffer with fixed capacity and uniform sampling.

ex-sp-ch32-05

Easy

Compute discounted returns for a trajectory with rewards [1, 0, -1, 10] and gamma=0.99.

ex-sp-ch32-06

Medium

Implement DQN with experience replay and target network for CartPole-v1.

ex-sp-ch32-07

Medium

Implement Double DQN and compare Q-value estimates to standard DQN.

ex-sp-ch32-08

Medium

Implement REINFORCE for CartPole. Plot learning curve over 500 episodes.

ex-sp-ch32-09

Medium

Add a learned baseline (value function) to REINFORCE and compare variance.

ex-sp-ch32-10

Medium

Implement a custom Gym environment for multi-user power control.

ex-sp-ch32-11

Hard

Implement PPO with clipped surrogate objective and GAE advantage estimation.

ex-sp-ch32-12

Hard

Train DQN for power control in a 4-user interference channel. Compare to max-SINR heuristic.

ex-sp-ch32-13

Hard

Implement prioritised experience replay and compare to uniform replay.

ex-sp-ch32-14

Challenge

Train a PPO agent for multi-user scheduling that optimises proportional fairness.

ex-sp-ch32-15

Challenge

Implement multi-agent RL for distributed power control where each BS is an independent agent.