Exercises

Easy

Create a Gymnasium environment and run 100 episodes with random actions. Plot cumulative rewards.

Easy

Implement tabular Q-learning for a 4x4 gridworld. Visualise learned Q-values.

Easy

Implement epsilon-greedy exploration with epsilon decaying from 1.0 to 0.01.

Easy

Implement a replay buffer with fixed capacity and uniform sampling.

Easy

Compute discounted returns for a trajectory with rewards [1, 0, -1, 10] and gamma=0.99.

Medium

Implement DQN with experience replay and target network for CartPole-v1.

Medium

Implement Double DQN and compare Q-value estimates to standard DQN.

Medium

Implement REINFORCE for CartPole. Plot learning curve over 500 episodes.

Medium

Add a learned baseline (value function) to REINFORCE and compare variance.

Medium

Implement a custom Gym environment for multi-user power control.

Hard

Implement PPO with clipped surrogate objective and GAE advantage estimation.

Hard

Train DQN for power control in a 4-user interference channel. Compare to max-SINR heuristic.

Hard

Implement prioritised experience replay and compare to uniform replay.

Challenge

Train a PPO agent for multi-user scheduling that optimises proportional fairness.

Challenge

Implement multi-agent RL for distributed power control where each BS is an independent agent.