Chapter Summary

Chapter Summary

Key Points

  • 1.

    MDPs formalise sequential decision-making. States, actions, transitions, rewards, and discount factor define the problem. The Bellman equation is the foundation of all value-based methods.

  • 2.

    DQN approximates Q-values with neural networks. Experience replay breaks temporal correlation. Target networks stabilise training. Double DQN reduces overestimation bias.

  • 3.

    Policy gradient methods handle continuous actions. REINFORCE estimates gradients from trajectories. PPO clips the policy ratio for stable updates. Advantage functions reduce variance.

  • 4.

    Wireless RL maps naturally to MDPs. Channel conditions are states, resource allocations are actions, throughput/QoS are rewards. RL can discover strategies that surpass rule-based approaches.

  • 5.

    Reward design is critical. Sparse rewards make learning difficult. Dense, well-shaped rewards accelerate convergence. Consider multi-objective rewards for throughput + fairness.

Looking Ahead

Chapter 33 covers transfer learning and model export, enabling deployment of trained RL policies and other models in production.