Chapter Summary

1.
MDPs formalise sequential decision-making. States, actions, transitions, rewards, and discount factor define the problem. The Bellman equation is the foundation of all value-based methods.
2.
DQN approximates Q-values with neural networks. Experience replay breaks temporal correlation. Target networks stabilise training. Double DQN reduces overestimation bias.
3.
Policy gradient methods handle continuous actions. REINFORCE estimates gradients from trajectories. PPO clips the policy ratio for stable updates. Advantage functions reduce variance.
4.
Wireless RL maps naturally to MDPs. Channel conditions are states, resource allocations are actions, throughput/QoS are rewards. RL can discover strategies that surpass rule-based approaches.
5.
Reward design is critical. Sparse rewards make learning difficult. Dense, well-shaped rewards accelerate convergence. Consider multi-objective rewards for throughput + fairness.

Chapter 33 covers transfer learning and model export, enabling deployment of trained RL policies and other models in production.