Chapter Summary
Chapter Summary
Key Points
- 1.
MDPs formalise sequential decision-making. States, actions, transitions, rewards, and discount factor define the problem. The Bellman equation is the foundation of all value-based methods.
- 2.
DQN approximates Q-values with neural networks. Experience replay breaks temporal correlation. Target networks stabilise training. Double DQN reduces overestimation bias.
- 3.
Policy gradient methods handle continuous actions. REINFORCE estimates gradients from trajectories. PPO clips the policy ratio for stable updates. Advantage functions reduce variance.
- 4.
Wireless RL maps naturally to MDPs. Channel conditions are states, resource allocations are actions, throughput/QoS are rewards. RL can discover strategies that surpass rule-based approaches.
- 5.
Reward design is critical. Sparse rewards make learning difficult. Dense, well-shaped rewards accelerate convergence. Consider multi-objective rewards for throughput + fairness.
Looking Ahead
Chapter 33 covers transfer learning and model export, enabling deployment of trained RL policies and other models in production.