Policy Gradient Methods

Definition:

REINFORCE Algorithm

θJ=1Ni=1Nt=0Tθlogπθ(atisti)Gti\nabla_\theta J = \frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T}\nabla_\theta \log \pi_\theta(a_t^i|s_t^i) \cdot G_t^iwherewhereG_t^iisthediscountedreturnfromtimeis the discounted return from timet$.

Definition:

Proximal Policy Optimization (PPO)

PPO clips the policy ratio to prevent large updates:

LCLIP=Et[min(rt(θ)A^t,  clip(rt(θ),1ε,1+ε)A^t)]L^{\text{CLIP}} = \mathbb{E}_t\left[\min\left(r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) \hat{A}_t\right)\right]

where rt(θ)=πθ(atst)/πθold(atst)r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t).