Intro
这篇blog是我在看过Logan等人的“implementation matters in deep policy gradients: a case study on ppo and trpo“之后的总结。
reward clipping
- clip the rewards within a preset range( usually [-5,5] or [-10,10])
observation clipping
- The state are first normalized to mean-zero, variance-one vectors
value function clipping
将(Loss^{V} = (V_{ heta t} - V_{targ})^{2})替换为(L^{V} = min[ (V_{ heta t} - V_{targ})^{2} , (clip(V_{ heta t}, V_{ heta t-1}-epsilon, V_{ heta t-1}+epsilon) - V_{targ})^{2} ])
orthogonal initialization and layer scaling
use orthogonal initialization with scaling that varies from layer to layer
adam learning rate annealing
anneal the learning rate of Adam
hyperbolic tan activations
use hyperbolic tan activations when constructing the policy network and value network
global gradient clipping
clip the gradients such the 'global l2 norm' doesn't exceed 0.5