针对PPO的一些Code-level性能优化技巧 - 润新知

针对PPO的一些Code-level性能优化技巧
Intro

这篇blog是我在看过Logan等人的“implementation matters in deep policy gradients: a case study on ppo and trpo“之后的总结。

reward clipping
- clip the rewards within a preset range( usually [-5,5] or [-10,10])
observation clipping
- The state are first normalized to mean-zero, variance-one vectors
value function clipping

将(Loss^{V} = (V_{ heta t} - V_{targ})^{2})替换为(L^{V} = min[ (V_{ heta t} - V_{targ})^{2} , (clip(V_{ heta t}, V_{ heta t-1}-epsilon, V_{ heta t-1}+epsilon) - V_{targ})^{2} ])

orthogonal initialization and layer scaling

use orthogonal initialization with scaling that varies from layer to layer

adam learning rate annealing

anneal the learning rate of Adam

hyperbolic tan activations

use hyperbolic tan activations when constructing the policy network and value network

global gradient clipping

clip the gradients such the 'global l2 norm' doesn't exceed 0.5

reward scaling
相关阅读:
练习44-继承和组合
 CF 1329B Dreamoon Likes Sequences
CF 1362C Johnny and Another Rating Drop
CF 1391D 505
CF 1383B GameGame
CF1360F Spy-string（暴力）
Java 之关键字与标识符
 Java 之 Java 注释与 API
Java 之第一个Java程序
 Java 之 Java开发环境
原文地址：https://www.cnblogs.com/dynmi/p/14031724.html

Copyright © 2020-2023 润新知