(元)强化学习相关开源代码调研
本地代码:https://github.com/lucifer2859/meta-RL
元强化学习简介:https://www.cnblogs.com/lucifer1997/p/13603979.html
一、Meta-RL
1、Learning to Reinforcement Learn:CogSci 2017
- https://github.com/awjuliani/Meta-RL
- 环境:TensorFlow,CPU;
- 任务:Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit,Contextual Bandit,GridWorld
- A3C-Meta-Bandit - Set of bandit tasks described in paper. Including: Independent, Dependent, and Restless bandits.
- A3C-Meta-Context - Rainbow bandit task using randomized colors to indicate reward-giving arm in each episode.
- A3C-Meta-Grid - Rainbow Gridworld task; a variation of gridworld in which goal colors are randomzied each episode and must be learned "on the fly."
- 模型:one-layer LSTM A3C [Figure 1(a),无Enc层];
- 实验:成功运行,无bug;训练收敛;结果大致相符;性能未达到论文效果(当前超参数);本地代码对其略有修改,参见https://github.com/lucifer2859/meta-RL/tree/master/Meta-RL;
- https://github.com/achao2013/Learning-To-Reinforcement-Learn
- 环境:MXNet,CPU;
- 任务:Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit;
- 模型:multi-layer LSTM A3C[无Enc层];
- 实验:未运行;
- https://github.com/lucifer2859/meta-RL/tree/master/L2RL-pytorch
- 环境:PyTorch,CPU;
- 任务:Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit;
- 模型:one-layer LSTM A3C [Figure 1(a),with GAE,无Enc层];
- 实验:成功运行,无bug;训练收敛;结果大致相符;性能未达到论文效果(当前超参数);
2、RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning (RL2):ICLR 2017
- https://github.com/mwufi/meta-rl-bandits
- 环境:PyTorch,CPU;
- 任务:Independent Bandit;
- 模型:two-layer LSTM REINFORCE;
- 实验:成功运行,无bug;模型与论文不符,原文RNN模型为GRU;训练不收敛(当前超参数);
- https://github.com/VashishtMadhavan/rl2
- 环境:TensorFlow,CPU;
- 任务:Dependent Bandit;
- 模型:one-layer LSTM A3C [无Enc层];
- 实验:运行失败,gym.error.UnregisteredEnv: No registered env with id: MediumBandit-v0;
3、Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML):ICML 2017
- https://github.com/tristandeleu/pytorch-maml-rl
- 环境:PyTorch,GPU;
- 任务:Multi-armed Bandit,Tabular MDP,Continuous Control with MuJoCo,2D Navigation Task;
- 模型:MAML TRPO;
- 实验:初始运行失败,terminate called after throwing an instance of 'c10::Error';参见https://github.com/tristandeleu/pytorch-maml-rl/issues/40#issuecomment-632598191即可解决;但是出现新问题(AttributeError: Can't pickle local object 'make_env.<locals>._make_env');参见https://github.com/tristandeleu/pytorch-maml-rl/issues/51即可解决;最终成功运行train.py,但test.py运行失败;bandit-k5-n10不收敛(当前超参数);
- https://github.com/cbfinn/maml_rl
- 环境:the TensorFlow rllab version,CPU;
- 任务:MuJoCo;
- 模型:MAML TRPO;
- 实验:未运行;
4、Evolved Policy Gradients (EPG):NeurIPS, 2018
- https://github.com/openai/EPG
- 环境:Chainer,CPU;
- 任务:MuJoCo;
- 模型:EPG PPO;
- 实验:未运行;
5、A Simple Neural Attentive Meta-Learner:ICLR 2018
- https://github.com/chanb/metalearning_RL
- 环境:PyTorch,GPU;
- 任务:Multi-armed Bandit,Tabular MDP;
- 模型:SNAIL,RL2(GRU)+ PPO;
- 实验:成功运行,无bug;
6、Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables (PEARL):arXiv: Learning, 2019
- https://github.com/katerakelly/oyster
- 环境:PyTorch,GPU;
- 任务:MuJoCo;
- 模型:PEARL (SAC-based);
- 实验:Docker配置过程中运行docker build . -t pearl失败;放弃Docker配置在本地对相关包进行安装后,可以成功运行;使用本地包需要提前加一句:conda config --set restore_free_channel true,不然找不到大部分特定版本的包,就会导致创建环境失败;相关问题可以咨询Chains朱朱的主页 - 博客园 (cnblogs.com);
7、Improving Generalization in Meta Reinforcement Learning using Learned Objectives (MetaGenRL): ICLR 2020
- http://louiskirsch.com/code/metagenrl
- 环境:TensorFlow,GPU;
- 任务:MuJoCo;
- 模型:MetaGenRL;
- 实验:在tensorflow-gpu==1.14.0与tensorflow==1.13.2环境上运行python ray_experiments.py train时都会出现bug;
二、RL-Adventure
1、Deep Q-Learning:
- 参见先前的Blog
- https://github.com/Kaixhin/Rainbow
- 环境:PyTorch,GPU;
- 任务:Atari;
- 模型:Rainbow;
- 实验:成功运行;
- https://github.com/TianhongDai/hindsight-experience-replay
- 环境:PyTorch,GPU(Not Recommended, Better Use CPU);
- 任务:MuJoCo;
- 模型:HER;
- 实验:未运行;
2、Policy Gradients:
- https://github.com/higgsfield/RL-Adventure-2
- 环境:PyTorch,GPU;
- 任务:Gym;
- 模型:A2C,GAE,PPO,ACER,DDPG,TD3,SAC,GAIL,HER;
- 实验:成功运行;本地代码基于bug、issue以及性能对其进行修改,参见https://github.com/lucifer2859/Policy-Gradients;在本地代码中,所有模型(HER除外)均可以收敛且获得较好性能;HER的问题参见https://github.com/higgsfield/RL-Adventure-2/issues/14;SAC实现似乎与原文不符(参见https://github.com/higgsfield/RL-Adventure-2/issues/11);A2C实验仅在CartPole-v0上能够收敛;
- https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail;
- 环境:PyTorch/TensorFlow GPU;
- 任务:Atari,MuJoCo,PyBullet (including Racecar, Minitaur and Kuka),DeepMind Control Suite;
- 模型:A2C,PPO,ACKTR,GAIL
- 实验:未运行;
- https://github.com/ikostrikov/pytorch-a3c
- 环境:PyTorch,CPU;
- 任务:Atari;
- 模型:A3C;
- 实验:初始运行失败,NotImplementedError;参考https://github.com/ikostrikov/pytorch-a3c/issues/66#issuecomment-559785590修改envs.py即可解决;最终成功运行;
- https://github.com/haarnoja/sac
- 环境:TensorFlow,GPU;
- 任务:Continuous Control Tasks (MuJoCo);
- 模型:Soft Actor-Critic(SAC,第一版,模型带有状态价值函数V);
- 实验:未运行;
- https://github.com/denisyarats/pytorch_sac
- 环境:PyTorch,GPU;
- 任务:Continuous Control Tasks (MuJoCo);
- 模型:Soft Actor-Critic(SAC,第一版,模型带有状态价值函数V);
- 实验:未运行;
- http://github.com/rail-berkeley/softlearning/
- 环境:TensorFlow,GPU;
- 任务:Continuous Control Tasks (MuJoCo);
- 模型:Soft Actor-Critic(SAC,第二版,模型去掉了状态价值函数V);
- 实验:未运行;
- https://github.com/ku2482/sac-discrete.pytorch
- 环境:PyTorch,GPU;
- 任务:Atari;
- 模型:SAC-Discrete(基于新版连续控制任务下的SAC改进的离散版本);
- 实验:成功运行;本地代码对其略有修改,参见https://github.com/lucifer2859/sac-discrete-pytorch;训练收敛,但性能与论文描述存在差异;
3、两者兼有:
- https://github.com/ShangtongZhang/DeepRL
- 环境:PyTorch,GPU;
- 任务:Atari,MuJoCo;
- 模型:(Double/Dueling/Prioritized) DQN,C51,QR-DQN,(Continuous/Discrete) Synchronous Advantage A2C,N-Step DQN,DDPG,PPO,OC,TD3,COF-PAC,GradientDICE,Bi-Res-DDPG,DAC,Geoff-PAC,QUOTA,ACE;
- 实验:未运行;
- https://github.com/astooke/rlpyt
- 环境:PyTorch,GPU;
- 任务:Atari;
- 模型:Modular, optimized implementations of common deep RL algorithms in PyTorch, with unified infrastructure supporting all three major families of model-free algorithms: policy gradient, deep-q learning, and q-function policy gradient.
- Policy Gradient:A2C, PPO.
- Replay Buffers:(supporting both DQN + QPG) non-sequence and sequence (for recurrent) replay, n-step returns, uniform or prioritized replay, full-observation or frame-based buffer (e.g. for Atari, stores only unique frames to save memory, reconstructs multi-frame observations).
- Deep Q-Learning DQN + variants: Double, Dueling, Categorical (up to Rainbow minus Noisy Nets), Recurrent (R2D2-style).
- Q-Function Policy Gradient DDPG, TD3, SAC.
- 实验:
- 成功运行,无bug;
- https://github.com/vitchyr/rlkit
- 环境:PyTorch,GPU;
- 任务:gym[all]
- 模型:Skew-Fit,RIG,TDM,HER,DQN,SAC(新版),TD3,AWAC;
- 实验:未运行;
- p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch: PyTorch implementations of deep reinforcement learning algorithms and environments (github.com)
- 环境:PyTorch;
- 任务:CartPole,MountainCar,Bit Flipping,Four Rooms,Long Corridor,Ant-[Maze, Push, Fall];
- 模型:DQN,DQN with Fixed Q Target,DDQN,DDQN with Prioritised Experience Replay,Dueling DDQN,REINFORCE,DDPG,TD3,SAC,SAC-Discrete,A3C,A2C,PPO,DQN-HER,DDPG-HER,h-DQN,Stochastic NN-HRL,DIAYN;
- 实验:部分模型在部分任务上成功运行(例如SAC-Discrete无法在Atari上成功运行);
- https://github.com/hill-a/stable-baselines
- 环境:TensorFlow;
- https://github.com/openai/baselines
- 环境:TensorFlow;
- https://github.com/openai/spinningup
- 环境:TensorFlow/PyTorch
- 介绍:This is an educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning (deep RL). For the unfamiliar: reinforcement learning (RL) is a machine learning approach for teaching agents how to solve tasks by trial and error. Deep RL refers to the combination of RL with deep learning. This module contains a variety of helpful resources, including:
- a short introduction to RL terminology, kinds of algorithms, and basic theory,
- an essay about how to grow into an RL research role,
- a curated list of important papers organized by topic,
- a well-documented code repo of short, standalone implementations of key algorithms,
- and a few exercises to serve as warm-ups.
三、Meta Learning (Learn to Learn)
1、Platform: