• Reinforcement Learning in Practice


    对网上各处寻找reinforcement learning的实践环节感到无力,尤其是对open ai gym的介绍上。cnblogs、github、csdn、zhuanlan.zhihu、公众号、google、youtube等等,都非常的零碎。公开课上主要是理论,对代码实现讲的很少。对于像我这样的新手来说简直两眼一抹黑,无从下手的感觉。

    目前我自己在用的学习材料:

    1,官方文档

    https://gym.openai.com/docs/

    2,知乎专栏

    https://zhuanlan.zhihu.com/reinforce(等等)

    3,github

    https://github.com/openai/gym

    https://github.com/dennybritz/reinforcement-learning

    4,git blog、youtube

    莫烦python

    在此处作为一个整理工具,放一些学习材料中的重要知识点和我自己的实现。大致的计划是,先实现一些Silver课程上的基本算法或者gym里的案例控制,后面再学deep learning和DRL,这个顺序。

    4/9/2018 实现iterative policy evaluation用于grid world

    学习材料:

    https://zhuanlan.zhihu.com/p/28084990

    我的代码:(我的gridworld问题和专栏里面有点不一样,我这里出口只有右下角一个,并且我用的是table和矩阵运算的形式来实现的)

    https://github.com/ysgclight/RL_Homework/blob/master/4x4gridworld.py

    4/10/2018 从零开始学习gym的ABC

    1,重要知识点:

    https://zhuanlan.zhihu.com/p/28086233 

     

     

    一个小问题:env和space到底有什么区别啊?

    2,

    mountain_car.py中对于观测空间和行为空间描述

    core.py中有一以下重要的class

    class Env(object):
        """The main OpenAI Gym class. It encapsulates an environment with
        arbitrary behind-the-scenes dynamics. An environment can be
        partially or fully observed.
    
        The main API methods that users of this class need to know are:
    
            step
            reset
            render
            close
            seed
    
        And set the following attributes:
    
            action_space: The Space object corresponding to valid actions
            observation_space: The Space object corresponding to valid observations
            reward_range: A tuple corresponding to the min and max possible rewards
    
        Note: a default reward range set to [-inf,+inf] already exists. Set it if you want a narrower range.
    
        The methods are accessed publicly as "step", "reset", etc.. The
        non-underscored versions are wrapper methods to which we may add
        functionality over time.
        """
      def step(self, action):
            """Run one timestep of the environment's dynamics. When end of
            episode is reached, you are responsible for calling `reset()`
            to reset this environment's state.
    
            Accepts an action and returns a tuple (observation, reward, done, info).
    
            Args:
                action (object): an action provided by the environment
    
            Returns:
                observation (object): agent's observation of the current environment
                reward (float) : amount of reward returned after previous action
                done (boolean): whether the episode has ended, in which case further step() calls will return undefined results
                info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
            """
            raise NotImplementedError
    
        def reset(self):
            """Resets the state of the environment and returns an initial observation.
    
            Returns: observation (object): the initial observation of the
                space.
            """
            raise NotImplementedError
    
        def render(self, mode='human'):
            """Renders the environment.
    
            The set of supported modes varies per environment. (And some
            environments do not support rendering at all.) By convention,
            if mode is:
    
            - human: render to the current display or terminal and
              return nothing. Usually for human consumption.
            - rgb_array: Return an numpy.ndarray with shape (x, y, 3),
              representing RGB values for an x-by-y pixel image, suitable
              for turning into a video.
            - ansi: Return a string (str) or StringIO.StringIO containing a
              terminal-style text representation. The text can include newlines
              and ANSI escape sequences (e.g. for colors).
    
            Note:
                Make sure that your class's metadata 'render.modes' key includes
                  the list of supported modes. It's recommended to call super()
                  in implementations to use the functionality of this method.
    
            Args:
                mode (str): the mode to render with
                close (bool): close all open renderings
    
            Example:
    
            class MyEnv(Env):
                metadata = {'render.modes': ['human', 'rgb_array']}
    
                def render(self, mode='human'):
                    if mode == 'rgb_array':
                        return np.array(...) # return RGB frame suitable for video
                    elif mode is 'human':
                        ... # pop up a window and render
                    else:
                        super(MyEnv, self).render(mode=mode) # just raise an exception
            """
            raise NotImplementedError
    class Space(object):
        """Defines the observation and action spaces, so you can write generic
        code that applies to any Env. For example, you can choose a random
        action.
        """

     3,gym官方文件

    envs定义游戏类型

    spaces定义action和observation的抽象数据空间

    4,gym envs二次开发

    https://zhuanlan.zhihu.com/p/26985029

    第零步,把gym下载位置保存到/.bashrs为PYTHONPATH这个环境变量,否则下面的操作会报错

    第一步,在gym/gym/envs/classic_control下面写入新的env子类

      至少包括以下函数:__init__(self),stepresetrender

    这个就是最坑的地方,调试了一晚上都没完全调试出来。

    其中一个因素是gym改版了,而这个教程里的代码是旧的,所以运行会报错!

    注意,我进行了以下修改:

    *添加register的时候,括号的内容加了tab缩进

    *GridEnv里面的__init__函数,加一个self.seed(),并且copy了cartpole.py里的def seed函数定义。开头import了seeding进来,参考cartpole那样

    *registration.py里面把174-191行的def patch_deprecated_methods(env)屏蔽了

    *grid_mdp.py所有的主要函数都去掉了开头的_符号

    *terminal退出当前的python命令,再重进,就可以在terminal调用import gym,定义env=新加入的GridEnv-v0,然后reset,render执行,不会报错了,render也生效了。但是quit之后会提示TypeError: 'NoneType' object is not iterable。

    *在其他文件下调用执行相同的程序,不显示render效果,只有上面的那个typeerror。

    *在vscode里执行官方代码的时候


    import gym
    env = gym.make('CartPole-v0')
    env.reset()
    for _ in range(1000):
      env.render()
      env.step(env.action_space.sample()) # take a random action
     
    也会弹出相应的错误,但是render效果会显示一段时间。改成gridworld之后 ,render效果一闪而过,完全看不清楚
    需要在env.render和env.step之间加入一个time.sleep(4)强制停下来(当然,要import time)
     
     
     

    https://zhuanlan.zhihu.com/p/36506567

  • 相关阅读:
    多态
    java8的十大新特性
    Floyd最短路径算法
    ES6(六)函数扩展
    ES6(五)数组扩展
    ES6(四)数值扩展
    ES6(一)解构赋值
    store封装
    ipad方案
    pyinstaller编译打包为pyd
  • 原文地址:https://www.cnblogs.com/ecoflex/p/8778585.html
Copyright © 2020-2023  润新知