• 强化学习框架RLlib教程002:Training APIs(一)快速入门与配置项


    目录

      开场(Getting Started)

      评估训练策略(Evaluating Trained Policies)

      指定参数(Specifying Parameters)

      指定资源(Specifying Resources)

      延伸指南(Scaling Guide)

      常用参数

      调好的参数文件(Tuned Examples)

      参考资料


    开场(Getting Started)

    在较高的层次上,RLlib提供了一个 Trainer 类,它保存着与环境交互的策略。通过trainer的接口,可以对策略进行训练、设置断点或计算一个动作。在多智能体训练(multi-agent training)中,trainer同时管理多个策略的查询(根据输入计算输出)和优化(训练策略网络)。

    如上图所示:一个Trainer类里面有train(训练策略)、save(保存策略)、restore(恢复策略)以及compute_action(计算动作)的方法。Trainer类中有策略以及优化器,右边的Workers负责与环境交互采集数据,整个过程由分布式计算引擎Ray支持。

     

    我们可以用以下简单的命令训练一个DQN的Trainer:

    rllib train --run DQN --env CartPole-v0  # --eager [--trace] for eager execution

     

    默认情况下,训练的日志会被保存在 ~/ray_results下。里面的params.json包含训练的超参数,result.json包含了训练时每一个episode的总结,并且也保存了可供TensorBoard可视化的文件

     

    `rllib train`命令和仓库里的train.py(from ray.rllib import train)一样,有多个可选项

     

    介绍几个最重要的可选项:

    --env(可选用的环境,包括任意的gym环境或用户自己注册的)

    --run(可选用的算法,包括SAC, PPO, PG, A2C, A3C, IMPALA, ES, DDPG, DQN, MARWIL, APEX, 和APEX_DDPG)

     返回目录

    评估训练策略(Evaluating Trained Policies)

    为了保存评估策略之后的checkpoints,在运行train时可以设置--checkpoint-freq(每间隔多少次训练保存一次)

     

    # rllib train --run DQN --env CartPole-v0 --checkpoint-freq 10

    运行以上命令后,就会保存文件,如下图所示

     

    对保存的checkpoint进行评估

    export CUDA_VISIBLE_DEVICES=3

    rllib rollout /root/ray_results/default/DQN_CartPole-v0_0_2020-10-03_09-24-37hg7ffl2s/checkpoint_10/checkpoint-10 --run DQN --env CartPole-v0

    rollout.py脚本会根据checkpoint重建DQN策略,并在指定了--env的情况下进行渲染,控制台会输出以下内容:

     

    Episode #0: reward: 15.0

    Episode #1: reward: 18.0

    Episode #2: reward: 24.0

    Episode #3: reward: 25.0

    Episode #4: reward: 18.0

    Episode #5: reward: 11.0

     返回目录

    指定参数(Specifying Parameters)

    每一个算法都可以通过--config设置超参数

    例如我们训练一个A2C指定8个worker通过config标记

    rllib train --env=PongDeterministic-v4 --run=A2C --config '{"num_workers": 8}'

     返回目录

    指定资源(Specifying Resources)

    对于大部分算法,你可以使用num_workers超参数控制并行度。driver可以使用的GPU的数量可以由num_gpus控制。同样的,workers能够使用的资源可以通过num_cpus_per_worker, num_gpus_per_worker, and custom_resources_per_worker控制。GPU的数量可以小于1,比如你可以在同一块gpu上训练5DQN,只要你设置num_gpus: 0.2就可以。

     

    像PPO和A2C这样的同步算法,driver和workers能使用同一块GPU:

    gpu_count = n

    num_gpus = 0.0001 # Driver GPU

    num_gpus_per_worker = (gpu_count - num_gpus) / num_workers

    num_workers决定要开多少个进程,每一个进程中又会有多个子进程

     返回目录

    延伸指南(Scaling Guide)

    下面有一些使用RLlib的经验指南

    1.如果环境是缓慢的并且不能重复的(比如它依赖于与物理系统的交互),你应该使用sample-efficient off-policy算法,例如DQN或SAC。这些算法默认是单进程工作( num_workers: 0)。如果你想使用GPU,确保num_gpus: 1。如果想考虑做batch RL,可以参考offline data API。

    2.如果环境是快速的并且模型较小(大部分RL模型是这样的),你应该使用time-efficient算法,比如PPOIMPALA, or APEX. 这些算法可以增加num_workers 。做推理的时候使用Vectorization也有意义。如果你想使用一个GPU,确保num_gpus: 1。如果learner成为了瓶颈,多GPU设置可以使用num_gpus > 1.

    3.如果模型是计算密集型的(比如非常深的残差网络)并且推断是瓶颈,可以考虑分配gpu给worker,并设置 num_gpus_per_worker: 1。如果你只有一个GPU,可以考虑设置num_workers: 0,他会使用Learner的GPU做推断。

    4.如果模型和环境都是计算密集型的,可以设置 remote_worker_envs: True 

     返回目录

    常用参数

    下面是一些常用的算法超参数

    COMMON_CONFIG: TrainerConfigDict = {
        # === Settings for Rollout Worker processes ===
        # Number of rollout worker actors to create for parallel sampling. Setting
        # this to 0 will force rollouts to be done in the trainer actor.
        "num_workers": 2,
        # Number of environments to evaluate vectorwise per worker. This enables
        # model inference batching, which can improve performance for inference
        # bottlenecked workloads.
        "num_envs_per_worker": 1,
        # Divide episodes into fragments of this many steps each during rollouts.
        # Sample batches of this size are collected from rollout workers and
        # combined into a larger batch of `train_batch_size` for learning.
        #
        # For example, given rollout_fragment_length=100 and train_batch_size=1000:
        #   1. RLlib collects 10 fragments of 100 steps each from rollout workers.
        #   2. These fragments are concatenated and we perform an epoch of SGD.
        #
        # When using multiple envs per worker, the fragment size is multiplied by
        # `num_envs_per_worker`. This is since we are collecting steps from
        # multiple envs in parallel. For example, if num_envs_per_worker=5, then
        # rollout workers will return experiences in chunks of 5*100 = 500 steps.
        #
        # The dataflow here can vary per algorithm. For example, PPO further
        # divides the train batch into minibatches for multi-epoch SGD.
        "rollout_fragment_length": 200,
        # Whether to rollout "complete_episodes" or "truncate_episodes" to
        # `rollout_fragment_length` length unrolls. Episode truncation guarantees
        # evenly sized batches, but increases variance as the reward-to-go will
        # need to be estimated at truncation boundaries.
        "batch_mode": "truncate_episodes",
    
        # === Settings for the Trainer process ===
        # Number of GPUs to allocate to the trainer process. Note that not all
        # algorithms can take advantage of trainer GPUs. This can be fractional
        # (e.g., 0.3 GPUs).
        "num_gpus": 0,
        # Training batch size, if applicable. Should be >= rollout_fragment_length.
        # Samples batches will be concatenated together to a batch of this size,
        # which is then passed to SGD.
        "train_batch_size": 200,
        # Arguments to pass to the policy model. See models/catalog.py for a full
        # list of the available model options.
        "model": MODEL_DEFAULTS,
        # Arguments to pass to the policy optimizer. These vary by optimizer.
        "optimizer": {},
    
        # === Environment Settings ===
        # Discount factor of the MDP.
        "gamma": 0.99,
        # Number of steps after which the episode is forced to terminate. Defaults
        # to `env.spec.max_episode_steps` (if present) for Gym envs.
        "horizon": None,
        # Calculate rewards but don't reset the environment when the horizon is
        # hit. This allows value estimation and RNN state to span across logical
        # episodes denoted by horizon. This only has an effect if horizon != inf.
        "soft_horizon": False,
        # Don't set 'done' at the end of the episode. Note that you still need to
        # set this if soft_horizon=True, unless your env is actually running
        # forever without returning done=True.
        "no_done_at_end": False,
        # Arguments to pass to the env creator.
        "env_config": {},
        # Environment name can also be passed via config.
        "env": None,
        # Unsquash actions to the upper and lower bounds of env's action space
        "normalize_actions": False,
        # Whether to clip rewards during Policy's postprocessing.
        # None (default): Clip for Atari only (r=sign(r)).
        # True: r=sign(r): Fixed rewards -1.0, 1.0, or 0.0.
        # False: Never clip.
        # [float value]: Clip at -value and + value.
        # Tuple[value1, value2]: Clip at value1 and value2.
        "clip_rewards": None,
        # Whether to clip actions to the action space's low/high range spec.
        "clip_actions": True,
        # Whether to use "rllib" or "deepmind" preprocessors by default
        "preprocessor_pref": "deepmind",
        # The default learning rate.
        "lr": 0.0001,
    
        # === Debug Settings ===
        # Whether to write episode stats and videos to the agent log dir. This is
        # typically located in ~/ray_results.
        "monitor": False,
        # Set the ray.rllib.* log level for the agent process and its workers.
        # Should be one of DEBUG, INFO, WARN, or ERROR. The DEBUG level will also
        # periodically print out summaries of relevant internal dataflow (this is
        # also printed out once at startup at the INFO level). When using the
        # `rllib train` command, you can also use the `-v` and `-vv` flags as
        # shorthand for INFO and DEBUG.
        "log_level": "WARN",
        # Callbacks that will be run during various phases of training. See the
        # `DefaultCallbacks` class and `examples/custom_metrics_and_callbacks.py`
        # for more usage information.
        "callbacks": DefaultCallbacks,
        # Whether to attempt to continue training if a worker crashes. The number
        # of currently healthy workers is reported as the "num_healthy_workers"
        # metric.
        "ignore_worker_failures": False,
        # Log system resource metrics to results. This requires `psutil` to be
        # installed for sys stats, and `gputil` for GPU metrics.
        "log_sys_usage": True,
        # Use fake (infinite speed) sampler. For testing only.
        "fake_sampler": False,
    
        # === Deep Learning Framework Settings ===
        # tf: TensorFlow
        # tfe: TensorFlow eager
        # torch: PyTorch
        "framework": "tf",
        # Enable tracing in eager mode. This greatly improves performance, but
        # makes it slightly harder to debug since Python code won't be evaluated
        # after the initial eager pass. Only possible if framework=tfe.
        "eager_tracing": False,
        # Disable eager execution on workers (but allow it on the driver). This
        # only has an effect if eager is enabled.
        "no_eager_on_workers": False,
    
        # === Exploration Settings ===
        # Default exploration behavior, iff `explore`=None is passed into
        # compute_action(s).
        # Set to False for no exploration behavior (e.g., for evaluation).
        "explore": True,
        # Provide a dict specifying the Exploration object's config.
        "exploration_config": {
            # The Exploration class to use. In the simplest case, this is the name
            # (str) of any class present in the `rllib.utils.exploration` package.
            # You can also provide the python class directly or the full location
            # of your class (e.g. "ray.rllib.utils.exploration.epsilon_greedy.
            # EpsilonGreedy").
            "type": "StochasticSampling",
            # Add constructor kwargs here (if any).
        },
        # === Evaluation Settings ===
        # Evaluate with every `evaluation_interval` training iterations.
        # The evaluation stats will be reported under the "evaluation" metric key.
        # Note that evaluation is currently not parallelized, and that for Ape-X
        # metrics are already only reported for the lowest epsilon workers.
        "evaluation_interval": None,
        # Number of episodes to run per evaluation period. If using multiple
        # evaluation workers, we will run at least this many episodes total.
        "evaluation_num_episodes": 10,
        # Internal flag that is set to True for evaluation workers.
        "in_evaluation": False,
        # Typical usage is to pass extra args to evaluation env creator
        # and to disable exploration by computing deterministic actions.
        # IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal
        # policy, even if this is a stochastic one. Setting "explore=False" here
        # will result in the evaluation workers not using this optimal policy!
        "evaluation_config": {
            # Example: overriding env_config, exploration, etc:
            # "env_config": {...},
            # "explore": False
        },
        # Number of parallel workers to use for evaluation. Note that this is set
        # to zero by default, which means evaluation will be run in the trainer
        # process. If you increase this, it will increase the Ray resource usage
        # of the trainer since evaluation workers are created separately from
        # rollout workers.
        "evaluation_num_workers": 0,
        # Customize the evaluation method. This must be a function of signature
        # (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the
        # Trainer._evaluate() method to see the default implementation. The
        # trainer guarantees all eval workers have the latest policy state before
        # this function is called.
        "custom_eval_function": None,
    
        # === Advanced Rollout Settings ===
        # Use a background thread for sampling (slightly off-policy, usually not
        # advisable to turn on unless your env specifically requires it).
        "sample_async": False,
    
        # Experimental flag to speed up sampling and use "trajectory views" as
        # generic ModelV2 `input_dicts` that can be requested by the model to
        # contain different information on the ongoing episode.
        # NOTE: Only supported for PyTorch so far.
        "_use_trajectory_view_api": False,
    
        # Element-wise observation filter, either "NoFilter" or "MeanStdFilter".
        "observation_filter": "NoFilter",
        # Whether to synchronize the statistics of remote filters.
        "synchronize_filters": True,
        # Configures TF for single-process operation by default.
        "tf_session_args": {
            # note: overriden by `local_tf_session_args`
            "intra_op_parallelism_threads": 2,
            "inter_op_parallelism_threads": 2,
            "gpu_options": {
                "allow_growth": True,
            },
            "log_device_placement": False,
            "device_count": {
                "CPU": 1
            },
            "allow_soft_placement": True,  # required by PPO multi-gpu
        },
        # Override the following tf session args on the local worker
        "local_tf_session_args": {
            # Allow a higher level of parallelism by default, but not unlimited
            # since that can cause crashes with many concurrent drivers.
            "intra_op_parallelism_threads": 8,
            "inter_op_parallelism_threads": 8,
        },
        # Whether to LZ4 compress individual observations
        "compress_observations": False,
        # Wait for metric batches for at most this many seconds. Those that
        # have not returned in time will be collected in the next train iteration.
        "collect_metrics_timeout": 180,
        # Smooth metrics over this many episodes.
        "metrics_smoothing_episodes": 100,
        # If using num_envs_per_worker > 1, whether to create those new envs in
        # remote processes instead of in the same worker. This adds overheads, but
        # can make sense if your envs can take much time to step / reset
        # (e.g., for StarCraft). Use this cautiously; overheads are significant.
        "remote_worker_envs": False,
        # Timeout that remote workers are waiting when polling environments.
        # 0 (continue when at least one env is ready) is a reasonable default,
        # but optimal value could be obtained by measuring your environment
        # step / reset and model inference perf.
        "remote_env_batch_wait_ms": 0,
        # Minimum time per train iteration (frequency of metrics reporting).
        "min_iter_time_s": 0,
        # Minimum env steps to optimize for per train call. This value does
        # not affect learning, only the length of train iterations.
        "timesteps_per_iteration": 0,
        # This argument, in conjunction with worker_index, sets the random seed of
        # each worker, so that identically configured trials will have identical
        # results. This makes experiments reproducible.
        "seed": None,
        # Any extra python env vars to set in the trainer process, e.g.,
        # {"OMP_NUM_THREADS": "16"}
        "extra_python_environs_for_driver": {},
        # The extra python environments need to set for worker processes.
        "extra_python_environs_for_worker": {},
    
        # === Advanced Resource Settings ===
        # Number of CPUs to allocate per worker.
        "num_cpus_per_worker": 1,
        # Number of GPUs to allocate per worker. This can be fractional. This is
        # usually needed only if your env itself requires a GPU (i.e., it is a
        # GPU-intensive video game), or model inference is unusually expensive.
        "num_gpus_per_worker": 0,
        # Any custom Ray resources to allocate per worker.
        "custom_resources_per_worker": {},
        # Number of CPUs to allocate for the trainer. Note: this only takes effect
        # when running in Tune. Otherwise, the trainer runs in the main program.
        "num_cpus_for_driver": 1,
        # You can set these memory quotas to tell Ray to reserve memory for your
        # training run. This guarantees predictable execution, but the tradeoff is
        # if your workload exceeeds the memory quota it will fail.
        # Heap memory to reserve for the trainer process (0 for unlimited). This
        # can be large if your are using large train batches, replay buffers, etc.
        "memory": 0,
        # Object store memory to reserve for the trainer process. Being large
        # enough to fit a few copies of the model weights should be sufficient.
        # This is enabled by default since models are typically quite small.
        "object_store_memory": 0,
        # Heap memory to reserve for each worker. Should generally be small unless
        # your environment is very heavyweight.
        "memory_per_worker": 0,
        # Object store memory to reserve for each worker. This only needs to be
        # large enough to fit a few sample batches at a time. This is enabled
        # by default since it almost never needs to be larger than ~200MB.
        "object_store_memory_per_worker": 0,
    
        # === Offline Datasets ===
        # Specify how to generate experiences:
        #  - "sampler": generate experiences via online simulation (default)
        #  - a local directory or file glob expression (e.g., "/tmp/*.json")
        #  - a list of individual file paths/URIs (e.g., ["/tmp/1.json",
        #    "s3://bucket/2.json"])
        #  - a dict with string keys and sampling probabilities as values (e.g.,
        #    {"sampler": 0.4, "/tmp/*.json": 0.4, "s3://bucket/expert.json": 0.2}).
        #  - a function that returns a rllib.offline.InputReader
        "input": "sampler",
        # Specify how to evaluate the current policy. This only has an effect when
        # reading offline experiences. Available options:
        #  - "wis": the weighted step-wise importance sampling estimator.
        #  - "is": the step-wise importance sampling estimator.
        #  - "simulation": run the environment in the background, but use
        #    this data for evaluation only and not for learning.
        "input_evaluation": ["is", "wis"],
        # Whether to run postprocess_trajectory() on the trajectory fragments from
        # offline inputs. Note that postprocessing will be done using the *current*
        # policy, not the *behavior* policy, which is typically undesirable for
        # on-policy algorithms.
        "postprocess_inputs": False,
        # If positive, input batches will be shuffled via a sliding window buffer
        # of this number of batches. Use this if the input data is not in random
        # enough order. Input is delayed until the shuffle buffer is filled.
        "shuffle_buffer_size": 0,
        # Specify where experiences should be saved:
        #  - None: don't save any experiences
        #  - "logdir" to save to the agent log dir
        #  - a path/URI to save to a custom output directory (e.g., "s3://bucket/")
        #  - a function that returns a rllib.offline.OutputWriter
        "output": None,
        # What sample batch columns to LZ4 compress in the output data.
        "output_compress_columns": ["obs", "new_obs"],
        # Max output file size before rolling over to a new file.
        "output_max_file_size": 64 * 1024 * 1024,
    
        # === Settings for Multi-Agent Environments ===
        "multiagent": {
            # Map of type MultiAgentPolicyConfigDict from policy ids to tuples
            # of (policy_cls, obs_space, act_space, config). This defines the
            # observation and action spaces of the policies and any extra config.
            "policies": {},
            # Function mapping agent ids to policy ids.
            "policy_mapping_fn": None,
            # Optional list of policies to train, or None for all policies.
            "policies_to_train": None,
            # Optional function that can be used to enhance the local agent
            # observations to include more state.
            # See rllib/evaluation/observation_function.py for more info.
            "observation_fn": None,
            # When replay_mode=lockstep, RLlib will replay all the agent
            # transitions at a particular timestep together in a batch. This allows
            # the policy to implement differentiable shared computations between
            # agents it controls at that timestep. When replay_mode=independent,
            # transitions are replayed independently per policy.
            "replay_mode": "independent",
        },
    
        # === Logger ===
        # Define logger-specific configuration to be used inside Logger
        # Default value None allows overwriting with nested dicts
        "logger_config": None,
    
        # === Replay Settings ===
        # The number of contiguous environment steps to replay at once. This may
        # be set to greater than 1 to support recurrent models.
        "replay_sequence_length": 1,
    }
    View Code

     返回目录

    调好的参数文件(Tuned Examples)

    一些调好的超参数和配置可以在代码库里找到(有一些是在GPU上调的)

    https://github.com/ray-project/ray/tree/master/rllib/tuned_examples

    你可以这样使用

    rllib train -f /path/to/tuned/example.yaml

     返回目录

    参考资料

    https://docs.ray.io/en/latest/rllib-training.html

     返回目录

  • 相关阅读:
    30张图解: TCP 重传、滑动窗口、流量控制、拥塞控制
    ffmpeg rtp时间戳
    35 张图解:被问千百遍的 TCP 三次握手和四次挥手面试题
    Pinpoint 分布式系统性能监控工具
    图解正向代理、反向代理、透明代理
    实战!我用 Wireshark 让你“看见“ TCP
    IE7的增强插件:IE7Pro
    Net Core 中的HTTP协议详解
    Autofac是一个轻量级的依赖注入的框架
    关于表数据的复制插入TSQL
  • 原文地址:https://www.cnblogs.com/itmorn/p/13765791.html
Copyright © 2020-2023  润新知