Deep Reinforcement Learning HandsOn——Tabular Learning and the Bellman Equation

Deep Reinforcement Learning Hands-On——Tabular Learning and the Bellman Equation

作者：凯鲁嘎吉 - 博客园 http://www.cnblogs.com/kailugaji/

更多请看：Reinforcement Learning - 随笔分类 - 凯鲁嘎吉 - 博客园 https://www.cnblogs.com/kailugaji/category/2038931.html

本文代码下载：https://github.com/kailugaji/Hands-on-Reinforcement-Learning/tree/main/01%20Tabular%20Learning%20and%20the%20Bellman%20Equation

这一篇博文参考了书目《Deep Reinforcement Learning Hands-On Second Edition》第5章与第6章内容，主要学习两个贝尔曼最优方程：最优状态值函数方程：${{V}^{*}}(s)={{\max }_{a}}{{\mathbb{E}}_{s'\tilde{\ }p(s'|s,a)}}[r(s,a,s')+\gamma {{V}^{*}}(s')]$与最优状态动作值函数：${{Q}^{*}}(s,a)={{\mathbb{E}}_{s'\tilde{\ }p(s'|s,a)}}[r(s,a,s')+\gamma {{\max }_{a'}}{{Q}^{*}}(s',a')]$，并用Python实现对应的值迭代(Value Iteration)算法、Q迭代(Q Iteration)算法与Q学习(Q Learning)算法。值迭代建立的值表仅有状态，而Q迭代建立的值表有动作与状态。所用的游戏环境为FrozenLake-v1，其中S: initial stat 起点，F: frozen lake 冰湖，H: hole 窟窿，G: the goal 目的地，agent要学会从起点走到目的地，并且不要掉进窟窿。

由于事先随机选择动作建立值表，因此每次得到的结果并非一致。所用的模块的版本为：

# packages in environment at D:\ProgramData\Anaconda3\envs\RL:
#
_pytorch_select           1.2.0                       gpu    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
absl-py                   1.0.0                     <pip>
ale-py                    0.7.3                     <pip>
astunparse                1.6.3                     <pip>
atari-py                  1.2.2                     <pip>
backcall                  0.2.0              pyhd3eb1b0_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
blas                      1.0                         mkl    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
Box2D                     2.3.10                    <pip>
box2d-py                  2.3.8                     <pip>
ca-certificates           2021.10.26           haa95532_2    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
cached-property           1.5.2                     <pip>
cachetools                5.0.0                     <pip>
certifi                   2020.6.20                py37_0    anaconda
cffi                      1.15.0           py37h2bbff1b_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
charset-normalizer        2.0.11                    <pip>
cloudpickle               2.0.0                     <pip>
colorama                  0.4.4              pyhd3eb1b0_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
cudatoolkit               10.1.243             h74a9793_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
cycler                    0.11.0                    <pip>
Cython                    0.29.26                   <pip>
decorator                 5.1.0              pyhd3eb1b0_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
fasteners                 0.16.3                    <pip>
ffmpeg                    1.4                       <pip>
flatbuffers               2.0                       <pip>
fonttools                 4.28.5                    <pip>
freetype                  2.10.4               hd328e21_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
gast                      0.5.3                     <pip>
ghostscript               0.7                       <pip>
glfw                      2.5.0                     <pip>
google-auth               2.6.0                     <pip>
google-auth-oauthlib      0.4.6                     <pip>
google-pasta              0.2.0                     <pip>
grpcio                    1.43.0                    <pip>
gym                       0.21.0                    <pip>
h5py                      3.6.0                     <pip>
idna                      3.3                       <pip>
imageio                   2.13.5                    <pip>
importlib-metadata        2.0.0                      py_1    anaconda
importlib-metadata        4.10.0                    <pip>
importlib-resources       5.4.0                     <pip>
intel-openmp              2019.4                      245    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
ipython                   7.29.0           py37hd4e2768_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
jedi                      0.18.0           py37haa95532_1    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
jpeg                      9b                   hb83a4c4_2    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
keras                     2.8.0                     <pip>
Keras-Preprocessing       1.1.2                     <pip>
kiwisolver                1.3.2                     <pip>
libclang                  13.0.0                    <pip>
libmklml                  2019.0.5             haa95532_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libpng                    1.6.37               h2a8f88b_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libtiff                   4.2.0                hd0e1b90_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libuv                     1.40.0               he774522_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libwebp                   1.2.0                h2bbff1b_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
lockfile                  0.12.2                    <pip>
lz4-c                     1.9.3                h2bbff1b_1    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
Markdown                  3.3.6                     <pip>
matplotlib                3.5.1                     <pip>
matplotlib-inline         0.1.2              pyhd3eb1b0_2    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
mkl                       2019.4                      245    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
mkl-service               2.3.0            py37h196d8e1_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
mkl_fft                   1.3.0            py37h46781fe_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
mkl_random                1.1.0            py37h675688f_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
mujoco-py                 1.50.1.68                 <pip>
ninja                     1.10.2           py37h559b2a2_3    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
numpy                     1.19.2           py37hadc3359_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
numpy                     1.21.5                    <pip>
numpy-base                1.19.2           py37ha3acd2a_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
oauthlib                  3.2.0                     <pip>
olefile                   0.46               pyhd3eb1b0_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
opencv-python             4.5.5.62                  <pip>
openssl                   1.0.2t           vc14h62dcd97_0  [vc14]  anaconda
opt-einsum                3.3.0                     <pip>
packaging                 21.3                      <pip>
parso                     0.8.3              pyhd3eb1b0_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
pickleshare               0.7.5           pyhd3eb1b0_1003    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
Pillow                    9.0.0                     <pip>
pillow                    8.4.0            py37hd45dc43_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
pip                       20.2.4                   py37_0    anaconda
prompt-toolkit            3.0.20             pyhd3eb1b0_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
protobuf                  3.19.4                    <pip>
pyasn1                    0.4.8                     <pip>
pyasn1-modules            0.2.8                     <pip>
pycparser                 2.21               pyhd3eb1b0_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
pyglet                    1.5.21                    <pip>
pygments                  2.10.0             pyhd3eb1b0_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
pyparsing                 3.0.6                     <pip>
python                    3.7.1                h33f27b4_4    anaconda
python-dateutil           2.8.2                     <pip>
pytorch                   1.7.1           py3.7_cuda101_cudnn7_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
requests                  2.27.1                    <pip>
requests-oauthlib         1.3.1                     <pip>
rsa                       4.8                       <pip>
setuptools                50.3.0           py37h9490d1a_1    anaconda
six                       1.16.0             pyhd3eb1b0_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
sqlite                    3.20.1           vc14h7ce8c62_1  [vc14]  anaconda
swig                      3.0.12               h047fa9f_3    anaconda
tensorboard               2.8.0                     <pip>
tensorboard-data-server   0.6.1                     <pip>
tensorboard-plugin-wit    1.8.1                     <pip>
tensorboardX              2.4.1                     <pip>
tensorflow                2.8.0                     <pip>
tensorflow-io-gcs-filesystem 0.24.0                    <pip>
termcolor                 1.1.0                     <pip>
tf-estimator-nightly      2.8.0.dev2021122109           <pip>
tk                        8.6.11               h2bbff1b_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
torchaudio                0.7.2                      py37    http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
torchvision               0.8.2                py37_cu101    http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
traitlets                 5.1.1              pyhd3eb1b0_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
typing_extensions         4.0.1                     <pip>
typing_extensions         3.10.0.2           pyh06a4308_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
urllib3                   1.26.8                    <pip>
vc                        14.2                 h21ff451_1    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
vs2015_runtime            14.27.29016          h5e58377_2    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
wcwidth                   0.2.5              pyhd3eb1b0_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
Werkzeug                  2.0.2                     <pip>
wheel                     0.37.0             pyhd3eb1b0_1    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
wincertstore              0.2                      py37_0    anaconda
wrappers                  0.1.9                     <pip>
wrapt                     1.13.3                    <pip>
xz                        5.2.5                h62dcd97_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
zipp                      3.7.0                     <pip>
zipp                      3.3.1                      py_0    anaconda
zlib                      1.2.11               h8cc25b3_4    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
zstd                      1.4.9                h19a0ad4_0    http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

1. 值迭代(Value Iteration)

1.1 算法流程

1.2 Python程序

#!/usr/bin/env python3
# -*- coding=utf-8 -*-
# Value iteration for FrozenLake
# https://www.cnblogs.com/kailugaji/
import gym
import collections
from tensorboardX import SummaryWriter
import time

ENV_NAME = "FrozenLake-v1" #游戏环境
'''
S: initial stat 起点
F: frozen lake 冰湖
H: hole 窟窿
G: the goal 目的地
agent要学会从起点走到目的地，并且不要掉进窟窿
'''
GAMMA = 0.9 # 折扣率
TEST_EPISODES = 20 # 玩几局游戏

class Agent: #保存表格，并包含将在训练循环中使用的函数
    def __init__(self):
        self.env = gym.make(ENV_NAME) #创建游戏环境
        self.state = self.env.reset() # 用于重置环境
        self.rewards = collections.defaultdict(float)
        self.transits = collections.defaultdict(collections.Counter)
        self.values = collections.defaultdict(float)

    '''
    此功能用于从环境中收集随机经验，并更新奖励和过渡表。
    注意，我们不需要等到一局游戏(回合)结束才开始学习；
    我们只需执行N个步骤，并记住它们的结果。
    这是值迭代和交叉熵方法的区别之一，交叉熵方法只能在完整的回合中学习。
    '''
    def play_n_random_steps(self, count): # 玩100步，得到回报表与转换表
        for _ in range(count):
            action = self.env.action_space.sample()  # 随机采样选择动作
            new_state, reward, is_done, _ = self.env.step(action) # 根据动作，与环境互动得到的新的状态与奖励
            self.rewards[(self.state, action, new_state)] = reward # 回报表：源状态，动作，目标状态
            self.transits[(self.state, action)][new_state] += 1 # 转换表：状态，动作，新状态的概率
            self.state = self.env.reset() if is_done else new_state

    def calc_action_value(self, state, action): # 步骤5：给定s, a, 计算Q(s, a)
        target_counts = self.transits[(state, action)] # 转换表：状态，动作
        total = sum(target_counts.values())
        action_value = 0.0
        for tgt_state, count in target_counts.items():
            reward = self.rewards[(state, action, tgt_state)] # 回报表：源状态，动作，目标状态
            val = reward + GAMMA * self.values[tgt_state] # 值表只有一个：目标状态
            action_value += (count / total) * val # 期望值——状态动作值函数(Q值)
        return action_value # Q值

    def select_action(self, state): # 步骤6：给定状态，找最优动作
        best_action, best_value = None, None
        for action in range(self.env.action_space.n): # 遍历所有动作
            action_value = self.calc_action_value(state, action) # 步骤5：Q值
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_action # 找使Q值最大的那个动作——最优动作 a = argmax Q(s, a)

    def play_episode(self, env): # 玩一局游戏
        total_reward = 0.0
        state = env.reset() # 用于重置环境
        while True:
            action = self.select_action(state) # 步骤6：最优动作
            # 不同于"Windows下OpenAI gym环境的使用"中的随机采样动作
            new_state, reward, is_done, _ = env.step(action) # 根据动作，与环境交互得到的新的状态与奖励
            self.rewards[(state, action, new_state)] = reward # 更新表
            self.transits[(state, action)][new_state] += 1 # 转换表
            total_reward += reward
            if is_done:
                break
            state = new_state
        return total_reward # 得到一局游戏过后的总体奖励

    def value_iteration(self): # 值迭代循环
        # 用s状态下可用的动作的最大值来更新当前状态的值
        # 任意s，π(s) = arg max Q(s, a)
        for state in range(self.env.observation_space.n): # 步骤2-4：遍历状态空间，找使Q值最大的最优策略
            state_values = [
                self.calc_action_value(state, action) # 计算Q(s, a)
                for action in range(self.env.action_space.n) # 遍历动作空间
            ]
            self.values[state] = max(state_values) # 步骤3：对于每个状态，V(s) = max Q(s, a)
            # 更新V值表，最优状态值函数，贝尔曼最优方程

if __name__ == "__main__":
    test_env = gym.make(ENV_NAME)
    agent = Agent()
    writer = SummaryWriter(comment="-v-iteration")

    iter_no = 0
    best_reward = 0.0
    while True: # 重复试验，直到20局游戏的平均奖励大于0.8，迭代终止
        iter_no += 1 # iter_no：重复试验的迭代次数
        agent.play_n_random_steps(100) # 步骤1：每一局游戏执行100个随机步骤，填充回报和转换表
        agent.value_iteration() # 步骤2-4：100步之后，对所有的状态进行一次值迭代循环，更新V值表，作为策略
        # time.sleep(0.1) #为了让显示变慢，否则画面会非常快
        # test_env.render() # 用于渲染出当前的智能体以及环境的状态

        reward = 0.0
        for _ in range(TEST_EPISODES): # 玩20局游戏
            reward += agent.play_episode(test_env) # 用到步骤5-6, 20局游戏奖励之和
        reward /= TEST_EPISODES # 20局的平均奖励
        writer.add_scalar("reward", reward, iter_no)
        if reward > best_reward:
            print("Best reward updated %.3f -> %.3f" % (
                best_reward, reward))
            best_reward = reward # 找到最优的奖励
        if reward > 0.80: # 重复试验次数，直到奖励>0.8，停止迭代
            print("Solved in %d iterations!" % iter_no)
            break
    writer.close()

1.3 结果

Best reward updated 0.000 -> 0.100
Best reward updated 0.100 -> 0.350
Best reward updated 0.350 -> 0.500
Best reward updated 0.500 -> 0.600
Best reward updated 0.600 -> 0.750
Best reward updated 0.750 -> 0.850
Solved in 14 iterations!

2. Q迭代(Q Iteration)

2.1 算法流程

2.2 Python程序

#!/usr/bin/env python3
# -*- coding=utf-8 -*-
# Q-learning for FrozenLake
# 1. 值表变了。上例保留了状态的值，因此字典中的键只是一个状态。
# 现在需要存储Q函数的值，它有两个参数：状态和动作，因此值表中的键现在是复合键。
# 2. 不需要calc_action_value()函数。因为我们的动作值存储在值表中。
# 3. value_iteration()变了。
# https://www.cnblogs.com/kailugaji/
import gym
import collections
from tensorboardX import SummaryWriter

ENV_NAME = "FrozenLake-v1" #游戏环境
'''
S: initial stat 起点
F: frozen lake 冰湖
H: hole 窟窿
G: the goal 目的地
agent要学会从起点走到目的地，并且不要掉进窟窿
'''
GAMMA = 0.9 # 折扣率
TEST_EPISODES = 20 # 玩几局游戏


class Agent:
    def __init__(self):
        self.env = gym.make(ENV_NAME) #创建游戏环境
        self.state = self.env.reset() # 用于重置环境
        self.rewards = collections.defaultdict(float)
        self.transits = collections.defaultdict(collections.Counter)
        self.values = collections.defaultdict(float)

    def play_n_random_steps(self, count): # 玩100步，得到回报表与转换表
        for _ in range(count):
            action = self.env.action_space.sample() # 随机采样选择动作
            new_state, reward, is_done, _ = self.env.step(action) # 根据动作，与环境互动得到的新的状态与奖励
            self.rewards[(self.state, action, new_state)] = reward # 回报表：源状态，动作，目标状态
            self.transits[(self.state, action)][new_state] += 1 # 转换表：状态，动作
            self.state = self.env.reset() if is_done else new_state

    def select_action(self, state): # 给定状态s, a = argmax Q(s, a)
        best_action, best_value = None, None
        for action in range(self.env.action_space.n): # 遍历所有动作
            action_value = self.values[(state, action)] # Q值表里有两个：状态与动作
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_action # 直接建立Q表，从Q值表里找最优动作

    def play_episode(self, env): # 玩一局游戏
        total_reward = 0.0
        state = env.reset() # 用于重置环境
        while True:
            action = self.select_action(state) # 给定状态s, 最优动作a = argmax Q(s, a)
            new_state, reward, is_done, _ = env.step(action) # 根据动作，与环境交互得到的新的状态与奖励
            self.rewards[(state, action, new_state)] = reward # 更新表
            self.transits[(state, action)][new_state] += 1
            total_reward += reward
            if is_done:
                break
            state = new_state # 步骤8
        return total_reward # 得到一局游戏过后的总体奖励

    def value_iteration(self): # 变了
    # 选择具有最大Q值的动作，然后把这个Q值作为目标状态的值
        for state in range(self.env.observation_space.n):  # 步骤2-10：其中3：遍历状态空间
            for action in range(self.env.action_space.n): # 步骤4-9：遍历动作空间
                action_value = 0.0
                target_counts = self.transits[(state, action)] # 转换表：状态，动作
                total = sum(target_counts.values())
                for tgt_state, count in target_counts.items():
                    reward = self.rewards[(state, action, tgt_state)] # 回报表：源状态，动作，目标状态
                    best_action = self.select_action(tgt_state) # 给定状态s, 最优动作a = argmax Q(s, a)
                    val = reward + GAMMA * self.values[(tgt_state, best_action)] # 值表：目标状态，最优动作
                    action_value += (count / total) * val # 期望值——最优状态动作值函数(Q值)(其中动作为最优动作)
                    # 贝尔曼最优方程
                self.values[(state, action)] = action_value # 更新Q值表：状态，动作

if __name__ == "__main__":
    test_env = gym.make(ENV_NAME)
    agent = Agent()
    writer = SummaryWriter(comment="-q-iteration")

    iter_no = 0
    best_reward = 0.0
    while True: # 重复试验，直到20局游戏的平均奖励大于0.8，迭代终止
        iter_no += 1 # iter_no：重复试验的迭代次数
        agent.play_n_random_steps(100) # 步骤1：每一局游戏执行100个随机步骤，填充回报和转换表
        agent.value_iteration() # 步骤2-10：100步之后，对所有的状态进行一次值迭代循环，更新Q值表，作为策略
        # time.sleep(0.1) #为了让显示变慢，否则画面会非常快
        # test_env.render() # 用于渲染出当前的智能体以及环境的状态

        reward = 0.0
        for _ in range(TEST_EPISODES): # 玩20局游戏
            reward += agent.play_episode(test_env) # 20局游戏奖励之和
        reward /= TEST_EPISODES # 20局的平均奖励
        writer.add_scalar("reward", reward, iter_no)
        if reward > best_reward:
            print("Best reward updated %.3f -> %.3f" % (best_reward, reward))
            best_reward = reward # 找到最优的奖励
        if reward > 0.80: # 重复试验次数，直到奖励>0.8，停止迭代
            print("Solved in %d iterations!" % iter_no)
            break
    writer.close()

2.3 结果

Best reward updated 0.000 -> 0.250
Best reward updated 0.250 -> 0.300
Best reward updated 0.300 -> 0.500
Best reward updated 0.500 -> 0.600
Best reward updated 0.600 -> 0.850
Solved in 33 iterations!

3. Q学习(Tabular Q-Learning)

3.1 算法流程

3.2 Python程序

#!/usr/bin/env python3
# -*- coding=utf-8 -*-
# Q-learning for FrozenLake
# https://www.cnblogs.com/kailugaji/
# 与上一个值迭代法相比，这个版本使用了更多的迭代来解决问题。
# 其原因是不再使用测试过程中获得的经验。
# 在上一个Q迭代例子中，周期性的测试会引起Q表统计的更新。
# 本算法在测试过程中不接触Q值，这在环境得到解决之前会造成更多的迭代。
# 总的来说，环境所需的样本总数几乎是一样的。
import gym
import collections
from tensorboardX import SummaryWriter

ENV_NAME = "FrozenLake-v1"
GAMMA = 0.9 # 折扣率
ALPHA = 0.2 # 平滑指数
TEST_EPISODES = 20 # 玩几局游戏

class Agent:
    def __init__(self):
        self.env = gym.make(ENV_NAME)
        self.state = self.env.reset()
        self.values = collections.defaultdict(float)

    def sample_env(self): # 随机采样动作
        action = self.env.action_space.sample()
        old_state = self.state
        new_state, reward, is_done, _ = self.env.step(action)
        self.state = self.env.reset() if is_done else new_state
        return old_state, action, reward, new_state

    def best_value_and_action(self, state): # 从Q表中选择最优值与动作
        best_value, best_action = None, None
        for action in range(self.env.action_space.n):
            action_value = self.values[(state, action)]
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_value, best_action

    def value_update(self, s, a, r, next_s): # 平滑
        best_v, _ = self.best_value_and_action(next_s)
        new_v = r + GAMMA * best_v # r(s, a, s') + γ * max Q(s, a)
        old_v = self.values[(s, a)]
        self.values[(s, a)] = old_v * (1-ALPHA) + new_v * ALPHA # 这变了，Q值平滑收敛
        # Q(s, a) <- (1-α) * Q(s, a) + α * (r(s, a, s') + γ * max Q(s, a))

    def play_episode(self, env): # 玩一局游戏
        total_reward = 0.0
        state = env.reset()
        while True:
            _, action = self.best_value_and_action(state) # 给定状态，从Q表中选择最优动作
            new_state, reward, is_done, _ = env.step(action)
            total_reward += reward
            if is_done:
                break
            state = new_state
        return total_reward

if __name__ == "__main__":
    test_env = gym.make(ENV_NAME)
    agent = Agent()
    writer = SummaryWriter(comment="-q-learning")

    iter_no = 0
    best_reward = 0.0
    while True:
        iter_no += 1
        s, a, r, next_s = agent.sample_env() # 执行一个随机步骤
        agent.value_update(s, a, r, next_s)

        reward = 0.0
        for _ in range(TEST_EPISODES):
            reward += agent.play_episode(test_env)
        reward /= TEST_EPISODES
        writer.add_scalar("reward", reward, iter_no)
        if reward > best_reward:
            print("Best reward updated %.3f -> %.3f" % (
                best_reward, reward))
            best_reward = reward
        if reward > 0.80:
            print("Solved in %d iterations!" % iter_no)
            break
    writer.close()

3.3 结果

Best reward updated 0.000 -> 0.200
Best reward updated 0.200 -> 0.250
Best reward updated 0.250 -> 0.350
Best reward updated 0.350 -> 0.500
Best reward updated 0.500 -> 0.550
Best reward updated 0.550 -> 0.600
Best reward updated 0.600 -> 0.650
Best reward updated 0.650 -> 0.700
Best reward updated 0.700 -> 0.800
Best reward updated 0.800 -> 0.850
Solved in 16682 iterations!

4. 参考文献

[1] https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition

[2] 邱锡鹏，神经网络与深度学习，机械工业出版社，https://nndl.github.io/, 2020.

[3] 强化学习(Reinforcement Learning)

相关阅读:
Xcode4快速Doxygen文档注释 — 简明图文教程
 iOS6 旋转
 echart 判断数据是否为空
 echart tootip使用技巧
 下拉菜单自动向上或向下弹起
 前后台数据交互
 打包代码
 echart 设计宽度为百分比时，div撑不开
 无缝滚动（小鹏写）
内置对象-Request对象
原文地址：https://www.cnblogs.com/kailugaji/p/15891401.html