• 强化学习读书笔记


    强化学习读书笔记 - 06~07 - 时序差分学习(Temporal-Difference Learning)

    学习笔记:
    Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016

    数学符号看不懂的,先看看这里:

    时序差分学习简话

    时序差分学习结合了动态规划和蒙特卡洛方法,是强化学习的核心思想。

    时序差分这个词不好理解。改为当时差分学习比较形象一些 - 表示通过当前的差分数据来学习。

    蒙特卡洛的方法是模拟(或者经历)一段情节,在情节结束后,根据情节上各个状态的价值,来估计状态价值。
    时序差分学习是模拟(或者经历)一段情节,每行动一步(或者几步),根据新状态的价值,然后估计执行前的状态价值。
    可以认为蒙特卡洛的方法是最大步数的时序差分学习。
    本章只考虑单步的时序差分学习。多步的时序差分学习在下一章讲解。

    数学表示
    根据我们已经知道的知识:如果可以计算出策略价值((pi)状态价值(v_{pi}(s)),或者行动价值(q_{pi(s, a)})),就可以优化策略。
    在蒙特卡洛方法中,计算策略的价值,需要完成一个情节(episode),通过情节的目标价值(G_t)来计算状态的价值。其公式:
    Formula MonteCarlo

    [V(S_t) gets V(S_t) + alpha delta_t \ delta_t = [G_t - V(S_t)] \ where \ delta_t ext{ - Monte Carlo error} \ alpha ext{ - learning step size} ]

    时序差分的思想是通过下一个状态的价值计算状态的价值,形成一个迭代公式(又):
    Formula TD(0)

    [V(S_t) gets V(S_t) + alpha delta_t \ delta_t = [R_{t+1} + gamma V(S_{t+1} - V(S_t)] \ where \ delta_t ext{ - TD error} \ alpha ext{ - learning step size} \ gamma ext{ - reward discount rate} ]

    注:书上提出TD error并不精确,而Monte Carlo error是精确地。需要了解,在此并不拗述。

    时序差分学习方法

    本章介绍的是时序差分学习的单步学习方法。多步学习方法在下一章介绍。

    • 策略状态价值(v_{pi})的时序差分学习方法(单步多步)
    • 策略行动价值(q_{pi})的on-policy时序差分学习方法: Sarsa(单步多步)
    • 策略行动价值(q_{pi})的off-policy时序差分学习方法: Q-learning(单步)
    • Double Q-learning(单步)
    • 策略行动价值(q_{pi})的off-policy时序差分学习方法(带importance sampling): Sarsa(多步)
    • 策略行动价值(q_{pi})的off-policy时序差分学习方法(不带importance sampling): Tree Backup Algorithm(多步)
    • 策略行动价值(q_{pi})的off-policy时序差分学习方法: (Q(sigma))(多步)

    策略状态价值(v_{pi})的时序差分学习方法

    单步时序差分学习方法TD(0)

    • 流程图
    • 算法描述

    Initialize (V(s)) arbitrarily (forall s in mathcal{S}^+)
    Repeat (for each episode):
      Initialize (mathcal{S})
      Repeat (for each step of episode):
       (A gets) action given by (pi) for (S)
       Take action (A), observe (R, S')
       (V(S) gets V(S) + alpha [R + gamma V(S') - V(S)])
       (S gets S')
      Until S is terminal

    多步时序差分学习方法

    • 流程图
    • 算法描述

    Input: the policy (pi) to be evaluated
    Initialize (V(s)) arbitrarily (forall s in mathcal{S})
    Parameters: step size (alpha in (0, 1]), a positive integer (n)
    All store and access operations (for (S_t) and (R_t)) can take their index mod (n)

    Repeat (for each episode):
      Initialize and store (S_0 e terminal)
    (T gets infty)
      For (t = 0,1,2,cdots):
       If (t < T), then:
        Take an action according to (pi(dot | S_t))
        Observe and store the next reward as (R_{t+1}) and the next state as (S_{t+1})
        If (S_{t+1}) is terminal, then (T gets t+1)
       $ au gets t - n + 1 $ (( au) is the time whose state's estimate is being updated)
       If ( au ge 0):
        (G gets sum_{i = au + 1}^{min( au + n, T)} gamma^{i- au-1}R_i)
        if ( au + n le T) then: (G gets G + gamma^{n}V(S_{ au + n}) qquad qquad (G_{ au}^{(n)}))
        (V(S_{ au}) gets V(S_{ au}) + alpha [G - V(S_{ au})])
      Until ( au = T - 1)

    这里要理解(V(S_0))是由(V(S_0), V(S_1), dots, V(S_n))计算所得;(V(S_1))是由(V(S_1), V(S_1), dots, V(S_{n+1}))

    策略行动价值(q_{pi})的on-policy时序差分学习方法: Sarsa

    单步时序差分学习方法

    • 流程图
    • 算法描述

    Initialize (Q(s, a), forall s in mathcal{S}, a in mathcal{A}(s)) arbitrarily, and (Q(terminal, dot ) = 0)
    Repeat (for each episode):
      Initialize (mathcal{S})
      Choose (A) from (S) using policy derived from (Q) (e.g. (epsilon-greedy))
      Repeat (for each step of episode):
       Take action (A), observe (R, S')
       Choose (A') from (S') using policy derived from (Q) (e.g. (epsilon-greedy))
       (Q(S, A) gets Q(S, A) + alpha [R + gamma Q(S', A') - Q(S, A)])
       (S gets S'; A gets A';)
      Until S is terminal

    多步时序差分学习方法

    • 流程图
    • 算法描述

    Initialize (Q(s, a)) arbitrarily (forall s in mathcal{S}^, forall a in mathcal{A})
    Initialize (pi) to be (epsilon)-greedy with respect to Q, or to a fixed given policy
    Parameters: step size (alpha in (0, 1]),
      small (epsilon > 0)
      a positive integer (n)
    All store and access operations (for (S_t) and (R_t)) can take their index mod (n)

    Repeat (for each episode):
      Initialize and store (S_0 e terminal)
      Select and store an action (A_0 sim pi(dot | S_0))
    (T gets infty)
      For (t = 0,1,2,cdots):
       If (t < T), then:
        Take an action (A_t)
        Observe and store the next reward as (R_{t+1}) and the next state as (S_{t+1})
        If (S_{t+1}) is terminal, then:
         (T gets t+1)
        Else:
         Select and store an action (A_{t+1} sim pi(dot | S_{t+1}))
       $ au gets t - n + 1 $ (( au) is the time whose state's estimate is being updated)
       If ( au ge 0):
        (G gets sum_{i = au + 1}^{min( au + n, T)} gamma^{i- au-1}R_i)
        if ( au + n le T) then: (G gets G + gamma^{n} Q(S_{ au + n}, A_{ au + n}) qquad qquad (G_{ au}^{(n)}))
        (Q(S_{ au}, A_{ au}) gets Q(S_{ au}, A_{ au}) + alpha [G - Q(S_{ au}, A_{ au})])
        If {pi} is being learned, then ensure that (pi(dot | S_{ au})) is (epsilon)-greedy wrt Q
      Until ( au = T - 1)

    策略行动价值(q_{pi})的off-policy时序差分学习方法: Q-learning

    Q-learning 算法(Watkins, 1989)是一个突破性的算法。这里利用了这个公式进行off-policy学习。

    [Q(S_t, A_t) gets Q(S_t, A_t) + alpha [R_{t+1} + gamma underset{a}{max} Q(S_{t+1}, a) - Q(S_t, A_t)] ]

    单步时序差分学习方法

    • 算法描述

    Initialize (Q(s, a), forall s in mathcal{S}, a in mathcal{A}(s)) arbitrarily, and (Q(terminal, dot ) = 0)
    Repeat (for each episode):
      Initialize (mathcal{S})
      Choose (A) from (S) using policy derived from (Q) (e.g. (epsilon-greedy))
      Repeat (for each step of episode):
       Take action (A), observe (R, S')
       (Q(S, A) gets Q(S, A) + alpha [R + gamma underset{a}{max} Q(S‘, a) - Q(S, A)])
       (S gets S';)
      Until S is terminal

    • Q-learning使用了max,会引起一个最大化偏差(Maximization Bias)问题。
      具体说明,请看书上的Example 6.7。**
      使用Double Q-learning可以消除这个问题。

    Double Q-learning

    单步时序差分学习方法

    Initialize (Q_1(s, a)) and (Q_2(s, a), forall s in mathcal{S}, a in mathcal{A}(s)) arbitrarily
    Initialize (Q_1(terminal, dot ) = Q_2(terminal, dot ) = 0)
    Repeat (for each episode):
      Initialize (mathcal{S})
      Repeat (for each step of episode):
       Choose (A) from (S) using policy derived from (Q_1) and (Q_2) (e.g. (epsilon-greedy))
       Take action (A), observe (R, S')
       With 0.5 probability:
        (Q_1(S, A) gets Q_1(S, A) + alpha [R + gamma Q_2(S', underset{a}{argmax} Q_1(S', a)) - Q_1(S, A)])
       Else:
        (Q_2(S, A) gets Q_2(S, A) + alpha [R + gamma Q_1(S', underset{a}{argmax} Q_2(S', a)) - Q_2(S, A)])
       (S gets S';)
      Until S is terminal

    策略行动价值(q_{pi})的off-policy时序差分学习方法(by importance sampling): Sarsa

    考虑到重要样本,把( ho)带入到Sarsa算法中,形成一个off-policy的方法。
    ( ho) - 重要样本比率(importance sampling ratio)

    [ ho gets prod_{i = au + 1}^{min( au + n - 1, T -1 )} frac{pi(A_t|S_t)}{mu(A_t|S_t)} qquad qquad ( ho_{ au+n}^{( au+1)}) ]

    多步时序差分学习方法

    • 算法描述

    Input: behavior policy mu such that (mu(a|s) > 0,forall s in mathcal{S}, a in mathcal{A})
    Initialize (Q(s,a)) arbitrarily (forall s in mathcal{S}^, forall a in mathcal{A})
    Initialize (pi) to be (epsilon)-greedy with respect to Q, or to a fixed given policy
    Parameters: step size (alpha in (0, 1]),
      small (epsilon > 0)
      a positive integer (n)
    All store and access operations (for (S_t) and (R_t)) can take their index mod (n)

    Repeat (for each episode):
      Initialize and store (S_0 e terminal)
      Select and store an action (A_0 sim mu(dot | S_0))
    (T gets infty)
      For (t = 0,1,2,cdots):
       If (t < T), then:
        Take an action (A_t)
        Observe and store the next reward as (R_{t+1}) and the next state as (S_{t+1})
        If (S_{t+1}) is terminal, then:
         (T gets t+1)
        Else:
         Select and store an action (A_{t+1} sim pi(dot | S_{t+1}))
       $ au gets t - n + 1 $ (( au) is the time whose state's estimate is being updated)
       If ( au ge 0):
        ( ho gets prod_{i = au + 1}^{min( au + n - 1, T -1 )} frac{pi(A_t|S_t)}{mu(A_t|S_t)} qquad qquad ( ho_{ au+n}^{( au+1)}))
        (G gets sum_{i = au + 1}^{min( au + n, T)} gamma^{i- au-1}R_i)
        if ( au + n le T) then: (G gets G + gamma^{n} Q(S_{ au + n}, A_{ au + n}) qquad qquad (G_{ au}^{(n)}))
        (Q(S_{ au}, A_{ au}) gets Q(S_{ au}, A_{ au}) + alpha ho [G - Q(S_{ au}, A_{ au})])
        If {pi} is being learned, then ensure that (pi(dot | S_{ au})) is (epsilon)-greedy wrt Q
      Until ( au = T - 1)

    Expected Sarsa

    • 流程图
    * 算法描述 略。

    策略行动价值(q_{pi})的off-policy时序差分学习方法(不带importance sampling): Tree Backup Algorithm

    Tree Backup Algorithm的思想是每步都求行动价值的期望值。
    求行动价值的期望值意味着对所有可能的行动(a)都评估一次。

    多步时序差分学习方法

    • 流程图
    • 算法描述

    Initialize (Q(s,a)) arbitrarily (forall s in mathcal{S}^, forall a in mathcal{A})
    Initialize (pi) to be (epsilon)-greedy with respect to Q, or to a fixed given policy
    Parameters: step size (alpha in (0, 1]),
      small (epsilon > 0)
      a positive integer (n)
    All store and access operations (for (S_t) and (R_t)) can take their index mod (n)

    Repeat (for each episode):
      Initialize and store (S_0 e terminal)
      Select and store an action (A_0 sim pi(dot | S_0))
    (Q_0 gets Q(S_0, A_0))
    (T gets infty)
      For (t = 0,1,2,cdots):
       If (t < T), then:
        Take an action (A_t)
        Observe and store the next reward as (R_{t+1}) and the next state as (S_{t+1})
        If (S_{t+1}) is terminal, then:
         (T gets t+1)
         (delta_t gets R - Q_t)
        Else:
         (delta_t gets R + gamma sum_a pi(a|S_{t+1})Q(S_{t+1},a) - Q_t)
         Select arbitrarily and store an action as (A_{t+1})
         (Q_{t+1} gets Q(S_{t+1},A_{t+1}))
         (pi_{t+1} gets pi(S_{t+1},A_{t+1}))
       $ au gets t - n + 1 $ (( au) is the time whose state's estimate is being updated)
       If ( au ge 0):
        (E gets 1)
        (G gets Q_{ au})
        For (k= au, dots, min( au + n - 1, T - 1):)
         (G gets G + E delta_k)
         (E gets gamma E pi_{k+1})
        (Q(S_{ au}, A_{ au}) gets Q(S_{ au}, A_{ au}) + alpha [G - Q(S_{ au}, A_{ au})])
        If {pi} is being learned, then ensure that (pi(a | S_{ au})) is (epsilon)-greedy wrt (Q(S_{ au},dot ))
      Until ( au = T - 1)

    策略行动价值(q_{pi})的off-policy时序差分学习方法: (Q(sigma))

    (Q(sigma))结合了Sarsa(importance sampling), Expected Sarsa, Tree Backup算法,并考虑了重要样本。
    (sigma = 1)时,使用了重要样本的Sarsa算法。
    (sigma = 0)时,使用了Tree Backup的行动期望值算法。

    多步时序差分学习方法

    • 流程图
    • 算法描述

    Input: behavior policy mu such that (mu(a|s) > 0,forall s in mathcal{S}, a in mathcal{A})
    Initialize (Q(s,a)) arbitrarily forall s in mathcal{S}^, forall a in mathcal{A}$
    Initialize (pi) to be (epsilon)-greedy with respect to Q, or to a fixed given policy
    Parameters: step size (alpha in (0, 1]),
      small (epsilon > 0)
      a positive integer (n)
    All store and access operations (for (S_t) and (R_t)) can take their index mod (n)

    Repeat (for each episode):
      Initialize and store (S_0 e terminal)
      Select and store an action (A_0 sim mu(dot | S_0))
    (Q_0 gets Q(S_0, A_0))
    (T gets infty)
      For (t = 0,1,2,cdots):
       If (t < T), then:
        Take an action (A_t)
        Observe and store the next reward as (R_{t+1}) and the next state as (S_{t+1})
        If (S_{t+1}) is terminal, then:
         (T gets t+1)
         (delta_t gets R - Q_t)
        Else:
         Select and store an action as (A_{t+1} sim mu(dot |S_{t+1}))
         Select and store (sigma_{t+1}))
         (Q_{t+1} gets Q(S_{t+1},A_{t+1}))
         (delta_t gets R + gamma sigma_{t+1} Q_{t+1} + gamma (1 - sigma_{t+1})sum_a pi(a|S_{t+1})Q(S_{t+1},a) - Q_t)
         (pi_{t+1} gets pi(S_{t+1},A_{t+1}))
         ( ho_{t+1} gets frac{pi(A_{t+1}|S_{t+1})}{mu(A_{t+1}|S_{t+1})})
       $ au gets t - n + 1 $ (( au) is the time whose state's estimate is being updated)
       If ( au ge 0):
        ( ho gets 1)
        (E gets 1)
        (G gets Q_{ au})
        For (k= au, dots, min( au + n - 1, T - 1):)
         (G gets G + E delta_k)
         (E gets gamma E [(1 - sigma_{k+1})pi_{k+1} + sigma_{k+1}])
         ( ho gets ho(1 - sigma_{k} + sigma_{k} au_{k}))
        (Q(S_{ au}, A_{ au}) gets Q(S_{ au}, A_{ au}) + alpha ho [G - Q(S_{ au}, A_{ au})])
        If ({pi}) is being learned, then ensure that (pi(a | S_{ au})) is (epsilon)-greedy wrt (Q(S_{ au},dot ))
      Until ( au = T - 1)

    总结

    时序差分学习方法的限制:学习步数内,可获得奖赏信息。
    比如,国际象棋的每一步,是否可以计算出一个奖赏信息?如果使用蒙特卡洛方法,模拟到游戏结束,肯定是可以获得一个奖赏结果的。

    参照

  • 相关阅读:
    App.domain http->https
    js 点赞数 处理
    js插入拼接链接 --包含可变字段
    MYSQL修改表操作
    PHP上传文件类 代码练习
    PHP上传文件代码练习2 (重复文章)
    自制山寨百度一下
    PHP多文件上传代码练习
    修改PHP上传文件的大小限制
    三维数组变两位数组
  • 原文地址:https://www.cnblogs.com/steven-yang/p/6516818.html
Copyright © 2020-2023  润新知