• 强化学习(五):时间差分学习


    Temporal-Difference Learning

    TD在强化学习中处于中心位置,它结合了DP与MC两种思想。如MC, TD可以直接从原始经验中学习,且不需要对环境有整体的认知。也如DP一样,它不需要等到最终结果才开始学习,它Bootstrap,即它的每步估计会部分地基于之前的估计。

    最简单的TD形式:

    [V(S_t) leftarrow V(S_t) + alpha [R_{t+1} + gamma V(S_{t+1} ) - V(S_t)] ]

    这个可被称为TD(0)或一步TD(one-step TD)。

    # Tabular TD(0) for estimating v_pi
    Input: the policy pi to be evaluated
    Algorithm parameter: step size alpha in (0,1]
    Initialize V(s), for all s in S_plus, arbitrarily except that V(terminal) = 0
    
    Loop for each episode
        Initialize S
        for step in episode:
            A = action given by pi for S
            Take action A, observe R, S'
            V(S) = V(S) + alpha *[ R+gamma V(S') - V(S)]
            S = S'
            if S == terminal:
                break
    

    TD error:

    [delta_t dot = R_{t+1} + gamma V(S_{t+1}) - V(S_t) ]

    在每一时刻,TD error是因为估计所产生的误差。

    Advantage of TD Prediction Methods

    Sarsa: On-policy TD Control

    [Q(S_t,A_t) leftarrow Q(S_t,A_t) + alpha[R_{t+1} + gamma Q(S_{t+1}, A_{t+1}) - Q(S_t,A_t)] ]

    Sarsa (State, Action, Reward, State, Action) 表达是这个五元组元素之间的关系。TD error 可表示

    [delta_t = R_{t+1} + gamma Q(S_{t+1}, A_{t+1}) - Q(S_t,A_t) ]

    # Sarsa (on-policy TD control) for estimating Q = q
    Algorithm parameters: step size  alpha in (0,1], small epsilon > 0
    Initialize Q(s,a), for all s in S_plus, a in A(s), arbitrarily except that Q(terminal,.) = 0
    
    Loop for each episode:
         Initialize S
         Choose A from S using policy derived from Q (e.g., epsilon-greedy)
         Loop for each step of episode:
              Take action A, observe R, S'
              Choose A' from S' using policy derived from Q (e.g., epsilon-greedy)
              Q(S,A) = Q(S,A) + alpha[R + gamma Q(S',A') - Q(S,A)]
              S = S',A=A'
              if S = terminal:
                  break
    

    Q-learning: Off-policy TD Control

    [Q(S_t,A_t) leftarrow Q(S_t,A_t) + alpha[R_{t+1} + gamma max_a Q(S_{t+1}, a) - Q(S_t,A_t)] ]

    # Sarsa (on-policy TD control) for estimating Q = q
    Algorithm parameters: step size  alpha in (0,1], small epsilon > 0
    Initialize Q(s,a), for all s in S_plus, a in A(s), arbitrarily except that Q(terminal,.) = 0
    
    Loop for each episode:
         Initialize S
         
         Loop for each step of episode:
            
              Choose A from S using policy derived from Q (e.g., epsilon-greedy)
              Take action A, observe R, S'
              Q(S,A) = Q(S,A) + alpha[R + gamma max_a Q(S',a) - Q(S,A)]
              S = S'
              if S = terminal:
                  break
    

    Q-learning 直接逼近q*, 最优的action-value 函数独立于行为策略。

    Expected Sarsa

    [Q(S_t,A_t) leftarrow Q(S_t,A_t) + alpha[R_{t+1} + gamma E[ Q(S_{t+1}, A_{t+1})|S_{t+1}] - Q(S_t,A_t)]\ leftarrow Q(S_t,A_t) +alpha [R_{t+1} + gamma sum_{a}pi(a|S_{t+1})Q(S_{t+1},a) - Q(S_t,A_t)] ]

    Double Q-learning

    [Q(S_t,A_t) leftarrow Q(S_t,A_t) + alpha[R_{t+1} + gamma Q_2(S_{t+1}, argmax_a Q_1(S_{t+1},a)) - Q(S_t,A_t)] ]

    # Double Q-learning, for estimating Q1 = Q2 = q*
    
    Algorithm parameters: step size alpha in (0,1],small epsilon >0
    Initialize Q1(s,a) and Q2(s,a), for all s in S_plus, a in A(s), such that Q(terminal,.) = 0
    
    Loop for each episode:
         Initialize S
         Loop for each step of episode:
              Choose A from S using the policy epsilon-greedy in Q1+Q2
              Take action A, observe R, S'
              with 0.5 probability:
                   Q1(S,A) = Q1(S,A) + alpha(R + gamma Q2(S',arg max_a Q1(S',a)) - Q1(S,A))
        	 else:
                   Q2(S,A) = Q2(S,A) + alpha(R + gamma Q1(S',arg max_a Q2(S',a)) - Q2(S,A))
         	 S = S'
             if S = terminal:
                    break
    
  • 相关阅读:
    线段树优化dp(elect选择)
    gdb调试
    无参装饰器
    3.23作业
    3.22周末作业
    函数对象与闭包函数
    3.20作业
    3.19作业
    名称空间与作用域
    函数参数的使用
  • 原文地址:https://www.cnblogs.com/vpegasus/p/td.html
Copyright © 2020-2023  润新知