• 强化学习读书笔记


    强化学习读书笔记 - 13 - 策略梯度方法(Policy Gradient Methods)

    学习笔记:
    Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016

    参照

    需要了解强化学习的数学符号,先看看这里:

    策略梯度方法(Policy Gradient Methods)

    基于价值函数的思路

    [ ext{Reinforcement Learning} doteq pi_* \ quad updownarrow \ pi_* doteq { pi(s) }, s in mathcal{S} \ quad updownarrow \ egin{cases} pi(s) = underset{a}{argmax} v_{pi}(s' | s, a), s' in S(s), quad ext{or} \ pi(s) = underset{a}{argmax} q_{pi}(s, a) \ end{cases} \ quad updownarrow \ egin{cases} v_*(s), quad ext{or} \ q_*(s, a) \ end{cases} \ quad updownarrow \ ext{approximation cases:} \ egin{cases} hat{v}(s, heta) doteq heta^T phi(s), quad ext{state value function} \ hat{q}(s, a, heta) doteq heta^T phi(s, a), quad ext{action value function} \ end{cases} \ where \ heta ext{ - value function's weight vector} \ ]

    策略梯度方法的新思路(Policy Gradient Methods)

    [ ext{Reinforcement Learning} doteq pi_* \ quad updownarrow \ pi_* doteq { pi(s) }, s in mathcal{S} \ quad updownarrow \ pi(s) = underset{a}{argmax} pi(a|s, heta) \ where \ pi(a|s, heta) in [0, 1] \ s in mathcal{S}, a in mathcal{A} \ quad updownarrow \ pi(a|s, heta) doteq frac{exp(h(s,a, heta))}{sum_b exp(h(s,b, heta))} \ quad updownarrow \ exp(h(s,a, heta)) doteq heta^T phi(s,a) \ where \ heta ext{ - policy weight vector} \ ]

    策略梯度定理(The policy gradient theorem)

    情节性任务

    如何计算策略的价值(eta)

    [eta( heta) doteq v_{pi_ heta}(s_0) \ where \ eta ext{ - the performance measure} \ v_{pi_ heta} ext{ - the true value function for } pi_ heta ext{, the policy determined by } heta \ s_0 ext{ - some particular state} \ ]

    • 策略梯度定理

    [ abla eta( heta) = sum_s d_{pi}(s) sum_{a} q_{pi}(s,a) abla_ heta pi(a|s, heta) \ where \ d(s) ext{ - on-policy distribution, the fraction of time spent in s under the target policy } pi \ sum_s d(s) = 1 \ ]

    蒙特卡洛策略梯度强化算法(ERINFORCE: Monte Carlo Policy Gradient)

    • 策略价值计算公式

    [egin{align} abla eta( heta) & = sum_s d_{pi}(s) sum_{a} q_{pi}(s,a) abla_ heta pi(a|s, heta) \ & = mathbb{E}_pi left [ gamma^t sum_a q_pi(S_t,a) abla_ heta pi(a|s, heta) ight ] \ & = mathbb{E}_pi left [ gamma^t G_t frac{ abla_ heta pi(A_t|S_t, heta)}{pi(A_t|S_t, heta)} ight ] end{align} ]

    • Update Rule公式

    [egin{align} heta_{t+1} & doteq heta_t + alpha gamma^t G_t frac{ abla_ heta pi(A_t|S_t, heta)}{pi(A_t|S_t, heta)} \ & = heta_t + alpha gamma^t G_t abla_ heta log pi(A_t|S_t, heta) \ end{align} ]

    • 算法描述(ERINFORCE: A Monte Carlo Policy Gradient Method (episodic))
      请看原书,在此不做拗述。

    带基数的蒙特卡洛策略梯度强化算法(ERINFORCE with baseline)

    • 策略价值计算公式

    [egin{align} abla eta( heta) & = sum_s d_{pi}(s) sum_{a} q_{pi}(s,a) abla_ heta pi(a|s, heta) \ & = sum_s d_{pi}(s) sum_{a} left ( q_{pi}(s,a) - b(s) ight ) abla_ heta pi(a|s, heta) \ end{align} \ ecause \ sum_{a} b(s) abla_ heta pi(a|s, heta) \ quad = b(s) abla_ heta sum_{a} pi(a|s, heta) \ quad = b(s) abla_ heta 1 \ quad = 0 \ where \ b(s) ext{ - an arbitrary baseline function, e.g. } b(s) = hat{v}(s, w) \ ]

    • Update Rule公式

    [delta = G_t - hat{v}(s, w) \ w_{t+1} = w_{t} + eta delta abla_w hat{v}(s, w) \ heta_{t+1} = heta_t + alpha gamma^t delta abla_ heta log pi(A_t|S_t, heta) \ ]

    • 算法描述
      请看原书,在此不做拗述。

    角色评论算法(Actor-Critic Methods)

    这个算法实际上是:

    1. 带基数的蒙特卡洛策略梯度强化算法的TD通用化。
    2. 加上资格迹(eligibility traces)

    注:蒙特卡洛方法要求必须完成当前的情节。这样才能计算正确的回报(G_t)
    TD避免了这个条件(从而提高了效率),可以通过临时差分计算一个近似的回报(G_t^{(0)} approx G_t)(当然也产生了不精确性)。
    资格迹(eligibility traces)优化了(计算权重变量的)价值函数的微分值,(e_t doteq abla hat{v}(S_t, heta_t) + gamma lambda e_{t-1})

    • Update Rule公式

    [delta = G_t^{(1)} - hat{v}(S_t, w) \ quad = R_{t+1} + gamma hat{v}(S_{t+1}, w) - hat{v}(S_t, w) \ w_{t+1} = w_{t} + eta delta abla_w hat{v}(s, w) \ heta_{t+1} = heta_t + alpha gamma^t delta abla_ heta log pi(A_t|S_t, heta) \ ]

    • Update Rule with eligibility traces公式

    [delta = R + gamma hat{v}(s', w) - hat{v}(s', w) \ e^w = lambda^w e^w + gamma^t abla_w hat{v}(s, w) \ w_{t+1} = w_{t} + eta delta e_w \ e^{ heta} = lambda^{ heta} e^{ heta} + gamma^t abla_ heta log pi(A_t|S_t, heta) \ heta_{t+1} = heta_t + alpha delta e^{ heta} \ where \ R + gamma hat{v}(s', w) = G_t^{(0)} \ delta ext{ - TD error} \ e^w ext{ - eligibility trace of state value function} \ e^{ heta} ext{ - eligibility trace of policy value function} \ ]

    • 算法描述
      请看原书,在此不做拗述。

    针对连续性任务的策略梯度算法(Policy Gradient for Continuing Problems(Average Reward Rate))

    • 策略价值计算公式
      对于连续性任务的策略价值是每个步骤的平均奖赏

    [egin{align} eta( heta) doteq r( heta) & doteq lim_{n o infty} frac{1}{n} sum_{t=1}^n mathbb{E} [R_t| heta_0= heta_1=dots= heta_{t-1}= heta] \ & = lim_{t o infty} mathbb{E} [R_t| heta_0= heta_1=dots= heta_{t-1}= heta] \ end{align} ]

    • Update Rule公式

    [delta = G_t^{(1)} - hat{v}(S_t, w) \ quad = R_{t+1} + gamma hat{v}(S_{t+1}, w) - hat{v}(S_t, w) \ w_{t+1} = w_{t} + eta delta abla_w hat{v}(s, w) \ heta_{t+1} = heta_t + alpha gamma^t delta abla_ heta log pi(A_t|S_t, heta) \ ]

    • Update Rule Actor-Critic with eligibility traces (continuing) 公式

    [delta = R - ar{R} + gamma hat{v}(s', w) - hat{v}(s', w) \ ar{R} = ar{R} + eta delta \ e^w = lambda^w e^w + gamma^t abla_w hat{v}(s, w) \ w_{t+1} = w_{t} + eta delta e_w \ e^{ heta} = lambda^{ heta} e^{ heta} + gamma^t abla_ heta log pi(A_t|S_t, heta) \ heta_{t+1} = heta_t + alpha delta e^{ heta} \ where \ R + gamma hat{v}(s', w) = G_t^{(0)} \ delta ext{ - TD error} \ e^w ext{ - eligibility trace of state value function} \ e^{ heta} ext{ - eligibility trace of policy value function} \ ]

    • 算法描述(Actor-Critic with eligibility traces (continuing))
      请看原书,在此不做拗述。
      原书还没有完成,这章先停在这里
  • 相关阅读:
    获取与端点的连接
    判断div内滚动条是否在底部
    MVC的使用!
    格式转换解决存取数据安全问题
    JQuery input file 上传图片
    contenteditable 常用的一些CSS !!
    C# 生成Json类型数据
    生成Excel
    生成验证码
    图片水印
  • 原文地址:https://www.cnblogs.com/steven-yang/p/6624253.html
Copyright © 2020-2023  润新知