强化学习读书笔记 - 13 - 策略梯度方法(Policy Gradient Methods)
学习笔记:
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016
参照
- Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016
- 强化学习读书笔记 - 00 - 术语和数学符号
- 强化学习读书笔记 - 01 - 强化学习的问题
- 强化学习读书笔记 - 02 - 多臂老O虎O机问题
- 强化学习读书笔记 - 03 - 有限马尔科夫决策过程
- 强化学习读书笔记 - 04 - 动态规划
- 强化学习读书笔记 - 05 - 蒙特卡洛方法(Monte Carlo Methods)
- 强化学习读书笔记 - 06~07 - 时序差分学习(Temporal-Difference Learning)
- 强化学习读书笔记 - 08 - 规划式方法和学习式方法
- 强化学习读书笔记 - 09 - on-policy预测的近似方法
- 强化学习读书笔记 - 10 - on-policy控制的近似方法
- 强化学习读书笔记 - 11 - off-policy的近似方法
- 强化学习读书笔记 - 12 - 资格痕迹(Eligibility Traces)
需要了解强化学习的数学符号,先看看这里:
策略梯度方法(Policy Gradient Methods)
基于价值函数的思路
[ ext{Reinforcement Learning} doteq pi_* \
quad updownarrow \
pi_* doteq { pi(s) }, s in mathcal{S} \
quad updownarrow \
egin{cases}
pi(s) = underset{a}{argmax} v_{pi}(s' | s, a), s' in S(s), quad ext{or} \
pi(s) = underset{a}{argmax} q_{pi}(s, a) \
end{cases} \
quad updownarrow \
egin{cases}
v_*(s), quad ext{or} \
q_*(s, a) \
end{cases} \
quad updownarrow \
ext{approximation cases:} \
egin{cases}
hat{v}(s, heta) doteq heta^T phi(s), quad ext{state value function} \
hat{q}(s, a, heta) doteq heta^T phi(s, a), quad ext{action value function} \
end{cases} \
where \
heta ext{ - value function's weight vector} \
]
策略梯度方法的新思路(Policy Gradient Methods)
[ ext{Reinforcement Learning} doteq pi_* \
quad updownarrow \
pi_* doteq { pi(s) }, s in mathcal{S} \
quad updownarrow \
pi(s) = underset{a}{argmax} pi(a|s, heta) \
where \
pi(a|s, heta) in [0, 1] \
s in mathcal{S}, a in mathcal{A} \
quad updownarrow \
pi(a|s, heta) doteq frac{exp(h(s,a, heta))}{sum_b exp(h(s,b, heta))} \
quad updownarrow \
exp(h(s,a, heta)) doteq heta^T phi(s,a) \
where \
heta ext{ - policy weight vector} \
]
策略梯度定理(The policy gradient theorem)
情节性任务
如何计算策略的价值(eta)
[eta( heta) doteq v_{pi_ heta}(s_0) \
where \
eta ext{ - the performance measure} \
v_{pi_ heta} ext{ - the true value function for } pi_ heta ext{, the policy determined by } heta \
s_0 ext{ - some particular state} \
]
- 策略梯度定理
[
abla eta( heta) = sum_s d_{pi}(s) sum_{a} q_{pi}(s,a)
abla_ heta pi(a|s, heta) \
where \
d(s) ext{ - on-policy distribution, the fraction of time spent in s under the target policy } pi \
sum_s d(s) = 1 \
]
蒙特卡洛策略梯度强化算法(ERINFORCE: Monte Carlo Policy Gradient)
- 策略价值计算公式
[egin{align}
abla eta( heta)
& = sum_s d_{pi}(s) sum_{a} q_{pi}(s,a)
abla_ heta pi(a|s, heta) \
& = mathbb{E}_pi left [ gamma^t sum_a q_pi(S_t,a)
abla_ heta pi(a|s, heta)
ight ] \
& = mathbb{E}_pi left [ gamma^t G_t frac{
abla_ heta pi(A_t|S_t, heta)}{pi(A_t|S_t, heta)}
ight ]
end{align}
]
- Update Rule公式
[egin{align}
heta_{t+1}
& doteq heta_t + alpha gamma^t G_t frac{
abla_ heta pi(A_t|S_t, heta)}{pi(A_t|S_t, heta)} \
& = heta_t + alpha gamma^t G_t
abla_ heta log pi(A_t|S_t, heta) \
end{align}
]
- 算法描述(ERINFORCE: A Monte Carlo Policy Gradient Method (episodic))
请看原书,在此不做拗述。
带基数的蒙特卡洛策略梯度强化算法(ERINFORCE with baseline)
- 策略价值计算公式
[egin{align}
abla eta( heta)
& = sum_s d_{pi}(s) sum_{a} q_{pi}(s,a)
abla_ heta pi(a|s, heta) \
& = sum_s d_{pi}(s) sum_{a} left ( q_{pi}(s,a) - b(s)
ight )
abla_ heta pi(a|s, heta) \
end{align} \
ecause \
sum_{a} b(s)
abla_ heta pi(a|s, heta) \
quad = b(s)
abla_ heta sum_{a} pi(a|s, heta) \
quad = b(s)
abla_ heta 1 \
quad = 0 \
where \
b(s) ext{ - an arbitrary baseline function, e.g. } b(s) = hat{v}(s, w) \
]
- Update Rule公式
[delta = G_t - hat{v}(s, w) \
w_{t+1} = w_{t} + eta delta
abla_w hat{v}(s, w) \
heta_{t+1} = heta_t + alpha gamma^t delta
abla_ heta log pi(A_t|S_t, heta) \
]
- 算法描述
请看原书,在此不做拗述。
角色评论算法(Actor-Critic Methods)
这个算法实际上是:
- 带基数的蒙特卡洛策略梯度强化算法的TD通用化。
- 加上资格迹(eligibility traces)
注:蒙特卡洛方法要求必须完成当前的情节。这样才能计算正确的回报(G_t)。
TD避免了这个条件(从而提高了效率),可以通过临时差分计算一个近似的回报(G_t^{(0)} approx G_t)(当然也产生了不精确性)。
资格迹(eligibility traces)优化了(计算权重变量的)价值函数的微分值,(e_t doteq abla hat{v}(S_t, heta_t) + gamma lambda e_{t-1})。
- Update Rule公式
[delta = G_t^{(1)} - hat{v}(S_t, w) \
quad = R_{t+1} + gamma hat{v}(S_{t+1}, w) - hat{v}(S_t, w) \
w_{t+1} = w_{t} + eta delta
abla_w hat{v}(s, w) \
heta_{t+1} = heta_t + alpha gamma^t delta
abla_ heta log pi(A_t|S_t, heta) \
]
- Update Rule with eligibility traces公式
[delta = R + gamma hat{v}(s', w) - hat{v}(s', w) \
e^w = lambda^w e^w + gamma^t
abla_w hat{v}(s, w) \
w_{t+1} = w_{t} + eta delta e_w \
e^{ heta} = lambda^{ heta} e^{ heta} + gamma^t
abla_ heta log pi(A_t|S_t, heta) \
heta_{t+1} = heta_t + alpha delta e^{ heta} \
where \
R + gamma hat{v}(s', w) = G_t^{(0)} \
delta ext{ - TD error} \
e^w ext{ - eligibility trace of state value function} \
e^{ heta} ext{ - eligibility trace of policy value function} \
]
- 算法描述
请看原书,在此不做拗述。
针对连续性任务的策略梯度算法(Policy Gradient for Continuing Problems(Average Reward Rate))
- 策略价值计算公式
对于连续性任务的策略价值是每个步骤的平均奖赏。
[egin{align}
eta( heta) doteq r( heta)
& doteq lim_{n o infty} frac{1}{n} sum_{t=1}^n mathbb{E} [R_t| heta_0= heta_1=dots= heta_{t-1}= heta] \
& = lim_{t o infty} mathbb{E} [R_t| heta_0= heta_1=dots= heta_{t-1}= heta] \
end{align}
]
- Update Rule公式
[delta = G_t^{(1)} - hat{v}(S_t, w) \
quad = R_{t+1} + gamma hat{v}(S_{t+1}, w) - hat{v}(S_t, w) \
w_{t+1} = w_{t} + eta delta
abla_w hat{v}(s, w) \
heta_{t+1} = heta_t + alpha gamma^t delta
abla_ heta log pi(A_t|S_t, heta) \
]
- Update Rule Actor-Critic with eligibility traces (continuing) 公式
[delta = R - ar{R} + gamma hat{v}(s', w) - hat{v}(s', w) \
ar{R} = ar{R} + eta delta \
e^w = lambda^w e^w + gamma^t
abla_w hat{v}(s, w) \
w_{t+1} = w_{t} + eta delta e_w \
e^{ heta} = lambda^{ heta} e^{ heta} + gamma^t
abla_ heta log pi(A_t|S_t, heta) \
heta_{t+1} = heta_t + alpha delta e^{ heta} \
where \
R + gamma hat{v}(s', w) = G_t^{(0)} \
delta ext{ - TD error} \
e^w ext{ - eligibility trace of state value function} \
e^{ heta} ext{ - eligibility trace of policy value function} \
]
- 算法描述(Actor-Critic with eligibility traces (continuing))
请看原书,在此不做拗述。
原书还没有完成,这章先停在这里