强化学习基础：基本概念和动态规划

问题描述

目标：最大化期望累加奖励(expected cumulative reward)

Episodic Task: 在某个时间步结束，$S_0,A_0,R_1,S_1,A_1,R_2,S_2,cdots,A_{T-1},R_T,S_T(i.e., terminal ext{ }states)$
Continuing Task: 没有明确的结束信号，$S_0,A_0,R_1,S_1,A_1,R_2,S_2,cdots$
The discounted return(cumulative reward) at time step t: $G_{t}=R_{t+1}+gamma R_{t+2}+gamma^{2} R_{t+3}+ldots, ext{ }0leqgammaleq{1}$

马尔可夫决策过程MDP

$mathcal{S}$: the set of all (nonterminal) states; $mathcal{S}^+$: S+{terminal states} (only for episodic task)
$mathcal{A}$: the set of possible actions; $mathcal{A}(s)$: the set of possible actions available in state $sinmathcal{S}$
$mathcal{R}$: the set of rewards; $gamma$: discount rate, $0leqgammaleq{1}$
the one-step dynamics: $pleft(s^{prime}, r | s, a ight) = mathbb{P}left(S_{t+1}=s^{prime}, R_{t+1}=r | S_{t}=s, A_{t}=a ight) ext{ for each possible }s^{prime}, r, s, ext { and } a$

问题求解

Deterministic Policy $pi$：a mapping $mathcal{S} ightarrow mathcal{A}$; Stochastic Policy $pi$: a mapping $mathcal{S} imes mathcal{A} ightarrow[0,1], ext{ i.e., }pi(a | s)=mathbb{P}left(A_{t}=a | S_{t}=s ight)$
State-Value Function以及Action-Value Function

Bellman Equation: $v_{pi}(s)=mathbb{E}_{pi}left[R_{t+1}+gamma v_{pi}left(S_{t+1} ight) | S_{t}=s ight]$
Policy $pi^{prime}geqpi ext{ } Longleftrightarrow ext{ } v_{pi^{prime}}(s) geq v_{pi}(s)$ for all $s in mathcal{S}$; Optimal policy(may not be unique) $pi_*geqpi ext{ for all possible }pi$
All optimal policies have the same state-value function $v_*$ $q_*$
$pi_{*}(s)=arg max _{a in mathcal{A}(s)} q_{*}(s, a)$

问题三（左图）：take an estimate $v_{pi}$ corresponding to a policy $pi$, returns a new policy $pi^{prime}geqpi$

方法2：Truncated Policy Iteration

In this approach, the evaluation step is stopped after a fixed number of sweeps through the state space

方法3：Value Iteration

In this approach, each sweep over the state space simultaneously performs policy evaluation and policy improvement

相关阅读:
Atitit.eclise的ide特性abt 编译
Atitit python3.0 3.3 3.5 3.6 新特性 Python2.7新特性1Python 3_x 新特性1python3.4新特性1python3.5新特性1值得关注的新特性1Pyth
Atitit. Atiposter 发帖机新特性 poster new feature v7 q39
Atitit.eclipse 4.3 4.4 4.5 4.6新特性
atitit.错误：找不到或无法加载主类的解决 v4 qa15.doc
Atitit RSA非对称加密原理与解决方案
Atitti.数字证书体系cer pfx attilax总结
Atitit ftp原理与解决方案
Atitit qzone qq空间博客自动点赞与评论工具的设计与实现
Atitit 软件国际化原理与概论

原文地址：https://www.cnblogs.com/sunwq06/p/11069134.html