强化学习基础：蒙特卡罗和时序差分

$v_{pi}$ corresponding to a policy $pi$
- First-visit MC estimates $v_{pi}(s)$
- Every-visit MC estimates $v_{pi}(s)$
问题二（右图）：estimate the action-value function $q_{pi}$
- First-visit MC estimates $q_{pi}(s,a)$
- Every-visit MC estimates $q_{pi}(s,a)$

问题三（左图）：get the optimal policy $pi_*$
- relationship between the mean and individual return: $ar{Q}_k=frac{sum_{i=1}^kG_i}{k}=ar{Q}_{k-1}+frac{1}{k}(G_k-ar{Q}_{k-1})$
- $epsilon$-greedy: Exploration vs Exploitation
  - with probability $1-epsilon$, select the greedy action ${pi}(s)=arg max _{a in mathcal{A}(s)} Q(s, a)$ (Exploitation)
  - with probability $epsilon$, select an action (uniformly) at random ${pi}(a|s)=frac{1}{|mathcal{A}(s)|}$ (Exploration)　　
问题四（右图）：modify the algorithm to put more weights to the most recent returns

求解方法：Temporal Difference

Monte Carlo (MC) prediction methods must wait until the end of an episode to update the value function estimate, temporal-difference (TD) methods update the value function after every time step.

问题一（左图）：estimate the state-value function $v_{pi}$ (the estimation of $q_{pi}$ is similar)
问题二（右图）：get the optimal action value function $q_*$
- On policy: the agent interact with the environment by following the same policy $pi$ that it seeks to evaluate (or improve)
- Sarsa(0) is an on-policy method

问题三：modified algorithm to get the optimal action value function $q_*$
- Off poliy: the agent interact with the environment by following a policy $b$

$q_*$
- Expected Sarsa is an on-policy method
- $pi(a|S_{t+1})$ is derived from $Q$ (e.g., $epsilon$-greedy)

$v_{π}$

相关阅读:
Oracle over函数
如何用 SQL Tuning Advisor (STA) 优化SQL语句
Oracle SQL的硬解析和软解析
Oracle执行计划详解
正则表达式的语法规则
ArcGIS Runtime for Android开发教程V2.0（3）基础篇---Hello World Map
ArcGIS Runtime for Android开发教程V2.0（2）开发环境配置
ArcGIS Runtime for Android开发教程V2.0（1）基本概念
在与 SQL Server 建立连接时出现与网络相关的或特定于实例的错误。
配置SQL Server 2008服务器

原文地址：https://www.cnblogs.com/sunwq06/p/11084512.html