• a survey for RL


    • A finite set of states St summarizing the information the agent senses from the environment at every time step t ∈ {1, ..., T}.

    • A set of actions At which the agent can perform at each time step t ∈ {1, ..., T} to interact with the environment.

    • A set of transition probabilities between subsequent states which render the environment stochastic. Note: the probabilities are usually not explicitly modeled but the result of the stochastic nature of the financial asset’s price process.

    • A reward (or return) function Rt which provides a numerical feedback value rt to the agent in response to its action At−1 = at−1 in state St−1 = st−1.

    • A policy π which maps states to concrete actions to be carried out by the agent. The policy can hence be understood as the agent’s rules for how to choose actions.

    • A value function V which maps states to the total (discounted) reward the agent can expect from a given state until the end of the episode (trading period) under policy π.

    Given the above framework, the decision problem is formalized as finding the optimal policy π = π ∗ , i.e., the mapping from states to actions, corresponding to the optimal value function V ∗ - see also Dempster et al. (2001); Dempster and Romahi (2002):

      V (st) = max at E[Rt+1 + γV (St+1)|St = st ].(1)

    Hereby, E denotes the expectation operator, γ the discount factor, and Rt+1 the expected immediate reward for carrying out action At = at in state St = st . Further, St+1 denotes the next state of the agent. The value function can hence be understood as a mapping from states to discounted future rewards which the agent seeks to maximize with its actions.

    To solve this optimization problem, the Q-Learning algorithm (Watkins, 1989) can be applied, extending the above equation to the level of state-action tuples:

      Q (st , at) = E[Rt+1 + γ max at+1 Q (St+1, at+1)|St = st , At = at ].(2)

    Hereby, the Q-value Q∗ (st , at) equals to the immediate reward for carrying out action At = at in state St = st plus the discounted future reward from carrying on in the best way possible.

    The optimal policy π (the mapping from states to actions) then simply becomes:

      π (st) = max at Q (st , at) .(3)

    i.e., in every state St = st , choose the action At = at that yields the highest Q-value. To approximate the Q-function during (online) learning, an iterative optimization is carried out with α denoting the learning rate - see also Sutton and Barto (1998) for further details:

      Q (st , at) ← (1 − α) Q (st , at) + α (rt+1 + γ max at+1 Q (st+1, at+1) ) . (4)

  • 相关阅读:
    几个有用的jQuery方法
    Highcharts常用属性的说明
    存储过程中判断临时表是否已经存在方法
    解决Dialog中EditView无法触发软键盘问题
    mysql数据库备份与恢复
    Ubuntu 安装 sunjava6jdk 错误解决办法
    putty关闭后后让java程序在后台一直执行
    jni数据类型
    android 打开各种文件(setDataAndType)
    防止Linux出现大量FIN_WAIT1
  • 原文地址:https://www.cnblogs.com/the-wolf-sky/p/10642895.html
Copyright © 2020-2023  润新知