• [Machine Learning for Trading] {ud501} Lesson 25: 03-05 Reinforcement learning | Lesson 26: 03-06 Q-Learning | Lesson 27: 03-07 Dyna


    The RL problem

    Trading as an RL problem 

     

    Mapping trading to RL 

    Markov decision problems 

     Unknown transitions and rewards

     

     What to optimize?

     

     

     






     

     Learning Procedure

     

    Update Rule

     

    Update Rule

    The formula for computing Q for any state-action pair <s, a>, given an experience tuple <s, a, s', r>, is:
    Q'[s, a] = (1 - α) · Q[s, a] + α · (r + γ · Q[s', argmaxa'(Q[s', a'])])

    Here:

      • r = R[s, a] is the immediate reward for taking action a in state s,
      • γ ∈ [0, 1] (gamma) is the discount factor used to progressively reduce the value of future rewards,
      • s' is the resulting next state,
      • argmaxa'(Q[s', a']) is the action that maximizes the Q-value among all possible actions a' from s', and,
      • α ∈ [0, 1] (alpha) is the learning rate used to vary the weight given to new experiences compared with past Q-values.

     Two Finer Points

     

    The Trading Problem: Actions 

     

     

    A reward at each step allows the learning agent get feedback on each individual action it takes (including doing nothing).

    SMA: single moving average => different stocks have different basis

    => adj close / SMA is a good normalized factor

    Creating the State 

     

    Discretizing 

     

     Q-Learning Recap

    Summary

    Advantages

    • The main advantage of a model-free approach like Q-Learning over model-based techniques is that it can easily be applied to domains where all states and/or transitions are not fully defined.
    • As a result, we do not need additional data structures to store transitions T(s, a, s') or rewards R(s, a).
    • Also, the Q-value for any state-action pair takes into account future rewards. Thus, it encodes both the best possible value of a state (maxa Q(s, a)) as well as the best policy in terms of the action that should be taken (argmaxa Q(s, a)).

    Issues

    • The biggest challenge is that the reward (e.g. for buying a stock) often comes in the future - representing that properly requires look-ahead and careful weighting.
    • Another problem is that taking random actions (such as trades) just to learn a good strategy is not really feasible (you'll end up losing a lot of money!).
    • In the next lesson, we will discuss an algorithm that tries to address this second problem by simulating the effect of actions based on historical data.

    Resources






    Dyna-Q Big Picture <= invented by Richard Sutton

     Learning T

     

    How to Evaluate T? 

    Type in your expression using MathQuill - a WYSIWYG math renderer that understands LaTeX.

    E.g.:

    • to enter Tc, type: T_c
    • to enter Σ, type: Sigma

    For entering a fraction, simply type / and MathQuill will automatically format it. Try it out!

    Correction: The expression should be:
    Computing transition probabilities using counts In the denominator shown in the video, T is missing the subscript c.

     Learning R

    Dyna Q Recap 

     

    Summary

    The Dyna architecture consists of a combination of:

    • direct reinforcement learning from real experience tuples gathered by acting in an environment,
    • updating an internal model of the environment, and,
    • using the model to simulate experiences.

    Dyna learning architecture

    Sutton and Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [web]

    Resources

    • Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, Austin, TX, 1990. [pdf]
    • Sutton and Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [web]
    • RL course by David Silver (videos, slides)
      • Lecture 8: Integrating Learning and Planning [pdf]





    Interview with Tammer Kamel

    Tammer Kamel is the founder and CEO of Quandl - a data platform that makes financial and economic data available through easy-to-use APIs.

    Listen to this two-part interview with him.

    • Part 1: The Quandle Data Platform (08:18)
    • Part 2: Trading Strategies and Nuances (10:53)

    Note: The interview is audio-only; closed captioning is available (CC button in the player).

  • 相关阅读:
    使用electron+vue开发一个跨平台todolist(便签)桌面应用
    初识ABP vNext(12):模块的独立运行与托管
    【老孟Flutter】41个酷炫的 Loading 组件库
    【Flutter 实战】酷炫的开关动画效果
    【老孟Flutter】自定义文本步进组件
    【老孟Flutter】6种极大提升Flutter开发效率的工具包
    【Flutter 实战】pubspec.yaml 配置文件详解
    Flutter 1.22版本新增的Button
    【Flutter 混合开发】添加 Flutter 到 iOS
    【Flutter 混合开发】添加 Flutter 到 Android Fragment
  • 原文地址:https://www.cnblogs.com/ecoflex/p/10977470.html
Copyright © 2020-2023  润新知