Reinforcement Learning Cheatsheet

Reinforcement Learning Cheatsheet

1. MDPs

What is MDPs?

MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards.

The dynamics of MDP is define as:

[p(s',r|s,a) = Pr{S_t=s',R_t=r|S_{t-1}=s,A_{t-1}=a} ]
What is Markov property?

The state must include information about all aspects of the past agent–environment interaction that make a difference for the future. If it does, then the state is said to have the Markov property

What is finite MDP?

In a finite MDP, the sets of states, actions, and rewards (S, A, and R) all have a finite number of elements. In this case, the random variables (R_t) and (S_t) have well defined discrete probability distributions dependent only on the preceding state and action.

What can one do if the state or action is continuous?

To quantize it.

What makes MDPs different from k-bandit algo?

Whereas in bandit problems we estimated the value (q_*(a)) of each action (a), in MDPs we estimate the value (q_*(s,a)) of each action (a) in each state (s), or we estimate the value (v_*(s)) of each state given optimal action selections.

Policy function

A policy (e.g. (pi(a|s))) is a mapping from states to probabilities of selecting each possible action.

Value function

State-value function

State-value function, denoted (v_{pi}), namely the value function of a state (s) under a policy (pi), means the expected return an agent can get following policy (pi) starting from state (s).

[v_{pi}(s) = mathrm{E_pi} sum_{k=0} ^{infty} {gamma^{k} R_{t+k+1} | S_t = s}, for \, forall s in S ]
Action-value function

$ q_{pi}(s, a) $, called action-value function for policy (pi).
Similar to the definition of value function, action-value function means the expected return an agent can get following policy (pi) starting from state-action pair ((s, a)) at time t.

[q_{pi}(s, a) = mathrm{E_pi} sum_{k=0} ^{infty} {gamma^{k} R_{t+k+1} | S_t = s, A_t = a}, for \, forall s in S and \, forall a in A ]
2. Dynamic Programing

DP refers to a collection of algos that can be used to compute the value functions, thus to find optimal policies, given a perfect model of the model of the environment as an MDP.
相关阅读:
PyMySQL学习笔记
 Python 操作csv和excel表格
 Python爬虫之解析网页
 Python爬虫的简单入门(一)
利用赫夫曼编码进行在线密码对话
 Markdown使用笔记
 SQL Server查询中特殊字符的处理方法（SQL Server特殊符号的转义处理）
decimal与float和double的区别
 dev、test、pre和prod是什么意思？
bat代码中判断 bat是否是以管理员权限运行，以及自动以管理员权限运行CMD BAT
原文地址：https://www.cnblogs.com/DianeSoHungry/p/11444968.html

Reinforcement Learning Cheatsheet

1. MDPs

What is MDPs?

What is Markov property?

What is finite MDP?

What can one do if the state or action is continuous?

What makes MDPs different from k-bandit algo?

Policy function

Value function

State-value function

Action-value function

2. Dynamic Programing