^ is the square root of epsilon
a simplified version of hard version
a more smooth way to find correct solution
the first term is the REINFORCE term, and the seconde term is our grad log probability of our loss
b is a stochastic node
more formula derivations are ignored.