





















instead of outputting Q values, directly optimize policy function
the likelihood we take specific action given s
not take the maximum, now we're going to do sample from this probability distribution
the sampling these actions is now going to be stochastic, not picking what the network think is best action.
going to have some more exploration of the environment. constantly sample







