RL Course by David Silver - Lecture 2: Markov Decision Process

HO SEUNG YOON·2024년 4월 14일
0

Reinforcement Learning

목록 보기
4/9

David silver youtube

-Markov Reward Process(MRP)

  • Markov Decision Process(MDP)

  • state s fully characterizes your future rewards
    • so we don't care about reward past because it's already consumed
    • What we want is maximize the reward from now on

  • how good is it to be in state s if I following policy π\pi

  • white to black possibility is defined by policy
    qπq_\pi is q value(action value)

  • vπ(s)v_\pi(s') state value function
    qπq_\pi action value function

  • put two together
  • ss is relative to ss'
  • reculsive relationship

  • you can do it to action values
  • as well aa is relative to aa'

  • beneth black dot is to show the process

  • there is always at least one optimal policy p* that is better than or equal to all other policies

  • It is possible there is more than one optimal policies same action take you to the same state.

  • we get q we can calculate MDP but how to arrive at q, figuring out q*

  • Bellman optimality equation

  • look at the value of each action you can take and pick the max of them.

  • v(s)=maxaq(s,a)v_*(s) = \max \limits_a q_*(s, a)

  • look optimal value we end up, back these things all the up to v(s)v_*(s)

  • just reordering for q*

  • 귀납적으로Inductively

  • dinamic programming method will solving these resculsive equation

0개의 댓글