RL Course by David Silver - Lecture 2: Markov Decision Process

HO SEUNG YOON·2024년 4월 14일

David Silver reinforcement learning

0

Reinforcement Learning

목록 보기

4/9

David silver youtube

-Markov Reward Process(MRP)

Markov Decision Process(MDP)

state s fully characterizes your future rewards
- so we don't care about reward past because it's already consumed
- What we want is maximize the reward from now on

how good is it to be in state s if I following policy $\pi$

white to black possibility is defined by policy
$q_\pi$ is q value(action value)

$v_\pi(s')$ state value function
$q_\pi$ action value function

put two together
$s$ is relative to $s'$
reculsive relationship

you can do it to action values
as well $a$ is relative to $a'$

beneth black dot is to show the process

there is always at least one optimal policy p* that is better than or equal to all other policies
It is possible there is more than one optimal policies same action take you to the same state.

we get q we can calculate MDP but how to arrive at q, figuring out q*
Bellman optimality equation
look at the value of each action you can take and pick the max of them.
$v_*(s) = \max \limits_a q_*(s, a)$

look optimal value we end up, back these things all the up to $v_*(s)$

just reordering for q*

귀납적으로Inductively

dinamic programming method will solving these resculsive equation

윤냠

이전 포스트

Fundamentals of Reinforcement Learning - Week 2

다음 포스트

RL Course by David Silver - Lecture 3: Planning by Dynamic Programming

0개의 댓글

관련 채용 정보