CS 285 at UC Berkeley: Deep Reinforcement Learning | Lecture 7: Value Function Methods

김까치·2023년 11월 21일

actor critic에서의 A: 현재 policy와 state에서 가장 좋은 action 고름


value based method: actor critic에서 policy 무시, 어떤 policy건 가장 좋은 action을 고르는 거에만 집중

  • 매번 새로운 policy pi'을 만든다(A의 argmax a_t만 확률 1, 나머지 action은 확률 0인 distribution)
  • 결과적으로 pi보다 좋은 성능

policy iteration algorithm:
1. generate sample
2. fit A(or Q or V)
3. implicit policy (pi')


how to evaluate A?
지난시간... A^pi(s,a) = r(s,a) + γE[V^pi(s')] - V^pi(s)
=> let's evaluate V


=> use dynamic programming
1.
bootstrapped update: V(s) <- E[r(s,a) + E[V(s')]]
V(s') 구할 때는 그냥 current estimate 이용
2.
V 이용해서 pi' 구함


simpler dynamic programming
A^pi(s,a) = r(s,a) + γE[V^pi(s')] - V^pi(s)
에서 V를 생략하면 그냥 Q function이 된다
argmax A^pi(s,a) = argmax Q^pi(s, a)
pi' 구할 때 A 대신 Q = r(s,a) + γE[V^pi(s')] 이용 (simpler)


Q function table(s, a가 각각 row, column)
에서 s 별로 가장 값이 높은 a의 집합이 policy가 되는 거임
그러면 굳이 policy 구하지 말고 value만 이용하자
1. generate sample
2. Q(s,a) <- r(s,a) + E[V(s')]
3. V(s) <- max_a Q(s,a)


value function을 table로 나타내면 -> curse of dimensionality
=> use function approximator V(s) = nn(S) (parameters ∅)


fit neural net value function
use least squares regression onto target value
좀전에 simpler value iteration에서 target value = max_a Q(s,a)


fitted value iteration algorithm:
1. y = V(s) <- max_a Q(s,a) = max_a (r(s,a) + E[V_∅(s')])
2. ∅ <- argmin_∅ least squares regression(V_∅(s), y)


should know transition dynamics
What if we don't know transition dynamics?
y <= max_a (r(s,a) + E[V(s')])
need to know outcomes for different actions


복습... policy iteration <- policy evaluation
V(s) <- r(s,pi(s)) + E[V(s')]
V 대신 Q evaluate
Q(s,a) <- r(s,a) + E[Q(s',pi(s'))]
=> subtle 차이 같지만 transition dynamics 몰라도 된다!
sample로부터 추출한 (s,a,s')만 있으면 됨


can we do fitted value iteration
with Q value also, without knowing transition?


fitted Q iteration algorithm:
1. y <- r(s,a) + E[V_∅(s')]
E[V(s')] = max_a' Q_∅(s',a')
use only sample and previous Q function
2. ∅ <- argmin_∅ least squares regression(Q_∅(s,a), y)
* no convergence guarantee for neural network


full fitted Q-iteration algorithm:
dataset size N
gradient steps S
iterate step 2~3 K times
1. collect dataset {(s,a,s',r)} using some policy
2. for each transition calculate target value
set y <= r(s,a) + max_a' Q_∅(s',a')
3. find new ∅ (train nn Q)
∅ <- argmin_∅ least squares regression(Q_∅(s,a), y)


Why is this algorithm off-policy?
max_a' Q_∅(s',a'): approximates the value of pi' at s'
r(s,a): given s and a, transition is independent of pi
have big bucket of different transitions
back up the values along each of transitions
back ups improve Q value
don't care which specific transitions they are
they cover all possible transitions


What is fitted Q-iteration optimizing?
step2에서 max_a' Q_∅(s',a') -> improve policy a = argmax_a Q_∅(s,a)
step3가 minimize error ε
if ε = 0, Q_∅(s,a) = r(s,a) + max_a' Q_∅(s',a') => optimal Q-function, optimal policy pi'


online Q iteration algorithm:
1. take one action (s,a,s',r)
*off-policy, so many choices here
2. compute one target value
3. take one gradient descent step


exploration with Q-learning
argmax_a Q_∅(s,a) 활용해 greedy policy 구하는 방식 -> step1에서 안 좋다
=> epsilon-greedy
Q가 처음에 별로일 때는 ε 크게, 이후에 좋아지고 나서는 작게 만듦
=> Boltzmann exploration(=softmax exploration)
take actions in proportion to the exponential of Q value
좋은 이유
(1) epsilon-greedy에서 first best action과 거의 비슷하게 좋은 second best action이 ignored되는 문제 일어나지 않음
(2) 진짜 안 좋은 action은 explore하지 않는다


value iteration algorithm:
1. Q(s,a) <- r(s,a) + γE[V(s')]
2. V(s) <- max_a Q(s,a)
does it converge?
=> define an operator


bellman operator B
BV = max_a r_a + γT_aV
apply bellman operator to value function(vector of numbers)
T_a: matrix of transition for action a(dimensionality s x s), every entry is P(s'|s,a), where a is chosen according to max
T_aV: expectation
r_a: vector of rewards at all states for action a
max: for every state take a max
=> capture value iteration algorithm
max는 step2이고 stuff inside the max는 step1
V*: fixed point of B, value function of optimal policy
V*(s) = max_a r(s,a) + γE[V*(s')]
V* = BV*
V* always exists, unique, corresponds to the optimal policy (B를 반복해서 취했을 때 V*로 converge)


can prove that value iteration reaches V* because B is contraction
contraction: for any two vectors V V_bar, applying B makes them closer
||BV - BV_bar|| <= γ||V - V_bar||
gap always gets smaller by γ


V* = V_bar이라면 (BV* = V* 이므로)
||BV - V*|| <= γ||V - V*||
V에 B를 취할 수록 V*에 가까워짐!


value iteration algorithm(using B):
1. V <- BV


fitted value iteration algorithm:
Ω: set, all possible (value func)neural net of particular architecture with different weight values (hypothesis space)
supervised learning: find element in hypothesis space that optimze objective
objective: square difference(V'(s), BV(s))
BV: updated value function
=> repeatedly find new value function V'(∈Ω)
V' <- argmin_V'∈Ω square difference(V'(s), BV(s))


define new operator Π
ΠV = argmin_V'∈Ω square difference(V'(s), V(s))
Π: projection onto Ω(=contraction)


fitted value iteration algorithm(using B,Π)
1. V <- ΠBV


B: contraction wrt infinity-norm
Π: contraction wrt l2-norm
그런데.. ΠB는 contraction이 아니다


conclusion:
value iteration converges (in tabular case)
fitted value iteration does "not" converge


What about fitted Q-iteration?
define an operator B:
(cf.
fitted Q iteration algorithm:
1. y <- r(s,a) + E[V_∅(s')]
E[V(s')] = max_a' Q_∅(s',a')
)
(cf. BV = max_a r_a + γT_aV)
BQ = r + γT max_a Q
define an operator Π:
ΠQ = argmin_Q'∈Ω square difference(Q'(s,a), Q(s,a))


fitted Q-iteration algorithm:
1. Q <- ΠBQ
B: contraction wrt infinity-norm
Π: contraction wrt l2-norm
ΠB: not a contraction of any kind (also applies to online Q-learning)


online Q iteration algorithm에서
3. take one gradient descent step
gradient descent한다며? converge하는 거 아니야??
=> Q-learning은 gradient descent가 아니다
(cf.
Q-learning에서의 target value: y <- r(s,a) + max_a' Q_∅(s',a')
)
because Q-learning의 target value depends on Q


same reason,
actor critic does not converge(V를 fit하는 데 V가 필요)

profile
개발자 연습생

0개의 댓글