CS 285 at UC Berkeley: Deep Reinforcement Learning | Lecture 8: Deep RL with Q-Functions

김까치·2023년 11월 21일

<<<part 1>>>

recap:
value based are not guaranteed to converge
=> However, in practice we can!

What's wrong?
online Q iteration algorithm에서
3. take one gradient descent step
gradient descent한다며? converge하는 거 아니야??
=> Q-learning은 gradient descent가 아니다
(cf.
Q-learning에서의 target value: y <- r(s,a) + max_a' Q_∅(s',a')
)
because Q-learning의 target value depends on Q

another problem
1. sample one transition (s,a,s',r)
sequential transitions are highly correlated
s, s'이 비슷할 가능성 높음
이들로 gradient step 취한다면 stochastic gradient descent의 assumption 위배, 잘 안 될 가능성 높음

correlated samples in online Q-learning
1. sample one transition (s,a,s',r)
sequential transitions are highly correlated
s, s'이 비슷할 가능성 높음
=> actor critic에서 같은 문제를 해결했었다

synchronized paralled Q-learning

multiple workers collect different transition (s,a,s'r)
workers의 transitions들 모아서 batch 만듦
이 batch 이용해서 ∅ update
repeat
* across workers, not correlated (mitigate problem)

asynchronous parallel Q-learning
individual workers don't wait for a synchronization point
(recap... Q-learning에서는 꼭 latest policy 사용할 필요 없다)
Another solution: replay buffers

Q-learning with a replay buffer:
1. sample a batch (s_i a_i, s_i', r_i) from B(=buffer)
=> buffer에서 i.i.d하게 뽑았으므로 samples are no longer correlated -> satisfy assumptions of sg
2. sum gradients over all the entries in the batch (multiple samples -> low gradient)

still don't have gradient (target value depends on Q) 그러나 at least

samples are not correlated
Where do we get our buffer?
need to periodically feed the replay buffer
because initial policy is bad and won't visit all the interesting regions
-> latest policy로 better data 수집해야함

full Q-learning with replay buffer:
1. collect dataset {(s_i,a_i,s_i',r_i)} and add to B
2. sample batch from B (i.i.d하게)
3. sum gradients over all the entries in the batch (multiple samples -> low gradient)
step2~3 K번 반복
다시 step1으로 돌아가서 refresh buffer

<<<part 2>>>

still problem: moving target -> not converge

Q-learning with replay buffer and target network:
1. save target network parameters ∅' <- ∅
2. collect dataset {(s_i,a_i,s_i',r_i)} and add to B
3. sample batch from B (i.i.d하게)
4. sum gradients over all the entries in the batch (이전 식과 똑같아 보이지만 target에서의 parameter가 ∅가 아니라 ∅'이다)
step3~4 K번 반복
step2~4 N번 반복
* data collection is inside the update of ∅'
=> target network parameters를 덜 자주 update, stable
∅와 ∅' random initialization
ex. N=1000 K=4이면 ∅'은 1000번 iterate될 동안 고정

classic deep Q-learning algorithm(DQN):
1. take some action a_i and observe (s_i,a_i,s_i',r_i)} and add it to B
2. sample mini-batch {s_j,a_j,s_j',r_j} from B uniformly (아까 add한 sample과 다를 수 있음)
3. compute target value (using target network Q_∅')
4. sum gradients over all the entries in the batch
5. update ∅': copy ∅ every N steps
* Q-learning with replay buffer and target network에서 N=1, K=1일 경우와 같다

at some points in time, target value look more moving target than others (right after flip)
=> alternative: ∅' <- 𝛕 ∅' + (1-𝛕)∅
interpolate between old and new

<<<part 3>>>

Q-learning with replay buffer and target network
vs.
fitted Q-learning
(1) data collection이 ∅'의 update 내부인지 외부인지
(2) fitted Q-learning의 경우 buffer에서 i.i.d하게 뽑았으므로 samples are no longer correlated -> satisfy assumptions of sgd

general view
process 1: data collection
process 2: target update
process 3: Q-function regression
replay buffer is finite size -> eviction process (evict oldest things)
다 general view process의 variation으로 설명 가능하다

<<<part 4>>>

Are Q-values accurate?
As predicted Q increases, so does the return
그러나 둘이 다름

* when calculate true value, use discounted estimator γ
red line: estimate of Q function (of discounted reward)
solid red line: actual sum ofdiscounted reward
Q function estimates are much larger
* why Q function think it's going to get larger rewards than it actually gets (consistent pattern)

Overestimation in Q-learning
target value: y <- r + max_a' Q_∅'(s',a')
"max" causes overextimation
Q_∅'(s',a') is not perfect(true value + noise)
여러 Q_∅'(s',a') 중에서 max 고르는 거니 noise가 positive일 가능성 높고.. 실제 Q보다 overestimate 될 수 밖에
* action을 선택하는 Q와 value를 산출하는 Q가 동일한 게 문제

Double Q-learning
if function that gives us the value is decorrelated from the fuction that selects the action, problem is go away
idea: don't use the same network to choose the action and evaluate value
=> "doblue" Q-learning: use two networks ∅_A, ∅_B
action을 선택하는 Q와 value를 산출하는 Q를 따로 둔다(∅_A, ∅_B) (∅_B, ∅_A)
만약 ∅_A가 action a를 선택했다면 ∅_B는 lower value를 주는 식으로 self-correcting하는 식으로

where to get two Q-functions?
=> just use the current and target networks!
double Q-learning: y = r + γQ_∅'(s', argmax_a' Q_∅(s',a')
action 선택할 때는 ∅, value 구할 때는 ∅' 이용(moving target 문제 없도록)

김까치

개발자 연습생

이전 포스트

CS 285 at UC Berkeley: Deep Reinforcement Learning | Lecture 7: Value Function Methods

다음 포스트

CS 285 at UC Berkeley: Deep Reinforcement Learning | Lecture 8: Deep RL with Q-Functions

CS 285 at UC Berkeley: Deep Reinforcement Learning | Lecture 7: Value Function Methods

CS 285 at UC Berkeley: Deep Reinforcement Learning | Lecture 8: Deep RL with Q-Functions

0개의 댓글