CS 285 at UC Berkeley: Deep Reinforcement Learning | Lecture 6: Actor-Critic Algorithms

김까치·2023년 11월 8일

CS 285 RL

cs285

목록 보기

1/1

policy gradient를 develop시키자

sampling 된 거 하나 말고 average 넣으면 분산 줄일 수 있음
baseline으로 b 대신 V 사용

value function fitting

s를 input으로 받으면 V(=reward)를 output으로 하는 neural net을 훈련시키겠음
generalization 통해 분산 낮아지는 효과
이때 training data로 뭘 사용할까? (아래에)

policy evaluation

Monte Carlo estimate (policy gradient): (s, sigma r)을 training data로 사용
bootstrapped estimate: (s, r_t + V_t+1)을 training data로 사용

batch actor-critic algorithm
1. 샘플링
2. V fitting
3. A(s, a) = r(s, a) + γV(s') - V(s)
4. gradient 구하기 (A이용)
5. update

discount factor γ 도입

시간이 지날 수록 reward의 가치 떨어트리고 그로 인한 분산을 줄인다 (r + γV)

online actor-critic algorithm
1. policy 중간의 어떤 state에서 action 취함 (s, a, s', r)
2. V fitting (r + γV 이용)
3. A(s, a) = r(s, a) + γV(s') - V(s)
4. gradient 구하기 (A이용)
5. update

Monte Carlo policy gradient에서 baseline을 value function으로 대체하면

no bias (monte carlo 장점)
low variance (actor critic 장점)

김까치

개발자 연습생

CS 285 at UC Berkeley: Deep Reinforcement Learning | Lecture 6: Actor-Critic Algorithms

cs285

0개의 댓글

관련 채용 정보