Policy Gradient

Human Being·2022년 12월 16일

Reinforcement Learning

목록 보기

19/22

policy를 만드는데 action value를 사용하지 말고
function approximation을 통해 policy를 학습하고 나타내려고 한다

$\pi$ : parameterized policy
- output the probability of taking that action in that state
$\theta$ : policy parameter vector
$W$ : approximate value function

policy에서 선형 함수가 모든 행동에 대한 확률의 합이 1이라고 보장하는 것으 쉽지 않기에
softmax로 각 행동이 0~1 사이의 확률을 가지도록 변환
이를 통해 negative action에서도 non-zero probability를 가지도록 할 수 있다.

action preference는 state, action, $\theta$ 에 대한 함수

다만 action preference와 action value을 혼동하지 말자

Parameterized stochastic policies are useful, because ...

They can autonomously decrease exploration over time
They can avoid failures due to deterministic polices with limited function approximation
Sometimes the policy is less complicated than the value function

Episodic: $G_t = \Sigma^T_{t=0}R_t$
Continuing
- $G_t = \Sigma^{\infin}_{t=0}\gamma^tR_t$
- $G_t = \Sigma^{\infin}_{t=0}R_t-r(\pi)$

Learning policy directly!

Before we were minimizing the mean squared value error,
Now we are maximizing an objective

That means we will want to move in the direction of the (positive) gradient rather than the negative gradient

up & left action have negative value
down & right action have positive value

Policy gradient theorem의 the overall average reward를 증가시키는 direction을 찾는 방법에 대하 설명하자면 다음과 같다.

weighted sum은 direction을 알려주는데 그림처럼 우측 하단에 목표가 위치된 경우라면 down & left에 positive를 최대화하고, up & right에 negative를 최소화하는 식으로 진행된다.

alive