David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 7: Policy Gradient (Youtube) 강의 내용을 정리했습니다.
Image from: here
Image from: here
Note: Policy is determined by parameter and is a policy objective function to maximize
Now compute the policy gradient analytically
Assumptions:
1) is differentiable whenever it is non-zero,
2) we know the gradient
Theorem
For any differentiable policy ,
for any of the policy objective functions
the policy gradient is
- 증명은 생략!
initialize arbitrarily
for each episode do
for to do
end for
end for
return
Simple actor-critic algorithm with linear value function approximator,
Image from: here
Theorem (Compatible Function Approximation Theorem)
If the following two condition are satisfied:
- Value function approximator is compatible to the policy
- Value function parameters minimize the mean squared error
Then the policy gradient is exact,
혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!