Fundamentals of Reinforcement Learning - Week 1

HO SEUNG YOON·2024년 3월 29일

Coursera KMOOC reinforcement learning

0

Reinforcement Learning

목록 보기

2/9

Introduction

Specialization Introduction

What a team! ㅋㅋ

Course Introduction

incremental learning증분학습

Meet your instructors!

Your Specialization Roadmap

course 1
- multi-arm bandit problems
- Markov decision processes
course 2
- Monte Carlo methods
- temporal difference learning
- Q learning
course 3
- feature construction, neural network learning, policy gradient methods, and other particularities of the function approximation setting
final course
- Capstone project

The K-Armed Bandit Problem

산적문제

Sequential Decision Making with Evaluative Feedback

The value is the expected reward
- $\sum_{r} p(r|a)r$
Goal : Maximize the expected reward
- $argmax$ $q*(a)$

Estimating Action Values

Learning Action Values

Estimating Action Values Incrementally

이 non-stationary bandit problem은 시간에 따라 reward distribution이 달라진다

뒤로 갈수록 최근의 보상이 가장 큰 영향력

Exploration vs Exploitation Tradeoff

What is the trade-off?

choose randomly
- Epsilon-Greedy

many noises, hard to conclude

epsilon = 0 -> only greedy
epsilon-greedy for balancing exploration and exploitation

Optimistic Initial Values

how optimism affect action-selection

optimistic initial value encourages exploration ealry in learning

Upper-Confidence Bound (UCB) Action Selection

upper-confidence bound action selection
how ucb drive to exploration

epsilon-greedy에서는 $\epsilon$ 확률로 uniformly selected
- value estimates에 uncertainty 개념을 도입하면 more intelligent way select

region in between lower-bound and upper-bound is confidence interval
- represents uncertainty
- if region is small we are very confident

Q(2) interval is smallest but highest upper-confidence bound

$A_t \doteq \text{argmax}\left[Q_t(a) + c\sqrt{ln \, t \over N_t(a)}\right]$
$A_t \doteq \text{argmax}\left[Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}}\right]$
C parameter : user specified parameter controls amount of exploration

$q*$ normarly distributed with the mean zero and standard deviation one.
rewards are sampled from univariance normal with mean $q*$
$c = 2$ , $\epsilon = 0.1$ compare UCB and epsilon-greedy

Jonathan Langford: Contextual Bandits for Real World Reinforcement Learning

RW typically environment controls you.

윤냠

이전 포스트

Deep RL Course

다음 포스트

Fundamentals of Reinforcement Learning - Week 2

0개의 댓글

관련 채용 정보