Silver RL (9) Exploration and Exploitation

Sanghyeok Choi·2022년 1월 24일

Intro_to_RL

목록 보기

9/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 9: Exploration and Exploitation (Youtube) 강의 내용을 정리했습니다.

Introduction

이전에 control을 얘기하면서 greedy와 $\epsilon$ -greedy를 배웠다.
$\epsilon$ -greedy가 일종의 exploration 방법이고, 이번 시간에는 $\epsilon$ -greedy 외의 다른 방법들도 다룬다.

Exploration vs. Exploitation Dilemma

Exploitation: Make the best decision given current information
e.g.) Go to your favorite restaurant
Exploration: Gather more information
e.g.) Try a new restaurant

Ways to Exploration
- Random exploration (e.g., $\epsilon$ -greedy, softmax)
- Optimism in the face of uncertainty
  - Estimate uncertainty on value
  - Prefer to explore states/actions with highest uncertainty
- Information state space
  - Consider agent's information as part of its state
  - Lookahead to see how information helps reward

State-action exploration vs. Parameter exploration
- State-action exploration
  - Systematically explore state/action space
    e.g.) 현재 state에서의 action을, 아직 안 해본 action들 중에서 선택
- Parameter exploration
  - With parameter $\bold{u}$ , policy is $\pi(A|S,\bold{u})$
    e.g.) Pick different parameters and try for a while
  - Advantage: consistent exploration
    Disadvantage: doesn't know about state/action space
- We will focus on state-action exploration!

Principles
- Naive Exploration:
  Add noise to greedy policy (e.g. $\epsilon$ -greedy)
- Optimistic Initialization:
  Assume the best until proven otherwise
- Optimism in the Face of Uncertainty:
  Prefer actions with uncertain values
- Probability Matching:
  Select actions according to probability they are best
- Information State Search:
  Lookahead search incorporating value of information

Multi-Armed Bandits

Image from: here

A multi-armed bandit is a tuple $\lang \mathcal{A}, \mathcal{R}\rang$
action에 대해 reward가 나오고 끝. (no state, no transition)
$\mathcal{A}$ is a known set of $m$ actions (or "arms")
$\mathcal{R}^a(r)=\mathbb{P}[r|a]$ is an unknown probability distribution over rewards
At each step $t$ agent selects an action $a_t\in\mathcal{A}$ ,
and the environment generates a reward $r_t\sim\mathcal{R}^{a_t}$
The goal is to maximize cumulative reward $\sum_{\tau=1}^{t}r_\tau$

Regret

Regret
- The action-value is the mean reward for action a,
  $Q(a)=\mathbb{E}[r|a]$
- The optimal value $V^*$ is
  $V^*=Q(a^*)=\max\limits_{a\in\mathcal{A}}Q(a)$
- The regret is the opportunity loss for one step
  $l_t=\mathbb{E}[V^*-Q(a_t)]$
- The total regret is the total opportunity loss
  $L_t=\mathbb{E}\left[ \sum\limits_{\tau=1}^{t}[V^*-Q(a_\tau)] \right]$
- Maximize cumulative reward $\equiv$ Minimize total regret

Counting Regret
- The count $N_t(A)$ is expected number of selections for action $a$
- The gap $\Delta_a$ is the difference in value between action $a$ and optimal action $a^*$
  $\Delta_a=V^*-Q(a)$
- The total regret is a function of gaps and the counts
  $L_t=\mathbb{E}\left[ \sum\limits_{\tau=1}^{t}[V^*-Q(a_\tau)] \right]\\ \space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}\mathbb{E}[N_t(a)](V^*-Q(a))\\ \space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}\mathbb{E}[N_t(a)]\Delta_a$
- Minimize $L_t$ ... small counts for large gaps.
  Problem is that gaps are not known!

Lower Bound on total regret
- Hard problems have similar-looking arms with different arms
  
  Theorem (Lai and Robbins)
  Asymptotic total regret is at least logarithmic in number of steps
  $\lim\limits_{t\to\infin}L_t\geq\log{t}\sum\limits_{a|\Delta_a>0}\cfrac{\Delta_a}{KL(\mathcal{R}^a||\mathcal{R}^{a^*})}$
- gap( $\Delta_a$ )이 크면 lower bound $\uarr$
  KL divergence가 작으면, 즉, reward가 서로 비슷하게 생기면 lower bound $\uarr$
  Note: KL divergence
  $KL(P||Q)=H(P,Q)-H(P)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\sum\limits_{x\in\mathcal{X}}p(x)\log{\cfrac{1}{q(x)}}-\sum\limits_{x\in\mathcal{X}}p(x)\log{\cfrac{1}{p(x)}}$

Greedy and $\epsilon$ -greedy algorithms

Linear or Sublinear Total Regret
Image from: here
- Greedy never explores
  $\Rarr N_t(a)$ 가 고정됨, linear total regret!
- $\epsilon$ -greedy explores forever ( $\epsilon$ is fixed)
  $\Rarr \mathbb{E}[N_t(a)]$ 가 고정됨, linear total regret!
- Is it possible to achieve sublinear total regret?

Optimistic Initialization
- Review: greedy algorithm
  - Estimate the value of each action by Monte-Carlo evaluation
    $\hat{Q}_t(a)=\cfrac{1}{N_t(a)}\sum\limits_{t=1}^Tr_t\bold{1}(a_t=a)$
  - The greedy algorithm selects action with highest value
    $a_t^*=\argmax\limits_{a\in\mathcal{A}}\hat{Q}_t(a)$
  - Greedy can lock onto a suboptimal action forever
- Optimistic Initialization
  - Simple and practical idea: initialize $Q(a)$ to high value
    and update action value by incremental Monte-Carlo evaluation
  - 초반에 exploration이 일어나지만 ( $Q$ 가 전체적으로 감소하면서 수렴하기 때문),
    여전히 suboptimal action에 lock 될 수 있다. ( $\Rarr$ linear total regret)
- Greedy Algorithm with Optimistic Initialization
  - Initialize values to maximum possible value, $Q(a)=r_{max}, \forall a\in\mathcal{A}$
  - Starting with $N(a)>0$ ,
    $\hat{Q}_t(a_t)=\hat{Q}_{t-1}(a_t)+\cfrac{1}{N_t(a_t)}(r_t-\hat{Q}_{t-1}(a_t))$
  - Act greedily
    $A_t=\argmax\limits_{a\in\mathcal{A}}\hat{Q}_t(a)$

Decaying $\epsilon_t$ -Greedy
- Review: $\epsilon$ -greedy algorithm
  - With probability 1- $\epsilon$ , select $a=\argmax\limits_{a\in\mathcal{A}}\hat{Q}(a)$
    With probability $\epsilon$ , select a random action
  - Constant $\epsilon$ ensures minimum regret (항상 explore 하면서 생기는 최소 regret)
    $l_t\geq\cfrac{\epsilon}{|\mathcal{A}|}\sum\limits_{a\in\mathcal{A}}\Delta_a$
    $\Rarr \epsilon$ -greedy has linear total regret (기울기의 minimum이 존재하므로)
- Decaying $\epsilon_t$
  - Pick a decay schedule for $\epsilon_1, \epsilon_2,...$
    $\Rarr$ sublinear total regret
  - Decaying schedule 예시:
    $c>0$
    $d=\min\limits_{a|\Delta_a>0}\Delta_a$ ... smallest gap (best action vs second best action)
    $\epsilon_t=\min\left\{ 1, \cfrac{c|\mathcal{A}|}{d^2t} \right\}$ ... smallest gap이 작아지면 decay $\uarr$
    이렇게 하면 logarithmic asymptotic total regret을 얻는다.
- Unfortunately, schedule requires advance knowledge of gaps

Optimism in the Face of Uncertainty

The more uncertain we are about an action-value, the more important it is to explore that action

Example
Action $a_1$ (blue line) is the most uncertain one $\rarr$ pick $a_1$

We are less uncertain about the value (blue is certainly bad...)
Now we are more likely to pick $a_2$ (red line)!

Images from: here

Upper Confidence Bounds
- Upper confidence $\hat{U}_t(a)$ is an uncertainty measure
- For each action value, estimate $\hat{U}_t(a)$ such that
  $Q(a) \leq \hat{Q}_t(a)+\hat{U}_t(a)$ with high probability (e.g. 95%)
- $\hat{U}_t(a)$ depends on the number of times $N(a)$ has been selected
  Note: small(large) $N(a) \Rarr$ more(less) uncertain $\Rarr$ large(small) $\hat{U}_t(a)$
- Select action maximizing Upper Confidence Bound(UCB),
  $a_t=\argmax\limits_{a\in\mathcal{A}}\left(\hat{Q}_t(a)+ \hat{U}_t(a)\right)$

Hoeffding's Inequality

Theorem (Hoeffding's Inequality)
Let $X_1, ..., X_t$ be i.i.d random variables in $[0,1]$ , and let
$\bar{X}_t=\frac{1}{\tau}\sum_{\tau=1}^t{X_\tau}$ be the sample mean. Then,
$\mathbb{P}\left[\mathbb{E}[X]>\bar{X}_t+u\right] \leq e^{-2tu^2}$
Note: This holds for any dist'n
- Apply this inequality to rewards of the bandit conditioned on selecting action a,
  $\mathbb{P}\left[ Q(a) > \hat{Q}_t(a) + \hat{U}_t(a) \right] \leq e^{-2N_t(a)U_t(a)^2}$ ................................ (1)
  i.e., $\mathbb{P}\left[ Q(a) \leq \hat{Q}_t(a) + \hat{U}_t(a) \right] \geq 1-e^{-2N_t(a)U_t(a)^2}$ .............. (2)
  Note: (2)의 우변을 특정 확률 값(e.g. 0.95)으로 만들어주는 $\hat{U}_t(a)$ 의 최솟값을 각각 찾아줄 수 있다.
- (1)의 우변을 확률 $p$ 로(e.g. 0.05) 놓고 정리하면,
  $e^{-2N_t(a)U_t(a)^2}=p\\ \implies U_t(a)=\sqrt{\cfrac{-\log{p}}{2N_t(a)}}$
- 시간이 지날 수록 uncertainty가 줄기 때문에 $p$ 를 $t$ 에 대해 decaying할 수 있다.
  즉, (2)의 우변이 증가한다. 더 엄격한 신뢰도를 요구하게 된다.
  이 경우에 특별히 $t\to\infin$ 에 대해 optimal action을 선택하는 게 보장된다.
  e.g.) $p=t^{-2}$ , $U_t(a)=\sqrt{\cfrac{2\log{t}}{N_t(a)}}$

UCB1 algorithm
$a_t=\argmax_{a\in\mathcal{A}}\left(Q(a)+\sqrt{\cfrac{2\log{t}}{N_t(a)}}\right)$

Theorem
UCB algorithm achieves logarithmic asymptotic total regret
$\lim\limits_{t\to\infin}L_t\leq 8\log{t}\sum\limits_{a|\Delta_a>0}\Delta_a$

(증명은 강의에서도 따로 다루지 않음)

Bayesian Bandits

Note: Frequentist approach: minimal assumption on dist'n
Bayesian approach: exploit prior knowledge

Bayesian Bandits
- So far we have made no assumptions about the reward distribution $\mathcal{R}$ .
- Bayesian bandits exploit prior knowledge of rewards, $p[\mathcal{R}]$
  to compute posterior distribution of rewards $p[\mathcal{R}|h_t]$
- And use posterior to guid exploration (Bayesian UCB / Thompsom sampling)
  Better performance if prior knowledge is accurate!
- Note: Bayes' Theorem
  $p(A|B)=\cfrac{p(B|A)}{p(B)}{p(A)}$
  $p(A|B)$ : posterior probability, $p(A)$ : prior probability

Bayesian UCB
- Assume reward distribution is Gaussian, $\mathcal{R}_a(r)=\mathcal{N}(r;\mu_a,\sigma^2_a)$
- Compute Gaussian posterior over $\mu_a$ and $\sigma_a^2$ ,
  $p[\mu_a,\sigma_a^2|h_t]\propto p[\mu_a,\sigma_a^2] \cdot p[h_t|\mu_a,\sigma_a^2]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space = p[\mu_a,\sigma_a^2]\cdot\prod\limits_{t|a_t=a}{\mathcal{N}(r_t;\mu_a\sigma_a^2)}$
  Question: how to determine the prior dist'n?
- Pick action that maximizes std. of $Q(a)$ (in favor of uncertainty)
  $a_t=\argmax\limits_{a\in\mathcal{A}}\left(\mu_a+c\cfrac{\sigma_a}{\sqrt{N(a)}}\right)$

Probability Matching
- Select action $a$ according to probability that $a$ is the optimal action
  $\pi(a|h_t)=\mathbb{P}[Q(a)>Q(a'), \forall a'\neq a|h_t]$
- Probability matching is optimistic in the face of uncertainty
  because uncertain actions have higher probability of being max
  (pdf가 더 완만하기 때문이다!)
- Posterior dist'n을 analytically 구할 수 없는 경우엔 쓰기 어렵다.

Thompsom Sampling
- An implementation of probability matching
  $\pi(a|h_t)=\mathbb{P}[Q(a)>Q(a'), \forall a'\neq a|h_t]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_{\mathcal{R}|h_t}\left[ \bold{1}\left(a=\argmax\limits_{a\in\mathcal{A}}Q(a)\right) \right]$
- Use Bayes' theorem to compute posterior distribution $p[\mathcal{R}_a|h_t]$
  Note: 앞에선 $p[\mu_a,\sigma_a^2|h_t]$ 를 구했었다. $\mathcal{R}_a$ 이 gaussian distribution이기 때문에 모수 $\mu_a,\sigma_a^2$ 을 찾는 것과 같다.
- Sample a reward distribution $\mathcal{R}$ from posterior
  각 action마다 하나씩 sample $\mathcal{R}_a$ 을 뽑는다.
- Compute action-value function $Q(a)=\mathbb{E}[\mathcal{R}_a]=\mu_a$ (?)
- Select action maximizing value on sample, $a_t=\argmax\limits_{a\in\mathcal{A}}Q(a)$
- Advantages
  - Thompsom sampling achieves Lai and Robbins lower bound for some problems
  - Thompsom sampling is parameter-free

Information State Search

Exploration은 information을 주기 때문에 좋은 것이다.
Information gain is higher in uncertain situations (c.f. 정보 이론)
If we know value of information, we can trade-off exploration and exploitation optimally

Information State Space
- View bandits as sequential decision-making problems
- At each step there is an information state $\tilde{s}$
  - $\tilde{s}_t=f(h_t)$ ... $\tilde{s}$ is a statistic of the history,
    summarizing all information accumulated so far
- Each action $a$ causes a transition to a new information state
  $\tilde{s}'$ (by adding information), with probability $\tilde{\mathcal{P}}^a_{\tilde{s},\tilde{s}'}$
- This defines an MDP $\tilde{\mathcal{M}}$ in augmented information state space
  $\tilde{\mathcal{M}}=\lang \tilde{\mathcal{S}}, \mathcal{A}, \tilde{\mathcal{P}}, \mathcal{R}, \gamma \rang$
Example: Bayes-Adaptive Bernoulli Bandits
- For action $a$ , win or lose a game with probability $\mu_a$
  $\mathcal{R}^a=\mathcal{B}(\mu_a)=\begin{cases}1 \space\space\space\space\space w.p. \space\space\mu_a\\0 \space\space\space\space\space w.p. \space\space1-\mu_a\\\end{cases}$
  We want to find which arm has the highest $\mu_a$
- The information state is $\tilde{s}=\lang\alpha,\beta\rang$
  - $\alpha_a$ counts the pulls of arm $a$ where reward was 0
  - $\beta_a$ counts the pulls of arm $a$ where reward was 1
- Start with $Beta(\alpha_a,\beta_a)$ prior over reward function $\mathcal{R}^a$ (bernoulli)
  $r^a\sim\mathcal{R}^a=\mathcal{B}(\mu_a)$
  $\mu_a\sim Beta(\alpha_a,\beta_a)$
  Note: Beta distribution is a conjugate piror for the bernoulli distribution
- Each time $a$ is selected, update posterior for $\mathcal{R}^a$
  $\mu_a \sim Beta(\alpha_a+1, \beta_a) \space\space\space if \space\space\space r=1$
  $\mu_a \sim Beta(\alpha_a, \beta_a+1) \space\space\space if \space\space\space r=0$
  Note: action 마다 따로 update 된다. 매 step 하나의 action만 할 수 있으므로 posterior도 하나만 update 된다.
- This defines transition function $\tilde{\mathcal{P}}$ for the Bayes-adaptive MDP
  - Information state $\lang \alpha,\beta \rang$ corresponds to reward model $Beta(\alpha,\beta)$
    (계속 update 되는 posterior의 모수가 state가 된다.)
    and each state transition corresponds to a Bayesian model update
    Image from: here
  - Bandit 시행을 계속 하면 $\tilde{\mathcal{P}}$ 를 학습할 수 있고(by RL or Bayes-adaptive RL), 각 action에 대해 모수가 어떻게 바뀌는지 알 수 있게 된다. 그러면 exploration과 exploitation의 optimal trade-off를 찾아낼 수 있다. (trade-off를 어떻게 계산하는지는 강의에서도 다루지 않음)

Contextual Bandits

A contextual Bandit is a tuple $\lang \mathcal{A}, \mathcal{S}, \mathcal{R} \rang$
$\mathcal{S}=\mathbb{P}[s]$ is an unknown distribution over states (or "context") and
$\mathcal{R}^a_s(r)=\mathbb{P}[r|s,a]$ is an unknown probability distribution over rewards ... (reward의 distribution이 state에 dependent)
At each step $t$ ,
- Environment generates state $s_t\sim\mathcal{S}$
- Agent selects action $a_t\in\mathcal{A}$
- Environment generates reward $r_t\sim\mathcal{R}_{s_t}^{a_t}$
Goal is to maximize cumulative reward $\sum_{\tau=1}^{t}r_\tau$
Linear UCB를 적용할 수 있다. (자세한 내용은 skip)

MDPs

Exploration/Exploitation Principles to MDPs

The same principles for exploration/exploitation apply to MDPs

Naive Exploration ( $\epsilon$ -greedy, softmax, Gaussian noise, ...)
Optimistic Initialization
Optimism in the Face of Uncertainty (UCB, Thompsom sampling, ...)
Probability Matching
Information State Search (Gittins indices, Bayes-adaptive MDPs, ...)

Optimistic Initialization

Model-Free RL
- Initialize action-value function $Q(s,a)$ to $\frac{r_{max}}{1-\gamma}$
- Run favorite model-free RL algorithm (MC control, Sarsa, Q-learning, ...)
- This encourages systematic exploration of states and actions
Model-Based RL
- Construct an optimistic model of the MDP
- Initialize transitions to go to heaven (i.e., transition to terminal state with $r_{max}$ reward
- Solve optimistic MDP by favorite planning algorithm (policy iteration, value iteration, tree search, ...)
- This encourages systematic exploration of states and actions

Optimism in the Face of Uncertainty

Upper Confidence Bounds for Model-Free RL
- Maximize UCB on action-value function $Q^\pi(s,a)$
  $a_t=\argmax\limits_{a\in\mathcal{A}}\left(Q(s_t,a)+U(s_t,a)\right)$
  - This estimates uncertainty in policy evaluation (Q의 uncertainty)
    but ignores uncertainty from policy improvement
- Maximize UCB on optimal action-value function $Q^*(s,a)$
  $a_t=\argmax\limits_{a\in\mathcal{A}}\left(Q(s_t,a)+U_1(s_t,a)+U_2(s_t,a)\right)$
  - This estimates uncertainty in policy evaluation
    plus uncertainty from policy improvement (which is hard to measure!)

Bayesian Model-Based RL
- Maintain posterior distribution over MDP models
- Estimate both transitions and rewards $p[\mathcal{P},\mathcal{R}|h_t]$
  where $h_t=s_1,a_1,r_2,...,s_t$ is the history.
  Note: 앞에선 $\mathcal{R}$ 의 distribution에만 관심이 있었지만 이제는 $\mathcal{P}$ 까지! ( $\because$ MDP)
- Use posterior to guide exploration
  - Bayesian UCB
  - Probability matching (Thompson sampling)

Thompson Sampling for Model-Based RL
- An implementation of probability matching
  $\pi(s,a|h_t)=\mathbb{P}\left[ Q^{*}(s,a)>Q^{*}(s,a'),\forall{a'} \neq a|h_t \right]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_{\mathcal{P},\mathcal{R}|h_t}\left[ \bold{1}(a=\argmax\limits_{a\in\mathcal{A}}Q^*(s,a)) \right]$
- Use Bayes' theorem to compute posterior distribution $p[\mathcal{P},\mathcal{R}|h_t]$
- Sample an MDP $\mathcal{P}, \mathcal{R}$ from posterior
- Solve the MDP using favorite planning algorithm to get $Q^*(s,a)$
- Select optimal action for the MDP, $a_t=\argmax\limits_{a\in\mathcal{A}}Q^*(s_t,a)$

Information State Search

Information State Search in MDPs
- MDPs can be augmented to include information state, such as, $\lang s, \tilde{s} \rang$
  where $s$ is original state within MDP and $\tilde{s}$ is a statistic of the history
- Each action $a$ causes a transition
  - to a new state $s'$ with probability $\mathcal{P}^a_{s, s'}$
  - to a new information state $\tilde{s}'$
- Defines MDP $\tilde{\mathcal{M}}$ in augmented information stae space,
  $\tilde{\mathcal{M}}=\lang\tilde{\mathcal{S}},\mathcal{A},\tilde{\mathcal{P}},\mathcal{R},\gamma \rang$

Bayes Adaptive MDPs
- Posterior distribution over MDP model is an information state
  $\tilde{s}_t=\mathbb{P}[\mathcal{P},\mathcal{R}|h_t]$
- Augmented MDP over $\lang s, \tilde{s} \rang$ is called Bayes-adaptive MDP
- Solve this MDP to find optimal exploration/exploitation trade-off (w.r.t. prior)
- However, Bayes-adaptive MDP is typically enormous
  $\Rarr$ Simulation-based search has proven effective (Guez et al.)

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

+ Lec.10의 경우 따로 정리할만큼 중요한 내용은 아닌 것 같아서 skip 합니다.
앞으로는 주요 RL 논문들을 다룰 예정입니다.

Sanghyeok Choi

Lazy Enthusiast

이전 포스트

Silver RL (9) Exploration and Exploitation

Intro_to_RL

Introduction

Exploration vs. Exploitation Dilemma

Multi-Armed Bandits

Multi-Armed Bandits

Regret

Greedy and $\epsilon$ -greedy algorithms

Optimism in the Face of Uncertainty

Bayesian Bandits

Information State Search

Contextual Bandits

MDPs

Exploration/Exploitation Principles to MDPs

Optimistic Initialization

Optimism in the Face of Uncertainty

Information State Search

Silver RL (8) Integrating Learning and Planning

0개의 댓글

관련 채용 정보

Silver RL (9) Exploration and Exploitation

Intro_to_RL

Introduction

Exploration vs. Exploitation Dilemma

Multi-Armed Bandits

Multi-Armed Bandits

Regret

Greedy and ϵ\epsilonϵ-greedy algorithms

Optimism in the Face of Uncertainty

Bayesian Bandits

Information State Search

Contextual Bandits

MDPs

Exploration/Exploitation Principles to MDPs

Optimistic Initialization

Optimism in the Face of Uncertainty

Information State Search

Silver RL (8) Integrating Learning and Planning

0개의 댓글

관련 채용 정보

Greedy and $\epsilon$ -greedy algorithms