David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 9: Exploration and Exploitation (Youtube) 강의 내용을 정리했습니다.
Introduction
이전에 control을 얘기하면서 greedy와 ϵ-greedy를 배웠다. ϵ-greedy가 일종의 exploration 방법이고, 이번 시간에는 ϵ-greedy 외의 다른 방법들도 다룬다.
Exploration vs. Exploitation Dilemma
Exploitation: Make the best decision given current information
e.g.) Go to your favorite restaurant Exploration: Gather more information
e.g.) Try a new restaurant
Ways to Exploration
Random exploration (e.g., ϵ-greedy, softmax)
Optimism in the face of uncertainty
Estimate uncertainty on value
Prefer to explore states/actions with highest uncertainty
Information state space
Consider agent's information as part of its state
Lookahead to see how information helps reward
State-action exploration vs. Parameter exploration
State-action exploration
Systematically explore state/action space
e.g.) 현재 state에서의 action을, 아직 안 해본 action들 중에서 선택
Parameter exploration
With parameter u, policy is π(A∣S,u)
e.g.) Pick different parameters and try for a while
Advantage: consistent exploration
Disadvantage: doesn't know about state/action space
We will focus on state-action exploration!
Principles
Naive Exploration:
Add noise to greedy policy (e.g. ϵ-greedy)
Optimistic Initialization:
Assume the best until proven otherwise
Optimism in the Face of Uncertainty:
Prefer actions with uncertain values
Probability Matching:
Select actions according to probability they are best
Information State Search:
Lookahead search incorporating value of information
Greedy never explores ⇒Nt(a)가 고정됨, linear total regret!
ϵ-greedy explores forever (ϵ is fixed) ⇒E[Nt(a)]가 고정됨, linear total regret!
Is it possible to achieve sublinear total regret?
Optimistic Initialization
Review: greedy algorithm
Estimate the value of each action by Monte-Carlo evaluation Q^t(a)=Nt(a)1t=1∑Trt1(at=a)
The greedy algorithm selects action with highest value at∗=a∈AargmaxQ^t(a)
Greedy can lock onto a suboptimal action forever
Optimistic Initialization
Simple and practical idea: initialize Q(a) to high value
and update action value by incremental Monte-Carlo evaluation
초반에 exploration이 일어나지만 (Q가 전체적으로 감소하면서 수렴하기 때문),
여전히 suboptimal action에 lock 될 수 있다. (⇒ linear total regret)
Greedy Algorithm with Optimistic Initialization
Initialize values to maximum possible value, Q(a)=rmax,∀a∈A
Starting with N(a)>0, Q^t(at)=Q^t−1(at)+Nt(at)1(rt−Q^t−1(at))
Act greedily At=a∈AargmaxQ^t(a)
Decaying ϵt-Greedy
Review: ϵ-greedy algorithm
With probability 1-ϵ, select a=a∈AargmaxQ^(a)
With probability ϵ, select a random action
Constant ϵ ensures minimum regret (항상 explore 하면서 생기는 최소 regret) lt≥∣A∣ϵa∈A∑Δa ⇒ϵ-greedy has linear total regret (기울기의 minimum이 존재하므로)
Decaying ϵt
Pick a decay schedule for ϵ1,ϵ2,... ⇒ sublinear total regret
Decaying schedule 예시: c>0 d=a∣Δa>0minΔa ... smallest gap (best action vs second best action) ϵt=min{1,d2tc∣A∣} ... smallest gap이 작아지면 decay ↑
이렇게 하면 logarithmic asymptotic total regret을 얻는다.
Unfortunately, schedule requires advance knowledge of gaps
Optimism in the Face of Uncertainty
The more uncertain we are about an action-value, the more important it is to explore that action
Example
Action a1(blue line) is the most uncertain one → pick a1
We are less uncertain about the value (blue is certainly bad...) Now we are more likely to pick a2(red line)!
Theorem (Hoeffding's Inequality)
Let X1,...,Xt be i.i.d random variables in [0,1], and let Xˉt=τ1∑τ=1tXτ be the sample mean. Then, P[E[X]>Xˉt+u]≤e−2tu2 Note: This holds forany dist'n
Apply this inequality to rewards of the bandit conditioned on selecting action a, P[Q(a)>Q^t(a)+U^t(a)]≤e−2Nt(a)Ut(a)2................................ (1)
i.e., P[Q(a)≤Q^t(a)+U^t(a)]≥1−e−2Nt(a)Ut(a)2.............. (2) Note: (2)의 우변을 특정 확률 값(e.g. 0.95)으로 만들어주는 U^t(a)의 최솟값을 각각 찾아줄 수 있다.
(1)의 우변을 확률 p로(e.g. 0.05) 놓고 정리하면, e−2Nt(a)Ut(a)2=p⟹Ut(a)=2Nt(a)−logp
시간이 지날 수록 uncertainty가 줄기 때문에 p를 t에 대해 decaying할 수 있다.
즉, (2)의 우변이 증가한다. 더 엄격한 신뢰도를 요구하게 된다.
이 경우에 특별히 t→∞에 대해 optimal action을 선택하는 게 보장된다.
e.g.) p=t−2, Ut(a)=Nt(a)2logt
UCB1 algorithm at=argmaxa∈A(Q(a)+Nt(a)2logt)
Theorem
UCB algorithm achieves logarithmic asymptotic total regret t→∞limLt≤8logta∣Δa>0∑Δa
So far we have made no assumptions about the reward distribution R.
Bayesian bandits exploit prior knowledge of rewards, p[R]
to compute posterior distribution of rewards p[R∣ht]
And use posterior to guid exploration (Bayesian UCB / Thompsom sampling)
Better performance if prior knowledge is accurate!
Note: Bayes' Theorem p(A∣B)=p(B)p(B∣A)p(A) p(A∣B): posterior probability, p(A): prior probability
Bayesian UCB
Assume reward distribution is Gaussian, Ra(r)=N(r;μa,σa2)
Compute Gaussian posterior over μa and σa2, p[μa,σa2∣ht]∝p[μa,σa2]⋅p[ht∣μa,σa2]=p[μa,σa2]⋅t∣at=a∏N(rt;μaσa2) Question: how to determine the prior dist'n?
Pick action that maximizes std. of Q(a) (in favor of uncertainty) at=a∈Aargmax(μa+cN(a)σa)
Probability Matching
Select action a according to probability that a is the optimal action π(a∣ht)=P[Q(a)>Q(a′),∀a′=a∣ht]
Probability matching is optimistic in the face of uncertainty
because uncertain actions have higher probability of being max
(pdf가 더 완만하기 때문이다!)
Posterior dist'n을 analytically 구할 수 없는 경우엔 쓰기 어렵다.
Thompsom Sampling
An implementation of probability matching π(a∣ht)=P[Q(a)>Q(a′),∀a′=a∣ht]=ER∣ht[1(a=a∈AargmaxQ(a))]
Use Bayes' theorem to compute posterior distributionp[Ra∣ht] Note: 앞에선 p[μa,σa2∣ht]를 구했었다. Ra이 gaussian distribution이기 때문에 모수 μa,σa2을 찾는 것과 같다.
Sample a reward distribution R from posterior
각 action마다 하나씩 sample Ra을 뽑는다.
Compute action-value functionQ(a)=E[Ra]=μa (?)
Select action maximizing value on sample, at=a∈AargmaxQ(a)
Advantages
Thompsom sampling achieves Lai and Robbins lower bound for some problems
Thompsom sampling is parameter-free
Information State Search
Exploration은 information을 주기 때문에 좋은 것이다.
Information gain is higher in uncertain situations (c.f. 정보 이론) If we know value of information, we can trade-off exploration and exploitation optimally
Information State Space
View bandits as sequential decision-making problems
At each step there is an information state s~
s~t=f(ht) ... s~ is a statistic of the history,
summarizing all information accumulated so far
Each action a causes a transition to a new information state s~′ (by adding information), with probability P~s~,s~′a
This defines an MDP M~ in augmented information state space M~=⟨S~,A,P~,R,γ⟩
Example: Bayes-Adaptive Bernoulli Bandits
For action a, win or lose a game with probability μa Ra=B(μa)={1w.p.μa0w.p.1−μa
We want to find which arm has the highest μa
The information state is s~=⟨α,β⟩
αa counts the pulls of arm a where reward was 0
βa counts the pulls of arm a where reward was 1
Start with Beta(αa,βa) prior over reward function Ra (bernoulli) ra∼Ra=B(μa) μa∼Beta(αa,βa) Note: Beta distribution is a conjugate piror for the bernoulli distribution
Each time a is selected, update posterior for Ra μa∼Beta(αa+1,βa)ifr=1 μa∼Beta(αa,βa+1)ifr=0 Note: action 마다 따로 update 된다. 매 step 하나의 action만 할 수 있으므로 posterior도 하나만 update 된다.
This defines transition function P~ for the Bayes-adaptive MDP
Information state ⟨α,β⟩ corresponds to reward model Beta(α,β)
(계속 update 되는 posterior의 모수가 state가 된다.)
and each state transition corresponds to a Bayesian model update
Image from: here
Bandit 시행을 계속 하면 P~를 학습할 수 있고(by RL or Bayes-adaptive RL), 각 action에 대해 모수가 어떻게 바뀌는지 알 수 있게 된다. 그러면 exploration과 exploitation의 optimal trade-off를 찾아낼 수 있다. (trade-off를 어떻게 계산하는지는 강의에서도 다루지 않음)
Contextual Bandits
A contextual Bandit is a tuple ⟨A,S,R⟩ S=P[s] is an unknown distribution over states (or "context") and Rsa(r)=P[r∣s,a] is an unknown probability distribution over rewards ... (reward의 distribution이 state에 dependent)
At each step t,
Environment generates state st∼S
Agent selects action at∈A
Environment generates reward rt∼Rstat
Goal is to maximize cumulative reward ∑τ=1trτ
Linear UCB를 적용할 수 있다. (자세한 내용은 skip)
MDPs
Exploration/Exploitation Principles to MDPs
The same principles for exploration/exploitation apply to MDPs