Silver RL (9) Exploration and Exploitation

Sanghyeok Choi·2022년 1월 24일
0

Intro_to_RL

목록 보기
9/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 9: Exploration and Exploitation (Youtube) 강의 내용을 정리했습니다.

Introduction

이전에 control을 얘기하면서 greedy와 ϵ\epsilon-greedy를 배웠다.
ϵ\epsilon-greedy가 일종의 exploration 방법이고, 이번 시간에는 ϵ\epsilon-greedy 외의 다른 방법들도 다룬다.

Exploration vs. Exploitation Dilemma

  • Exploitation: Make the best decision given current information
    e.g.) Go to your favorite restaurant
    Exploration: Gather more information
    e.g.) Try a new restaurant
  • Ways to Exploration
    • Random exploration (e.g., ϵ\epsilon-greedy, softmax)
    • Optimism in the face of uncertainty
      • Estimate uncertainty on value
      • Prefer to explore states/actions with highest uncertainty
    • Information state space
      • Consider agent's information as part of its state
      • Lookahead to see how information helps reward
  • State-action exploration vs. Parameter exploration
    • State-action exploration
      • Systematically explore state/action space
        e.g.) 현재 state에서의 action을, 아직 안 해본 action들 중에서 선택
    • Parameter exploration
      • With parameter u\bold{u}, policy is π(AS,u)\pi(A|S,\bold{u})
        e.g.) Pick different parameters and try for a while
      • Advantage: consistent exploration
        Disadvantage: doesn't know about state/action space
    • We will focus on state-action exploration!
  • Principles
    • Naive Exploration:
      Add noise to greedy policy (e.g. ϵ\epsilon-greedy)
    • Optimistic Initialization:
      Assume the best until proven otherwise
    • Optimism in the Face of Uncertainty:
      Prefer actions with uncertain values
    • Probability Matching:
      Select actions according to probability they are best
    • Information State Search:
      Lookahead search incorporating value of information

Multi-Armed Bandits

Multi-Armed Bandits

Image from: here

  • A multi-armed bandit is a tuple A,R\lang \mathcal{A}, \mathcal{R}\rang
    action에 대해 reward가 나오고 끝. (no state, no transition)
  • A\mathcal{A} is a known set of mm actions (or "arms")
    Ra(r)=P[ra]\mathcal{R}^a(r)=\mathbb{P}[r|a] is an unknown probability distribution over rewards
  • At each step tt agent selects an action atAa_t\in\mathcal{A},
    and the environment generates a reward rtRatr_t\sim\mathcal{R}^{a_t}
  • The goal is to maximize cumulative reward τ=1trτ\sum_{\tau=1}^{t}r_\tau

Regret

  • Regret
    • The action-value is the mean reward for action a,
      Q(a)=E[ra]Q(a)=\mathbb{E}[r|a]
    • The optimal value VV^* is
      V=Q(a)=maxaAQ(a)V^*=Q(a^*)=\max\limits_{a\in\mathcal{A}}Q(a)
    • The regret is the opportunity loss for one step
      lt=E[VQ(at)]l_t=\mathbb{E}[V^*-Q(a_t)]
    • The total regret is the total opportunity loss
      Lt=E[τ=1t[VQ(aτ)]]L_t=\mathbb{E}\left[ \sum\limits_{\tau=1}^{t}[V^*-Q(a_\tau)] \right]
    • Maximize cumulative reward \equiv Minimize total regret
  • Counting Regret
    • The count Nt(A)N_t(A) is expected number of selections for action aa
    • The gap Δa\Delta_a is the difference in value between action aa and optimal action aa^*
      Δa=VQ(a)\Delta_a=V^*-Q(a)
    • The total regret is a function of gaps and the counts
      Lt=E[τ=1t[VQ(aτ)]]     =aAE[Nt(a)](VQ(a))     =aAE[Nt(a)]ΔaL_t=\mathbb{E}\left[ \sum\limits_{\tau=1}^{t}[V^*-Q(a_\tau)] \right]\\ \space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}\mathbb{E}[N_t(a)](V^*-Q(a))\\ \space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}\mathbb{E}[N_t(a)]\Delta_a
    • Minimize LtL_t ... small counts for large gaps.
      Problem is that gaps are not known!
  • Lower Bound on total regret
    • Hard problems have similar-looking arms with different arms

      Theorem (Lai and Robbins)
      Asymptotic total regret is at least logarithmic in number of steps
      limtLtlogtaΔa>0ΔaKL(RaRa)\lim\limits_{t\to\infin}L_t\geq\log{t}\sum\limits_{a|\Delta_a>0}\cfrac{\Delta_a}{KL(\mathcal{R}^a||\mathcal{R}^{a^*})}

    • gap(Δa\Delta_a)이 크면 lower bound \uarr
      KL divergence가 작으면, 즉, reward가 서로 비슷하게 생기면 lower bound \uarr
      Note: KL divergence
      KL(PQ)=H(P,Q)H(P)                    =xXp(x)log1q(x)xXp(x)log1p(x)KL(P||Q)=H(P,Q)-H(P)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\sum\limits_{x\in\mathcal{X}}p(x)\log{\cfrac{1}{q(x)}}-\sum\limits_{x\in\mathcal{X}}p(x)\log{\cfrac{1}{p(x)}}

Greedy and ϵ\epsilon-greedy algorithms

  • Linear or Sublinear Total Regret
    Image from: here
    • Greedy never explores
      Nt(a)\Rarr N_t(a)가 고정됨, linear total regret!
    • ϵ\epsilon-greedy explores forever (ϵ\epsilon is fixed)
      E[Nt(a)]\Rarr \mathbb{E}[N_t(a)]가 고정됨, linear total regret!
    • Is it possible to achieve sublinear total regret?
  • Optimistic Initialization
    • Review: greedy algorithm
      • Estimate the value of each action by Monte-Carlo evaluation
        Q^t(a)=1Nt(a)t=1Trt1(at=a)\hat{Q}_t(a)=\cfrac{1}{N_t(a)}\sum\limits_{t=1}^Tr_t\bold{1}(a_t=a)
      • The greedy algorithm selects action with highest value
        at=arg maxaAQ^t(a)a_t^*=\argmax\limits_{a\in\mathcal{A}}\hat{Q}_t(a)
      • Greedy can lock onto a suboptimal action forever
    • Optimistic Initialization
      • Simple and practical idea: initialize Q(a)Q(a) to high value
        and update action value by incremental Monte-Carlo evaluation
      • 초반에 exploration이 일어나지만 (QQ가 전체적으로 감소하면서 수렴하기 때문),
        여전히 suboptimal action에 lock 될 수 있다. (\Rarr linear total regret)
    • Greedy Algorithm with Optimistic Initialization
      • Initialize values to maximum possible value, Q(a)=rmax,aAQ(a)=r_{max}, \forall a\in\mathcal{A}
      • Starting with N(a)>0N(a)>0,
        Q^t(at)=Q^t1(at)+1Nt(at)(rtQ^t1(at))\hat{Q}_t(a_t)=\hat{Q}_{t-1}(a_t)+\cfrac{1}{N_t(a_t)}(r_t-\hat{Q}_{t-1}(a_t))
      • Act greedily
        At=arg maxaAQ^t(a)A_t=\argmax\limits_{a\in\mathcal{A}}\hat{Q}_t(a)
  • Decaying ϵt\epsilon_t-Greedy
    • Review: ϵ\epsilon-greedy algorithm
      • With probability 1-ϵ\epsilon, select a=arg maxaAQ^(a)a=\argmax\limits_{a\in\mathcal{A}}\hat{Q}(a)
        With probability ϵ\epsilon, select a random action
      • Constant ϵ\epsilon ensures minimum regret (항상 explore 하면서 생기는 최소 regret)
        ltϵAaAΔal_t\geq\cfrac{\epsilon}{|\mathcal{A}|}\sum\limits_{a\in\mathcal{A}}\Delta_a
        ϵ\Rarr \epsilon-greedy has linear total regret (기울기의 minimum이 존재하므로)
    • Decaying ϵt\epsilon_t
      • Pick a decay schedule for ϵ1,ϵ2,...\epsilon_1, \epsilon_2,...
        \Rarr sublinear total regret
      • Decaying schedule 예시:
        c>0c>0
        d=minaΔa>0Δad=\min\limits_{a|\Delta_a>0}\Delta_a ... smallest gap (best action vs second best action)
        ϵt=min{1,cAd2t}\epsilon_t=\min\left\{ 1, \cfrac{c|\mathcal{A}|}{d^2t} \right\} ... smallest gap이 작아지면 decay \uarr
        이렇게 하면 logarithmic asymptotic total regret을 얻는다.
    • Unfortunately, schedule requires advance knowledge of gaps

Optimism in the Face of Uncertainty

The more uncertain we are about an action-value, the more important it is to explore that action

  • Example
    Action a1a_1(blue line) is the most uncertain one \rarr pick a1a_1

    We are less uncertain about the value (blue is certainly bad...)
    Now we are more likely to pick a2a_2(red line)!
Images from: here
  • Upper Confidence Bounds
    • Upper confidence U^t(a)\hat{U}_t(a) is an uncertainty measure
    • For each action value, estimate U^t(a)\hat{U}_t(a) such that
      Q(a)Q^t(a)+U^t(a)Q(a) \leq \hat{Q}_t(a)+\hat{U}_t(a) with high probability (e.g. 95%)
    • U^t(a)\hat{U}_t(a) depends on the number of times N(a)N(a) has been selected
      Note: small(large) N(a)N(a) \Rarr more(less) uncertain \Rarr large(small) U^t(a)\hat{U}_t(a)
    • Select action maximizing Upper Confidence Bound(UCB),
      at=arg maxaA(Q^t(a)+U^t(a))a_t=\argmax\limits_{a\in\mathcal{A}}\left(\hat{Q}_t(a)+ \hat{U}_t(a)\right)
  • Hoeffding's Inequality

    Theorem (Hoeffding's Inequality)
    Let X1,...,XtX_1, ..., X_t be i.i.d random variables in [0,1][0,1], and let
    Xˉt=1ττ=1tXτ\bar{X}_t=\frac{1}{\tau}\sum_{\tau=1}^t{X_\tau} be the sample mean. Then,
    P[E[X]>Xˉt+u]e2tu2\mathbb{P}\left[\mathbb{E}[X]>\bar{X}_t+u\right] \leq e^{-2tu^2}
    Note: This holds for any dist'n

    • Apply this inequality to rewards of the bandit conditioned on selecting action a,
      P[Q(a)>Q^t(a)+U^t(a)]e2Nt(a)Ut(a)2\mathbb{P}\left[ Q(a) > \hat{Q}_t(a) + \hat{U}_t(a) \right] \leq e^{-2N_t(a)U_t(a)^2}................................ (1)
      i.e., P[Q(a)Q^t(a)+U^t(a)]1e2Nt(a)Ut(a)2\mathbb{P}\left[ Q(a) \leq \hat{Q}_t(a) + \hat{U}_t(a) \right] \geq 1-e^{-2N_t(a)U_t(a)^2}.............. (2)
      Note: (2)의 우변을 특정 확률 값(e.g. 0.95)으로 만들어주는 U^t(a)\hat{U}_t(a)의 최솟값을 각각 찾아줄 수 있다.
    • (1)의 우변을 확률 pp로(e.g. 0.05) 놓고 정리하면,
      e2Nt(a)Ut(a)2=p    Ut(a)=logp2Nt(a)e^{-2N_t(a)U_t(a)^2}=p\\ \implies U_t(a)=\sqrt{\cfrac{-\log{p}}{2N_t(a)}}
    • 시간이 지날 수록 uncertainty가 줄기 때문에 pptt에 대해 decaying할 수 있다.
      즉, (2)의 우변이 증가한다. 더 엄격한 신뢰도를 요구하게 된다.
      이 경우에 특별히 tt\to\infin에 대해 optimal action을 선택하는 게 보장된다.
      e.g.) p=t2p=t^{-2}, Ut(a)=2logtNt(a)U_t(a)=\sqrt{\cfrac{2\log{t}}{N_t(a)}}
  • UCB1 algorithm
    at=arg maxaA(Q(a)+2logtNt(a))a_t=\argmax_{a\in\mathcal{A}}\left(Q(a)+\sqrt{\cfrac{2\log{t}}{N_t(a)}}\right)

    Theorem
    UCB algorithm achieves logarithmic asymptotic total regret
    limtLt8logtaΔa>0Δa\lim\limits_{t\to\infin}L_t\leq 8\log{t}\sum\limits_{a|\Delta_a>0}\Delta_a

    (증명은 강의에서도 따로 다루지 않음)

Bayesian Bandits

Note: Frequentist approach: minimal assumption on dist'n
        Bayesian approach: exploit prior knowledge

  • Bayesian Bandits
    • So far we have made no assumptions about the reward distribution R\mathcal{R}.
    • Bayesian bandits exploit prior knowledge of rewards, p[R]p[\mathcal{R}]
      to compute posterior distribution of rewards p[Rht]p[\mathcal{R}|h_t]
    • And use posterior to guid exploration (Bayesian UCB / Thompsom sampling)
      Better performance if prior knowledge is accurate!
    • Note: Bayes' Theorem
      p(AB)=p(BA)p(B)p(A)p(A|B)=\cfrac{p(B|A)}{p(B)}{p(A)}
      p(AB)p(A|B): posterior probability, p(A)p(A): prior probability
  • Bayesian UCB
    • Assume reward distribution is Gaussian, Ra(r)=N(r;μa,σa2)\mathcal{R}_a(r)=\mathcal{N}(r;\mu_a,\sigma^2_a)
    • Compute Gaussian posterior over μa\mu_a and σa2\sigma_a^2,
      p[μa,σa2ht]p[μa,σa2]p[htμa,σa2]                    =p[μa,σa2]tat=aN(rt;μaσa2)p[\mu_a,\sigma_a^2|h_t]\propto p[\mu_a,\sigma_a^2] \cdot p[h_t|\mu_a,\sigma_a^2]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space = p[\mu_a,\sigma_a^2]\cdot\prod\limits_{t|a_t=a}{\mathcal{N}(r_t;\mu_a\sigma_a^2)}
      Question: how to determine the prior dist'n?
    • Pick action that maximizes std. of Q(a)Q(a) (in favor of uncertainty)
      at=arg maxaA(μa+cσaN(a))a_t=\argmax\limits_{a\in\mathcal{A}}\left(\mu_a+c\cfrac{\sigma_a}{\sqrt{N(a)}}\right)
  • Probability Matching
    • Select action aa according to probability that aa is the optimal action
      π(aht)=P[Q(a)>Q(a),aaht]\pi(a|h_t)=\mathbb{P}[Q(a)>Q(a'), \forall a'\neq a|h_t]
    • Probability matching is optimistic in the face of uncertainty
      because uncertain actions have higher probability of being max
      (pdf가 더 완만하기 때문이다!)
    • Posterior dist'n을 analytically 구할 수 없는 경우엔 쓰기 어렵다.
  • Thompsom Sampling
    • An implementation of probability matching
      π(aht)=P[Q(a)>Q(a),aaht]              =ERht[1(a=arg maxaAQ(a))]\pi(a|h_t)=\mathbb{P}[Q(a)>Q(a'), \forall a'\neq a|h_t]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_{\mathcal{R}|h_t}\left[ \bold{1}\left(a=\argmax\limits_{a\in\mathcal{A}}Q(a)\right) \right]
    • Use Bayes' theorem to compute posterior distribution p[Raht]p[\mathcal{R}_a|h_t]
      Note: 앞에선 p[μa,σa2ht]p[\mu_a,\sigma_a^2|h_t]를 구했었다. Ra\mathcal{R}_a이 gaussian distribution이기 때문에 모수 μa,σa2\mu_a,\sigma_a^2을 찾는 것과 같다.
    • Sample a reward distribution R\mathcal{R} from posterior
      각 action마다 하나씩 sample Ra\mathcal{R}_a을 뽑는다.
    • Compute action-value function Q(a)=E[Ra]=μaQ(a)=\mathbb{E}[\mathcal{R}_a]=\mu_a (?)
    • Select action maximizing value on sample, at=arg maxaAQ(a)a_t=\argmax\limits_{a\in\mathcal{A}}Q(a)
    • Advantages
      • Thompsom sampling achieves Lai and Robbins lower bound for some problems
      • Thompsom sampling is parameter-free

Exploration은 information을 주기 때문에 좋은 것이다.
Information gain is higher in uncertain situations (c.f. 정보 이론)
If we know value of information, we can trade-off exploration and exploitation optimally

  • Information State Space

    • View bandits as sequential decision-making problems
    • At each step there is an information state s~\tilde{s}
      • s~t=f(ht)\tilde{s}_t=f(h_t) ... s~\tilde{s} is a statistic of the history,
        summarizing all information accumulated so far
    • Each action aa causes a transition to a new information state
      s~\tilde{s}' (by adding information), with probability P~s~,s~a\tilde{\mathcal{P}}^a_{\tilde{s},\tilde{s}'}
    • This defines an MDP M~\tilde{\mathcal{M}} in augmented information state space
      M~=S~,A,P~,R,γ\tilde{\mathcal{M}}=\lang \tilde{\mathcal{S}}, \mathcal{A}, \tilde{\mathcal{P}}, \mathcal{R}, \gamma \rang
  • Example: Bayes-Adaptive Bernoulli Bandits

    • For action aa, win or lose a game with probability μa\mu_a
      Ra=B(μa)={1     w.p.  μa0     w.p.  1μa\mathcal{R}^a=\mathcal{B}(\mu_a)=\begin{cases}1 \space\space\space\space\space w.p. \space\space\mu_a\\0 \space\space\space\space\space w.p. \space\space1-\mu_a\\\end{cases}
      We want to find which arm has the highest μa\mu_a
    • The information state is s~=α,β\tilde{s}=\lang\alpha,\beta\rang
      • αa\alpha_a counts the pulls of arm aa where reward was 0
      • βa\beta_a counts the pulls of arm aa where reward was 1
    • Start with Beta(αa,βa)Beta(\alpha_a,\beta_a) prior over reward function Ra\mathcal{R}^a (bernoulli)
      raRa=B(μa)r^a\sim\mathcal{R}^a=\mathcal{B}(\mu_a)
      μaBeta(αa,βa)\mu_a\sim Beta(\alpha_a,\beta_a)
      Note: Beta distribution is a conjugate piror for the bernoulli distribution
    • Each time aa is selected, update posterior for Ra\mathcal{R}^a
      μaBeta(αa+1,βa)   if   r=1\mu_a \sim Beta(\alpha_a+1, \beta_a) \space\space\space if \space\space\space r=1
      μaBeta(αa,βa+1)   if   r=0\mu_a \sim Beta(\alpha_a, \beta_a+1) \space\space\space if \space\space\space r=0
      Note: action 마다 따로 update 된다. 매 step 하나의 action만 할 수 있으므로 posterior도 하나만 update 된다.
    • This defines transition function P~\tilde{\mathcal{P}} for the Bayes-adaptive MDP
      • Information state α,β\lang \alpha,\beta \rang corresponds to reward model Beta(α,β)Beta(\alpha,\beta)
        (계속 update 되는 posterior의 모수가 state가 된다.)
        and each state transition corresponds to a Bayesian model update
        Image from: here
      • Bandit 시행을 계속 하면 P~\tilde{\mathcal{P}}를 학습할 수 있고(by RL or Bayes-adaptive RL), 각 action에 대해 모수가 어떻게 바뀌는지 알 수 있게 된다. 그러면 exploration과 exploitation의 optimal trade-off를 찾아낼 수 있다. (trade-off를 어떻게 계산하는지는 강의에서도 다루지 않음)

Contextual Bandits

  • A contextual Bandit is a tuple A,S,R\lang \mathcal{A}, \mathcal{S}, \mathcal{R} \rang
    S=P[s]\mathcal{S}=\mathbb{P}[s] is an unknown distribution over states (or "context") and
    Rsa(r)=P[rs,a]\mathcal{R}^a_s(r)=\mathbb{P}[r|s,a] is an unknown probability distribution over rewards ... (reward의 distribution이 state에 dependent)
  • At each step tt,
    • Environment generates state stSs_t\sim\mathcal{S}
    • Agent selects action atAa_t\in\mathcal{A}
    • Environment generates reward rtRstatr_t\sim\mathcal{R}_{s_t}^{a_t}
  • Goal is to maximize cumulative reward τ=1trτ\sum_{\tau=1}^{t}r_\tau
  • Linear UCB를 적용할 수 있다. (자세한 내용은 skip)

MDPs

Exploration/Exploitation Principles to MDPs

The same principles for exploration/exploitation apply to MDPs

  • Naive Exploration (ϵ\epsilon-greedy, softmax, Gaussian noise, ...)
  • Optimistic Initialization
  • Optimism in the Face of Uncertainty (UCB, Thompsom sampling, ...)
  • Probability Matching
  • Information State Search (Gittins indices, Bayes-adaptive MDPs, ...)

Optimistic Initialization

  • Model-Free RL
    • Initialize action-value function Q(s,a)Q(s,a) to rmax1γ\frac{r_{max}}{1-\gamma}
    • Run favorite model-free RL algorithm (MC control, Sarsa, Q-learning, ...)
    • This encourages systematic exploration of states and actions
  • Model-Based RL
    • Construct an optimistic model of the MDP
    • Initialize transitions to go to heaven (i.e., transition to terminal state with rmaxr_{max} reward
    • Solve optimistic MDP by favorite planning algorithm (policy iteration, value iteration, tree search, ...)
    • This encourages systematic exploration of states and actions

Optimism in the Face of Uncertainty

  • Upper Confidence Bounds for Model-Free RL
    • Maximize UCB on action-value function Qπ(s,a)Q^\pi(s,a)
      at=arg maxaA(Q(st,a)+U(st,a))a_t=\argmax\limits_{a\in\mathcal{A}}\left(Q(s_t,a)+U(s_t,a)\right)
      • This estimates uncertainty in policy evaluation (Q의 uncertainty)
        but ignores uncertainty from policy improvement
    • Maximize UCB on optimal action-value function Q(s,a)Q^*(s,a)
      at=arg maxaA(Q(st,a)+U1(st,a)+U2(st,a))a_t=\argmax\limits_{a\in\mathcal{A}}\left(Q(s_t,a)+U_1(s_t,a)+U_2(s_t,a)\right)
      • This estimates uncertainty in policy evaluation
        plus uncertainty from policy improvement (which is hard to measure!)
  • Bayesian Model-Based RL
    • Maintain posterior distribution over MDP models
    • Estimate both transitions and rewards p[P,Rht]p[\mathcal{P},\mathcal{R}|h_t]
      where ht=s1,a1,r2,...,sth_t=s_1,a_1,r_2,...,s_t is the history.
      Note: 앞에선 R\mathcal{R}의 distribution에만 관심이 있었지만 이제는 P\mathcal{P}까지! (\because MDP)
    • Use posterior to guide exploration
      • Bayesian UCB
      • Probability matching (Thompson sampling)
  • Thompson Sampling for Model-Based RL
    • An implementation of probability matching
      π(s,aht)=P[Q(s,a)>Q(s,a),aaht]                 =EP,Rht[1(a=arg maxaAQ(s,a))]\pi(s,a|h_t)=\mathbb{P}\left[ Q^{*}(s,a)>Q^{*}(s,a'),\forall{a'} \neq a|h_t \right]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_{\mathcal{P},\mathcal{R}|h_t}\left[ \bold{1}(a=\argmax\limits_{a\in\mathcal{A}}Q^*(s,a)) \right]
    • Use Bayes' theorem to compute posterior distribution p[P,Rht]p[\mathcal{P},\mathcal{R}|h_t]
    • Sample an MDP P,R\mathcal{P}, \mathcal{R} from posterior
    • Solve the MDP using favorite planning algorithm to get Q(s,a)Q^*(s,a)
    • Select optimal action for the MDP, at=arg maxaAQ(st,a)a_t=\argmax\limits_{a\in\mathcal{A}}Q^*(s_t,a)

Information State Search

  • Information State Search in MDPs
    • MDPs can be augmented to include information state, such as, s,s~\lang s, \tilde{s} \rang
      where ss is original state within MDP and s~\tilde{s} is a statistic of the history
    • Each action aa causes a transition
      • to a new state ss' with probability Ps,sa\mathcal{P}^a_{s, s'}
      • to a new information state s~\tilde{s}'
    • Defines MDP M~\tilde{\mathcal{M}} in augmented information stae space,
      M~=S~,A,P~,R,γ\tilde{\mathcal{M}}=\lang\tilde{\mathcal{S}},\mathcal{A},\tilde{\mathcal{P}},\mathcal{R},\gamma \rang
  • Bayes Adaptive MDPs
    • Posterior distribution over MDP model is an information state
      s~t=P[P,Rht]\tilde{s}_t=\mathbb{P}[\mathcal{P},\mathcal{R}|h_t]
    • Augmented MDP over s,s~\lang s, \tilde{s} \rang is called Bayes-adaptive MDP
    • Solve this MDP to find optimal exploration/exploitation trade-off (w.r.t. prior)
    • However, Bayes-adaptive MDP is typically enormous
      \Rarr Simulation-based search has proven effective (Guez et al.)

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

+ Lec.10의 경우 따로 정리할만큼 중요한 내용은 아닌 것 같아서 skip 합니다.
앞으로는 주요 RL 논문들을 다룰 예정입니다.

profile
Lazy Enthusiast

0개의 댓글

관련 채용 정보