Silver RL (1) Introduction to Reinforcement Learning

Sanghyeok Choi·2021년 12월 21일
0

Intro_to_RL

목록 보기
1/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 1: Introduction to Reinforcement Learning (Youtube) 강의 내용을 정리했습니다.

About RL

RL is all about Decision Making

What makes RL different from other ML paradigms?

  • No supervisor, only reward signal
  • Feedback is delayed, not instantaneous
  • Sequential and non-iid data (<-> typical (un)supervised setting 에서는 iid data)
  • Dynamic system: agent's actions affect the subsequent data
  • Learning by "Trial & Error"

The RL Problem

Reward RtR_{t}

  • RtR_t is a scalar (크고 작음을 비교할 수 있도록)
  • RL은 아래 reward hypothesis를 전제로 한다.

    Reward Hypothesis
    All goals can be described by the maximization of expected cumulative reward

  • 여러 종류의 Problems를 Reward라는 하나의 system으로 나타낼 수 있다는 장점
  • The agent's job is to select actions to maximize total future reward
    • Why future reward?? -> Reward may be delayed! (e.g. 마시멜로 이야기)

Environment

Image from: here

  • The earth in the picture indicates the environment
  • At each step t:
    • The agent execute action AtA_t and the environment receives it (and is affected by it)
    • The environment emits observation Ot+1O_{t+1} and reward Rt+1R_{t+1}, and the agent receives them (and select next action w.r.t. them)
    • t <= t + 1

State StS_t

  • History HtH_t is sequence of observations, actions, and rewards
    • Ht=O1,R1,A1,...,At1,Ot,RtH_t = O_1, R_1, A_1, ..., A_{t-1}, O_t, R_t
    • What happens next depends on the history
      • 즉, agent와 environment 모두 history의 영향을 받는다.
      • AtA_t depends on HtH_t
      • Ot+1O_{t+1} and Rt+1R_{t+1} both depend on AtA_t and HtH_t
  • State StS_t is the information used to determine what happens next
    • History is too big! State is a summary of the history.
    • St=f(Ht)S_t = f(H_t) where ff is any function
      • State는 history의 function
      • State를 어떻게 잡느냐에 따라서 활용하는 정보가 달라진다.
  • Environment state SteS_t^e
    • Environment's internal(private) representation
    • SteS_t^e determines Ot+1O_{t+1} and Rt+1R_{t+1}
    • Env 내부의 메커니즘(구조?)이라서 보통 invisible & uncontrollable
    • visible 하더라도 SteS_t^e 자체를 활용하지는 않는다. (our agent only utilizes OtO_t and RtR_t)
  • Agent state StaS_t^a
    • Agent's internal representation
    • Agent 관점에서 나타낸(요약한) history
    • Next action의 결정에 활용되는 정보 (used by RL algorithms)
    • Again, Sta=f(Ht)S_t^a = f(H_t)
    • 정의하기 나름이다!
  • Information state (a.k.a. Markov state)
    • Information state is a state that satisfies the Markov property

      StS_t is an information(markov) state if and only if:
      P(St+1St)=P(St+1S1,...,St)\underline{\mathrm{P}(S_{t+1} \mid S_t)=\mathrm{P}(S_{t+1} \mid S_1, ..., S_t)}
                \uarr markov property

    • 현재(StS_t)만 알고 있으면 과거(S1,...,St1Ht1)S_1,..., S_{t-1} \rArr H_{t-1})는 중요하지 않다.
    • 예: 헬리콥터 운전에서 State를
      1) '위치'로 정의하면 Non-markov,
      2) '위치+속도+가속도'로 정의하면 Markov!
    • Information state StS_t를 사용하면 과거를 다 저장할 필요가 없어진다.
      i.e. The history may be thrown away because the state is a sufficient statistic of the future
    • c.f.) Environment state SteS_t^e is Markov by definition
      • Environment state의 정의가 "All the data needed to pick next OO and RR"이기 때문에, next state St+1eS_{t+1}^e를 알기 위해서는 SteS_t^e만 있어도 충분하다.
      • e.g. 체스에서의 의사결정은 현재 체스판 상태만 알고 있으면 된다. 과거 기록은 다 잊어버려도 상관 없다.
    • c.f.) History state HtH_t is Markov
      • This is trivial
    • Agent state는 markov property를 갖도록 직접 정의해줘야 한다.
  • Rat Example
    • AABC -> Electrocute
      CABB -> Cheese
      BABC -> ??
    • What if agent state = last 3 items in sequence? => Electrocute
      What if agent state = counts for A, B, and C? => Cheese
      What if agent state = complete sequence? => We don't know
    • Agent state를 어떻게 정의하느냐에 따라 예상되는 결과가 달라진다.
      Agent state가 필요한 정보를 잘 담고 있도록 정의해야 한다!
  • Fully Observable Environments vs Partially Observable Environments
    • Fully observable environments: Ot=Sta=SteO_t=S_t^a=S_t^e
      • Agent가 Environment state를 직접 observe 한다.
      • Environment의 매커니즘을 알고 있는 경우
      • Formally, this is a Markov decision process(MDP)!
    • Partially observable environments: Agent must construct its own state representation StaS_t^a
      • Agent가 Environment를 간접적으로 observe 한다.
      • Environment로부터 Observation을 받지만, 매커니즘을 정확하게 알지는 못한다.
      • Formally, this is a partially observable Markov decision process(POMDP)!
      • e.g. state representation with recurrent neural network: Sta=σ(St1aWs+OtWo)S_t^a=\sigma(S^a_{t-1}W_s+O_tW_o)

Inside An RL Agent

  • Three Major Components of an RL Agent (세 개가 항상 다 필요한 건 아님)
    • Policy: agent's behaviour function
    • Value function: how good is each state and/or action
    • Model: agent's representation of the environment

Policy

  • Policy is a map from state to action.
  • Deterministic policy: a=π(s)a=\pi(s)
    • s는 t 시점의 state, a는 t 시점의 action.
    • π()\pi(\cdot): state를 받아서 expected reward를 maximize 하는 action을 return 하도록 학습!
  • Stochastic policy: π(as)=P[At=aSt=s]\pi(a|s)=\mathbb{P}[A_t=a|S_t=s]

Value Function

  • Value Function is a prediction of future reward (지나간 reward는 생각 안 한다)
  • vπ(s)=Eπ[Rt+1+γRt+2+γ2Rt+3+...St=s]v_{\pi}(s)=\mathbb{E}_{\pi}[R_{t+1}+\gamma\R_{t+2}+\gamma^2R_{t+3}+...|S_t=s]
    • γ\gamma: time-discount factor (γ\gamma<1 이면 현재의 reward가 미래의 reward보다 더 중요하다는 것)
    • 현재 state를 받아서 Expected future reward (until the end)를 계산해주는 함수

Model

  • A model predicts what the environment will do next (agent가 바라보는 environment)
  • Transition Model P\mathcal{P} predicts the next state (i.e. dynamics)
    • Pssa=P[St+1=sSt=s,At=a]\mathcal{P}^a_{ss'}=\mathbb{P}[S_{t+1}=s'|S_t=s, A_t=a]
    • t 시점의 state s와 action a가 정해졌을 때 다음 state가 s'일 확률을 return!
    • Note: St+1eS^e_{t+1}를 알고 있으면(Fully Observable?) Model이 필요 없다. "Model Free"
  • Reward Model R\mathcal{R} predicts next (immediate) reward
    • Rsa=E[Rt+1St=s,At=a]\mathcal{R}^a_s=\mathbb{E}[R_{t+1}|S_t=s, A_t=a]
    • t 시점의 state s와 action a가 정해졌을 때 t+1 시점의 expected reward를 return!

Categorizing RL agents

Image from: here

  • Value Based (No policy ... policy: pick the action with highest value)
    Policy Based (No Value Function)
    Actor Critic
  • Model Free (No Model)
    Model Based

Problems within RL

Planning vs Reinforcement Learning

  • Two fundamental problems in sequential decision making
  • Planning
    • A model of the environment is known.
    • environment의 model으로 reward & observation을 바로 계산할 수 있다. (게임의 규칙을 알고 있음)
    • state s에서 action a를 했을 때 reward와 다음 state를 알고 있다.
    • Plan ahead to find optimal policy (e.g. tree search)
  • Reinforcement Learning
    • The environment is initially unknown
    • environment와 상호작용(interaction)을 하면서 필요한 정보를 학습한다. (게임을 하면서 규칙을 배움)*
    • environment와의 상호작용을 통해 지속적으로 모델을 개선시키게 된다.

Exploration vs Exploitation

  • Reinforcement learning (not planning) is like trial-and-error learning
  • Exploration finds more invormation about the environment (새로운 메뉴 도전)
  • Exploitation exploits known information to maximize reward (제일 좋아하는 메뉴 주문)
  • Trade-off 관계를 잘 조정해야 한다.

Prediction vs Control

  • Prediction: evaluate the future given a policy
    • Find the value function
  • Control: optimize the future
    • Find the best policy
  • What RL does is to solve a prediction problem in order to solve a control problem
    • Agent가 갖고 있는 policies를 evaluate 하고 그 중에서 best를 찾게 됨 (=> Control).

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

profile
Lazy Enthusiast

0개의 댓글

관련 채용 정보