Silver RL (1) Introduction to Reinforcement Learning

Sanghyeok Choi·2021년 12월 21일

Intro_to_RL

목록 보기

1/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 1: Introduction to Reinforcement Learning (Youtube) 강의 내용을 정리했습니다.

About RL

RL is all about Decision Making

What makes RL different from other ML paradigms?

No supervisor, only reward signal
Feedback is delayed, not instantaneous
Sequential and non-iid data (<-> typical (un)supervised setting 에서는 iid data)
Dynamic system: agent's actions affect the subsequent data
Learning by "Trial & Error"

The RL Problem

Reward $R_{t}$

$R_t$ is a scalar (크고 작음을 비교할 수 있도록)
RL은 아래 reward hypothesis를 전제로 한다.

Reward Hypothesis
All goals can be described by the maximization of expected cumulative reward
여러 종류의 Problems를 Reward라는 하나의 system으로 나타낼 수 있다는 장점
The agent's job is to select actions to maximize total future reward
- Why future reward?? -> Reward may be delayed! (e.g. 마시멜로 이야기)

Environment

Image from: here

The earth in the picture indicates the environment
At each step t:
- The agent execute action $A_t$ and the environment receives it (and is affected by it)
- The environment emits observation $O_{t+1}$ and reward $R_{t+1}$ , and the agent receives them (and select next action w.r.t. them)
- t <= t + 1

State $S_t$

History $H_t$ is sequence of observations, actions, and rewards
- $H_t = O_1, R_1, A_1, ..., A_{t-1}, O_t, R_t$
- What happens next depends on the history
  - 즉, agent와 environment 모두 history의 영향을 받는다.
  - $A_t$ depends on $H_t$
  - $O_{t+1}$ and $R_{t+1}$ both depend on $A_t$ and $H_t$
State $S_t$ is the information used to determine what happens next
- History is too big! State is a summary of the history.
- $S_t = f(H_t)$ where $f$ is any function
  - State는 history의 function
  - State를 어떻게 잡느냐에 따라서 활용하는 정보가 달라진다.
Environment state $S_t^e$
- Environment's internal(private) representation
- $S_t^e$ determines $O_{t+1}$ and $R_{t+1}$
- Env 내부의 메커니즘(구조?)이라서 보통 invisible & uncontrollable
- visible 하더라도 $S_t^e$ 자체를 활용하지는 않는다. (our agent only utilizes $O_t$ and $R_t$ )
Agent state $S_t^a$
- Agent's internal representation
- Agent 관점에서 나타낸(요약한) history
- Next action의 결정에 활용되는 정보 (used by RL algorithms)
- Again, $S_t^a = f(H_t)$
- 정의하기 나름이다!
Information state (a.k.a. Markov state)
- Information state is a state that satisfies the Markov property
  
  $S_t$ is an information(markov) state if and only if:
  $\underline{\mathrm{P}(S_{t+1} \mid S_t)=\mathrm{P}(S_{t+1} \mid S_1, ..., S_t)}$
  $\uarr$ markov property
- 현재( $S_t$ )만 알고 있으면 과거( $S_1,..., S_{t-1} \rArr H_{t-1})$ 는 중요하지 않다.
- 예: 헬리콥터 운전에서 State를
  1) '위치'로 정의하면 Non-markov,
  2) '위치+속도+가속도'로 정의하면 Markov!
- Information state $S_t$ 를 사용하면 과거를 다 저장할 필요가 없어진다.
  i.e. The history may be thrown away because the state is a sufficient statistic of the future
- c.f.) Environment state $S_t^e$ is Markov by definition
  - Environment state의 정의가 "All the data needed to pick next $O$ and $R$ "이기 때문에, next state $S_{t+1}^e$ 를 알기 위해서는 $S_t^e$ 만 있어도 충분하다.
  - e.g. 체스에서의 의사결정은 현재 체스판 상태만 알고 있으면 된다. 과거 기록은 다 잊어버려도 상관 없다.
- c.f.) History state $H_t$ is Markov
  - This is trivial
- Agent state는 markov property를 갖도록 직접 정의해줘야 한다.
Rat Example
- AABC -> Electrocute
  CABB -> Cheese
  BABC -> ??
- What if agent state = last 3 items in sequence? => Electrocute
  What if agent state = counts for A, B, and C? => Cheese
  What if agent state = complete sequence? => We don't know
- Agent state를 어떻게 정의하느냐에 따라 예상되는 결과가 달라진다.
  Agent state가 필요한 정보를 잘 담고 있도록 정의해야 한다!
Fully Observable Environments vs Partially Observable Environments
- Fully observable environments: $O_t=S_t^a=S_t^e$
  - Agent가 Environment state를 직접 observe 한다.
  - Environment의 매커니즘을 알고 있는 경우
  - Formally, this is a Markov decision process(MDP)!
- Partially observable environments: Agent must construct its own state representation $S_t^a$
  - Agent가 Environment를 간접적으로 observe 한다.
  - Environment로부터 Observation을 받지만, 매커니즘을 정확하게 알지는 못한다.
  - Formally, this is a partially observable Markov decision process(POMDP)!
  - e.g. state representation with recurrent neural network: $S_t^a=\sigma(S^a_{t-1}W_s+O_tW_o)$

Inside An RL Agent

Three Major Components of an RL Agent (세 개가 항상 다 필요한 건 아님)
- Policy: agent's behaviour function
- Value function: how good is each state and/or action
- Model: agent's representation of the environment

Policy

Policy is a map from state to action.
Deterministic policy: $a=\pi(s)$
- s는 t 시점의 state, a는 t 시점의 action.
- $\pi(\cdot)$ : state를 받아서 expected reward를 maximize 하는 action을 return 하도록 학습!
Stochastic policy: $\pi(a|s)=\mathbb{P}[A_t=a|S_t=s]$

Value Function

Value Function is a prediction of future reward (지나간 reward는 생각 안 한다)
$v_{\pi}(s)=\mathbb{E}_{\pi}[R_{t+1}+\gamma\R_{t+2}+\gamma^2R_{t+3}+...|S_t=s]$
- $\gamma$ : time-discount factor ( $\gamma$ <1 이면 현재의 reward가 미래의 reward보다 더 중요하다는 것)
- 현재 state를 받아서 Expected future reward (until the end)를 계산해주는 함수

Model

A model predicts what the environment will do next (agent가 바라보는 environment)
Transition Model $\mathcal{P}$ predicts the next state (i.e. dynamics)
- $\mathcal{P}^a_{ss'}=\mathbb{P}[S_{t+1}=s'|S_t=s, A_t=a]$
- t 시점의 state s와 action a가 정해졌을 때 다음 state가 s'일 확률을 return!
- Note: $S^e_{t+1}$ 를 알고 있으면(Fully Observable?) Model이 필요 없다. "Model Free"
Reward Model $\mathcal{R}$ predicts next (immediate) reward
- $\mathcal{R}^a_s=\mathbb{E}[R_{t+1}|S_t=s, A_t=a]$
- t 시점의 state s와 action a가 정해졌을 때 t+1 시점의 expected reward를 return!

Categorizing RL agents

Image from: here

Value Based (No policy ... policy: pick the action with highest value)
Policy Based (No Value Function)
Actor Critic
Model Free (No Model)
Model Based

Problems within RL

Planning vs Reinforcement Learning

Two fundamental problems in sequential decision making
Planning
- A model of the environment is known.
- environment의 model으로 reward & observation을 바로 계산할 수 있다. (게임의 규칙을 알고 있음)
- state s에서 action a를 했을 때 reward와 다음 state를 알고 있다.
- Plan ahead to find optimal policy (e.g. tree search)
Reinforcement Learning
- The environment is initially unknown
- environment와 상호작용(interaction)을 하면서 필요한 정보를 학습한다. (게임을 하면서 규칙을 배움)*
- environment와의 상호작용을 통해 지속적으로 모델을 개선시키게 된다.

Exploration vs Exploitation

Reinforcement learning (not planning) is like trial-and-error learning
Exploration finds more invormation about the environment (새로운 메뉴 도전)
Exploitation exploits known information to maximize reward (제일 좋아하는 메뉴 주문)
Trade-off 관계를 잘 조정해야 한다.

Prediction vs Control

Prediction: evaluate the future given a policy
- Find the value function
Control: optimize the future
- Find the best policy
What RL does is to solve a prediction problem in order to solve a control problem
- Agent가 갖고 있는 policies를 evaluate 하고 그 중에서 best를 찾게 됨 (=> Control).

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

Sanghyeok Choi

Lazy Enthusiast

다음 포스트

Silver RL (1) Introduction to Reinforcement Learning

Intro_to_RL

About RL

What makes RL different from other ML paradigms?

The RL Problem

Reward $R_{t}$

Environment

State $S_t$

Inside An RL Agent

Policy

Value Function

Model

Categorizing RL agents

Problems within RL

Planning vs Reinforcement Learning

Exploration vs Exploitation

Prediction vs Control

Silver RL (2) Markov Decision Process

0개의 댓글

관련 채용 정보

Silver RL (1) Introduction to Reinforcement Learning

Intro_to_RL

About RL

What makes RL different from other ML paradigms?

The RL Problem

Reward RtR_{t}Rt​

Environment

State StS_tSt​

Inside An RL Agent

Policy

Value Function

Model

Categorizing RL agents

Problems within RL

Planning vs Reinforcement Learning

Exploration vs Exploitation

Prediction vs Control

Silver RL (2) Markov Decision Process

0개의 댓글

관련 채용 정보

Reward $R_{t}$

State $S_t$