[HF Deep RL Course] Unit 1

Minhan Cho·2026년 2월 9일

huggingface reinforcement learning study

Hugginface Deep RL Course

목록 보기

1/1

The Reinforcement Learning Framework

RL notation

Our Agent receives state $S_0$ from the Environment (first frame of the game)
Based on that state $S_0$ , the Agent takes action $A_0$ (agent goes right)
The environment goes to a new state $S_1$ (new frame)
The Environment gives reward $R_1$ (player not dead, reward +1)

the reward hypothesis: all goals are to maximize the expected cumulative reward

Markov Property
"you'll see that the RL process is called a Markov Decision Process (MDP)
RL의 Markov Decision Process와 NLP의 n-gram 시 Markov Assumption은 같은 건가?

MDP의 Markov 가정 vs 초기 NLP의 Markov assumption

결론부터 말하면, 이름과 철학은 비슷하지만 의미와 목적은 다르다.

1. RL에서의 Markov Decision Process (MDP)

MDP에서 말하는 Markov 가정은 다음을 의미한다.

현재 상태(state)가 주어지면, 미래는 과거와 독립이다.

수식으로 표현하면:

[
P(s_{t+1} \mid s_t, a_t)
]

즉,

현재 상태 (s_t) 는
과거의 모든 정보를 충분히 요약한 상태여야 하며
다음 상태는 과거 전체가 아니라 (s_t) 와 (a_t) 만으로 결정된다.

👉 핵심은 state 설계가 완전하다고 가정한다는 점이다.

예:

체스: 현재 보드 상태는 과거 수순을 완전히 요약함
물리 시뮬레이션: 위치 + 속도 = 다음 상태 예측 가능

2. 초기 NLP (n-gram)의 Markov assumption

초기 언어모델에서는 원래 다음 단어의 확률이 다음과 같지만,

[
P(wt \mid w_1, \dots, w{t-1})
]

계산이 너무 복잡해서 다음처럼 근사했다.

[
P(wt \mid w{t-n+1}, \dots, w_{t-1})
]

즉,

과거 정보가 불필요해서 버린 게 아니라
계산 불가능해서 강제로 잘라낸 것

👉 이는 이론적 가정보다는 현실적인 근사(approximation) 이다.

3. 핵심 차이 정리

구분	RL (MDP)	NLP (n-gram)
Markov 가정 의미	state가 과거를 완전히 요약	문맥을 일부만 사용
목적	문제의 이론적 정의	계산 가능성
과거를 무시하는 이유	필요 없다고 가정	너무 복잡해서
가정이 깨질 경우	POMDP로 확장	성능 저하

4. 한 줄 요약

RL의 Markov 가정은 정의의 문제
NLP의 Markov assumption은 근사의 문제
철학은 같지만 적용 방식과 목적이 다르다

state와 observation의 차이
state는 world의 complete description이지만, observation은 partial description of the state

Action Space: set of all possile actions in an environment
Discrete space: 가능한 action이 유한함
Continuous space: 가능한 action이 무한함

Cumulative Reward

cumulative reward: $R(\tau)=r_{t+1}+r_{t+2}+\dots$
which is equivalent to $R(\tau)=\sum_{k=0}^\infty r_{t+k+1}$

그런데, 미래의 보상은 불확실하므로 할인이 들어가야 함 (DCF에서 수익률만큼 할인하는 것과 동일)
할인율 $\gamma$ 는 0과 1 사이의 값으로, 0에 가까울수록 현재 보상을 중시, 1에 가까울수록 미래 보상을 중시

discounted cumulative reward: $R(\tau)=\sum_{k=0}^\infty \gamma^k r_{t+k+1}$

Types of tasks

episodic tasks: episode가 유한함 (game)
continuing tasks: episode가 무한함 (stock trading)

The Exploration/Exploitation trade-off

Exploration: environment에 대해 더 많은 information을 찾기 위해 random action을 통해 environment를 explore
Exploitation: reward maximizatio을 위해 known information을 exploit
trade-off: environment에 대한 더 나은 information을 위한 exploration, 당장의 reward를 maximize하기 위한 exploitation 사이의 균형을 맞추는 것 (exploration을 통해 더 나은 reward를 얻을 수 있지만, 당장의 reward를 놓칠 수 있음)

Two main approaches for solving RL problems

어떻게 RL agent가 expected cumulative reward를 maximize하는 action을 선택하게 할 것인가?

Policy $\pi$ : agent's brain

현재 state에 따라 어떤 action을 취하게 할지 결정하는 function
defines agent's behavior
학습을 통해 optimal training policy $\pi^*$ 를 찾는 것이 목표

Policy-based methods

policy를 직접 학습하는 방법
각 state와 action을 mapping하는 function을 학습
state에서 가능한 action들의 probability distribution을 출력
Deterministic policy: $a = \pi(s)$ , state가 같다면 항상 같은 action을 선택
Stochastic policy: $\pi(a|s) = P(A_t=a|S_t=s)$ , state에 따라 action을 확률적으로 선택

Value-based methods

policy function을 학습하는 것이 아니라, 어떤 상태에 있을 때, 그 상태에 있음으로써 기대할 수 있는 가치를 알려주는 함수를 학습한다 (we learn a value fuction that maps a state to the expected value of being at that state).
state의 value는 해당 state에서 시작하여 최적의 policy(=highest value를 따라가는 state)를 따랐을 때 얻을 수 있는 expected cumulative (discounted) reward
- $v_\pi(s) = E_{\pi}[\sum_{k=0}^\infty \gamma^k r_{t+k+1}|S_t=s]$

Materials

Minhan Cho

multidisciplinary