Reinforcement Learning: An Introduction (Sutton & Barto) 문단별 요약정리

Heejin Jo·2021년 5월 4일

AI

목록 보기

10/13

Introduction

chapter 01

우리는 우리를 감싸고 있는 환경과 접촉하면서 자연스럽게 배운다. 우리의 행동은 환경에 영향을 미치고 그 영향은 다시 우리에게 돌아온다. 이러한 상호작용은 모든 이론과 배움의 기저라고 볼 수 있다.
이 책에서는 이러한 상호작용을 이용해 다양한 분야의 learning에서 어떻게 가장 goal-directed하게 문제를 해결할 수 있는지에 적용할 수 있는지 탐구해보고자 한다.

1.1 Reinforcement Learning

def reinforcement(input, output):
input = situation
output = effective action
for reward in input:
reward = reward + output
return print('the results of interaction were as follows')

machine learning
Reinforcement learning
mountaineering
이 세 개 똑띠 구분해라 문제와 해결책 구하는거에서 혼돈의 카오스에 빠지고 싶지 않으면!

최적제어 마코브 그걸 바탕으로했따. A learning agent는 최전선의 환경을 바로 알아차리고 그 환경과 관련하여 조치를 취할 수 있어야 해. 마코브가 sensation, action, and goal 이 세가지를 포함하기는 하는데 약간 아쉬워. 뭐 되다 말음. 그래서 강화간댜아아아!!!어떤거든 이것들 해결하는데 적합하면 다 강화학습이야아아앙(내 생각: 아 결론적으로 마코프 이론 보충의 개념이었구나)

supervised learning 은 각 각 이름 붙여져 정해진 상황에 속한 분류에 맞게 조치를 취하기 위한 것으로 혼자 알아서 인터랙션을 펼치는 강화학습과는 다르다.

숨겨져있고 이름도 안 붙여진 데이터 구조에서 해결책을 찾아야하는 unsupervised learning과도 강화학습은 다르다. Because reinforcement learning is trying to maximize a reward signal instead of trying to find hidden structure.

일단 강화학습의 딜레마는 탐색을 하려면 삽질을 해야하는데 얘는 최상의 보상을 주고 싶어하잖아? 그러면 삽질하면 안되잖아? 근데 처음엔 삽질해야 하잖아? 즉 탐색과 활용 둘 사이 절충이 관건이야.

Another key feature of reinforcement learning is that it explicitly considers the whole problem of a goal-directed agent interacting with an uncertain environment. 상위개념 그만 만들고 하위개념 좀 준비하자.

강화학습은 explicit한 골이 설정되어 있어서 기존의 연구방향과 반대로 갈거임 뿌우
부족한 디테일 x까 나는 나만의 길을 간다. 두루뭉실 주제문 꺼져라 나는 족집개처럼 명확한 결과들고 나올테니! 후훗!

아 그렇다고 너무 독고다이는 아니궁ㅎㅎ (예시로 로봇의 시스템과는 관련되고 환경과는 간접적으로 교류하는거로 나옴)

아 우리 애긔인 강화학습 여기저기 다 쓰일 수 있궁 ㅎㅎ 짱이야 짱 ㅎㅎ 쩔지?
(개인적으로 뇌과학나오는 챕터15 아주 궁금해 군침이 싹돈다)

Finally, reinforcement learning is also part of a larger trend in artificial intelligence back toward simple general principles.
It is not clear how far back the pendulum will swing, but reinforcement learning research is certainly part of the swing back toward simpler and fewer general principles of artificial intelligence. (포부가...대다나다...)

1.2 Examples

5개의 예시가 나옴 개인적으로 그 시리얼뇸뇸이 이해가 잘 되서 그것만 원문 가져온다.

• Phil prepares his breakfast. Closely examined, even this apparently mundane activity reveals a
complex web of conditional behavior and interlocking goal–subgoal relationships: walking to the
cupboard, opening it, selecting a cereal box, then reaching for, grasping, and retrieving the box.
Other complex, tuned, interactive sequences of behavior are required to obtain a bowl, spoon,
and milk jug. Each step involves a series of eye movements to obtain information and to guide
reaching and locomotion. Rapid judgments are continually made about how to carry the objects
or whether it is better to ferry some of them to the dining table before obtaining others. Each
step is guided by goals, such as grasping a spoon or getting to the refrigerator, and is in service
of other goals, such as having the spoon to eat with once the cereal is prepared and ultimately
obtaining nourishment. Whether he is aware of it or not, Phil is accessing information about the
state of his body that determines his nutritional needs, level of hunger, and food preferences.

환경이 불확실하더라도 agent는 goal 달성을 추구하며 그러한 environment와 agent는 서로 interaction을 한다. 올바른 답을 바로 찾기란 어려워서 계획을 잘 세우는 것이 중요하다. 동시에 이 예시들의 행동의 결과를 다 예측하기도 어렵다. 그래서 agent는 환경을 자주, 알아서 잘 모니터링하면서 결과를 찾아야 한다. 이 예시들에서 agent는 경험을 계속 반복하면서 숙달된 수행능력을 시간이 지남에 따라 보여준다.

1.3 Elements of Reinforcement Learning

agent와 envrionment 타령만 하는거 넘어서서, 4가지의 강화학습 하위요소를 살펴보자.

policy
defines the learning agent’s way of behaving at a given time.
주어진 시간에서 agent의 행동양식을 정해줌, 강화학습의 핵심이다. 얘만 있으면 일단 작동가능
$ f(환경인식) = action! $ ?? 라텍스 왜 안됨??

What is stimulus response theory?
Stimulus Response Theory is a concept in psychology that refers to the belief that behavior manifests as a result of the interplay between stimulus and response. ... In other words, behavior cannot exist without a stimulus of some sort, at least from this perspective.
a reward signal
goal, On each time step, the environment sends to the reinforcement learning agent a single number called the reward. The agent’s sole
objective is to maximize the total reward it receives over the long run.
시간마다 보상숫자가 늘어나는데 이 숫자의 총합을 극대화하는게 agent의 단 하나의 목적임.

reward들은 agent한테 닥친 문제의 특징을 정의해준다.
The reward signal is the primary basis for altering the policy;
만약 어떤 action을 policy대로 했는데 막 점수(reward signal)가 -1 이면 맘에 안들잖아. + 1받고 싶은데! 그러면 policy 바꾸는 거임.
In general, reward signals may be stochastic functions of the state
of the environment and the actions taken

일반적으로 보상 신호는 확률함수일 수 있다.
value
the total amount of reward

reward signal <-> value
(미분값) (적분값)
(single) (sum)
(immediate) (long run)
(예상어려움) (예상가능)
(주식1봉ㅋㅋ) (주식1년간흐름)

rewards -> values (충분조건)
But! action choices are made based on value judgments.
values -> actions (필요조건)

the most important component
of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values.

model
환경을 흉내낸 무언가
보통 모형이라고 하는데...쉽게 보면
애기가 핸드폰 갖고 싶어서 찰흙으로 핸드폰 모양 만들어놓고 핸드폰이라고 상상하면서 놀면 그 찰흙덩어리가 model임.

일반적으로는 어떻게 환경이 움직일지에 대해 inferences가 만들어지도록 허용해주는거 이 기획으로 모델링한다면 다음 분기 실적이 20억 추가될 전망입니다. 라고 할 때 모델도 같은 의미의 모델이고 이 모델 기반으로 뭘 하는게 model-based method

Modern reinforcement learning spans the spectrum from low-level, trial-and-error learning to high-level, deliberative planning.

1.4 Limitations and Scope

Informally, we can think of the state as a signal
conveying to the agent some sense of “how the environment is” at a particular time.

와 이거 설명 너무 잘했다.
항상 저 STATE... 약간 설명충...?
"야 지금 environment가 ~이렇대"
1초후에
"지금은 또 ~이렇대"
N초후에
"지금~이래"

Most of the reinforcement learning methods we consider in this book are structured around estimating
value functions, but it is not strictly necessary to do this to solve reinforcement learning problems.
실제 강화학습에서는 굳이 VALUE FUNCTION에만 집중안해도 되는 듯 value function필요 없는 예시로 나온 annealing 모르는거라 찾아봄
Simulated Annealing

1.5. AN EXTENDED EXAMPLE: TIC-TAC-TOE

강화학습의 통용되는 기본 원리와 반대되는 접근방법들도 살펴봤으니 좀 더 심화예시를 살펴보자.

빙고게임에서 완벽하지 못 한 상대방과 게임을 한다고 가정해보자. 어떻게 하면 약점을 알고 이길 확률을 극대화할 수 있을까?

쉬울거 같지만 일반적인 방법으로 못 풀어. 상대방에 대한 스펙을 완전 꿰고 있을 때 올바른 연속결정을 내릴 수 있는게 minimax를 비롯한 classical optimization method들이거등!
미리 접하지 못 한 문제들이 현실세계에서는 겁나 많은데 그걸 다 알 수 없잖아. 근데 이게 경험을 하면 할수록 예측가능해지는 문제들이기도해. 그렇게 신뢰구간에 다다르면 뭐 그 때부터는 대략적으로 정답을 맞출 수 있는거지
(신뢰구간이라는 말보다는 아 나 좀 자신감 뿜뿜해도 되는레벨에 온 듯 ㅎㅎ 라고 표현하는게 더 좋을 거 같다.)

Evolutionary method는 이길 가능성이 높은 경로를 바로 찾아낼거고 매번 상태를 말해줄 policy를 설립할거야. 또 다른 policy를 만들거고 각 policy를 고려할 때마다 이길 확률이 높다고 예측되는 가능성은 몇 번의 게임을 통해서 찾아낼 수 있고 한 게임이 끝나고 난 뒤의 평가들은 다음에 어떤 전략을 쓸지 정할 수 있어. 여튼 진짜 엄청 많은 방법들을 쓸 수 있는거지.

Value function을 생각해보자. 0, 1,2,3..이렇게 숫자를 설정하고 각 숫자가 게임의 STATE을 나타낸다고 해보자. 만약 A라는 전략을 썼을 때 이겼어 그리고 B를 썼을 때 졌어 그러면 A는 항상 1이고 B는 항상 0이야 확률은
그러면 1/2

exploratory moves 알아가는 단계 라고 보면 되겠다. 한 발자국씩 옮길 때 마다 우리는 그전에는 알지 못한 것들을 알 수 있지. 보통은 그냥 음 이리로가면 답이 나오겠군 하지만 가끔은 일부로 오답 선택하는느낌?

더 정밀하게 이길 가능성을 높이려고 게임을 진행하면서 게임 속 우리를 찾아낼 수 있는 상태를 바꿈으로써 '백업'하는거야.

*잠깐 greedy move가 정확히 뭔지 모르겠어

...?

적절하지 못 한 무언가를 발견했다
greedy move가 뭔지 답답하다. 위키야 도와줘...

오 성격급해서 숲을 못 보고 나무를 보는걸 greedy algorithm이라고 하는구나 오키 다시 논문으로 돌아가자.

a:step size parameter, which influences the rate of learning
경사하강법에서도 나왔지만 스텝이 너무 작으면 학습이 안 되고 너무 크면 값을 벗어나서 적절~하게 잘 a를 찾는게 관건 a가 작아질수록 v(s)가 수렴!

evolutionary method는 다 합쳐서 보니까 만약 이기면 모든 행동들이 다 정당화돼. 하지만 value function은 각 스텝도 다 평가매겨. If I wanna get information during the course of game, just use value function!.

correct behavior requires planning or foresight that takes into account delayed effects of one’s choices.
올바른 결정은 누군가의 선택으로 딜레이되는것까지를 고려한 계획이나 foresight을 필요로 한다.

part 01

chapter2

chapter3

이 부분은 솔님 그리리고 지지윤윤님님이 정리하심. 나는 정리에에서 짜짜지지기기로 함함. 벨벨로로그 진진짜

chapter4

chapter5

chapter6

chapter7

chapter8

part 02

chapter9

.
.
.

chapter17

Heejin Jo

core를 기르자

이전 포스트

닥터앤서

다음 포스트