State
Actor = agent
Google Data Centre
Cooling Bill by 40%
(스위치 언제끄고 키고 등)
로보틱스 - 관절의 움직임
비지니스 - 재고 관리, 자원 할당
재무 : 투자 결정, 포트폴리오 디자인
이커머스/미디어 - 어떤 콘텐츠를 사용자에게 보여줄지
Frozen Lake World (OpenAI GYM)
Start, Hole, Goal
Agent
import gym
env = gym.make("FrozenLake-v0")
observation = env.reset()
for _ in range(1000):
env.render()
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
처음엔 아무것도 보이지 않고 한 걸음 가야 보임
keyin and move
각각의 행동에 대해서 잘했는지 못했는지 말해주지 않음
Frozen Lake : Even if you know the way, ask.
"아는 길도, 물어가라"
Q 라는 형님 : 그 길이 좋은지 안좋은지 알려줌. 내가 해봐서 아는데...
Q(state, action)
Policy using Q-function
Optimal Policy, pi and Max Q
Q를 어떻게 구할 수 있을까?
가정 : Assume (believe) Q in s' exists!
즉 다음 상태에서의 Q 값이 존재한다.
My condition
I am in s
when I do action a, I'll go to s'
When I do action a, I'll get reward r
Q in s', Q(s',a') exists!
R_t = r_t +
R(t) = r_t + R(t+1)
R(t)* = r_t + maxR(t+1)
Q(s,a) = r + max_a'Q(s',a')
Learning Q(s,a) table
Q(s,a) = r + maxQ(s',a')
Learning Q(s,a) Table (with many trials)
initial Q values are 0
Q(s14, a_right) = r = 1
Dummy Q-learning algorithm
For each s,a initialize table entry Q(s,a) <- 0
Observe current state s
Do forever:
# Initialize table with all zeros
Q = np.zeros([env.observation_space.n, env.action_space,n]) #16x4
# Set learning parameters
num_episodes = 2000
# create lists to contain total rewards and steps per episodes
rList = []
for i in range(num_episodes):
# Reset environment and get first new observation
state = env.reset()
rAll = 0
done = False
# The Q-Table learning algorithm
while not done:
action = rargmax(Q[state, :]) # Q 값이 똑같다면 랜덤하게 간다
# 즉, 값이 다 0이면 랜덤하게 이동시킴
# Get new state and reward from environment
new_state, reward, done,_ = env.step(action)
# Update Q-Table with new knowledge using learning
Q[state, action] = reward + np.max(Q[new_state, :])
state = new_state
rList.append(rAll)
# post processing
print("Success rate: " + str(sum(rList)/num_episodes))
print("Final Q-Table Values")
print("Left down right up")
print(Q)
plt.bar(range(len(rList)), rList, color="blue")
plt.show()
import gym
import numpy as np
import matplotlib.pyplot as plt
from gym.envs.registration import register
import random as pr
def rargmax(vector):
""" Argmax that choose randomly among eligible maximum indices. """
m = np.amax(vector)
indices = np.nonzero(vector == m)[0]
return pr.choice(indices)
register(
id='FrozenLake-v3',
entry_point='gym.envs.toy_text:FrozenLakeEnv',
kwargs={'map_name':'4x4',
'is_slippery': False}
)
env = gym.make('FrozenLake-v3')
dummy 로 부른 이유
새로운 길을 가지 않음. -> Exploit vs Exploration
Exploit (weekday) vs Exploration(weekend)
e=0.1
if rand <e:
a = random
else:
a=argmax(Q(s,a))
for i in range(1000):
e = 0.1/(i+1)
if random(1) < e:
a = random
else:
a=argmax(Q(s,a))

for i in range(1000):
a = argmax(Q(s,a) + random_values / (i+1))


오늘 일한건 오늘 받으면 제일 좋음!
다음에 받는건 gamma를 곱하여 할인한다.

이제 에이전트는 어디로 갈지 알게 되었다!
Q-hat : 실제로 Q 값을 모르기에 근사치로 표현
Q-hat -> Q로 수렴한다 아래의 가정 시
for i in range(num_episodes):
e = 1. / ((i/100)+1)
# The Q-Table learning algorithm
while not done:
# Choose an action by e-greedy
if np.random.rand(1) < e:
action = env.action_space.sample()
else:
action = np.argmax(Q[state, :])
# Choose an action by greedily (with nois)e picking from Q table
action = np.argmax(Q[sate, :] + np.random.randn(1, env.action_space.n) / (i+1))

오른쪽으로 한 칸 가고 싶어도 항상 그럴수는 없음.





Q-hat 이 상상의 Q와 같아질까요..?
=> 네 수렴합니다.

Q[state,action] = (1-learning_rate) * Q[state,action] \
+ learning_rate*(reward + dis * np.max(Q[new_state, :]))


더 큰 세상에 적용해보자..!
- 100 x 100 x 4 - Q table
- 화면/카메라 입력을 받아서 .. 80x80 pixel에 색상이 2개인 경우 Q-table의 사이즈는
픽셀 하나마다 0과 1 2가지의 경우가 80x80으로 가능하므로 2^(80x80) 사이즈의 Q-table이 필요함..즉, 너무 큼!

상태를 입력으로 받고, 가능한 모든 action에 대해서 Q값들을 알려줌.

y label은 optimal Q*임




왜 target은 stochatic으로 안하지?
그럴 필요가 없음. 학습하는 과정 자체가 stochastic world의 Q를 학습하는 것과 같음

발산이 되서 학습이 이러나지 않았음!
다음 시간에 해결방법에 대해 알아보자!

def one_hot(x):
return np.identity(16)[x:x+1]

# input and output size based on the Env
input_size = env.observation_space.n
output_size = env.action_space.n
# These lines establish the feed-forward part of the network used to choose actions
X = tf.placeholder(shape=[1,input_size], dtype=tf.float32) # state input
W = tf.Variable(tf.random_uniform([input_size, output_size],0,0.01)) # weight
Qpred = tf.matmul(X, W) # Out Q prediction
Y = tf.placeholder(shape=[1, output_size], dtype=tf.float32) # Y label
loss = tf.reduce_sum(tf.square(Y - Qpred))
train = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)
Qs[0, a] = reward + dis * np.max(Qs1)
# Train our network using target (Y) and predicted Q (Qpred) values
sess.run(train, feed_dict={X: one_hot(s), Y: Qs})


import gym
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
env = gym.make('FrozenLake-v0')
# input and output size based on the Env
intput_size = env.observation_space.n
output_size = env.action_space.n
learning_rate = 0.1
# These lines establish the feed-forward part of the network used to choose actions
X = tf.placeholder(shape=[1,input_size],dtype=tf.float32) # state input
W = tf.Variable(tf.random_uniform([input_size, output_size],0,0.01)) # weight
Qpred = tf.matmul(X, W) # Out Q prediction
Y = tf.placeholder(shape=[1, output_size], dtype=tf.float32) # Y label
loss = tf.reduce_sum(tf.square(Y-Qpred))
train = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)
# Set Q-learning related parameters
dis = .99
num_episodes = 2000
# Create lists to contain total rewards and steps per episode
rList = []
with tf.Session() as sess:
sess.run(init)
for i in range(num_episodes):
# Reset environment and get first new observation
s = env.reset()
e = 1. / ((i/50) + 10)
rAll = 0
done = False
local_loss = []
# The Q-Network training
while not done:
# Choose an action by greedily (with e change of random action) from the Q-network
Qs = sess. run(Qpred, feed_dict={X: one_hot(s)})
if np.random.rand(1) < e:
a = env.action_space.sample()
else:
a = np.argmax(Qs)
# Get new state and reward from environment
s1, reward, done, _ = env.step(a)
if done:
# Update Q, and no Qs+1, since it's a terminal state
Qs[0, a] = reward
else:
# Obtain the Q_s1 value s by feeding the new state through our network
Qs1 = sess.run(Qpred, feed_dict={X: one_hot(s1)})
# Update Q
Qs[0, a] = reward + dis * np.max(Qs1)
# Train our network using target (Y) and predicted Q (Qpred) values
sess.run(train, feed_dict={x: one_hot(s), Y: Qs})
rAll += reward
s = s1
List.append(rAll)

converge하지 못하기 때문에 상대적으로 성능이 떨어짐..
import gym
env = gym.make('CartPole-v0')
env.reset()
random_episodes = 0
reward_sum = 0
while random_episodes < 10:
env.render()
action = env.action_space.sample() # take random action
observation, reward, done, _ = env.step(action)
print(observation, reward, done)
reward_sum += reward
if done:
random_episodes +=1
print("Reward of this episodes was:", reward_sum)
reward_sum = 0
env.reset()
# Get new state and reward from environment
s1, reward, done, _ = env.step(a)
if done:
Qs[0, a] = -100
else:
x1 = np.reshape(s1, [1, input_size])
# Obtain the Q' values by feeding the new state through our network
Qs1 = sess.run(Qpred, feed_dict={X: x1})
Qs[0, a] = reward + dis * np.max(Qs1)


import numpy as np
import tensorflow as tf
import gym
env = gym.make('CartPole-v0')
# COnstants defining our neural network
learning_rate = 1e-1
input_size = env.observation_space.shape[0] # 4
output_size = env.action_space.n # 2
X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
# First layer of weights
W1 = tf.get_variable("W1", shape=[input_size, output_size],
initializer=tf.contrib.layers.xaver_initializer())
Qpred = tf.matmul(X, W1)
# We need to define the parts of the network needed for learning a policy
Y = tf.placeholder(shape=[None, output_size], dtype=tf.float32)
# Loss function
loss = tf.reduce_sum(tf.square(Y - Qpred))
# Learning
train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
# Values for q learning
num_episodes = 2000
dis = 0.9
rList = []
for i in range(num_episodes):
e = 1. / ((i/10) + 1)
rAll = 0
step_count = 0
s = env.reset()
done = False
# The Q-Network training
while not done:
step_count += 1
x = np.reshape(s, [1, input_size])
# Choose an action by greedily (with e chance of random action) from the Q-network
Qs= sess.run(Qpred, feed_dict={X: x})
if np.random.rand(1) < e:
a = env.action_space.sample()
else:
a = np.argmax(Qs)
# Get new state and reward from environment
s1, reward, done, _ = env.step(a)
if done:
Qs[0, a] = -100 # label = target Q
else:
x1 = np.reshape(s1, [1, input_size])
# Obtain the Q' values by feeding the new state through our network
Qs1 = sess.run(Qpred, feed_dict={X: x1})
Qs[0, a] = reward + dis * np.max(Qs1) # label = target Q
# Train our network using target and predicted Q values on each episode
sess.run(train, feed_dict={X: x, Y: Qs})
s = s1
rList.append(step_count)
print("Episode. {} steps: {}".format(i, step_count))
# If last 10's avg steps are 500, it's good enough
if len(rList) > 10 and np.mean(rList[-10:]) > 500:
break
# See our trained network in action
observation = env.reset()
reward_sum = 0
while True:
env.render()
x = np.reshape(observation, [1, input_size])
Qs = sess.run(Qpred, feed_dict={X: x})
a = np.argmax(Qs)
observation, reward, done, _ = env.step(a)
reward_sum += reward
if done:
print("Total_score: {}".format(reward_sum))
break

Why does not work? Too shallow
- diverge..
- 샘플간 correlation 존재
- target이 정해지지 않음. 타겟도 움직임


오직 두 개만 가지고 학습 시키면? 이상한 선을 만듦
오직 4개만 가지고 학습 시키면?
Target = Y
같은 네트워크 세타를 사용하기 때문에 타겟 값 Y 값도 변함
화살을 쏘자마자 과녁이 움직이는 셈... 즉, Non-stationary targets

버퍼에서 랜덤하게 샘플링을 하자. 그걸로 미니배치를 만들어서 학습

네트워크를 하나 더 만들자!




# store the previous observations in replay memory
replay_buffer = deque()



import numpy as np
import tensorflow as tf
import random
import dqn
from collections import deque
import gym
env = gym.make('CartPole-v0')
# Constants defining our neural network
dis = 0.9
REPLAY_MEMORY = 50000
class DQN:
def __init__(self, ...)
def _build_network(self, ...)
def predict(self, state):
def update(self,x_stack, y_stack):
def simple_replay_train(DQN, train_batch):
x_stack =
y_stack = ...
# Get stored information from buffer
for state, action, reward, next_state, done in train_batch:
Q = DQN.predict(state)
# terminal?
if done:
else:
return DQN.update(x_stack, y_stack)
def bot_play(mainDQN):
s = env.reset()
reward_sum = 0
while True:
env.render()
...
def main():
max_episodes = 5000
replay_buffer = deque()
with tf.Session() as sess:
mainDQN = dqn.DQN(sess, input_size

제한한 스탭만큼 막대기를 세울 수 있는 정도로 학습됨.
하지만 학슴 중간에 불안정한 결과들이 있음







