Lec-07 application and tips

leban·2021년 10월 30일

목록 보기

11/18

Learning rate

: Gradient, 기울기
: learning rate와 기울기의 연관관계를 통해서 학습을 위한 최적의 모델 값을 찾아낼 수 있음.
: hyper-parameter(모델을 만들어가기 위한 설정 값), =얼마나 최적화해서 모델을 만들어내는지
: 얼마만큼의 기울기와 이동거리를 통해서 최적화 위치에 도달할 수 있는가를 구할 수 있음.

Tensorflow Code

def grad(hypothesis, labels):	# 가설, 실제 답
	with tf.GradientTape() as tape:
    		loss_value = loss_fn(hypothesis, labels)	# loss_value = Cost
        return tape.gradient(loss_value, [W,b])	# 가설 값과 실제 값의 차이를 
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
optimizer.apply_gradients(grads_and_vars=zip(grads, [W,b]))	# learning_rate와 gradient값을 통해 우리가 원하는 모델의 최적의 값을 찾을 수 있음.

gradient: 각 위치 별로 기울기 값이 나옴

learning_rate : 한 번에 얼마큼 이동할지에 대한 설정 값에 따라 다양한 값이 나옴

learning rate값이 크면 움직이는 양이 크기에 overshooting등의 문제가 발생

learning rate값이 작으면 천천히 내려가기에 적합함. 0.01을 많이 씀.
3e-4 : 0.0003 Adam이 말하는 최적의 learning rate.

Tensorflow Code

optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.01)

learning rate decay
: 좋은 learning rate값을 구했다고 하더라도 학습을 하는 과정에서 learning rate값을 적절히 조절하는 것이 중요
: 학습을 하는 과정에서 cost값이 점점 떨어지기 시작하고, 떨어지는 과정에서 어느 순간 학습이 되지 않는 경우가 있음.
: 이때 learning rate값을 조절함으로써 더 떨어지게 만드는 것이 learning rate decay 기법.

decay 기법의 다양한 방법
1. Step decay : N epoch or validation loss -> 각 스탭 별로 특정의 폭만큼 learning rate 조절
2. Exponential decay
3. 1/t decay

Tensorflow Code

learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, 1000, 0.96, staircase)	# (시작 값, 전체 스탭, 몇 번째 스탭마다, 얼마만큼 learning rate값을 조절할건지)

Data preprocessing

Feature Scaling
: 입력 변수의 크기를 조정해주어 일정 범위 내로 떨어지도록 바꾸는 데이터 전처리 방법

Standardization : x값을 평균으로 뺀 것을 표준편차로 나눔 -> 평균에서 얼마만큼 떨어졌냐를 나타냄.
Normalization : 실제 x값을 최소값으로 뺀 것과 최대 최소 값의 차이로 나눈 것 -> 0~1 사이로 데이터를 고루고루 분포해낼 수 있음.

Python Code(numpy)

Standardization = (data - np.mean(data) / sqrt(np.sum((data - np.mean(data))^2) / np.count(data))

Normarlization = (data - np.min(data, 0)) / (np.max(data, 0) - np.min(data, 0))

: 표준화와 정규화를 통해 불필요한 데이터를 골고루 동일 표면에 나타냄.

전처리 사례 - Noisy Data(쓸모없는 데이터들을 잘 없애주는 것이 중요)
1. Numeric
2. NLP
3. Face Image

Overfitting

: 과하게 맞춰져있다.
: 모델이 가설에 맞춰져 가면서 실제 모델을 만드는 과정에서 평가를 하는데, 평가를 하면서 모델이 점점 올라가면서 모델의 정확도가 어느정도에 위치하게 됨.
: 이 같은 경우 실제 모델을 만드는 과정에서 우리가 사용한 데이터만으로만 평가를 하다보니까 데이터에 맞게 모델이 잘 만들어지는데, 실제 우리가 모델을 만드는 과정에서 새로운 데이터들, 즉 모델의 학습에 쓰이지 않은 새로운 데이터로 학습을 하면 학습이 점점 되면서 test과정에서는 정확도가 떨어지는 경향.
: 이상적인 모델을 만드는 과정에서 중요한 것은 테스트 데이터와 평가 데이터가 모두 최고의 accuracy를 보이는게 가장 중요

High bias(underfit)
: 학습이 덜 된 상태

High variance(overfit)
: 데이터에만 맞게 모델 자체가 만들어지는 것.
: 변화량이 높음

Set a features -> overfitting 해결 방법
: Get more training data (데이터를 더 많이 넣는 방법)
: Smaller set of features (feature를 적게) : 차원을 줄여 각각의 속성이 가진 의미를 분명히 함 (PCA)
: Add additional features (feature를 많게) : 모델을 구체화 함.

sklearn Code

from sklearn.decomposition import PCA
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)

Regulation(Add term to loss : 특정 값을 추가해서 정규화)
λ--: fixes high bias (Under fitting)
λ++: fixes high variance (overfitting)

: weight 자체를 학습을 하면서 정규화 과정을 통해서 cost값이 점점 반복되면서 loss값에 term을 줌으로써 overfitting 방지

Tensorflow Code

L2_loss = tf.nn.12_loss(w)	#output = sum(t ** 2) / 2

Ovefitting Solutions
: feature Normalization (특징들을 잘 정규화하는 것)
: Regularization (loss값에다가 특정 값을 줌으로써 모델에 대해서 특정 weight가 큰 것을 잘 정규화 하는 방법)
: More Data (Data Augmentation: 데이터를 증가시킴)

Color Jilttering (색상을 다양하게)
Horizontal Flips (뒤집기)
Random Crops/Scales (적절한 크기로 자르거나 사이지를 크게)

: Dropout (0.5 is common)
: Batch Normalization

Tensorflow Code

import tensorflow.contrib.eager as tfe
tf.enable_eager_execution()

Code(Eager)

Data Preprocess

xy = np.array([[828.659973, 833.450012, 908100, 828.349976, 831.659973],
	     [823.02002, 828.070007, 1828100, 821.655029, 828.070007],
             [819.929993, 824.400024, 1438100, 818.97998, 824.159973],
             [816, 820.958984, 1008100, 815.48999, 819.23999],
             [819.359985, 823, 1188100, 818.469971, 818.669983],
             [819, 823, 1198100, 816, 820.450012],
             [811.700012, 815.25, 1098100, 809.780029, 813.669983],
             [809.51001, 816.659973, 1398100, 804.539978, 809.559998]])
            
x.train = xy[:, 0:-1]
y_train = xy[:, [-1]]

def normalization(data):	# 0~1 데이터 정규화
	numerator = data - np.min(data, 0)
    	dominator = np.max(data, 0) - np.min(data, 0)
    	return numerator / denominator
    
xy = normalization(xy)

L2 Norm

dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train).batch(len(x_train))

W = tf.Variable(tf.random_normal([4, 1]), dtype=tf.float32)
b = tf.Variable(tf.random_normal([1]), dtype=tf.float32)

def linearReg_fn(features):	#Linear Regression에 맞는 hypothesis 구하기
	hypothesis = tf.matmul(feature, W) + b
    	return hypothesis
    
def l2_loss(loss, beta = 0.01):
	W_reg = tf.nn.12_loss(W)	# output = sum(t ** 2) / 2
    	loss = tf.reduce_mean(loss + W_reg * beta)
    	return loss	# 정규화된 l2_loss 값을 구함. 
    
def loss_fn(hypothesis, labels, flag = False):	#Linear Regression에 대한 cost function
	cost = tf.reduce_mean(tf.square(hypothesis - labels))	
    	# 실제 가설과 y값의 차이를 최소화하는 구간을 cost function을 통해서 구함.
    	# flag를 통해서 l2_loss를 적용할건지 말건지를 판단
    	if(flag):
    		cost = 12_loss(cost)
    	return cost

Learning Decay

is_decay = True
starter_learning_rate = 0.1

if(is_decay):
	global_step = tf.Variable(0, trainable=False)
    	learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, 50, 0.96, staircase=True)
        optimizer = tf.train.GradientDescentOptimizer(learning_rate)
else:
	optimizer = tf.train.GradientDescentOptimizer(learning_rate=starter_learning_rate)
    
def grad(features, labels, 12_flag):
	with tf.GradientTape() as tape:
    		loss_value = loss_fn(linearReg_fn(features),labels, 12_flag)
    	return tape.gradient(loss_value, [W,b]), loss_value
        
for step in range(EPOCHS):
	for features, labels in tfe.Iterator(dataset):
    		features = tf.cast(features, tf.float32)
            labels = tf.cast(labels, tf.float32)
            grads, loss_Value = grad(linearReg_fn(features), features, labels, False)
            optimizer.apply_gradients(grads_and_vars=zip(grads,[W,b]), global_step=global_step)
            if step % 10 == 0:
            	print("Iter: {}, Loss: {:.4f}, Learning Rate: {:.8f}".format(step, loss_value, optimizer._learning_rate()))

Summary
: Learning rate

Gradient # cost값을 통해서 최적의 모델을 찾는 과정 - Gradient x Learning rate
Good and Bad learning rate # 적절한 learning rate값을 구하는 것이 중요
Annealing the learning rate(Decay) # learning rate값을 잘 조절하는 것이 중요

: Data preprocessing

Standardization / Normalization # 정규화
Noisy Data # 필요한 데이터만을 뽑아냄

: Overfitting

Set a Features # 특징 값들을 조절
Regularization # 정규화

Data sets

Training and Validation
: 학습을 위한 데이터와 평가를 위한 데이터를 잘 구성하는 것이 가장 중요

Good Case
: 모델의 성능을 올리는 과정이 필요
: hyper parameter와 network 구조를 통해서 동일한 데이터를 가지고 계속 반복적으로 테스트하면서 99%의 모델을 만드는 것이 목적

Tensorflow Code

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()	#60,000 training / 10,000 testing images
model.fit(x_train, y_train, validation_split=0.2, epochs=5)	# 20% Val data

: layer를 잘 만들고 learning rate를 잘 결정하고 optimizer을 잘 선언한 후에는 어느정도 모델이 잘 나오게 됨.

Evaluating a hypothesis
: 어느정도 모델링이 선택된 후에는 아예 새로운 데이터를 만들어서 모델을 직접 테스트 해볼 수 있음.
: ex) 동일한 사람의 다른 각도/ 표정을 보고 잘 구분해내야 함.
: test 데이터 자체가 학습을 하면서 validation하는 평가데이터를 구성했지만 새로운 데이터를 통해서 실제 모델을 검증하는 과정이 중요

Tensorflow Code

test_acc = accuracy_fn(softmax_fn(x_test), y_test)	# define hypothesis and test
model.evaluate(x_test, y_test)	# Keras

: 학습을 위한 데이터와 평가를 위한 데이터를 통해서 모델을 만듬

Anomaly detection(이상 감지)
: 건강한 데이터를 가지고 모델을 만듬
: 새로운, 특이한 데이터가 왔을때 모델에 대해서 데이터가 이상적인 케이스가 발생하면 Anomaly를 감지

Learning

Online vs Batch
: Online Learning - 데이터가 인터넷이 연결된 상태에서 지속적으로 바뀌면서 계속 학습하면서 변함
: Batch(Offline) Learning - 데이터가 정적인 상태에서 모델을 만드는 과정

Fine Tuning / Feature Extraction
1. Original Model
: 백인을 구분하는 모델을 만들었을때, 백인 모델이 들어오면 잘 구분
: 기존 모델에서 황인이나 흑인이 들어오면 잘 구분해내지 못함.
2. Fine-tunning
: 기존의 얼굴을 구분해내는 모델에서 기존의 weight값을 잘 조절해서 황인과 흑인의 데이터가 들어왔을때 잘 구분하도록 함.
3. Feature Extraction
: 새로운 태스크에 대해서만 학습을 시킴.

Efficient Models
: 효과적인 모델을 만드는 것이 중요
: inference time 자체를 최소화하는 것이 중요
: inference에 대해 영향을 많이 미치는 모델의 weight값을 경량화하는 것이 중요
: fully connected layers에 대한 parameter값이 많기 때문에 이것을 1x1 convolution으로 대체하는 기법이 많이 사용됨.
: squeezenet, Mobilenet 논문 많이 나옴.

Tensorflow Code

tf.nn.depthwise_conv2d(input, filter, strides, padding)

Sample Data

Fashion MNIST-Image Classification

#Tensorflow Code
fashion_mnist = keras.datasets.fashion_mnist	# keras-> fashion_mnist library 제공
(train_iamges, train labels), (test_images, test_labels) = fashion_mnist.load_data()

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Cost', 'Sandal', 'shirt', 'Sneaker', 'Bag', 'Ankle boot']	# 0~9까지 vector 변환

train_images = train_images / 255.0	# (60000, 28, 28)	# 0~1까지로 정규화
test_images = test_images / 255.0	# (10000, 28, 28)	# 0~1까지로 정규화

model = keras.Sequential([
	keras.layers.Flatten(input_shape=(28, 28)	# 학습을 위한 모델을 펴줌
    	keras.layers.Dense(128, activation=tf.nn.relu),	# 128개의 layer 선언
    	keras.layers.Dense(10, activation=tf.nn.softmax) # 10개의 클래스로 구분하는 모델 생성
])

model.compile(optimize='adam',
	      loss='sparse_categorical_crossentropy',
              mertircs=['accuracy'])
              
model.fit(train_images, train_labels, epochs=5)	# 5번 모델 훈련
test_loss, test_acc = model.evaluate(test_images, test_labels)
predictions = model.predict(test_images)
np.argmax(predictions[0])	# 9 label

IMDB-Text Classification
: IMDB(Internet Movie Data Base)
: 실제 영화를 봤을 때 영화에 대한 평을 잘 분리하기 위한 모델을 만들 수 있음

# Tensorflow Code
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)	# 10000개의 단어로 구성
word_index = imdb.get_word_index()
# The first indices are reserved
# 자연어 처리를 위한 전처리 과정
word_index = {k:(v+3) for k,v in word_index.item()}
word_index = ["<PAD>"] = 0	# 공백에 대한 값 vector로 선언
word_index = ["<START>"] = 1	# 시작 값
word_index = ["<UNK>"] = 2	# unknown	# 모르는 값
word_index = ["<UNUSED>"] = 3	# 사용되지 않은 값
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index["<PAD>"], padding="post", maxlen=256)

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling10())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy", metrics=['accuracy'])

model.fit(partial_x_train, partial_y_train, epochs=40, validation_data=(x_val, y_val))

CIFAR-100

from keras.datasets import cifar100
(x_train, y_train), (x_test, y_test) = cifar100.load_data(label_mode='fine')
# 100개의 클래스에 대해서 직접 test 가능

Summary
: Data sets

Training / Validation / Testing # 학습을 위한 데이터 / 평가를 위한 데이터
Evaluating a hypothesis # 가설 평가
Anomaly Detection # 이상 감지

: Learning

Online Learning vs Batch Learning # 실시간 학습 / 고정된 데이터 학습
Fine tuning / Feature Extraction # 새로운 데이터를 넣어 미세하게 weight 조절 / 기존 모델의 해당 특징을 뽑아와 새로운 layer를 만들어 학습
Efficient Models # 경량화된 모델

: Sample Data