『모두의 딥러닝』 week_7

SungchulCHA·2023년 12월 24일

Deep Learning tensorflow

AMD DL

목록 보기

6/12

딥러닝 활용하기

Chap 16. 컨볼루션 신경망(CNN)

MNIST 데이터 셋 : 미국 국립표준기술원(NIST)이 만든 손글씨 데이터

이미지를 인식하는 원리

Python

from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

import matplotlib.pyplot as plt
import sys

(X_train, y_train), (X_test, y_test) = mnist.load_data()

print("학습셋 이미지 수 : %d 개" % (X_train.shape[0]))
print("테스트셋 이미지 수 : %d 개" % (X_test.shape[0]))

plt.imshow(X_train[0], cmap='Greys')
plt.show()

for x in X_train[0]:
    for i in x:
        sys.stdout.write("%-3s" % i)
    sys.stdout.write('\n')

X_train = X_train.reshape(X_train.shape[0], 784)
X_train = X_train.astype('float64')
X_train = X_train / 255

X_test = X_test.reshape(X_test.shape[0], 784).astype('float64') / 255

print("class : %d " % (y_train[0]))

y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

print(y_train[0])

MNIST 데이터는 텐서플로의 케라스 API를 이용하여 불러올 수 있음
mnist 클래스의 load_data() 메서드로 학습용과 테스트용으로 입력, 출력 나눌 수 있다
하나의 데이터는 $28 \times 28$ 크기의 픽셀, channel은 8UC1
reshape(총 샘플 수, 1차원 속성의 개수) : 가로, 세로 이미지 데이터를 784개의 1차원 배열로 바꿔주는 함수
정규화(normalization) : 0 ~ 255 → 0~1
astype() : 데이터의 타입을 정수에서 실수로 바꾸고
최대값 즉, 255로 나눔
y_train : 클래스 값 = 출력 값 즉, 어떤 숫자인지 = 0 ~ 9 → 원-핫 인코딩
np_utils.to_categorical(클래스, 클래스 개수) : 10개의 2진수 데이터 만들고 하나만 1로 만듦

딥러닝 기본 프레임 만들기

Python

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import ModelCheckpoint,EarlyStopping
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

import matplotlib.pyplot as plt
import numpy as np
import os

(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(X_train.shape[0], 784).astype('float32') / 255
X_test = X_test.reshape(X_test.shape[0], 784).astype('float32') / 255
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

model = Sequential()
model.add(Dense(512, input_dim=784, activation='relu'))
model.add(Dense(10, activation='softmax'))
# model.summary()
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

MODEL_DIR = './data/model/'
if not os.path.exists(MODEL_DIR):
    os.mkdir(MODEL_DIR)

modelpath="./data/model/MNIST_MLP.hdf5"
checkpointer = ModelCheckpoint(filepath=modelpath, monitor='val_loss', verbose=1, save_best_only=True)
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=10)

history = model.fit(X_train, y_train, validation_split=0.25, epochs=30, batch_size=200, verbose=0, callbacks=[early_stopping_callback,checkpointer])

print("\n Test Accuracy: %.4f" % (model.evaluate(X_test, y_test)[1]))

y_vloss = history.history['val_loss']
y_loss = history.history['loss']

x_len = np.arange(len(y_loss))
plt.plot(x_len, y_vloss, marker='.', c="red", label='Testset_loss')
plt.plot(x_len, y_loss, marker='.', c="blue", label='Trainset_loss')

plt.legend(loc='upper right')
plt.grid()
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()

mnist.load_data() : mnist 데이터를 6만 개의 학습셋과 1만 개의 테스트셋으로 나누고 정규화, 바이너리화
모델의 인풋 값 = 속성 개수는 784개, 은닉층은 1개, 노드는 512개, 출력 값 = 클래스는 10개
은닉층 활성화 함수 : relu, 출력층 활성화 함수 : softmax, 오차 함수 : categorical_crossentropy, 최적화 함수 : adam

relu

softmax

$S(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}$

ModelCheckpoint로 가장 좋은 모델 저장, EarlyStoppping으로 10번까지 이전 실행보다 monitor값이 개선이 안되면 미리 종료
검증셋을 학습셋의 0.25만큼 설정
model.evaluate()[0] : 학습셋의 오차(loss), model.evaluate()[1] : 학습셋의 정확도(acc)
plt.legend : 범례
plt.grid() : 그리드

plt.show()

컨볼루션 신경망(CNN)

CNN : 커널(슬라이딩 윈도)을 도입하는 기법
가중치를 곱해서 더함 즉, 컨볼루션 연산
컨볼루션(합성곱) 층 : 커널을 적용하여 만들어진 결과, 새로운 층
커널이 여러 개면 컨볼루션 층도 여러 개

컨볼루션 층 적용 전 모델의 도식

model.add(Conv2D(32, kernel_size(3,3), input_shape=(28,28,1), activation='relu'))

model.add() : 층을 추가
Conv2D() : 컨볼루션 층
즉, 컨볼루션 층을 추가

32 : 커널의 개수 → 하나의 데이터가 32개의 필터를 거친 후 활성 함수를 거침
kernel_size() : (행, 열)
input_shape() : Dense와 마찬가지로 처음에는 입력 값 알려줘야함 (행, 열, 채널 수)

컨볼루션 층 2개 적용

맥스 풀링, 드롭아웃, 플래튼

풀링(pooling) 또는 서브 샘플링(sub sampling) : 컨볼루션 층을 통해 이미지 특징을 도출해도 결과가 크고 복잡하면 한 번 더 축소하는 과정
맥스 풀링(max pooling) : 정해진 구역 안에서 최댓값을 뽑아내는 풀링 기법
구역을 나누고 각 구역에서 가장 큰 값 추출
평균 풀링(average pooling) : 정해진 구역 안에서 평균값을 뽑아내는 풀링 기법
드롭아웃(drop out) : 은닉층에 배치된 노드 중 일부를 임의로 꺼 줌
epochs와 node, Dense가 많다고 좋은거 아님. 과적합 이해하기 참조
플래튼(flatten) : 2차원 배열을 1차원으로 바꾸어주는 거. 15장 주택 가격 예측 모델 참조

Convolution, pooling, drop out, flatten 적용한 모델의 도식

Python

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.callbacks import ModelCheckpoint,EarlyStopping
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

import matplotlib.pyplot as plt
import numpy as np
import os

(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(X_train.shape[0], 28, 28, 1).astype('float32') / 255
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1).astype('float32') / 255
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), input_shape=(28, 28, 1), activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128,  activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

MODEL_DIR = './data/model/'
if not os.path.exists(MODEL_DIR):
    os.mkdir(MODEL_DIR)

modelpath="./data/model/MNIST_CNN.hdf5"
checkpointer = ModelCheckpoint(filepath=modelpath, monitor='val_loss', verbose=1, save_best_only=True)
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=10)

history = model.fit(X_train, y_train, validation_split=0.25, epochs=30, batch_size=200, verbose=0, callbacks=[early_stopping_callback,checkpointer])

print("\n Test Accuracy: %.4f" % (model.evaluate(X_test, y_test)[1]))

y_vloss = history.history['val_loss']
y_loss = history.history['loss']
x_len = np.arange(len(y_loss))

plt.plot(x_len, y_vloss, marker='.', c="red", label='Testset_loss')
plt.plot(x_len, y_loss, marker='.', c="blue", label='Trainset_loss')
plt.legend(loc='upper right')
plt.grid()
plt.xlabel('epoch')
plt.ylabel('loss')

plt.show()

속성 전처리
.reshape(X_train.shape[0], 28, 28, 1) : 가로 28, 세로 28, 채널 1개의 2차원 데이터로 가공
.astype('float32') / 255 : 4byte 실수형으로 형변환, 정규화
클래스 전처리
to_categorical() : 원-핫 인코딩 or 바이너리화

(28, 28, 1) 데이터가 인풋 값
가로 3, 세로 3 크기의 커널 32개, 활성함수는 relu로 은닉층 추가
커널의 가중치는 랜덤 값임
import tensorflow as tf
import matplotlib.pyplot as plt
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), input_shape=(28, 28, 1), activation='relu'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
first_conv_layer = model.layers[0]
weights = first_conv_layer.get_weights()[0]
for i in range(32):
    print(f'Kernel {i+1}\n{weights[:, :, 0, i]}')
Kernel 15
[[ 1.59433484e-02 7.49469399e-02 -1.12079814e-01]

[-1.16571039e-02 3.32824588e-02 1.32951811e-01]

[ 2.85079628e-02 5.33312559e-05 -1.22211546e-01]]

Kernel 16
[[-0.02556068 -0.02660926 0.10065147]

[-0.04595307 -0.04553822 -0.12601563]

[ 0.09611009 0.03309062 0.04829112]]
컨볼루션 층 2개 추가 후, pooling
풀링 후 노드의 1/4을 drop out
해당 데이터는 (28, 28, 1)의 2차원 배열이므로 Flatten()을 이용하여 1차원 배열로 만듦
노드가 128개인 은닉층 추가
다시 Dropout() 으로 절반 비활성화
출력 값 즉, 클래스 개수는 10개, 활성함수 : softmax
'val_loss' : 모델을 검증 셋에 적용해 얻은 오차
'val_loss'가 가장 작은 모델을 저장, 10번 더 해보고 좋아지지 않으면 종료

학습 진행에 따른 학습셋과 테스트셋의 오차 변화

Chap 17. 딥러닝을 이용한 자연어 처리

자연어 처리(Natural Language Processing, NLP) : 음성이나 텍스트를 컴퓨터가 인식하고 처리하는 것.

텍스트의 토큰화

토큰(token) : 텍스트를 단어별, 문장별, 형태소별로 작게 나누어진 하나의 단위
토큰화(tokenization) : 입력된 텍스트를 잘게 나누는 과정

from tensorflow.keras.preprocessing.text import text_to_word_sequence

text = '해보지 않으면 해낼 수 없다.'

result = text_to_word_sequence(text)
print("\n원문:\n", text)
print("\n토큰화:\n", result)

text_to_word_sequence() : 문장을 토큰화함
반환값은 리스트

원문:
해보지 않으면 해낼 수 없다.

토큰화:
['해보지', '않으면', '해낼', '수', '없다']

단어의 가방(bag of words) : 전처리 기법 중 하나. 같은 단어끼리 따로따로 가방에 담은 후 각 가방에 몇 개의 단어가 들어있는지 세는 기법

from tensorflow.keras.preprocessing.text import Tokenizer

docs = ['먼저 텍스트의 각 단어를 나누어 토큰화합니다.',
		'텍스트의 단어로 토큰화해야 딥러닝에서 인식됩니다.',
        '토큰화한 결과는 딥러닝에서 사용할 수 있습니다.'
       ]
 
token = Tokenizer()
token.fit_on_texts(docs)

print("\n단어 카운트:\n", token.word_counts)

print("\n문장 카운트: ", token.document_count)

print("\n각 단어가 몇 개의 문장에 포함되어 있는가:\n", token.word_docs)

print("\n각 단어에 매겨진 인덱스 값:\n", token.word_index)

token.word_counts : 단어의 빈도수 계산
반환값은 딕셔너리, OrderedDict 클래스에 담겨있음

단어 카운트:
OrderedDict([('먼저', 1), ('텍스트의', 2), ('각', 1), ('단어를', 1), ('나누어', 1), ('토큰화합니다', 1), ('단어로', 1), ('토큰화해야', 1), ('딥러닝에서', 2), ('인식됩니다', 1), ('토큰화한', 1), ('결과는', 1), ('사용할', 1), ('수', 1), ('있습니다', 1)])

token.document_count() : 문장 개수

문장 카운트: 3

token.word_docs() : 각 단어들이 몇 개의 문장에 나오는지, 순서는 랜덤
반환값은 딕셔너리, defaultdict 클래스에 담겨있음

각 단어가 몇 개의 문장에 포함되어 있는가:
defaultdict(<class 'int'>, {'텍스트의': 2, '나누어': 1, '각': 1, '먼저': 1, '토큰화합니다': 1, '단어를': 1, '인식됩니다': 1, '딥러닝에서': 2, '단어로': 1, '토큰화해야': 1, '수': 1, '결과는': 1, '토큰화한': 1, '있습니다': 1, '사용할': 1})

token.word_index() : 단어에 인덱스 매기기
반환값은 딕셔너리

각 단어에 매겨진 인덱스 값:
{'텍스트의': 1, '딥러닝에서': 2, '먼저': 3, '각': 4, '단어를': 5, '나누어': 6, '토큰화합니다': 7, '단어로': 8, '토큰화해야': 9, '인식됩니다': 10, '토큰화한': 11, '결과는': 12, '사용할': 13, '수': 14, '있습니
다': 15

단어의 원-핫 인코딩

from tensorflow.keras.preprocessing.text import Tokenizer

text = "오랫동안 꿈꾸는 이는 그 꿈을 닮아간다"

token = Tokenizer()
x = token.fit_on_texts([text])
x = token.texts_to_sequences([text])

print(x)

text를 토큰의 인덱스로만 채워진 새로운 배열을 생성

[[1, 2, 3, 4, 5, 6]]

from tensorflow.keras.preprocessing.text import Tokenizer

text = "오랫동안 꿈꾸는 이는 그 꿈을 닮아간다"

token = Tokenizer()
x = token.fit_on_texts([text])
x = token.texts_to_sequences([text])

from tensorflow.keras.utils import to_categorical

word_size = len(token.word_index) + 1
x = to_categorical(x, num_classes=word_size)

print(x)

문장을 이루고 있는 단어들이 위에서부터 차례로 벡터화

[[[0. 1. 0. 0. 0. 0. 0.]

[0. 0. 1. 0. 0. 0. 0.]

[0. 0. 0. 1. 0. 0. 0.]

[0. 0. 0. 0. 1. 0. 0.]

[0. 0. 0. 0. 0. 1. 0.]

[0. 0. 0. 0. 0. 0. 1.]]]

단어 임베딩

단어 임베딩(word embedding) : 주어진 배열을 정해진 길이로 압축시킴

from tensorflow.keras.layers import Embedding

model = Sequential()
model.add(Embedding(16,4))

Embedding(16, 4, input_length=2) : (입력, 출력, 매번 입력될 단어의 개수)
16단어를 입력, 출력 벡터의 크기는 4, 매번 2개씩 넣음

텍스트를 읽고 긍정, 부정 예측하기

텍스트 리뷰 자료 지정
클래스 지정(긍정 : 1, 부정 : 0)
토큰화
토큰에 지정된 인덱스로 새로운 배열 생성
패딩
임베딩 함수에 넣을 입력, 출력, 단어 수 지정

패딩(padidng) : 길이를 똑같이 맞추어 주는 작업 짧은 부분은 0을 채우고, 긴 부분은 자름

Python

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Flatten,Embedding
from tensorflow.keras.utils import to_categorical
from numpy import array

docs = ["너무 재밌네요","최고예요","참 잘 만든 영화예요","추천하고 싶은 영화입니다","한번 더 보고싶네요","글쎄요","별로예요","생각보다 지루하네요","연기가 어색해요","재미없어요"]

classes = array([1,1,1,1,1,0,0,0,0,0])

token = Tokenizer()
token.fit_on_texts(docs)
print(token.word_index)

x = token.texts_to_sequences(docs)
print("\n리뷰 텍스트, 토큰화 결과:\n",  x)

padded_x = pad_sequences(x, 4)
print("\n패딩 결과:\n", padded_x)

word_size = len(token.word_index) + 1

model = Sequential()
model.add(Embedding(word_size, 8, input_length=4))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.summary()

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(padded_x, classes, epochs=20)
print("\n Accuracy: %.4f" % (model.evaluate(padded_x, classes)[1]))

classes : 긍정이면 1, 부정이면 0
token.texts_to_sequences(docs) : 리뷰들을 토큰화

리뷰 텍스트, 토큰화 결과:
[[1, 2], [3], [4, 5, 6, 7], [8, 9, 10], [11, 12, 13], [14], [15], [16, 17], [18, 19], [20]]

pad_sequences(x, 4) : 길이가 4가 되도록 패딩

패딩 결과:
[[ 0 0 1 2]

[ 0 0 0 3]

[ 4 5 6 7]

[ 0 8 9 10]

[ 0 11 12 13]

[ 0 0 0 14]

[ 0 0 0 15]

[ 0 0 16 17]

[ 0 0 18 19]

[ 0 0 0 20]]