과적합 확인하기

TaeHyun Lee·2023년 5월 8일

AI 공부

목록 보기

16/17

데이터셋

딥러닝에서는 일반적으로 데이터를 훈련, 검증, 테스트 세트로 나눠 사용합니다.

훈련 데이터셋(Training dataset): 모델을 학습시키는 데 사용하는 데이터셋
검증 데이터셋(Validation dataset): 모델을 학습시키는 동안 하이퍼파라미터 조정 등에 사용하는 데이터셋
테스트 데이터셋(Test dataset): 모델의 일반화 성능을 평가하는 데 사용하는 데이터셋

와인의 종류 예측하기

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.model_selection import train_test_split

# 깃허브에 준비된 데이터를 가져옵니다.
!git clone https://github.com/taehojo/data.git

# 와인 데이터를 불러옵니다.
df = pd.read_csv('./data/wine.csv', header=None)

# 와인의 속성을 X로, 와인의 분류를 y로 저장합니다.
X = df.iloc[:,0:12]
y = df.iloc[:,12]

# 학습셋과 테스트셋으로 나눕니다.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

# 모델 구조를 설정합니다.
model = Sequential()
model.add(Dense(30, input_dim=12, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

# 모델을 컴파일합니다.
model.compile(loss='binary_crossentropy', optimizer='adam', 
metrics=['accuracy'])

# 모델을 실행합니다.
history = model.fit(X_train, y_train, epochs=50, batch_size=500, 
validation_split=0.25) # 0.8 x 0.25 = 0.2

# 테스트 결과를 출력합니다.
score = model.evaluate(X_test, y_test)
print('Test accuracy:', score[1])

과적합

과적합(overfitting)은 딥러닝 모델에서 매우 일반적인 문제 중 하나로, 훈련 데이터에 대해 모델이 지나치게 학습된 경우를 의미합니다. 즉, 모델이 훈련 데이터에는 높은 성능을 보이지만, 새로운 데이터에 대해서는 일반화 성능이 떨어지는 현상을 말합니다.

그래프로 과적합 확인하기

# 그래프 확인을 위한 긴 학습(컴퓨터 환경에 따라 시간이 다소 걸릴 수 있습니다)
history = model.fit(X_train, y_train, epochs=2000, batch_size=500, 
validation_split=0.25) 

# history에 저장된 학습 결과를 확인해 보겠습니다.
hist_df = pd.DataFrame(history.history)
hist_df

# y_vloss에 테스트셋의 오차를 저장합니다.
y_vloss = hist_df['val_loss']

# y_loss에 학습셋의 오차를 저장합니다.
y_loss = hist_df['loss']

# x 값을 지정하고 테스트셋의 오차를 빨간색으로, 학습셋의 오차를 파란색으로 표시합니다.
x_len = np.arange(len(y_loss))
plt.plot(x_len, y_vloss, "o", c="red", markersize=2, label='Testset_loss')
plt.plot(x_len, y_loss, "o", c="blue", markersize=2, label='Trainset_loss')

plt.legend(loc='upper right')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()

과적합 방지로 학습이 자동으로 중단되도록 하기

from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

# 학습이 언제 자동 중단될지 설정합니다.
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=20)

# 최적화 모델이 저장될 폴더와 모델 이름을 정합니다.
modelpath = "./data/model/Ch14-4-bestmodel.hdf5"

# 최적화 모델을 업데이트하고 저장합니다.
checkpointer = ModelCheckpoint(filepath=modelpath, monitor='val_loss', 
verbose=0, save_best_only=True)

# 모델을 실행합니다.
history = model.fit(X_train, y_train, epochs=2000, batch_size=500, validation_split=0.25, verbose=1, callbacks=[early_stopping_callback, checkpointer])

TaeHyun Lee

서커스형 개발자

이전 포스트

K교차검증

다음 포스트