๐ŸŽฒ[AI] Overfitting & Underfitting

manduยท2025๋…„ 5์›” 8์ผ

[AI]

๋ชฉ๋ก ๋ณด๊ธฐ
17/20

ํ•ด๋‹น ๊ธ€์€ FastCampus - '[skill-up] ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ์œ ์น˜์› ๊ฐ•์˜๋ฅผ ๋“ฃ๊ณ ,
์ถ”๊ฐ€ ํ•™์Šตํ•œ ๋‚ด์šฉ์„ ๋ง๋ถ™์—ฌ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.


1. Overfitting์ด๋ž€?

  • ์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ: ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฟ ์•„๋‹ˆ๋ผ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ(unseen data)์— ๋Œ€ํ•ด์„œ๋„ ์ž˜ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ
    โ†’ ์ฆ‰, Training error๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” ๊ฒƒ์ด ์ตœ์ข… ๋ชฉํ‘œ๊ฐ€ ์•„๋‹˜
  • Overfitting: Training Error (ํ•™์Šต ๋ฐ์ดํ„ฐ ์˜ค์ฐจ)๋ณด๋‹ค Generalization Error (์ผ๋ฐ˜ํ™” ์˜ค์ฐจ)๊ฐ€ ํ˜„์ €ํžˆ ๋†’์•„์ง€๋Š” ํ˜„์ƒ
    • ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ bias์™€ noise๊นŒ์ง€ ๊ณผ๋„ํ•˜๊ฒŒ ํ•™์Šตํ•จ
    • Overfitting ํ˜„์ƒ์ด ๊ผญ ๋‚˜์œ ๊ฒƒ์€ ์•„๋‹˜
      โ†’ ๋ชจ๋ธ์˜ capacity๊ฐ€ ์ถฉ๋ถ„ํ•œ์ง€ ํ™•์ธํ•˜๋Š” ํ•˜๋‚˜์˜ ๋ฐฉ๋ฒ• (๋ฌผ๋ก  ํ™•์ธ ํ›„์—๋Š” overfitting ํ•ด๊ฒฐ ๅฟ…)


2. Underfitting์ด๋ž€?

  • ๋ชจ๋ธ์˜ capacity(depth, width)๊ฐ€ ๋ถ€์กฑํ•ด training error ๊ฐ€ ์ถฉ๋ถ„ํžˆ ๋‚ฎ์ง€ ์•Š์€ ์ƒํƒœ

3. Overfitting vs Underfitting

์šฉ์–ด์„ค๋ช…
Overfitting๋ถˆํ•„์š”ํ•œ ํŒจํ„ด๊นŒ์ง€ ํ•™์Šต
Underfitting๋ฐ์ดํ„ฐ์˜ ๊ธฐ๋ณธ ํŒจํ„ด์กฐ์ฐจ ํ•™์Šตํ•˜์ง€ ๋ชปํ•จ
Well-generalizedํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ๋ชจ๋‘์—์„œ ์ข‹์€ ์„ฑ๋Šฅ


4. Overfitting์„ ๋ง‰๊ธฐ ์œ„ํ•œ Validation Set ํ™œ์šฉ

  • train/validation/test ๋ฐ์ดํ„ฐ๋ฅผ randomํ•˜๊ฒŒ ๋ถ„ํ•  e.g. 6:2:2
  • validaiton, test set์€ ์ ˆ๋Œ€ ํ•™์Šต์— ์‚ฌ์šฉ X (Scaling fit ๋‹จ๊ณ„์—์„œ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€์ž„!)
  • training set์œผ๋กœ ํ•™์Šต, ๋งค epoch๊ฐ€ ๋๋‚˜๋ฉด validation set์œผ๋กœ generalization ์„ฑ๋Šฅ ์ถ”์ •
  • training error๋งŒ ๋‚ฎ๊ณ  validation error๊ฐ€ ๋†’๋‹ค๋ฉด overfitting ์‹ ํ˜ธ
  • Early Stopping: ์ผ์ • epoch ๋™์•ˆ validation loss ๊ฐœ์„  ์—†์œผ๋ฉด ํ•™์Šต ์ค‘๋‹จ
๋ฐ์ดํ„ฐ์…‹๋ชฉ์ 
Training SetํŒŒ๋ผ๋ฏธํ„ฐ ํ•™์Šต
Validation Setgeneralization ๋ฐ hyperparameter ๊ฒ€์ฆ
Test Set์ตœ์ข… ๋ชจ๋ธ ๋ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฒ€์ฆ

5. Typical Training Procedure

  1. ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
  2. train set์œผ๋กœ feed-forward, loss ๊ณ„์‚ฐ, SGD ์ˆ˜ํ–‰
  3. training error ๊ณ„์‚ฐ
  4. validation set์œผ๋กœ feed-forward, loss ๊ณ„์‚ฐ (ํ•™์Šต์€ X)
  5. validation error ๊ณ„์‚ฐ
  6. best ๋ชจ๋ธ ์ €์žฅ (์ตœ์ € validation loss ๊ธฐ์ค€)
  7. test set์œผ๋กœ ์ตœ์ข… ํ‰๊ฐ€


6. Pytorch ์‹ค์Šต ์ฝ”๋“œ

# Split into Train / Valid / Test set

## Load Dataset from sklearn

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()

df = pd.DataFrame(california.data, columns=california.feature_names)
df["Target"] = california.target
df.tail()

## Convert to PyTorch Tensor

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

data = torch.from_numpy(df.values).float()

x = data[:, :-1]
y = data[:, -1:]

print(x.size(), y.size())

# Train / Valid / Test ratio
ratios = [.6, .2, .2]

train_cnt = int(data.size(0) * ratios[0])
valid_cnt = int(data.size(0) * ratios[1])
test_cnt = data.size(0) - train_cnt - valid_cnt
cnts = [train_cnt, valid_cnt, test_cnt]

print("Train %d / Valid %d / Test %d samples." % (train_cnt, valid_cnt, test_cnt))

# Shuffle before split.
indices = torch.randperm(data.size(0))
x = torch.index_select(x, dim=0, index=indices)
y = torch.index_select(y, dim=0, index=indices)

# Split train, valid and test set with each count.
x = list(x.split(cnts, dim=0))
y = y.split(cnts, dim=0)

for x_i, y_i in zip(x, y):
    print(x_i.size(), y_i.size())

## Preprocessing

scaler = StandardScaler()
scaler.fit(x[0].numpy()) # You must fit with train data only.

x[0] = torch.from_numpy(scaler.transform(x[0].numpy())).float()
x[1] = torch.from_numpy(scaler.transform(x[1].numpy())).float()
x[2] = torch.from_numpy(scaler.transform(x[2].numpy())).float()

df = pd.DataFrame(x[0].numpy(), columns=california.feature_names)
df.tail()

## Build Model & Optimizer

model = nn.Sequential(
    nn.Linear(x[0].size(-1), 6),
    nn.LeakyReLU(),
    nn.Linear(6, 5),
    nn.LeakyReLU(),
    nn.Linear(5, 4),
    nn.LeakyReLU(),
    nn.Linear(4, 3),
    nn.LeakyReLU(),
    nn.Linear(3, y[0].size(-1)),
)

model

optimizer = optim.Adam(model.parameters())

## Train

n_epochs = 4000
batch_size = 256
print_interval = 100

from copy import deepcopy

lowest_loss = np.inf
best_model = None

early_stop = 100
lowest_epoch = np.inf

train_history, valid_history = [], []

for i in range(n_epochs):
    # Shuffle before mini-batch split.
    indices = torch.randperm(x[0].size(0))
    x_ = torch.index_select(x[0], dim=0, index=indices)
    y_ = torch.index_select(y[0], dim=0, index=indices)
    # |x_| = (total_size, input_dim)
    # |y_| = (total_size, output_dim)
    
    x_ = x_.split(batch_size, dim=0)
    y_ = y_.split(batch_size, dim=0)
    # |x_[i]| = (batch_size, input_dim)
    # |y_[i]| = (batch_size, output_dim)
    
    train_loss, valid_loss = 0, 0
    y_hat = []
    
    for x_i, y_i in zip(x_, y_):
        # |x_i| = |x_[i]|
        # |y_i| = |y_[i]|
        y_hat_i = model(x_i)
        loss = F.mse_loss(y_hat_i, y_i)

        optimizer.zero_grad()
        loss.backward()

        optimizer.step()        
        train_loss += float(loss)

    train_loss = train_loss / len(x_)

    # You need to declare to PYTORCH to stop build the computation graph.
    with torch.no_grad():
        # You don't need to shuffle the validation set.
        # Only split is needed.
        x_ = x[1].split(batch_size, dim=0)
        y_ = y[1].split(batch_size, dim=0)
        
        valid_loss = 0
        
        for x_i, y_i in zip(x_, y_):
            y_hat_i = model(x_i)
            loss = F.mse_loss(y_hat_i, y_i)
            
            valid_loss += loss
            
            y_hat += [y_hat_i]
            
    valid_loss = valid_loss / len(x_)
    
    # Log each loss to plot after training is done.
    train_history += [train_loss]
    valid_history += [valid_loss]
        
    if (i + 1) % print_interval == 0:
        print('Epoch %d: train loss=%.4e  valid_loss=%.4e  lowest_loss=%.4e' % (
            i + 1,
            train_loss,
            valid_loss,
            lowest_loss,
        ))
        
    if valid_loss <= lowest_loss:
        lowest_loss = valid_loss
        lowest_epoch = i
        
        # 'state_dict()' returns model weights as key-value.
        # Take a deep copy, if the valid loss is lowest ever.
        best_model = deepcopy(model.state_dict())
    else:
        if early_stop > 0 and lowest_epoch + early_stop < i + 1:
            print("There is no improvement during last %d epochs." % early_stop)
            break

print("The best validation loss from epoch %d: %.4e" % (lowest_epoch + 1, lowest_loss))

# Load best epoch's model.
model.load_state_dict(best_model)

## Loss History

plot_from = 10

plt.figure(figsize=(20, 10))
plt.grid(True)
plt.title("Train / Valid Loss History")
plt.plot(
    range(plot_from, len(train_history)), train_history[plot_from:],
    range(plot_from, len(valid_history)), valid_history[plot_from:],
)
plt.yscale('log')
plt.show()

## Let's see the result!

test_loss = 0
y_hat = []

with torch.no_grad():
    x_ = x[2].split(batch_size, dim=0)
    y_ = y[2].split(batch_size, dim=0)

    for x_i, y_i in zip(x_, y_):
        y_hat_i = model(x_i)
        loss = F.mse_loss(y_hat_i, y_i)

        test_loss += loss # Gradient is already detached.

        y_hat += [y_hat_i]

test_loss = test_loss / len(x_)
y_hat = torch.cat(y_hat, dim=0)

sorted_history = sorted(zip(train_history, valid_history),
                        key=lambda x: x[1])

print("Train loss: %.4e" % sorted_history[0][0])
print("Valid loss: %.4e" % sorted_history[0][1])
print("Test loss: %.4e" % test_loss)

df = pd.DataFrame(torch.cat([y[2], y_hat], dim=1).detach().numpy(),
                  columns=["y", "y_hat"])

sns.pairplot(df, height=5)
plt.show()
profile
๋งŒ๋‘๋Š” ๋ชฉ๋ง๋ผ

0๊ฐœ์˜ ๋Œ“๊ธ€