๐ŸŽฒ[AI] ๋‹ค์ค‘ ๋ถ„๋ฅ˜, Softmax, Cross Entropy

manduยท2025๋…„ 5์›” 14์ผ

[AI]

๋ชฉ๋ก ๋ณด๊ธฐ
19/20

ํ•ด๋‹น ๊ธ€์€ FastCampus - '[skill-up] ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ์œ ์น˜์› ๊ฐ•์˜๋ฅผ ๋“ฃ๊ณ ,
์ถ”๊ฐ€ ํ•™์Šตํ•œ ๋‚ด์šฉ์„ ๋ง๋ถ™์—ฌ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.


1. ๋‹ค์ค‘ ๋ถ„๋ฅ˜ (Multi-class Classification)

  • ์ž…๋ ฅ๊ฐ’ x์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ํด๋ž˜์Šค ์ค‘ ํ•˜๋‚˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ
  • ์ถœ๋ ฅ์€ ํ™•๋ฅ  ๋ฒกํ„ฐ์ด๋ฉฐ, Softmax ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ๊ตฌํ•จ
  • ํด๋ž˜์Šค ๋ณ„ ํ™•๋ฅ  ๊ฐ’๋“ค์„ ์ „๋ถ€ ๋”ํ•˜๋ฉด 1
  • ์˜ˆ์ธก ๊ฒฐ๊ณผ๋Š” ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ํด๋ž˜์Šค ์„ ํƒ

์ด์ง„ ๋ถ„๋ฅ˜ (Binary classification)

  • Sigmoid์˜ ์ถœ๋ ฅ ๊ฐ’์€ 0 ~ 1์ด๋ฏ€๋กœ, ํ™•๋ฅ  ๊ฐ’ P(y|x)์œผ๋กœ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ์Œ.
  • ์‹ ๊ฒฝ๋ง์€ True ํด๋ž˜์Šค์˜ ํ™•๋ฅ  ๊ฐ’์„ ๋ฑ‰์–ด๋‚ธ๋‹ค๊ณ  ์ •์˜ํ•˜์—ฌ Classificaiton์„ ํ™•๋ฅ  ๋ฌธ์ œ๋กœ ์น˜ํ™˜ํ•  ์ˆ˜ ์žˆ์Œ

2. Softmax ํ•จ์ˆ˜

  • ์ž…๋ ฅ ๋ฒกํ„ฐ๋ฅผ discrete(์ด์‚ฐ) ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜
  • ๋ชจ๋“  ํด๋ž˜์Šค ํ™•๋ฅ ์˜ ํ•ฉ์€ 1
  • ์ˆ˜์‹:

softmaxi(x)=exp(xi)โˆ‘j=1nexp(xj)softmax_i(x) = {exp(x_i) \over \sum_{j=1}^n exp(x_j)}

  • ๋ถ„๋ฅ˜๊ธฐ ์ถœ๋ ฅ์€ ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ฒกํ„ฐ๋กœ ํ•ด์„๋จ

softmax(x)=[softmax1(x)...softmaxn(x)]softmax(x) = [softmax_1(x) ... softmax_n(x)]

One-hot encoding

  • ํ•˜๋‚˜์˜ ๊ฐ’๋งŒ 1์ด๊ณ  ๋‚˜๋จธ์ง€๋Š” 0์ธ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹
  • ex)
    Dog [1, 0, 0]
    Cat [0, 1, 0]
    Bird [0, 0, 1]
  • ์นดํ…Œ๊ณ ๋ฆฌ ์ˆ˜๊ฐ€ ๋งŽ์•„์ง€๋ฉด ๋ฒกํ„ฐ๊ฐ€ ๊ณ ์ฐจ์›(sparse) โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋น„ํšจ์œจ
  • ๋Œ€์•ˆ์œผ๋กœ๋Š” embedding ๋ฐฉ๋ฒ•์ด ๋งŽ์ด ์“ฐ์ž„ (ํŠนํžˆ ๋”ฅ๋Ÿฌ๋‹์—์„œ)
ํ•ญ๋ชฉOne-hot EncodingEmbedding
ํ‘œํ˜„ ๋ฐฉ์‹0๊ณผ 1์˜ ํฌ์†Œ ๋ฒกํ„ฐ์—ฐ์†์ ์ธ ์‹ค์ˆ˜ ๋ฒกํ„ฐ
์ฐจ์›ํด๋ž˜์Šค ๊ฐœ์ˆ˜๋งŒํผ (๋ณดํ†ต ๋งค์šฐ ํผ)ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ง€์ •๋œ ์ €์ฐจ์› (์˜ˆ: 5์ฐจ์›)
์ •๋ณด๋Ÿ‰์˜๋ฏธ ์—†์Œ (์ˆœ์„œยท์œ ์‚ฌ๋„ ็„ก)์œ ์‚ฌ๋„, ์˜๋ฏธ๋ฅผ ๋ฒกํ„ฐ ๊ณต๊ฐ„์— ๋ฐ˜์˜
๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ๋งค์šฐ ๋น„ํšจ์œจ์  (sparse)ํšจ์œจ์  (dense)
์˜ˆ์‹œ[1, 0, 0], [0, 1, 0], [0, 0, 1][0.12, 0.3, ..., -0.4], [0.51, ..., 0.06], [0.02, ..., -0.11]

3. Cross Entropy Loss

  • ๋‹ค์ค‘ ํด๋ž˜์Šค ๋ถ„๋ฅ˜์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์†์‹ค ํ•จ์ˆ˜

  • Binary Cross Entropy์˜ ์ผ๋ฐ˜ํ™” ํ˜•ํƒœ

  • Softmax์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜์–ด ๋ถ„๋ฅ˜ ๋ฌธ์ œ์— ์ ํ•ฉ
    CE(y1:N,y^1:N)=โˆ’1Nโˆ‘i=1NyiTโˆ—logy^i=โˆ’1Nโˆ‘i=1Nโˆ‘j=1myi,jโˆ—logy^i,j=โˆ’1Nโˆ‘i=1NlogPฮธ(yiโˆฃxi)whereย y1:NโˆˆRN,mย andย y^1:NโˆˆRN,mCE(y_{1:N}, ลท_{1:N}) = {-1 \over N} \sum_{i=1}^Ny_i^T*logลท_i \\ = {-1 \over N} \sum_{i=1}^N\sum_{j=1}^my_{i,j} *logลท_{i,j} \\ = {-1 \over N} \sum_{i=1}^N logP_{\theta}(y_i|x_i) \\ where \ y_{1:N} โˆˆ R^{N, m} \ and\ ลท_{1:N} โˆˆ R^{N, m}

  • ์‹ค์ œ ์ •๋‹ต ํด๋ž˜์Šค์— ํ•ด๋‹นํ•˜๋Š” ํ™•๋ฅ ๊ฐ’์ด ๋†’์•„์ง€๋„๋ก ํ•™์Šต


4. Log-Softmax

  • Cross Entropy Loss์—์„œ log ์”Œ์›Œ์„œ ๊ณ„์‚ฐํ•˜๋Š” ๊ฑฐ ์ด์™• Softmax์—์„œ ์”Œ์›Œ์ค˜์„œ ์ถœ๋ ฅ ์•ˆ์ •์„ฑ๊ณผ ์ˆ˜ํ•™์  ํŽธ์˜์„ฑ์„ ๊ฐ€์ ธ๊ฐ€์ž

logโˆ’softmaxi(x)=logexp(xi)โˆ‘j=1nexp(xj)log-softmax_i(x) =log{exp(x_i) \over \sum_{j=1}^n exp(x_j)}

  • NLL(Negative Log Likelihood) ์‚ฌ์šฉ ๊ฐ€๋Šฅ

NLL(y1:N,y^1:N)=โˆ’1Nโˆ‘i=1NlogPฮธ(yiโˆฃxi)NLL(y_{1:N}, ลท_{1:N}) = {-1 \over N} \sum_{i=1}^NlogP_{\theta}(y_i|x_i)

logPฮธ(yiโˆฃxi)logP_{\theta}(y_i|x_i)๊ฐ€ log-softmax์˜ ๊ฒฐ๊ณผ


5. Confusion Matrix

  • Accuracy, Recall๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ˆœ์„œ๋Œ€๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ**
  • ํ•˜์ง€๋งŒ ํ•˜๋‚˜์˜ ์ˆซ์ž๋กœ ๋‚˜ํƒ€๋‚ธ๋งŒํผ ๋‚ด๋ถ€์˜ ์ž์„ธํ•œ ์„ฑ๋Šฅ์„ ์•Œ ์ˆ˜ ์—†์Œ
    • ํŠนํžˆ ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜• ์ƒํ™ฉ์ด๋ผ๋ฉด ๊ฒฐ๊ณผ๊ฐ€ ์™ธ๊ณก๋  ์ˆ˜ ์žˆ์Œ
  • Confusion Matrix๋Š” ์‹ค์ œ ํด๋ž˜์Šค vs ์˜ˆ์ธก ํด๋ž˜์Šค๋ฅผ ํ‘œ ํ˜•ํƒœ๋กœ ํ‘œํ˜„

  • ์–ด๋–ค ํด๋ž˜์Šค์—์„œ ์„ฑ๋Šฅ์ด ๋‚ฎ์€์ง€๋ฅผ ์ง๊ด€์ ์œผ๋กœ ํŒŒ์•… ๊ฐ€๋Šฅ
  • ์ œํ’ˆ ๊ฐœ์„ , ํ›„์† ํ•™์Šต ๋“ฑ์— ์œ ์šฉ

6. ํšŒ๊ท€ vs ๋ถ„๋ฅ˜ ์š”์•ฝ

TaskOutputActivationLoss Function
RegressionReal ValueLinear (None)MSE Loss
Binary Classification0 or 1SigmoidBinary Cross Entropy
Multi-class ClassificationClassSoftmaxCross Entropy Loss

7. Summary

  • ๋‹ค์ค‘ ๋ถ„๋ฅ˜์—์„œ๋Š” Softmax์™€ Cross Entropy๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํด๋ž˜์Šค๋ณ„ ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•˜๊ณ  ํ•™์Šต
  • Confusion Matrix๋Š” ํด๋ž˜์Šค๋ณ„ ์˜ˆ์ธก ์„ฑ๋Šฅ ๋ถ„์„์— ์ค‘์š”ํ•œ ๋„๊ตฌ
  • Accuracy, recall, F1 Score ๋“ฑ์˜ ์ •๋Ÿ‰ํ™”๋œ ์ˆซ์ž๋งŒ ๋ณด์ง€ ๋ง๊ณ , Confusion Matrix๋ฅผ ํ•จ๊ป˜ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๋‹ค๊ฐ๋„๋กœ ํ‰๊ฐ€ํ•  ๊ฒƒ

8. Pytorch ์‹ค์Šต ์ฝ”๋“œ

  • MNIST Dataset
  • 28 x 28 ํ”ฝ์…€ ์ด๋ฏธ์ง€๋ฅผ 784 ์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ฆ
    • (70000, 28, 28) โ†’ (70000, 784)
  • y๊ฐ’์€ 0~9๊นŒ์ง€ ๊ฐ’์ด ๋“ค์–ด์žˆ๋Š” one hot encoded vector์˜ index๊ฐ€ ๋‹ด๊ฒจ ์žˆ๋Š” ๊ฒƒ
    • [0,0,0,1,0,0,0,0,0,0] โ†’ 3 (๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ)

xโˆˆR70000,ย 784x โˆˆ R^{70000, \ 784}
yโˆˆR70000,ย 10y โˆˆ R^{70000, \ 10}

# Classification with Deep Neural Networks

## Load MNIST Dataset

import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import pandas as pd
from sklearn.metrics import confusion_matrix

from torchvision import datasets, transforms

train = datasets.MNIST(
    '../data', train=True, download=True,
    transform=transforms.Compose([
        transforms.ToTensor(),
    ]),
)
test = datasets.MNIST(
    '../data', train=False,
    transform=transforms.Compose([
        transforms.ToTensor(),
    ]),
)

def plot(x):
    img = (np.array(x.detach().cpu(), dtype='float')).reshape(28,28)

    plt.imshow(img, cmap='gray')
    plt.show()

plot(train.data[0])

x = train.data.float() / 255.
y = train.targets
print(x.shape, y.shape)
x = x.view(x.size(0), -1)
print(x.shape, y.shape)

input_size = x.size(-1)
output_size = int(max(y)) + 1

print('input_size: %d, output_size: %d' % (input_size, output_size))

# Train / Valid ratio
ratios = [.8, .2]

train_cnt = int(x.size(0) * ratios[0])
valid_cnt = int(x.size(0) * ratios[1])
test_cnt = len(test.data)
cnts = [train_cnt, valid_cnt]

print("Train %d / Valid %d / Test %d samples." % (train_cnt, valid_cnt, test_cnt))

indices = torch.randperm(x.size(0))

x = torch.index_select(x, dim=0, index=indices)
y = torch.index_select(y, dim=0, index=indices)

x = list(x.split(cnts, dim=0))
y = list(y.split(cnts, dim=0))

x += [(test.data.float() / 255.).view(test_cnt, -1)]
y += [test.targets]

for x_i, y_i in zip(x, y):
    print(x_i.size(), y_i.size())

## Build Model & Optimizer

model = nn.Sequential(
    nn.Linear(input_size, 500),
    nn.LeakyReLU(),
    nn.Linear(500, 400),
    nn.LeakyReLU(),
    nn.Linear(400, 300),
    nn.LeakyReLU(),
    nn.Linear(300, 200),
    nn.LeakyReLU(),
    nn.Linear(200, 100),
    nn.LeakyReLU(),
    nn.Linear(100, 50),
    nn.LeakyReLU(),
    nn.Linear(50, output_size),
    nn.LogSoftmax(dim=-1),
)

model

crit = nn.NLLLoss()

optimizer = optim.Adam(model.parameters())

## Move to GPU if it is available

device = torch.device('cpu')
if torch.cuda.is_available():
    device = torch.device('cuda')

model = model.to(device)

x = [x_i.to(device) for x_i in x]
y = [y_i.to(device) for y_i in y]

## Train

n_epochs = 1000
batch_size = 256
print_interval = 10

from copy import deepcopy

lowest_loss = np.inf
best_model = None

early_stop = 50
lowest_epoch = np.inf

train_history, valid_history = [], []

for i in range(n_epochs):
    indices = torch.randperm(x[0].size(0))
    x_ = torch.index_select(x[0], dim=0, index=indices)
    y_ = torch.index_select(y[0], dim=0, index=indices)
    
    x_ = x_.split(batch_size, dim=0)
    y_ = y_.split(batch_size, dim=0)
    
    train_loss, valid_loss = 0, 0
    y_hat = []
    
    for x_i, y_i in zip(x_, y_):
        y_hat_i = model(x_i)
        loss = crit(y_hat_i, y_i.squeeze())

        optimizer.zero_grad()
        loss.backward()

        optimizer.step()        
        train_loss += float(loss) # This is very important to prevent memory leak.

    train_loss = train_loss / len(x_)
        
    with torch.no_grad():
        x_ = x[1].split(batch_size, dim=0)
        y_ = y[1].split(batch_size, dim=0)
        
        valid_loss = 0
        
        for x_i, y_i in zip(x_, y_):
            y_hat_i = model(x_i)
            loss = crit(y_hat_i, y_i.squeeze())
            
            valid_loss += float(loss)
            
            y_hat += [y_hat_i]
            
    valid_loss = valid_loss / len(x_)
    
    train_history += [train_loss]
    valid_history += [valid_loss]
        
    if (i + 1) % print_interval == 0:
        print('Epoch %d: train loss=%.4e  valid_loss=%.4e  lowest_loss=%.4e' % (
            i + 1,
            train_loss,
            valid_loss,
            lowest_loss,
        ))
        
    if valid_loss <= lowest_loss:
        lowest_loss = valid_loss
        lowest_epoch = i
        
        best_model = deepcopy(model.state_dict())
    else:
        if early_stop > 0 and lowest_epoch + early_stop < i + 1:
            print("There is no improvement during last %d epochs." % early_stop)
            break

print("The best validation loss from epoch %d: %.4e" % (lowest_epoch + 1, lowest_loss))
model.load_state_dict(best_model)

## Loss History

plot_from = 0

plt.figure(figsize=(20, 10))
plt.grid(True)
plt.title("Train / Valid Loss History")
plt.plot(
    range(plot_from, len(train_history)), train_history[plot_from:],
    range(plot_from, len(valid_history)), valid_history[plot_from:],
)
plt.yscale('log')
plt.show()

## Let's see the result!

test_loss = 0
y_hat = []

with torch.no_grad():
    x_ = x[-1].split(batch_size, dim=0)
    y_ = y[-1].split(batch_size, dim=0)

    for x_i, y_i in zip(x_, y_):
        y_hat_i = model(x_i)
        loss = crit(y_hat_i, y_i.squeeze())

        test_loss += loss # Gradient is already detached.

        y_hat += [y_hat_i]

test_loss = test_loss / len(x_)
y_hat = torch.cat(y_hat, dim=0)

print("Validation loss: %.4e" % test_loss)

correct_cnt = (y[-1].squeeze() == torch.argmax(y_hat, dim=-1)).sum()
total_cnt = float(y[-1].size(0))

print('Accuracy: %.4f' % (correct_cnt / total_cnt))



pd.DataFrame(confusion_matrix(y[-1], torch.argmax(y_hat, dim=-1)),
             index=['true_%d' % i for i in range(10)],
             columns=['pred_%d' % i for i in range(10)])


profile
๋งŒ๋‘๋Š” ๋ชฉ๋ง๋ผ

0๊ฐœ์˜ ๋Œ“๊ธ€