딥러닝 유치원 4 - Classification

조영유·2025년 4월 7일

deeplearning-kindergarten

목록 보기

5/5

0. 서론

딥러닝 유치원 1강에서 가르친 데이터셋이 기억나는가? 분명 데이터셋에서는 이미지를 분류하는 작업을 했다. 그것이 비행기인지, 차인지, 고양이인지, 강아지인지 말이다. 그런데 문제가 있다. 우리는 그 동안 linear regression을 하는 방법만 배웠다. 그렇다면 classification은 어떻게 해야 하는가?

일반적으로 딥 러닝 교재들을 보면, 머신 러닝 교재 대부분은 머신 러닝은 regression과 classification으로 나뉜다고 하면서, regression 먼저 가르치는 경우가 많다. 사실 그 이유는, regression과 classification은 언뜻 보면 갈라져 있는 것 같지만, 실은 classification은 regression의 일종의 확장팩이기 때문이다. 즉, classification은 사실 regression의 일종 같은 개념(?)이지만 그것을 발전시킨 것이다. 그게 무슨 개소리인가! 궁금하겠지만 다음을 보자.

1. classificatio의 원리

1) One-hot

만일 내가 사진을 보고 그것이 강아지인지, 고양이인지, 인간인지 구분한다고 가정하자. 그렇다면 각 사진에다가 label로, 강아지는 0번 ~~코딩세계는 0부터 시작이야!~~ 고양이는 1번, 인간은 2번이라고 붙일 것이다. 그렇다면 원래 정답지는 다음과 같아진다. 이와 같은 표기법을 one-hot이라고 불린다.

y_{dog} = (1, 0, 0), y_{cat} = (0, 1, 0), y_{human} = (0, 0, 1)

실제로 당신이 이걸 implementation할 필요는 없다. 당신은 그냥 강아지에 0번, 고양이에 1번을 하기만 하면 된다.

이제 모델은 무엇을 해야 하는가? label이 3 종류이므로 3개의 probability를 출력하는 output을 내야 한다.

\vec y = f(\vec x), \vec y=(p_{dog}, p_{cat}, p_{human})

당연히 목표는 해당하는 확률을 1로 만들고 그렇지 않은 확률을 0으로 만드는 것 (예를 들어 강아지면 (1, 0, 0). ) 즉, output이 one-hot vector에 최대한 가까워지는 것이다.

2) classification output 내는 법: SOFTMAX

그런데 문제는 이것이 언제까지나 확률이기 때문에, $1=p_{dog}+p_{cat}+p_{human}$ 이래야 한다는 것이다. 즉, 그냥 $\vec y = f(\vec x)$ 를 해버리면 결괏값이 중구난방해져서 확률이 나오지 않을 수 있다는 문제가 생긴다는 것이다. 그렇다면 어떻게 해야 할까?
그렇기 때문에 통상적인 딥 러닝은 output이 나오면 그 output에 softmax를 때려버린다. 예를 들어 다음과 같은 neural network이 있을 때,

\vec{h_{1}}=W^{(1)}\vec{x}, \vec{g_{1}}=f(\vec{h_{1}}),

\vec{h_{2}}=W^{(2)}\vec{g_{1}}, \vec{g_{2}}=f(\vec{h_{2}}),

... , \vec{g_{n}}=f(\vec{h_{n}}), \vec{y}=W^{(n)}\vec{g_{n}}

$\vec{y}$ 는 다음과 같다 치자. 이 때 분류해야 하는 class의 갯수는 c개라고 가정하자.

\vec{y}=(y_{1}, y_{2}, ..., y_{c})

이 때 c개의 class에 대한 softmax는 다음과 같다.

p_{i}=\frac{e^{y_{i}}}{\sum_{j=1}^{C}e^{y_{j}}}, \vec{p}=(p_{1}, p_{2}, ..., p_{c})

즉, 해당하는 y값의 exp / 모든 y값의 exponential을 합한 값이 softmax가 되는 셈이다. 당연히 이렇게 하면 $1=p_{1}+p_{2}+... +p_{C}$ 가 충족이 된다.

3) 새로운 loss function: cross entropy

사실 이 확률 값을 MSE로 구하라 하는 건 좀 짜치는 일일 것이다. 좀만 생각해도 mean square의 평균 값이 1일텐데, loss function의 최대가 1인 건 좀 짜치지 않은가(...)

그리하여 사용하게 된 새로운 loss가 cross entropy다. 만일 모델을 돌려서 나온 결과물이 $\vec{p}=(p_{1}, p_{2}, ..., p_{c})$ 고, 정답이 $\vec{\hat{p}}=(\hat{p}_{1}, \hat{p}_{2}, ..., \hat{p}_{c})$ 이라고 가정하자. 이 때 cross entropy는 다음과 같다.

L=-\sum_{i=1}^{C}\hat{p_{i}}\textrm{ln}p_{i}

만일 정답 라벨과 내 답이 다르다 하자. 정답 번호가 n이라 가정할 때, $\hat{p_{n}}=1$ 일 때, $p_{n}=0$ 에 가까울 것이다. 그러면 $L=-1\textrm{ln}0$ 에 가까우므로 큰 값이 나올 것이다. 반면 $\hat{p_{n}}=1$ 일 때, $p_{n}=1$ 이고 나머지 경우에 $\hat{p_{i}}=0$ 이면 L은 0에 가까워질 것이다. 그러므로, cross entropy는 정답 라벨에 가까워질수록 더 큰 값이 나온다고 보면 된다.

2. Classification의 실행

일단 데이터셋과 dataloader는 이전에 우리가 만든 dataset을 사용하면 되겠다.


from dataset import CIFAR10

EPOCH = 20
LEARNING_RATE = 1e-2
BATCH_SIZE = 64

train_dataset = CIFAR10(train= True)
test_dataset = CIFAR10(train= False)
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=True)

이제 여기서 사용해야 하는 loss가 달라진다. cross entropy loss를 사용하면 된다.

model = MLP()
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

그 다음에 train code는 그냥...

for epoch in range(1, EPOCH + 1):
    loss_val = 60
    
    # train
    model.train()
    for data in tqdm(train_dataloader, 
                     desc = "epoch {} and now loss {}...".format(epoch, loss_val)):
        label, x = data["label"], data["data"]
        pred = model(x)
        
        loss = loss_function(pred, label)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        
        loss_val = loss.item()

이렇게만 짜도 된다! 뭐라 하지 않는다! 당연히 label에는 one-hot vector가 아니라 번호가 붙어 있다.
참고로 모델에서도 class가 10개니까 뒤에 output layer의 길이를 10으로 설정은 했지만... 그 뒤에 softmax를 붙이지 않았다.

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.LazyLinear(2048)
        self.layer2 = nn.Linear(2048, 2048)
        self.layer3 = nn.Linear(2048, 10)
        self.activation = nn.ReLU()
        
    def forward(self, x):
        hidden1 = self.layer1(x)
        hidden1 = self.activation(hidden1)
        
        hidden2 = self.layer2(hidden1)
        hidden2 = self.activation(hidden2)
        
        hidden3 = self.layer3(hidden2)
        output = self.activation(hidden3)
        
        return output

여기서 여러분은 왜 이 놈들이 이렇게 하나 미1쳤나 하는 생각이 들 것이다. 참고로 GPT에게 image classification code를 짜라 하고 그렇게 한 이유를 물어보면 다음과 같이 대답한다. 참고로 한국어 사용자 여러분에게 미안하지만 귀찮으니 번역하지 않겠다. 우리 모두 영어유치원 세대니 ~~심지어 난 영어유치원 안 갔다~~ 이 정도 영어는 될 거라 믿는다.
이 친구는 대체 뭐라 씨부리는 걸까... 다음 번에 보도록 하자.

GPT쨩의 답변

Great observation! You're absolutely right that for image classification tasks, especially when dealing with multi-class classification, one-hot encoding and softmax are typically used.

In the code I provided, I did not explicitly include softmax or a one-hot encoded target vector because PyTorch's nn.CrossEntropyLoss() already handles both internally. Let me explain in more detail:

1. Cross-Entropy Loss:

PyTorch's nn.CrossEntropyLoss() combines two important steps:
- Softmax Activation: It applies the softmax function to the raw output (logits) from the network.
- Negative Log-Likelihood: It then calculates the negative log-likelihood loss.
  
  So, you do not need to apply softmax manually on the model's output when using CrossEntropyLoss, as this is done internally by the loss function.

2. Target Format:

nn.CrossEntropyLoss() expects the target labels to be in the integer class index format (not one-hot encoded). For example, if you have a dataset with 10 classes (like CIFAR-10), the target labels should be integers in the range [0, 9], where each integer corresponds to the class label.

So, instead of providing one-hot encoded vectors like:
```
# One-hot encoding (example for class 3)
target = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]  # Length 10 for CIFAR-10
```
You simply provide:
```
target = 3  # Integer class index (3rd class)
```

Why This Approach Works:

Softmax Layer: Since nn.CrossEntropyLoss() already applies the softmax function, you don't need to manually apply it to the output of the model. The model outputs raw logits (un-normalized scores) for each class, and CrossEntropyLoss() takes care of converting these logits into probabilities using softmax during the loss calculation.
Target Labels: The target labels are expected as integers, so no need for one-hot encoding. For example, if the model predicts [2.3, -1.2, 0.5, ...], the loss function applies softmax to convert these into probabilities (like [0.7, 0.1, 0.2, ...]) and then compares the predicted probability of the true class with the one-hot encoded target internally.

Example Workflow with Cross-Entropy:

Model output: Let's say the model outputs raw logits (e.g., [2.3, -1.2, 0.5, 3.1] for a 4-class problem).
CrossEntropyLoss() applies softmax on the logits to get the predicted probabilities.
It then computes the negative log-likelihood loss between the predicted probabilities and the true label.

If You Were to Manually Apply Softmax (Not Necessary in this Case):

In case you wanted to manually apply softmax and compute the loss yourself, you could do something like this:

import torch
import torch.nn.functional as F

# Example logits from the model (raw scores)
logits = torch.tensor([2.3, -1.2, 0.5, 3.1])

# Apply softmax to get probabilities
probs = F.softmax(logits, dim=0)

# True class label (e.g., class 3)
true_label = torch.tensor(3)

# Compute the negative log-likelihood loss manually
loss = -torch.log(probs[true_label])

But again, this approach is redundant when you're using nn.CrossEntropyLoss(), as it handles both softmax and the loss calculation internally.

Conclusion:

No need for softmax or one-hot encoding when using nn.CrossEntropyLoss() in PyTorch.
The loss function expects integer class labels and will automatically compute softmax internally.

I hope that clears up your question! Let me know if you'd like more details on any part of this process.

조영유

이전 포스트