[PyTorch] nn.BCELoss vs nn.BCEWithLogitsLoss vs nn.CrossEntropyLoss

정준환·2023년 5월 5일

Classification 문제를 풀다보면 쉽게 접할 수 있는 3가지 loss다. Cross Entropy가 무엇인지는 이미 알고 있다고 생각하고, PyTorch를 이용한 구현 과정에 있어 실수 하기 쉬운 부분을 정리해봤다.

아래 예시들에서 output 변수는 모델의 결과를 의미한다.

nn.BCELoss

Binary Cross Entropy Loss다.
이진분류 문제에서 사용할 수 있다.
output이 0~1 사이의 값이어야 하므로, 일반적으로 sigmoid와 함께 사용된다.
다만 이렇게 sigmoid와의 결합을 사용하는 경우라면, 아래의 nn.BCEWithLogitsLoss 사용이 더 권장된다.

예시

# batch size는 1
bce = nn.BCELoss()
output = torch.Tensor([1])
target = torch.Tensor([1])

bce(output, target)
>>> tensor(0.)

# sigmoid와 함께 사용
sigmoid = nn.Sigmoid()
output = torch.Tensor([99999])
target = torch.Tensor([1])

# 0~1 사이의 값이 아닐 경우 오류 발생
bce(output, target)
>>> RuntimeError: all elements of input should be between 0 and 1

bce(sigmoid(output), target)
>>> tensor(0.)

nn.BCEWithLogitsLoss

위의 BCE Loss에 Sigmoid를 결합한 형태다. 이걸 사용하는게 좀 더 안정적이라고 한다.
마찬가지로 이진 분류 문제에서 사용한다.
output이 0~1 사이의 값일 필요가 없다.

예시

# batch size는 1
bce_with_logits = nn.BCEWithLogitsLoss()

output = torch.Tensor([99999])
target = torch.Tensor([1])

# 굳이 sigmoid를 따로 통과해줄 필요 없음을 확인
bce(sigmoid(output), target)
>>> tensor(0.)


output = torch.Tensor([1])
target = torch.Tensor([1])

# output과 target이 동일해도 sigmoid를 자체적으로 통과하므로 올바르지 않은 loss 발생
bce(output, target)
>>> tensor(0.3133)

nn.CrossEntropyLoss

class가 3개 이상인 경우에는 위의 Binary 계열을 당연히 사용하지 못한다. 이 경우에는 보통 nn.CrossEntropyLoss를 사용한다.

크게 2가지 유의 사항이 있다.

1. Softmax 사용 X

아래 식이 Cross Entropy를 구하는 공식이다.

H(P,Q) = -\sum_{x\in\mathcal{X}} p(x)\, \log q(x)

나는 확률 분포가 되도록 Softmax를 취해 넣으면 PyTorch에서 이 식을 계산해 주는 줄 알았는데, 아니었다. 공식 문서를 읽어보자.

This criterion computes the cross entropy loss between input logits and target.

The input is expected to contain the unnormalized logits for each class (which do not need to be positive or sum to 1, in general)

즉, 출력 값에 softmax 함수를 사용 할 필요가 없다. 모델의 출력에 아무 변형도 적용하지 말고, 그냥 사용하면 된다.

예시

# batch size는 1, class는 3가지
ce = nn.CrossEntropyLoss()
softmax = nn.Softmax(dim=1)

output = torch.Tensor([[99999, 1, 1]])
target = torch.Tensor([[1, 0, 0]])

# softmax를 쓰지 말자. 
ce(softmax(output), target)
>>> tensor(0.5514)

ce(output, target)
>>> tensor(0.)

이 부분이 헷갈렸던 이유가, nn.BCELoss와 nn.BCEWithLogitsLoss의 경우에는 이름에 Logits 유무로 구분을 해서 당연히 이 경우도 마찬가지 일 것이라 생각했다. nn.CrossEntropyLoss가 있고, nn.CrossEntropyWithLogitsLoss 이런게 있을 줄 알았다...

참고로 NLLLoss 와 LogSoftmax를 결합해 사용하면 동일한 결과를 낸다.

2. One-hot Vector의 사용 여부

Multi-class Classification에서 One-hot vector를 사용해야 하는 이유를 굳이 설명하진 않겠다. 다만, 이 경우엔 One-hot vector를 사용하지 않아도 된다. 공식 문서에 아래와 같이 나와있다.

The target that this criterion expects should contain either:

Class indices in the range $[0,C)$ where $C$ is the number of classes;

Probabilities for each class;

즉, 그냥 class의 index만 써도 알아서 계산해준다는 뜻이다. 근데 One-hot을 써도 그게 곧 어차피 각 클래스의 확률을 의미하므로 동일한 결과를 낳는다.

blended labels, label smoothing 등의 경우에는 확률로 주는 방법을 사용해야 하지만, 단일 예측인 경우에는 그냥 index만 쓰면 편하다.

예시

# batch size는 1, class는 3가지
ce = nn.CrossEntropyLoss()

output = torch.Tensor([[99999, 1, 1]])

# index를 의미하므로 Float가 아니라 LongTensor를 사용해야 한다. 
# Tensor의 차원에도 유의해서 보자. 
target = torch.LongTensor([0])
target_onehot = torch.Tensor([[1, 0, 0]])

ce(output, target)
>>> tensor(0.)

ce(output, target_onehot)
>>> tensor(0.)

정준환

이전 포스트

[오류 노트] Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory

다음 포스트