Deep Learning on Small Datasets without Pre-Training using Cosine Loss 제1부

이준석·2022년 10월 28일

Deep Learning on Small Datasets without Pre-Training using Cosine Loss

목록 보기

1/2

Deep Learning on Small Datasets without Pre-Training using Cosine Loss

코사인 손실을 사용하여 사전 훈련 없이 작은 데이터 세트에 대한 딥 러닝

Abastract

Two things seem to be indisputable in the contemporary deep learning discourse: 1. The categorical cross-entropy loss after softmax activation is the method of choice for classification. 2. Training a CNN classifier from scratch on small datasets does not work well.
indisputable 반론의 여지없는
현대 딥 러닝 담론에서 두 가지 논쟁의 여지가 없는 것 같습니다. 1. softmax 활성화 후 범주형 교차 엔트로피 손실이 분류를 위한 선택 방법입니다. 2. 작은 데이터 세트에서 CNN 분류기를 처음부터 훈련하는 것은 잘 작동하지 않습니다.

In contrast to this, we show that the cosine loss function provides substantially better performance than crossentropy on datasets with only a handful of samples per class. For example, the accuracy achieved on the CUB-200-2011 dataset without pre-training is by 30% higher than with the cross-entropy loss. Further experiments on other popular datasets confirm our findings.
substantially 상당히, 나은 handful 소수의
이와 대조적으로, 우리는 코사인 손실 함수가 클래스당 소수의 샘플만 있는 데이터 세트에서 교차 엔트로피보다 훨씬 더 나은 성능을 제공한다는 것을 보여줍니다. 예를 들어, 사전 훈련 없이 CUB-200-2011 데이터 세트에서 달성한 정확도는 교차 엔트로피 손실보다 30% 더 높습니다. 다른 인기 있는 데이터 세트에 대한 추가 실험은 우리의 발견을 확인시켜줍니다.

Moreover, we demonstrate that integrating prior knowledge in the form of class hierarchies is straightforward with the cosine loss and improves classification performance further.
demonstrate 증명하다, 입증하다 integrating 통합시키다
게다가, 우리는 클래스 계층의 형태로 사전 지식을 통합하는 것이 코사인 손실에 대해 간단하고 분류 성능을 더욱 향상시킨다는 것을 보여줍니다.

6. Conclusions

We have found the cosine loss to be useful for training deep neural classifiers from scratch on limited data. Experiments on five well-known small image datasets and one text classification task have shown that this loss function outperforms the traditionally used cross-entropy loss after softmax activation by a large margin.
우리는 코사인 손실이 제한된 데이터에서 처음부터 심층 신경 분류기를 훈련하는 데 유용하다는 것을 발견했습니다. 잘 알려진 5개의 작은 이미지 데이터 세트와 1개의 텍스트 분류 작업에 대한 실험은 이 손실 함수가 소프트맥스 활성화 후 전통적으로 사용된 교차 엔트로피 손실보다 큰 차이로 성능이 우수한 것으로 나타났습니다.

On the other hand, both loss functions perform similarly if a sufficient amount of training data is available or the network is initialized with weights pre-trained on a large dataset.
반면에 충분한 양의 훈련 데이터를 사용할 수 있거나 네트워크가 대규모 데이터 세트에 대해 사전 훈련된 가중치로 초기화되는 경우 두 손실 함수 모두 유사하게 수행됩니다.

This leads to the hypothesis, that the L2 normalization involved in the cosine loss is a strong regularizer. Evidence for this hypothesis is provided by the poor performance of the MSE loss, which mainly differs from the cosine loss by not applying L2 normalization.
이것은 코사인 손실과 관련된 L2 정규화가 강력한 정규화라는 가설로 이어집니다. 이 가설에 대한 증거는 L2 정규화를 적용하지 않는 코사인 손실과 주로 다른 MSE 손실의 열악한 성능에 의해 제공됩니다.

Previous works have found that direction bears substantially more information in highdimensional feature spaces than magnitude [17, 55].
bear 담고 있다. 포함하다
이전 연구에서는 방향이 규모 [17, 55]보다 고차원 특징 공간에서 훨씬 더 많은 정보를 가지고 있다는 것을 발견했다.
The magnitude of feature vectors can hence mainly be considered as noise, which is eliminated by L2 normalization.
따라서 특징 벡터의 크기는 주로 L2 정규화에 의해 제거되는 노이즈로 간주될 수 있습니다.

Moreover, the cosine loss is bounded between 0 and 2, which facilitates a dataset-independent choice of a learning rate schedule and limits the impact of misclassified samples, e.g., difficult examples or label noise.
또한 코사인 손실은 0과 2 사이로 제한되어 학습률 일정의 데이터 세트 독립적 선택을 용이하게 하고 잘못 분류된 샘플(예: 어려운 예제 또는 레이블 노이즈)의 영향을 제한합니다.

While some problems can in fact be solved satisfactorily by simply collecting more and more data, we hope that applications that have to deal with limited amounts of data and cannot apply pre-training can benefit from using the cosine loss. Moreover, we hope to motivate future research on different loss functions for classification, since there obviously are viable alternatives to categorical cross-entropy.
실제로 일부 문제는 단순히 점점 더 많은 데이터를 수집하여 만족스럽게 해결할 수 있지만 제한된 양의 데이터를 처리해야 하고 사전 훈련을 적용할 수 없는 응용 프로그램은 코사인 손실을 사용하여 이점을 얻을 수 있기를 바랍니다. 또한 범주형 교차 엔트로피에 대한 실행 가능한 대안이 분명히 있기 때문에 분류를 위한 다양한 손실 함수에 대한 향후 연구에 동기를 부여하기를 바랍니다.

1. Introduction

Deep learning methods are well-known for their demand after huge amounts of data [40]. It is even widely acknowledged that the availability of large datasets is one of the main reasons—besides more powerful hardware—for the recent renaissance of deep learning approaches [20, 40].
딥 러닝 방법은 엄청난 양의 데이터에 대한 수요로 잘 알려져 있습니다[40]. 최근 딥 러닝 접근 방식의 르네상스가 발생한 주요 원인 중 하나는 더 강력한 하드웨어 외에 대규모 데이터 세트의 가용성이라는 사실이 널리 알려져 있습니다[20, 40].

However, there are plenty of domains and applications where the amount of available training data is limited due to high costs induced by the collection or annotation of suitable data.
In such scenarios, pre-training on similar tasks with large amounts of data such as the ImageNet dataset [9] has become the de facto standard [51, 12], for example in the domain of fine-grained recognition [23, 56, 8, 37].
그러나 적절한 데이터의 수집 또는 주석으로 인해 발생하는 높은 비용으로 인해 사용 가능한 교육 데이터의 양이 제한된 도메인 및 응용 프로그램이 많이 있습니다.
이러한 시나리오에서 ImageNet 데이터 세트[9]와 같은 많은 양의 데이터가 있는 유사한 작업에 대한 사전 훈련은 예를 들어 세분화된 인식 영역에서 사실상 표준[51, 12]이 되었습니다[23, 56, 8, 37].

While this so-called transfer learning often comes without additional costs for research projects thanks to the availability of pre-trained models, it is rather problematic in at least two important scenarios: On the one hand, the target domain might be highly specialized, e.g., in the field of medical image analysis [24], inducing a large bias between the source and target domain in a transfer learning scenario.
problematic 문제가 있는
이 소위 전이 학습은 사전 훈련된 모델의 가용성 덕분에 연구 프로젝트에 대한 추가 비용 없이 종종 제공되지만 적어도 두 가지 중요한 시나리오에서는 문제가 있습니다. 한편으로는 대상 도메인이 고도로 전문화될 수 있습니다. , 의료 영상 분석 분야에서 [24], 전이 학습 시나리오에서 소스 도메인과 타겟 도메인 사이에 큰 편향을 유발합니다.

The input data might have more than three channels provided by sensors different from cameras, e.g., depth sensors, satellites, or MRI scans. In that case, pre-training on RGB images is anything but straightforward.
strainghtforward 간단하지 않은
입력 데이터는 깊이 센서, 위성 또는 MRI 스캔과 같이 카메라와 다른 센서에 의해 제공되는 채널이 3개 이상일 수 있다. 이 경우 RGB 이미지에 대한 사전 교육은 결코 간단하지 않습니다.

But even in the convenient case that the input data consists of RGB images, legal problems arise: Most large imagery datasets consist of images collected from the web, whose licenses are either unclear or prohibit commercial use [9, 19, 49].
Therefore, copyright regulations imposed by many countries make pre-training on ImageNet illegal for commercial applications.
그러나 입력 데이터가 RGB 이미지로 구성된 편리한 경우에도 법적 문제가 발생합니다. 대부분의 대규모 이미지 데이터 세트는 라이센스가 불분명하거나 상업적 사용이 금지된 웹에서 수집한 이미지로 구성됩니다[9, 19, 49].
따라서 많은 국가에서 부과하는 저작권 규정은 ImageNet에 대한 사전 교육을 상업용 응용 프로그램에 대해 불법으로 규정하고 있습니다.

Nevertheless, the majority of research applying deep learning to small datasets focuses on transfer learning. Given huge amounts of data, even simple models can solve complex tasks by memorizing [43, 52].
그럼에도 불구하고 작은 데이터 세트에 딥 러닝을 적용하는 대부분의 연구는 전이 학습에 중점을 둡니다. 방대한 양의 데이터가 주어지면 간단한 모델이라도 암기하면 복잡한 작업을 해결할 수 있습니다[43, 52].

Generalizing well from limited data is hence the hallmark of true intelligence. But still, works aiming at directly learning from small datasets without external data are surprisingly scarce.
hallmark 특징 scarce 드문
따라서 제한된 데이터에서 잘 일반화하는 것이 진정한 지능의 특징이다. 그러나 여전히 외부 데이터 없이 소규모 데이터 세트에서 직접 학습하는 것을 목표로 하는 작업은 의외로 드물다.

Certainly, the notion of a “small dataset” is highly subjective and depends on the task at hand and the diversity of the data, as expressed in, e.g., the number of classes. In this work, we consider datasets with less than 100 training images per class as small, such as the Caltech-UCSD Birds (CUB) dataset [46], which comprises at most 30 images per class.
diversity 다양성 compriese at 구성하다
확실히, "작은 데이터 세트"의 개념은 매우 주관적이며 당면한 작업과 데이터의 다양성(예: 클래스 수)에 따라 달라집니다. 이 작업에서는 클래스당 최대 30개의 이미지로 구성된 Caltech-UCSD Birds(CUB) 데이터 세트[46]와 같이 클래스당 훈련 이미지가 100개 미만인 데이터 세트를 작은 것으로 간주합니다.

In contrast, the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC’12) [34] contains between 700 and 1,300 images per class.
대조적으로 ImageNet Large Scale Visual Recognition Challenge 2012(ILSVRC'12)[34]에는 클래스당 700~1,300개의 이미지가 포함되어 있습니다.

Since transfer learning works well in cases where sufficiently large and licensable datasets are available for pretraining, research on new methodologies for learning from small data without external information has been very limited.
전이 학습은 사전 훈련에 사용할 수 있는 충분히 크고 라이선스 가능한 데이터 세트가 있는 경우에 잘 작동하기 때문에 외부 정보 없이 작은 데이터로부터 학습하기 위한 새로운 방법론에 대한 연구는 매우 제한적이었습니다.

For example, the choice of categorical cross-entropy after a softmax activation as loss function has, to the best of our knowledge, not been questioned.
예를 들어, 손실 함수로 softmax 활성화 후 범주형 교차 엔트로피를 선택하는 것은 우리가 아는 한 의심의 여지가 없습니다.

In this work, however, we propose an extremely simple but surprisingly effective loss function for learning from scratch on small datasets: the cosine loss, which maximizes the cosine similarity between the output of the neural network and one-hot vectors indicating the true class.
그러나 이 작업에서 우리는 작은 데이터 세트에서 처음부터 학습하기 위한 매우 간단하지만 놀라울 정도로 효과적인 손실 함수를 제안합니다. 코사인 손실은 신경망의 출력과 실제 클래스를 나타내는 원-핫 벡터 간의 코사인 유사성을 최대화합니다.

Our experiments show that this is superior to cross-entropy by a large margin on small datasets. We attribute this mainly to the L 2 normalization involved in the cosine loss, which seems to be a strong, hyper-parameter free regularizer.
우리의 실험은 이것이 작은 데이터 세트에서 큰 차이로 교차 엔트로피보다 우수하다는 것을 보여줍니다. 우리는 이것을 주로 코사인 손실과 관련된 L 2 정규화에 기인하며, 이는 강력한 하이퍼 매개변수가 없는 정규화기인 것 같습니다.

1. We conduct a study on 5 small image datasets (CUB, NAB, Stanford Cars, Oxford Flowers, MIT Indoor Scenes) and one text classification dataset (AG News) to assess the benefits of the cosine loss for learning from small data.
1. 우리는 5개의 작은 이미지 데이터 세트(CUB, NAB, Stanford Cars, Oxford Flowers, MIT 실내 장면)와 1개의 텍스트 분류 데이터 세트(AG News)에 대한 연구를 수행하여 작은 데이터에서 학습할 때 코사인 손실의 이점을 평가합니다.

2. We analyze the effect of the dataset size using differently sized subsets of CUB, CIFAR-100, and AG News.
2. CUB, CIFAR-100 및 AG News의 서로 다른 크기의 하위 집합을 사용하여 데이터 집합 크기의 영향을 분석합니다.

3. We investigate whether the integration of prior semantic knowledge about the relationships between classes as recently suggested by Barz and Denzler [5] improves the performance further. To this end, we introduce a novel class taxonomy for the CUB dataset and also evaluate different variants to analyze the effect of the granularity of the hierarchy.
investigate 조사하다 integration 통합
3. 최근 Barz와 Denzler[5]가 제안한 클래스 간의 관계에 대한 사전 의미적 지식의 통합이 성능을 더욱 향상시키는지 조사합니다. 이를 위해 우리는 CUB 데이터 세트에 대한 새로운 클래스 분류법을 도입하고 계층 구조의 세분성의 영향을 분석하기 위해 다양한 변형을 평가합니다.

이준석

인공지능 전문가가 될레요

다음 포스트

Deep Learning on Small Datasets without Pre-Training using Cosine Loss 제1부