Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning 제1부

이준석·2022년 12월 27일

Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning

목록 보기

1/1

Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning

1 Introduction

Relying on massive annotated datasets, significant progress has been made on many visual recognition tasks, which is mainly due to the widespread use of different deep learning architectures [Ren et al., 2015, Dosovitskiy et al., 2021, Khan et al., 2021].
방대한 주석 데이터 세트에 의존하여 많은 시각적 인식 작업에서 상당한 진전이 이루어졌으며, 이는 주로 다양한 딥 러닝 아키텍처의 광범위한 사용 때문입니다[Ren et al., 2015, Dosovitskiy et al., 2021, Khan et al., 2021].

Despite these advancements, recognising any arbitrary real-world object still remains a daunting challenge as it is unrealistic to label all the existing object classes on the earth.
이러한 발전에도 불구하고 지구상의 모든 기존 객체 클래스에 레이블을 지정하는 것은 비현실적이므로 임의의 실제 객체를 인식하는 것은 여전히 어려운 과제로 남아 있습니다.

Zero-Shot Learning (ZSL) addresses this problem, requiring images from the seen classes during the training, but has the capability of recognising unseen classes during the inference [Xian et al., 2019a, Xie et al., 2019, Xu et al., 2020, Federici et al., 2020].
ZSL(Zero-Shot Learning)은 이 문제를 해결하여 교육 중에 보이는 클래스의 이미지가 필요하지만 추론 중에 보이지 않는 클래스를 인식할 수 있는 기능이 있습니다[Xian et al., 2019a, Xie et al., 2019, Xu et al. ., 2020, Federici et al., 2020].

Here the central insight is that all the existing categories share a common semantic space and the task of ZSL is to learn a mapping from the imagery space to the semantic space with the help of side information (attributes, word embeddings) [Xian et al., 2017, Mikolov et al., 2013, Pennington et al., 2014] available with the seen classes during the training phase so that it can be used to predict the class information for the unseen classes during the inference time.
여기서 핵심 통찰력은 모든 기존 범주가 공통 의미 공간을 공유하고 ZSL의 작업은 부가 정보(속성, 단어 임베딩)의 도움으로 이미지 공간에서 의미 공간으로의 매핑을 학습하는 것입니다 [Xian et al. , 2017, Mikolov et al., 2013, Pennington et al., 2014] 훈련 단계에서 보이는 클래스와 함께 사용할 수 있으므로 추론 시간 동안 보이지 않는 클래스에 대한 클래스 정보를 예측하는 데 사용할 수 있습니다.

Most of the existing ZSL methods [Xian et al., 2018, Schönfeld et al., 2019] depends on pretrained visual features and necessarily focus on learning a compatibility function between the visual features and semantic attributes.
기존 ZSL 방법[Xian et al., 2018, Schönfeld et al., 2019]의 대부분은 사전 훈련된 시각적 특징에 의존하며 반드시 시각적 특징과 의미적 속성 간의 호환성 기능을 학습하는 데 중점을 둡니다.

Although modern neural network models encode local visual information and object parts [Xie et al., 2019], they are not sufficient to solve the localisation issue in ZSL models. Some attempts have also been made by learning visual attention that focuses on some object parts [Zhu et al., 2019]. However, designing a model that can exploit a stronger attention mechanism is relatively unexplored.
최신 신경망 모델은 로컬 시각 정보와 개체 부분을 인코딩하지만 [Xie et al., 2019] ZSL 모델의 로컬라이제이션 문제를 해결하기에는 충분하지 않습니다. 사물의 일부 부분에 집중하는 시각적 주의를 학습하는 시도도 있었다[Zhu et al., 2019]. 그러나 더 강력한 어텐션 메커니즘을 활용할 수 있는 모델을 설계하는 것은 상대적으로 탐구되지 않았습니다.

이준석

인공지능 전문가가 될레요

Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning 제1부