Attribute Prototype Network for Zero-Shot Learning 제1부

이준석·2022년 10월 23일

Attribute Prototype Network for Zero-Shot Learning

목록 보기

1/1

Attribute Prototype Network for Zero-Shot Learning

Abstract

From the beginning of zero-shot learning research, visual attributes have been shown to play an important role. In order to better transfer attribute-based knowledge from known to unknown classes, we argue that an image representation with integrated attribute localization ability would be beneficial for zero-shot learning.
integrated 통합된
제로샷 러닝 연구 초기부터 시각적 속성이 중요한 역할을 하는 것으로 나타났습니다. 속성 기반 지식을 알려진 클래스에서 알려지지 않은 클래스로 더 잘 전달하기 위해 통합된 속성 위치 파악 기능이 있는 이미지 표현이 제로샷 학습에 도움이 될 것이라고 주장합니다.

To this end, we propose a novel zero-shot representation learning framework that jointly learns discriminative global and local features using only class-level attributes. While a visual-semantic embedding layer learns global features, local features are learned through an attribute prototype network that simultaneously regresses and decorrelates attributes from intermediate features.
이를 위해 우리는 클래스 수준 속성만을 사용하여 판별적인 전역 및 로컬 기능을 공동으로 학습하는 새로운 제로샷 표현 학습 프레임워크를 제안합니다. 시각적 의미 임베딩 레이어가 전역 기능을 학습하는 동안 로컬 기능은 중간 기능의 속성을 동시에 회귀 및 역상관하는 속성 프로토타입 네트워크를 통해 학습됩니다.

We show that our locality augmented image representations achieve a new state-of-the-art on three zero-shot learning benchmarks. As an additional benefit, our model points to the visual evidence of the attributes in an image, e.g. for the CUB dataset, confirming the improved attribute localization ability of our image representation.
우리는 지역 증강 이미지 표현이 3개의 제로샷 학습 벤치마크에서 새로운 최첨단을 달성했음을 보여줍니다. 추가적인 이점으로 우리 모델은 이미지의 속성에 대한 시각적 증거를 가리킵니다. CUB 데이터 세트의 경우 이미지 표현의 향상된 속성 현지화 기능을 확인합니다.

5 Conclusion

In this work, we develop a zero-shot representation learning framework, i.e. attribute prototype network (APN), to jointly learn global and local features. By regressing attributes with local features and decorrelating prototypes with regularization, our model improves the locality of image representations.
decorrelating 역상관
이 작업에서는 전역 및 로컬 기능을 공동으로 학습하기 위해 제로샷 표현 학습 프레임워크, 즉 속성 프로토타입 네트워크(APN)를 개발합니다. 속성을 로컬 기능으로 회귀하고 프로토타입을 정규화로 역상관함으로써 우리 모델은 이미지 표현의 지역성을 개선합니다.

We demonstrate consistent improvement over the state-of-the-art on three ZSL benchmarks, and further show that, when used in conjunction with feature generating models, our representations improve over finetuned ResNet representations.
conjunction 결합
우리는 세 가지 ZSL 벤치마크에서 최첨단보다 일관된 개선을 보여주며, 기능 생성 모델과 함께 사용될 때 우리의 표현이 미세 조정된 ResNet 표현보다 개선된다는 것을 추가로 보여준다.

We qualitatively verify that our network is able to accurately localize attributes in images, and the part localization accuracy significantly outperforms a weakly supervised localization model designed for zero-shot learning.
qualitatively 점성적으로 verify 검증하다
우리는 우리의 네트워크가 이미지의 속성을 정확하게 현지화할 수 있고 부품 현지화 정확도가 제로샷 학습을 위해 설계된 약한 감독된 현지화 모델보다 훨씬 우수한 성능을 보이는지 정성적으로 검증합니다.

1 Introduction

Visual attributes describe discriminative visual properties of objects shared among different classes.
Attributes have shown to be important for zero-shot learning as they allow semantic knowledge transfer from known to unknown classes. Most zero-shot learning (ZSL) methods [30, 6, 1, 50] rely on pretrained image representations and essentially focus on learning a compatibility function between the image representations and attributes.
compatibility 양립 공존성
시각적 속성은 다른 클래스 간에 공유되는 개체의 구별되는 시각적 속성을 설명합니다.
속성은 알려진 클래스에서 알려지지 않은 클래스로 의미론적 지식 이전을 허용하므로 제로샷 학습에 중요한 것으로 나타났습니다. 대부분의 ZSL(Zero-shot learning) 방법[30, 6, 1, 50]은 사전 훈련된 이미지 표현에 의존하며 기본적으로 이미지 표현과 속성 간의 호환성 함수를 학습하는 데 중점을 둡니다.

Focusing on image representations that directly allow attribute localization is relatively unexplored. In this work, we refer to the ability of an image representation to localize and associate an image region with a visual attribute as locality. Our goal is to improve the locality of image representations for zero-shot learning.
속성 현지화를 직접적으로 허용하는 이미지 표현에 초점을 맞추는 것은 상대적으로 미개척입니다. 이 작업에서 우리는 이미지 영역을 지역화하고 시각적 속성과 지역성을 연관시키는 이미지 표현의 기능을 참조합니다. 우리의 목표는 제로샷 학습을 위한 이미지 표현의 지역성을 개선하는 것입니다.

While modern deep neural networks [13] encode local information and some CNN neurons are linked to object parts [53], the encoded local information is not necessarily best suited for zero-shot learning. There have been attempts to improve locality in ZSL by learning visual attention [24, 58] or attribute classifiers [35].
현대의 심층 신경망[13]은 로컬 정보를 인코딩하고 일부 CNN 뉴런은 객체 부분[53]에 연결되어 있지만 인코딩된 로컬 정보가 반드시 제로샷 학습에 가장 적합한 것은 아닙니다. 시각적 주의[24, 58] 또는 속성 분류기[35]를 학습하여 ZSL의 지역성을 개선하려는 시도가 있었습니다.

Although visual attention accurately focuses on some object parts, often the discovered parts and attributes are biased towards training classes due to the learned correlations. For instance, the attributes yellow crown and yellow belly co-occur frequently (e.g. for Yellow Warbler).
crown 왕관 belly 배꼽
시각적 주의는 일부 객체 부분에 정확하게 초점을 맞추지만, 종종 학습된 상관관계로 인해 발견된 부품과 속성이 훈련 클래스에 편향된다. 예를 들어, 노란색 왕관과 노란색 배는 자주 공존한다(예: Yellow Warbler의 경우).

Such correlations may be learned as a shortcut to maximize the likelihood of training data and therefore fail to deal with unknown configurations of attributes in novel classes such as black crown and yellow belly (e.g. for Scott Oriole) as this attribute combination has not been observed before.
이러한 상관관계는 훈련 데이터의 가능성을 극대화하기 위한 지름길로 학습될 수 있으며, 따라서 이 속성 조합이 이전에 관찰되지 않았기 때문에 블랙 크라운 및 옐로우 배(예: 스콧 오리올의 경우)와 같은 새로운 클래스의 알려지지 않은 속성 구성을 처리하지 못한다.

To improve locality and mitigate the above weaknesses of image representations, we develop a weakly supervised representation learning framework that localizes and decorrelates visual attributes.
mitigate 완화시키다
지역성을 개선하고 이미지 표현의 위의 약점을 완화하기 위해 우리는 시각적 속성을 지역화하고 상관 관계를 해제하는 약한 감독 표현 학습 프레임워크를 개발합니다.

More specifically, we learn local features by injecting losses on intermediate layers of CNNs and enforce these features to encode visual attributes defining visual characteristics of objects.
보다 구체적으로, CNN의 중간 레이어에 손실을 주입하여 로컬 기능을 학습하고 이러한 기능을 적용하여 객체의 시각적 특성을 정의하는 시각적 속성을 인코딩합니다.

It is worth noting that we use only class-level attributes and semantic relatedness of them as the supervisory signal, in other words, no human annotated association between the local features and visual attributes is given during training.
우리는 클래스 수준 속성과 그것들의 의미론적 관련성만을 감독 신호로 사용한다는 것에 주목할 필요가 있다. 즉, 훈련 중에 로컬 기능과 시각적 속성 사이에 인간 주석이 달린 연관성이 주어지지 않는다.

In addition, we propose to alleviate the impact of incidentally correlated attributes by leveraging their semantic relatedness while learning these local features.
relatedness 연관성
또한 이러한 로컬 기능을 학습하는 동안 의미론적 관련성을 활용하여 부수적으로 상관된 속성의 영향을 완화할 것을 제안합니다.

To summarize, our work makes the following contributions. (1) We propose an attribute prototype network (APN) to improve the locality of image representations for zero-shot learning. By regressing and decorrelating attributes from intermediate-layer features simultaneously, our APN model learns local features that encode semantic visual attributes.
요약하자면, 우리의 작업은 다음과 같은 기여를 합니다. (1) 제로샷 학습을 위한 이미지 표현의 지역성을 개선하기 위해 속성 프로토타입 네트워크(APN)를 제안합니다. 중간 계층 기능의 속성을 동시에 회귀 및 역상관함으로써 APN 모델은 의미론적 시각적 속성을 인코딩하는 로컬 기능을 학습합니다.

(2) We demonstrate consistent improvement over the state-of-the-art on three challenging benchmark datasets, i.e., CUB, AWA2 and SUN, in both zero-shot and generalized zero-shot learning settings.
(2) 제로샷 및 일반화된 제로샷 학습 설정 모두에서 CUB, AWA2 및 SUN과 같은 3개의 까다로운 벤치마크 데이터 세트에서 최신 기술보다 일관된 개선을 보여줍니다.

(3) We show qualitatively that our model is able to accurately localize bird parts by only inspecting the attention maps of attribute prototypes and without using any part annotations during training. Moreover, we show significantly better part detection results than a recent weakly supervised method.
(3) 우리는 우리 모델이 속성 프로토타입의 주의 맵만 검사하고 훈련 중에 부품 주석을 사용하지 않고 새 부품을 정확하게 지역화할 수 있음을 질적으로 보여줍니다. 또한 최근의 약한 감독 방법보다 훨씬 더 나은 부품 감지 결과를 보여줍니다.

이준석

인공지능 전문가가 될레요

Attribute Prototype Network for Zero-Shot Learning 제1부