[Multimodal_01] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision(ICML 2021)

fla1512·2023년 2월 14일

study

Multimodal Study

목록 보기

1/4

해당 논문은 2023/02/17 기준 무려 753회가 인용되었다

Abstract의 경우 Intro에서 자세히 설명해주고 있어 우선 넘어가겠다.

Abstract

Background+ 기존 연구 limitation

Pre-trained representations가 NLP와 perception task에 더 중요해지고 있다
- NLP에서 representation을 학습하는 것은, human annotations 없이 raw text에서 train하는 것으로 바뀌었지만, => unsupervised!
- 'visual, vison-language'는 여전히 전문가에 엄선된(curated) training datasets(비싸고 전문 지식을 필요로 함)을 사용한다
vision applications의 representation은 ImageNet, OpenImages 같은 explicit class labels로 학습된 데이터셋을 사용해 학습된다
- vision-language의 경우, 'Conceptual Captions, MSCOCO, CLIP' 같은 유명한 데이터셋들은 모두 non-trivial data collection(and cleaning) process를 포함한다
  - 이런 값비싼 curation process는 dataset의 크기에 한계를 주고, trained models의 scaling을 막는다(한계점 지적)

limitation 극복하기 위한 본 논문의 Approach+Method

본 논문에서는 one billion image alt-text pairs가 넘는 noisy dataset을 활용한다
- Conceptual Captions dataset에서의 값비싼 filtering, post-processig 단계 없이
simple dual-encoder 아키텍처는 contrastive loss를 사용해서 이미지와 텍스트 페어의 visual, language representations를 align하는 것을 가능하게 한다
우리는 우리 corpus의 scale이 noise가 있게 만들어지긴 했지만, simple learning scheme만으로도 state-of-the-art representations를 달성할 수 있음을 보였다

Experiment & Result

우리의 visual representation은 'ImageNet, VTAB' 같은 classification task로 transfer 되었을 때 strong performance를 거둘 수 있다
- align된 'visual, language representations'은 zero-shot image classification을 가능하게 하고,
- 'Flickr30K, MSCOCO image-text retrieval benchmarks'에서 new state-of-the-art 결과를 거둔다
  - (더 sophisticated한 cross-attention models와 비교되어도)

Contribution

representation은 또한 '1) complex text 와 2) text + image queries'로 cross-modality search를 가능하게 한다

contrastive loss 참고

Contrast: 비교했을 때, 둘 이상의 것에서 차이가 명백한 것

contrastive learning: 대상들의 차이를 좀 더 명확하게 보여줄 수 있도록 학습하는 것

Zero-shot: “모델이 학습 과정에서 배우지 않은 작업을 수행하는 것”

cross-modality search: the task of searching data using different data modalities

1 Introduction

Background+기존 연구 Limitation

'1) visual, 2) vision-language representation learning'은 지금까지 다른 training data sources에서 연구되어 왔다
- 1) vision의 경우, 'ImageNet, OpenImages 그리고 FT 300M' 같은 large-scale supervised data에서 pretraining된 것들이 -> down stream task에서 transfer learning을 통해 performance를 향상하는 것으로 입증되었다
  - 그런데 그런 pre-training datasets는 'data gathering, sampling, and human annotation' 같은 heavy work를 필요로 하기에 datasets를 scale하는 것이 어렵다
- 2) vision-language modeling의 경우 pre-training이 일상적으로 쓰이는 approach가 되었다
  - 하지만, 'Conceptual Captions, Visual Genome Dense Captions, ImageBERT' 같은 vision-language pre-training datasets들은 'human annotation, semantic parsing, cleaning and balancing'에서 더 heavier한 일을 필요로 한다
결론적으로 이런 datasets들의 scales는 ∼10M examples의 규모다
- 이는 counterparts인 vision domain보다 최소 크기의 차수이며 NLP pre-training을 위한 인터넷의 텍스트 대규모 말뭉치보다 훨씬 작다 (e.g., Devlin et al. (2019); Radford et al. (2019); Yang et al. (2019); Liu et al. (2019b); Raffel et al. (2020))

limitation 극복하기 위한 본 논문의 Approach

본 논문에서는 one billion noisy image alt-text pairs가 넘는 dataset을 활용해, 'visual과 vision-language representation learning'을 scale하고자 한다
- 우리는 a large noisy dataset을 얻고자 Conceptual Captions dataset (Sharma et al., 2018)에 있는 과정을 따랐다
  - dataset 정제에서 기존 방법인 'complex filtering and post-processing steps'를 따르는 대신에 simple frequency-based filtering만 적용했다
- 결과적으로 얻은 dataset은 noisy하지만, Conceptual Captions dataset보다 두 자릿수 더 크다(two orders of magnitude larger)
  - = (해석하면) -> 장점(크기 크다)도 있지만 단점(noisy)도 있다
우리의 exascale dataset에서 pre-trained된 'visual and vision-language representations'은 많은 범위의 task에서 강한 영향력을 보였다

Method

모델을 훈련하고자, 우리는 simple dual-encoder architecture를 사용해서 shared latent embedding space(공유 임베딩 공간)에서 'visual and language representations'를 aligns하는 objective를 사용했다
- 이전 연구에서, 유사한 objectives가 visual-semantic embeddings (VSE)(Frome et al., 2013; Faghri et al., 2018)를 학습하는데 이용된 경우가 있다
  cf. VSE(Visual-Semantic Embedding) 참고
  : 이미지 텍스트 검색, 이미지 캡션 및 시각적 질문 답변과 같은 다양한 정보 검색 관련 작업을 가능하게 하기 위해 공유 임베딩 공간에서 이미지와 텍스트를 매핑하는 공동 이미지 텍스트 표현을 생성.
  - Abstract : We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-ofthe-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).
  - ref: VSE++: Improving Visual-Semantic Embeddings with Hard Negatives(2018)
모델 이름을 ALIGN: A Large-scale ImaGe and Noisy-text embedding로 해줌
- contrastive loss (formulated as normalized softmax)를 통해 'Image and text encoders'가 학습
  - -> embedding의 matched image-text pair를 1) together pushing함과 동시에,
  - -> non-matched image-text pair를 2) apart pushing해준다
- 이 방법은 'self-supervised (Chen et al., 2020b)와 supervised (Zhai & Wu, 2019; Musgrave et al., 2020) representation learning' 둘 다에 가장 효율적인 loss functions 중 하나다

Contrastive loss 참고

positive pair의 embedding은 가깝게, negative pair의 embedding은 멀게 하도록 하는 objective를 직접적으로 수행

positive pair loss와 negative pair loss를 합친 것으로, 같은 이미지(같은 클래스)일 경우 라벨이 1, 다른 이미지(다른 클래스)일 경우 라벨이 0이며 이미지에서 cnn을 이용해 임베딩을 추출하고 임베딩 간의 거리를 이용해 loss값을 구함

1) positive pair: Positive pair끼리 Euclidian loss가 최소화 되도록 학습 시켜 positive pair끼리 거리가 가깝도록 low dimension으로 dimension reduction함. 같은 클래스인 경우 embedding 간 거리가 loss가 되므로 거리가 0이 되도록 학습을 진행

2) Negative pair: Negative pair끼리 Euclidian distance 값이 커지도록 하기 위한 수단으로 margin(=negative pair 간의 최소한의 거리)을 도입. Margin보다 거리가 작을 경우 loss 값이 존재하게 되며 Margin만큼 커지도록 CNN 파라미터 업데이트가 진행됨. Margin보다 거리가 큰 경우 max 함수를 거쳐 loss가 0이 되므로 가중치 업데이트가 없음

앞에서 짝지어준 text들을 토대로, 다음 과정을 이어서 진행.

paired texts를 fine-grained labels of images(▽)로 생각해보면, image-to-text contrastive loss는 conventional label-based classification objective와 유사하다
- = (해석하면) -> 유사한 애들을 뽑으려고 특정 종속된 카테고리에서 분석하는 방법인 fine-grained 방법을 사용했고, contrast가 가까운 애들은 가깝게 해주고 다른 애들 둘 사이의 차이가 명백한 것이니까 대충 이런 느낌으로 썼다는 것 같은뎅
주요한 차이점은, text encoder가 'label' weights를 생성한다는 것이다
figure1의 윗 왼쪽 부분은 우리가 ALIGN에서 사용한 방법의 요약이다

fine-grained image?

종속된 카테고리로부터 visual objects를 분석하는 방법

예) 새의 종, 자동차의 모델
Fine-grained image analysis (FGIA) is a longstanding and fundamental problem in computer vision and pattern recognition, and underpins a diverse set of real-world applications. The task of FGIA targets analyzing visual objects from subordinate categories, e.g., species of birds or models of cars. The small inter-class and large intra-class variation inherent to fine-grained image analysis makes it a challenging problem. Capitalizing on advances in deep learning, in recent years we have witnessed remarkable progress in deep learning powered FGIA. In this paper we present a systematic survey of these advances, where we attempt to re-define and broaden the field of FGIA by consolidating two fundamental fine-grained research areas – fine-grained image recognition and fine-grained image retrieval. In addition, we also review other key issues of FGIA, such as publicly available benchmark datasets and related domain-specific applications. We conclude by highlighting several research directions and open problems which need further exploration from the community.

(ref): Fine-Grained Image Analysis with Deep Learning: A Survey(2021, IEEE)

Experiment & Result + Contribution

'aligned image와 text representations'은 cross-modality matching/retrieval tasks에 naturally하게 잘맞고, state-of-the-art (SOTA) results를 달성했다
- 예를 들어서, ALIGN은 이전 SOTA 방법을 '7% in most zero-shot and fine-tuned R@1 metrics in Flickr30K and MSCOCO'로 앞선다
- 더 나아가 cross-modality matching은 classname을 text-encoder에 feed 했을 때, zero-shot image classification을 자연스럽게 가능하게 한다
  - ImageNet에서 training samples를 쓰지 않고도, 76.4% top-1 accuracy를 달성
image representation은 또한 여러 downstream visual tasks에서 superior performance를 달성한다
- 예를 들어, ALIGN은 ImgaeNet에서 88.64% top-1 accuracy를 달성
- Figure 1-bottom은 ALIGN에 의해 만들어진 실제 retrieval system에서의 cross-modal retrival examples를 보여준다
  - Fig1 해석: ALIGN의 summary
    - visual and language representations는 noisy image alt-text data로부터 jointly하게 learn된다
    - representation은 vision-only나 vision-language task transfer로 사용될 수 있다
    - fine-tuning 없이도 ALIGN은 zero-shot visual classification그리고 cross-modal(image-to-text search, text-to-image search, image+text queries의 조합)에서 좋은 성과를 거둔다

1
classification, retrieval의 이전 연구 방법과 transferability라는 한계점

'classification, retrieval'에 있어 high-quality visual representations는 주로 large-scale labeled datasets에서 pre-trained 된다
최근에 'self-supervised와 semi-supervised learning'은 alternative paradigms로서 연구되었다
하지만 그런 방식들로 훈련된 모델은 downstream tasks에서 transferability에 있어 한계가 있다
어떠한 한계? 가 있는지는 논문에 나와있지 않다 !!'
- 맥락상으로 파악해보면 2과 비슷한 맥락으로 dataset이 클 때 transferability가 어려워진다는 것 아닐까 정도로 추측해볼 수 있을 것 같다

2
visual representations의 방법들과 한계점

visual representations를 학습함에 있어 Leveraging images와 natural language captions는 다른 direction이다
images에서 captions를 예측하는 과정을 학습함으로써 좋은 visual representation이 학습될 수 있음을 입증한 연구들이 있었다
- 이 연구들은 하지만, Flickr 그리고 COCO Captions 같은 작은 datasets으로 한정적이며, 결과로 얻는 모델은 cross-modal retrieval 같은 task에서 필요로 하는 vision-language representation을 produce하지 않는다

3
vision-language representation learning의 방법들과 한계점

vision-language representation learning domain에서 visual-semantic embeddings (VSE)와 improved versions가 제안되었다
- 최근에는 더 발전된 모델들이 cross-modal attention layers와 함께 등장했고, image-text matching tasks에서 우세한 성능을 보였다
- 하지만, 그들의 데이터 규모가 커지면서 slower해지기에, 실제로 image-text retrieval에서는 적합하지 않다
- 그에 반해 우리 모델은 simplest VSE 형태를 내재하지만, 여전히 image-text matching benchmark에서 모든 previous cross-attention models를 outperform한다

4
CLIP과의 방법상의 비교

연구와 관련 있는 것은 CLIP인데, 이는 similar contrastive learning setting에서 natural language supervision을 통해 visual representation learning을 제시한다
- different vision와 language encoder architectures를 사용함에도 불구하고 가장 큰 차이는 training data이다
  - ALIGN은 raw alt-text data에서 image-text pairs의 natural distribution을 따르는 반면에
  - CLIP은 English Wikipedia에서 먼저 allowlist of high-frequency visual concepts를 constructing하면서 dataset을 수집한다
  - 우리는 전문가의 지식이 엄선된 데이터셋으로 학습되지 않았음에도 강한 visual and vision-language representations를 얻었음을 입증한다

CLIP 참고

이전 비전 SOTA 연구들은, 고정된 pre-determined object categories를 예측하는데 있어서 일반화가 어렵다는 단점이 있음 => images에 대해서 직접적으로 raw text에서 배우는 대안 제시
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. (한계점 지적->)This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. (해결책 제시->)Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. (어떤 방법으로?->)We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. (실험 방법 ->) We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. (실험 결과 ->) The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline
without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.
(ref: Learning Transferable Visual Models From Natural Language Supervision (ICML, 2021))

5
'Learning From Noisy Large-Scale Datasets With Minimal Supervision(IEEE,2017)'

noisy data가 논문의 초반부터 계속해서 언급되는데, data에서 noisy하다는 것이 어떠한 맥락인지, 더 나아가서 annotation이 어떠한 과정을 통해서 이루어지는지 알아보고자
다음 논문, 'Learning From Noisy Large-Scale Datasets With Minimal Supervision(2017, IEEE)'을 살펴보았다. 인용 횟수가 433회로 마찬가지로 높다. 해당 논문을 간단히 요약해보고자한다

논문을 한 줄로 요약하면,
: noisy annotations가 있는 millions of images를 잘사용하고자, clean annotation을 사용하는 방법을 제시하여 open image dataset에서 evaluation을 진행함

해당 논문의 경우 1 Introduction에서 0Abstract를 잘 다루고 있어서 Abstract에서 의문점이 생겼더라면 이를 해소할 수 있다

Abstract

powerful한 image representations를 학습하기 위해서
- small subset of cleanly-annotated images와 함께 noisy annotations가 있는 millions of images를 잘 사용하고자 하는 접근 방식을 제시
'clean and noisy data'를 잘 결합하는 잘 알려진 방식은
- 먼저, 1) large noisy dataset을 사용해 network를 pre-train'하고
- 2) clean dataset으로 fine-tune하는 것이다
- 본 논문에서는 해당 방식이 clean set에 포함된 정보를 완전히 활용하지 않음을 입증한다
그러므로, 우리는 큰 데이터셋에서 'fine-tuning the network' 이전에 noise를 줄이고자 'clean annotations'를 사용하는 방법을 제시한다
- 해당 방법은 multi-task network로 구성되어있다
  - jointly learns to clean noisy annotations
  - accurately classify images.
Open Images dataset에서 evaluation을 진행
- Open Images dataset: ∼9 million images 포함, multiple annotations per image and over 6000 unique classes
- For the small clean set of annotations we use a quarter of the validation set with ∼40k images
실험 결과
- clearly outperforms direct fine-tuning across all major categories of classes in the Open Image dataset
- is particularly effective for a large number of classes with wide range of noise in annotations (20-80% false positive annotations).

1 Introduction

Background

Deep convolutional neural networks (ConvNets)는 현재 machine vision에서 proliferate(급증)한다
- 그들 학습을 scaling함에 있어 가장 큰 어려움은 images에 대해서 semantic annotation을 가진 massive하고 clean한 collection이 필요하다는 것이다
오늘날 ImageNet이 성공한지 5년이 흘렀음에도 불구하고, more clean labeled data를 포함하는 대중적으로 이용가능한 dataset이 여전히 없다
이 어려움을 해결하고자 다른 training paradigms는 manually collected annotations를 필요로 하는 training에 목표를 두었다.
- 그 예시가 1) unsupervised learning, 2) self-supervised learning와 3) noisy annotations를 이용한 learning이다
  - 이런 방법들의 대다수는 다음 가정을 둔다
    - "All annotations are noisy, and no clean data is available"
    - 실제로 typical learning scenario는 semi supervised learning과 더 유사하다
      - : images는 noisy or missing annotations를 가지고 있다 + a small fraction of images는 또한 clean annotations를 가지고 있다
      - 해당 부분에 대한 예시로, noisy annotations인 images가 web에서 추출되고, a small fraction이 costly human verification으로 sent되는 경우가 있다
Semi-supervised learning (준지도학습)은 소량의 labeled data에는 supervised learning을 적용하고 대용량 unlabeled data에는 unsupervised learning을 적용해 추가적인 성능향상을 목표로 하는 방법론

기존 연구 Limitation+ limitation 극복하기 위한 본 논문의 Approach

본 논문에서 large amounts of noisy annotated data와 함께 어떻게 'a small amount of clean annotations를 활용할지'를 탐색한다
- 특히 convolutional neural networks를 train할 때
한가지 잘 알려진 방식은 network를 noisy data로 pre-train하고 clean dataset으로 fine-tune하는 것이다 -> 더 좋은 성능을 얻고자
- 우리는 이 방식이 clean annotations에서 포함된 정보를 완전히 포함하지 않는다고 판단했다
- 그래서 대안으로, visual representations를 directly하게 학습하는데 small clean dataset을 쓰는 것 대신에, 그것을 사용해서 noisy and clean annotations 사이의 mapping을 학습하고자 한다

Approach: to train a multi-label image classifier using a large dataset with relatively noisy labels, where additionally a small subset of the dataset has human verified labels available.

Method

해당 mapping은 1) noise의 pattern을 학습할 뿐만 아니라, 2) label space에 있는 structure를 capture한다
- noisy하고 clean한 annotations 사이의 learned mapping은
  - 1) noisy dataset을 clean하게 하고,
  - 2) network를 clean한 full dataset의 reduced noise로 둘 다 fine-tune하는 것을 가능하게 한다
제안하는 approach는 multi-task network로 구성
- jointly 하게 clean noisy annotations를 학습하고 이미지를 정확하게 분류함
- Fig 2 해석
  - Noisy input labels는 final classifer의 targets로서 정제되고 사용된다
  - label cleaning network와 multi-label classifier는 jointly하게 훈련되고 deep convnet으로부터 visual features를 공유한다.
  - cleaning network는 small set of clean annotations에 의해서 supervised되고
  - final classifier는 clean data와 much larger noisy data를 둘 다 쓴다
더 나아가 image classification 문제를 다음 목표와 함께 고려한다
- images를 image에 존재하는 모든 concepts로 annotating하기
- label noise를 고려할 때 다음 두 가지 측면이 고려할 가치가 있다
  - 1) 많은 multi label classification approaches는 classes가 independent함을 가정한다
    - 하지만 label space는 Fig1의 예시에 묘사된 바와 같이 typically highly structured되었다
      
      = independent임을 원래 가정하는데 Fig1을 보면 dependent하게 결국 서로서로 연결이 되어져있다는 뜻 !!
    Fig1 해석
    - task: noisy annotations로부터 -> robust multi-label image classifer 훈련하기
    - image annotation: simple lists of classes
    - graph with green and red edges: strong positive and negative relations
    - 해당 방법은
      - a cleaned version of the dataset
      - a robust image classifier를 모두 제시
    - 그래서 label-cleaning network를 all noisy input labels에 conditionally dependent하게 model한다
  - 2) 많은 클래스들은 multiple semantic modes를 가질 수 있다
    - 예를 들어서 coconut 클래스는 drink, a fruit, a tree를 포함하는 image로 assign될 수 있다
    - 해당 모드 사이들을 차별화하고자 input image 그 자체를 고려해야 한다
  - 우리 모델은 그러므로 input image에서의 annotation noise의 dependence를 captures한다
    - learned cleaning network가 conditionally하게 image features에서 dependent하게 하면서

Experiment & Result

recently-released largescale Open Images Dataset에서 evaluate.
결과
- traditional fine-tuning methods의 성능을 앞선다
  - direct fine-tuning은 limited rated data를 사용할 수 있을 때 가끔 성능 향상에 해를 끼칠 수 있다
  - 그에 반해 우리 방법은 모든 범위의 label noise levels에서 성능을 향상하며, 가장 효율적일 때는 training set에서 classes가 20% to 80% false positive annotations일 때다
- 모든 카테고리에서 좋은 성과(8가지 Open Images의)
  - vehicles, products, art, person,sport, food, animal, plant

Contribution

1 semi-supervised learning framework를 multilabel image classification에 도입
- clean annotations인 set를 조금 쓰고 noisy한 annotations의 set를 많이 씀
2 Open Images Dataset에서 처음으로 benchmark 제공
3 제안 방식이 traditional fine-tuning에서보다 small labeled data에서 활용했을 때 효율적임을 입증함

Fig3: Overview of approach

1) 'a very large set of training samples with noisy labels (orange)', 2) 'small set of samples which additionally have human verification (green)'으로부터 image classifer를 훈련하기 위한 approach
'label cleaning network'가 있는데 이는, to map noisy labels to clean labels에 쓰임
- conditioned on visual features from an Inception V3 ConvNet.
- is supervised by the human verified labels and
- follows a residual architecture => so that it only needs to learn the difference between the noisy and clean labels.
'image classifer'
- shares the same visual features
- learns to directly predict clean labels supervised by either (a) the output of the
  label cleaning network or (b) the human rated labels, if available.

Fig7: Examples

3. A Large-Scale Noisy Image-Text Dataset

연구의 초점은 'visual and visionlanguage representation learning'을 scale up하는 것이다
- 그래서 기존에 존재하는 것들보다 훨씬 더 큰 데이터셋에 재분류한다
- 특히 raw English alt-text data (image and alt-text pairs)의 버전을 얻고자 Conceptual Captions dataset (Sharma et al., 2018)을 생성하는 방법론을 따른다
  - Conceptual Captions dataset은 heavy filtering과 post-processing으로 정제되었다
- 여기서 scaling의 목적으로, 원본 작업 대다수의 cleaning steps를 완화해서 quality를 scale로 trade한다
  - 본 연구의 경우 minimal frequency-based filtering만을 적용(아래 더 자세히 설명)
결과: 훨씬 더 크지만(1.8B image-text pairs) noisier한 dataset이다
Fig2: dataset의 image-text pair sample
오잉? noisy text annotation 결과 두번째 사진만!!!
- 2010년 6월 215720 버전의 썸네일 => 그럴만하네.. 나머지는 그래도 noisy정도는 아니다!!!
  
  데이터셋을 어떻게 filtering하였는가에 대한 설명이 이제부터 나온다 ! 이게 아마도 ,minimal frequency-based filtering 이겠지 !

Image-based filtering.

Sharma et al. (2018)에 기반해 우리는 pornographic images를 지우고, shorter dimension가 200 픽셀보다 크고 aspect ratio는 3보다 작은 이미지들만 남겼다
- 1000 associated alt-texts가 넘는 이미지는 제거되었다
우리가 test images에서 훈련되지 않았음을 확실히하고자, 모든 downstream evaluation datasets (e.g.,ILSVRC-2012, Flickr30K, and MSCOCO)에서 test images의 near-duplicates를 복사본을 제거했다

Text-based filtering.
1. alt-texts(▽밑에 설명) 제거: 10 images보다 많이 공유된 경우

-> 해당 alt-texts는 가끔 이미지의 내용과 관련이 없기 때문
- e.g., “1920x1080”, “alt img”, and “cristina”

rare token (outside of 100 million most frequent unigrams and bigrams from the raw dataset)을 포함하거나 너무 짧거나(<3 unigrams) 너무 긴 경우 (>20 unigrams) 제거

-> 이를 통해 “image tid 25&id mggqpuweqdpd&cache 0&lan code 0” 같은 noisy texts를 지우거나, 사용하기에 너무 일반적인 texts가 제거됨

alt-texts 참고
: “why” of the image as it relates to the content of a document or webpage.

4. Pre-training and Task Transfer

#5에서 실험과 결과에 대해서 다루기 전에 #4에서는 Pre-training과 Transferring을 어떠한 과정을 거쳐 진행하였는가를 크게 세 파트로 나누어서 다룬다
4.1. Pre-training on Noisy Image-Text Pairs
4.2. Transferring to 'Image-Text Matching & Retrieval'
4.3. Transferring to Visual Classification

4.1. Pre-training on Noisy Image-Text Pairs

1
dual-encoder architecture로 ALIGN을 pre-train함

모델 구성

모델은 cosine-similarity combination function을 top에서 진행하는 'a pair of image and text encoders'로 구성
- image encoder로서 EfficientNet을 사용
  - global pooling (without training the 1x1 conv layer in the classification head)으로
- text embedding encoder로서 [CLS] token embedding인 BERT를 사용
  - we generate 100k wordpiece vocabulary from our training dataset
- linear activation인 fully-connected layer가 BERT encoder의 top에 위치해 image tower에서부터의 dimension을 match함
- image와 text encoder는 둘 다 scratch로 훈련
  
  코사인 유사도: 두 벡터 간의 코사인 각도를 이용하여 구할 수 있는 두 벡터의 유사도

'image and text encoders'는 normalized softmax loss (Zhai & Wu, 2019)로 optimize됨
훈련 때, matched image-text pairs를 positive로, 나머지 랜덤(training batch에서 형성되는)은 negative로 둠

두 loss의 합을 최소화 함
- 1. image-to-text classification을 위해서
- 1. text-to-image classification을 위해서
- 본 식에서 ${x_i}$ 와 ${y_i}$ 는 각각 $i-th$ pair에서 image와 $j-th$ pair에 text의 normalized embedding이다
- N : batch size
- σ: logits을 scale하기 위한 temperature
- in-batch negatives를 더 효율적으로 하고자, 모든 computing cores에서 embeddings를 concatenate한다 -> 더 큰 batch를 생성하고자
- temperature variable은 image와 text embdedding이 둘다 L2-normalized이어서 중요하다
- 최적의 temperature value를 위해 manually sweeping하기보다, 우리는 다른 파라미터와 효율적일 수 있는 값을 찾았다

4.2. Transferring to 'Image-Text Matching & Retrieval'

finetuning을 1) 하고 2) 안하고로 해서 ALIGN models를 1) image-to-text 와 2) text-toimage retrieval tasks에서 evalutate함
- 두 benchmark datasets를 고려함
  - 1) Flickr30K (Plummer et al., 2015)
  - 2) MSCOCO (Chen et al., 2015)
또한 Crisscrossed Captions (CxC)(▽) (Parekh et al., 2021)에 대해서 ALIGN으로 evaluate 진행
- MSCOCO의 연장선, 추가적인 사람의 semantic similarity에 대한 판단이 있음
  - 1) caption-caption,
  - 2) image-image, and
  - 3) image-caption pairs.
- extended annotations이기에 CxC로 four intra-와 inter-modal retrieval tasks가 가능해짐
  - 1) image-to-text
  - 2) text-toimage
  - 3) text-to-text,
  - 4) image-to-image retrieval,
  - +) three semantic similarity tasks(이거가 CxC의 main contribution같은데, fig1 참고)
    - 1) semantic textual similarity (STS),
    - 2) semantic image similarity (SIS),
    - 3) semantic image-text similarity (SITS).
  - training set이 original MSCOCO와 동일해서 -> CxC annotations에서 MSCOCO fine-tuned ALIGN model을 직접적으로 평가 가능

해당 부분에 대한 실험은 뒤에 자세히 나오지만 여기서 의문이 들 수도 있다

Q. text-to-text와 image-to-image retrieval 실험 어떻게? 했을까?
ALIGN은 앞서 fig1에서 text -> image, image -> text, image+text -> image 등 같이 조합하거나 하나로 다른 멀티모달의 결과를 얻게 해주는 모델이었다
- 그렇다면, text-to-text와 image-to-image retrieval은 어떻게 가능했던 것일까?
- 해당 부분은 '추가적인 사람의 semantic similarity에 대한 판단'이 이루어진 CxC 데이터로(=> Tab3 해석에 on CxC dataset이라 명시 !!!) 가능했다(해당 데이터는 사람이 직접 annotation을 해서 semantic similarity에 대한 판단을 pair별로 했다 )
- 이 데이터를 가지고 text-to-text와 image-to-image retrieval에 대한 결과를 얻어내고 평가했다고 볼 수 있겠다

Crisscrossed Captions (CxC)

문제점 지적 Similarity Judgments for MS-COCO, By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning. Unfortunately, datasets have limited cross-modal associations: (데이터에 이런 문제들이 있었다~~)images are not paired with other images, captions are only paired with other captions of the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines research into how inter-modality learning impacts intra-modality tasks.

해결방안 제시 We address this gap with Crisscrossed Captions (CxC), an extension of the MSCOCO dataset with human semantic similarity judgments for 267,095 intra- and intermodality pairs.(=>사람이 직접 annotation을 해서 semantic similarity에 대한 판단을 pair별로 했다는 것 같다 !! MSCOCO dataset의 연장선 프로젝트로서) We report baseline results on CxC for strong existing unimodal and multimodal models.

evaluatuion 결과 We also evaluate a multitask dual encoder trained on both image-caption and caption-caption pairs that crucially demonstrates CxC’s value for measuring the influence of intra- and inter-modality learning.

ref: Crisscrossed Captions: Extended Intramodal and Intermodal Semantic(EACL2021)

image-text retrieval (ITR)

(해석하면) => 두 subtasks로 구성된 modality에서 한 modality로부터 관련된 다른 modality를 retrieve하는 방법

Cross-modal image-text retrieval (ITR) is to retrieve the relevant samples from one modality as per the given user expressed in another modality, usually consisting of two subtasks: image-to-text (i2t) and text-to-image (t2i) retrieval.
ref: Image-text Retrieval: A Survey on Recent Research and Development(2022, CVPR)

4.3. Transferring to Visual Classification

해당 부분은 4.3 제목에 대한 내용을 어떻게 실행했는지에 대한 설명을 담고 있다
크게 세 가지의 task를 다음 과정으로 수행했다,, 그 정도만 알면 될 것 같다 !!

1. ALIGN에 zero-shot transfer 적용

적용 파트:
1) ImageNet ILSVRC-2012 benchmark에 visual classification tasks
1-1) 변형 형태로, ImageNet-R(endition, non-natural images such as art, cartoons, sketches)을 포함함
2) ImageNet-A(dversarial)(more challenging images for ML models)
3) ImageNet-V2
어떤 과정으로 이루어지는가?
- 해당 variants들은 ImageNet classes의 same set (or a subset)를 따르고, 'ImageNet-R and ImageNet-A'는 ImageNet의 다양한 분포에서 샘플링된다

2. image encoder를 downstream visual classification tasks로 transfer하기

이를 위해 ImageNet과 handful of smaller fine-grained classification datasets를 사용
- datasets의 경우,
  - Oxford Flowers-102 (Nilsback & Zisserman, 2008), Oxford-IIIT Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), and Food101 (Bossard et al., 2014).
- Imagenet에 대해서 두 setting에 대한 결과는 다음과 같음
  - training the top classification layer only (with frozen ALIGN image encoder) and fully fine-tuned.
  - Only the latter setting is reported for fine-grained classification benchmarks.

3. Kolesnikov et al. (2020)를 참조해서 우리는 또한 Visual Task Adaptation Benchmark (VTAB) (Zhai et al., 2019)에서 model의 강건함을 evaluate함

VTAB: 19개의 다양한 (covering subgroups of natural, specialized and structured image classification tasks) visual classification tasks로 구성, 각각 1000 training samples

5. Experiments and Results

이 부분은 어떠한 환경에서 어떻게 훈련했는가에 대한 설명이다.
그 후 뒤에서 #5에 해당하는 부분은 다음과 같다.
5.1. Image-Text Matching & Retrieval
5.2. Zero-shot Visual Classification
5.3. Visual Classification w/ Image Encoder Only

scratch에서 훈련
- opensourced implementation of EfficientNet as the image encoder and BERT as the text encoder 사용
- ablation study를 제외하고, ALIGN의 결과는 다음에서 쓰임
image encoder is 1) EfficientNet-L2 and the text encoder is 2) BERT-Large
- 1) image encoder: trained at resolution of 289 × 289 pixels(EfficientNet variant가 어떻게 사용됨에 상관없이)
  - 1. resize input images to 346 × 346 resolution
  - 1. perform random crop (with additional random horizontal flip) in training, and central crop in evaluation
- 2) wordpiece sequence of maximum 64 tokens 사용
  - 이유: input texts are no longer than 20 unigrams
softmax temperature variant: 1.0
label smoothing parameter: 0.1
LAMB optimizer (You et al., 2020) weight decay ratio 1e-5.
learning rate is warmed up linearly o 1e-3 from zero in 10k steps, and then linearly decay to zero in 1.2M steps (∼12 epochs).
1024 Cloud TPUv3 cores with 16 positive pairs on each core에서 모델 훈련
total effective batch size = 16384.

5.1. Image-Text Matching & Retrieval

Flickr30K와 MSCOCO 즉, crossmodal retrieval benchmark에서 1) zero-shot, 2) fully fine-tuned setting으로 모두 evaluate
1) zero-shot: (Karpathy & Fei-Fei, 2015)를 따름
- 대다수의 작업은 train/test splits를 얻기 위해서
  1) Flickr30K, we evaluate on the standard 1K test set, and finetune on the 30k training set.
  2) MSCOCO, evaluate on the 5K test set, and finetune on 82K training plus 30K additional validation images that are not in the 5K validation or 5K test sets.
2) fully fine-tuned setting:
- fine-tuning 동안에 같은 loss function을 씀
- 문제점 발생 및 해결 방안
  - There can be false negatives when the batch size is comparable to the total number of training samples.
    - -> So we reduce the global batch size from 16384 to 2048.
  - We also reduce the initial learning rate to 1e-5 and train for 3K and 6K steps (with linear decay) respectively on Flickr30K and MSCOCO.
  - All the other hyper-parameters are kept the same as pre-training.

Tab1해석

Q. R@1,R@5, R@10이 뭘까? 참고
R은 Recall의 약자

k: K는 추천 아이템 수

Recall: 실제 모든 1 중에서 내가 1로 예측한 것이 얼마나 되는지 비율

Recall@k: 사용자가 관심있는 모든 아이템 중에서 내가 추천한 아이템 K개가 얼마나 포함되는지 비율

=> 결과: [그림1]의 예시, 사용자가 관심있는 모든 아이템은 6개, 추천한 아이템 중에서 사용자가 좋아한 아이템은 3개다. 따라서, Recall@5=0.5

어디서 왜 필요할까? 참고

정확도(accuracy), AUC score 같은 기계학습에 쓰이는 모델 성능을 측정하는 metric들은 추천시스템에 적용하기 어렵다

추천시스템은 다음 두 질문에 대답할 수 있는 metric을 찾아야 하기 때문

추천 시스템이 사용자가 선호하는 아이템을 얼마나 상위권에 잘 올려놓았는가?

사용자에게 있어 추천된 아이템 간의 상대적인 선호도가 잘 반영되었는가?

Image-Text Matching & Retrieval, 결과 해석
두 데이터셋에 대한 Image-Text Matching & Retrieval 결과다
ALIGN은 Image BERT, UNITER, CLIP, GPO, ERNIE-ViL, VILLA, Oscar와 비교되었다

ALIGN이 두 benchmark의 모든 metric에서 SOTA
- 1) zero-shot setting,
  - ALIGN gets more than 7% improvement in image retrieval task compared to the previous SOTA, CLIP (Radford et al.,2021).
- 2) fine-tuning,
  - ALIGN outperforms all existing methods by a large margin, including those that employ more complex cross-modal attention layers such as ImageBERT (Qi et al., 2020), UNITER (Chen et al., 2020c), ERNIE-ViL (Yu et al., 2020), VILLA (Gan et al., 2020) and Oscar (Li et al., 2020).

Tab2해석

Crisscrossed Captions (CxC) retrieval tasks에서의 ALIGN performance 결과
- 모든 metric에서 SOTA 달성
  - especially by a large margin on image-to-text (+22.2% R@1) and text-to-image (20.1%R@1) tasks.
- ALIGN also outperforms the previous SOTA on SITS task with an improvement of 5.7%.

Tab3 해석

ALIGN이 SITS task에서 이전보다 5.7% 앞섬을 보임
- 흥미로운 것은, inter-modal에서 잘함에도 불구하고 intra-modal tasks에서는 그만큼이 아니라는 것.
  - 예시로, text-to-text와 image-to-image retrieval task는 image-to-text and text-to-image tasks보다 상대적으로 less significant했다.
  - STS와 SIS task는 또한 'VSE++와 DEI2T'보다 더 성능이 안좋다
    - 우리는 이것이 ALIGN의 training objective가 intra-modal matching이 아니라 cross-modal (image-text) matching에 의존해서라고 본다
  - Parekh et al. (2021)는 multitask learning이 more balanced representations를 생산할 수 있음을 제안했는데 우리는 이를 후속 연구로 남긴다
(= 해석하면) ALIGN이 SITS task에서 가장 잘함 67.6으로, SIS나 STS는 다른 모델들이 더 좋은 성과 보임 => 왜일까? ALIGN의 training objective가 intra-modal matching이 아니라 cross-modal (image-text) matching에 의존해서

5.2. Zero-shot Visual Classification

방법: texts of classnames를 text encoder로 directly하게 feed하면, ALIGN은 images를 image-text retrieval를 통해 candidate classes로 classify하는 것이 가능하다
Tab4: ALIGN과 CLIP을 Imagenet과 variants에서 비교한 결과
- ALIGN은 다양한 이미지 분포의 classification tasks에서 robustness를 보여줌
  - 공평하게 비교하기 위해, CLIP과 같은 prompt ensembling method 사용.
  - 각 classname은 a set of prompt templates로 expanded된다
    - 이는 CLIP에 의해서 정의되었으며, 예시로는 “A photo of a {classname}”
    - class embedding은 L2-normalization을 따르는 all templates의 embeddings를 평균해서 계산된다
    - 우리는 그러한 embedding이 ImageNet top-1 accuracy에서 2.9% 향상을 준다는 것을 발견했다.

5.3. Visual Classification w/ Image Encoder Only

실험1 ImageNet benchmark과 비교

ImageNet benchmark
- 1. learned visual features를 freeze하고 classification head만 훈련했다
- 1. 모든 레이어를 fine-tune했다
  - random cropping (same as in Szegedy et al. (2015))이나 horizontal flip을 포함한 basic data augmentations를 사용했다
- evaluation에 있어서 single central crop을 0.875로 적용했다
- Touvron et al. (2019)을 따라서 training and evaluation 사이에 0.8 scale ratio를 두었다 -> resolution discrepancy를 mitigate하고자
- 특히 train/eval resolution is 289/360 with frozen visual features => 475/600 when fine-tuning all variables.
both stages of training에서 다음을 사용
- global batch size of 1024, SGD optimizer with momentum 0.9, and learning rate decayed every 30 epochs with ratio 0.2 (100 epochs in total).
- Weight decay is set to zero
- With frozen visual features, we use the initial learning rate of 0.1.
- When fine-tuning all layers with use the initial learning rate of 0.01, and use 10x smaller learning rate on the backbone network compared to the classification head.

Tab5

ALIGN을 ImageNet benchmark과 비교
- 결과:
  - With frozen features, ALIGN slightly outperforms CLIP and achieves SOTA result of 85.5% top-1 accuracy.
  - After fine-tuning ALIGN achieves higher accuracy than BiT and ViT models, and is only worse than Meta Pseudo Labels which requires deeper interaction between ImageNet training and large-scale unlabeled data.
  - Compared to NoisyStudent and Meta-Pseudeo-Labels which also use EfficientNet-L2, ALIGN saves 44% FLOPS by using smaller test resolution (600 instead of 800).

실험2 VTAB eval

여기서 VTAB은, 논문에 따르면 VTAB은 19개의 다양한 (covering subgroups of natural, specialized and structured image classification tasks) visual classification tasks로 구성, 각각 1000 training samples

we follow a hyper-parameter sweep as shown in the Appendix I in (Zhai et al., 2019) with 50 trials for each task. Each task is trained on 800 images and the hyperparameters are selected using the validation set of 200 images.
After the sweep, the selected hyperparameters are used to train on the combined training and validation splits of 1000 images for each task.

Table 6

reports the mean accuracy (including the breakdown results on each subgroup) with standard deviation from three fine-tuning runs and shows that ALIGN outperforms BiT-L (Kolesnikov et al., 2020) with similar hyper-parameter selection method applied.

실험3 BiT-L과의 비교

To evaluate on smaller fine-grained classification benchmarks, we adopt a simple fine-tuning strategy for all tasks. We use the same data augmentation and optimizer as in ImageNet fine-tuning. Similarly, we first train the classification head and then fine-tune all layers, except with batch norm statistics frozen. The train/eval resolution is fixed at 289/360. We use batch size 256 and weight decay 1e-5. The initial learning rate is set to 1e-2 and 1e-3 respectively, with cosine learning rate decay in 20k steps.

Table 7

compares ALIGN with BiT-L (Kolesnikov et al., 2020) and SAM (Foret et al., 2021) which both apply same fine-tuning hyper-parameters for all tasks. For small tasks like these, details in finetuning matter. So we list the baseline results in (Foret et al., 2021) without using SAM optimization for a fairer comparison. Our result (average of three runs) is comparable to the SOTA results without tweaking on optimization algorithms.

6. Ablation Study

원래는 Flickr30K와 MSCOCO, CxC 등 여러 task에 대해서 다른 데이터셋을 써서 결과를 report하는 등의 작업이었음. 여기 #6에서는 KNN task를 비교하고 어떤 image encoder와 text encoder를 사용했을 때 ALIGN이 가장 우수한 성과를 거둘지 그 값을 찾고자 하였음

model performance를 1) MSCOCO zero-shot retrieval와 2) ImageNet KNearest-neighbor (KNN) tasks에 비교
두 metric이 대표적이며, 다른 metric들과 correlate를 잘함을 알아냄
언급되지 않은 경우의 하이퍼파라미터는 baseline과 동일

6.1. Model Architectures

ALIGN models의 performance를 먼저 연구함
- different image와 text backbones로.
어떻게?
- We train EfficientNet from B1 to L2 for the image encoder and BERT-Mini to BERT-Large for the text encoder.
- We add an additional fully-connected layer with linear activation on top of B1, B3, B5 and L2 globally-pooled features to match the output dimension of B7 (640).
- A similar linear layer is added to all text encoders. We reduce the training steps to 1M in ablation to save some runtime.

Fig3 해석
- 결과: image and text backbones를 다양하게 조합해서 얻은 MSCOCO zero-shot retrieval와 ImageNet KNN 결과
  - ImageNet KNN metric이 EfficientNet-B7와 EfficientNet-L2를 사용하여 BERT-Base에서 BERT-Large로 포화되기 시작한다는 점을 제외하고는 backbone이 클수록 Model quality improves.
  - 예상한 바와 같이, 1) image encoder capacity를 scaling up하는 것은 vision task에서 더 중요하다
    - e.g., even with BERT-Mini text tower, L2 performs better than B7 with BERT-Large
  - 2) image-text retrieval tasks에서 image와 text encoder capacities는 동일하게 중요하다.
  - Figure 3에서 보여진 우수한 scaling property를 기반으로, 우리는 Section5에서 모델을 EfficientNet-L2 + BERT-Large로 훈련했다.
key architecture hyperparameters를 연구
- 예를 들어서, embedding dimensions, number of random negatives in the batch, and the softmax temperature.
해석:
- Table 8
  - Row1
    - 여러 개의 model variants를 baseline model에서 비교compares
    - 다음 상황에서 훈련: EfficientNet-B5 image encoder, BERT-Base text encoder, embedding dimension 640, all negatives in the batch, and a learnable softmax temperature.
  - Rows 2-4
    - model performance가 higher embedding dimensions로 가면 향상함을 보임.
      - 그래서, we let the dimension scale with larger EfficientNet backbone (L2 uses 1376).
  - Rows 5 and 6
    - show that using fewer in-batch negatives (50% and 25%) in the softmax loss will degrade the performance.
  - Rows 7-9
    - study the effect of the temperature parameter in the softmax loss.
    - Compared to the baseline model that learns the temperature parameter (converged to about 1/64), some hand-selected, fixed temperatures could be slightly better. However, we choose to use the learnable temperature as it performs competitively and makes learning easier. We also notice that the temperature usually quickly decrease to only around 1.2x of the converged values in the first 100k steps, and then slowly converges until the end of training.

6.2. Pre-training Datasets

모델이 다양한 사이즈의 다양한 데이터셋에서 훈련되었을 때의 how the model performs를 아는 것은 중요하다 => 두 모델 훈련
1. EfficientNet-B7 + BERTbase
2. EfficientNet-B3 + BERT-mini

3 datasets:
- 1) full ALIGN training data,
- 2) 10% randomly sampled ALIGN training data, and
- 3) Conceptual Captions (CC-3M, around 3M images).
CC-3M is much smaller so we train the model with 1/10 of the default number of steps. All models are trained from scratch.
Tab 9 해석:
- large scale training set은 우리 model의 scale을 up했을 때 더 좋은 performance를 얻기 위해서 필수적이다
  - 예시로, ALIGN data에서 훈련된 model은 CC-3M data에서 훈련된 model을 압도한다
  - CC-3M에서, B7+BERT-base는 오버피팅되기 시작하고 B3+BERT-mini보다도 성능이 안좋다.
  - 반대로 larger model은 larger dataset을 fully하게 이용하기 위해서 필요하다
    - smaller B3+BERT-mini는 10% of ALIGN data에서 거의 포화
    - the larger B7+BERTbase로는 full ALIGN data에서 명백한 향상이 있다
data size scaling wins over the increased noise를 더 잘 알아보고자, 다음 과정 진행
- randomly sample 3M, 6M, and 12M ALIGN training data and compare them with the cleaned CC-3M data on B7+BERT-base model.
Table 10
- ALIGN data가 CC data에서 3M일 때는 worse하지만
- 6M, 12M일 때는 그 값을 따라잡음을 보임
- noisy해졌지만 ALIGN은 4x size 일때만 CC보다 좋은 성능(23.8, 17.5, 51.4)

7. Analysis of Learned Embeddings

분석1
simple image retrieval system을 만들어서 -> ALIGN에 의해 훈련된 embeddings의 behavior를 연구하고자 함

입증의 목적으로, we use an index consisting of 160M CC-BY licensed images that are separate from our training set.
Figure 4
- shows the top 1 text-to-image retrieval results for a handful of text queries.
- 결과: ALIGN은 장면에 대한 '자세한 설명이 담긴 정확한 이미지'나 'fine-grained or instance-level concepts like landmarks and artworks'를 retrieve할 수 있다
  - 해당 예시들이 입증한 바
    - 1) ALIGN model can align images and texts with similar semantics
    - 2) ALIGN can generalize to novel complex concepts

분석2

이전에 word2vec (Mikolov et al., 2013a;b)은 word vectors간의 linear relationships가 문장과 문맥에서 인접한 단어를 예측하기 위한 훈련의 결과로 emerge함을 입증했다
우리는 ALIGN에서 또한 이미지와 텍스트 임베딩 사이의 linear relationship이 emerge함을 보였다
- 'combined image+text query'를 사용해서 image retrieval를 수행했고.
  - 더 구체적으로, a query image and a text string가 주어졌을 때, ALIGN embeddings를 같이 더해서 관련된 이미지를 retrieve하는데 사용

Figure 5
- image+text queries의 다양성의 결과.
  - 1) 'vision and language domains'에서 ALIGN embeddings의 great compositionality를 보여줌
  - 2) 'text query or image query'만으로는 사용이 어려운 “search with multi-modal query”라는 새로운 패러다임의 실현 가능성 입증
  - 예) "호주" 또는 "마다가스카"와 동등한 판다를 찾거나, 검은 신발 한 켤레를 "베이지"의 색을 가진 똑같이 생긴 신발로 바꿀 수 있게 됨
  - 그림 5의 마지막 세 행: 임베딩 공간에서 subtraction을 수행함으로써 scene에서 objects/attributes을 제거 가능

8. Multilingual ALIGN Model

ALIGN의 한 장점은, 모델이 noisy web image text data에서 매우 간단한 filters로 훈련되고, filters중 어느것도 language specific이 아니라는 것이다

해당 사실을 고려해, conceptual caption data processing pipeline의 language constraint을 완화하여 dataset를 multilingual (covering 100+ languages)로 확장하고 크기를 English dataset(1.8B image-text pairs)와 match시킴.
- multilingual model ALIGN_mling은 해당 데이터를 사용해 훈련되었다
새 multilingual wordpiece vocabulary를 만듦
- with size 250k to cover all languages
Model training은 the exact English configuration을 따름
Multi30k에서 multilingual model을 test함
- Multi30k: a multilingual image text retrieval dataset extends Flickr30K (Plummer et al., 2015) to German (de) (Elliott et al., 2016), French (fr) (Elliott et al., 2017) and Czech (cs) (Barrault et al., 2018).
  - consists of 31,783 images with 5 captions per image in English and German and 1 caption per image in French and Czech.
- The train/dev/test splits are defined in Young et al. (2014).
- evaluate 방법: We evaluate the zero-shot model performance of ALIGN and compare it with M3P (Huang et al., 2020a) and UC2 (Zhou et al., 2021). The evaluation metric is mean Recall (mR), which computes the average score of Recall@1, Recall@5 and Recall@10 on image-to-text retrieval and text-to-image retrieval tasks.
Tab11 결과:
- zero-shot performance of ALIGNmling outperforms $M^3P$ on all languages by a large margin, with the largest +57.8 absolution mR improvement on fr.
The zero-shot performance of ALIGNmling is even comparable to the fine-tuned (w/ training splits) M3P and UC2 except on cs. On en, ALIGNmling performs slightly worse on its counterpart ALIGNEN (trained on EN-only data.)

9. Conclusion

large-scale noisy image-text data를 leveraging하는 간단한 방법을 제시
-> visual and vision-language representation learning을 scale up하기
- data curation, annotation 같은 어려운 작업을 피했고,
- minimal frequency-based cleaning만을 요구함
해당 데이터셋에서 우리는 simple dual-encoder model를 contrastive loss를 사용해 훈련함
결과 모델은 ALIGN인데,
- cross-modal recross-modal retrieval이 가능하고
- SOTA VSE와 cross-modal reattention vision language models를 능가함
visual-only down-stream tasks에서 ALIGN은 또한 large-scaled labeled data에서 훈련된 다른 SOTA models들을 능가함

10 논문에 대한 소감

좋은 점
- 설명을 잘 해주는 논문, 이유에 대해서 논리적으로 설명하는 논문이다
  - 인용횟수가 많은데는 이유가 있다...!!!!!!
  - 하나를 놓친 것 같다 (아쉬운 점에 0-0)
- 내용이 깔끔하고 체계적이고 간단해서 논문을 읽기가 쉽다
  - 그러한 아이디어를 구상한 분들이 대단하다는 생각이 들었다
  - 실험도 많고,, (데이터셋 논문이어서 그런 것 같기는 하지만)
    - Ablation study에서 왜 efficientnet과 bert를 어떠한 기준으로 설정했고
    - 튜닝 비슷하게 파라미터를 어떻게 했고 등등을 잘 설명해준 부분이 인상깊었다
    - CXC dataset을 사용해서 intra와 inter 모달의 이유에 대해서 설명해주고 추가로 실험한 아이디어도 너무 좋은 것 같다 아주 굳굳 !!
- Abstract에서 모호하게 어렵게 표현된 문장들이 모두 다 Introduction에서 다시 길고 정확하게 설명해주는 부분이 아주 좋았다 !
아쉬운 점
- 코드 리뷰가 없어서 아쉽지만..
- Related work의 첫번째 파트에서 이전 방식의 transferability의 한계점을 지적하는데 지적만 할 뿐 그런 문제가 왜 생겼는지에 대한 근거가 없었던 부분이 아쉽다
소감
- Recall@k가 무엇인지 발표 막판에 궁금증이 생겼는데,, 갓ㄷㅇ님 덕분에 해결했다 : ) ~
- 멀티모달.. 어렵다 새롭다 신기하다
- Dall-E 더 어렵다.. 무섭다ㅣ..