[논문 번역] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

JINHO LEE·2025년 11월 23일

BLIP-2: Bootstrapping Language-Image Pre-training

with Frozen Image Encoders and Large Language Models

Abstract

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

대규모 모델의 종단 간(end-to-end) 학습 때문에 비전-언어 사전학습 비용이 점점 감당하기 어려울 정도로 증가해왔다.

This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models.

본 논문은 기성(off-the-shelf) 동결 사전학습 이미지 인코더와 동결 대형 언어 모델로부터 비전-언어 사전학습을 부트스트랩하는, 일반적이고 효율적인 사전학습 전략인 BLIP-2를 제안한다.

BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages.

BLIP-2는 두 단계로 사전학습되는 경량 Querying Transformer를 통해 모달리티 간 격차를 연결한다.

The first stage bootstraps vision-language representation learning from a frozen image encoder.

첫 번째 단계는 동결 이미지 인코더로부터 비전-언어 표현 학습을 부트스트랩한다.

The second stage bootstraps vision-to-language generative learning from a frozen language model.

두 번째 단계는 동결 언어 모델로부터 비전-투-언어 생성 학습을 부트스트랩한다.

BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods.

BLIP-2는 기존 방법보다 학습 가능한 파라미터 수가 현저히 적음에도 불구하고 다양한 비전-언어 작업에서 SOTA 성능을 달성한다.

For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.

예를 들어, 우리 모델은 zero-shot VQAv2에서 Flamingo80B보다 54배 적은 학습 파라미터로 8.7% 더 높은 성능을 나타낸다.

We also demonstrate the model’s emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

또한 우리는 자연어 지시를 따를 수 있는 zero-shot 이미지-투-텍스트 생성 능력의 등장(emerging capabilities)을 입증한다.

1. Introduction

Vision-language pre-training (VLP) research has witnessed a rapid advancement in the past few years, where pre-trained models with increasingly larger scale have been developed to continuously push the state-of-the-art on various downstream tasks (Radford et al., 2021; Li et al., 2021; 2022; Wang et al., 2022a; Alayrac et al., 2022; Wang et al., 2022b).

비전-언어 사전학습(VLP) 연구는 지난 몇 년 동안 급격한 발전을 목격해왔으며, 점점 더 대규모의 사전학습 모델들이 다양한 다운스트림 작업에서 지속적으로 SOTA를 갱신하기 위해 개발되어 왔다.

However, most state-of-the-art vision-language models incur a high computation cost during pre-training, due to end-to-end training using large-scale models and datasets.

그러나 대부분의 SOTA 비전-언어 모델들은 대규모 모델과 데이터셋을 사용한 종단 간(end-to-end) 학습 때문에 사전학습 과정에서 매우 높은 연산 비용이 발생한다.

Vision-language research sits at the intersection between vision and language, therefore it is naturally expected that vision-language models can harvest from the readily-available unimodal models from the vision and natural language communities.

비전-언어 연구는 비전과 언어의 교차 지점에 위치하기 때문에, 비전-언어 모델이 비전 및 자연어 커뮤니티에서 쉽게 이용 가능한 단일 모달리티 모델들을 활용할 수 있기를 기대하는 것이 자연스럽다.

In this paper, we propose a generic and compute-efficient VLP method by bootstrapping from off-the-shelf pre-trained vision models and language models.

본 논문에서는 기성(off-the-shelf) 사전학습 비전 모델과 언어 모델로부터 부트스트랩하는, 일반적이고 연산 효율적인 VLP 방법을 제안한다.

Pre-trained vision models offer high-quality visual representation.

사전학습 비전 모델은 높은 품질의 시각적 표현을 제공한다.

Pre-trained language models, in particular large language models (LLMs), offer strong language generation and zero-shot transfer abilities.

사전학습 언어 모델, 특히 대형 언어 모델(LLMs)은 강력한 언어 생성 능력과 제로샷 전이 능력을 제공한다.

To reduce computation cost and counteract the issue of catastrophic forgetting, the unimodal pre-trained models remain frozen during the pre-training.

연산 비용을 줄이고 catastrophic forgetting 문제를 방지하기 위해, 단일 모달리티 사전학습 모델들은 사전학습 과정 동안 동결된 상태로 유지된다.

In order to leverage pre-trained unimodal models for VLP, it is key to facilitate cross-modal alignment.

VLP에서 사전학습 단일 모달리티 모델을 활용하기 위해서는, 크로스 모달 정렬(cross-modal alignment)을 촉진하는 것이 핵심이다.

However, since LLMs have not seen images during their unimodal pre-training, freezing them makes vision-language alignment in particular challenging.

그러나 LLM은 단일 모달리티 사전학습 동안 이미지를 본 적이 없기 때문에, 이를 동결하면 비전-언어 정렬이 특히 어렵게 된다.

In this regard, existing methods (e.g. Frozen (Tsimpoukelli et al., 2021), Flamingo (Alayrac et al., 2022)) resort to an image-to-text generation loss, which we show is insufficient to bridge the modality gap.

이와 관련하여, 기존 방법들(예: Frozen, Flamingo)은 이미지-투-텍스트 생성 손실에 의존하지만, 우리는 이것이 모달리티 간 격차를 메우기에 충분하지 않음을 보여준다.

To achieve effective vision-language alignment with frozen unimodal models, we propose a Querying Transformer (Q-Former) pre-trained with a new two-stage pre-training strategy.

동결된 단일 모달 모델과 함께 효과적인 비전-언어 정렬을 달성하기 위해, 우리는 새로운 2단계 사전학습 전략으로 사전학습된 Querying Transformer(Q-Former)를 제안한다.

As shown in Figure 1, Q-Former is a lightweight transformer which employs a set of learnable query vectors to extract visual features from the frozen image encoder.

Figure 1과 같이, Q-Former는 학습 가능한 쿼리 벡터 집합을 사용하여 동결 이미지 인코더로부터 시각적 특징을 추출하는 경량 트랜스포머이다.

It acts as an information bottleneck between the frozen image encoder and the frozen LLM, where it feeds the most useful visual feature for the LLM to output the desired text.

이는 동결 이미지 인코더와 동결 LLM 사이에서 정보 병목 역할을 수행하며, LLM이 원하는 텍스트를 출력할 수 있도록 가장 유용한 시각적 특징을 전달한다.

In the first pre-training stage, we perform vision-language representation learning which enforces the Q-Former to learn visual representation most relevant to the text.

첫 번째 사전학습 단계에서 우리는 텍스트와 가장 관련성 높은 시각 표현을 학습하도록 Q-Former를 강제하는 비전-언어 표현 학습을 수행한다.

In the second pre-training stage, we perform vision-to-language generative learning by connecting the output of the Q-Former to a frozen LLM, and trains the Q-Former such that its output visual representation can be interpreted by the LLM.

두 번째 사전학습 단계에서는 Q-Former의 출력을 동결 LLM에 연결하여 비전-투-언어 생성 학습을 수행하고, Q-Former의 출력 시각 표현이 LLM에 의해 해석될 수 있도록 Q-Former를 학습시킨다.

We name our VLP framework as BLIP-2: Bootstrapping Language-Image Pre-training with frozen unimodal models.

우리는 우리의 VLP 프레임워크를, 동결 단일 모달 모델을 사용한 Language-Image Pre-training 부트스트래핑인 BLIP-2라고 명명한다.

The key advantages of BLIP-2 include:

BLIP-2의 주요 장점은 다음과 같다:

BLIP-2 effectively leverages both frozen pre-trained image models and language models. BLIP-2는 동결 사전학습 이미지 모델과 언어 모델을 모두 효과적으로 활용한다. We bridge the modality gap using a Q-Former pre-trained in two-stages: representation learning stage and generative learning stage. `우리는 표현 학습 단계와 생성 학습 단계의 두 단계로 사전학습된 Q-Former를 사용하여 모달리티 간 격차를 연결한다. BLIP-2 achieves state-of-the-art performance on various vision-language tasks including visual question answering, image captioning, and image-text retrieval. BLIP-2는 시각 질문 응답(VQA), 이미지 캡셔닝, 이미지-텍스트 검색 등 다양한 비전-언어 작업에서 SOTA 성능을 달성한다.

Powered by LLMs (e.g. OPT (Zhang et al., 2022), FlanT5 (Chung et al., 2022)), BLIP-2 can be prompted to perform zero-shot image-to-text generation that follows natural language instructions, which enables emerging capabilities such as visual knowledge reasoning, visual conversation, etc. (see Figure 4 for examples). LLM(예: OPT, FlanT5)의 지원을 바탕으로, BLIP-2는 자연어 지시를 따르는 zero-shot 이미지-투-텍스트 생성을 수행하도록 프롬프트 될 수 있으며, 이는 시각 지식 추론, 시각 대화 등의 새롭게 등장하는 능력을 가능하게 한다(예시는 Figure 4 참조).

Due to the use of frozen unimodal models and a lightweight Q-Former, BLIP-2 is more compute-efficient than exisiting state-of-the-arts. 동결 단일 모달 모델과 경량 Q-Former의 사용 덕분에, BLIP-2는 기존 SOTA 모델보다 더 높은 연산 효율성을 가진다. For example, BLIP-2 outperforms Flamingo (Alayrac et al., 2022) by 8.7% on zero-shot VQAv2, while using 54× fewer trainable parameters. 예를 들어, BLIP-2는 zero-shot VQAv2에서 Flamingo보다 8.7% 더 높은 성능을 보이며, 학습 가능한 파라미터는 54배 더 적게 사용한다. Furthermore, our results show that BLIP-2 is a generic method that can harvest more advanced unimodal models for better VLP performance. 또한 우리의 결과는 BLIP-2가 더 발전된 단일 모달 모델을 활용하여 더 나은 VLP 성능을 이끌어낼 수 있는 일반적인 방법임을 보여준다.

2.1. End-to-end Vision-Language Pre-training

Vision-language pre-training aims to learn multimodal foundation models with improved performance on various vision-and-language tasks.

비전-언어 사전학습은 다양한 비전-언어 작업에서 향상된 성능을 가진 멀티모달 기반 모델을 학습하는 것을 목표로 한다.

Depending on the downstream task, different model architectures have been proposed, including the dual-encoder architecture (Radford et al., 2021; Jia et al., 2021), the fusion-encoder architecture (Tan & Bansal, 2019; Li et al., 2021), the encoder-decoder architecture (Cho et al., 2021; Wang et al., 2021b; Chen et al., 2022b), and more recently, the unified transformer architecture (Li et al., 2022; Wang et al., 2022b).

다운스트림 작업에 따라, 듀얼-인코더 구조, 퓨전-인코더 구조, 인코더-디코더 구조, 그리고 최근의 통합 트랜스포머 구조 등 다양한 모델 아키텍처가 제안되어 왔다.

Various pre-training objectives have also been proposed over the years, and have progressively converged to a few time-tested ones: image-text contrastive learning (Radford et al., 2021; Yao et al., 2022; Li et al., 2021; 2022), image-text matching (Li et al., 2021; 2022; Wang et al., 2021a), and (masked) language modeling (Li et al., 2021; 2022; Yu et al., 2022; Wang et al., 2022b).

또한 수년에 걸쳐 다양한 사전학습 목표가 제안되었으며, 점차적으로 이미지-텍스트 대비 학습, 이미지-텍스트 매칭, (마스킹된) 언어 모델링 등 몇 가지 검증된 방식으로 수렴해 왔다.

Most VLP methods perform end-to-end pre-training using large-scale image-text pair datasets.

대부분의 VLP 방법들은 대규모 이미지-텍스트 페어 데이터셋을 사용한 종단 간(end-to-end) 사전학습을 수행한다.

As the model size keeps increasing, the pre-training can incur an extremely high computation cost.

모델 크기가 계속 증가함에 따라, 사전학습은 매우 높은 연산 비용을 야기할 수 있다.

Moreover, it is inflexible for end-to-end pre-trained models to leverage readily-available unimodal pre-trained models, such as LLMs (Brown et al., 2020; Zhang et al., 2022; Chung et al., 2022).

게다가, 종단 간 사전학습된 모델은 LLM과 같은 쉽게 사용 가능한 단일 모달 사전학습 모델을 활용하는 데 유연성이 부족하다.

2.2. Modular Vision-Language Pre-training

More similar to us are methods that leverage off-the-shelf pre-trained models and keep them frozen during VLP.

우리와 더 유사한 방법은 기성 사전학습 모델을 활용하고 VLP 동안 이를 동결하는 방식들이다.

Some methods freeze the image encoder, including the early work which adopts a frozen object detector to extract visual features (Chen et al., 2020; Li et al., 2020; Zhang et al., 2021), and the recent LiT (Zhai et al., 2022) which uses a frozen pre-trained image encoder for CLIP (Radford et al., 2021) pre-training.

일부 방법은 이미지 인코더를 동결하는데, 여기에는 시각 특징 추출을 위해 동결 객체 감지기를 사용한 초기 연구와, CLIP 사전학습에 동결 사전학습 이미지 인코더를 사용하는 최근의 LiT가 포함된다.

Some methods freeze the language model to use the knowledge from LLMs for vision-to-language generation tasks (Tsimpoukelli et al., 2021; Alayrac et al., 2022; Chen et al., 2022a; Mañas et al., 2023; Tiong et al., 2022; Guo et al., 2022).

일부 방법은 비전-투-언어 생성 작업을 위해 LLM의 지식을 활용할 수 있도록 언어 모델을 동결한다.

The key challenge in using a frozen LLM is to align visual features to the text space.

동결된 LLM을 사용할 때의 핵심 과제는 시각 특징을 텍스트 공간에 정렬하는 것이다.

To achieve this, Frozen (Tsimpoukelli et al., 2021) finetunes an image encoder whose outputs are directly used as soft prompts for the LLM.

이를 위해 Frozen은 이미지 인코더를 미세조정하고 그 출력값을 LLM에 대한 소프트 프롬프트로 직접 사용한다.

Flamingo (Alayrac et al., 2022) inserts new cross-attention layers into the LLM to inject visual features, and pre-trains the new layers on billions of image-text pairs.

Flamingo는 LLM에 새로운 크로스-어텐션 레이어를 삽입하여 시각 특징을 주입하고, 수십억 개의 이미지-텍스트 페어로 해당 레이어들을 사전학습한다.

Both methods adopt the language modeling loss, where the language model generates texts conditioned on the image.

두 방법 모두 언어 모델링 손실을 사용하며, 언어 모델은 이미지에 조건화된 텍스트를 생성한다.

Different from existing methods, BLIP-2 can effectively and efficiently leverage both frozen image encoders and frozen LLMs for various vision-language tasks, achieving stronger performance at a lower computation cost.

기존 방법들과 달리, BLIP-2는 동결 이미지 인코더와 동결 LLM을 모두 다양한 비전-언어 작업에 효과적이고 효율적으로 활용하며, 더 낮은 연산 비용으로 더 강력한 성능을 달성한다.

3. Method

We propose BLIP-2, a new vision-language pre-training method that bootstraps from frozen pre-trained unimodal models.

우리는 동결 사전학습 단일 모달 모델로부터 부트스트랩하는 새로운 비전-언어 사전학습 방법인 BLIP-2를 제안한다.

In order to bridge the modality gap, we propose a Querying Transformer (Q-Former) pre-trained in two stages:

모달리티 간 격차를 연결하기 위해, 우리는 두 단계로 사전학습된 Querying Transformer(Q-Former)를 제안한다:

(1) vision-language representation learning stage with a frozen image encoder and

(1) 동결 이미지 인코더와 함께하는 비전-언어 표현 학습 단계

(2) vision-to-language generative learning stage with a frozen LLM.

(2) 동결 LLM과 함께하는 비전-투-언어 생성 학습 단계.

This section first introduces the model architecture of Q-Former, and then delineates the two-stage pre-training procedures.

이 섹션에서는 먼저 Q-Former의 모델 아키텍처를 소개하고, 이어서 두 단계 사전학습 절차를 설명한다.

3.1. Model Architecture

We propose Q-Former as the trainable module to bridge the gap between a frozen image encoder and a frozen LLM.

우리는 Q-Former를 동결 이미지 인코더와 동결 LLM 사이의 격차를 연결하는 학습 가능한 모듈로 제안한다.

It extracts a fixed number of output features from the image encoder, independent of input image resolution.

이는 입력 이미지 해상도와 무관하게 이미지 인코더로부터 고정 개수의 출력 특징을 추출한다.

As shown in Figure 2, Q-Former consists of two transformer submodules that share the same self-attention layers:

Figure 2에 나타난 바와 같이, Q-Former는 동일한 self-attention 레이어를 공유하는 두 개의 트랜스포머 서브모듈로 구성된다:

(1) an image transformer that interacts with the frozen image encoder for visual feature extraction,

(1) 시각 특징 추출을 위해 동결 이미지 인코더와 상호작용하는 이미지 트랜스포머

(2) a text transformer that can function as both a text encoder and a text decoder.

(2) 텍스트 인코더와 텍스트 디코더 역할을 모두 수행할 수 있는 텍스트 트랜스포머.

We create a set number of learnable query embeddings as input to the image transformer.

우리는 이미지 트랜스포머의 입력으로 학습 가능한 쿼리 임베딩의 정해진 개수를 생성한다.

The queries interact with each other through self-attention layers, and interact with frozen image features through cross-attention layers (inserted every other transformer block).

쿼리들은 self-attention 레이어를 통해 서로 상호작용하고, cross-attention 레이어(각 트랜스포머 블록마다 삽입됨)를 통해 동결 이미지 특징과 상호작용한다.

The queries can additionally interact with the text through the same self-attention layers.

쿼리들은 동일한 self-attention 레이어를 통해 텍스트와도 추가적으로 상호작용할 수 있다.

Depending on the pre-training task, we apply different self-attention masks to control query-text interaction.

사전학습 작업에 따라, 우리는 쿼리-텍스트 상호작용을 제어하기 위해 서로 다른 self-attention 마스크를 적용한다.

We initialize Q-Former with the pre-trained weights of BERTbase (Devlin et al., 2019), whereas the cross-attention layers are randomly initialized.

우리는 Q-Former를 BERTbase의 사전학습 가중치로 초기화하며, cross-attention 레이어는 무작위로 초기화한다.

In total, Q-Former contains 188M parameters.

전체적으로 Q-Former는 188M 파라미터를 포함한다.

Note that the queries are considered as model parameters.

쿼리들은 모델 파라미터로 간주된다는 점에 유의하라.

In our experiments, we use 32 queries where each query has a dimension of 768 (same as the hidden dimension of the Q-Former).

우리의 실험에서, 우리는 32개의 쿼리를 사용하며 각 쿼리의 차원은 768이다(Q-Former의 hidden dimension과 동일).

We use Z to denote the output query representation.

우리는 출력 쿼리 표현을 Z로 나타낸다.

The size of Z (32 × 768) is much smaller than the size of frozen image features (e.g. 257 × 1024 for ViT-L/14).

Z의 크기(32×768)는 동결 이미지 특징의 크기(예: ViT-L/14의 경우 257×1024)보다 훨씬 작다.

This bottleneck architecture works together with our pre-training objectives into forcing the queries to extract visual information that is most relevant to the text.

이 병목 구조는 사전학습 목표와 함께 작동하여 쿼리들이 텍스트와 가장 관련 있는 시각 정보를 추출하도록 강제한다.

3.2. Bootstrap Vision-Language Representation

Learning from a Frozen Image Encoder

In the representation learning stage, we connect Q-Former to a frozen image encoder and perform pre-training using image-text pairs.

표현 학습 단계에서, 우리는 Q-Former를 동결 이미지 인코더에 연결하고 이미지-텍스트 페어를 사용하여 사전학습을 수행한다.

We aim to train the Q-Former such that the queries can learn to extract visual representation that is most informative of the text.

우리는 쿼리가 텍스트에 대해 가장 정보량이 높은 시각 표현을 학습하여 추출할 수 있도록 Q-Former를 학습시키는 것을 목표로 한다.

Inspired by BLIP (Li et al., 2022), we jointly optimize three pre-training objectives that share the same input format and model parameters.

BLIP에서 영감을 받아, 우리는 동일한 입력 형식과 모델 파라미터를 공유하는 세 가지 사전학습 목표를 공동으로 최적화한다.

Each objective employs a different attention masking strategy between queries and text to control their interaction (see Figure 2).

각 목표는 쿼리와 텍스트 간 상호작용을 제어하기 위해 서로 다른 attention masking 전략을 사용한다(Figure 2 참고).

Image-Text Contrastive Learning (ITC)

Image-Text Contrastive Learning (ITC) learns to align image representation and text representation such that their mutual information is maximized.

이미지-텍스트 대비 학습(ITC)은 이미지 표현과 텍스트 표현을 정렬하여 그들의 상호 정보를 최대화하도록 학습한다.

It achieves so by contrasting the image-text similarity of a positive pair against those of negative pairs.

이는 양성 페어의 이미지-텍스트 유사도를 음성 페어와의 유사도와 대비함으로써 달성된다.

We align the output query representation Z from the image transformer with the text representation t from the text transformer, where t is the output embedding of the [CLS] token.

우리는 이미지 트랜스포머의 출력 쿼리 표현 Z를 텍스트 트랜스포머의 텍스트 표현 t와 정렬하며, 여기서 t는 [CLS] 토큰의 출력 임베딩이다.

Since Z contains multiple output embeddings (one from each query), we first compute the pairwise similarity between each query output and t, and then select the highest one as the image-text similarity.

Z는 여러 개의 출력 임베딩(각 쿼리마다 하나)을 포함하고 있기 때문에, 우리는 먼저 각 쿼리 출력과 t 사이의 쌍별 유사도를 계산하고, 그 중 가장 높은 값을 이미지-텍스트 유사도로 선택한다.

To avoid information leak, we employ a unimodal self-attention mask, where the queries and text are not allowed to see each other.

정보 누출을 방지하기 위해, 우리는 쿼리와 텍스트가 서로를 볼 수 없도록 하는 단일 모달 self-attention 마스크를 사용한다.

Due to the use of a frozen image encoder, we can fit more samples per GPU compared to end-to-end methods.

동결 이미지 인코더를 사용하기 때문에, 우리는 end-to-end 방법에 비해 GPU당 더 많은 샘플을 수용할 수 있다.

Therefore, we use in-batch negatives instead of the momentum queue in BLIP.

따라서 우리는 BLIP의 momentum queue 대신 in-batch negatives를 사용한다.

Image-grounded Text Generation (ITG)

Image-grounded Text Generation (ITG) loss trains the Q-Former to generate texts, given input images as the condition.

이미지 기반 텍스트 생성(ITG) 손실은 입력 이미지가 조건으로 주어졌을 때 텍스트를 생성하도록 Q-Former를 학습시킨다.

Since the architecture of Q-Former does not allow direct interactions between the frozen image encoder and the text tokens, the information required for generating the text must be first extracted by the queries, and then passed to the text tokens via self-attention layers.

Q-Former의 아키텍처는 동결 이미지 인코더와 텍스트 토큰 간의 직접적인 상호작용을 허용하지 않기 때문에, 텍스트 생성을 위해 필요한 정보는 먼저 쿼리에 의해 추출되고 이어 self-attention 레이어를 통해 텍스트 토큰으로 전달되어야 한다.

Therefore, the queries are forced to extract visual features that capture all the information about the text.

따라서 쿼리들은 텍스트와 관련된 모든 정보를 포착하는 시각 특징을 추출하도록 강제된다.

We employ a multimodal causal self-attention mask to control query-text interaction, similar to the one used in UniLM (Dong et al., 2019).

우리는 UniLM에서 사용된 것과 유사한 멀티모달 causal self-attention 마스크를 사용하여 쿼리와 텍스트 간의 상호작용을 제어한다.

The queries can attend to each other but not the text tokens.

쿼리들은 서로에게는 attention할 수 있지만 텍스트 토큰에는 attention할 수 없다.

Each text token can attend to all queries and its previous text tokens.

각 텍스트 토큰은 모든 쿼리와 이전 텍스트 토큰에 attention할 수 있다.

We also replace the [CLS] token with a new [DEC] token as the first text token to signal the decoding task.

우리는 디코딩 작업을 알리기 위해 첫 번째 텍스트 토큰으로 [CLS] 토큰을 새로운 [DEC] 토큰으로 대체한다.

Image-Text Matching (ITM)

Image-Text Matching (ITM) aims to learn fine-grained alignment between image and text representation.

이미지-텍스트 매칭(ITM)은 이미지와 텍스트 표현 사이의 세밀한 정렬을 학습하는 것을 목표로 한다.

It is a binary classification task where the model is asked to predict whether an image-text pair is positive (matched) or negative (unmatched).

이는 이미지-텍스트 페어가 양성(매칭됨)인지 음성(매칭되지 않음)인지 예측하도록 모델에게 요구하는 이진 분류 작업이다.

We use a bi-directional self-attention mask where all queries and texts can attend to each other.

우리는 모든 쿼리와 텍스트가 서로 attention할 수 있는 양방향 self-attention 마스크를 사용한다.

The output query embeddings Z thus capture multimodal information.

이로써 출력 쿼리 임베딩 Z는 멀티모달 정보를 포착한다.

We feed each output query embedding into a two-class linear classifier to obtain a logit, and average the logits across all queries as the output matching score.

우리는 각 출력 쿼리 임베딩을 2-class 선형 분류기에 입력하여 로짓을 얻고, 모든 쿼리에서 로짓을 평균하여 최종 매칭 점수로 사용한다.

We adopt the hard negative mining strategy from Li et al. (2021; 2022) to create informative negative pairs.

우리는 정보량 있는 음성 페어를 생성하기 위해 Li의 hard negative mining 전략을 채택한다.

Image-grounded Text Generation (ITG)

Image-grounded Text Generation (ITG) loss trains the Q-Former to generate texts, given input images as the condition.

이미지 기반 텍스트 생성(ITG) 손실은 입력 이미지가 조건으로 주어졌을 때 텍스트를 생성하도록 Q-Former를 학습시킨다.

Therefore, the queries are forced to extract visual features that capture all the information about the text.

따라서 쿼리들은 텍스트와 관련된 모든 정보를 포착하는 시각 특징을 추출하도록 강제된다.

We employ a multimodal causal self-attention mask to control query-text interaction, similar to the one used in UniLM (Dong et al., 2019).

우리는 UniLM에서 사용된 것과 유사한 멀티모달 causal self-attention 마스크를 사용하여 쿼리와 텍스트 간의 상호작용을 제어한다.

The queries can attend to each other but not the text tokens.

쿼리들은 서로에게는 attention할 수 있지만 텍스트 토큰에는 attention할 수 없다.

Each text token can attend to all queries and its previous text tokens.

각 텍스트 토큰은 모든 쿼리와 이전 텍스트 토큰에 attention할 수 있다.

We also replace the [CLS] token with a new [DEC] token as the first text token to signal the decoding task.

우리는 디코딩 작업을 알리기 위해 첫 번째 텍스트 토큰으로 [CLS] 토큰을 새로운 [DEC] 토큰으로 대체한다.

Image-Text Matching (ITM)

Image-Text Matching (ITM) aims to learn fine-grained alignment between image and text representation.

이미지-텍스트 매칭(ITM)은 이미지와 텍스트 표현 사이의 세밀한 정렬을 학습하는 것을 목표로 한다.

It is a binary classification task where the model is asked to predict whether an image-text pair is positive (matched) or negative (unmatched).

이는 이미지-텍스트 페어가 양성(매칭됨)인지 음성(매칭되지 않음)인지 예측하도록 모델에게 요구하는 이진 분류 작업이다.

We use a bi-directional self-attention mask where all queries and texts can attend to each other.

우리는 모든 쿼리와 텍스트가 서로 attention할 수 있는 양방향 self-attention 마스크를 사용한다.

The output query embeddings Z thus capture multimodal information.

이로써 출력 쿼리 임베딩 Z는 멀티모달 정보를 포착한다.

We feed each output query embedding into a two-class linear classifier to obtain a logit, and average the logits across all queries as the output matching score.

우리는 각 출력 쿼리 임베딩을 2-class 선형 분류기에 입력하여 로짓을 얻고, 모든 쿼리에서 로짓을 평균하여 최종 매칭 점수로 사용한다.

We adopt the hard negative mining strategy from Li et al. (2021; 2022) to create informative negative pairs.

우리는 정보량 있는 음성 페어를 생성하기 위해 Li의 hard negative mining 전략을 채택한다.

3.3. Bootstrap Vision-to-Language Generative Learning from a Frozen LLM

In the generative pre-training stage, we connect Q-Former (with the frozen image encoder attached) to a frozen LLM to harvest the LLM’s generative language capability.

생성 사전학습 단계에서, 우리는 Q-Former(동결 이미지 인코더가 연결된 상태)를 동결 LLM에 연결하여 LLM의 생성 언어 능력을 활용한다.

As shown in Figure 3, we use a fully-connected (FC) layer to linearly project the output query embeddings Z into the same dimension as the text embedding of the LLM.

Figure 3에서 보이는 것처럼, 우리는 완전 연결(FC) 레이어를 사용하여 출력 쿼리 임베딩 Z를 LLM의 텍스트 임베딩과 동일한 차원으로 선형 투영한다.

The projected query embeddings are then prepended to the input text embeddings.

투영된 쿼리 임베딩은 입력 텍스트 임베딩 앞에 추가된다.

They function as soft visual prompts that condition the LLM on visual representation extracted by the Q-Former.

이들은 Q-Former가 추출한 시각 표현에 조건화된 soft visual prompt 역할을 한다.

Since the Q-Former has been pre-trained to extract language-informative visual representation, it effectively functions as an information bottleneck that feeds the most useful information to the LLM while removing irrelevant visual information.

Q-Former는 언어적으로 유의미한 시각 표현을 추출하도록 사전학습되었기 때문에, 관련 없는 시각 정보를 제거하면서 LLM에 가장 유용한 정보를 전달하는 정보 병목으로 효과적으로 작동한다.

This reduces the burden of the LLM to learn vision-language alignment, thus mitigating the catastrophic forgetting problem.

이는 LLM이 비전-언어 정렬을 학습하기 위한 부담을 줄이며, 결과적으로 catastrophic forgetting 문제를 완화한다.

We experiment with two types of LLMs: decoder-based LLMs and encoder-decoder-based LLMs.

우리는 디코더 기반 LLM과 인코더-디코더 기반 LLM 두 종류의 LLM을 실험한다.

For decoder-based LLMs, we pre-train with the language modeling loss, where the frozen LLM is tasked to generate the text conditioned on the visual representation from Q-Former.

디코더 기반 LLM의 경우, 우리는 언어 모델링 손실로 사전학습하며, 동결 LLM은 Q-Former의 시각 표현에 조건화된 텍스트를 생성하는 역할을 맡는다.

For encoder-decoder-based LLMs, we pre-train with the prefix language modeling loss, where we split a text into two parts.

인코더-디코더 기반 LLM의 경우, 우리는 텍스트를 두 부분으로 나누는 prefix language modeling 손실을 사용하여 사전학습한다.

The prefix text is concatenated with the visual representation as input to the LLM’s encoder.

prefix 텍스트는 시각 표현과 결합되어 LLM의 인코더 입력으로 사용된다.

The suffix text is used as the generation target for the LLM’s decoder.

suffix 텍스트는 LLM 디코더의 생성 목표로 사용된다.

3.4. Model Pre-training

Pre-training data.

사전학습 데이터.

We use the same pre-training dataset as BLIP with 129M images in total, including COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), CC3M (Sharma et al., 2018), CC12M (Changpinyo et al., 2021), SBU (Ordonez et al., 2011), and 115M images from the LAION400M dataset (Schuhmann et al., 2021).

우리는 BLIP와 동일한 사전학습 데이터셋을 사용하며, 총 1억 2,900만 개의 이미지가 포함되며 COCO, Visual Genome, CC3M, CC12M, SBU, 그리고 LAION400M 데이터셋의 1억 1,500만 개의 이미지를 포함한다.

We adopt the CapFilt method (Li et al., 2022) to create synthetic captions for the web images.

우리는 웹 이미지에 대한 합성 캡션을 생성하기 위해 CapFilt 방법을 사용한다.

Specifically, we generate 10 captions using the BLIPlarge captioning model, and rank the synthetic captions along with the original web caption based on the image-text similarity produced by a CLIP ViT-L/14 model.

구체적으로, 우리는 BLIPlarge 캡션 모델을 사용하여 10개의 캡션을 생성하고, CLIP ViT-L/14 모델이 생성한 이미지-텍스트 유사도를 기반으로 합성 캡션과 원본 웹 캡션을 함께 랭킹한다.

We keep top-two captions per image as training data and randomly sample one at each pre-training step.

우리는 이미지당 상위 두 캡션을 학습 데이터로 유지하고, 각 사전학습 단계에서 무작위로 하나를 샘플링한다.

Pre-trained image encoder and LLM.

For the frozen image encoder, we explore two state-of-the-art pre-trained vision transformer models:

동결 이미지 인코더로서, 우리는 두 가지 SOTA 사전학습 비전 트랜스포머 모델을 탐색한다:

(1) ViT-L/14 from CLIP (Radford et al., 2021) and

(1) CLIP의 ViT-L/14

(2) ViT-g/14 from EVA-CLIP (Fang et al., 2022).

(2) EVA-CLIP의 ViT-g/14.

We remove the last layer of the ViT and uses the second last layer’s output features, which leads to slightly better performance.

우리는 ViT의 마지막 레이어를 제거하고, 두 번째 마지막 레이어의 출력 특징을 사용하며, 이는 약간 더 나은 성능을 달성한다.

For the frozen language model, we explore the unsupervised-trained OPT model family (Zhang et al., 2022) for decoder-based LLMs, and the instruction-trained FlanT5 model family (Chung et al., 2022) for encoder-decoder-based LLMs.

동결 언어 모델로서, 우리는 디코더 기반 LLM을 위해 비지도 학습된 OPT 모델군을, 인코더-디코더 기반 LLM을 위해 instruction-trained FlanT5 모델군을 탐색한다.

Pre-training settings.

We pre-train for 250k steps in the first stage and 80k steps in the second stage.

우리는 첫 번째 단계에서 250k 스텝, 두 번째 단계에서 80k 스텝 동안 사전학습을 수행한다.

We use a batch size of 2320/1680 for ViT-L/ViT-g in the first stage and a batch size of 1920/1520 for OPT/FlanT5 in the second stage.

첫 번째 단계에서는 ViT-L/ViT-g에 대해 배치 크기 2320/1680을, 두 번째 단계에서는 OPT/FlanT5에 대해 배치 크기 1920/1520을 사용한다.

During pre-training, we convert the frozen ViTs’ and LLMs’ parameters into FP16, except for FlanT5 where we use BFloat16.

사전학습 동안, 우리는 동결된 ViT와 LLM의 파라미터를 FP16으로 변환하며, FlanT5의 경우에는 BFloat16을 사용한다.

We found no performance degradation compared to using 32-bit models.

우리는 32-bit 모델을 사용하는 것과 비교하여 성능 저하가 없음을 확인했다.

Due to the use of frozen models, our pre-training is more computational friendly than existing large-scale VLP methods.

동결 모델의 사용 덕분에, 우리의 사전학습은 기존 대규모 VLP 방법들보다 계산적으로 더 친화적이다.

For example, using a single 16-A100(40G) machine, our largest model with ViT-g and FlanT5-XXL requires less than 6 days for the first stage and less than 3 days for the second stage.

예를 들어, 단일 16-A100(40G) 머신을 사용하여 ViT-g와 FlanT5-XXL 기반의 우리의 가장 큰 모델은 첫 번째 단계에서 6일 미만, 두 번째 단계에서 3일 미만이 소요된다.

The same set of pre-training hyper-parameters are used for all models.

동일한 사전학습 하이퍼파라미터 세트를 모든 모델에 사용한다.

We use the AdamW (Loshchilov & Hutter, 2017) optimizer with β₁ = 0.9, β₂ = 0.98, and a weight decay of 0.05.

우리는 β₁ = 0.9, β₂ = 0.98, weight decay = 0.05의 AdamW 옵티마이저를 사용한다.

We use a cosine learning rate decay with a peak learning rate of 1e-4 and a linear warmup of 2k steps.

우리는 최대 학습률 1e-4, 그리고 2k 스텝의 선형 워밍업을 사용하는 cosine learning rate decay를 적용한다.

The minimum learning rate at the second stage is 5e-5.

두 번째 단계의 최소 학습률은 5e-5이다.

We use images of size 224×224, augmented with random resized cropping and horizontal flipping.

우리는 224×224 크기의 이미지를 사용하며, 랜덤 리사이즈 크롭과 수평 플리핑으로 데이터 증강을 수행한다.

4. Experiment

Table 1 provides an overview of the performance of BLIP-2 on various zero-shot vision-language tasks.

Table 1은 다양한 zero-shot 비전-언어 작업에서 BLIP-2의 성능 개요를 제공한다.

Compared to previous state-of-the-art models, BLIP-2 achieves improved performance while requiring substantially fewer number of trainable parameters during vision-language pre-training.

이전 SOTA 모델과 비교하여, BLIP-2는 비전-언어 사전학습 중 학습 가능한 파라미터 수가 크게 적음에도 향상된 성능을 달성한다.

4.1. Instructed Zero-shot Image-to-Text Generation

BLIP-2 effectively enables a LLM to understand images while preserving its capability in following text prompts, which allows us to control image-to-text generation with instructions.

BLIP-2는 LLM이 텍스트 프롬프트를 따르는 능력을 유지하면서 이미지 이해를 가능하게 하여, 지시 기반의 이미지-투-텍스트 생성을 제어할 수 있도록 한다.

We simply append the text prompt after the visual prompt as input to the LLM.

우리는 시각 프롬프트 뒤에 텍스트 프롬프트를 단순히 추가하여 LLM의 입력으로 사용한다.

Figure 4 shows examples to demonstrate a wide range of zero-shot image-to-text capabilities including visual knowledge reasoning, visual commonsense reasoning, visual conversation, personalized image-to-text generation, etc.

Figure 4는 시각 지식 추론, 시각 상식 추론, 시각 대화, 개인화된 이미지-투-텍스트 생성 등 다양한 zero-shot 이미지-투-텍스트 능력을 보여주는 예시를 제공한다.

Zero-shot VQA

We perform quantitative evaluation on the zero-shot visual question answering task.

우리는 zero-shot 시각 질문 응답 작업에 대해 정량적 평가를 수행한다.

For OPT models, we use the prompt “Question: {} Answer:”.

OPT 모델의 경우, 우리는 “Question: {} Answer:” 프롬프트를 사용한다.

For FlanT5 models, we use the prompt “Question: {} Short answer:”.

FlanT5 모델의 경우, 우리는 “Question: {} Short answer:” 프롬프트를 사용한다.

During generation, we use beam search with a beam width of 5.

생성 과정에서, 우리는 빔 너비 5의 beam search를 사용한다.

We also set the length-penalty to -1 which encourages shorter answers that align better with human annotation.

우리는 length-penalty를 -1로 설정하여 인간 주석과 더 잘 맞는 짧은 답변을 유도한다.

As shown in Table 2, BLIP-2 achieves state-of-the-art result on the VQAv2 (Goyal et al., 2017) and GQA (Hudson & Manning, 2019) datasets.

Table 2에서 보이듯이, BLIP-2는 VQAv2 및 GQA 데이터셋에서 SOTA 결과를 달성한다.

It outperforms Flamingo80B by 8.7% on VQAv2, despite having 54x fewer trainable parameters.

BLIP-2는 VQAv2에서 Flamingo80B보다 8.7% 더 높은 성능을 보여주며, 학습 가능한 파라미터는 54배 더 적다.

On the OK-VQA (Marino et al., 2019) dataset, BLIP-2 comes secondary to Flamingo80B.

OK-VQA 데이터셋에서는 BLIP-2가 Flamingo80B 다음으로 성능을 보인다.

We hypothesis that this is because OK-VQA focuses more on open-world knowledge than visual understanding, and the 70B Chinchilla (Hoffmann et al., 2022) language model from Flamingo80B possesses more knowledge than the 11B FlanT5XXL.

우리는 이것이 OK-VQA가 시각적 이해보다 오픈월드 지식에 더 중점을 두고 있으며, Flamingo80B의 70B Chinchilla 언어 모델이 11B FlanT5XXL보다 더 많은 지식을 가지고 있기 때문이라고 가설한다.

Observation & Discussion

We make a promising observation from Table 2: a stronger image encoder or a stronger LLM both lead to better performance.

우리는 Table 2에서 유망한 관찰을 한다: 더 강력한 이미지 인코더 또는 더 강력한 LLM은 모두 더 나은 성능으로 이어진다.

This observation is supported by several facts:

이 관찰은 여러 사실로 뒷받침된다:

(1) ViT-g outperforms ViT-L for both OPT and FlanT5.

(1) ViT-g는 OPT와 FlanT5 모두에서 ViT-L보다 우수하다.

(2) Within the same LLM family, larger models outperform smaller ones.

(2) 동일한 LLM 계열 내에서는 더 큰 모델이 더 작은 모델보다 우수하다.

(3) FlanT5, an instruction-tuned LLM, outperforms the unsupervised-trained OPT on VQA.

(3) instruction-tuned LLM인 FlanT5는 VQA에서 비지도 학습된 OPT보다 우수하다.

This observation validates BLIP-2 as a generic vision-language pre-training method that can efficiently harvest the rapid advances in vision and natural language communities.

이 관찰은 BLIP-2가 비전 및 자연어 커뮤니티의 빠른 발전을 효율적으로 활용할 수 있는 일반적인 비전-언어 사전학습 방법임을 입증한다.

Effect of Vision-Language Representation Learning

The first-stage representation learning pre-trains the Q-Former to learn visual features relevant to the text, which reduces the burden of the LLM to learn vision-language alignment.

첫 번째 단계의 표현 학습은 Q-Former가 텍스트와 관련된 시각 특징을 학습하도록 사전학습하며, 이는 LLM이 비전-언어 정렬을 학습하는 부담을 줄여준다.

Without the representation learning stage, Q-Former relies solely on the vision-to-language generative learning to bridge the modality gap, which is similar to the Perceiver Resampler in Flamingo.

표현 학습 단계가 없으면, Q-Former는 Flamingo의 Perceiver Resampler와 유사하게 모달리티 간 격차를 연결하기 위해 비전-투-언어 생성 학습에만 의존한다.

Figure 5 shows the effect of representation learning on generative learning.

Figure 5는 생성 학습에 대한 표현 학습의 효과를 보여준다.

Without representation learning, both types of LLMs give substantially lower performance on zero-shot VQA.

표현 학습이 없을 경우, 두 유형의 LLM 모두 zero-shot VQA에서 상당히 낮은 성능을 나타낸다.

In particular, OPT suffers from catastrophic forgetting where performance drastically degrades as training proceeds.

특히 OPT는 학습이 진행될수록 성능이 급격히 저하되는 catastrophic forgetting 현상을 겪는다.

4.2. Image Captioning

We finetune BLIP-2 models for the image captioning task, which asks the model to generate a text description for the image’s visual content.

우리는 이미지 캡셔닝 작업을 위해 BLIP-2 모델을 파인튜닝하며, 이는 이미지의 시각적 내용을 설명하는 텍스트를 생성하도록 요구한다.

We use the prompt “a photo of” as an initial input to the LLM and trains the model to generate the caption with the language modeling loss.

우리는 LLM의 초기 입력으로 “a photo of” 프롬프트를 사용하고, 언어 모델링 손실로 캡션을 생성하도록 모델을 학습시킨다.

We keep the LLM frozen during finetuning, and updates the parameters of the Q-Former together with the image encoder.

파인튜닝 동안 LLM은 동결 상태로 유지되며, Q-Former와 이미지 인코더의 파라미터만 업데이트된다.

We experiment with ViT-g and various LLMs.

우리는 ViT-g와 다양한 LLM을 사용하여 실험한다.

Detailed hyperparameters can be found in the appendix.

상세 하이퍼파라미터는 부록에서 확인할 수 있다.

We perform finetuning on COCO, and evaluate on both COCO test set and zero-shot transfer to NoCaps (Agrawal et al., 2019) validation set.

우리는 COCO에서 파인튜닝을 수행하고, COCO 테스트셋과 NoCaps 검증셋에 대한 zero-shot 전이로 평가한다.

The results are shown in Table 3.

결과는 Table 3에 나타난다.

BLIP-2 achieves state-of-the-art performance with significant improvement on NoCaps over existing methods, demonstrating strong generalization ability to out-domain images.

BLIP-2는 기존 방법들 대비 NoCaps에서 큰 성능 향상을 보이며, 도메인 밖(out-domain) 이미지에 대한 강력한 일반화 능력을 보여준다.

4.3. Visual Question Answering

Given annotated VQA data, we finetune the parameters of the Q-Former and the image encoder while keeping the LLM frozen.

주석이 달린 VQA 데이터를 기반으로, 우리는 LLM은 동결 상태로 유지하면서 Q-Former와 이미지 인코더의 파라미터를 파인튜닝한다.

We finetune with the open-ended answer generation loss, where the LLM receives Q-Former’s output and the question as input, and is asked to generate the answer.

우리는 open-ended answer generation 손실로 파인튜닝하며, 이때 LLM은 Q-Former의 출력과 질문을 입력으로 받아 정답을 생성하도록 학습된다.

In order to extract image features that are more relevant to the question, we additionally condition Q-Former on the question.

질문과 더 관련 있는 이미지 특징을 추출하기 위해, 우리는 Q-Former를 질문에 추가적으로 조건화한다.

Specifically, the question tokens are given as input to the Q-Former and interact with the queries via the self-attention layers, which can guide the Q-Former’s cross-attention layers to focus on more informative image regions.

구체적으로, 질문 토큰은 Q-Former의 입력으로 제공되며 self-attention 레이어를 통해 쿼리와 상호작용하여, cross-attention 레이어가 더 유의미한 이미지 영역에 집중할 수 있도록 안내한다.

Following BLIP, our VQA data includes the training and validation splits from VQAv2, as well as training samples from Visual Genome.

BLIP을 따라, 우리의 VQA 데이터는 VQAv2의 학습 및 검증 분할과 Visual Genome의 학습 샘플을 포함한다.

Table 4 demonstrates the state-of-the-art results of BLIP-2 among open-ended generation models.

Table 4는 open-ended generation 모델들 중 BLIP-2의 SOTA 결과를 보여준다.

4.4. Image-Text Retrieval

Since image-text retrieval does not involve language generation, we directly finetune the first-stage-pretrained model w/o LLM.

이미지-텍스트 검색은 언어 생성이 관여하지 않기 때문에, 우리는 LLM 없이 1단계 사전학습 모델을 직접 파인튜닝한다.

Specifically, we finetune the image encoder together with Q-Former on COCO using the same objectives (i.e. ITC, ITM, and ITG) as pre-training.

구체적으로, 우리는 COCO에서 사전학습과 동일한 목표(ITC, ITM, ITG)를 사용하여 이미지 인코더와 Q-Former를 함께 파인튜닝한다.

We then evaluate the model for both image-to-text retrieval and text-to-image retrieval on COCO and Flickr30K (Plummer et al., 2015) datasets.

그 후 COCO 및 Flickr30K 데이터셋에서 이미지-투-텍스트 검색과 텍스트-투-이미지 검색 모두에 대해 모델을 평가한다.

During inference, we follow Li et al. (2021; 2022) which first select k = 128 candidates based on the image-text feature similarity, followed by a re-ranking based on pairwise ITM scores.

추론 시, 우리는 이미지-텍스트 특징 유사도에 기반하여 먼저 k=128 후보를 선택한 후, pairwise ITM 점수를 기반으로 재정렬하는 Li의 방식을 따른다.

We experiment with both ViT-L and ViT-g as the image encoder.

우리는 이미지 인코더로 ViT-L과 ViT-g 모두를 사용하여 실험한다.

Detailed hyperparameters can be found in the appendix.

상세 하이퍼파라미터는 부록에서 확인할 수 있다.

The results are shown in Table 5.

결과는 Table 5에 나타난다.

BLIP-2 achieves state-of-the-art performance with significant improvement over existing methods on zero-shot image-text retrieval.

BLIP-2는 zero-shot 이미지-텍스트 검색에서 기존 방법 대비 큰 향상을 보이며 SOTA 성능을 달성한다.

The ITC and ITM losses are essential for image-text retrieval as they directly learn image-text similarity.

ITC와 ITM 손실은 이미지-텍스트 유사도를 직접 학습하기 때문에 이미지-텍스트 검색에 필수적이다.

In Table 6, we show that the ITG (image-grounded text generation) loss is also beneficial for image-text retrieval.

Table 6에서, 우리는 ITG 손실 또한 이미지-텍스트 검색에 유익함을 보여준다.

This result supports our intuition in designing the representation learning objectives: the ITG loss enforces the queries to extract visual features most relevant to the text, thus improving vision-language alignment.

이 결과는 표현 학습 목표 설계 시 우리의 직관을 뒷받침하는데, 즉 ITG 손실은 쿼리가 텍스트와 가장 관련된 시각 특징을 추출하도록 강제하여 비전-언어 정렬을 개선한다.

5. Limitation

Recent LLMs can perform in-context learning given few-shot examples.

최근 LLM들은 몇 개의 예시가 주어지면 in-context learning을 수행할 수 있다.

However, our experiments with BLIP-2 do not observe an improved VQA performance when providing the LLM with in-context VQA examples.

그러나 BLIP-2 실험에서는 LLM에 in-context VQA 예시를 제공하더라도 VQA 성능 향상이 관찰되지 않았다.

We attribute the lack of in-context learning capability to our pre-training dataset, which only contains a single image-text pair per sample.

우리는 이러한 in-context 학습 부재가 사전학습 데이터셋이 샘플당 단일 이미지-텍스트 페어만 포함하기 때문이라고 판단한다.

The LLMs cannot learn from it the correlation among multiple image-text pairs in a single sequence.

LLM은 하나의 시퀀스에 여러 이미지-텍스트 페어 간 상관관계를 학습할 수 없다.

The same observation is also reported in the Flamingo paper, which uses a close-sourced interleaved image and text dataset (M3W) with multiple image-text pairs per sequence.

유사한 관찰은 Flamingo 논문에서도 보고되었으며, Flamingo는 시퀀스당 여러 이미지-텍스트 페어를 포함한 비공개 데이터셋 M3W를 사용한다.

We aim to create a similar dataset in future work.

우리는 미래 연구에서 이와 유사한 데이터셋 제작을 목표로 한다.

BLIP-2’s image-to-text generation could have unsatisfactory results due to various reasons including inaccurate knowledge from the LLM, activating the incorrect reasoning path, or not having up-to-date information about new image content (see Figure 7).

BLIP-2의 이미지-투-텍스트 생성은 LLM의 부정확한 지식, 잘못된 추론 경로의 활성화, 새로운 이미지 콘텐츠에 대한 최신 정보 부재 등의 다양한 이유로 만족스럽지 못할 수 있다(Figure 7 참조).

Furthermore, due to the use of frozen models, BLIP-2 inherits the risks of LLMs, such as outputting offensive language, propagating social bias, or leaking private information.

또한 동결 모델을 사용하기 때문에 BLIP-2는 공격적 언어 생성, 사회적 편향 확산, 개인정보 유출 등의 LLM의 위험을 그대로 이어받는다.

Remediation approaches include using instructions to guide model’s generation or training on a filtered dataset with harmful content removed.

해결 방안으로는 지시문을 활용하여 생성 과정을 안내하거나 유해 콘텐츠가 제거된 필터링 데이터셋으로 학습하는 방법 등이 있다.

6. Conclusion

We propose BLIP-2, a generic and compute-efficient method for vision-language pre-training that leverages frozen pre-trained image encoders and LLMs.

우리는 동결 사전학습 이미지 인코더와 LLM을 활용하는, 일반적이고 연산 효율적인 비전-언어 사전학습 방법인 BLIP-2를 제안한다.

BLIP-2 achieves state-of-the-art performance on various vision-language tasks while having a small amount of trainable parameters during pre-training.

BLIP-2는 사전학습 동안 학습 가능한 파라미터 수가 적음에도 불구하고 다양한 비전-언어 작업에서 SOTA 성능을 달성한다.

BLIP-2 also demonstrates emerging capabilities in zero-shot instructed image-to-text generation.

BLIP-2는 zero-shot 지시 기반 이미지-투-텍스트 생성에서 새롭게 등장하는 능력을 보여준다.

We consider BLIP-2 as an important step towards building a multimodal conversational AI agent.

우리는 BLIP-2가 멀티모달 대화형 AI 에이전트 구축을 향한 중요한 단계라고 본다.

JINHO LEE

이전 포스트

[논문 번역] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

다음 포스트

[논문 번역] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

BLIP-2: Bootstrapping Language-Image Pre-training

Abstract

1. Introduction

2.1. End-to-end Vision-Language Pre-training

2.2. Modular Vision-Language Pre-training

3. Method

3.1. Model Architecture

3.2. Bootstrap Vision-Language Representation

Image-Text Contrastive Learning (ITC)

Image-grounded Text Generation (ITG)

Image-Text Matching (ITM)

Image-grounded Text Generation (ITG)

Image-Text Matching (ITM)

3.3. Bootstrap Vision-to-Language Generative Learning from a Frozen LLM

3.4. Model Pre-training

Pre-training data.

Pre-trained image encoder and LLM.

Pre-training settings.

4. Experiment

4.1. Instructed Zero-shot Image-to-Text Generation

Zero-shot VQA

Observation & Discussion

Effect of Vision-Language Representation Learning

4.2. Image Captioning

4.3. Visual Question Answering

4.4. Image-Text Retrieval

5. Limitation

6. Conclusion

[논문 번역] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

[논문 번역] REACT : SYNERGIZING REASONING AND ACTING INLANGUAGE MODELS

0개의 댓글

[논문 번역] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

BLIP-2: Bootstrapping Language-Image Pre-training

Abstract

1. Introduction

2. Related Work

2.1. End-to-end Vision-Language Pre-training

2.2. Modular Vision-Language Pre-training

3. Method

3.1. Model Architecture

3.2. Bootstrap Vision-Language Representation

Image-Text Contrastive Learning (ITC)

Image-grounded Text Generation (ITG)

Image-Text Matching (ITM)

Image-grounded Text Generation (ITG)

Image-Text Matching (ITM)

3.3. Bootstrap Vision-to-Language Generative Learning from a Frozen LLM

3.4. Model Pre-training

Pre-training data.

Pre-trained image encoder and LLM.

Pre-training settings.

4. Experiment

4.1. Instructed Zero-shot Image-to-Text Generation

Zero-shot VQA

Observation & Discussion

Effect of Vision-Language Representation Learning

4.2. Image Captioning

4.3. Visual Question Answering

4.4. Image-Text Retrieval

5. Limitation

6. Conclusion

[논문 번역] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

[논문 번역] REACT : SYNERGIZING REASONING AND ACTING INLANGUAGE MODELS

0개의 댓글