Transformers in Vision: A Survey 제1부

이준석·2022년 12월 13일

Transformers in Vision: A Survey

목록 보기

1/4

Transformers in Vision: A Survey

Abstract

Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM).
intrigued 아주 흥미로워 하는 salient 가장 중요한, 두드러진
Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets.
modalities 양식 straitghtforward 직관성 scalabilty 확장성
또한, 트랜스포머의 간단한 설계는 유사한 처리 블록을 사용하여 여러 양식(예: 이미지, 비디오, 텍스트 및 음성)을 처리할 수 있으며 매우 큰 용량의 네트워크와 거대한 데이터 세트에 대한 뛰어난 확장성을 보여준다.

These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
discipline 규율, 분야
이러한 강점은 트랜스포머 네트워크를 사용하는 여러 비전 작업에서 흥미로운 진전을 가져왔다. 이 설문조사는 컴퓨터 비전 분야에서 트랜스포머 모델에 대한 포괄적인 개요를 제공하는 것을 목표로 한다.

We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding.
트랜스포머의 성공 뒤에 있는 기본 개념, 즉 셀프 어텐션, 대규모 사전 훈련 및 양방향 기능 인코딩에 대한 소개부터 시작합니다.

We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation).
그런 다음 인기 있는 인식 작업(예: 이미지 분류, 개체 감지, 동작 인식 및 분할), 생성 모델링, 다중 모드 작업(예: 시각적 질문 응답, 시각적 추론 및 시각적 접지)을 포함하여 비전에서 변환기의 광범위한 응용 프로그램을 다룹니다. ), 비디오 처리(예: 활동 인식, 비디오 예측), 낮은 수준의 비전(예: 이미지 초고해상도, 이미지 향상 및 색상화) 및 3D 분석(예: 포인트 클라우드 분류 및 세분화).

We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.
우리는 건축 설계와 실험적 가치 측면에서 인기 있는 기술의 각각의 장점과 한계를 비교한다. 마지막으로, 우리는 개방적인 연구 방향과 가능한 미래 연구에 대한 분석을 제공한다.

We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
우리는 이러한 노력이 컴퓨터 비전에서 변압기 모델의 적용을 향한 현재의 과제를 해결하기 위해 커뮤니티에 대한 추가적인 관심에 불을 붙이기를 바란다.

5 CONCLUSION

Attention has played a key role in delivering efficient and accurate computer vision systems, while simultaneously providing insights into the function of deep neural networks.
simultaneously 동시에
관심은 효율적이고 정확한 컴퓨터 비전 시스템을 전달하는 동시에 심층 신경망의 기능에 대한 통찰력을 제공하는 데 핵심적인 역할을 했다.

This survey reviews the self-attention approaches and specifically focuses on the Transformer and bidirectional encoding architectures that are built on the principle of self-attention.
이 설문 조사는 셀프 어텐션 접근 방식을 검토하고 특히 트랜스포머 및 셀프 어텐션 원칙에 따라 구축된 양방향 인코딩 아키텍처에 중점을 둡니다.

We first cover fundamental concepts pertaining to self-attention architectures and later provide an in-depth analysis of competing approaches for a broad range of computer vision applications.
우리는 먼저 자기 주의 아키텍처와 관련된 기본 개념을 다루고 나중에 광범위한 컴퓨터 비전 애플리케이션에 대한 경쟁 접근 방식에 대한 심층 분석을 제공한다.

Specifically, we include state of the art self-attention models for image recognition, object detection, semantic and instance segmentation, video analysis and classification, visual question answering, visual commonsense reasoning, image captioning, visionlanguage navigation, clustering, few-shot learning, and 3D data analysis.
특히 이미지 인식, 객체 감지, 시맨틱 및 인스턴스 분할, 비디오 분석 및 분류, 시각적 질문 응답, 시각적 상식 추론, 이미지 캡션, 시각 언어 탐색, 클러스터링, 퓨샷 학습, 및 3D 데이터 분석.

We systematically highlight the key strengths and limitations of the existing methods and particularly elaborate on the important future research directions. With its specific focus on computer vision tasks, this survey provides a unique view of the recent progress in self-attention and Transformer-based methods.
systematically 체계적으로 elaborate 정교한 specific 특별히, 특유한
우리는 기존 방법의 주요 강점과 한계를 체계적으로 강조하고 특히 중요한 향후 연구 방향에 대해 자세히 설명한다. 컴퓨터 비전 작업에 구체적으로 초점을 맞춘 이 설문 조사는 자기 주의 및 트랜스포머 기반 방법의 최근 진전에 대한 독특한 견해를 제공한다.

We hope this effort will drive further interest in the vision community to leverage the potential of Transformer models and improve on their current limitations e.g., reducing their carbon footprint.
우리는 이러한 노력이 Transformer 모델의 잠재력을 활용하고 탄소 발자국 감소와 같은 현재의 한계를 개선하기 위해 비전 커뮤니티에서 더 많은 관심을 불러일으키기를 바랍니다.

이준석

인공지능 전문가가 될레요

다음 포스트

Transformers in Vision: A Survey 제1부