AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

이준석·2022년 6월 22일

paperread+

목록 보기

1/1

The Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of the image.
비전 트랜스포머 또는 ViT는 이미지의 패치에 대해 트랜스포머와 유사한 아키텍처를 사용하는 이미지 분류 모델입니다.
An Image is split into fixed-size patches, each of them are then linearly embedded, postion embeddeings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder.
이미지는 고정 크기 패치로 분할되고, 각 패치는 선형적으로 임베디드되고, 포스트 임베딩이 추가되고, 벡터의 결과 시퀀스는 표준 트랜스포머 인코더에 공급됩니다.
In order to perform classification, the satndard approach of adding an extra learnable "classification token" to the sequence is used.
분류를 수행하기 위해, 추가적인 학습 가능한 분류 토큰을 시퀀스에 추가하는 표준 접근법이 사용된다.

인공지능 전문가가 될레요