[CV] An Image Is Worth 16x16 Words: Transformers for Image Recognition At Scale(ViT) review

๊ฐ•๋™์—ฐยท2022๋…„ 3์›” 2์ผ
0

[Paper review]

๋ชฉ๋ก ๋ณด๊ธฐ
14/17
post-thumbnail

๐ŸŽˆ ๋ณธ ๋ฆฌ๋ทฐ๋Š” ViT ๋ฐ ๋ฆฌ๋ทฐ๋ฅผ ์ฐธ๊ณ ํ•ด ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

Keywords

๐ŸŽˆ Using pure transformer for Image Recognition
๐ŸŽˆ Fewer computational resources to train

Introduction

โœ” ์œ„๋Š” ViT์€ ์ „๋ฐ˜์ ์ธ ์•„ํ‚คํ…์ฒ˜ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ธฐ์กด์˜ Transformer๋‚˜ Bert์™€ ๋งค์šฐ ์œ ์‚ฌํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์˜ ์ €์ž๋Š” ๊ธฐ์กด์˜ transformer ๊ตฌ์กฐ๋ฅผ ์ตœ๋Œ€ํ•œ ๋น„์Šทํ•˜๊ฒŒ ์„ค๊ณ„ํ•ด image classification์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ์— ๊ธฐ๋ณธ์ ์ธ Transformer๋‚˜ Bert ๊ตฌ์กฐ๋ฅผ ์ฝ์–ด์•ผ ์ดํ•ดํ•˜๊ธฐ ์ˆ˜์›”ํ•ฉ๋‹ˆ๋‹ค.

โœ” ViT๋Š” ๊ธฐ์กด์˜ CNN๋ณด๋‹ค inductive bias๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค๊ณ  ์ด์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ์ ์€ ๋ฐ์ดํ„ฐ ์…‹๋ณด๋‹ค๋Š” ๋งŽ์€ ๋ฐ์ดํ„ฐ ์…‹์—์„œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ๋Š” ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹์ด ์žˆ๋‹ค๋ฉด SOTA์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋กœ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

โœ… Inductive bias

  • Inductive bias๋Š” training์—์„œ ๋ณด์ง€ ๋ชปํ•œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ—ค์„œ๋„ ์ ์ ˆํ•œ ๊ท€๋‚ฉ์  ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฐ€์ •๋“ค์˜ ์ง‘ํ•ฉ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

โœ” ๋…ผ๋ฌธ์—์„œ๋Š” Transformer๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ CNN๋ณด๋‹ค Inductive bias์ด ๋ถ€์กฑํ•˜๋‹ค๊ณ  ์ด์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ CNN์—์„œ๋Š” translation equivariance์™€ locality๋ผ๋Š” ๊ฐ€์ •์ด ์กด์žฌํ•˜์ง€๋งŒ, Transformer์—์„œ๋Š” ์ด๋ฅผ ๊ฐ€์ • ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ์ถฉ๋ถ„ํ•˜์ง€ ๋ชปํ•œ data์—์„œ๋Š” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค(ex. ImageNet)

Method

โœ” ViT์˜ ๋ชจ๋ธ ๋””์ž์ธ์€ ๊ธฐ์กด์˜ Transformer์™€ ๊ฐ€๋Šฅํ•œ ํ•œ ์œ ์‚ฌํ•˜๊ฒŒ ๊ตฌ์„ฑํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Vision Transformer(ViT)

โœ” ViT๋Š” ๊ธฐ๋ณธ์ ์ธ ์ด๋ฏธ์ง€๋ฅผ Patch๋กœ ๋ถ„ํ• ํ•ด ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ patch๋ฅผ (16x16), (14x14) ์‚ฌ์ด์ฆˆ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ด๋Š” ์ด๋ฏธ์ง€์˜ resolution๊ณผ๋Š” ๊ด€๊ณ„์—†์ด ์ผ์ •ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ธฐ์กด์˜ Transformer ๊ตฌ์กฐ์™€ ๋‹ค๋ฅธ ์ ์€ Norm์„ ๋จผ์ € ์ˆ˜ํ–‰ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

โœ” ๊ธฐ์กด์˜ Transformer์˜ ๊ฒฝ์šฐ์—” 1D sequence of token์ด ํ•„์š”ํ–ˆ๋‹ค๋ฉด, ViT๋Š” ์ด๋ฏธ์ง€(2D)๋ฅผ ๋‹ค๋ฃจ๊ธฐ ๋•Œ๋ฌธ์— ์œ„์™€ ๊ฐ™์ด Reshape๋ฅผ ํ•„์š”๋กœํ•ฉ๋‹ˆ๋‹ค. Hร—Wร—CH \times W \times C์— ๋Œ€ํ•ด Nร—(P2โ‹…C)N \times (P^2 \cdot C) ๊ตฌ์กฐ๋กœ reshape๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ (H,W)(H,W)๋Š” ์›๋ณธ ์ด๋ฏธ์ง€์˜ ๋†’์ด์™€ ๋„ˆ๋น„์ด๋ฉฐ, CC๋Š” ์ฑ„๋„์˜ ์ˆ˜, (P,P)(P,P)๋Š” ๊ฐ๊ฐ์˜ ์ด๋ฏธ์ง€ patch์˜ ํฌ๊ธฐ์ด๋ฉฐ, N=HW/P2N= HW/P^2(=ํŒจ์น˜์˜ ์ˆ˜) ์ž…๋‹ˆ๋‹ค.

โœ” Transformer์—์„œ D ์‚ฌ์ด์ฆˆ์˜ ์ƒ์ˆ˜ ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ, patches๋“ค์„ flattenํ•ด D dimenstion์œผ๋กœ ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค. ์œ„์˜ projection ๊ฒฐ๊ณผ๋ฅผ patch embedding์ด๋ผ๊ณ  ์ด์•ผ๊ธฐ ํ•ฉ๋‹ˆ๋‹ค.

โœ” ๋˜ํ•œ Bert์™€ ์œ ์‚ฌํ•˜๊ฒŒ ์‹œ์ž‘์ง€์ ์— ํ•™์Šต๊ฐ€๋Šฅํ•œ ์ž„๋ฒ ๋”ฉ [class] token์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค(z00=xclassz_0^0=x_{class}). ๊ฐ class ํ† ํฐ์€ impage representation์„ Transformer encoder์—์„œ output์œผ๋กœ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

โœ” classification head๋Š” pre-training๊ณผ fine-tuning์‹œ์— zL0z_L^0์— attached ๋˜๋ฉฐ, ์ด๋Š” MLP๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๊ณ , pre-training์‹œ์—๋Š” ํ•˜๋‚˜์˜ hidden-layer๋กœ, fine-tuning์‹œ์—๋Š” sigle linear layer๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” ๋˜ํ•œ ์ถ”๊ฐ€์ ์œผ๋กœ patch embedding์— position embeddingํ•œ ๊ฐ’์œผ ๋”ํ•ฉ๋‹ˆ๋‹ค. position embedding๋Š” ๊ธฐ์กด์˜ ํ•™์Šต๊ฐ€๋Šฅํ•œ 1D position embedding์„ ์ง„ํ–‰ํ–ˆ๋Š”๋ฐ, ์ด๋Š” 2D๋กœ ์ง„ํ–‰ํ–ˆ์œผ๋•Œ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

โœ” ์œ„๋Š” ViT์˜ ์ „์ฒด์ ์ธ ๊ตฌ์กฐ๋ฅผ ์œ„์˜ ์ˆ˜์‹์„ ํ†ตํ•ด ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. MSA๋Š” ๊ธฐ์กด transformer์— multiheaded self-attention์„ ์˜๋ฏธํ•˜๋ฉฐ, LN์€ Layernorm์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Inductive bias

โœ” ์•ž์„œ Inductive bias์— ์ •์˜์— ๋Œ€ํ•ด ๊ฐ„๋‹จํžˆ ์–ธ๊ธ‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ViT์—์„œ๋Š” CNN๋ณด๋‹ค ์ ์€ inductive bias๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. self-attention layers์—์„œ๋Š” globalํ•˜๊ธฐ์—, ์˜ค์ง MLP์—์„œ๋งŒ local, translationally equivariant์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

Hybrid Architecture

โœ” ViT์—์„œ๋Š” rawํ•œ image patch๊ฐ€ ์•„๋‹Œ, CNN์„ ํ†ตํ•ด ์ถ”์ถœ๋œ feature map์„ input์œผ๋กœ ์‚ฌ์šฉํ•œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ชจ๋ธ์„ ์‹คํ—˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ patch๋“ค์€ 1x1๋กœ input์œผ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Fine-Tuning And Higher Resolution

โœ” ์ „ํ˜•์ ์œผ๋กœ, ViT๋Š” ํฐ ๋ฐ์ดํ„ฐ ์…‹์„ pre-trained ํ•œ ํ›„ fine-tune downstream tasks๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. fine-tune์‹œ pre-traine prediction head๋Š” ์ œ๊ฑฐํ•œ ํ›„ Dร—KD \times K๋กœ ์ด๋ฃจ์–ด์ง„ zero-initialized ํ”ผ๋“œํฌ์›Œ๋“œ ๋ ˆ์ด์–ด๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ถ”๊ฐ€๋กœ K๋Š” downstream class์˜ ์ˆ˜๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

โœ” ์ผ๋ฐ˜์ ์œผ๋กœ ์ƒ๊ฐํ•˜๋ฉด ์ด๋ฏธ์ง€์˜ resolution์— ๋”ฐ๋ผ์„œ patch ์‚ฌ์ด์ฆˆ๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ•˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ, patch ์‚ฌ์ด์ฆˆ๋ฅผ ๊ณ ์ •์„ ํ•ฉ๋‹ˆ๋‹ค. patch์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ๊ณ ์ •ํ•˜๋ฉด sequence lengths๋ฅผ ๋‹ฌ๋ผ์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ž„์˜์˜ sequence lengths๋ฅผ ์ง€์ •ํ•ด์ค€๋‹ค๋ฉด ํฌ์ง€์…˜ ์ž„๋ฒ ๋”ฉ์˜ ์˜๋ฏธ๊ฐ€ ์‚ฌ๋ผ์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ 2D interpolation๋ฅผ ์‚ฌ์šฉํ•ด ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Experiments

Setup

โœ” Datasets์€ ์•„๋ž˜์˜ ๋ฐ์ดํ„ฐ๋“ค์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ImageNet with 1k classes
  • ImageNet-21k with 21k classes and 14M images
  • JFT with 18k classes and 303M high-resolution images

โœ” Model variants๋Š” ์œ„์— ํ‘œ์™€ ๊ฐ™์ด Base, Large, Huge๋กœ ๊ตฌ๋ถ„ํ•˜๋ฉฐ, ViT-L/16์ด๋ผ๊ณ  ํ•˜๋ฉด Large๋ชจ๋ธ์— 16x16 input patch size๋ผ๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ patch ์‚ฌ์ด์ฆˆ๊ฐ€ ์ž‘์„ ์ˆ˜๋ก computation cost๋Š” ์ฆ๊ฐ€ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

โœ” Training & Fine-tuning ๋ชจ๋“  ๋ชจ๋ธ์˜ training์‹œ Adam optimization(ฮฒ1=0.9,ฮฒ2=0.999\beta1 = 0.9, \beta2 = 0.999์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. 4096๊ฐœ์˜ batch size๋ฅผ ๊ฐ€์ง€๋ฉฐ, high weight decay๋Š” 0.1๋กœ ์„œ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. Fine-tuning์‹œ์—์„œ๋Š” SGD with momentum์„ ์‚ฌ์šฉํ•˜๋ฉฐ, 512์˜ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Comparison to State of the art

โœ” ๊ธฐ์กด์˜ SOTA ๋ชจ๋ธ๊ณผ์˜ ๋น„๊ต์ž…๋‹ˆ๋‹ค. ViT๊ฐ€ SOTA์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋ฉด์„œ, ํ›จ์‹  ๋” ์ ์€ computational resources๊ฐ€ ํ•„์š”ํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Pre-Training Data Requirments

โœ” ์œ„์˜ ๊ทธ๋ž˜ํ”„์—์„œ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์„๋–„๋Š” ๊ธฐ์กด์˜ SOTA ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๋” ์ข‹์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค๋งŒ, ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์ง€๋ฉด ViT์˜ ์„ฑ๋Šฅ์ด ๋” ์ข‹์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” ์ด๋Š” ์•ž์„œ ๋ง์”€๋“œ๋ ธ๋˜ inductive bias์™€ ์—ฐ๊ด€์ง€์–ด ์ƒ๊ฐํ•œ๋‹ค๋ฉด, ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ๊ฐ€ ๋งŽ์œผ๋ฉด inductive bias๊ฐ€ ํฌ๊ฒŒ ์ค‘์š”ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•  ์ˆ˜๋„ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

Inspecting Vision Transformer

โœ” Figure 7์˜ Left๋Š” ํ•™์Šต๋œ ์ž„๋ฒ ๋”ฉ ํ•„ํ„ฐ์˜ ๊ตฌ์„ฑ์š”์†Œ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ตฌ์„ฑ ์š”์†Œ๋Š” ๊ฐ patch๋‚ด์—์„œ ๋ฏธ์„ธํ•œ ๊ตฌ์กฐ๋ฅผ ์ €์ฐจ์›์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ๊ทธ๋Ÿด๋“ฏํ•œ ๊ธฐ๋ณธ ๊ธฐ๋Šฅ๊ณผ ์œ ์‚ฌํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” Figure 7์˜ center๋Š” ๋ชจ๋ธ์˜ position ์ž„๋ฒ ๋”ฉ์˜ ์œ ์‚ฌ์„ฑ์œผ๋กœ ์ด๋ฏธ์ง€ ๋‚ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋” ๊ฐ€๊นŒ์šด patch๋Š” ๋” ์œ ์‚ฌํ•œ position ์ž„๋ฒ ๋”ฉ์„ ๊ฐ–๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” Figure 7์˜ right๋Š” "attention weights"๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ •๋ณด๊ฐ€ ํ†ตํ•ฉ๋œ ์ด๋ฏธ์ง€ ๊ณต๊ฐ„์˜ ํ‰๊ท  ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•œ ๊ฑฐ๋ฆฌ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์˜๋ฏธํ•˜๋Š” "attention weights"๋Š” CNN์˜ receptive field size์™€ ์œ ์‚ฌํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ… ์ง€๊ธˆ๊นŒ์ง€ ์ „๋ฐ˜์ ์ธ ViT ๋„คํŠธ์›Œํฌ์— ๋Œ€ํ•ด ์•Œ์•„๋ดค์Šต๋‹ˆ๋‹ค. ViT ๋ชจ๋ธ ๊ตฌ์กฐ๋Š” ์‚ฌ์‹ค ์ƒ transformer์™€ bert๋ฅผ ์•Œ๊ณ  ์žˆ๋‹ค๋ฉด, ์ดํ•ดํ•˜๊ธฐ ์–ด๋ ต์ง€ ์•Š์€ ๊ตฌ์กฐ๋ผ๊ณ  ์ƒ๊ฐ๋ฉ๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋„ ์—ญ์‹œ transformer ๊ฐ€๋Šฅํ•œ ํ•œ ๋น„์Šทํ•˜๊ฒŒ ๊ตฌ์„ฑํ–ˆ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์œ„์— ์ž‘์„ฑ๋œ ์‹คํ—˜ ๊ฒฐ๊ณผ์™ธ์—๋„ ๋‹ค์–‘ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋“ค์ด ๋…ผ๋ฌธ์—์„œ ์ œ์‹œ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋…ผ๋ฌธ์—์„œ ํ™•์ธํ•ด์ฃผ์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.


Reference

profile
Maybe I will be an AI Engineer?

0๊ฐœ์˜ ๋Œ“๊ธ€