
The standard Transformer model is directly applied to images
Shape and Architecture
Experiments and Discussion




Contributions
Main idea: Distillation

Teacher model
DeiT exploits strong image classifier as a teacher model to learn a transformer. DeiT simply includes a new distillation token. The distillation token interacts with the class and patch tokens through the self-attention layers.
Soft vs. Hard distillation
Distillation token
Experiments and Discussion


Issues with Vanilla ViT Model
Main idea





A multi-stage (3 in this work) hierarchy design borrowed from CNNs is employed. Each stage has two parts, Convolutional Token Embedding and Convolutional projection.
Issues with Vanilla ViT Model
Main idea
Convolutional Token Embedding layer
Convolutional projection


Model 1: Spatio-temporal attention

Model 2: Factorized encoder

Model 3: Factorized self-attention

Model 4: Factorized dot-product attention

Experiments and Discussion




A similar idea to CvT, applied to videos.
Multi Head Pooling Attention (MHPA)

Compared to CvT

Discussion

📙 강의