The standard Transformer model is directly applied to images
Shape and Architecture
Experiments and Discussion
Contributions
Main idea: Distillation
Teacher model
DeiT exploits strong image classifier as a teacher model to learn a transformer. DeiT simply includes a new distillation token. The distillation token interacts with the class and patch tokens through the self-attention layers.
Soft vs. Hard distillation
Distillation token
Experiments and Discussion
Issues with Vanilla ViT Model
Main idea
A multi-stage (3 in this work) hierarchy design borrowed from CNNs is employed. Each stage has two parts, Convolutional Token Embedding and Convolutional projection.
Issues with Vanilla ViT Model
Main idea
Convolutional Token Embedding layer
Convolutional projection
Model 1: Spatio-temporal attention
Model 2: Factorized encoder
Model 3: Factorized self-attention
Model 4: Factorized dot-product attention
Experiments and Discussion
A similar idea to CvT, applied to videos.
Multi Head Pooling Attention (MHPA)
Compared to CvT
Discussion
📙 강의