논문 링크paper with codeThe Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of t