논문읽기

1.VIT : An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

post-thumbnail

2.ViViT: A Video Vision Transformer

post-thumbnail

3.MAE : Masked Autoencoders Are Scalable Vision Learners

post-thumbnail