Masked Autoencoder(MAE) pre-trained ViT
Masked Autoencoder(MAE)
Vision Transformer
(batch, 1, embed_dim)
(batch, 2, embed_dim)