Paper Review

1.ViT: An Image Is Worth 16X16 Words: Transformaers for Image Recognition at Scale

post-thumbnail

2.CLIP: Learning Transferable Visual Models From Natural Language Supervision

post-thumbnail

3.BLIP: Bootstrapped Language-Image Pre-training for Unified Vision-Language Understanding and Generation

post-thumbnail

4.BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

post-thumbnail