Multi-Modal

1.CLIP (Contrastive Language Image Pretraining)

post-thumbnail

2.[paper] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

post-thumbnail