Multi-modal

1.CLIP : Learning Transferable Visual Models From Natural Language Supervision

post-thumbnail

2.Meshed-Memory Transformer for Image Captioning

post-thumbnail