Paper accepted for presentation at the ViGIL 2021 workshop @NAACL. This version: added models to the comparison (ICMLM, TSM); added tests of adversarial robustness; mistake identified and corrected in the normalization of image features; results and conclusions updated accordinglyInternational audienceVision models trained on multimodal datasets can benefit from the wide availability of large image-caption datasets. A recent model (CLIP) was found to generalize well in zero-shot and transfer learning settings. This could imply that linguistic or "semantic grounding" confers additional generalization abilities to the visual feature space. Here, we systematically evaluate various multimodal architectures and vision-only models in terms of uns...
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretra...
Images can be described in terms of the objects they contain, or in terms of the types of scene or p...
Powered by deep convolutional networks and large scale visual datasets, modern computer vision syste...
International audienceVision models trained on multimodal datasets can benefit from the wide availab...
hosted by ACL 2022 : 60th Annual Meeting of the Association for Computational LinguisticsInternation...
International audienceIn recent years, joint text-image embeddings have significantly improved thank...
In recent years, joint text-image embeddings have significantly improved thanks to the development o...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is ...
Current language models have been criticised for learning language from text alone without connectio...
While text generated by current vision-language models may be accurate and syntactically correct, it...
Language acquisition by children and machines is remarkable. Yet while children learn from hearing a...
154 pagesOver the course of the last decades, we have witnessed the significant progress of machine ...
Images can be described in terms of the objects they contain, or in terms of the types of scene or p...
© 2021 IEEEPrevious models for vision-to-language generation tasks usually pretrain a visual encoder...
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretra...
Images can be described in terms of the objects they contain, or in terms of the types of scene or p...
Powered by deep convolutional networks and large scale visual datasets, modern computer vision syste...
International audienceVision models trained on multimodal datasets can benefit from the wide availab...
hosted by ACL 2022 : 60th Annual Meeting of the Association for Computational LinguisticsInternation...
International audienceIn recent years, joint text-image embeddings have significantly improved thank...
In recent years, joint text-image embeddings have significantly improved thanks to the development o...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is ...
Current language models have been criticised for learning language from text alone without connectio...
While text generated by current vision-language models may be accurate and syntactically correct, it...
Language acquisition by children and machines is remarkable. Yet while children learn from hearing a...
154 pagesOver the course of the last decades, we have witnessed the significant progress of machine ...
Images can be described in terms of the objects they contain, or in terms of the types of scene or p...
© 2021 IEEEPrevious models for vision-to-language generation tasks usually pretrain a visual encoder...
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretra...
Images can be described in terms of the objects they contain, or in terms of the types of scene or p...
Powered by deep convolutional networks and large scale visual datasets, modern computer vision syste...