While several benefits were realized for multilingual vision-language pretrained models, recent benchmarks across various tasks and languages showed poor cross-lingual generalisation when multilingually pre-trained vision-language models are applied to non-English data, with a large gap between (supervised) English performance and (zero-shot) cross-lingual transfer. In this work, we explore the poor performance of these models on a zero-shot cross-lingual visual question answering (VQA) task, where models are fine-tuned on English visual-question data and evaluated on 7 typologically diverse languages. We improve cross-lingual transfer with three strategies: (1) we introduce a linguistic prior objective to augment the cross-entropy loss wit...
Large-scale cross-lingual language models (LM), such as mBERT, Unicoder and XLM, have achieved great...
Pre-trained multilingual language models show significant performance gains for zero-shot cross-ling...
International audienceRecent vision-language models are driven by large-scale pretrained models. How...
While several benefits were realized for multilingual vision-language pretrained models, recent benc...
Prior work on multilingual question answering has mostly focused on using large multilingual pre-tra...
Reliable evaluation benchmarks designed for replicability and comprehensiveness have driven progress...
Multilingual language models exhibit better performance for some languages than for others (Singh et...
Visual Question Answering (VQA) aims to answer the natural language question about a given image by ...
Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong ...
Visual Question Answering (VQA) methods have made incredible progress, but suffer from a failure to ...
Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models t...
International audienceVision models trained on multimodal datasets can benefit from the wide availab...
Multilingual question answering (MLQA) is a critical part of an accessible natural language interfac...
Cross-lingual Machine Reading Comprehension (xMRC) is a challenging task due to the lack of training...
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles ...
Large-scale cross-lingual language models (LM), such as mBERT, Unicoder and XLM, have achieved great...
Pre-trained multilingual language models show significant performance gains for zero-shot cross-ling...
International audienceRecent vision-language models are driven by large-scale pretrained models. How...
While several benefits were realized for multilingual vision-language pretrained models, recent benc...
Prior work on multilingual question answering has mostly focused on using large multilingual pre-tra...
Reliable evaluation benchmarks designed for replicability and comprehensiveness have driven progress...
Multilingual language models exhibit better performance for some languages than for others (Singh et...
Visual Question Answering (VQA) aims to answer the natural language question about a given image by ...
Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong ...
Visual Question Answering (VQA) methods have made incredible progress, but suffer from a failure to ...
Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models t...
International audienceVision models trained on multimodal datasets can benefit from the wide availab...
Multilingual question answering (MLQA) is a critical part of an accessible natural language interfac...
Cross-lingual Machine Reading Comprehension (xMRC) is a challenging task due to the lack of training...
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles ...
Large-scale cross-lingual language models (LM), such as mBERT, Unicoder and XLM, have achieved great...
Pre-trained multilingual language models show significant performance gains for zero-shot cross-ling...
International audienceRecent vision-language models are driven by large-scale pretrained models. How...