Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the recognition model through explicit products, mixtures, or other such factorisations. Here we introduce a novel alternative, the MEME, that avoids such explicit combinations by repurposing semi-supervised VAEs to combine information between modalities implicitly through mutual supervision. This formulation naturally allows learning from partially-observed data where some modalities can be entirely missing---something that most existing approaches ...
Many real-world problems are inherently multimodal, from the communicative modalities humans use to ...
Multimodal variational autoencoders (VAEs) have shown promise as efficient generative models for wea...
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and ...
Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language...
Multimodal VAEs have recently gained attention as efficient models for weakly-supervised generative ...
The growth of content on the web has raised various challenges, yet also provided numerous opportuni...
Multimodal generative models learn a joint distribution over multiple modalities and thus have the p...
Multimodal sentiment analysis is a core research area that studies speaker sentiment expressed from ...
Significant progress has been witnessed in learning-based Multi-view Stereo (MVS) under supervised a...
Generative models have recently shown the ability to realistically generate data and model the distr...
One of the key factors driving the success of machine learning for scene understanding is the develo...
A number of variational autoencoders (VAEs) have recently emerged with the aim of modeling multimoda...
Representation Learning is a significant and challenging task in multimodal learning. Effective moda...
Multi-modal entity alignment (MMEA) aims to discover identical entities across different knowledge g...
Devising deep latent variable models for multi-modal data has been a long-standing theme in machine ...
Many real-world problems are inherently multimodal, from the communicative modalities humans use to ...
Multimodal variational autoencoders (VAEs) have shown promise as efficient generative models for wea...
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and ...
Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language...
Multimodal VAEs have recently gained attention as efficient models for weakly-supervised generative ...
The growth of content on the web has raised various challenges, yet also provided numerous opportuni...
Multimodal generative models learn a joint distribution over multiple modalities and thus have the p...
Multimodal sentiment analysis is a core research area that studies speaker sentiment expressed from ...
Significant progress has been witnessed in learning-based Multi-view Stereo (MVS) under supervised a...
Generative models have recently shown the ability to realistically generate data and model the distr...
One of the key factors driving the success of machine learning for scene understanding is the develo...
A number of variational autoencoders (VAEs) have recently emerged with the aim of modeling multimoda...
Representation Learning is a significant and challenging task in multimodal learning. Effective moda...
Multi-modal entity alignment (MMEA) aims to discover identical entities across different knowledge g...
Devising deep latent variable models for multi-modal data has been a long-standing theme in machine ...
Many real-world problems are inherently multimodal, from the communicative modalities humans use to ...
Multimodal variational autoencoders (VAEs) have shown promise as efficient generative models for wea...
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and ...