Transformers have gained increasing popularity in a wide range of applications, including Natural Language Processing (NLP), Computer Vision and Speech Recognition, because of their powerful representational capacity. However, harnessing this representational capacity effectively requires a large amount of data, strong regularization, or both, to mitigate overfitting. Recently, the power of the Transformer has been unlocked by self-supervised pretraining strategies based on masked autoencoders which rely on reconstructing masked inputs, directly, or contrastively from unmasked content. This pretraining strategy which has been used in BERT models in NLP, Wav2Vec models in Speech and, recently, in MAE models in Vision, forces the model to lea...
Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) have been primarily due to e...
We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. B...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...
As it is empirically observed that Vision Transformers (ViTs) are quite insensitive to the order of ...
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Enco...
We propose a conditional positional encoding (CPE) scheme for vision Transformers. Unlike previous f...
Word order, an essential property of natural languages, is injected in Transformer-based neural lang...
Masked language modeling (MLM), a self-supervised pretraining objective, is widely used in natural l...
Recently, the development of pre-trained language models has brought natural language processing (NL...
This paper explores a better prediction target for BERT pre-training of vision transformers. We obs...
In recent years, the development of accurate deep keyword spotting (KWS) models has resulted in KWS ...
Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of...
Masked image modeling (MIM), an emerging self-supervised pre-training method, has shown impressive s...
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance ...
For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches randomly mask input p...
Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) have been primarily due to e...
We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. B...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...
As it is empirically observed that Vision Transformers (ViTs) are quite insensitive to the order of ...
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Enco...
We propose a conditional positional encoding (CPE) scheme for vision Transformers. Unlike previous f...
Word order, an essential property of natural languages, is injected in Transformer-based neural lang...
Masked language modeling (MLM), a self-supervised pretraining objective, is widely used in natural l...
Recently, the development of pre-trained language models has brought natural language processing (NL...
This paper explores a better prediction target for BERT pre-training of vision transformers. We obs...
In recent years, the development of accurate deep keyword spotting (KWS) models has resulted in KWS ...
Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of...
Masked image modeling (MIM), an emerging self-supervised pre-training method, has shown impressive s...
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance ...
For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches randomly mask input p...
Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) have been primarily due to e...
We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. B...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...