Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. ...
Self-supervised Video Representation Learning (VRL) aims to learn transferrable representations from...
This study introduces an efficacious approach, Masked Collaborative Contrast (MCC), to highlight sem...
We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervi...
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Enco...
This paper explores improvements to the masked image modeling (MIM) paradigm. The MIM paradigm enabl...
It has been witnessed that masked image modeling (MIM) has shown a huge potential in self-supervised...
We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. B...
Masked image modeling (MIM), an emerging self-supervised pre-training method, has shown impressive s...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
Pretraining language models with next-token prediction on massive text corpora has delivered phenome...
Transformers and masked language modeling are quickly being adopted and explored in computer vision ...
Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), wh...
Self-attention is of vital importance in semantic segmentation as it enables modeling of long-range ...
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeli...
Masked image modeling has been demonstrated as a powerful pretext task for generating robust represe...
Self-supervised Video Representation Learning (VRL) aims to learn transferrable representations from...
This study introduces an efficacious approach, Masked Collaborative Contrast (MCC), to highlight sem...
We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervi...
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Enco...
This paper explores improvements to the masked image modeling (MIM) paradigm. The MIM paradigm enabl...
It has been witnessed that masked image modeling (MIM) has shown a huge potential in self-supervised...
We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. B...
Masked image modeling (MIM), an emerging self-supervised pre-training method, has shown impressive s...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
Pretraining language models with next-token prediction on massive text corpora has delivered phenome...
Transformers and masked language modeling are quickly being adopted and explored in computer vision ...
Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), wh...
Self-attention is of vital importance in semantic segmentation as it enables modeling of long-range ...
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeli...
Masked image modeling has been demonstrated as a powerful pretext task for generating robust represe...
Self-supervised Video Representation Learning (VRL) aims to learn transferrable representations from...
This study introduces an efficacious approach, Masked Collaborative Contrast (MCC), to highlight sem...
We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervi...