Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. B...
Transformers and masked language modeling are quickly being adopted and explored in computer vision ...
The self-supervised Masked Image Modeling (MIM) schema, following "mask-and-reconstruct" pipeline of...
Several recent works have directly extended the image masked autoencoder (MAE) with random masking i...
It has been witnessed that masked image modeling (MIM) has shown a huge potential in self-supervised...
An important goal of self-supervised learning is to enable model pre-training to benefit from almost...
This paper explores improvements to the masked image modeling (MIM) paradigm. The MIM paradigm enabl...
Masked image modeling (MIM), an emerging self-supervised pre-training method, has shown impressive s...
In this study, we propose Mixed and Masked Image Modeling (MixMIM), a simple but efficient MIM metho...
Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), wh...
We propose ADIOS, a masked image model (MIM) framework for self-supervised learning, which simultane...
For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches randomly mask input p...
Though image transformers have shown competitive results with convolutional neural networks in compu...
Self-supervised Video Representation Learning (VRL) aims to learn transferrable representations from...
Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in ...
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretr...
Transformers and masked language modeling are quickly being adopted and explored in computer vision ...
The self-supervised Masked Image Modeling (MIM) schema, following "mask-and-reconstruct" pipeline of...
Several recent works have directly extended the image masked autoencoder (MAE) with random masking i...
It has been witnessed that masked image modeling (MIM) has shown a huge potential in self-supervised...
An important goal of self-supervised learning is to enable model pre-training to benefit from almost...
This paper explores improvements to the masked image modeling (MIM) paradigm. The MIM paradigm enabl...
Masked image modeling (MIM), an emerging self-supervised pre-training method, has shown impressive s...
In this study, we propose Mixed and Masked Image Modeling (MixMIM), a simple but efficient MIM metho...
Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), wh...
We propose ADIOS, a masked image model (MIM) framework for self-supervised learning, which simultane...
For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches randomly mask input p...
Though image transformers have shown competitive results with convolutional neural networks in compu...
Self-supervised Video Representation Learning (VRL) aims to learn transferrable representations from...
Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in ...
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretr...
Transformers and masked language modeling are quickly being adopted and explored in computer vision ...
The self-supervised Masked Image Modeling (MIM) schema, following "mask-and-reconstruct" pipeline of...
Several recent works have directly extended the image masked autoencoder (MAE) with random masking i...