For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches randomly mask input patches and then reconstruct pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional supervised learning (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative semantics in the pretraining dataset, and accordingly show its provable improvement over SL on the cl...
Masked image modeling has been demonstrated as a powerful pretext task for generating robust represe...
While image data starts to enjoy the simple-but-effective self-supervised learning scheme built upon...
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NM...
Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream ...
Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of...
Pre-training video transformers on extra large-scale datasets is generally required to achieve premi...
Masked autoencoding has achieved great success for self-supervised learning in the image and languag...
We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervi...
Masked image modeling (MIM), an emerging self-supervised pre-training method, has shown impressive s...
Self-attention is of vital importance in semantic segmentation as it enables modeling of long-range ...
Prior works on self-supervised pre-training focus on the joint training scenario, where massive unla...
Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in ...
177 pagesThe field of computer vision has benefited tremendously from an unusual blessing: a baselin...
As a successful approach to self-supervised learning, contrastive learning aims to learn invariant i...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
Masked image modeling has been demonstrated as a powerful pretext task for generating robust represe...
While image data starts to enjoy the simple-but-effective self-supervised learning scheme built upon...
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NM...
Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream ...
Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of...
Pre-training video transformers on extra large-scale datasets is generally required to achieve premi...
Masked autoencoding has achieved great success for self-supervised learning in the image and languag...
We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervi...
Masked image modeling (MIM), an emerging self-supervised pre-training method, has shown impressive s...
Self-attention is of vital importance in semantic segmentation as it enables modeling of long-range ...
Prior works on self-supervised pre-training focus on the joint training scenario, where massive unla...
Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in ...
177 pagesThe field of computer vision has benefited tremendously from an unusual blessing: a baselin...
As a successful approach to self-supervised learning, contrastive learning aims to learn invariant i...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
Masked image modeling has been demonstrated as a powerful pretext task for generating robust represe...
While image data starts to enjoy the simple-but-effective self-supervised learning scheme built upon...
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NM...