In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on the linguistic perspective and the analysis of pre-trained Transformer (BERT) on a huge corpus, we further design a strategy to control the scale distribution for each layer. Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets
We explore the suitability of self-attention models for character-level neural machine translation. ...
Recent years have seen the vast potential of the Transformer model, as it is arguably the first gene...
Neural machine translation has been lately established as the new state of the art in machine transl...
The transformer multi-head self-attention mechanism has been thoroughly investigated recently. On o...
Self-attention mechanisms have recently caused many concerns on Natural Language Processing (NLP) ta...
Transformers have emerged as a powerful tool for a broad range of natural language processing tasks....
International audiencePre-trained language models have proven to be effective in multi-class text cl...
Large pretrained language models using the transformer neural network architecture are becoming a do...
Document classification has a broad application in the field of sentiment classification, document r...
Deep neural models (e.g. Transformer) naturally learn spurious features, which create a ``shortcut''...
We propose a novel framework based on the attention mechanism to identify the sentiment of a movie r...
Transformers are the state-of-the-art for machine translation and grammar error correction. One of t...
The great success of Transformer-based models benefits from the powerful multi-head self-attention m...
This paper explores the multi-scale aggregation strategy for scene text detection in natural images....
International audienceAttention mechanisms have played a crucial role in the development of complex ...
We explore the suitability of self-attention models for character-level neural machine translation. ...
Recent years have seen the vast potential of the Transformer model, as it is arguably the first gene...
Neural machine translation has been lately established as the new state of the art in machine transl...
The transformer multi-head self-attention mechanism has been thoroughly investigated recently. On o...
Self-attention mechanisms have recently caused many concerns on Natural Language Processing (NLP) ta...
Transformers have emerged as a powerful tool for a broad range of natural language processing tasks....
International audiencePre-trained language models have proven to be effective in multi-class text cl...
Large pretrained language models using the transformer neural network architecture are becoming a do...
Document classification has a broad application in the field of sentiment classification, document r...
Deep neural models (e.g. Transformer) naturally learn spurious features, which create a ``shortcut''...
We propose a novel framework based on the attention mechanism to identify the sentiment of a movie r...
Transformers are the state-of-the-art for machine translation and grammar error correction. One of t...
The great success of Transformer-based models benefits from the powerful multi-head self-attention m...
This paper explores the multi-scale aggregation strategy for scene text detection in natural images....
International audienceAttention mechanisms have played a crucial role in the development of complex ...
We explore the suitability of self-attention models for character-level neural machine translation. ...
Recent years have seen the vast potential of the Transformer model, as it is arguably the first gene...
Neural machine translation has been lately established as the new state of the art in machine transl...