We seek to understand how the representations of individual tokens and the structure of the learned feature space evolve between layers in deep neural networks under different learning objectives. We chose the Transformers for our analysis as they have been shown effective with various tasks, including machine translation (MT), standard left-to-right language models (LM) and masked language modeling (MLM). Previous work used black-box probing tasks to show that the representations learned by the Transformer differ significantly depending on the objective. In this work, we use canonical correlation analysis and mutual information estimators to study how information flows across Transformer layers and observe that the choice of the objective ...
In natural language processing (NLP), Transformer is widely used and has reached the state-of-the-ar...
LSTMs and other RNN variants have shown strong performance on character-level language modeling. The...
In 2017, Vaswani et al. proposed a new neural network architecture named Transformer. That modern ar...
We seek to understand how the representations of individual tokens and the structure of the learned ...
Language Generation Models produce words based on the previous context. Although existing methods of...
The general trend in NLP is towards increasing model capacity and performance via deeper neural netw...
We analyze the learning dynamics of neural language and translation models using Loss Change Allocat...
In Neural Machine Translation (and, more generally, conditional language modeling), the generation o...
In Neural Machine Translation (NMT), each token prediction is conditioned on the source sentence and...
Transformer based language models exhibit intelligent behaviors such as understanding natural langua...
Why do artificial neural networks model language so well? We claim that in order to answer this ques...
Recently, the development of pre-trained language models has brought natural language processing (NL...
In Neural Machine Translation (NMT), each token prediction is conditioned on the source sentence and...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Comp...
Transformer is a neural machine translation model which revolutionizes machine translation. Compared...
In natural language processing (NLP), Transformer is widely used and has reached the state-of-the-ar...
LSTMs and other RNN variants have shown strong performance on character-level language modeling. The...
In 2017, Vaswani et al. proposed a new neural network architecture named Transformer. That modern ar...
We seek to understand how the representations of individual tokens and the structure of the learned ...
Language Generation Models produce words based on the previous context. Although existing methods of...
The general trend in NLP is towards increasing model capacity and performance via deeper neural netw...
We analyze the learning dynamics of neural language and translation models using Loss Change Allocat...
In Neural Machine Translation (and, more generally, conditional language modeling), the generation o...
In Neural Machine Translation (NMT), each token prediction is conditioned on the source sentence and...
Transformer based language models exhibit intelligent behaviors such as understanding natural langua...
Why do artificial neural networks model language so well? We claim that in order to answer this ques...
Recently, the development of pre-trained language models has brought natural language processing (NL...
In Neural Machine Translation (NMT), each token prediction is conditioned on the source sentence and...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Comp...
Transformer is a neural machine translation model which revolutionizes machine translation. Compared...
In natural language processing (NLP), Transformer is widely used and has reached the state-of-the-ar...
LSTMs and other RNN variants have shown strong performance on character-level language modeling. The...
In 2017, Vaswani et al. proposed a new neural network architecture named Transformer. That modern ar...