Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignment between source and target sequence, and previous work has facilitated or enforced learning of monotonic attention behavior via specialized attention functions or pretraining. In this work, we introduce a monotonicity loss function that is compatible with standard attention mechanisms and test it on several sequence-to-sequence tasks: grapheme-to-phoneme conversion, morphological inflection, transliteration, and dialect normalization. Experiments show that we can achieve largely monotonic behavior. Performance is mixed, with larger gains on top of RNN baselines. General monotonicity does not benefit transformer multihead attention, however, ...
The powerful modeling capabilities of all-attention-based transformer architectures often cause over...
Many aspects of language can be categorized as quasi-regular: the relationship between the inputs an...
We evaluate three simple, normalization-centric changes to improve Transformer training. First, we s...
Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignmen...
The Transformer architecture model, based on self-attention and multi-head attention, has achieved r...
The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are ...
This thesis demonstrates how modeling techniques from speech recognition can be advantageous in a va...
This work proposes an extensive analysis of the Transformer architecture in the Neural Machine Trans...
We analyze the performance of encoder-decoder neural models and compare them with well-known establi...
Transformer networks have seen great success in natural language processing and machine vision, wher...
Some Transformer-based models can perform cross-lingual transfer learning: those models can be train...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Psycholinguistic studies have repeatedly demonstrated that downward entailing (DE) quantifiers are m...
International audienceWe probe pre-trained transformer language models for bridging inference. We fi...
International audienceSince Bahdanau et al. [1] first introduced attention for neural machine transl...
The powerful modeling capabilities of all-attention-based transformer architectures often cause over...
Many aspects of language can be categorized as quasi-regular: the relationship between the inputs an...
We evaluate three simple, normalization-centric changes to improve Transformer training. First, we s...
Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignmen...
The Transformer architecture model, based on self-attention and multi-head attention, has achieved r...
The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are ...
This thesis demonstrates how modeling techniques from speech recognition can be advantageous in a va...
This work proposes an extensive analysis of the Transformer architecture in the Neural Machine Trans...
We analyze the performance of encoder-decoder neural models and compare them with well-known establi...
Transformer networks have seen great success in natural language processing and machine vision, wher...
Some Transformer-based models can perform cross-lingual transfer learning: those models can be train...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Psycholinguistic studies have repeatedly demonstrated that downward entailing (DE) quantifiers are m...
International audienceWe probe pre-trained transformer language models for bridging inference. We fi...
International audienceSince Bahdanau et al. [1] first introduced attention for neural machine transl...
The powerful modeling capabilities of all-attention-based transformer architectures often cause over...
Many aspects of language can be categorized as quasi-regular: the relationship between the inputs an...
We evaluate three simple, normalization-centric changes to improve Transformer training. First, we s...