The Transformer architecture model, based on self-attention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be easily applied for streaming or online ASR. For self-attention in Transformer ASR, the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. For multi-head attention in Transformer ASR, it is not easy to model monotonic alignments in different heads. To overcome these two limits, we integrate sparse attention and monotonic attention into Transformer-based ASR. The sparse mechanism introduces a learned sparsity scheme to enable each self-attentio...
In recent years, self-supervised learning paradigm has received extensive attention due to its great...
Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignmen...
Training deep neural network based Automatic Speech Recognition (ASR) models often requires thousand...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
There is growing interest in unifying the streaming and full-context automatic speech recognition (A...
Transformer-based models excel in speech recognition. Existing efforts to optimize Transformer infer...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
As a result of advancement in deep learning and neural network technology, end-to-end models have be...
Transformer-based speech recognition models have achieved great success due to the self-attention (S...
Transformers have achieved state-of-the-art results across multiple NLP tasks. However, the self-att...
Recently, self-attention-based transformers and conformers have been introduced as alternatives to R...
The training and generalization dynamics of the Transformer's core mechanism, namely the Attention m...
For personalized speech generation, a neural text-to-speech (TTS) model must be successfully impleme...
Transformers are the state-of-the-art for machine translation and grammar error correction. One of t...
International audienceThe recently proposed Conformer architecture has shown state-of-the-art perfor...
In recent years, self-supervised learning paradigm has received extensive attention due to its great...
Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignmen...
Training deep neural network based Automatic Speech Recognition (ASR) models often requires thousand...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
There is growing interest in unifying the streaming and full-context automatic speech recognition (A...
Transformer-based models excel in speech recognition. Existing efforts to optimize Transformer infer...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
As a result of advancement in deep learning and neural network technology, end-to-end models have be...
Transformer-based speech recognition models have achieved great success due to the self-attention (S...
Transformers have achieved state-of-the-art results across multiple NLP tasks. However, the self-att...
Recently, self-attention-based transformers and conformers have been introduced as alternatives to R...
The training and generalization dynamics of the Transformer's core mechanism, namely the Attention m...
For personalized speech generation, a neural text-to-speech (TTS) model must be successfully impleme...
Transformers are the state-of-the-art for machine translation and grammar error correction. One of t...
International audienceThe recently proposed Conformer architecture has shown state-of-the-art perfor...
In recent years, self-supervised learning paradigm has received extensive attention due to its great...
Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignmen...
Training deep neural network based Automatic Speech Recognition (ASR) models often requires thousand...