Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method ...
Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language...
Two recent approaches have achieved state-of-the-art results in image caption-ing. The first uses a ...
Video captioning via encoder–decoder structures is a successful sentence generation method. In addit...
Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample a...
Audio captioning is the task of automatically creating a textual description for the contents of a g...
Automated audio captioning aims to use natural language to describe the content of audio data. This ...
Audio captioning is a novel task in machine learning which involves the generation of textual descri...
It is well believed that video captioning is a fundamental but challenging task in both computer vis...
Automated audio captioning, a task that mimics human perception as well as innovatively links audio ...
We propose an audio captioning system that describes non-speech audio signals in the form of natural...
Audio captioning is a multi-modal task, focusing on using natural language for describing the conten...
Dense video captioning is a task of localizing interesting events from an untrimmed video and produc...
Deep learning is a very prevalent field in these recent years and so many applications is coming out...
Audio captioning aims at generating natural language descriptions for audio clips automatically. Exi...
Automatic video description, or video captioning, is a challenging yet much attractive task. It aims...
Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language...
Two recent approaches have achieved state-of-the-art results in image caption-ing. The first uses a ...
Video captioning via encoder–decoder structures is a successful sentence generation method. In addit...
Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample a...
Audio captioning is the task of automatically creating a textual description for the contents of a g...
Automated audio captioning aims to use natural language to describe the content of audio data. This ...
Audio captioning is a novel task in machine learning which involves the generation of textual descri...
It is well believed that video captioning is a fundamental but challenging task in both computer vis...
Automated audio captioning, a task that mimics human perception as well as innovatively links audio ...
We propose an audio captioning system that describes non-speech audio signals in the form of natural...
Audio captioning is a multi-modal task, focusing on using natural language for describing the conten...
Dense video captioning is a task of localizing interesting events from an untrimmed video and produc...
Deep learning is a very prevalent field in these recent years and so many applications is coming out...
Audio captioning aims at generating natural language descriptions for audio clips automatically. Exi...
Automatic video description, or video captioning, is a challenging yet much attractive task. It aims...
Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language...
Two recent approaches have achieved state-of-the-art results in image caption-ing. The first uses a ...
Video captioning via encoder–decoder structures is a successful sentence generation method. In addit...