Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling

Ramesh, Akshai
Uhana, Haque Usuf
Parthasarathy, Venkatesh Balavadhani
Haque, Rejwanul
Way, Andy

Open link

Publication date

September 2021

DOI

10.1109/IJCNN52387.2021.9534211

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Abstract

Neural machine translation (NMT) is often described as ‘data hungry’ as it typically requires large amounts of parallel data in order to build a good-quality machine translation (MT) system. However, most of the world's language-pairs are low-resource or extremely low-resource. This situation becomes even worse if a specialised domain is taken into consideration for translation. In this paper, we present a novel data augmentation method which makes use of bilingual word embeddings (BWEs) learned from monolingual corpora and bidirectional encoder representations from transformer (BERT) language models (LMs). We augment a parallel training corpus by introducing new words (i.e. out-of-vocabulary (OOV) items) and increasing the presence of rare...