Towards making the most of BERT in neural machine translation

ByteDance AI

Publication date

February 2020

DOI

Abstract

Can we utilize extremely large monolingual text to improve neural machine translation without the expensive back-translation? Neural machine translation models are trained on parallel bilingual corpus. Even the large ones only include 20 to 40 millions of parallel sentence pairs. In the meanwhile, pre-trained language models such as BERT and GPT are trained on usually billions of monolingual sentences. Direct use BERT as the initialization for Transformer encoder could not gain any benefit, due to the catastrophic forgetting problem of BERT knowledge during further training on MT data. This example shows how to run the CTNMT (Yang et al. 2020) training method that integrates BERT into a Transformer MT model, the first successful method to d...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Towards making the most of BERT in neural machine translation

Abstract

Extracted data

Towards making the most of BERT in neural machine translation

Abstract

Extracted data

Related items

Related items