Keywords

1 Introduction

This paper presents the neural machine translation (NMT) systems built for National Institute of Information and Communications Technology (NICT)’s participation in the CCMT-19 shared News Translation Task for Chinese\(\leftrightarrow \)English directions. Specifically, we used the Transformer architecture to build our translation systems. We then employed techniques that have been proven to be most effective, such as back-translation, fine-tuning, and model ensembling, to generate the primary submissions of Chinese\(\leftrightarrow \)English translation tasks. All of our systems are constrained, i.e., we used only the parallel and monolingual data provided by the organizers to train and tune our systems. This system is also a part of our system for WMT19 [1]Footnote 1.

The remainder of this paper is organized as follows. In Sect. 2, we present the data preprocessing. In Sect. 3, we introduce the details of our NMT systems. Empirical results obtained with our systems are analyzed in Sect. 4 and we conclude this paper in Sect. 5.

2 Datasets

2.1 Data

As parallel data to train our systems, we used all the provided parallel data for all our targeted translation directions. The training data for the Chinese\(\leftrightarrow \)English (ZH\(\leftrightarrow \)EN) translation tasks consists of two parts: (1) we selected the first 10 million lines of the News Crawl 2018 English corpus according to the finding of [6, 11], (2) the corresponding synthetic data was generated through back-translation [5, 8].

2.2 Pre-processsing

We applied tokenizer and truecaser of Moses [4] to the English sentences. For Chinese, we used JiebaFootnote 2 for tokenization but did not perform truecasing. For cleaning, we filtered out sentences longer than 80 tokens in the training data by using Moses script clean-n-corpus.perl, and replaced characters forbidden by Moses. Tables 1 and 2 present the statistics of the parallel and monolingual data, respectively, after pre-processing.

Table 1. Statistics of our pre-processed parallel data
Table 2. Statistics of our pre-processed monolingual data

3 MT Systems

3.1 NMT

We used Marian toolkit [2]Footnote 3 to build competitive NMT systems based on the Transformer [10] architecture. We used the byte pair encoding (BPE) algorithm [9] for obtaining the sub-word vocabulary whose size was set to 50,000. The number of dimensions of all input and output layers was set to 512, and that of the inner feed-forward neural network layer was set to 2048. The number of attention heads in each encoder and decoder layer was set to eight. During training, the value of label smoothing was set to 0.1, and the attention dropout and residual dropout were set to 0.1. The Adam optimizer [3] was used to tune the parameters of the model. The learning rate was varied under a warm-up strategy with warm-up steps of 16,000. We validated the model with an interval of 5,000 batches on the development set and selected the best model according to BLEU [7] score on the development set. All our NMT systems were consistently trained on 4 GPUs,Footnote 4 with the following parameters for Marian (Table 3):

Table 3. Parameters for training Marian.

3.2 Back-Translation of Monolingual Data

The so-called “back-translation” of monolingual has been shown to be one of the most efficient ways to exploit monolingual data for NMT [8]. It is simply to translate target monolingual data into the source language, using a pre-trained target-to-source NMT models, in order to produce a new synthetic parallel data that can be used to train NMT models. We concatenated the resulting synthetic parallel data to the original parallel data to train better NMT models. For En\(\rightarrow \)Zh, we back-translated the entire XMU Chinese monolingual corpus containing 5.4M sentences as the source to produce synthetic English data. For Zh\(\rightarrow \)En, we empirically compared the impact of back-translating different sizes of English monolingual data, using the first 10M lines of the concatenation of News Crawl-2016 and News Crawl-2017 English corpora to produce synthetic Chinese data.

3.3 Fine-Tuning and Ensemble of NMT Models

After the back-translation, we performed the training run independently for five times on the mixture of the original parallel data and the pseudo-parallel data, and thus obtain the translation models. The new model was further fine-tuned on the ccmt2018_newstest set for 20 epochs. Finally, we decoded the ccmt2019_newstest set with an ensemble of the five fine-tuned models to generate the primary submissions for the ZH\(\leftrightarrow \)EN tasks.

4 Results

Our systems are evaluated on the WMT2019NewsTest test setFootnote 5 for ZH\(\leftrightarrow \)EN tasks and the results are shown in Table 4. For EN\(\rightarrow \)ZH, BLEU scores were computed on the basis of character-based segmentation. “w/backtr” and “w/o backtr” indicate with and without back-translation, respectively. “w/ft” indicates that this single model was fine-tuned on the ccmt2018_newstest sets. “ensemble” indicates that five fine-tuned single models were ensembled at decoding time.

Table 4. Results (BLEU-cased) of our MT systems on the ccmt2018_newstest test set.

Our observations from Table 4 are as follows: It is obvious that the back-translation, fine-tuning, and ensemble methods are greatly effective for the ZH\(\leftrightarrow \)EN tasks. In particular, the ensemble gave more improvements on the ZH\(\rightarrow \)EN task over the “Single model+back-translation+fine-tuning” model than the EN\(\rightarrow \)ZH task.

5 Conclusion

We presented in this paper the NICT’s participation in the CCMT-2019 shared Chinese\(\leftrightarrow \)English news translation task. Our primary submissions to the tasks were the results of a simple combination of back-translation, fine-tuning, and ensemble methods. Our results confirmed that these three methods can incrementally improve translation performance of the Transformer NMT.