Keywords

1 Introduction

Neural networks have shown their superiority on machine translation [1, 18] and other natural language processing tasks [5]. Self-attention based Transformer [19] has been the dominant architecture for neural machine translation. This paper describes our submission for CCMT-2020 Uighur \(\rightarrow \) Chinese translation task.

We build our system based on Transformer [19] due to its superior performance and parallelism. Several techniques which have been proved effective are employed to boost the performance of our system.

We apply Byte Pair Encoding (BPE) [15] to reduce the sizes of vocabularies and achieve open-vocabulary translation. Tagged back-translation with top-k sampling [2, 7, 14] is used to improve translation performance with monolingual data. We also train several variants of Transformer such as DynamicConv [20] and Transformer with relative position representations. We select back-translated data by length and alignment features. We average the parameters of several best checkpoints [3] in a single training process to get a better single model. Translation models trained on mixed data are fine-tuned on real data provided by the evaluation organizer. Finally, we translate source texts by ensemble several best performing models and rerank the n-best lists with K-batched MIRA algorithm [4].

With above techniques, our system evaluated with BLEU [13] improved for a large margin. We also tried a few methods used in other neural machine translation systems without seeing significant improvements.

2 Machine Translation System

Since there is far less parallel data for Uighur \(\rightarrow \) Chinese translation, we adopt several effective techniques for alleviating data starvation problem. The following sections describe how we build a well-performing system for Uighur \(\rightarrow \) Chinese translation in low-resource scenario.

2.1 Pre-processing

Table 1. Statistics of pre-processed parallel data.

We escape special characters and normalize punctuation characters with Moses [10]Footnote 1. Then we tokenize sentences for Chinese with pkuseg [12]Footnote 2. Sentences with more than 100 words were removed for both Uighur and Chinese. We also filter parallel data where Chinese sentence is 6 times longer than Uighur sentence or Uighur sentence is 4 times longer than Chinese. We learn word alignment with fast_align [6]Footnote 3 and filter sentence pairs whose alignment rates are less than 0.6. The statistics of pre-processed parallel data are shown in Table 1. The remaining data is processed by Byte Pair Encoding [15]Footnote 4, with 32K merge operations for both Uighur and Chinese.

Table 2. Architecture hyper-parameters of Transformer Big in our system.

2.2 Architecture

We adopt Transformer Big as our base model and tune a few architecture hyper-parameters in current setting, which are shown in Table 2. We train all models by optimizing cross entropy loss with label smoothing. Adam optimizer [9] (\(\beta 1 = 0.9\), \(\beta 2=0.98\), \(\epsilon =10^{-9}\)) was used for optimization. Learning rate is linearly increased during the first 4000 steps, and then decreased with inverse square root function of steps as in [19]. We train all models on 4 NVIDIA Tesla V100 GPUs.

To obtain more diversed models for ensembling, we train two variants of vanilla Transformer: Transformer with relative position representations [16] (Relative Transformer) and DynamicConv [20]. Checkpoint averaging [3] is also used to get a stronger model.

2.3 Back-Translation of Monolingual Data

Back-translation has been proved as an effective method for data augmentation of neural machine translation [7, 14], especially in low-resource scenarios. With only 165K provided parallel data, Transformer Big performs worse than Transformer Base, seeing Table 3. We train a Chinese \(\rightarrow \) Uighur translation model, taking Transformer Base architecture. Then we apply the trained Transformer to translate large scale monolingual sentences in Chinese to Uighur and construct pseudo Uighur \(\rightarrow \) Chinese translation parallel data.

Table 3. Back-translation with different strategies

We experiment with several methods to generate synthetic data as proposed in [7], such as beam search and top-k sampling. We find top-k sampling is more effective as shown in Table 3. A possible explanation is that top-k sampling introduce moderate noise into synthetic data, which makes pseudo data generated by top-k sampling contain stronger training signal [8].

It is useful to distinguish real data and synthetic data during training since synthetic data is usually more noised. A simple method distinguish real data and synthetic data is adding a tag in front of each sentences, which is called Tagged Back-Translation [2]. Experimental results in Table 3 proved its effectiveness in Uighur \(\rightarrow \) Chinese translation.

We construct two synthetic datasets (named sample1 and sample2) by top-10 sampling in back-translation and filter sentence pairs with length and alignment features.

2.4 Fine-Tuning

Table 4. Fine-tuning trained models on real data

There is domain divergence between real data and synthetic data, since synthetic sentence pairs are in general domain while real data specific in news domain. We fine-tune translation models trained on mixed data on real data to adapt them specific to target domain.

As indicated in Table 4, fine-tuning trained model on real data boost the performance of translation models for a large margin evaluated by BLEU scores on development set.

2.5 Ensemble Translation

Many literatures [1, 18] have shown the effectiveness of ensemble learning for improving translation quality. We translate evaluation source texts by ensembling several diversed and best performing models. Our experimental results in Table 5 present stable increments of translation quality with ensembling more best performing models.

Table 5. Ensemble translation: index i means the i-th model in Table 4

2.6 Reranking

We generate the n-best translation lists by ensembling 6 best performing models with beam size = 24. We hand-craft several features for reranking the n-best lists, including log probability of each single translation model, target-to-source translation score, right-to-left translation score [11], n-gram language model perplexityFootnote 5 and beam index. The reranking model is tuned by K-batched MIRA algorithm [4]. BLEU score evaluated on development set achieves 49.17 after reranking.

3 Results

Table 6 shows our systems evaluated by BLEU on development set. For Uighur \(\rightarrow \) Chinese translation, BLEU scores [13] are computed at character level. For the last 4 rows, each model is based on the model described in the previous row.

Table 6. Translation quality evaluated by BLEU on development set

We can see that back-translation, fine-tuning, ensemble translation and reranking consistently boost the performance of the Uighur \(\rightarrow \) Chinese translation system. During these techniques, back-translation is most effective in low-resource scenario.

4 Conclusion

This paper presents our submission for CCMT-2020 Uighur \(\rightarrow \) Chinese translation task. We obtain a strong baseline system by tuning Google’s Transformer Big architecture and continually improve it by back-translation, fine-tuning, ensembing and reranking.