Abstract
Transfer learning is an interesting approach to tackle the low resource languages machine translation problem. Transfer learning, as a machine learning algorithm, requires to make several choices such as selecting the training data and more particularly language pairs and their available quantity and quality. Other important choices must be made during the preprocessing step, like selecting data to learn subword units, the subsequent model’s vocabulary. It is still unclear how to optimize this transfer. In this paper, we analyse the impact of such early choices on the performance of the systems. We show that systems performance are depending on quantity of available data and proximity of the involved languages as well as the protocol used to determined the subword units model and consequently the vocabulary. We also propose a multilingual approach to transfer learning involving a universal encoder. This multilingual approach is comparable to a multi-source transfer learning setup where the system learns from multiple languages before the transfer. We analyse subword units distribution across different languages and show that, once again, preprocessing choices impact systems overall performance.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Transfer learning
- Machine translation
- Languages proximity
- Data quantity
- Subwords distribution
- Languages balance
- Data balance
- Multilingual
1 Introduction
Some major technical advances have allowed neural systems to become the most efficient approach to machine translation when a large amount of data is available [1, 19]. However, when small amounts of training data are available, neural systems struggle to obtain good performance [13]. Transfer learning consists in training a first neural machine translation (NMT) system on another language pair where a larger quantity of data is available. This already trained system is then adapted to the new data from the low resource language pair with the aim of getting better performance than a system trained only on few data.
Transfer learning in machine translation results in learning a first system called “parent” system on abundant data. Then use this learned system as a basis for the “children” systems. This basis allows the system to learn the new data. It is comparable to domain adaptation where the domain is another language. The transfer generally improves the results of the child system which benefit from knowledge learned during the training of the parent system [20]. In this paper we show that the data used to train the parent system significantly impact the resulting child system. Many factors must be taken into account like data quantity, languages proximity, data preprocessing, etc. Several studies deal with quantities of data used as well as proximity of involved languages, and conclusions diverge [5, 12].
2 Related Work
Currently, neural systems require a large body of training corpus to achieve good performance, which by definition is problematic for low resource language pairs. Phrase-based statistical machine translation systems appears then as a relevant alternative [13].
Several multilingual automatic translation approaches have been developed to translate texts from low resource language pairs [7, 10, 15]. The use of universal encoders and decoders allowed [10] to design a learning system that manages several language pairs in parallel and achieves better results, especially for less-endowed languages. Specific symbols (e.g. <2s>) are used to control the output language of the universal decoder. This kind of model even makes it possible to translate pairs of languages that are not seen during training (so-called zero-shot learning). However, performance in such cases remains relatively low. In the same line, [17] explore different parameter sharing schemes in a multilingual model.
The choice of level of representation of words is important. [6, 16] have shown that the use of sub-lexical symbols shared between languages of the parent and child models results in an increase of transfer performance. In our work, we use this method by exploring different amounts of symbols (i.e. different vocabulary sizes).
Transfer learning tries to overcome the problem by relying on a parent system (trained on a large amount of data) that serves as the basis for learning a child system [20]. The transfer is more effective when languages are shared between the parent and the child. In this direction, [5] highlights the importance of proximity of languages in order to obtain a transfer of better quality. These observations are contradicted in [12] where better results are obtained with more distant but better endowed language pairs.
The work presented in this paper extends those of [12] and [5] on several points. Like [12], we try to evaluate the performance of the child system according to the data used in the parent system, still considering the criteria of proximity of language and quantity of data. In this study, we also consider a parent system consisting of a universal encoder (trained over several languages). We study different choices to be made for preprocessing data and the parameters of the translation model and we will try to determine the best configuration.
The objective is to better understand the correlation between the impact of the amount of data and the proximity of languages on the performance of the child system. We will see that our experiences contradict some of the conclusions of the articles mentioned above.
3 Data
In order to perform transfer learning, we need multiple language pairs. We selected the Estonian\(\rightarrow \)English language pair as our low resource language pair. Our goal is to have the best possible results for this language pair, thus we need another language pair to train the parent system for the transfer.
3.1 Data Selection
We use data provided in WMT2018 machine translation evaluation campaign [2]. 2.5 million (41M words source side and 52M target side) parallel sentences are available for the Estonian\(\rightarrow \)English language pair. We can not consider this as a low resource language pair, but this quantity remains low for training an NMT system that achieves good results. To assess the impact of language proximity in the parent system, we use two pairs of different languages with different proximity to Estonian. The first pair is Finnish\(\rightarrow \)English. Finnish is close to Estonian since both are Finno-Ugric languages. 5 million parallel sentences are available for this language pair which represent 78M words source side and 114M target side. As distant language pair we have chosen German\(\rightarrow \)English. German is a Germanic language further away from Finnish and Estonian for which 40 million parallel sentences are available corresponding to 570M words source side and 607M target side. This will allow us to evaluate the impact of the quantity of data. Both pairs have English as target so the transfer will be from close (or distant) language on source side and target language is fixed. We want to exhibit whether this significant difference in terms of quantity of data will compensate for the language distance and result in a distant (German\(\rightarrow \)English) parent system from which the transfer is as effective as a close (Finnish\(\rightarrow \)English) parent system.
3.2 Data Preprocessing
We use subword SPM units [14]. Systems using subword units are the current state of the art in neural machine translation. There is also a correlation in transfer quality depending on the number of subword units in common [16].
Two separate models of subword units are learned, one trained on source languages and another trained on the target language (English). Both are used in parent and child systems. The corresponding source and target vocabularies are created from tokenized data. Consequently, there is a direct correlation between data used to train the SPM models and resulting vocabularies in the NMT systems.
We take this into account by learning subword models for the source side with the data used to learn the parent and child systems. This subword model is then applied to all source side data, for both parent and child data. The goal is to not change the vocabulary during the parent/child transition since this would require to get representations for units that are not seen during training of the parent model.
Sentences with length less than 3 subword units and more than 100 subword units are filtered out.
Finally, subword units that occur at least 5 times in our training corpus are kept in the vocabularies while the others are matched to an unknown unit (<unk>). This process is necessary in our case since SPM can not guarantee exhaustive coverage of the corpus.
3.3 SPM Model Study
In this section we will describe 3 different SPM models that were created for the multilingual parent system approach.
We tried to combine the proximity of Finnish and Estonian and to take advantage of large amount of data from the German\(\rightarrow \)English pair. For that, we built a system with universal encoder and decoder [8] with both Finnish\(\rightarrow \)English and German\(\rightarrow \)English corpora used as training data. One advantage of the universal approach is the capability to add one or more languages to our system without having to change the architecture. We can thus always have really comparable Estonian\(\rightarrow \)English children, whereas we now have a multilingual system as a parent. [10] showed that parallel learning of multiple language pairs with a universal architecture has a positive impact on translation results. We want to verify if this is also the case for transfer learning.
We designed 3 different SPM data distributions to train separate SPM models for the source side of our universal systems.
For the first one, the model has been trained on German data only, resulting in subword units that are specific this language but are used for Finnish and Estonian. The result is that the vocabulary contains many short subword units covering the Finnish and the Estonian text. The Finnish and Estonian words end up being heavily split, which might complicate the subsequent modelling. This model is referred to as the 10-0-0 SPM. The second SPM model is using 5M German sentences, 3M in Finnish and 2.5M in Estonian. This is an intermediate model, with a more balanced data distribution across languages. It is referred to as the 5-3-2.5 SPM. The last one is made of 3M sentences in German, 3M in Finnish and 2.5M in Estonian. We force the data to be balanced for this SPM model despite the data quantity imbalance. It is referred to as the 3-3-2.5 SPM.
Figure 1 describes subword units distributions in the vocabulary of our systems obtained with the three different SPM models presented above. Figure 1 (left) shows the distribution for the 10-0-0 SPM model. This distribution is very unbalanced as expected. The Finnish specific units and Estonian specific units in this figure are unseen by the SPM model composed on German data. Figure 1 (center) shows that, even when greatly reducing the quantity of German data, the vocabulary remains mainly composed of German-specific units, however, we notice more common subword units than with the 10-0-0 SPM.
It is important to keep in mind that the distribution of subword units in the vocabulary does not reflect the actual coverage in the corpus (the number of occurrences is not taken into account).
The Fig. 1 (right) shows a balanced distribution of subword units across the 3 languages.
In this case, every language has a set of specific units that will be learned during training of the parent and/or child NMT model. We want to verify whether this will lead to a better transfer for NMT.
Figure 2 shows the relationship between the average subword token size (in characters) and the number of tokens per segment. This graph highlights that all the ratios for German are close to each other regardless the data distribution used to train them (they are all in the pink rectangle). It also shows that the Finnish and Estonian ratios of each SPM model are close to each other in each distribution. There is a particularity observed only with the 3-3-2.5 SPM: all 3 languages ratio are close to each other (circled in green in Fig. 2), emphasizing the balance between the 3 languages.
4 Architecture
To carry out our experiments we based our approach on the principle of trivial transfer learning of [12]. The principle is simple and consists in using an architecture that does not change between the learning of the parent system and the child system. Only training data are changed between parent and child system learning.
We use a standard end-to-end encoder/decoder architecture with an attention mechanism [1, 19]. This architecture is composed of a bi-directional encoder and a decoder based on gated recurrent units, Bi-GRU [4] of size 800. The embeddings size is set to 400. We apply a 0.3 dropout [18] on the embeddings, on the context before being provided to the attention mechanism and on the output before softmax. Weights are initialized according to [9] and we use Adam [11] as optimizer. The initial learning rate is set to 1.10e−4 and size of a batch is 32. This architecture is the only configuration used for all systems presented in the next sections. They were implemented with the nmtpytorchFootnote 1 toolkit [3].
5 Experiments
All results of the systems presented here are calculated on the development dataset of the News translation task of WMT2018 evaluation campaign.
The Estonian\(\rightarrow \)English system presented in Table 1 is compared to our system using transfer learning. We can notice that the differences are small and negligible. Note that the number of source and target side subword units is the same.
We can see in Table 2, that for learning a child system Estonian\(\rightarrow \)English, the models based on the subword units including German get worse results than those including Finnish. Since Finnish and Estonian are close languages, it is likely that they share more subword units than with German, which explains the results. The hypothesis is that they coexist better in the vocabulary. This is confirmed by the results of the Estonian-German SPM model, which increases as number of subword units increases. While for the Estonian-Finnish SPM, results decrease when using 32k units compared to using 16k units. Therefore, it seems that a greater number of subword units is more favourable for German-Estonian system whereas 16k units are sufficient for the Finnish-Estonian system.
The results of the standard systems in Table 3 give us an idea of the performance obtained by parent systems in their respective language pairs. We selected 5M sentences from the German\(\rightarrow \)English corpus to have a similar size to the Finnish\(\rightarrow \)English corpus. We can then effectively compare the 2 different source languages of the parent systems with the same amount of data to train on. This allows us to evaluate the impact of language proximity. We observed that the performance of the German\(\rightarrow \)English parent system using only 5M of randomly selected data is much lower than that of the system using all available data. One can thus expect a loss of performance when the parent system is trained with a smaller amount of data.
We had to make a compromise when defining architecture size. We want the biggest possible architecture to effectively learn the parent system, but we also want a reasonable size to avoid overfitting the child’s system afterwards. For the upcoming experiments, we chose to use 16k subword units because this quantity led to the best performance for the Estonian\(\rightarrow \)English system.
5.1 Results
In Table 4, we expose the results of the Estonian\(\rightarrow \)English child systems that were learned from the different parent systems presented in Table 3.
Using all data for our 3 systems we get an improvement compared to our Estonian\(\rightarrow \)English baseline which is 14.17 BLEU (see Table 1). Results are close but we can see that the results with the DE+ET SPM are worse, which corresponds to results in Table 2. The best result is obtained with the Finnish\(\rightarrow \)English parent.
We performed several experiments with different quantities of German\(\rightarrow \)English data (from 40M to 1.25M) and Finnish\(\rightarrow \)English data (from 5M to 1.25M). Results show that with the same amount of data, results differ greatly among architectures. This difference is explained by the proximity of languages used to train the parent system. Finnish, which is closer to Estonian, offers a better transfer than the more distant German, confirming results in [5]. [12] shows that the quality of the parent system is important to ensure a good transfer to a child. The low performance of the German\(\rightarrow \)English parent using 5M of data explains the poor results of the later learned child system.
We also tried our multilingual parent approach with universal encoder as described in Sect. 3.3.
We use a different SPM model from previous ones because this time it contains German and Finnish from the parent system, in addition to Estonian from the child system for source side.
The assumption is that by combining these two factors we should get a parent who will provide a better transfer to our child systems. The results show that this is not so obvious (see Table 4); performance is worse than German\(\rightarrow \)English or Finnish\(\rightarrow \)English as the only parent. One hypothesis is that the imbalance of amounts of data between the two source languages of the parent is an obstacle to learning a good quality parent.
We tried different data quantity distribution between the two languages.
In the 40M column there is 35M from German\(\rightarrow \)English with 5M from
Finnish\(\rightarrow \)English, in the 20M column it is 15M German\(\rightarrow \)English with 5M
Finnish\(\rightarrow \)English, in the 10M column it is 5M German\(\rightarrow \)English with 5M
Finnish\(\rightarrow \)English, in the 5M column it is 2.5M form both and in the last column it is 1.25M from both. This way we have a better vision of the impact of data quantity from the German\(\rightarrow \)English language pair.
With our 3 different SPM models applied to our multilingual parent, we observe some interesting results.
The first of them is the 15.71 BLEU obtained when using the full data available on both parent pairs.
This result is surprising considering the others scores. This SPM model learned only on German\(\rightarrow \)English data reveals an interesting behaviour of non DE source side data. Indeed, words are “overly” split into subword units. Surprisingly, this particularity seems to provide good results in Estonian\(\rightarrow \)English. Our hypothesis is that, thanks to the quite large architecture to train the Estonian\(\rightarrow \)English system, the system overfits on the small subword units. The average number of subword units per line and average subword unit size of this particular SPM model applied to our data can confirm it. Thus, for the training of this child model, the system keeps improving and stopped during the 10th epoch, while most of the other child systems presented barely passed the 5th. These observations are in line with our hypothesis of overfitting but we keep investigating about this result.
Overall the results from the 5-3-2.5 SPM are the least interesting. Results seem to increase slightly as we reduced data quantity used by a small margin. This might be related to the quantity of German\(\rightarrow \)English data used for training the model.
Finally, with the more balanced SPM model (3-3-2.5), the results are quite stable with only slight changes. The results are also better at each data quantity than the 5-3-2.5 SPM.
We see an improvement thanks to the transfer for the Estonian\(\rightarrow \)English child systems. However, we also want to apply this transfer in the case where few resources are available. To simulate this lack of data, we kept only 200k sentences from the Estonian\(\rightarrow \)English corpus to learn new children with the same parent systems.
Results of Table 5 show us that when we have few training data for the child system, the proximity of languages is the most important feature.
Finnish\(\rightarrow \)English parent system outperforms the others in this configuration.
This time our multilingual approach results are as good as with the German\(\rightarrow \)English pair. Results coming from the balanced SPM outperforms it on all data quantities.
Compared to previous results with all the data on the 10-0-0 SPM system, we observe that the results are not anymore outperforming the others. The system obtains 11.05 BLEU which is not better than the previous results. This can confirm that the overfiting of the subword units works less well when less data is available like in this setup. In general the results are consistent without dependency to data quantity.
With the 5-3-2.5 SPM model, the results are consistent as before: they are still worse than the two others SPM models.
Our 3-3-2.5 SPM models now outperforms the others as well as the German\(\rightarrow \)English parent systems. Their results even get close to the Finnish\(\rightarrow \)English parent. We believe that the balance of the subword units across the 3 source languages involved is particularly effective in this case where few data are available for the child system.
6 Conclusion
In this paper, we showed that transfer learning for NMT depends on the quantity of available data and the proximity of involved languages. Also, carefully training the subword models can lead to a better language equilibrium in the vocabulary leading to better translation results.
These parameters are therefore to be taken into account for the choice of parent systems.
Our results are in line with those obtained by [20] and [5]; proximity of languages used for transfer learning is more important than data quantities. With equivalent amounts of data, parent systems using pairs of closer languages perform better, but the quality of the parent systems in question should not be neglected and should be taken into account in the results of the child systems. The token distribution in the vocabulary is also of greater importance and have an impact on system performance.
Our universal multilingual approach end up showing some interesting results, especially in low resource context. We presented an analysis of the subword units distribution and the importance of the preprocessing steps ahead of the training process. We showed that the balance between the different languages involved in the system is extremely relevant for the performance of the child systems afterwards.
In the future we want to keep investigating subword units distribution with different examples to better explain the relation between those factors and the systems performance results.
References
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014). http://arxiv.org/abs/1409.0473
Bojar, O., et al.: Findings of the 2018 conference on machine translation (WMT18). In: Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pp. 272–307. Association for Computational Linguistics, Belgium, Brussels, October 2018. http://www.aclweb.org/anthology/W18-6401
Caglayan, O., García-Martínez, M., Bardet, A., Aransa, W., Bougares,F., Barrault, L.: NMTPY: a flexible toolkit for advanced neural machine translation systems. Prague Bull. Math. Linguist. 109, 15–28 (2017). https://doi.org/10.1515/pralin-2017-0035, https://ufal.mff.cuni.cz/pbml/109/art-caglayan-et-al.pdf
Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078 (2014). http://arxiv.org/abs/1406.1078
Dabre, R., Nakagawa, T., Kazawa, H.: An empirical study of language relatedness for transfer learning in neural machine translation. In: Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation, pp. 282–286. The National University (Philippines) (2017). http://aclweb.org/anthology/Y17-1038
Durrani, N., Dalvi, F., Sajjad, H., Belinkov, Y., Nakov, P.: One size does not fit all: comparing NMT representations of different granularities. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1504–1516. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://www.aclweb.org/anthology/N19-1154
Gu, J., Wang, Y., Chen, Y., Li, V.O.K., Cho, K.: Meta-learning for low-resource neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3622–3631. Association for Computational Linguistics, Brussels, Belgium, October–November 2018. https://www.aclweb.org/anthology/D18-1398
Ha, T., Niehues, J., Waibel, A.H.: Toward multilingual neural machine translation with universal encoder and decoder. CoRR abs/1611.04798 (2016). http://arxiv.org/abs/1611.04798
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. CoRR abs/1502.01852 (2015). http://arxiv.org/abs/1502.01852
Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling zero-shot translation. CoRR abs/1611.04558 (2016). http://arxiv.org/abs/1611.04558
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
Kocmi, T., Bojar, O.: Trivial transfer learning for low-resource neural machine translation. In: Proceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, 31 October–1 November 2018, pp. 244–252 (2018). https://aclanthology.info/papers/W18-6325/w18-6325
Koehn, P., Knowles, R.: Six challenges for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, pp. 28–39. Association for Computational Linguistics, Vancouver, August 2017. https://doi.org/10.18653/v1/W17-3204, https://www.aclweb.org/anthology/W17-3204
Kudo, T., Richardson, J.: Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. CoRR abs/1808.06226 (2018). http://arxiv.org/abs/1808.06226
Lakew, S.M., Erofeeva, A., Negri, M., Federico, M., Turchi, M.: Transfer learning in multilingual neural machine translation with dynamic vocabulary. In: IWSLT 2018, October 2018
Nguyen, T.Q., Chiang, D.: Transfer learning across low-resource, related languages for neural machine translation. CoRR abs/1708.09803 (2017). http://arxiv.org/abs/1708.09803
Sachan, D., Neubig, G.: Parameter sharing methods for multilingual self-attentional translation models. In: 3rd Conference on Machine Translation (WMT), Brussels, Belgium, October 2018. https://arxiv.org/abs/1809.00252
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014). http://jmlr.org/papers/v15/srivastava14a.html
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. CoRR abs/1409.3215 (2014). http://arxiv.org/abs/1409.3215
Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural machine translation. CoRR abs/1604.02201 (2016). http://arxiv.org/abs/1604.02201
Acknowledgments
This work was supported by the French National Research Agency (ANR) through the CHIST-ERA M2CR project, under the contract number ANR-15-CHR2-0006-017.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Bardet, A., Bougares, F., Barrault, L. (2019). A Study on Multilingual Transfer Learning in Neural Machine Translation: Finding the Balance Between Languages. In: Martín-Vide, C., Purver, M., Pollak, S. (eds) Statistical Language and Speech Processing. SLSP 2019. Lecture Notes in Computer Science(), vol 11816. Springer, Cham. https://doi.org/10.1007/978-3-030-31372-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-31372-2_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31371-5
Online ISBN: 978-3-030-31372-2
eBook Packages: Computer ScienceComputer Science (R0)