Life Is Short, Train It Less: Neural Machine Tibetan-Chinese Translation Based on mRASP and Dataset Enhancement

Wang, Hao; Yu, Yongbin; Tashi, Nyima; Dongrub, Rinchen; Favour, Ekong; Ai, Mengwei; Gyatso, Kalzang; Cuo, Yong; Nuo, Qun

doi:10.1007/978-981-19-7960-6_6

Hao Wang⁷,
Yongbin Yu⁷,
Nyima Tashi⁸,
Rinchen Dongrub⁸,
Ekong Favour⁷,
Mengwei Ai⁷,
Kalzang Gyatso⁸,
Yong Cuo⁸ &
…
Qun Nuo⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1671))

Included in the following conference series:

China Conference on Machine Translation

277 Accesses
2 Citations

Abstract

This paper highlights a multilingual pre-trained neural machine translation architecture as well as a dataset augmentation approach based on curvature selection. The multilingual pre-trained model is designed to increase the performance of machine translation with low resources by bringing in more common information. Instead of repeatedly training several checkpoints from scratch, this study proposes a checkpoint selection strategy that uses a cleaned optimizer to hijack a midway status. Experiments with our own dataset on the Chinese-Tibetan translation demonstrate that our architecture gets a 32.65 BLEU score, while in the reverse direction, it obtains a 39.51 BLEU score. This strategy drastically reduces the amount of time spent training. To demonstrate the validity of our method, this paper shows a visualization of curvature for a real-world training scenario.

The demo of this paper is available at http://mt.utibet.edu.cn.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Incorporating Translation Quality Estimation into Chinese-Korean Neural Machine Translation

Extremely low-resource neural machine translation for Asian languages

Article Open access 01 December 2020

Recent advances of low-resource neural machine translation

Article 30 October 2021

Keywords

1 Introduction

Many fields, such as education, publishing, and information security, have a strong demand for Chinese-Tibetan translation algorithms. During the last few years, neural machine translation (NMT) tasks have had a great deal of success thanks to the transformer architecture [11]. However, one of the key drawbacks of this approach is that transformer is greedy in terms of both quality and quantity of data. This study set out with the aim of investigating a training method including cross-language transferable pre-trained models and dataset enhancement algorithm to achieve better results in low-resource Chinese-Tibetan translation tasks.

Many studies focus on cross-language unified models, which may increase translation quality in low-resource languages, on the assumption that a cross-language unified model will acquire common knowledge between languages to boost performance on unseen data [2, 3]. Johnson et al. add an artificial token at the beginning of the sentence to set the required language [4]. Zhang et al. suggest that the off-target translation issue is the main reason for unexpected zero-shot performance. [12] Xiao et al. involve contrastive learning in their mRASP framework which achieve good results on other language pair yet misses the Tibetan language dataset [7].

The use of dataset enhancement as a solution to the data hungriness problem is another alternative. One of the merits of such methods is that most of them proceed in parallel with model architecture, saving time on model modifications. Aside from data augmentation, the back translation method [1, 8] is a simple yet effective way to generate synthetic data to improve efficiency. Although this methodology is highly effective, it involved the use of additional monolingual data. Nguyen et al. [5] propose an interesting method that generates a diverse set of synthetic data to augment original data. This method is powerful and effective yet still required training multiple loops.

2 Prerequisite

2.1 Neural Machine Translation with mRASP

mRASP uses a standard Transformer with both 6 layers encoder and decoder pre-trained jointly on 32 language pairs. In this paper, we use the pre-trained mRASP model to finetune our Chinese-Tibetan translation model. Following the symbols in mRASP, we denote the Chinese-Tibetan parallel dataset as $(L_{src},l_{tgt})$, the finetuned loss is

$$\begin{aligned} \mathcal {L}^{{finetune }}= \mathbb {E}_{\left( \textbf{x}^{i}, \textbf{x}^{j}\right) \sim \mathcal {D}_{src, tgt}}\left[ -\log P_{\theta }\left( \textbf{x}^{i} \mid \textbf{x}^{j}\right) \right] . \end{aligned}$$

(1)

where the $\theta $ is the pretrained mRASP model.

2.2 Diversification Method

Data diversification is a simple yet effective data augmentation method. It trains predictions from multiple bi-direction models to diversify training data which is ideal for the low-source Chinese-Tibetan translation task. This strategy is formulated as:

$$\begin{aligned} \mathcal {D}=(S, T) \bigcup \cup _{i=1}^{k}\left( S, M_{S \rightarrow T, 1}^{i}(S)\right) \bigcup \cup _{i=1}^{k}\left( M_{T \rightarrow S, 1}^{i}(T), T\right) \end{aligned}$$

(2)

M denotes the model and k is the diversification factor. In this paper, we propose an accelerating hijack method to reduce this training burden significantly.

2.3 Curvature

In this work, we choose the curvature as the metric of the sharpness of the perplexity curve for the validation dataset in the whole training process. Denote K as the curvature, for a continuous curve it can be calculated as:

$$\begin{aligned} K=\frac{1}{r}=\frac{\left| f^{\prime \prime }\left( x_{0}\right) \right| }{\left( 1+\left( f^{\prime }\left( x_{0}\right) \right) ^{2}\right) ^{\frac{3}{2}}} \end{aligned}$$

(3)

However, the valid perplexity averaged within an epoch is discrete and the direct finite difference may bring relatively large error. In this work, we use the curvature of the quadratic curve determined by the nearest three points to estimate the curvature of a valid perplexity curve [13].

3 Methodology

3.1 Overall Structure

In this work, mRASP pretrained on 32 language pairs is utilized to provide a good starting point than plain Transformer. The vocabulary for our 115k dataset is merged into the provided vocabulary of mRASP. Then the private Tibetan-Chinese parallel dataset is utilized to generate an enhanced dataset. As shown in Fig. 1, the whole fine-tune stage is divided into three parts based on the valid perplexity averaged on each epoch. We hijack the checkpoint from the key points and then continue training using a cleaned optimizer to generate k more checkpoints. Along with the main checkpoint, the enhanced dataset is generated to train the final model.

3.2 Curvature Based Checkpoint Hijack

In this paper, we argue that it is not necessary to train the entire procedure for a large pre-trained model like mRASP. Figure 2 illustrates the perplexity of the valid set will go under three stages. In the fast drop stages, the perplexity will sharply drop to fit the new dataset. Then in the key points stage, the perplexity will gradually get smooth to the minimal value. The final stage is the stable oscillation stage where the perplexity will not change fast. Instead of training from scratch, the curvature is involved to quantify the key points. To ensure the model status is as far as possible from the best point, a few checkpoints before the first key checkpoint are averaged along with the key checkpoint to ensure maximum diversity (Fig. 3).

Formally, denote the training epoch as N. The calculated curvature for valid set is denoted as a sequence $\mathscr {A}:=\left\{ k_1,k_2,\dots k_i,\dots k_N\right\} ,k_j \in \mathbb {R}$ where i denotes the very first key point. By setting the threshold for curvature as hyperparameter T, the key points can be formulated as:

$$\begin{aligned} \mathscr {S} := \left\{ k \in \mathscr {A} \mid k_i \ge k_j, k_i \ge T, \forall i,j\in \mathbb {R},i<j \right\} . \end{aligned}$$

(4)

The generated parallel dataset is:

$$\begin{aligned} \mathcal {D}=(S, T) \bigcup \cup _{i=1}^{k}\left( S, \frac{1}{m}\sum _{i-m}^i M_{S \rightarrow T, i}^{i}(S)\right) \bigcup \cup _{i=1}^{k}\left( \frac{1}{m}\sum _{i-m}^i M_{T \rightarrow S, i}^{i}(T), T\right) \end{aligned}$$

(5)

where m is the total averaged checkpoints numbers and i is the smallest index in $\mathscr {S}.$

4 Experiments

4.1 Dataset Description and Finetune Parameters

This paper uses the Chinese-Tibetan parallel dataset constructed by Tibet University and Qinghai Normal University. It contains high-quality parallel sentences checked and approved by professionals. The Chinese segment tool is Jieba and the Tibetan segment tool is based on perceptron and CRF developed by Tsering et al. [10] In the fine-tuning process, both the input and output length is restricted to 300. The optimizer is Adam and the learning rate is set to 1e−4. Label smoothing is set to 0.1 and mixed precision is used. The diversion factor k is set to 2 and the average checkpoint number m is 3. We perform our experiments in RTX 3090 and A5000 with fairseq [6].

4.2 Experiment Result

Table 1. BLEU score reported on test set

Full size table

Table 1 shows the BLEU score on the test set. Compared to baseline, the mRASP-based pre-trained model indeed performs better. For the training epochs, 90(47) means that we first train an entire loop and then use the best ppl as a stopping point so the next m training will stop at it. The hijack-enhanced dataset brings slightly better benefits than mRASP. However, it is worth mentioning that it only takes dozens of extra epochs to fine-tune, which is faster than the original diversity approach.

5 Conclusion

In this paper, a neural machine translation architecture is proposed for Chinese-Tibetan translation. The involvement of curvature selection reduces the training time significantly. The experiments demonstrate that a multilingual pre-trained model can boost low resources language translation performance. More discussion of curvature in neural networks is desirable for future work.

References

Edunov, S., Ott, M., Ranzato, M., Auli, M.: On the evaluation of machine translation systems trained with back-translation. arXiv preprint arXiv:1908.05204 (2019)
Gu, J., Wang, Y., Cho, K., Li, V.O.: Improved zero-shot neural machine translation via ignoring spurious correlations. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1258–1268. Association for Computational Linguistics, Florence (2019)
Google Scholar
Ji, B., Zhang, Z., Duan, X., Zhang, M., Chen, B., Luo, W.: Cross-lingual pre-training based transfer for zero-shot neural machine translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 115–122 (2020)
Google Scholar
Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 5, 339–351 (2017)
Article Google Scholar
Nguyen, X.P., Joty, S., Kui, W., Aw, A.T.: Data diversification: a simple strategy for neural machine translation. In: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2020)
Google Scholar
Ott, M., et al.: Fairseq: a fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: Demonstrations (2019)
Google Scholar
Pan, X., Wang, M., Wu, L., Li, L.: Contrastive learning for many-to-many multilingual neural machine translation. arXiv preprint arXiv:2105.09501 (2021)
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015)
Cairang, T., Dongzhu, R., Zhaxi, N., Yongbin, Y., Quanxin, D.: Research on Chinese-Tibetan machine translation model based on improved byte pair encoding. J. Univ. Electron. Sci. Technol. 50(02), 249–255+293 (2021)
Google Scholar
Tsering, T., Dhondub, R., Tashi, N.: Research on Tibetan location name recognition technology under CRF. Comput. Eng. Appl. 55(18), 111 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in neural information processing systems, vol. 30 (2017)
Google Scholar
Zhang, B., Williams, P., Titov, I., Sennrich, R.: Improving massively multilingual neural machine translation and zero-shot translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1628–1639. Association for Computational Linguistics (2020)
Google Scholar
Zhang, P., Wang, C.B., Ye, L.: A type iii radio burst automatic analysis system and statistic results for a half solar cycle with nançay decameter array data. Astron. Astrophys. 618, A165 (2018)
Article Google Scholar

Download references

Acknowledgement

This paper is supported by the Chinese Tibetan English neural machine translation system, the artificial intelligence industry innovation task of the Ministry of industry and information technology with the open competition mechanism to select the best candidates to undertake key research projects. The authors thank you for the guidance of the reviewers!

Author information

Authors and Affiliations

University of Electronic Science and Technology of China, Chengdu, 610000, China
Hao Wang, Yongbin Yu, Ekong Favour & Mengwei Ai
Engineering Research Center for Tibetan Information Processing, School of Information Science and Technology, Tibet University, Lhasa, 850000, China
Nyima Tashi, Rinchen Dongrub, Kalzang Gyatso, Yong Cuo & Qun Nuo

Authors

Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yongbin Yu
View author publications
You can also search for this author in PubMed Google Scholar
Nyima Tashi
View author publications
You can also search for this author in PubMed Google Scholar
Rinchen Dongrub
View author publications
You can also search for this author in PubMed Google Scholar
Ekong Favour
View author publications
You can also search for this author in PubMed Google Scholar
Mengwei Ai
View author publications
You can also search for this author in PubMed Google Scholar
Kalzang Gyatso
View author publications
You can also search for this author in PubMed Google Scholar
Yong Cuo
View author publications
You can also search for this author in PubMed Google Scholar
Qun Nuo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yongbin Yu or Nyima Tashi .

Editor information

Editors and Affiliations

Northeastern University, Shenyang, China
Tong Xiao
Meta AI, San Francisco, CA, USA
Juan Pino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, H. et al. (2022). Life Is Short, Train It Less: Neural Machine Tibetan-Chinese Translation Based on mRASP and Dataset Enhancement. In: Xiao, T., Pino, J. (eds) Machine Translation. CCMT 2022. Communications in Computer and Information Science, vol 1671. Springer, Singapore. https://doi.org/10.1007/978-981-19-7960-6_6

Download citation

DOI: https://doi.org/10.1007/978-981-19-7960-6_6
Published: 09 December 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-7959-0
Online ISBN: 978-981-19-7960-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Life Is Short, Train It Less: Neural Machine Tibetan-Chinese Translation Based on mRASP and Dataset Enhancement