Keywords

1 Introduction

Many fields, such as education, publishing, and information security, have a strong demand for Chinese-Tibetan translation algorithms. During the last few years, neural machine translation (NMT) tasks have had a great deal of success thanks to the transformer architecture [11]. However, one of the key drawbacks of this approach is that transformer is greedy in terms of both quality and quantity of data. This study set out with the aim of investigating a training method including cross-language transferable pre-trained models and dataset enhancement algorithm to achieve better results in low-resource Chinese-Tibetan translation tasks.

Many studies focus on cross-language unified models, which may increase translation quality in low-resource languages, on the assumption that a cross-language unified model will acquire common knowledge between languages to boost performance on unseen data [2, 3]. Johnson et al. add an artificial token at the beginning of the sentence to set the required language [4]. Zhang et al. suggest that the off-target translation issue is the main reason for unexpected zero-shot performance. [12] Xiao et al. involve contrastive learning in their mRASP framework which achieve good results on other language pair yet misses the Tibetan language dataset [7].

The use of dataset enhancement as a solution to the data hungriness problem is another alternative. One of the merits of such methods is that most of them proceed in parallel with model architecture, saving time on model modifications. Aside from data augmentation, the back translation method [1, 8] is a simple yet effective way to generate synthetic data to improve efficiency. Although this methodology is highly effective, it involved the use of additional monolingual data. Nguyen et al. [5] propose an interesting method that generates a diverse set of synthetic data to augment original data. This method is powerful and effective yet still required training multiple loops.

2 Prerequisite

2.1 Neural Machine Translation with mRASP

mRASP uses a standard Transformer with both 6 layers encoder and decoder pre-trained jointly on 32 language pairs. In this paper, we use the pre-trained mRASP model to finetune our Chinese-Tibetan translation model. Following the symbols in mRASP, we denote the Chinese-Tibetan parallel dataset as \((L_{src},l_{tgt})\), the finetuned loss is

$$\begin{aligned} \mathcal {L}^{{finetune }}= \mathbb {E}_{\left( \textbf{x}^{i}, \textbf{x}^{j}\right) \sim \mathcal {D}_{src, tgt}}\left[ -\log P_{\theta }\left( \textbf{x}^{i} \mid \textbf{x}^{j}\right) \right] . \end{aligned}$$
(1)

where the \(\theta \) is the pretrained mRASP model.

2.2 Diversification Method

Data diversification is a simple yet effective data augmentation method. It trains predictions from multiple bi-direction models to diversify training data which is ideal for the low-source Chinese-Tibetan translation task. This strategy is formulated as:

$$\begin{aligned} \mathcal {D}=(S, T) \bigcup \cup _{i=1}^{k}\left( S, M_{S \rightarrow T, 1}^{i}(S)\right) \bigcup \cup _{i=1}^{k}\left( M_{T \rightarrow S, 1}^{i}(T), T\right) \end{aligned}$$
(2)

M denotes the model and k is the diversification factor. In this paper, we propose an accelerating hijack method to reduce this training burden significantly.

2.3 Curvature

In this work, we choose the curvature as the metric of the sharpness of the perplexity curve for the validation dataset in the whole training process. Denote K as the curvature, for a continuous curve it can be calculated as:

$$\begin{aligned} K=\frac{1}{r}=\frac{\left| f^{\prime \prime }\left( x_{0}\right) \right| }{\left( 1+\left( f^{\prime }\left( x_{0}\right) \right) ^{2}\right) ^{\frac{3}{2}}} \end{aligned}$$
(3)

However, the valid perplexity averaged within an epoch is discrete and the direct finite difference may bring relatively large error. In this work, we use the curvature of the quadratic curve determined by the nearest three points to estimate the curvature of a valid perplexity curve [13].

3 Methodology

3.1 Overall Structure

In this work, mRASP pretrained on 32 language pairs is utilized to provide a good starting point than plain Transformer. The vocabulary for our 115k dataset is merged into the provided vocabulary of mRASP. Then the private Tibetan-Chinese parallel dataset is utilized to generate an enhanced dataset. As shown in Fig. 1, the whole fine-tune stage is divided into three parts based on the valid perplexity averaged on each epoch. We hijack the checkpoint from the key points and then continue training using a cleaned optimizer to generate k more checkpoints. Along with the main checkpoint, the enhanced dataset is generated to train the final model.

Fig. 1.
figure 1

The overall architecture of this work. The proposed work can be divided into three stages. (1) In the pre-trained stage, the multi-lingual pre-trained mRASP model is prepared for further finetune. (2) In the dataset enhancement stage, \(m\,+\,1\) checkpoints are trained and inference to generate an enhanced dataset. (3) The final translation model is finetuned based on mRASP at the enhanced dataset.

3.2 Curvature Based Checkpoint Hijack

Fig. 2.
figure 2

The curvature change for an ideal training process. The green point denotes the first key point which is used to re-train. The red point denotes the best perplexity which is an ideal endpoint. (Color figure online)

Fig. 3.
figure 3

Actual ppl change during training. The very first epoch is deleted for better visualization without changing the shape of the curve. The curvature is visualized as black arrows.

In this paper, we argue that it is not necessary to train the entire procedure for a large pre-trained model like mRASP. Figure 2 illustrates the perplexity of the valid set will go under three stages. In the fast drop stages, the perplexity will sharply drop to fit the new dataset. Then in the key points stage, the perplexity will gradually get smooth to the minimal value. The final stage is the stable oscillation stage where the perplexity will not change fast. Instead of training from scratch, the curvature is involved to quantify the key points. To ensure the model status is as far as possible from the best point, a few checkpoints before the first key checkpoint are averaged along with the key checkpoint to ensure maximum diversity (Fig. 3).

Formally, denote the training epoch as N. The calculated curvature for valid set is denoted as a sequence \(\mathscr {A}:=\left\{ k_1,k_2,\dots k_i,\dots k_N\right\} ,k_j \in \mathbb {R}\) where i denotes the very first key point. By setting the threshold for curvature as hyperparameter T, the key points can be formulated as:

$$\begin{aligned} \mathscr {S} := \left\{ k \in \mathscr {A} \mid k_i \ge k_j, k_i \ge T, \forall i,j\in \mathbb {R},i<j \right\} . \end{aligned}$$
(4)

The generated parallel dataset is:

$$\begin{aligned} \mathcal {D}=(S, T) \bigcup \cup _{i=1}^{k}\left( S, \frac{1}{m}\sum _{i-m}^i M_{S \rightarrow T, i}^{i}(S)\right) \bigcup \cup _{i=1}^{k}\left( \frac{1}{m}\sum _{i-m}^i M_{T \rightarrow S, i}^{i}(T), T\right) \end{aligned}$$
(5)

where m is the total averaged checkpoints numbers and i is the smallest index in \(\mathscr {S}.\)

4 Experiments

4.1 Dataset Description and Finetune Parameters

This paper uses the Chinese-Tibetan parallel dataset constructed by Tibet University and Qinghai Normal University. It contains high-quality parallel sentences checked and approved by professionals. The Chinese segment tool is Jieba and the Tibetan segment tool is based on perceptron and CRF developed by Tsering et al. [10] In the fine-tuning process, both the input and output length is restricted to 300. The optimizer is Adam and the learning rate is set to 1e−4. Label smoothing is set to 0.1 and mixed precision is used. The diversion factor k is set to 2 and the average checkpoint number m is 3. We perform our experiments in RTX 3090 and A5000 with fairseq [6].

4.2 Experiment Result

Table 1. BLEU score reported on test set

Table 1 shows the BLEU score on the test set. Compared to baseline, the mRASP-based pre-trained model indeed performs better. For the training epochs, 90(47) means that we first train an entire loop and then use the best ppl as a stopping point so the next m training will stop at it. The hijack-enhanced dataset brings slightly better benefits than mRASP. However, it is worth mentioning that it only takes dozens of extra epochs to fine-tune, which is faster than the original diversity approach.

5 Conclusion

In this paper, a neural machine translation architecture is proposed for Chinese-Tibetan translation. The involvement of curvature selection reduces the training time significantly. The experiments demonstrate that a multilingual pre-trained model can boost low resources language translation performance. More discussion of curvature in neural networks is desirable for future work.