Keywords

1 Introduction

Segmentation is an essential process that has been extensively studied in literature [3, 4, 13, 14]. It covers simple processes such as separating punctuation from words (tokenization), splitting words in subparts based on their frequency or more sophisticated processes such as applying morphological knowledge. In this work, we use tokenization referring to separating punctuation and splitting tokens into words or subwords.

Tokenizing words has proven to be helpful to reduce vocabulary and increase the number of examples of each word. It is extremely important for languages in which there is no separation between words and, therefore, a single token corresponds to more than one word. The way in which tokens are split can greatly change the meaning of the sentence. For example, the Japanese word means admonish, and means observe. However, together they form the word police . Therefore, a correct tokenization can help to improve translation quality.

In this study, we aim to find the impact of tokenization on the quality of the final translation. To do so, we experimented with five tokenizers over ten language pairs. To the best of our knowledge, this is the first work in which an exhaustive comparison between tokenizers has been run for NMT. We include tokenizers based on morphology that could guide the splitting of the words [17].

Some previous works include studying the effect of word-level preprocessing for Arabic on Statistical Machine Translation (SMT). A comparison of several segmenters for Chinese on SMT was done by Zhao et al. [24]. Huck et al. [6] compared morphological segmenters for German in NMT. Finally, Kudo [11] compared their statistical word segmenter with other well-known Japanese morphological segmenters, reaching the conclusions that statistical segmenters worked better than morphological ones.

Our main contributions are as follows:

  • First study of tokenizers for neural machine translation.

  • Experimentation with five different tokenizers over ten language pairs.

The rest of this document is structured as follows: Sect. 2 introduces the neural machine translation system used in this work. After that, in Sect. 3, we present the tokenizers applied for comparison purposes. Then, in Sect. 4, we describe the experimental framework, whose results are presented and discussed in Sect. 5. Section 6 shows some translation examples of the results. Finally, in Sect. 7, conclusions are drawn.

2 Neural Machine Translation

Given a source sentence \(x_1^J=x_1,\dots ,x_J\) of length J, NMT aims to find the best translated sentence \(\hat{y}_1^{\hat{I}}=\hat{y}_1,\dots ,\hat{y}_{\hat{I}}\) of length \(\hat{I}\):

$$\begin{aligned} \hat{y}_1^{\hat{I}} = \mathop {\mathrm {arg\,max}}\limits _{I,y_1^I} Pr(y_1^I \mid x_1^J) \end{aligned}$$
(1)

where the conditional translation probability is modelled as:

$$\begin{aligned} Pr(y_1^I \mid x_1^J) = \prod _{i=1}^{I} Pr(y_i \mid y_1^{i-1},x_1^J) \end{aligned}$$
(2)

NMT frequently relies on a Recurrent Neural Network (RNN) encoder-decoder framework. The source sentence is projected into a distributed representation at the encoding step. Then, the decoder generates, at the decoding step, its translation word by word [21].

The input of the system is a word sequence in the source language. Each word is projected linearly to a fixed-size real-valued vector through an embedding matrix. Then, these word embeddings are fed into a bidirectional [18] Long Short-Term Memory (LSTM) [5] network. As a result, a sequence of annotations is produced by concatenating the hidden states from the forward and backward layers.

An attention mechanism [1] allows the decoder to focus on parts of the input sequence, computing a weighted mean of annotated sequences. A soft alignment model computes these weights, weighting each annotation with the previous decoding state.

Another LSTM network is used for the decoder. This network is conditioned by the representation computed by the attention model and the last generated word. Finally, a distribution over the target language vocabulary is computed by the deep output layer [16].

The model is trained by applying stochastic gradient descent jointly to maximize the log-likelihood over a bilingual parallel corpus. At decoding time, the model approximates the most likely target sentence with beam-search [21].

3 Tokenizers

In this section, we present the tokenizers we employed in order to assess their impact on the quality of the final translation.

  • SentencePieceFootnote 1: an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. It can be used for any language, but its models need to be trained for each of them. To do so, we used the unigram [12] mode and a vocabulary size of 32000 over each corpora’s training partition. Figure 1a shows an example of tokenizing a sentence using SentencePiece.

  • MecabFootnote 2: an open source morphological analysis engine for Japanese, based on conditional random fields. It extracts morphological and syntactical information from sentences and splits tokens into words. Figure 1b shows an example of tokenizing a sentence using Mecab.

  • Stanford Word Segmenter [22]: a Chinese word segmenter based on conditional random fields. Using a set of morphological and character reduplication features, it is able to split Chinese tokens into words. In this work, we use the toolkit’s CTB scheme. Figure 1c shows an example of tokenizing a sentence using Stanford Word Segmenter.

  • OpenNMT tokenizer [8]: the tokenizer included with the OpenNMT toolkit. It normalizes characters (e.g., quotes Unicode variants) and separates punctuation from words. It can be used with any language. Figure 1d shows an example of tokenizing a sentence using OpenNMT tokenizer.

  • Moses tokenizer [10]: the tokenizer included with the Moses toolkit. It separates punctuation from word—preserving special tokens such as URL or dates—and normalizes characters (e.g., quotes Unicode variants). It can be used with any language. Figure 1e shows an example of tokenizing a sentence using Moses tokenizer.

Fig. 1.
figure 1

Examples of segmenting sentences with each word segmenter.

4 Experimental Framework

In this section, we describe the corpora, systems and metrics used in order to asses our proposal.

4.1 Corpora

The corpora selected for our experimental session was extracted from translation memories from the translation industry. The files are the result of professional translation tasks demanded by real clients. The general domain is technical (see Table 1 for the specific content of each language pair), which is harder for NMT than other general domains such as news. Unlike in other domains, in technical domains certain words correspond to specific terms and have a different translation to their most frequent one: e.g., rear arm translates into German as hinterer Arm. However, in this domain, it should be translated as hinterer Querlenker. In order to increase language diversity, we selected the following language-pairs: Japanese–English, Russian–English, Chinese–English, German–English, and Arabic–English. Table 2 shows the corpora statistics.

Table 1. Specific domains for each language pair. Ja stands for Japanese, En for English, Ru for Russian, Zh for Chinese, De for German and Ar for Arabic.

The training dataset is composed of around three million sentences in the German–English language pair and around half a million sentences in the rest of the language pairs. Development and test datasets are composed of two thousand sentences for all the language pairs.

Table 2. Corpora statistics. Ja stands for Japanese, En for English, Ru for Russian, Zh for Chinese, De for German and Ar for Arabic. Tokens\(_\text {BPE}\) and Vocabulary\(_\text {BPE}\) are the number of tokens and size of the vocabulary after applying BPE to the corpora. K stands for thousand and M for millions.

4.2 Systems

NMT systems were trained with OpenNMT [8]. We used LSTM units taking into account the findings in [2]. The size of the LSTM units and word embeddings were set to 1024. We used Adam [7] with a learning rate of 0.0002 [23], a beam size of 6 and a batch size of 20. We reduced the vocabulary using Byte Pair Encoding (BPE) [19], training the models with a joint vocabulary of 32000 BPE units. Finally, the corpora were lowercased and, later, recased using OpenNMT’s tools.

4.3 Evaluation Metrics

We made use of the following well-known metrics to assess our proposal:

  • BiLingual Evaluation Understudy (BLEU) [15]: corresponds to the geometric average of the modified n-gram precision. It is multiplied by a brevity factor to penalize short sentences.

  • Translation Error Rate (TER) [20]: number of word edit operations (insertion, substitution, deletion, and swapping), normalized by the number of words in the final translation.

Confidence intervals (\(p=0.05\)) are computed for all metrics by means of bootstrap resampling [9].

5 Results

In this section, we present the results of the experiments conducted in order to assess the impact of the tokenizer on the translation quality. Table 3 shows the experimental results.

Table 3. Experimental results comparing the translation quality produced by using the different tokenizers. In the columns Mecab and Stanford, Moses tokenizer was used for segmenting the English part of the corpora since both Mecab and Stanford Word Segmenter only work for Japanese and Chinese respectively. Best results are denoted in bold.

For the Ja–En experiment, the best results were yielded by Moses tokenizer and Mecab. It must be taken into account that in both experiments, the English side of the corpus was segmented with Moses tokenizer, this means that the segmentation of the target side has a greater impact on the translation quality. Overall, there is a quality improvement of around 4 points in terms of BLEU and 3 points in terms of TER with respect to the tokenizer which yielded the second best results.

For En–Ja, the best results were yielded by Mecab, representing a significant improvement (around 12 points in terms of BLEU and 15 points in terms of TER) with respect to the tokenizer which yielded the second best results. Most likely, this is due to Mecab being developed specifically to segment Japanese.

For Ru–En and En–Ru, Moses tokenizer yielded the best results (with improvements of around 2 to 4 points in terms of BLEU and 5 points in terms of TER). It is worth noting that, in both cases, SentencePiece and OpenNMT tokenizer yielded similar results.

The Chinese experiments behaved similarly to the Japanese experiments: Moses tokenizer and Stanford Word Segmenter (the specific Chinese word tokenizer, which included using Moses tokenizer for segmenting the English part of the corpus) achieved the best results when translating to English (yielding an improvement of around 7 points in terms of BLEU and 5 points in terms of TER), and Stanford Word Segmenter achieved the best results when translating to Chinese (yielding an improvement of around 8 points in terms of BLEU and 20 points in terms of TER).

For the German experiments, the best results were yielded by both OpenNMT tokenizer and Moses tokenizer, representing an improvement of around 7 to 9 points in terms of BLEU and 14 to 17 points in terms of TER. It is worth noting how, despite being the largest corpora, SentencePiece—which learns how to segment from the corpora’s training data—yielded the worst results. As a future study, we should evaluate the relation between the size of the corpora and the quality yielded by SentencePiece.

Finally, Arabic behaved similarly to Russian, with Moses tokenizer yielding the best results for both Ar–En and En–Ar (representing improvements of around 2 to 4 points in terms of BLEU and 4 to 6 points in terms of TER). However, SentencePiece performed similar to Moses tokenizer when translating to English. When translating to Arabic, both SentencePiece and OpenNMT tokenizer yielded similar results.

Overall, Moses tokenizer yielded the best results for German, Russian and Arabic experiments. When using specialized morphologically oriented tokenizers, the system using Mecab obtained the best results for Japanese experiments; and Stanford Word Segmenter for Chinese experiments. Additionally, OpenNMT tokenizer and SentencePiece yielded the worst translation quality in all experiments. An explanation for these poor results is that OpenNMT tokenizer is fairly simple: it only separates punctuation symbols from words. However, this is not the case for SentencePiece. We think that using SentencePiece in a bigger training dataset in order to better learn the segmentation could help to improve their results. Nonetheless, as mentioned before, we have to corroborate this in a future work.

Table 4. English to German translation examples comparing SentencePiece, OpenNMT tokenizer and Moses tokenizer. First line corresponds to the source sentence in English, second line to the German reference and third, forth and fifth lines to the translations generated using SentencePiece, OpenNMT tokenizer and Moses tokenizer respectively to segment the corpora. Correct translations hypothesis are denoted in bold, and incorrect translations are denoted in italic.

6 Qualitative Analysis

We obtained a better performance using Moses tokenizer than OpenNMT tokenizer and SentencePiece. In order to qualitatively analyze this performance, Table 4 shows a couple of examples of translation outputs generated using SentencePiece, OpenNMT tokenizer and Moses tokenizer for segmenting the corpora.

The first example clearly shows a better performance when using Moses tokenizer rather than SentencePiece. The translation output from the system trained using Moses tokenizer for segmenting matches the reference. However, the output translations of the systems using OpenNMT tokenizer and SentencePiece are wrong. Translation segmented with OpenNMT tokenizer contains many repetitions and lacks sense. Additionally, translation segmented by SentencePiece has problems repeating some words in the translation (e.g., motor) and missing some translation words (e.g., the translation of pilot).

The system’s behavior using Moses tokenizer in the second example is similar: its translation matches the reference. By contrast, the systems using SentencePiece and OpenNMT tokenizer translated wrongly. The system using SentencePiece translated all the words from the source but its translation is not grammatically correct. A correct translation could be kalte Zeichnung des Drahtes. Lastly, OpenNMT tokenizer’s performance is the worst in this case: the translation of its system ignored the word wire.

Therefore, we observed that, despite sharing the same data and model architecture, the behavior of the systems’ translation changed as a result of using a different tokenizer.

7 Conclusions

In this study, we tested different tokenizers to evaluate their impact on the quality of the final translation. We experimented using 10 language pairs and arrived to the conclusion that tokenization has a great impact on the translation quality, achieving gains of up to 12 points of BLEU and 15 points of TER.

Additionally, we observed that there was not a single best tokenizer. Each one produced the best results for certain language pairs. Although, in some cases, those best results overlapped with the ones yielded by other tokenizers. Moreover, we have seen different behaviors depending on the language pair direction. The system using SentencePiece obtained the best results for Ar–En, but not for En–Ar translation.

As a future work, we would like to evaluate the relation between the size of the corpora and the quality yielded by SentencePiece—which uses each language’s training corpora to learn how to segment. It would also be interesting to compare more segmentation strategies such as separating by characters or fixed n-grams. Finally, we would like to confirm that repeating these experiments on some of the general domain training data used for these languages achieves similar effects.