1 Introduction

Under the influence of globalization, international communication becomes more and more frequent, and people have more and more opportunities to come into contact with non-native languages in life and work, which leads to a more urgent need for translation. Machine translation refers to the process of converting one language into another language with the assistance of computer technology [1]. In the current research, how to obtain a higher quality of machine translation has become a key and difficult issue [2]. Xiang et al. [3] put forward the method of integrating language differentiation features to improve the quality of Chinese–Vietnamese machine translation, designed a two-way long short-term memory model, and found that the proposed method can effectively enhance translation quality through the experiment on large-scale bilingual data. Lee et al. [4] proposed an attention mechanism based on reinforced learning to solve the delay problem of machine translation model in online scenes. Through experiments, it was found that the model has better translation quality and comparable delay compared to other models. For the machine translation of Dayak language, Khaikal et al. [5] extracted data from web pages to build a corpus based on statistical methods and found that the highest accuracy rate reached 49.15%, which was about 3% higher than the other machine translations. Pandey et al. [6] designed a translation system from Hindi to Chhattisgarh based on the open source software Moses and found that the accuracy rate of the system reached 75% through testing on 1000 sentences. English and Chinese are widely used in various real-life situations. With the impact of cross-cultural communication, there is an increasing need for translation between English and Chinese, which raises higher demands on current English–Chinese machine translation shift. The growing number of out-of-vocabulary (OOV) words results in a decrease in translation quality, hindering effective communication support. Therefore, in order to further improve the translation quality of English–Chinese machine translation shift, this paper designed a method based on word vector similarity that replaces OOV words with similar words. Experimental analysis verified the effectiveness of this method. The study in this article not only provides a higher quality method for English–Chinese machine translation shift, but also offers some new insights into the deep research of Transformer models in machine translation. It demonstrates the impact of OOV word processing methods on translation quality and provides some references for OOV word processing in machine translation research for other languages, which is beneficial for further enhancing the quality of machine translation and promoting its better application in practice.

2 Translation model of english-chinese machine translation

The transformer model has a good application in many languages [7]. The transformer model includes six encoders and six decoders [8]. In the encoder, input text is first converted into input vector, and then based on position encoding, it is passed into the encoder layer.

The relationship between each word and other words in the source sentence is learned through a multi-head self-attention layer. The formula is written as:

$$\text{MultiHead}\left(Q,K,V\right)=\text{concat}\left({\text{head}}_{1},{\text{head}}_{2},\cdots ,{\text{head}}_{h}\right){W}^{O},$$
(1)
$${\text{head}}_{i}=\text{Attention}\left(Q{W}_{i}^{Q},K{W}_{i}^{K},V{W}_{i}^{V}\right),$$
(2)

where \({W}_{i}^{Q},{W}_{i}^{K},{W}_{i}^{V}\in {\mathbb{R}}^{{d}_{\text{model}}\times {d}_{k}}\), \(Q,K,V\) indicate query, key, and value matrices, and \({W}^{O}\) is the final context vector.

Then, the output of the multi-head self-attention layer is nonlinearly mapped through a feedforward neural network (FFN), and the formula is written as:

$$\text{FFN}\left(x\right)=\text{max}\left(0,x{w}_{1}+{b}_{1}\right){w}_{2}+{b}_{2},$$
(3)

where \({w}_{1}\) and \({w}_{2}\) are weights of the first and second layers of the FFN, \({w}_{1}\) and \({w}_{2}\) are threshold values.

In the decoding layer, since the information of ungenerated words cannot be seen when decoding, the decoder uses a mask mechanism to hide the information. Finally, after all the execution of the decoder layer, for the source end sentence, the target translation is obtained according to conditional probability:

$$P\left(Y|X\right)=\prod_{i=1}^{\left|Y\right|}P\left({y}_{i}|{Y}_{<i},X\right).$$
(4)

where \(X\) refers to the input sentence and \(Y\) is the target translation generated by the model according to the conditional probability. At each time step \(i\), beam search is used for decoding [9]. K best candidates are kept. The formula can be written as:

$$\text{log}P\left(Y|X\right)=\sum_{i=1}^{\left|Y\right|}\text{log}\left({y}_{i}|{Y}_{<i},X\right).$$
(5)

3 OOV word processing based on word vector similarity

In the actual English–Chinese machine translation shift, for the included OOV words, the same symbol < unk > will be used to represent them [10], which is easy to lead to ambiguity. To solve this problem, in the pre-processing stage, this paper first replaces OOV words in corpus based on word vector similarity and then puts the replaced words into the model to complete translation, so as to improve translation quality.

Distributed representation can use a fixed-dimension vector to represent words, which can avoid the problem of excessive dimension of one-hot encoding and can also reflect the semantic correlation between words. It has a good performance in natural language processing. In this paper, this method is used to find the similar words of OOV words, and the Skip-gram model in word to vector (word2vec) is used as the training tool [11].

The Skip-gram model performs one-hot encoding on the input and output and then word vector training. For the input sequence \({w}_{1},{w}_{2},\cdots ,{w}_{T}\), the training goal is to maximize the value of the following formula:

$$\frac{1}{T}\sum_{t=1}^{T}\sum_{j\subseteq nb\left(t\right)}\text{log}p\left({w}_{j}|{w}_{t}\right),$$
(6)

where \(nb\left(t\right)\) refers to the context word of \({w}_{t}\) and \(p\left({w}_{j}|{w}_{t}\right)\) is the conditional probability value.

After obtaining the vector representation of all the words in the corpus, the similarity can be calculated according to the word vector. The similarity between OOV word vector \(w\) and its similar word vector \(w{\prime}\) is calculated based on cosine similarity. The formula is:

$$\text{sim}\left(w,w{\prime}\right)=\text{cos}\left(\text{vec}\left(w\right),\text{vec}\left(w{\prime}\right)\right),$$
(7)
$${w}^{*}=\underset{\mathit{w{\prime}}\in \mathit{IV}}{\text{arg max}}\text{sim}\left(w,w{\prime}\right),$$
(8)

where \(IV\) refers to the common word list. The cosine similarity can be calculated based on the distance tool in word2vec. In order to further retain the meaning of the source end sentence, candidate words are screened again using the n-gram model [12]. OOV word \({w}_{i}\) and candidate word \(w{\prime}\) are scored based on n-gram. The formula is:

$$P\left(s\right)={\prod }_{i=1}^{k}P\left({w}_{i}|{w}_{i-n+1},\cdots ,{w}_{i-1}\right).$$
(9)

In this paper, the binary language model is used in calculation, i.e., \(n=2\). The appearance of a word is only related to the previous word. The score of each candidate sentence is calculated, and then the similar high-frequency words with the highest score is used to replace OOV words. The steps are as follows:

  1. (1)

    The word vector is trained to find all the high-frequency words that are similar to OOV words.

  2. (2)

    Alternative candidate words are scored using the n-gram model.

  3. (3)

    The high-frequency words with the highest score are found out to replace OOV words.

  4. (4)

    The replaced sentence is used as the source end sentence and translated using a trained transformer model.

4 Results and analysis

4.1 Experimental setup

The experiment was carried out in Linux environment. The operating system was Ubuntu 18.04 LTS. The programming language was Python 3.6, and the transformer model was implemented based on the PyTorch framework. The experimental datasets came from LDC (Table 1).

Table 1 Experimental dataset

The Skip-gram model in the word2vec tool was used to train a 300-dimensional word vector, and the training set included 2.1 million English–Chinese parallel corpus. The forward and backward translation dictionary was obtained by GIZA +  + tool [13]. Chinese was segmented using jieba [14], and English was segmented using byte pair encoding (BPE) [15]. In terms of parameter setting of the transformer model, the word embedding dimension was set to 768, the FFN dimension was 2,048, and the Adam optimizer was employed. In the Adam optimizer, \({\beta }_{1}={\beta }_{2}=0.5\), the learning rate was set as 0.5, and the dropout was set as 0.1. The 4-g BLEU [16], which is case-insensitive, was used to assess the translation quality.

4.2 Result analysis

The transformer model was employed to realize English–Chinese machine translation shift. For the corpus for training, OOV words were replaced based on word vector similarity. For low-frequency OOV words without word vector representation, two processing methods can be adopted: retain (represented by < unk >) or remove (directly delete). In order to compare the results of the English–Chinese machine translation shift method based on the word vector similarity, the comparative experiments are as follows:

  1. (1)

    Experiment 1: The original corpus without OOV word processing.

  2. (2)

    Experiment 2: The corpus processed based on word vector similarity, whose low-frequency OOV words are retained.

  3. (3)

    Experiment 3: The corpus processed based on word vector similarity, whose low-frequency OOV words are removed.

The BLEU-4 results obtained after using two test sets are presented in Table 2.

Table 2 Translation results of English–Chinese machine translation based on word vector similarity

As can be seen from Table 2, when using the original corpus, the transformer model achieved a BLEU-4 of 37.29 for IST2006 and a BLEU-4 of 30.73 for IST2008. In experiment 2, OOV words in the corpus were processed based on word vector similarity, and low-frequency OOV words were retained. As a result, the BLEU-4 of the transformer model for NIST2006 was 37.36, indicating an increase of 0.07 compared with experiment 1. The BLEU-4 for NIST2008 was 30.78, which showed an increase of 0.05 compared to experiment 1. These results proved the role of word vector similarity processing in improving the quality of English–Chinese machine translation shift. Then, in experiment 3, compared with experiment 2, low-frequency OOV words in the corpus were removed. The BLEU-4 of the transformer model for IST2006 was 37.12, which was reduced by 0.17 and 0.24 compared with experiment 1. The BLEU-4 for IST2008 was 30.33, which was reduced by 0.4 compared with experiment 1 and 0.45 compared with experiment 2. These results showed that for low-frequency OOV words, the effect of retention processing was obviously better than that of removal processing. For the direct removal of low-frequency OOV words in sentences, the sentence structure may be damaged, resulting in a significant decline in translation quality. While reserving it as < unk > , although there may be semantic loss, it ensures the integrity of sentence structure. According to the experimental results, low-frequency OOV words in English–Chinese machine translation should be reserved in addition to processing based on word vector similarity, so as to avoid a substantial decline in translation quality.

To further verify the performance of the proposed mode, it was compared with some other machine translation models, including:

  • the phrase-based statistical machine translation [17],

  • the deep neural network-based statistical machine translation [18],

  • the bidirectional long short-term memory-based neural machine translation [19],

  • the attention-based neural machine translation [20].

The results are presented in Table 3.

Table 3 Results of comparisons with the other machine translation models

From Table 3, it can be observed that both statistical machine translation methods had relatively low BLEU-4 on NIST2006 and NIST2008. Compared to the statistical machine translation models, neural machine translation methods performed better and achieve higher BLEU-4. However, the overall comparison showed that the method proposed in this paper had more advantages. It improved the BLEU-4 by 1.72, 1.52, 0.59, and 0.24, respectively for NIST2006 compared to the other methods, and improved the BLEU-4 by 2.16, 1.81, 1.23, and 0.54, respectively for NIST2008 compared to the other methods. These results proved the effectiveness of the proposed method in enhancing translation quality.

Two translation examples were analyzed as follows:

Example 1: Source end Sentence: 奥勃说, 所有三位大使都对菲律宾政府加强安全措施表示满意。

The reference result of manual translation: Ople said that all the three ambassadors expressed their satisfaction with the strengthened measures taken by the Philippines government.

The transformer model: He said all three ambassadors expressed satisfaction with the Philippines government's increased security measures.

The transformer model based on word vector similarity: According to < unk > , all three ambassadors were satisfied with strengthened security measures taken by the Philippines government.

Example 2: Source end Sentence: 仲介商今天说, 总部设在香港的地产集团 “华人置业” (Chinese Estates) 以2亿8000万英镑, 在伦敦买下高盛 (Goldman Sachs) 投资银行欧洲总部所在的大楼。

The reference result of manual translation: An intermediary said today that Hong Kong-based property group “Chinese Estates” has bought a building in London which houses the European headquarters of investment bank Goldman Sachs for 280 million British pounds.

The transformer model: Chinese Estates, the Hong Kong-based property group, has bought the European headquarters of Goldman Sachs’ investment bank in London for £280 m, agents said today.

The transformer model based on word vector similarity: The intermediary said today that Hong Kong-based property group “Chinese Estates” has bought a building in London which houses the European headquarters of investment bank Goldman Sachs for 280 million British pounds.

In example 1, "奥勃" in the source sentence is a person name and also a low-frequency OOV word, and no similar word can be replaced. In the transformer model trained with original corpus, the word was translated into "He", while in the Transformer model based on word vector similarity, the word was replaced by < unk > , maintaining the integrity of the sentence structure.

In Example 2, the word "仲介商" in the source end sentence as an OOV word was replaced by "中介商". The translated result had a good similarity with the reference result of manual translation, while in the original corpus training, the word is translated as "agents", which is somewhat different from the actual semantics.

From the analysis of translation examples, it can be found that the Transformer model based on word vector similarity obtained better translation results in English–Chinese machine translation and can be applied in practice.

5 Conclusion

Aiming at the OOV problem in English–Chinese machine translation, this paper proposed a processing method based on word vector similarity and conducted an experimental analysis with the transformer model. The results showed that after OOV word processing based on word vector similarity, the translation quality of the transformer model improved to some extent. In addition, for low-frequency OOV words, the effect of retention processing was better than that of removal processing. The results of the proposed method were more similar to the result of manual translation. Therefore, the proposed method can be further applied in practice. However, there are some limitations in this study. For example, only one method, word2vec, was considered in the training of word vectors. Additionally, no novel improvements were made to the design of the transformer model for machine translation. Therefore, future work will explore the application of more advanced word vector techniques and consider optimizing the transformer model to further enhance the quality of English-to-Chinese machine translation. Furthermore, assessing the applicability of the proposed approach in machine translation for other languages will also be considered.