Keywords

1 Introduction

Distributional Semantic Models (DSM) are consolidating themselves as fundamental components for supporting automatic semantic interpretation in different application scenarios in natural language processing. From question answering systems, to semantic search and text entailment, distributional semantic models support a scalable approach for representing the meaning of words, which can automatically capture comprehensive associative commonsense information by analysing word-context patterns in large-scale corpora in an unsupervised or semi-supervised fashion [8, 18, 19].

However, distributional semantic models are strongly dependent on the size and the quality of the reference corpora, which embeds the commonsense knowledge necessary to build comprehensive models. While high-quality texts containing large-scale commonsense information are present in English, such as Wikipedia, other languages may lack sufficient textual support to build distributional models.

To address this problem, this paper investigates how different distributional semantic models built from corpora in different languages and with different sizes perform in computing semantic relatedness similarity and relatedness tasks. Additionally, we analyse the role of machine translation approaches to support the construction of better distributional vectors and for computing semantic similarity and relatedness measures for other languages. In other words, in the case that there is not enough information to create a DSM for a particular language, this work aims at evaluating whether the benefit of corpora volume for English outperforms the error introduced by machine translation.

Given a pair of words and a human judgement score that represents the semantic relatedness of these two words, the evaluation method aims at indicating how close distributional models score to humans. Three widely used word-pairs datasets are employed in this work: Miller and Charles (MC) [14], Rubenstein and Goodenough (RG) [17] and WordSimilarity 353 (WS-353) [7].

In the proposed model the word-pairs datasets are translated into English as a reference language and the distributional vectors are defined over the target end model (Fig. 1). Despite the simplicity of the proposed method based on machine translation, there is a high relevance for the distributional semantics user/practitioner due to its simplicity of use and the significant improvement in the results.

Fig. 1.
figure 1

Depiction of the experimental setup of the experiment.

This work presents a systematic study involving 11 languages and four distributional semantic models (DSMs), providing a comparative quantitative analysis of the performance of the distributional models and the impact of machine translation approaches for different models.

In summary, this paper answers the following research questions:

  1. 1.

    Does machine translation to English perform better than the word vectors in the original language (for which languages and for which distributional semantic models)?

  2. 2.

    Which DSMs and languages benefit more and less from the translation?

  3. 3.

    What is the quality of state-of-the-art machine translation approaches for word pairs (for each language)?

Moreover, this paper contributes with two resources which can be used by the community to evaluate multi-lingual semantic similarity and relatedness models: (i) a high quality manual translation of the three word-pairs datasets - Miller and Charles (MC) [14], Rubenstein and Goodenough (RG) [17] and WordSimilarity 353 (WS-353) [7] - for 10 languages and (ii) the 44 pre-computed distributional models (four distributional models for each one of the 11 languages) which can be accessed as a serviceFootnote 1, together with the multi-lingual approaches mediated by machine translation.

This paper is organised as follows: Sect. 2 describes the related work, Sect. 3 describes the experimental setting; while Sect. 4 analyses the results and provides the comparative analysis from different models and languages, Finally, Sect. 5 provides the conclusion.

2 Related Work

Most of related work has concentrated on leveraging joint multilingual information to improve the performance of the models.

Faruqui and Dyer [6] use the distributional invariance across languages and propose a technique based on canonical correlation analysis (CCA) for merging multilingual evidence into vectors generated monolingually. They evaluate the resulting word representations on semantic similarity/relatedness evaluation tasks, showing the improvement of multi-lingual over the monolingual scenario.

Utt and Pado [20], develop methods that take advantage of the availability of annotated corpora in English using a translation-based approach to transport the word-link-word co-occurrences to support the creation of syntax-based DSMs.

Navigli and Ponzetto [15] propose an approach to compute semantic relatedness exploiting the joint contribution of different languages mediated by lexical and semantic knowledge bases. The proposed model uses a graph-based approach of joint multi-lingual disambiguated senses which outperforms the monolingual scenario and achieves competitive results for both resource-rich and resource-poor languages.

Zou et al. [21] describe an unsupervised semantic embedding (bilingual embedding) for words across two languages that represent semantic information of monolingual words, but also semantic relationships across different languages. The motivation of their works was based on the fact that it is hard to identify semantic similarities across languages, specially when co-occurrences words are rare in the training parallel text. Al-Rfou et al. [1] produced multilingual word embeddings for about 100 languages using Wikipedia as the reference corpora.

Comparatively, this work aims at providing a comparative analysis of existing state-of-the-art distributional semantic models for different languages as well as analyzing the impact of a machine translation over an English DSM.

3 Experimental Setup

The experimental setup consists of the instantiation of four distributional semantic models (Explicit Semantic Analysis (ESA) [9], Latent Semantic Analysis (LSA) [12], Word2Vec (W2V) [13] and Global Vectors (GloVe) [16]) in 11 different languages - English, German, French, Italian, Spanish, Portuguese, Dutch, Russian, Swedish, Arabic and Farsi.

The DSMs were generated from Wikipedia dumps (January 2015), which were preprocessed by lowercasing, stemming and removing stopwords. For LSA and ESA, the models were generated using the SSpace Package [11], while W2V and GloVe were generated using the code shared by the respective authors. For the experiment the vector dimensions for LSA, W2V and GloVe were set to 300 while ESA was defined with 1500 dimensions. The difference of size occurs because ESA is composed of sparse vectors. All models used in the generation process the default parameters defined in each implementation.

Each distributional model was evaluated for the task of computing semantic similarity and relatedness measures using three human-annotated gold standard datasets: Miller and Charles (MC) [14], Rubenstein and Goodenough (RG) [17] and WordSimilarity 353 (WS-353) [7]. As these word-pairs datasets were originally in English, except for those language available in previous works ([4, 5]), the word pairs were translated and reviewed with the help of professional translators, skilled in data localisation tasks. The datasets are available at http://rebrand.ly/multilingual-pairs.

Two automatic machine translation approaches were evaluated: the Google Translate Service and the Microsoft Bing Translation Service. As Google Translate Service performed 16 % better for overall word-pairs translations, this was set as the main machine translation model.

The DInfra platform [2] provided the DSMs used in the work. To support experimental reproducibility, both experimental data and software are available at http://rebrand.ly/dinfra.

4 Evaluation and Results

4.1 Spearman Correlation and Corpus Size

Table 1 shows the correlation between the average Spearman correlation values for each DSM and two indicators of corpus size: # of tokens and # of unique tokens.

ESA is consistently more robust (on average) than the other models in relation to the corpus size due the fact that ESA has larger context windows in opposition to the other distributional models. While ESA considers the whole document as its context window, the other models are restricted to five (LSA) and ten (Word2Vec and GloVe) words.

Another observation is that the evaluation of the WS-353 dataset is more dependent on the corpus size, which can be explained by the broader number of semantic relations expressed under the semantic relatedness umbrella.

Table 2 shows the size of each corpus in different languages regarding the number of unique tokens and the number of tokens.

Table 1. Correlation between corpus size and different models.
Table 2. The sizes of the corpora in terms of the number of unique tokens and tokens (scale of \(10^6\)).

4.2 Word-Pair Machine Translation Quality

The second step evaluates the accuracy of state-of-the-art machine translation approa-ches for word-pairs (Table 3). The accuracy of the translation for the WS-353 word pairs significantly outperforms the other datasets. This shows that the higher semantic distance between word pairs (semantic relatedness) has the benefit of increasing the contextual information during the machine translation process, subsequently improving the mutual disambiguation process.

Table 3. Translation accuracy.

For WS-353 the set of best-performing translations has an average accuracy of 80 % (with maximum 85 % and minimum 76 %). This value dropped significantly for Arabic and Farsi (average 50 %).

For MC and RG, the average translation accuracy for the semantic similarity pairs is 51.5 %. This difference may be a result of a deficit of contextual information during the machine translation process. For these word-pairs datasets, the difference between best translation performers and lower performers (across languages) is smaller. Additionally, the final translation accuracy for all languages and all word-pairs datasets is 59 %. French, Dutch and Spanish are the languages with best automatic translations.

Table 4. Spearman correlation for the language-specific models.

4.3 Language-Specific DSMs

In the first part of the experiment, the Spearman correlations (\(\rho \)) between the human assessments and the computation of the semantic similarity and relatedness for all DSMs instantiated for all languages were evaluated (Fig. 1 (ii)). Table 4 shows the Spearman correlation for each DSM using language-specific corpora (without machine translation), for the three word-pairs datasets.

The comparative language-specific analysis indicates that English is the best-perfor-ming language (0.70), followed by German (0.61). The lowest Spearman correlation was observed in Arabic (0.35). From the tested DSMs, W2V is consistently the best-performing DSM (0.56). The language-specific DSMs achieved higher correlations for MC and RG (0.56 and 0.53, respectively), in comparison to 0.41 for WS-353.

The results for the language-specific DSMs were contrasted to the machine translation (MT) approach, according to the diagram depicted in Fig. 1 (i). The Spearman correlation for the MT-mediated approach are shown in Table 5.

Table 5. Spearman correlation for the machine translation models over the English corpora. Diff. represents the difference of machine translation score minus the language specific.

4.4 Machine Translation Based Semantic Relatedness

Using the MT models, W2V is consistently the best performing DSM (average 0.68), while ESA is consistently the worst performing model (0.47). We can interpret this result by stating that the benefit of using machine translation for ESA does not introduces significant performance improvements in comparison to the language-specific baselines.

Table 6. Difference between the language-specific and the machine translation approach. M. AVG represents the average of the models and DS. AVG represents the average of the datasets.

The best performing languages are French and Farsi (\(\rho = 0.63\)). The Spearman correlation variance across languages in the MT models is low, as the impact of the use of the English corpus on the DSM model has a higher positive impact on the results in comparison to the variation of the quality of the machine translation. The results for all languages achieve very similar correlation values.

The impact of the MT model can be better interpreted by examining the difference between the machine translation and the domain-specific models (depicted in Table 6). LSA accounts for the largest average percent improvement (28.4 %) using the MT model, while ESA accounts for the lowest value \((-2.9\,\%)\). As previously noticed, this can be explained by the sensitivity of these models to the corpus size due to the dimensional reduction strategy (LSA) or the broader context window (ESA). The remaining models accounted for substantial improvements (W2V = 21.7 %, GloVe = 19.5 %).

Arabic and French achieved the highest percent gains (47 % and 38 %, respectively), while German accounts for worst results \((-4\,\%)\). These numbers are consistent with the corpus size. For German, the result shows that the corpus volume of the German Wikipedia crossed a threshold size (34 % of the English corpus) above which improvements for computing semantic similarity for the target word-pairs dataset might be marginally relevant, while the translation error accounts negatively in the final result.

The average improvement for the MT over the language specific model for each word-pairs dataset is consistently significant: MC = 20 %, RG = 30 % and WS353 = 14 %.

4.5 Summary

Below, the interpretation of the results are summarised as the core research questions which we aim to answer with this paper:

Question 1: Does machine translation to English perform better than the word vectors in the original language (for which languages and for which distributional semantic models)?

Machine translation to English consistently performs better for all languages, with the exception of German, which presents equivalent results for the language-specific models. The MT approach provides an average improvement of 16.7 % over language-specific distributional semantic models.

Question 2: Which DSMs or MT-DSMs work best for the set of analysed languages?

W2V-MT consistently performs as the best model for all word-pairs datasets and languages, except German, in which the difference between MT-W2V and language-speci-fic W2V is not significant.

Question 3: What is the quality of state-of-the-art machine translation approaches for word-pairs?

The average translation accuracy for all languages and all word-pairs datasets is 59 %. Translation quality varies according to the nature of the word-pair (better translations are provided for word pairs which are semantically related compared to semantically similar word pairs), reaching a maximum of 85 % and a minimum of 36 % across different languages.

For the distributional semantics user/practitioner, as a general practice, we recommend using W2V built over an English corpus, supported by machine translation. Additionally, the accuracy of state-of-the-art machine translation approaches work better for translating semantically related word pairs (in contrast to semantically similar word pairs).

5 Conclusion

This work provides a comparative analysis of the performance of four state-of-the-art distributional semantic models over 11 languages, contrasting the native language-specific models with the use of machine translation over English-based DSMs. The experimental results show that there is a significant improvement (average of 16.7 % for the Spearman correlation) by using off-the-shelf machine translation approaches and that the benefit of using a more informative (English) corpus outweighs the possible errors introduced by the machine translation approach. The average accuracy of the machine translation approach is 59 %. Moreover, for all languages, W2V showed consistently better results, while ESA showed to be more robust concerning lower corpora sizes. For all languages, the combination of machine translation over the W2V English distributional model provided the best results consistently (average Spearman correlation of 0.68).

Future work will focus on the analysis and translation of two other word-pairs datasets: SimLex-999 [10] and MEN-3000 [3].