Semantic Relatedness for All (Languages): A Comparative Analysis of Multilingual Semantic Relatedness Using Machine Translation

Freitas, André; Barzegar, Siamak; Sales, Juliano Efson; Handschuh, Siegfried; Davis, Brian

doi:10.1007/978-3-319-49004-5_14

André Freitas¹⁷,
Siamak Barzegar¹⁸,
Juliano Efson Sales¹⁷,
Siegfried Handschuh¹⁷ &
…
Brian Davis¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10024))

Included in the following conference series:

European Knowledge Acquisition Workshop

2319 Accesses
3 Citations
2 Altmetric

Abstract

This paper provides a comparative analysis of the performance of four state-of-the-art distributional semantic models (DSMs) over 11 languages, contrasting the native language-specific models with the use of machine translation over English-based DSMs. The experimental results show that there is a significant improvement (average of 16.7 % for the Spearman correlation) by using state-of-the-art machine translation approaches. The results also show that the benefit of using the most informative corpus outweighs the possible errors introduced by the machine translation. For all languages, the combination of machine translation over the Word2Vec English distributional model provided the best results consistently (average Spearman correlation of0.68).

Access provided by Autonomous University of Puebla. Download conference paper PDF

Comparison of the Best Parameter Settings in the Creation and Comparison of Feature Vectors in Distributional Semantic Models Across Multiple Languages

Robust semantic text similarity using LSA, machine learning, and linguistic resources

Article 30 October 2015

Sentence Similarity by Combining Explicit Semantic Analysis and Overlapping N-Grams

Keywords

1 Introduction

Distributional Semantic Models (DSM) are consolidating themselves as fundamental components for supporting automatic semantic interpretation in different application scenarios in natural language processing. From question answering systems, to semantic search and text entailment, distributional semantic models support a scalable approach for representing the meaning of words, which can automatically capture comprehensive associative commonsense information by analysing word-context patterns in large-scale corpora in an unsupervised or semi-supervised fashion [8, 18, 19].

However, distributional semantic models are strongly dependent on the size and the quality of the reference corpora, which embeds the commonsense knowledge necessary to build comprehensive models. While high-quality texts containing large-scale commonsense information are present in English, such as Wikipedia, other languages may lack sufficient textual support to build distributional models.

To address this problem, this paper investigates how different distributional semantic models built from corpora in different languages and with different sizes perform in computing semantic relatedness similarity and relatedness tasks. Additionally, we analyse the role of machine translation approaches to support the construction of better distributional vectors and for computing semantic similarity and relatedness measures for other languages. In other words, in the case that there is not enough information to create a DSM for a particular language, this work aims at evaluating whether the benefit of corpora volume for English outperforms the error introduced by machine translation.

Given a pair of words and a human judgement score that represents the semantic relatedness of these two words, the evaluation method aims at indicating how close distributional models score to humans. Three widely used word-pairs datasets are employed in this work: Miller and Charles (MC) [14], Rubenstein and Goodenough (RG) [17] and WordSimilarity 353 (WS-353) [7].

In the proposed model the word-pairs datasets are translated into English as a reference language and the distributional vectors are defined over the target end model (Fig. 1). Despite the simplicity of the proposed method based on machine translation, there is a high relevance for the distributional semantics user/practitioner due to its simplicity of use and the significant improvement in the results.

This work presents a systematic study involving 11 languages and four distributional semantic models (DSMs), providing a comparative quantitative analysis of the performance of the distributional models and the impact of machine translation approaches for different models.

In summary, this paper answers the following research questions:

1.
Does machine translation to English perform better than the word vectors in the original language (for which languages and for which distributional semantic models)?
2.
Which DSMs and languages benefit more and less from the translation?
3.
What is the quality of state-of-the-art machine translation approaches for word pairs (for each language)?

Moreover, this paper contributes with two resources which can be used by the community to evaluate multi-lingual semantic similarity and relatedness models: (i) a high quality manual translation of the three word-pairs datasets - Miller and Charles (MC) [14], Rubenstein and Goodenough (RG) [17] and WordSimilarity 353 (WS-353) [7] - for 10 languages and (ii) the 44 pre-computed distributional models (four distributional models for each one of the 11 languages) which can be accessed as a service^{Footnote 1}, together with the multi-lingual approaches mediated by machine translation.

This paper is organised as follows: Sect. 2 describes the related work, Sect. 3 describes the experimental setting; while Sect. 4 analyses the results and provides the comparative analysis from different models and languages, Finally, Sect. 5 provides the conclusion.

2 Related Work

Most of related work has concentrated on leveraging joint multilingual information to improve the performance of the models.

Faruqui and Dyer [6] use the distributional invariance across languages and propose a technique based on canonical correlation analysis (CCA) for merging multilingual evidence into vectors generated monolingually. They evaluate the resulting word representations on semantic similarity/relatedness evaluation tasks, showing the improvement of multi-lingual over the monolingual scenario.

Utt and Pado [20], develop methods that take advantage of the availability of annotated corpora in English using a translation-based approach to transport the word-link-word co-occurrences to support the creation of syntax-based DSMs.

Navigli and Ponzetto [15] propose an approach to compute semantic relatedness exploiting the joint contribution of different languages mediated by lexical and semantic knowledge bases. The proposed model uses a graph-based approach of joint multi-lingual disambiguated senses which outperforms the monolingual scenario and achieves competitive results for both resource-rich and resource-poor languages.

Zou et al. [21] describe an unsupervised semantic embedding (bilingual embedding) for words across two languages that represent semantic information of monolingual words, but also semantic relationships across different languages. The motivation of their works was based on the fact that it is hard to identify semantic similarities across languages, specially when co-occurrences words are rare in the training parallel text. Al-Rfou et al. [1] produced multilingual word embeddings for about 100 languages using Wikipedia as the reference corpora.

Comparatively, this work aims at providing a comparative analysis of existing state-of-the-art distributional semantic models for different languages as well as analyzing the impact of a machine translation over an English DSM.

3 Experimental Setup

The experimental setup consists of the instantiation of four distributional semantic models (Explicit Semantic Analysis (ESA) [9], Latent Semantic Analysis (LSA) [12], Word2Vec (W2V) [13] and Global Vectors (GloVe) [16]) in 11 different languages - English, German, French, Italian, Spanish, Portuguese, Dutch, Russian, Swedish, Arabic and Farsi.

The DSMs were generated from Wikipedia dumps (January 2015), which were preprocessed by lowercasing, stemming and removing stopwords. For LSA and ESA, the models were generated using the SSpace Package [11], while W2V and GloVe were generated using the code shared by the respective authors. For the experiment the vector dimensions for LSA, W2V and GloVe were set to 300 while ESA was defined with 1500 dimensions. The difference of size occurs because ESA is composed of sparse vectors. All models used in the generation process the default parameters defined in each implementation.

Each distributional model was evaluated for the task of computing semantic similarity and relatedness measures using three human-annotated gold standard datasets: Miller and Charles (MC) [14], Rubenstein and Goodenough (RG) [17] and WordSimilarity 353 (WS-353) [7]. As these word-pairs datasets were originally in English, except for those language available in previous works ([4, 5]), the word pairs were translated and reviewed with the help of professional translators, skilled in data localisation tasks. The datasets are available at http://rebrand.ly/multilingual-pairs.

Two automatic machine translation approaches were evaluated: the Google Translate Service and the Microsoft Bing Translation Service. As Google Translate Service performed 16 % better for overall word-pairs translations, this was set as the main machine translation model.

The DInfra platform [2] provided the DSMs used in the work. To support experimental reproducibility, both experimental data and software are available at http://rebrand.ly/dinfra.

4 Evaluation and Results

4.1 Spearman Correlation and Corpus Size

Table 1 shows the correlation between the average Spearman correlation values for each DSM and two indicators of corpus size: # of tokens and # of unique tokens.

ESA is consistently more robust (on average) than the other models in relation to the corpus size due the fact that ESA has larger context windows in opposition to the other distributional models. While ESA considers the whole document as its context window, the other models are restricted to five (LSA) and ten (Word2Vec and GloVe) words.

Another observation is that the evaluation of the WS-353 dataset is more dependent on the corpus size, which can be explained by the broader number of semantic relations expressed under the semantic relatedness umbrella.

Table 2 shows the size of each corpus in different languages regarding the number of unique tokens and the number of tokens.

Table 1. Correlation between corpus size and different models.

Full size table

Table 2. The sizes of the corpora in terms of the number of unique tokens and tokens (scale of \(10^6\)).

Full size table

4.2 Word-Pair Machine Translation Quality

The second step evaluates the accuracy of state-of-the-art machine translation approa-ches for word-pairs (Table 3). The accuracy of the translation for the WS-353 word pairs significantly outperforms the other datasets. This shows that the higher semantic distance between word pairs (semantic relatedness) has the benefit of increasing the contextual information during the machine translation process, subsequently improving the mutual disambiguation process.

Table 3. Translation accuracy.

Full size table

For WS-353 the set of best-performing translations has an average accuracy of 80 % (with maximum 85 % and minimum 76 %). This value dropped significantly for Arabic and Farsi (average 50 %).

For MC and RG, the average translation accuracy for the semantic similarity pairs is 51.5 %. This difference may be a result of a deficit of contextual information during the machine translation process. For these word-pairs datasets, the difference between best translation performers and lower performers (across languages) is smaller. Additionally, the final translation accuracy for all languages and all word-pairs datasets is 59 %. French, Dutch and Spanish are the languages with best automatic translations.

Table 4. Spearman correlation for the language-specific models.

Full size table

4.3 Language-Specific DSMs

In the first part of the experiment, the Spearman correlations (\(\rho \)) between the human assessments and the computation of the semantic similarity and relatedness for all DSMs instantiated for all languages were evaluated (Fig. 1 (ii)). Table 4 shows the Spearman correlation for each DSM using language-specific corpora (without machine translation), for the three word-pairs datasets.

The comparative language-specific analysis indicates that English is the best-perfor-ming language (0.70), followed by German (0.61). The lowest Spearman correlation was observed in Arabic (0.35). From the tested DSMs, W2V is consistently the best-performing DSM (0.56). The language-specific DSMs achieved higher correlations for MC and RG (0.56 and 0.53, respectively), in comparison to 0.41 for WS-353.

The results for the language-specific DSMs were contrasted to the machine translation (MT) approach, according to the diagram depicted in Fig. 1 (i). The Spearman correlation for the MT-mediated approach are shown in Table 5.

Table 5. Spearman correlation for the machine translation models over the English corpora. Diff. represents the difference of machine translation score minus the language specific.

Full size table

4.4 Machine Translation Based Semantic Relatedness

Using the MT models, W2V is consistently the best performing DSM (average 0.68), while ESA is consistently the worst performing model (0.47). We can interpret this result by stating that the benefit of using machine translation for ESA does not introduces significant performance improvements in comparison to the language-specific baselines.

Table 6. Difference between the language-specific and the machine translation approach. M. AVG represents the average of the models and DS. AVG represents the average of the datasets.

Full size table

The best performing languages are French and Farsi (\(\rho = 0.63\)). The Spearman correlation variance across languages in the MT models is low, as the impact of the use of the English corpus on the DSM model has a higher positive impact on the results in comparison to the variation of the quality of the machine translation. The results for all languages achieve very similar correlation values.

The impact of the MT model can be better interpreted by examining the difference between the machine translation and the domain-specific models (depicted in Table 6). LSA accounts for the largest average percent improvement (28.4 %) using the MT model, while ESA accounts for the lowest value \((-2.9\,\%)\). As previously noticed, this can be explained by the sensitivity of these models to the corpus size due to the dimensional reduction strategy (LSA) or the broader context window (ESA). The remaining models accounted for substantial improvements (W2V = 21.7 %, GloVe = 19.5 %).

Arabic and French achieved the highest percent gains (47 % and 38 %, respectively), while German accounts for worst results \((-4\,\%)\). These numbers are consistent with the corpus size. For German, the result shows that the corpus volume of the German Wikipedia crossed a threshold size (34 % of the English corpus) above which improvements for computing semantic similarity for the target word-pairs dataset might be marginally relevant, while the translation error accounts negatively in the final result.

The average improvement for the MT over the language specific model for each word-pairs dataset is consistently significant: MC = 20 %, RG = 30 % and WS353 = 14 %.

4.5 Summary

Below, the interpretation of the results are summarised as the core research questions which we aim to answer with this paper:

Question 1: Does machine translation to English perform better than the word vectors in the original language (for which languages and for which distributional semantic models)?

Machine translation to English consistently performs better for all languages, with the exception of German, which presents equivalent results for the language-specific models. The MT approach provides an average improvement of 16.7 % over language-specific distributional semantic models.

Question 2: Which DSMs or MT-DSMs work best for the set of analysed languages?

W2V-MT consistently performs as the best model for all word-pairs datasets and languages, except German, in which the difference between MT-W2V and language-speci-fic W2V is not significant.

Question 3: What is the quality of state-of-the-art machine translation approaches for word-pairs?

The average translation accuracy for all languages and all word-pairs datasets is 59 %. Translation quality varies according to the nature of the word-pair (better translations are provided for word pairs which are semantically related compared to semantically similar word pairs), reaching a maximum of 85 % and a minimum of 36 % across different languages.

For the distributional semantics user/practitioner, as a general practice, we recommend using W2V built over an English corpus, supported by machine translation. Additionally, the accuracy of state-of-the-art machine translation approaches work better for translating semantically related word pairs (in contrast to semantically similar word pairs).

5 Conclusion

This work provides a comparative analysis of the performance of four state-of-the-art distributional semantic models over 11 languages, contrasting the native language-specific models with the use of machine translation over English-based DSMs. The experimental results show that there is a significant improvement (average of 16.7 % for the Spearman correlation) by using off-the-shelf machine translation approaches and that the benefit of using a more informative (English) corpus outweighs the possible errors introduced by the machine translation approach. The average accuracy of the machine translation approach is 59 %. Moreover, for all languages, W2V showed consistently better results, while ESA showed to be more robust concerning lower corpora sizes. For all languages, the combination of machine translation over the W2V English distributional model provided the best results consistently (average Spearman correlation of 0.68).

Future work will focus on the analysis and translation of two other word-pairs datasets: SimLex-999 [10] and MEN-3000 [3].

Notes

1.
The service is available at http://rebrand.ly/dinfra.

References

Al-Rfou, R., Perozzi, B., Skiena, S.: Polyglot: distributed word representations for multilingual NLP. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 183–192. Association for Computational Linguistics, Sofia, August 2013. http://www.aclweb.org/anthology/W13-3520
Barzegar, S., Sales, J.E., Freitas, A., Handschuh, S., Davis, B.: Dinfra: a one stop shop for computing multilingual semantic relatedness. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, 1027–1028. ACM, New York (2015). http://doi.acm.org/10.1145/2766462.2767870
Bruni, E., Tran, N.K., Baroni, M.: Multimodal distributional semantics. J. Artif. Int. Res. 49(1), 1–47 (2014). http://dl.acm.org/citation.cfm?id=2655713.2655714
MathSciNet MATH Google Scholar
Camacho-Collados, J., Pilehvar, M.T., Navigli, R.: A framework for the construction of monolingual and cross-lingual word similarity datasets. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP), pp. 1–7. Citeseer (2015)
Google Scholar
Faruqui, M., Dyer, C.: Community evaluation and exchange of word vectors at wordvectors.org (2014)
Google Scholar
Faruqui, M., Dyer, C.: Improving vector space word representations using multilingual correlation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 462–471. Association for Computational Linguistics, Gothenburg, April 2014. http://www.aclweb.org/anthology/E14-1049
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, pp. 406–414. ACM (2001)
Google Scholar
Freitas, A.: Schema-agnositc queries over large-schema databases: a distributional semantics approach. Ph.D. thesis, Digital Enterprise Research Institute (DERI), National University of Ireland, Galway (2015)
Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI 2007, pp. 1606–1611. Morgan Kaufmann Publishers Inc., San Francisco (2007). http://dl.acm.org/citation.cfm?id=1625275.1625535
Hill, F., Reichart, R., Korhonen, A.: Simlex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015)
Article MathSciNet Google Scholar
Jurgens, D., Stevens, K.: The s-space package: an open source package for word space models. In: Proceedings of the ACL 2010 System Demonstrations, ACLDemos 2010, pp. 30–35. Association for Computational Linguistics, Stroudsburg (2010). http://dl.acm.org/citation.cfm?id=1858933.1858939
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR Workshop Papers (2013)
Google Scholar
Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Lang. Cogn. Process. 6(1), 1–28 (1991)
Article Google Scholar
Navigli, R., Ponzetto, S.P.: Babelrelate! A joint multilingual approach to computing semantic relatedness. In: AAAI Conference on Artificial Intelligence (2012)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), vol. 12, pp. 1532–1543 (2014)
Google Scholar
Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Commun. ACM 8(10), 627–633 (1965)
Article Google Scholar
Sales, J.E., Freitas, A., Davis, B., Handschuh, S.: A compositional-distributional semantic model for searching complex entity categories. In: Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics (*SEM), pp. 199–208 (2016)
Google Scholar
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010). http://dl.acm.org/citation.cfm?id=1861751.1861756
MathSciNet MATH Google Scholar
Utt, J., Pad, S.: Crosslingual and multilingual construction of syntax-based vector space models. Trans. Assoc. Comput. Linguist. 2, 245–258 (2014)
Google Scholar
Zou, W.Y., Socher, R., Cer, D.M., Manning, C.D.: Bilingual word embeddings for phrase-based machine translation. In: EMNLP, pp. 1393–1398 (2013)
Google Scholar

Download references

Acknowledgments

This publication has emanated from research supported by the National Council for Scientific and Technological Development, Brazil (CNPq) and by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289.

Author information

Authors and Affiliations

Department of Computer Science and Mathematics, University of Passau, Innstrasse 43, ITZ-110, 94032, Passau, Germany
André Freitas, Juliano Efson Sales & Siegfried Handschuh
Insight Centre for Data Analytics, National University of Ireland, Galway, IDA Business Park, Lower Dangan, Galway, Ireland
Siamak Barzegar & Brian Davis

Authors

André Freitas
View author publications
You can also search for this author in PubMed Google Scholar
Siamak Barzegar
View author publications
You can also search for this author in PubMed Google Scholar
Juliano Efson Sales
View author publications
You can also search for this author in PubMed Google Scholar
Siegfried Handschuh
View author publications
You can also search for this author in PubMed Google Scholar
Brian Davis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to André Freitas .

Editor information

Editors and Affiliations

Linköping University, Linköping, Sweden
Eva Blomqvist
University of Bologna, Bologna, Italy
Paolo Ciancarini
University of Bologna, Bologna, Italy
Francesco Poggi
University of Bologna, Bologna, Italy
Fabio Vitali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Freitas, A., Barzegar, S., Sales, J.E., Handschuh, S., Davis, B. (2016). Semantic Relatedness for All (Languages): A Comparative Analysis of Multilingual Semantic Relatedness Using Machine Translation. In: Blomqvist, E., Ciancarini, P., Poggi, F., Vitali, F. (eds) Knowledge Engineering and Knowledge Management. EKAW 2016. Lecture Notes in Computer Science(), vol 10024. Springer, Cham. https://doi.org/10.1007/978-3-319-49004-5_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-49004-5_14
Published: 04 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49003-8
Online ISBN: 978-3-319-49004-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Semantic Relatedness for All (Languages): A Comparative Analysis of Multilingual Semantic Relatedness Using Machine Translation

Abstract

Similar content being viewed by others

Comparison of the Best Parameter Settings in the Creation and Comparison of Feature Vectors in Distributional Semantic Models Across Multiple Languages

Robust semantic text similarity using LSA, machine learning, and linguistic resources

Sentence Similarity by Combining Explicit Semantic Analysis and Overlapping N-Grams

Keywords

1 Introduction

2 Related Work

3 Experimental Setup

4 Evaluation and Results

4.1 Spearman Correlation and Corpus Size

4.2 Word-Pair Machine Translation Quality

4.3 Language-Specific DSMs

4.4 Machine Translation Based Semantic Relatedness

4.5 Summary

5 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Semantic Relatedness for All (Languages): A Comparative Analysis of Multilingual Semantic Relatedness Using Machine Translation

Abstract

Similar content being viewed by others

Comparison of the Best Parameter Settings in the Creation and Comparison of Feature Vectors in Distributional Semantic Models Across Multiple Languages

Robust semantic text similarity using LSA, machine learning, and linguistic resources

Sentence Similarity by Combining Explicit Semantic Analysis and Overlapping N-Grams

Keywords

1 Introduction

2 Related Work

3 Experimental Setup

4 Evaluation and Results

4.1 Spearman Correlation and Corpus Size

4.2 Word-Pair Machine Translation Quality

4.3 Language-Specific DSMs

4.4 Machine Translation Based Semantic Relatedness

4.5 Summary

5 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation