1 Introduction

Sentiment analysis has become very popular in recent years in many areas using natural language text processing. These include topics such as prediction of future events including security issues in the world [25]. There is also great interest in the analysis of consumer opinions [6, 15, 16] especially among product manufacturers who want to know the general reactions of customers to their products and thus improve them. Consumer reviews allow for the recognition of specific customer preferences, which facilitates good marketing decisions. With the increase in the number of reviews, especially for products sold on the global market (for which reviews are available in many languages), it is necessary to develop an effective method of multilingual analysis of the sentiment of a review, which would also be able to evaluate not only the sentiment of the entire opinion, but also its components, e.g. aspects or features of the product, whose sentiment is expressed at the level of sentences [24]. It is important that the method should also work in as many domains as possible [1, 17, 18].

In this work we present MultiEmo, a multilanguage benchmark corpus of consumer opinions, developed on the basis of PolEmo 2.0 [19]. The original collection was created to fill the gap in datasets annotated with sentiments for low-resource language, such as Polish. However, the results of this work show that perhaps treating Polish as a low-resource language is no longer correct (Sect. 7). It can certainly be said that the number of corpora annotated with sentiment for the Polish one is very small (low-resource in this domain, Sect. 3). Low-resource languages often provide a wealth of information related to the culture of the people who speak them. This knowledge concerns intangible cultural heritage, which allows a better understanding of the processes that have shaped a given society, its value system and traditions. These factors are important in the process of determining the sentiment of texts written by a person belonging to a particular cultural group.

MultiEmo allows building and evaluating a sentiment recognition model for both high-resource and low-resource languages at the level of the whole text, as well as single sentences, and for different domains. A high level of Positive Specific Agreement (PSA) [13] was achieved for this set, which is 0.91 for annotations at the text level and 0.81 at the sentence level. It turns out that the collection is very well suited for the evaluation of modern deep language models, especially cross-lingual ones. To the best of our knowledge, there is no other such large publicly available dataset annotated with a sentiment, allowing simultaneous evaluation of models in 3 different aspects (3M): multilingual, multilevel and multidomain.

We also present the results of classification using selected recent deep language models: XLM-RoBERTa [4], MultiFiT [8] and the proposed new combination of LASER+BiLSTM, using the Language-Agnostic SEntence Representations (LASER) [2] model to evaluate the quality of cross-lingual sentiment recognition zero-shot transfer learning task.

Table 1. The description of the review sources, with review domain, author type, subject type and domain subcorpus size (number of documents). For two domains potentially neutral texts were added as part of articles related to the domain.

2 Related Work

In recent years, the development of Transformer-based language models has led to significant improvements in cross-lingual language understanding (XLU). This would not have been possible without an increasing number of benchmark sets, which make it possible to test the quality of new language models and compare them with existing ones. The pre-training and fine-tuning approach allows for state-of-the-art results for a large number of NLP tasks. Among the popular pre-trained models, two groups can be distinguished. The first of them are monolingual models, e.g.: BERT [7] or RoBERTa [21]. The second group are multilingual models, e.g.: LASER [2], XLM-RoBERTa [4], or MultiFiT [8] In this article we will focus mainly on the second group and we compare their effectiveness in aspects not only related to cross-lingual tasks, but also multidomain and multilevel. There are many benchmark data sets on which the above mentioned models are tested. In general, they can also be divided into similar groups. The following datasets can be listed in the monolingual group: GLUE [27], KLEJ [23] or CoLA [28]. In the multilingual group, the examples are: XGLUE [20] or XTREME [14].

Most of the mentioned language models support over 100 languages, e.g. LASER, mBERT, XLM, XLM-RoBERTa, fastText-RCSLS. However, there are models that are pre-trained in a much smaller number of languages, e.g. Unicoder (15 languages) or MultiFiT (7 languages). In the context of multilingual benchmark data sets, the number of supported languages is usually even smaller. The largest number of languages is XTREME (40 languages), XGLUE (19 languages) and XNLI (15 languages). However, the tasks in these datasets are mostly unique to the individual languages, i.e. they are not their translations. Additionally, there are no sets for which different levels of annotation (e.g. document level and sentence level) or other phenomena, e.g. cross-domain knowledge transfer, can be studied at the same time (i.e. on the same texts, translated into many languages). Moreover, low-resource languages are highly underrepresented in most of the sub-tasks of these benchmarks.

An important problem from the perspective of multilingual sentiment analysis is the small number of benchmark sets. None of the previously mentioned sets contain multilingual data for this task. To the best of our knowledge, there is no set for this task, which contains accurate translations of the training and test instances for many languages, additionally taking into account multidomain and multilevel aspects. We found two collections close to the one we need, but both of them did not meet our objectives. One of the existing datasets is a collection of the SemEval-2016-Task-5 [22]. One of its subtask (Out-of-domain Aspect-Based Sentiment Analysis) contains data sets for 8 languages. These are consumer reviews from different sources, but each language contains a different number of them and they are not translations of the same reviews in different languages. The next most conceptually similar set to MultiEmo is Multilanguage Tweets Corpus [9]. This collection contains 2794 tweets in Polish (1397 positive and 1397 negative), 4272 tweets in Slovenian (2312 positive and 1950 negative) and 3554 tweets in Croatian (2129 positive and 1425 negative). Then the Google Translate tool was used to translate these tweets into English. However, this data was not translated into other languages, and there were different texts within the non-English collections. Due to a lack of data, we decided to prepare our own collection.

3 MultiEmo Sentiment Corpus

The motivation to prepare the source corpus for MultiEmo were works devoted to domain-oriented sentiment analysis, where the model is trained on annotated reviews from the source domain and tested on other [10]. A newer work on this subject describes a study on the Amazon Product Data collection [11]. However, this collection contains ratings assigned to the reviews by the authors of the texts. Additionally, annotations are assigned at the level of the entire document. The initial idea was to have a corpus of reviews that would be evaluated by the recipients, not the authors of the content. Annotations should also be assigned not only at the level of the whole document, but also at the level of individual sentences, which makes it easier to assess aspects of the opinion. The last important feature was that the collection would be multidomain in order to be able to study models in the cross-domain knowledge transfer task. Four domains presented in Table 1 were chosen to build the initial corpus. Initial set of annotation tags contained 6 different ratings: 1) Strong Positive (SP), 2) Weak Positive (WP), 3) Neutral (0), 4) Weak Negative (WN), 5) Strong Negative (SN), 6) Ambivalent (AMB). The annotators were asked not to judge the strength of sentiment when distinguishing between strong and weak categories. If the review was entirely positive or entirely negative, then it received a strong category. If the positive aspects outweighed the negative ones, then weak. If the positive and negative aspects were balanced, then the texts were marked as AMB. These rules were applied both to the entire text level and the sentence level. The final Positive Specific Agreement on a part of corpus containing 50 documents was 90% (meta) and 87% (sentence).

Table 2. PSA for WP/WN/AMB tags merged into one tag (AMB) at the (L)evel of (T)ext and (S)entence for the following (D)omains: (H)otels, (M)edicine, (P)roducts, (S)chool and (A)ll. Abbreviations: Strong Positive (SP), Neutral (0), Strong Negative (SN), Ambivalent (AMB).

After annotating the whole corpus, it turned out that PSA for weak categories (WP, WN, AMB) is low and does not exceed 40%. Distinguishing between the significance of positive and negative aspects was a difficult task. It was decided to merge the WP, WN and AMB categories into one AMB category. Table 2 presents the PSA value after the weak category merging procedure. After this operation, the total PSA value has increased from 83% to 91% for annotations at the text level and from 85% to 88% for annotations at the sentence level.

Table 3. The number of texts/sentences for each evaluation type in train/dev/test sets. Average length (Avg len) of line is calculated from merged set.

Table 3 shows the number of texts and sentences annotated by linguists for all evaluation types, with division into the number of elements within training, validation and test sets as well as average line length of each combined set. Finally, the corpus has been translated into 10 languages using the DeepLFootnote 1 tool: English, Chinese, Italian, Japanese, Russian, German, Spanish, French, Dutch and Portuguese. Its translations are of better quality than those generated by Microsoft Translator Hub [26]. DeepL achieves the best results when translating German texts into English or French. The semantic correctness of the translations does not guarantee the precise preservation of the sentiment associated with a given text. However, in a situation where we have limited resources and want to use information about the cultural background of authors writing in a low-resource language, machine translation is one of the best solutions. MultiEmoFootnote 2 corpus is available under the MIT Licence.

4 Chosen Language Models

We have chosen XLM-RoBERTa [4] and MultiFiT [8] language models to perform analysis of sentiment recognition task and LASER [2] to test cross-lingual zero-shot transfer task capability using MultiEmo. The first model, Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa), is a large multillingual language model, trained on 2.5TB of filtered CommonCrawl data, using self-supervised training techniques to achieve state-of-the-art performance in cross-lingual understanding. Unfortunately, usage of this model is a very resource-intensive process due to its complexity. The second model, Efficient Multi-lingual Language Model Fine-tuning (MultiFiT), is based on Universal Language Model Fine-Tuning (ULMFiT) [12] with number of improvements: 1) usage of SentencePiece subword tokenization instead of word-based tokenization, significantly reducing vocabulary size for morphologically rich languages, and 2) Quasi-Recurrent Neural Network (QRNN) [3] which are up to 16 times faster at train and test time comparing to long short-term memory (LSTM) neural networks due to increased parallelism. The last approach is our proposal to use LASER embeddings as an input for the neural network based on bidirectional long short-term memory (BiLSTM) architecture. During the literature review we did not find such an application directly. LASER is capable of calculating sentence embeddings for 93 languages, therefore solution prepared on one language can be used on other language without any additional training and allows performing sentiment recognition zero-shot cross-lingual transfer task. The main advantage of this multilingual approach is that a preparation of individual model for each language can be avoided. This significantly reduces the training time and memory usage. The second advantage is that it is not necessary to translate the text into each language separately. This results in a reduction of training time and the computational resources usage.

5 Multidimensional Evaluation

In order to present the multidimensional evaluation possibilities of MultiEmo, we have conducted several types of evaluation. The first three evaluation processes focused on the multilingual aspect of the sentiment corpus. The first one was to check whether models trained on LASER embeddings of texts in one language would be equally effective in sentiment analysis of texts in another language as models trained on LASER embeddings of texts in the same language as the test set. We chose 11 different languages available in MultiEmo Sentiment Corpus: Chinese, Dutch, English, French, German, Italian, Japanese, Polish, Portuguese, Russian and Spanish. The second type of evaluation aimed to check whether models trained on LASER embeddings of texts in languages other than Polish will be able to effectively analyze sentiment in texts in Polish as well as the model trained only on LASER embeddings of texts in Polish. The third evaluation focused on measuring the effectiveness of classifiers in the task of sentiment analysis in texts written in 10 different languages: Chinese, Dutch, English, French, German, Italian, Japanese, Portuguese, Russian and Spanish. We decided to evaluate 3 different classifiers: bidirectional long short-term memory network trained on language-agnostic sentence embeddings (LASER+BiLSTM), MultiFiT and XLM-RoBERTa. The fourth evaluation focused on the multilevel aspect of the MultiEmo Sentiment Corpus. In the evaluation process we focused on checking the effectiveness of 3 classifiers (LASER+BiLSTM, MultiFiT and XLM-RoBERTa) in the sentiment recognition of single sentences. A single sentence provides far less information than a multisentence opinion. Such a small amount of information makes it difficult to correctly determine the sentiment of a review. Therefore, we decided to test the same 3 classifiers that were used in the evaluation process on text-level annotations to see if they will be equally effective in the classification of sentence-level annotations. The fifth evaluation aims to take advantage of the multidomain aspect of MultiEmo Sentiment Corpus. The sentiment of a given word often depends on the domain of the whole text. Depending on the subject of the text, the word may have positive, neutral or negative sentiment. Moreover, correct recognition of the sentiment of a text regardless of its field is an even more difficult task and requires good quality texts from many domains. During this process we evaluated 3 classifiers (LASER+BiLSTM, MultiFiT and XLM-RoBERTa) in the task of sentiment recognition in texts from a single domain. We conducted the evaluation both when the classifiers were trained on a set containing only texts from the same domain (SD) and when the training set contained texts from multiple domains (MD).

During the evaluation process we trained 30 instances of each model and then conducted evaluation on a given test set. After that we conducted statistical tests to verify the statistical significance of differences between evaluation results of each model. We decided to use independent samples t-test, as the evaluation results concerned different models. Before we conducted the test, we checked its assumptions and if any of the samples did not meet them, we used the non-parametric Mann Whitney U test. The values in bold in each table with the results of a particular evaluation, presented in Sect. 6 mean that a given model performed significantly better than the others. It should be mentioned that monolingual models are in fact multilingual models tuned using a single language set. In our five experiments we counted how many "cases" the model was better than the others by counting the number of occurrences of the best result in all variants in a single experiment.

Table 4 presents the average F1-score values for each of the labels as well as global F1-score, micro-AUC and macro-AUC for the MultiEmo evaluation of bidirectional long short-term memory network models trained on language-agnostic sentence embeddings. Significant differences between performance of the models trained on texts in Polish and the models trained in the same language as the test set were observed in 26 out of 70 cases \((37\%)\). The models achieved different results mainly in case of neutral and ambivalent texts, which are much more diverse than texts characterized by strong and uniform emotions, e.g. strongly positive and strongly negative.

Table 4. Average F1-scores for the MultiEmo evaluation of LASER+BiLSTM models trained on texts in Polish and the ones trained on texts in the same language as the test set. The values in bold refer to model that achieved significantly better results than the other one. Abbreviations: Strong Positive (SP), Neutral (0), Strong Negative (SN), Ambivalent (AMB).
Table 5. Average F1-scores for the MultiEmo evaluation of LASER+BiLSTM models on the test set containing only texts in Polish. The values in bold refer to models that achieved significantly better results than the model trained on texts in Polish. Abbreviations: Strong Positive (SP), Neutral (0), Strong Negative (SN), Ambivalent (AMB).

Table 5 shows average F1-scores for the MultiEmo evaluation of long short-term memory neural network models trained on language-agnostic sentence embeddings on the test set containing only texts in Polish. The results of models trained on texts in languages different than Polish were compared with the results of the model trained only on texts in Polish. On the basis of statistical tests described in Sect. 5, significant differences in model results were observed in 3 out of 70 cases \((4.3\%)\). The worst results were observed for models trained on Chinese and Japanese texts.

Table 6. Average F1-scores for the MultiEmo evaluation of three different classifiers: LASER+BiLSTM, MultiFiT and XLM-RoBERTa. For languages not supported by MultiFiT, an evaluation was carried out for the LASER+BiLSTM and XLM-RoBERTa classifiers. The values in bold refer to model that achieved significantly better results than the other ones. Abbreviations: Strong Positive (SP), Neutral (0), Strong Negative (SN), Ambivalent (AMB).

The MultiEmo multilingual evaluation results of different classifiers are presented in Table 6. We decided to choose three classifiers: LASER+BiLSTM, MultiFiT and XLM-RoBERTa. MultiFiT achieved the best results in 32 out of 49 cases \((65\%)\). XLM-RoBERTa outperformed other models in 38 out of 70 cases \((54\%)\). Both MultiFiT and XLM-RoBERTa obtained better results than LASER+BiLSTM in every case. MultiFiT performed better than XLM-RoBERTa in 4 out of 7 languages \((57\%)\).

Table 7. Average F1-scores for the evaluation on the MultiEmo sentence-based multidomain dataset. Classifiers: LASER+BiLSTM, MultiFiT, XLM-RoBERTa. For languages not supported by MultiFiT, an evaluation was carried out for the LASER + BiLSTM and XLM-RoBERTa classifiers. The values in bold refer to model that achieved significantly better results than the other ones. Abbreviations: Strong Positive (SP), Neutral (0), Strong Negative (SN), Ambivalent (AMB).

6 Results

The results of the evaluation on the MultiEmo sentence-based multidomain dataset are described in Table 7. MultiFiT outperformed other models in 28 out of 28 cases \((100\%)\). XLM-RoBERTa achieved the best results in 13 out of 42 cases \((31\%)\).

Table 8 shows the evaluation results on MultiEmo single-domain and multidomain datasets. We decided to evaluate three classifiers: LASER+BiLSTM, MultiFiT and XLM-RoBERTa. In case of single domain datasets MultiFiT obtained the best results in 8 out of 16 cases \((50\%)\). XLM-RoBERTa outperformed other models in 10 out of 24 cases \((42\%)\). LASER+BiLSTM turned out to be the best in 6 out of 24 cases \((25\%)\). It outperformed other models in the review domain, achieving the best results in 5 out of 6 cases \((83\%)\). In case of multi domain evaluation XLM-RoBERTa outperformed other models in 18 out of 24 cases \((75\%)\). MultiFiT achieved the best results in 2 out of 16 cases \((12.50\%)\). The only case where LASER+BiLSTM achieved the best results were texts about products written in Japanese.

Table 8. Average F1-scores for the evaluation on the MultiEmo single-domain (SD) and multidomain (MD) datasets. The languages of the individual datasets: DE – German, EN – English, IT – Italian, JP – Japanese, PL – Polish, RU – Russian. Classifiers: LASER+BiLSTM, MultiFiT, XLM-RoBERTa. For languages not supported by MultiFiT, an evaluation was carried out for the LASER + BiLSTM and XLM-RoBERTa classifiers. The values in bold refer to model that achieved significantly better results than the other ones. Abbreviations: Strong Positive (SP), Neutral (0), Strong Negative (SN), Ambivalent (AMB).

7 Conclusions and Future Work

MultiEmo serviceFootnote 3 with all models is available through the CLARIN-PL Language Technology CentreFootnote 4. The source code is available on the MutliEmo GitHub pageFootnote 5. In the case of LASER+BiLSTM model evaluation few differences were found between the model trained on texts in Polish and the model trained on texts in the same language as the test set. Similarly, statistical tests showed few differences in the effectiveness of models trained on texts in different languages in the task of sentiment recognition of texts in Polish. Low values of the model average F1-scores in the case of texts in Chinese and Japanese may be related to a significantly worse quality of text translations compared to translations into languages more similar to Polish, such as English or German. On the other hand, similar values of the average F1-scores for the multilingual model in the case of Polish and translated texts may be related to the high similarity of the model used for machine translation and the multilingual model. The authors of DeepL do not provide information regarding this subject.

In Table 4 presenting a comparison of LASER+BiLSTM models tested on pairs of texts in Polish and the language of training data, the biggest differences are observed for the classes with the smallest number representation in the set. Analyzing F1, micro and macro results, significant differences are only for Asian languages. The results are significantly worse for these two languages than for the others. This may be due to a much smaller number of data for the LASER model for these languages, because in Table 6 the results obtained for these languages on XLM-RoBERTa and MultiFiT models are much better. Unfortunately, we do not have access to the training resources of the source models to make this clear. The results for the other languages indicate that regardless of the configuration choice, the results within a pair of two languages do not differ significantly from each other. There is a possibility that the source models (DeepL and LASER) were trained on similar data for these language pairs. On the other hand, LASER supports 93 languages and DeepL only 12. We are not able to evaluate the other languages supported by LASER, but it can be assumed that if the data representation in the source model was at a similar level as for the examined languages with a high score, we can expect equally high results for such languages. Another experiment was to compare models trained on different languages and tested only on Polish (Table 5). Aggregate results for the LASER+BiLSTM model show that models created on translations of the original set are of comparable or worse quality than the model trained on Polish. Results for some single classes turn out to be even better for models built on translations than on the model built on the original corpus. Such cases are observed for Dutch, English and German. It is possible that in the data to create source models (LASER and DeepL) for these languages there is significantly larger number of translation examples. Further work should examine the quality of the translations for individual language pairs and check the correlation between the quality of the translation and the results of models based on these translations. Table 6 shows the results of different deep multi-language models built on different MultiEmo language versions for whole texts. Similar results are available in Table 7 for models built on single sentences. The aggregate results (F1, macro, micro) show a clear superiority of XLM-RoBERTa and MultiFiT models over the zero-shot transfer learning approach. The probable cause of these differences is the use of much more texts to create DeepL, XLM-RoBERTa and MultiFiT models, compared to the LASER model. On the other hand, in the absence of a good machine translation tool, the LASER+BiLSTM model for most languages still achieves results that are at least in some business applications already acceptable. The results also show that translating a text into another language using a good quality translator allows to obtain a model with results comparable to those obtained for a model built for the source language. Moreover, it has been shown that the Polish language achieves more and more satisfactory support in known SOTA tools and models, and perhaps assigning this language to the low-resource category [5] is no longer justified. Otherwise, the conclusion is that we can also get very good quality models for high-resource languages from rare resources in low-resource languages.

Table 8 shows the results of models trained on the selected domain (SD) and on all domains simultaneously (MD). The results show that in the context of domain adaptation it is not possible to clearly indicate the best model to represent a single domain (SD variants). Differences were also found in different languages within the same domain. In case one model was trained on all domains, the most domain-agnostic sentiment representation has the XLM-RoBERTa.

MultiFiT achieved the best results in the greatest number of cases. The disadvantage of this model is the small number of supported languages (only 7). XLM-RoBERTa most often achieved the second best results, except the multidomain evaluation, where it outperformed other classifiers. LASER+BiLSTM as the only zero-shot classifier obtained worse results in almost every case. In our further research, we would like to address the detailed analysis of the impact of translations on sentiment analysis. Apart from the quality of the translations as such, a relatively interesting issue seems to be a direct change in the sentiment of the text during the translation.