1 Introduction

Over the past few years, sentiment analysis which extracts embedded opinions found in a given dataset has received increasing interest, mostly due to the emergence and popularity of online social platforms (OSN). Previous studies have analysed opinions and views regarding products, services, news, social and political events, etc. [1], and thus providing insights into what people are thinking and feeling. The results from these analyses can be used in multiple applications such as propagation of hate speech [2].

Many sentiment analysis techniques exist, and one of the most commonly used technique uses the sentiment lexicon. A sentiment lexicon is described as a list of opinion or opinionated words and phrases based on their sentiment categories or orientations [3,4,5]. In the absence of an adequate training dataset, the lexicon-based approach is proven to be more appropriate than the machine learning approach [6]. Moreover, sentiment lexicons have been shown to perform better in short texts, such as social media texts [7]. The technique is also suitable for real-time opinion classification due to their relatively low requirements on computations [6, 8]. Furthermore, these sentiment lexicons can be employed for unsupervised [3, 6] and supervised classifications [5, 9] for a given text.

Sentiment lexicons are primarily available for the English language and completely limited or not available for other languages [10]. Interestingly, although English is recognized as the most commonly used language globally, the percentage of Internet users who communicate using English is less than 27%.Footnote 1 This means that it is dire if not an urgent need to create resources and tools for subjectivity purposes and sentiment analysis in non-English languages [11, 12]. Some researchers have attempted to build sentiment lexicons for non-English texts; however, they are not comparable to those used in the English language, as they are often incomplete or domain specific [11, 13]. An exciting and motivating factor towards creating and producing resources for non-English languages is supported by the fact that many organizations and enterprises recognize and appreciate the value and necessity to understand user feedback and associated trends, thereby gaining a competitive advantage regardless of language or demographics [14, 15]. Also, it is incredibly time-consuming and expensive to create sentiment lexicons for many languages manually [3, 16].

The extant of the literature has used translating methods to translate English lexicons to specific target languages to build non-English sentiment lexicons [13, 17, 18]; however, this results in the loss of context [19]. Others have used lexical language resources containing words with synonyms and antonyms, such as translated copies of SentiWordNet (SWN) as reported by [20, 21], where they built lexicons by analysing semantic relations between the words. This technique, however, is also limited as most non-English languages lack linguistic resources. Annotated corpus is another way to construct sentiment lexicons, which can be accomplished either using the statistical or the semantic relations method. Statistical techniques use large corpora with statistical equations to obtain the polarity of words, whilst semantic relations methods use semantic relations between the words in a large corpus to extract a sentiment lexicon [22]. The construction of sentiment lexicons by analysing the corpus requires a substantial volume of the corpus to enable an acceptable level of accuracy to be achieved. Moreover, in some instances, the annotated corpus requires additional data annotation and human effort before an analysis can commence [23, 24].

1.1 Contribution

The current study proposes an automatic language-independent, novel method for building non-English sentiment lexicons in order to address the gaps stated above. The proposed method uses existing English lexicons with unannotated target language corpus to identify the sentiment of the given document or word. The work will depend on the intuitive rule, whereby related words can determine the polarity of a sentiment word, and this rule will also hold true when determining the polarity of a document. The main contributions of the study are as follows:

  • A novel framework incorporating two available resources (i.e. seed lexicons and unannotated corpus) is developed to build and adapt sentiment lexicons for non-English languages;

  • The development of an automated method to recognize new polarity words in the unannotated corpus was determined by calculating the seed polarity values to predict the overall sentiment orientation of the candidate word (i.e. Eq. 1); and

  • The construction of a curated Arabic sentiment lexicon based on the proposed method.

The remainder of this paper is structured accordingly. Section 2 describes the related work on lexicon generation methods for non-English languages. Section 3 provides an overview of the proposed automated method for building and expanding sentiment lexicons. Moreover, it presents the implementation of the method using real data and the evaluation, which is followed by the conclusion and suggestions for future work in Sect. 4.

2 Related work

2.1 Sentiment lexicon generation methods

Methods for generating opinion lexicons for the non-English languages range from being fully manual, semiautomatic, to restricted automatic methods [21]. Three common methods in generating these lexicons are: dictionary-based, corpus-based and human-based.

2.1.1 Dictionary-based method

The dictionary-based method relies on existing sentiment lexicons to build target language lexicons by translating, transferring learning or semantic relations [25]. Moreover, the rapid development of machine translation tools has enabled researchers to translate many English sentiment lexicons into other languages [17, 18]. Yao et al. [26] conducted one of the first studies in building sentiment lexicons by translation. They used a bilingual dictionary to determine the sentiment orientation of the Chinese words in order to generate a Chinese sentiment lexicon. The translation technique was also used in Mihalcea et al. [27] to build a Romanian sentiment lexicon. On the other hand, Steinberger et al. [13] suggested a semiautomatic method to build sentiment lexicons by proposing a triangulation translating method between three languages, whereby two languages (English and Spanish) were used as sources, and the result of the translation was the third language. The authors found the triangulation technique to outperform the abstract translation in terms of building sentiment lexicons.

Other studies that have incorporated translation include the work of [18] who translated the English NRC Word Emotion Association Lexicon [28] to a French lexicon, and [29] in which the AFINN English sentiment lexicon used in [30] was translated to Norwegian language using Google translate.Footnote 2 Similarly, Basile and Nissim [31] used SentiWordNet and MultiWordNet to translate the sentiment orientation of English words to an Italian sentiment lexicon, whilst Perez-Rosas et al. [12] used OpinionFinder [32] along with SentiWordNet to transfer the sentiment scores from English to Spanish.

Though translation is a quick way to generate lexicons, it carries the risk of losing context and polarity associated with the words [25], which results in an inaccurate development of sentiment lexicons. Moreover, translation does not work on slangs and abbreviations that are commonly found on social networking sites [4].

2.1.2 Corpus-based method

The corpus-based method extracts the polarity words from a large volume of the corpus through the use of statistical or semantic relation methods [22]. Statistical-based methods use known polarity words (seeds) to exploit the co-occurrence of patterns to extract new sentiment words from a large corpus [24]. For example, Remus et al. [33] collected product reviews containing 5100 positive reviews and 5100 negative reviews and determined the polarity of the words using pointwise mutual information (PMI). Other studies that have used similar PMI include authors in [34] who created a Hindi sentiment lexicon, and [35] who generated an Arabic lexicon.

On the other hand, Elhawary and Elfeky [36] proposed a similarity graph to cluster all words/phrases to develop an Arabic sentiment lexicon. Their hypothesis was if two words have an edge, they are deemed to be similar; the similar words either have the same sentiment polarity or the same meaning. The random walk technique was used to build a Polish sentiment lexicon using 3222 web documents in [37], whereas emoticons were utilized to extract polarity words in a microblog for a Chinese lexicon in [38].

2.1.3 Human-based method

The human-based method relies on encouraging people to answer questions or solve puzzles to construct the sentiment lexicons from the answers [39]. Sentiment lexicons built by humans are usually more accurate than others [39]; however, the production of these lexicons is time-consuming, requires a lot of resources and is costly [28]. Nevertheless, studies that have engaged human experts to manually annotate data are many. For example, Abdul-Mageed et al. [40] proposed a supervised machine learning system called SAMAR, which analyses the Arabic subjectivity and sentiments in both Modern Standard Arabic and Arabic dialects by manually building a sentiment lexicon consisting of 3982 adjectives labelled as positive, negative or neutral.

One of the techniques gaining popularity in human-based approach is online games, whereby games are used to solicit ‘experts’ view. For instance, Lafourcade et al. [41] produced an online game with a purpose (GWAP) that requires the users to determine the polarity of the presented terms and words using emoticons whereas a language-independent crowdsourcing game called Tower of Babel that determines sentiments in Korean language was accomplished in [39]. Authors in [19] developed Sentiment Quiz and invited players from different countries to evaluate the given words. This resulted in more than 3500 users evaluating approximately 325,000 words in different languages. The authors, however, did not provide any information about their evaluations.

2.2 Limitations of current methods

All the three-mentioned techniques, that is, dictionary-based, corpus-based and human-based, have limitations. In short, the sentiment orientation of words and the sentiment lexicons built for dictionary-based approach tends to suit general domain lexicons, and thus may be inaccurate when used with specific domains. Moreover, sentiment lexicons do not contain many words or abbreviations commonly used on social networking sites, such as Twitter and Facebook. Therefore, the approach is not able to handle different dialects and informal/slang words [4]. Additionally, when machine translation is used errors may arise due to cultural differences about the sentiment orientations of words [12, 27, 42].

The corpus-based approach on the other hand lacks data pre-processing tools supporting many other languages, and thus making it difficult and complex to rely on the corpus to build lexicons. In addition, the generated lexicons cannot be relied upon to analyse other domains as they tend to be domain specific. The approach is also heavily dependent on annotated corpus [23] that requires manual data annotation. Finally, sentiment lexicons built by humans are usually more accurate than dictionary- and corpus-based approaches [39]; however, it is time-consuming and costly [28].

To overcome these challenges, the present study aims to build an automated language-independent non-English lexicon using unannotated corpus and English sentiment lexicons.

2.3 Arabic sentiment analysis

In this paper, the emphasis will be on the Arabic language sentiments. Consequently, this subsection describes some works of Arabic sentiment resources. Compared with the available resources for the sentiment analysis of English and other languages, the Arabic language is severely under-resourced. With great interest in Arabic sentiment analysis in recent times [43], little research has been published compared to other languages. Some of these studies are based on the machine learning methods where researchers are labelling a corpus manually to positive or negative in order to get enough training data [44]. The vast majority of these corpora and resources are not available to the public [45]. On the other hand, some studies derive from the lexicon-based methods. Table 1 summarizes the main works in the building of Arabic sentiment lexicons.

Table 1 The main works in the building of Arabic sentiment lexicons

In addition to the problems and limitations mentioned in Sect. 2.2, most of the Arabic lexicons are very noisy, since there is no part-of-speech (POS) and they are not available to the public [48]. The majority of lexicons were built for the Modern Standard Arabic language, which makes it appear weak when used with Arabic dialects [45]. This is because machine translation generates lexicon in the standard Arabic language.

3 Methodology

This section presents the proposed methodology for automatically constructing the non-English sentiment lexicons. A corpus-based method is proposed to discover new polarity words based on the following two resources: a target language corpus and an English seed sentiment lexicon. The seed sentiment lexicon is utilized to specify new sentiment words in the target language corpus. Figure 1 illustrates the process consisting of four phases: (1) seed lexicon preparation, (2) corpus collection and pre-processing, (3) candidate words extraction and (4) determination of the sentiment orientation for the candidate words. The steps involved in the four phases are presented in the following subsections.

Fig. 1
figure 1

The proposed method phases

3.1 Preparation of resources

3.1.1 Seed lexicon preparation

One of the main resources used in this study was the English sentiment lexicons, constructed by translating them to the target language using GoogleFootnote 3 machine translation tools. The first step is to clean the translated lexicons by removing any duplicate and un-translated words. At this stage, more than one lexicon are used and combined to increase the number of words in the seed lexicon. If initial or sufficient seeds are available in the target language, then the automatic translation step will not take place.

3.1.2 Unannotated corpus preparation

An unannotated corpus was constructed using Facebook APIFootnote 4 due to the absence of annotated corpus for many languages [16]. Two corpus groups were established: one for the construction of the lexicon, and the other to test the accuracy of the classification. The pre-processing step involved data cleansing in which any comments that contained ‘links’ or ‘symbols’ were deleted, followed removing words and characters other than the words in the target language. ‘Common’ words (i.e. stop words), such as ‘he (هو)’, ‘you (أنت)’, ‘we (نحن)’ and ‘she (هي)’, were also removed. Finally, lemmatization (i.e. splitting off prefixes and affixes of the words) was performed to convert the words to their roots or dictionary form [63]. For example, the sentence “the student’s books are different sizes” will be lemmatized in the following way;

  • the => the,

  • student’s => student,

  • books => book,

  • are => be,

  • different => differ,

  • sizes => size.

3.1.3 Candidate words extraction

The list of candidate words was prepared in several steps: first, the tokenization process that divides the sentences into individual words was carried out. For example, the sentence:

  • “the students go to school”

  • will become: “the”, “students”, “go”, “to”, “school”.

Next, the words were converted to their root or dictionary forms by splitting off the prefixes and affixes pertaining to the words.

Several filters were applied to refine the lemmatized tokens. The first filter sorts the words alphabetically and removes words that are not contained within the target language, including any symbols or URLs. Additionally, any repeated or unusual words are also removed as these words are often misspelled or meaningless. The part-of-speech tags (POS) were then added to each candidate words in the list. Notably, adjectives and adverbs seem more likely to contain sentiment as shown in the previous studies [64, 65], unlike verbs and nouns [6]. Therefore, the present study gives priority to adjectives and adverbs by positioning them first during the candidate list sorting compared to verbs and nouns. Figure 2 shows an example of how the text pre-processing steps are accomplished to prepare the corpus and the candidate words list.

Fig. 2
figure 2

Resource preparation example

3.2 Sentiment orientation identification

The sentiment orientation of the candidate words was identified using the seed lexicon and pre-processed corpus, based on the relationships between the previously known polarity words (seeds) and the “new” words (candidates).

Figure 3 outlines the five steps which are performed to specify the candidate words’ sentiment orientation and are described as follows:

Fig. 3
figure 3

The framework steps of building sentiment lexicons for non-English languages

  1. 1.

    A new candidate word is selected from the candidate word list that was prepared in the previous phase (Sect. 3.1.3). Next,

  2. 2.

    The candidate word is used to search the corpus for any documents that contain the candidate word.

  3. 3.

    The seed lexicon is used to specify the polarity words in those documents.

  4. 4.

    Next, polarity values (SWP) are collected for the known seeds counting the number of words (N) and documents (D) and calculating the sentiment orientation of the candidate words (P) using Eq. (1).

  5. 5.

    The final polarity class is calculated by Eq. (2), with (T) as the threshold value.

The polarity equation (i.e. Eq. 1) is formulated based on the principle that a negative word should occur more frequently alongside the negative seed words and thus will obtain a negative score, whereas a positive word will occur most often in the vicinity of positive seed words, thus obtaining a positive score [66]. Equation (1) therefore depends on seed polarity values to predict the overall sentiment orientation of the candidate word, accomplished by aggregating the values of seed words in the documents and dividing by the total number of seed words.

$${\text{Candidate}}\;{\text{polarity}}\,\left( w \right) = \frac{{\sum {\text{Polarity}}\;{\text{of}}\;{\text{nearby}}\;{\text{polar}}\;{\text{words}}}}{{\sum {\text{Nearby}}\;{\text{polar}}\;{\text{words}}}}*{\text{The}}\;{\text{number}}\;{\text{of}}\;{\text{documents}}$$
$$P\left( w \right) = \left( {\frac{1}{N} \mathop \sum \limits_{i = 1}^{N} \left( {SWP_{i } } \right)} \right)*\frac{D}{D'}$$
(1)
$$C\left( w \right) = \left\{ {\begin{array}{*{20}l} {{\text{Positive }}\;{\text{if }}\quad P\left( w \right) \ge \left( { + T} \right)} \hfill \\ {{\text{Negative }}\;{\text{if}} \quad P\left( w \right) \le \left( { - T} \right)} \hfill \\ {\text{Neutral else}} \hfill \\ \end{array} } \right.$$
(2)

In Eq. (2) [4], (w) is a candidate word and (P) is the candidate word polarity value calculated by collecting the polarity values of the seed words (SWP) found in the same document, and divided by the number of seed words (N) obtained. Next, the result is multiplied by the number of selected documents (D) and divided by the total number of the documents (D’), and thus ensuring a minimal noise due to any misspellings. The higher the frequency of a word in multiple documents, the more likely it is a polarity word. In some cases, however, the candidate words are only repeated numerous times in a single document (i.e. single author), suggesting that the occurrence of the word is likely to be misspelled.

For example, assume the candidate word is ‘lawl’, and the seed lexicon = [(good, 1); (happy, 1); (success, 1); (mistake, − 1); (win, 1)], where positive = 1 and negative = − 1, and the three selected documents are as follows:

  • Lawl! That’s good news!”,

  • “My mom will be happy because of my success! lawl!!”,

  • Lawl… despite their mistakes, they will win”, from a corpus contains four documents.

Based on the given seed lexicon, the seed words polarity (SWP) = (1, 1, 1, − 1, 1), the count of seed words (N) = 5, the count of documents (D) = 3, and the total number of documents (D’) = 1000. The final result is calculated as follows:

$$p\left( {Lawl} \right) = \left( {\frac{1}{N} \mathop \sum \limits_{i = 1}^{N} \left( {SWP_{i } } \right)} \right)*\frac{D}{D'} = \left( {\frac{{1 + 1 + 1 + \left( { - 1} \right) + 1}}{5}} \right)*\frac{3}{1000} = 0.0018$$

At first glance, the word can be construed as either positive [i.e. (P) is positive] or negative [i.e. (P) is negative]. However, doing so merely creates unnecessary noise in the lexicon due to the addition of weak polarity words. In this case, the value for threshold (T) is assumed to be both negative and positive. Moreover, if the positive value is greater or equal to the positive threshold (+ T), then the word will be considered positive, and vice versa.

3.3 Experiments and evaluation

This section presents the experiments to evaluate the proposed method, with explanations provided for data collection, pre-processing and the evaluation of the lexicon.

3.3.1 Data collection and pre-processing

Two primary resources: (1) target language corpus and (2) seed lexicons are first collected. As mentioned in Sect. 1, the study focuses on the Arabic language and the choice of domain was media news.

Data were collected from Facebook, a social networking site with a global presence of more than 1 billion active users. These users subscribe to various services and often voice opinions and views through online posts, which are used as a communication tool to interact with their friends, relatives and user groups [67]. To be specific, the data were fetched from six Arabic Facebook official news sites (identities withheld for confidentiality purpose). An application programming interface (API) script was used to access and collect the data, resulting in a total of 20,816 posts (i.e. equivalent to 507,529 tokens) during the period of 3 August 2017 to 23 August 2017.

The unannotated corpus was then cleaned by deleting posts that contained URL links and other symbols, as these are often associated with spam, advertisements or irrelevant comments. Words and characters other than the Arabic language were also removed, followed by the removal of the Arabic ‘stop words’ that was done by comparing the corpus with the lists available on the Internet.Footnote 5 For lemmatizing the corpus, FARASA [68], which is a quick and reliable Arabic text processing toolkit, was used to convert all the words to their roots, and their dictionary forms. Candidate words were then extracted, and the final corpus contained 10,219 documents. Of these, 90% (9219) of the documents were used as the training set, whereas the remaining 10% (1000) were used as the testing set. The total number of tokens was 507,529. Table 2 shows the steps adopted to clean the tokens to achieve an appropriate set of candidate words.

Table 2 The steps of preparing candidate list with the number of tokens in each step

Three English lexicons readily available on the Internet were collected and used to prepare the seed lexicon, as shown in Tables 3, 4 and Fig. 4. The three lexicons were AFINN [29], MPQA (Multi-Perspective Question Answering) [47] and Bing Liu’s Opinion Lexicon [46]. Google translateFootnote 6 tool was used to translate the English lexicons to Arabic. Due to the nature of the generic translation, it is common for some synonyms to be translated to the same word, for example, the words wonderful, terrific, marvellous, gorgeous and fabulous, translated to the most frequently used synonym in the Arabic language is “رائع”. Therefore, in our proposed method, we collected sentiment words through the analysis of the corpus, as the corpus contains the words that people actually use on the internet.

Table 3 The English lexicons that used to build the seed lexicon
Table 4 Pre-processing stages of the seed lexicons
Fig. 4
figure 4

Pre-processing stages of the seed lexicons

This phase generates three outputs: (1) seed lexicon, (2) pre-processed corpus and (3) candidate words.

3.3.2 Experimental procedure

A candidate word from the candidate word list was first selected, which is then used as a parameter to query the pre-processed corpus of documents. The seed lexicon was then used to specify the polarity words and their polarity values in the documents. The total polarity value of the candidate word is calculated using Eq. (1), in Sect. 3.2. These steps were repeated for the rest of the words in the candidate word list.

The initial results showed many polarity words and neutral in various degrees. In the next step, the specific value for the threshold (T) value was set manually to determine the boundary between the polarity words and neutral words. The threshold was determined based on the absence or lack of words with a sentiment. The present study set the positive threshold (+ T) = 0.003, and the negative threshold (− T) = − 0.004, with any values in between treated as neutral. Our calculation shows there were 1340 positive words, 3239 negative words and 1777 neutral words. Figure 5 illustrates the distribution of the polarity for these words, indicating the number of negative words to be far greater than positive words.

Fig. 5
figure 5

The distribution of the polarity and neutral words

3.3.3 Evaluation

At total of 1000 posts from the lexicon were randomly selected and manually labelled as either positive, negative or neutral. The lexicon scoring method was adopted for classification purposes, i.e. in any document, if the number of positive words was higher than the number of negative words, then the document was classified as positive; else, it was classified as negative. The resulting lexicon was then used to train the classifier. A confusion matrix [69] was used with four indices: accuracy (A), precision (P), recall (R) and F measure (F), to measure the performance of the proposed method, based on the following equations:

$$A = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} ,\;\; P = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}}, \;\; R = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}} , \;\;F = 2 \cdot \frac{P \cdot R}{P + R}$$
(3)

where TN (true negatives) are instances that are supposed to be negative and appeared as negative; TP (true positives) are instances that are supposed to be positive and appeared positive; FN (false negatives) are instances that are supposed to be positive but appeared negative; and FP (false positives) are instances that are supposed to be negative but appeared positive [70]. Table 5 presents an example of calculating the evaluation measures based on the confusion matrix. Seven cases were selected randomly from the test dataset which has been annotated manually to positive or negative (i.e. actual class). The predicted class indicates the classification using our method. For performance assessment, the results are aggregated as shown in Table 6, before applying Eq. (3).

Table 5 An example of calculating the evaluation measures based on the confusion matrix
Table 6 The confusion matrix of the example in Table 5

The final result for this example is as follows:

$${\text{Accuracy}} = \frac{2 + 3}{2 + 3 + 2 + 0} = 0.71$$
$${\text{Precision }} = \frac{2}{2 + 2} = 0.5$$
$${\text{Recall }} = \frac{2}{2 + 0} = 1$$
$$F\;{\text{measure}} = 2. \frac{0.5*1.0}{0.5 + 1.0} = 0.67$$

Because of the scarcity of Arabic sentiment lexicons and their lack of availability to the public, we compared our method with two lexicons, namely AraSenTi [35] and Arabic-NRC [71]. Moreover, we translated three English lexicons to compare the translation-based methods with our method. In Sect. 2, we mentioned that previous studies used the manual method of building some sentiment lexicons, but we could not find any available Arabic lexicon that was built manually to compare with our lexicon. A brief description of the lexicons is as follows:

  • T_MPQA is the translated copy of the MPQA sentiment lexicon translated by Google translator.Footnote 7

  • T_OL is the translated copy of the Bing Liu’s opinion lexicon translated by Google translator.

  • T_AFINN is the translated copy of the AFINN sentiment lexicon translated by Google translator.

  • Hybrid 3 is the combination of the three translated lexicons T_MPQA, T_OL, and T_AFINN.

  • Unannotated Corpus-Based Sentiment Lexicon (UCBSL)Footnote 8 is the proposed sentiment lexicon developed by the method outlined in Sect. 3 in this study.

  • Hybrid 4 is the combination of (Hybrid 3) and the proposed lexicon (UCBSL).

  • AraSenTi (Arabic) is the large-scale Arabic sentiment lexicon generated from a large dataset for social network sentiment analysis [35].

  • Arabic-NRC (sentiment lexicon) The NRC emotion lexicon included emotional English words which were divided with their POS (adjectives, verbs, nouns and adverbs) and positive and negative sentiments [71]. The authors translated the Arabic version.Footnote 9

Table 7 and Fig. 6 list the number of polarity entries in those sentiment lexicons.

Table 7 Lists the numbers of positive and negative entries in examined lexicons
Fig. 6
figure 6

The numbers of positive and negative entries in the examined lexicons (except AraSenTi)

3.3.3.1 Human evaluation

The Arabic lexicon generated was manually checked and validated by five professional Arabic linguists. Copies of 200 randomly sampled cases of the lexicon words were provided to them, and then we asked them to identify positive and negative words. The results of the manual validations were compared with the polarity values generated automatically by our method. Average accuracy was obtained at 81%, which means that our method could lead to wrong inputs of 19%. The outcome indicating the proposed automatic non-English sentiment lexicon builder is capable of producing a good quality lexicon. Table 8 lists some polarity words from our lexicon list with their sentiment orientation and frequency.

Table 8 Sample sentiment words from our lexicon

4 Results and discussion

4.1 Experimental results

Table 9 and Fig. 7 show the performance results of the new lexicon compared with other sentiment lexicons discussed in Sect. 3.3.3. The results show that the new sentiment lexicon outperformed the other sentiment lexicons, achieving 0.74 F measure. The closest lexicon with regard to the F measure was the Hybrid 4, which attained an F measure of 0.69. The Hybrid 4 lexicon included the new lexicon and the three seed lexicons T_MPQA, T_OL and T_AFINN. However, the translated lexicons did not achieve F measure exceeding 0.67 including NRC, which was the sentiment lexicon translated by its authors.

Table 9 The performance results of our lexicon compared to a number of sentiment lexicons
Fig. 7
figure 7

The performance results of our lexicon compared to a number of sentiment lexicons

The AraSenTi sentiment lexicon consisted of many polarity words generated through translation and using PMI statistical equation; however, the F measure of this lexicon did not exceed 0.57. This means that the size of the lexicon is not always useful, and maybe more of an issue, given this is considered as a major challenge in performing sentiment analysis [72]. We expected Hybrid 4 to achieve a higher F measure as it contained both the new lexicon (UCBSL) and three translated lexicons (Hybrid 3); however, this was not the case. This is probably due to the translation process in Hybrid 3 that may have resulted in incorrect or inaccurate polarity words originating from the seed lexicons. For example, the word “terrible” is translated to the Arabic word “رهيب” in Hybrid 4, while Arab users generally use this word in informal languages to express a positive sentiment rather than negative. Therefore, the translation process in Hybrid 3 may have affected the performance of Hyrbid 4. This is also observed in Fig. 7 where Hybrid 3 produced the lowest performance compared to UCBSL and Hybrid 4.

The results were observed to differ at the class level, as depicted in Table 10. Across all the lexicons, the precision and F measure in the negative class showed much better results than in the positive class. For instance, in the new lexicon, the F measure in the negative class was 0.87 while it was only 0.60 in the positive class. This is probably due to the fact that the seed lexicon in the study contained more negative words than positive words.

Table 10 The test results for both positive and negative classes

The low recall values of translated lexicons (i.e. seed lexicons) in the negative class indicate that the classic lexicons clearly suffer from coverage problem and incorrect polarity values when used to classify social media data. On the other hand, our lexicon (UCBSL) suffers from a low recall of 0.51 in the positive class, probably due to the lack of positive words in the used corpus. We, however, increased the seed lexicons to our automatically generated lexicon to improve the recall as shown in Hybrid 4 (i.e. recall = 0.78).

Table 11 presents the intersection of the words of the new lexicons with existing lexicons. The table indicates that the UCBSL lexicon contains new entries not available in the translated lexicons, with results showing the rate of agreement between the UCBSL lexicon and Hybrid 3 is 21% (i.e. meaning UCBSL contains about 79% of new entries).

Table 11 The intersection of the words of the obtained lexicons with other lexicons

As shown in Table 9, the UCBSL lexicon outperformed the Arabic available lexicons built by dictionary and corpus-based methods. Due to the difficulty of building manual lexicons for each language and domain, our proposed method will facilitate the construction of new lexicons or expansion and adaptation of existing lexicons in a much effective and easy manner. As a result, the methodology, used in this study, supports those methods that use co-occurrence-based measures to find implicit relationships in unstructured data. Moreover, this study demonstrated that the related words could, in fact, determine the polarity of a sentiment word in the same context. An important criterion for the adoption of a ‘new’ word is based on the coverage or spread of the ‘word’ amongst users and repetition. The word is deemed entirely worthless if it is not commonly used amongst users or frequently repeated in documents. Furthermore, the methodology used for the experiments is language independent and can, therefore, be implemented for use in many other languages.

5 Conclusions

One of the major limitations in sentiment classification for non-English languages is that most of the existing annotated corpora are in English [16]. Many limitations and shortcomings reported in prior studies were addressed in the present study by using available resources and minimizing the amount of human effort in data labelling. This was accomplished by developing an automatic method for building non-English sentiment lexicons using two types of available and relatively cheap resources, namely the target language unannotated corpus and the English seed sentiment lexicons. The proposed method was evaluated using Arabic posts gathered from Arabic news media on Facebook. The evaluation results showed that the new Arabic lexicon produced the highest accuracy with an F measure of 0.74, compared to translated lexicons and other Arabic lexicons.

The study is not without its limitations. For example, dealing with non-English languages further presents a number of difficulties and limitations such as the limited size of resources or their availability to the public [44]. Moreover, the deficiency and availability of pre-processing data tools for some languages continue to be a concern. This affected the study in the sense that limited Arabic sentiment lexicons were available for us to compare with the new lexicon. In addition, the performance of the proposed method may also have been affected due to the nature of the language as well, as the Arab people frequently write in multiple dialects, incurring numerous spelling and typographical errors [73]. In addition, diacritics (i.e. marks used to represent vowel sounds in Arabic script) are often used in formal Arabic communication [74]; hence, they were not analysed in the current study. Future studies could look into comparing diacritic Arabic with informal Arabic communication such as on social media.

The current study only evaluated the proposed method on a single language, and thus the assessment is limited. It would be interesting to examine the performance of the proposed method in other languages such as French or Dutch. Further, future studies could also explore other performance measures such as time complexity involved in generating the lexicons automatically. Finally, social media features such as emoticons and hashtags were not included in the present study, and this can be another avenue for future studies to explore.