Background

Well-grounded knowledge of popular opinion is essential in the decision-making process, both for consumers and executives of companies and organizations pertinent to the industry in question. Currently, with the advent of Web 2.0, a vast amount of content reflecting opinion is being generated. However, the process of analyzing opinions presents a challenge due to the large quantity of documents, opinion polls, and the conflicting viewpoints on any given subject. Therefore, there is evident demand for the retrieval and analysis of Web comments or reviews. In recent decades, AI researchers have sought to endow machines with cognitive capabilities to recognize, interpret, and express emotions and sentiments [1].

Supplying machines with cognitive capabilities to recognize, interpret, and express emotions and sentiments has been among the most important topics in artificial intelligence field of study.

In recent years, natural language processing studies have become more oriented toward opinion mining. An important function of opinion mining is the classification of documents according to an overall sentiment, whether it be positive or negative. Sentiment analysis is a major topic in Affective Computing research [1]. Initial studies on opinion mining frequently attempted to classify the opinions or overall sentiments of a document as either positive or negative feedback [2].

These classifications do not address all aspects of opinions containing subtle linguistic forms, simultaneous expression of positive and negative nuances, and implicit judgments based on explicit ones [3]. Consequently, NLP must be supplemented by cognitive and social perspective to resolve such issues.

Researchers then tried to determine the degree of satisfaction or dissatisfaction with the document, instead of the previous two-state classification [4]. A considerable complication presents itself at this level with the erroneous assumption that the topic in question is the same throughout a text or document, while different parts of a document (different reviews) may deal with varying issues.

It is therefore, essential to identify the topics within different sections individually rather than analyzing the overall sentiment in reviews in a collective manner. Consequently, some researchers have conducted analyses on sentiment at the sentence level [5] or semantic phrase level [6]. At this level, a subjectivity analysis is carried out to distinguish between subjective and objective sentences (e.g., irrefutable facts, news reports). Neither document-level nor sentence-level analyses can reveal the target of the opinion. To obtain this level of fine-grained results, we need to go to the aspect level (aspect-based sentiment analysis) [7]. New generation opinion mining and NLP techniques [8,9,10] contain resources and the integration of a biologically inspired paradigm with statistical approaches, in order to understand and extract concepts from texts.

In recent years, many studies, conducted in this same fashion, have focused on non-English languages, specifically Spanish, Chinese, German, Czech, and Arabic [11,12,13,14,15]. Relatively new approaches to multilingual opinion mining are currently being developed [16]. Most research in this area is document-level sentiment analysis or the sentence-level sentiment analysis (subjectivity analysis) [17, 18]. However, few studies have addressed aspect-based opinion mining [19, 20]. These methods mostly employ machine translators based on relatively simple ideas in order to utilize a set of sentiment lexicons from other resources and some English text processing tools for the intended language [16, 21,22,23].

As the importance of using opinion mining, for the purpose of identifying public opinion increases, namely in the commercial sector, accuracy in classification becomes more vital. Therefore, in some new studies, the effects of several aspects of data representations and feature selection methods on the sentiment classification are investigated in various languages, such as English, Arabic, and Czech [15, 24,25,26]. In these, the feature vectors of the reviews were preprocessed using different methods and the resulting effects on the accuracy of different classifiers discussed.

The results obtained from these methods are, however, not accurate enough to be applied to other languages due to the differences in their syntactic rules and grammar, sentiment idioms or terms, and other intricacies of natural language. The Persian language is one of the Indo-European languages spoken by more than one hundred million people worldwide and it is the official language of three countries, namely Iran, Afghanistan (known as Dari language), and Tajikistan (known as Tajik language). From a computational standpoint, little attention has been paid to Persian language due to its complexities and limited resources [27, 28].

Determining feature sets is particularly challenging and yet highly significant when it comes to biologically inspired machine learning approaches. To the best of our knowledge, no study has been done to investigate the impact of sentiment lexicon and NLP tools for sentiment polarity classification in Persian language text. The main aim of this paper is to analyze the impact of several preprocessing tools and sentiment lexicon on the sentiment classification from different viewpoints. First, to achieve this, the required text-processing tools, a comprehensive Persian WordNet (FerdowsNet), a Persian SentiWordNet, and a Persian corpus for opinion mining, were developed. Next, an in-depth analysis of different methods of feature selection and sentiment classification of Persian review texts was developed.

In the following subsections, previous research on sentiment analysis as well as some sentiment lexicon generation methods are discussed in further detail.

Sentiment Classification Methods

In general, sentiment classification methods can be categorized into two groups: (i) methods using a sentiment lexicon or background knowledge (unsupervised or semi-supervised learning) and (ii) methods using supervised machine learning algorithms.

Recently, the first group of methods has attracted many researchers. The accuracy of this approach depends entirely on the accuracy of sentiment lexicon weights [29]. These methods are usually unsupervised and, therefore, independent of specific domains [30]. To classify the sentiments, techniques for assessing the semantic similarities between words should be applied. Here, the semantic similarities of expressions and a short list of initial sentiment words are used to classify the sentiments within reviews. In general, three methods can be used to assess semantic similarity:

  1. 1.

    Ontology-based methods (e.g., employing WordNet, ConceptNet, or other dictionaries and encyclopedias) [31, 32];

  2. 2.

    Extraction of syntactic dependency relations between sentiment candidate words and the words of the given lexicon [33, 34];

  3. 3.

    Extraction of the co-occurrence of sentiment candidate words and the words of the initial list (unsupervised learning methods) in sentences from various corpora (user reviews, blogs, web pages, or search engines) [35, 36].

In the second group, a set of opinion documents (reviews), labeled with negative or positive sentiments, are provided as the training data. Then, each document is represented as a vector of features. Several features have been used for this purpose. In previous publications, words along with their frequency, n-grams, POSFootnote 1 tags of words [37, 38], sentiment words and phrases, the place of each word in the document [39], negative words, and syntactic dependency [40] are given to the classifier algorithm as the input features of each sentence. After, the classifier algorithm is trained by applying the training set and a model is built. Finally, this model is used to determine the sentiments of other opinions (test set).

To improve the quality and efficiency of machine learning algorithms, superior features should be selected. Many feature selection methods (described in detail in Feature Engineering section) could be applied for this purpose.

Numerous studies use several classification algorithms such as Naive Bayes (NB), Maximum Entropy (ME), and particularly, Support Vector Machine (SVM) to classify the polarity of reviews [41, 42]. These methods are used in text classification applications as well. In this case, instead of considering keywords and concepts, the expressed sentiments are used for classification.

In most research, opinions are simply divided into two categories, namely positive and negative sentiments [43]. Some researchers have considered an additional category for neutral opinions. Some have even implemented a user-provided rating criterion for opinions (e.g., 1 to 5 stars). These ratings could be employed to build the training dataset. Therefore, in most cases, sentiment classification is considered at document or sentence level.

Sentiment Lexicon Generation Methods

Generating automatic sentiment lexicon is a major task in the field of sentiment analysis. Detecting the polarity of sentiment and its intensity without human assistance is complicated for a machine to perform. Therefore, in sentiment analysis methods, experts first provide a list of primary sentiment terms, along with the numeric values that determine the intensity. The existence of these sentiment terms in a sentence is an important feature for sentiment classifiers.

As mentioned above, most lexicon-based sentiment classifiers use knowledge bases, such as WordNet, to measure polarity. The semantic relations (e.g., synonym, antonym) between words are used to form a graph. At this point, by using the initial sentiment seed words, the polarity is propagated in the graph via various methods, such as shortest path, random walk, PageRank, boot-strapping, and classifiers [44,45,46,47,48].

The SentiWordNet lexicon [49] is one of the best available resources to identify sentiment words. It is generated by determining the sentiment weight of each synset in Princeton WordNet (PWN). SentiWordNet specifies the polarity (negativity, positivity or objectivityFootnote 2) of each synset. SentiWordNet v1.0 [50] was created in four steps using a semi-supervised learning algorithm:

  1. 1.

    The positive and negative polarities of a limited number of initial synsets are manually indicated and are propagated for relevant synsets.

  2. 2.

    Some objective synsets are also selected and the Glosses of the synsets specified in the previous step are used for the learning phase of the classification method as training data.

  3. 3.

    Using a classification algorithm, other synsets are labeled as “neg,” “pos,” and “obj.”

  4. 4.

    In order to reduce the classification algorithm error, synsets provided in step 2 are used to train several classification algorithms (step 3). The results are then combined.

The initial version (SentiWordNet v1.0) was improved in SentiWordNet v3.0 using the iterative random walk algorithm and WordNet 3.0 graph [49]. SentiWordNet has been employed in many opinion mining applications as a sentiment lexicon, independent of the domain and subject [3, 23, 31]. Also, SentiWordNet and WordNet-Affect [51] have been employed to develop other sentiment lexicons such as SentiFul [52], SenticNet [53, 54], SentislangNet [55], and WSD-based SentiWordNet [29]. Moreover, utilizing the link between SentiWordNet and PWN combined with the link between PWN and non-English WordNets, SentiWordNet has been used in numerous opinion mining applications in other languages as well [56, 57].

Persian language opinion mining studies have predominantly used small and manually-created sentiment lexicons [58, 59]. However, some researchers simply translated English sentiment lexicons to Persian [60, 61]. Dehkharghani et al. proposed a new method to create a Persian sentiment lexicon (UTIIS) using a Persian WordNet (FarsNetFootnote 3 v1.0) and English resources [47]. They manually created the primary sentiment words (seeds) using English resource translation (Micro-WNOp corpus [62]). They also used the random walk method to propagate the weights of the seed words to determine the weights of the remaining words in the semantic graph of FarsNet. As FarsNet v1.0 was sparse and incomplete, it did not fulfill the requirements of their study. Therefore, they extended their Persian sentiment lexicon (UTIIS) based on PWN synsets. The synsets not covered by FarsNet were included into UTIIS by translating the related synsets in PWN. UTIIS consists of 1815 positive sentiment words and 1856 negative sentiment words organized into three groups of Persian nouns, adjectives, and verbs [47]. Dashtipour et al. [63] developed a new lexicon for Persian language called Per-Sent. The lexicon contains 1500 Persian words accompanied by their respective polarity values, based on a numeric scale ranging from − 1 to + 1, and their parts of speech tags. The words and phrases used in Per-Sent were taken from multiple resources, such as movie review websites, blogs, and Facebook. The majority of the values in Per-Sent were assigned manually.

The structure of the remainder of this paper is as follows: Methods section introduces the general architecture of sentiment classification systems. The methods of constructing Persian sentiment lexicon in this study are then described. In this section, the popular features in the literature and various state-of-the-art methods of selecting superior features for the sentiment classification of reviews will be expanded upon. In Experimental Results section, the quality assessment results of several text processing tools, various Persian WordNets, and the state-of-the-art methods of sentiment classification for different features will be presented and compared. The final section is the conclusion.

Methods

In order to classify reviews, they first must be pre-processed. Then, with the help of the constructed sentiment lexicon, their features are extracted. After converting the text into numeric vectors, superior features of opinions are identified and selected using feature selection methods. Finally, applying different classification methods, positive and negative opinions are separated. The architecture of sentiment classification system is shown in.

Normalization, segmentation of text (into sentences, phrases and words), tagging, and annotation have significant impacts on the processing and extraction of information, classification and other applications of natural language processing [24]. This paper utilizes Ferdowsi Persian text processing tools. The tools were developed for non-commercial use and are available on the website of the Web Technology Laboratory of Ferdowsi University.Footnote 4 In the rest of this section, first, a few studies aimed at Persian WordNet construction are introduced. Then, the proposed approach for sentiment lexicon generation is discussed. Two approaches in the present study which are applied to extract sentiment words are explained and the qualities of the results are compared in the following sections. In the first method, the links between FerdowsNet, as described in Section 0, and English WordNet synsets and sentiment weights of existing words in the SentiWordNet dictionary are used for Persian sentiment lexicon generation (Fig. 1).

Fig. 1
figure 1

Persian sentiment classification architecture

In the second method, experts first labeled a set of reviews with sentiment tags and other specified tags. Then, using a learning algorithm based on HMM (Hidden Markov Model), patterns for expressing sentiment phrases were found and the list of these words was extended. More details on this approach (PSWM) are included in Section 0. Additionally, the state-of-the-art methods for extracting and selecting superior features are presented.

Sentiment Lexicon Generation

Sentiment lexicon generation is an essential part of sentiment detection and its intensity. In previous studies, two methods have been applied to generate lexicons of sentiment words: 1—development or translation of expressions from the available lexicons [56, 64, 65]; 2—expansion of the list of seed sentiment words using a knowledge base (e.g., WordNet) or a corpus of opinions (statistical approach) and other linguistics resources [49, 57, 66, 67]. In this paper, a combination of both approaches is used by creating a complete WordNet for Persian and establishing links between its concepts and the English WordNet ones.

Persian WordNet

Princeton WordNet (PWN) is an electronic lexical database for the English language. It is comprised of a natural language vocabulary in the form of synonymous sets (synsets), which are classified into categories according to their parts of speech, such as verbs, nouns, adjectives, and adverbs. These synsets are connected to each other by semantic relations, such as synonymy, antonymy, hypernymy, hyponymy, and meronymy. The latest version of PWNFootnote 5 contains approximately 155,327 words, which were organized into 117,597 synsets. WordNet has recently been used in some papers to extract sentiment words and features [31, 44]. Researchers have made attempts to automatically or semi-automatically construct a Persian WordNet [68,69,70,71,72,73]. However, FarsNet [71] and PersianWN [70] are the only published Persian WordNets which are available to use.

FarsNet

FarsNet (the first Persian WordNet) is a lexical database that contains information on words and language combinations (concepts), their syntactic information (POS), and the semantic relations between them. The latest version of this database (FarsNet version 2.0) is available to researchers.Footnote 6 The EuroWordNet concepts were used to create the WordNet in this study. That is, the initial core of Persian WordNet was first produced manually and then it was completed in a top-down process using a semi-supervised method. The initial core of FarsNet was developed with the help of translating BalkaNet concepts and some common Persian concepts. It was then completed using a semi-supervised method, various Persian resources, and bilingual resources (Persian-English) [71].

Persian WordNet of Tehran University (PersianWN)

The latest version of Persian WordNet, provided by Tehran University [70], is available on the multilingual WordNet website.Footnote 7 It was created by running an unsupervised Expectation Maximization (EM) algorithm and implementing a text corpus and English WordNet (PWN). The FarsNet version 1.0 is used to calculate the primary probability of each word in each synset. Then, an iterative method of EM is employed to maximize the probability of each word.

Constructing a Comprehensive Persian WordNet (FerdowsNet)

As shown in Table 7, the current Persian WordNets contain an insufficient number of synsets. Also, the synsets have a small overlap with PWN in English. Furthermore, the semantic relations between synsets in the Persian WordNets are fewer than those in the PWN. The inadequacy of the current Persian WordNets called for the development of a new Persian WordNet (FerdowsNet), which covers most synsets and semantic relations in the PWN.

To construct FerdowsNet, the following language resources and knowledge bases were implemented:

  • Princeton WordNet

  • Various bilingual dictionaries (English-Persian)

  • Google Text Translator

  • Wikipedia, the Encyclopedia, and the Yago Ontology [74] to link Wikipedia and PWN

  • Persian corpora (several newsgroups and Persian Wikipedia)

  • Persian encyclopedias and dictionaries

  • Pre-existing Persian WordNets.

The construction of FerdowsNet consists of nine steps (Fig. 2):

  1. Step 1:

    All synset words are translated by different bilingual dictionaries.

  2. Step 2:

    A bipartite graph is formed, in which there is a node on the left side (Xi) for each English word (synset words) and a node on the right side (Yi) for each Persian word (list of translations). Then, each English word xi is connected to its Persian translation yj by an edge in the form of (xi,yj) between them. The weight of each edge depends on the number of occurrences of the words in the translation list (by various dictionaries) and their translation rating (in most available dictionaries, translated words are sorted according to their importance).

  3. Step 3:

    In this step, Persian words related to a synset are first extracted from Wikipedia knowledge bases and other Persian WordNets. Then, using the Yago ontology [74], concepts relevant to the synset words selected from Wikipedia and the equivalent of these concepts in Persian are extracted (if a Persian Wikipedia page for that concept is available). Then, using the links between FarsNet, Persian WN, and PWN, synset words equivalent to the corresponding English synsets are extracted (if any equivalent synset is available in these WordNets).

  4. Step 4:

    Using the words extracted from the previous step, translated words are extended (adding new words to the right of the bipartite graph) or the weight of the edges related to the translated words in the bipartite graph (formed in Step 2) is modified.

  5. Step 5:

    Using the Hungarian algorithm,Footnote 8 the best match (the most proper Persian equivalent in English) is extracted from the formed weighted bipartite graph and the list of candidate words S for this synset is obtained. Then, Persian words that were not selected and whose relation weights (relation with one of the English words) are more than the average weight of the selected ones are chosen as the second candidate translation.

  6. Step 6:

    Gloss and the example of each synset are translated using Google Translator.

  7. Step 7:

    The required preprocessing of the translated text is performed and the keywords are extracted.

  8. Step 8:

    The synonymous and equivalent words (using the PMIFootnote 9 measure [2, 75]) with the words selected from Step 5 (word set S) are extracted using Persian corpora (Hamshahri online newspaper, Alef and Tabnak [76,77,78] and the contents of Persian Wikipedia pages), and available Persian dictionaries and encyclopedias.

Fig. 2
figure 2

The system architecture for construction of the synsets of FerdowsNet

  1. Step 9:

    The words extracted in the previous step, whose similarities with the words of S are more than a certain threshold, are added to the list of final words.

After translating and extending the synsets, the relations between synsets in PWN are used for the relations between FerdowsNet synsets.

Construction of Persian Sentiment Lexicon

As mentioned, this paper employs two methods to construct the sentiment lexicon. In the first method, after construction of FerdowsNet and establishment of links to PWN, the polarity of each synset can be obtained using SentiWordNet. Thus, a Persian sentiment lexicon can be constructed by translating SentiWordNet. However, this method yields two types of errors. The first occurs because of the disambiguation in the synset translation when constructing the Persian WordNet. Moreover, the specified polarities of synsets in SentiWordNet also contain errors, which further decreases the accuracy of the Persian sentiment lexicon.

To resolve this, in the second method (PersianSentiWordMiner or PSWM), the sentiment lexicon is extracted by a semi-supervised learning method (without the use of SentiWordNet).

Persian Sentiment WordNet (PSWN)

To develop the Persian Sentiment WordNet, concepts (synsets) in Princeton WordNet are first mapped onto their equivalent synsets in FerdowsNet. Then, the calculated polarity for each synset in English SentiWordNet is mapped to its corresponding synsets in the Persian SentiWordNet (PSWN).

Persian Sentiment WordNet can be used as a comprehensive sentiment lexicon for Persian. The obtained Persian sentiment lexicon is derived from the sentiment words whose polarity is more than 0.5.Footnote 10 Moreover, given the degree of confidence for the words in each synset in FerdowsNet, there will be confidence of correctness for each sentiment word in addition to positive and negative polarity. The number of sentiment words with different POS tags and confidence is shown in Table 1.

Table 1 Positive (Pos#) and negative (Neg#) sentiment words in PS

The effect of confidence on the precision and recall of FerdowsNet and its impact on quality of sentiment lexicon are demonstrated in the assessment (evaluation) section.

Persian Sentiment Word Miner (PSWM)

The PSWM is an HMM-based sequence learning method employed to extract the sentiment lexicon after manually tagging some reviews. The methodology is similar to that of OpinionMiner [79]. After manually labeling opinions, the sequence learning approach based on HMM is applied to extract the sentiment words (rather than sentences as in OpinionMiner [79]).

In PSWM, some sentences, consisting mostly of adjectives and adverbs that are often used to express sentiment sentences or change the polarity, are labeled with special tags. For this purpose, in order to extract sentiment words, after manually tagging some reviews, the sequence learning HMM-based method is employed. Hence, a number of sentences in texts about a specific subject (digital products) are labeled with tags specified in Table 2, manually using the tagger tools provided for this purpose. Experts specified the polarity (rating) of sentiment words and the total polarity of each sentence (a number between − 5 and 5). Finally, the words with tags other than the ones shown in Table 2 are labeled as <BG> (background word).

Table 2 Sentimental tags and their descriptions

In Persian, negative forms of verbs, usually indicated by adding the prefix “ن”/N/ to the beginning of the verb, are often used to reverse the polarity of a sentence. Negative verbs are detected in the preprocessing phase by the lemmatizer and are automatically tagged as “Reverse.”

After tagging reviews, tagged sentences are used to develop the set of sentiment words according to the PSWM algorithm applied. Before implementing the learning method, a list of sentiment seed words is extracted from those tagged as sentiment words (positive or negative). In order to extract new sentiment expressions in the opinion corpus, a list of existing sentiment words by semantic relationships in FerdowsNet is then expanded (Fig. 3).

Fig. 3
figure 3

Sentiment lexicon generation (PSWM) algorithm

Next, the learning algorithm is executed to expand the list of sentiment words in a corpus of unlabeled reviews. The Viterbi approach [80] is used in order to implement the HMM learning method and select the best path with the maximum score. The purpose of the learning method is to find the most probable tag {<BG>, Pos, Neg, Reverse, Decrease, Intensity, Feature} for each word in a given sentence. Words with tags such as Reverse, Decrease, and Intensity are extracted from a predetermined list of words from the training corpus which was labeled manually. The list of words is assumed to be fixed in the current study, due to the limited number of these types of words in Persian. Thus, the challenge is to determine tags {Feature, Neg, Pos, <BG>} for each unlabeled word, which does not have Intensity, Reverse, or Decrease labels or any of the POS tags, such as Number, Delim, and Prep in the sentence. Details of this algorithm are presented below.

PSWM Algorithm

  1. 1.

    The synset related to the list of words with sentiment polarity tags (positive and negative) is extracted from FerdowsNet. The list of sentiment words is then extended by those related to the selected synset in FerdowsNet.

  2. 2.

    The HMM learning method is trained using tagged sentences by reviewers or in the previous iteration of the algorithm.

  3. 3.

    Part of the unlabeled opinion texts is randomly selected from the review corpus.

  4. 4.

    The words of an unlabeled review sentence are POS tagged.

  5. 5.

    A set of tags {Feature, Intensity, Decrease, Reverse, Neg, Pos} from the initial SentiWordNet and the current sentiment lexicon for words, extracted from the previous iteration of the algorithm, are considered and if there is a word with a corresponding tag it is labeled accordingly.

  6. 6.

    Other unlabeled words of the reviews are labeled by the HMM learning algorithm with Feature, Neg, Pos, and <BG > tags.

  7. 7.

    If there are new sentiment words among the labels, the list of sentiment words will be updated and the algorithm will return to the first stage. Otherwise, the algorithm will terminate.

Feature Engineering

Generation and selection of relevant features of a dataset play vital roles in improving the quality and efficiency of machine learning methods. Feature selection is especially essential in high-dimensional data, such as text, gene expression data, image, and audio video. [81, 82]. The objective of the feature selection is to extract a set of relevant features from natural language texts, before they are sent to the sentiment classification methods. In general, feature selection is performed with two objectives: (1) increasing the efficiency and speed of classification methods by reducing the size of data (number of dimensions), it is particularly essential when using classification methods whose training phase has cost and time overheads or high memory usage (such as SVM) and (2) enhancing the generalization by reducing the redundant or irrelevant features. The irrelevant input features may lead to overfitting [83].

Features

In the current paper, the list of features applied to classify sentiments includes the following:

N-Gram Features

This feature has been the baseline in most related research [6, 15]. In this paper, various unigrams, bigrams, and trigrams are applied as feature sets. In order to reduce the feature space an informal to formal word converter, normalizer and stemmer were developed. Feature space is pruned by a minimum n-gram occurrence empirically set to the value of five.

TFIDF-Based Word Weighting

Instead of using merely word presence, a variety of Delta-TFIDF-based versions [84] have been implemented and tested for the purpose of weighting such as Augmented TF, LogAve TF, BM25 TF, BM25 + TF, Delta smoothed IDF, and DeltaProb. IDF, Delta smoothed Prob. IDF, and Delta BM25 IDF. In Delta-TFIDF, a simple TFIDF weighting method is applied separately for each class (positive and negative) [85].

Character N-Gram Features

Similar to n-Gram features of words, N-Gram features of characters are also applied according to the [86] approach. Different characters of 3-Grams to 6-Grams, available in the opinion texts are applied as a feature set. The minimum occurrence of a particular character n-gram is set to five, according to the corpus size, in order to prune the feature space.

Part-of-Speech (POS) Tag Feature

Given the various meanings and usages of words in different parts of speech, the POS tag feature is used to classify sentiments in most related studies. In this paper, only words with noun, verb, adjective, and adverb tags are used. Additionally, the number of occurrences of each POS tag is considered (similar to [87]).

Sentiment Words Features

The extracted sentiment word lexicon is one of the key features. The polarity of the sentiment words in each sentence is calculated. However, the polarity of sentiment phrases may change after analyzing the sentiment reverse or intensifier tags. In order to calculate the overall sentiment of a review, the polarities of different sentences are averaged. Features related to words with roles of intensifying, reducing, and reversing the sentiment are also considered in this calculation. These words directly affect the polarity or sentiment intensity. Therefore, it is necessary to apply them in sentiment analyses. In this study, elongated words and repeated punctuation are also treated as features, similar to relevant state-of-the-art features used to analyze sentiments on Twitter (like STATE- features) [87, 88]. In this approach, the collection of sentiment features is obtained from the semi-supervised method (PSWM).

Bi-Tagged Feature

This feature has been proposed by [2] to extract the relevant features for expressing sentiments in English. These features include pre-defined patterns of common collocations to express a sentiment. They have been employed in most related works to classify sentiments [89, 90]. In the present study, a set of common Bi-tags are used to express a sentiment in Persian. A list of these patterns, along with some examples, are compiled in Table 3.

Table 3 Common bi-tagged patterns to express sentiments in Persian

SWN Subjectivity Scores (SWNSS)

The SWNSS method [91] uses the weights assigned to the words in SentiWordNet (SWN) to calculate their subjectivity. Considering the specified threshold, objective words (unigram features) and the words that do not exist in SWN are removed from opinion texts. In the current study, the sentiment threshold of sentiment phrases should be 0.22 based on the conducted experiments. Moreover, the SWNPD (SWN Proportional Difference) method was proposed which applies positive or negative polarity in SentiWordNet [91]. Similar to the Proportional Difference method, the polar words (negative or positive) are selected and others are removed from the features. However, it was shown that the SWNPD method is less efficient compared to SWNSS. Thus, only the SWNSS method for using the created Persian Sentiment WordNet (PSWN) is used.

Word2vec Cluster N-Grams (W2VC)

Similar to the methodology applied in [92], the words of the review corpus are reduced to 100-dimensional vectors using the Word2vec toolFootnote 11 [93]. The K-means clustering method is then used to cluster 100,000 words (within the input corpus) into 5000 clusters. These clusters are used to represent words (n-Grams).

Sentiment-Specific Word Embedding (SSWE)

Tang et al. improved the Word2vec model to propose a new method for word representation [94].Footnote 12 It was shown that sentiment classification using SSWE for the conversion of sentiment features in a continuous space yields better results than other similar methods, such as Word2Vec_Skip-gram [93] and ReEmb [95]. They succeeded in increasing the accuracy of the sentiment classifier in the Coooolll system by combining the SSWE features and other common features (STATE features which were used at NRC [87]) for opinion mining.

Feature Selection

Various methods have been proposed to select features in text and sentiment classification [15, 96,97,98]. Feature extraction methods that transfer features into a new space with fewer dimensions have also been proposed and implemented. However, [6] has shown that the results of feature extraction methods for sentiment classification, such as Principal Component Analysis (PCA) and Singular Value decomposition (SVD) are less accurate than feature selection methods, such as Information Gain (IG). Thus, the current study uses only feature selection methods. In this section, a variety of the state-of-the-art supervised and unsupervised approaches for feature engineering are introduced.

Notations used in this section are represented in Table 4. c w , \( {c}_{\overline{w}} \), \( {\overline{c}}_w \), and \( {\overline{c}}_{\overline{w}} \), respectively, show the number of documents in class c that contain the word (feature) w; the number of documents in class c that have no word w; the number of documents in \( \overline{c} \) that contain word w; and the number of documents in class \( \overline{c} \) that have no word w. n c and \( {n}_{\overline{c}} \) also represent the number of documents in c and \( \overline{c} \) classes, respectively. N (i.e., \( {n}_c+{n}_{\overline{c}} \)) is the total number of documents.

Table 4 Representation of notation

Mutual Information (MI)

In information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. The Mutual Information metric for calculating the probability of the feature (word) occurrence in the target class in proportion to the probability of its overall occurrence is calculated as follows [99]:

$$ {MI}_w=\frac{c_w\times N}{\left({c}_w+{c}_{\overline{w}}\right)\left({c}_w+{\overline{c}}_w\right)} $$
(1)

Information Gain (IG)

The Information Gain metric specifies the number of necessary information bits to predict the category (class) in the presence or absence of each feature (word) in the text [100]. The Information Gain value of a feature is calculated as follows:

$$ {IG}_w=-P(c){\mathit{\log}}_2P(c)+P\left(\overline{c}\right){\mathit{\log}}_2P\left(\overline{c}\right)\kern0.5em -\left(P(w)\left(-P\left({c}_w\right){\mathit{\log}}_2P\left({c}_w\right)-P\left({\overline{c}}_w\right){\mathit{\log}}_2P\left({\overline{c}}_w\right)\right)\right)+\left(P\left(\overline{w}\right)\left(-P\left({c}_{\overline{w}}\right){\mathit{\log}}_2P\left({c}_{\overline{w}}\right)-P\left({\overline{c}}_{\overline{w}}\right){\mathit{\log}}_2P\left({\overline{c}}_{\overline{w}}\right)\right)\right) $$
(2)

where

$$ {\displaystyle \begin{array}{l}P\left({c}_w\right)=\frac{c_w}{c_w+{\overline{c}}_w}\kern0.5em P\left({\overline{c}}_w\right)=\frac{{\overline{c}}_w}{c_w+{\overline{c}}_w}\kern0.5em P\left({c}_{\overline{w}}\right)=\frac{c_{\overline{w}}}{c_{\overline{w}}+{\overline{c}}_{\overline{w}}}\kern0.5em P\left({\overline{c}}_{\overline{w}}\right)=\frac{{\overline{c}}_{\overline{w}}}{c_{\overline{w}}+{\overline{c}}_{\overline{w}}}\\ {}P(w)=\frac{c_w+{\overline{c}}_w}{N}\kern0.5em P\left(\overline{w}\right)=1-P(w)=\frac{c_{\overline{w}}+{\overline{c}}_{\overline{w}}}{N}\kern0.5em P(c)=\frac{n_c}{N}\kern0.5em P\left(\overline{c}\right)=\frac{n_{\overline{c}}}{N}\end{array}} $$

Chi-square (CHI) and Variants

Chi-square (χ 2) is one of the common statistical metrics to calculate the independence between a feature and a class and is used to select the superior features. Ng et al. [101] proposed a variant of χ 2 called the NGL (Ng-Goh-Low) Coefficient. They demonstrated that the feature selection results of the NGL approach for text classification, in some cases, are better than χ 2. Moreover, Galavotti et al. presented a simplified form of χ 2 called GSS (Galavotti-Sebastiani-Simi) coefficient [102]. They proposed that GSS can produce better results than NGL and χ 2 [102].

$$ {GSS}_w={c}_w{\overline{c}}_{\overline{w}}-{\overline{c}}_w{c}_{\overline{w}} $$
(3)
$$ {NGL}_w=\frac{\sqrt{N}\ {GSS}_w}{\sqrt{\left({c}_w+{c}_{\overline{w}}\right)\left({\overline{c}}_w+{\overline{c}}_{\overline{w}}\right)\left({c}_w+{\overline{c}}_w\right)\left({c}_{\overline{w}}+{\overline{c}}_{\overline{w}}\right)}} $$
(4)
$$ {\chi^2}_w={\left({NGL}_w\right)}^2 $$
(5)

Relevancy Score (RS) and Odds Ratio (OR)

These two metrics are recognized statistical methods of feature selection that have been shown, in some cases to yield better results in classifying texts than IG and MI [98, 103].

$$ {OR}_w=\frac{c_w{\overline{c}}_{\overline{w}}}{c_{\overline{w}}{\overline{c}}_w} $$
(6)
$$ {RS}_w=\frac{c_w}{{\overline{c}}_{\overline{w}}} $$
(7)

Document Frequency (DF)

One of the popular methods of feature selection in text classification is to filter features according to the number of documents in the corpus which contain said features [6]. Features that occur in less than a certain number of texts are removed. The frequency of documents (DF) including a particular feature is calculated using the following equation:

$$ {DF}_w={c}_w+{\overline{c}}_w $$
(8)

Categorical Proportional Difference (CPD)

In order to determine the impact of each feature (unigrams) in representing a class, the CPD method was proposed by [104]. The frequency of each feature in each class (positive or negative) is separately calculated and the polarized words, which occur dominantly in a class, have a higher PD value, while those words distributed equally in both classes have a lower PD value. The PD value for both positive and negative classes is calculated as follows:

$$ CPD=\frac{\left|{DF}_{+}-{DF}_{-}\right|}{DF_{+}+{DF}_{-}} $$
(9)

In this paper, all mentioned feature selection methods (CPD, CHI, GSS, IG, MI, NGL, OR, RS and DF) have been applied to extract features from the Persian reviews. The best 2000 features with higher weight scores were selected to be used for sentiment classification. It was empirically proven that the selection of over 2000 features does not notably affect the quality of the sentiment classifier of Persian reviews.

Experimental Results

In this section, we introduce our dataset (review corpus) and then compare the quantitative and qualitative assessment of sentiment lexicons, in addition to various Persian WordNets. Finally, it is demonstrated how the sentiment classification of reviews is affected by the inclusion of several text processing tools, sentiment lexicons, and the latest methods of sentiment classification for different features.

Dataset

To assess the extracted sentiment words, a corpus of opinions on the Digikala websiteFootnote 13 were collected by a web crawler. Similar to Amazon, Digikala is the fifth most visited website and market leader in e-commerce in the Middle East.Footnote 14 Due to the high volume of users, Digikala has relatively rich comments. This dataset contains reviews about different products. Total opinions of the collected corpus consist of 31,730 reviews on ten different types of products. There are about 3080 reviews for training supervised machine learning algorithms and assessing them. These have been tagged by experts, but the rest have no sentiment tags. Some features of this dataset are listed in (Table 5).

Table 5 Features of Digikala review corpus

There are three categories of reviews: “Expert Review,” “Reviews of Active Users,” and “Short Comments.” In Table 6, the features of each category and their differences are presented.

Table 6 A variety of opinions available on the corpus

Evaluation of FerdowsNet and PSWN

First, FerdowsNet is quantitatively assessed and compared with other Persian WordNets. In Table 7, the features of various Persian WordNets are shown.

Table 7 Quantitative assessment Persian WordNets

To assess and compare FerdowsNet with other Persian WordNets, first about 1000 synsets were randomly selected from the English WordNet (250 synsets from each noun, verb, adjective and adverb category).Then, the equivalent words of these synsets in different Persian WordNets were extracted and their quality in the context of natural language processing were assessed by three experts.

In order to make a precise evaluation, the accuracy and recall metrics for each wordnet were separately calculated by experts. First for each English synset, a reference set of equivalent Persian words is considered as S* = {\( {s}_1^{\ast } \), \( {s}_2^{\ast } \), \( {s}_3^{\ast } \), …, \( {s}_n^{\ast } \)}. Then, the set of words available in the WordNet for each synset is defined as Swn = {\( {s}_1^{wn} \), \( {s}_2^{wn} \), \( {s}_3^{wn} \), …, \( {s}_n^{wn} \)}. Finally, the accuracy and recall metrics for each synset was calculated and their average (micro) was considered as the final accuracy and recall in the WordNet using the following formulas:

$$ \mathrm{Recall}=\frac{\left|{S}^{\ast}\cap {S}^{wn}\right|}{\left|{S}^{\ast}\right|} $$
(10)
$$ \mathrm{Precision}=\frac{\left|{S}^{\ast}\cap {S}^{wn}\right|}{\left|{S}^{wn}\right|} $$
(11)
$$ {F}_1-\mathrm{Measure}=2\cdot \frac{\mathrm{precision}\times \mathrm{recall}}{\mathrm{precision}+\mathrm{recall}} $$
(12)

where |S  ∩ S wn| represents the number of correct words selected in the intended synset words, |S | is the number of Persian words equivalent to the main concept, and |S wn| stands for the number of words in the intended Persian WordNet synset.

For a quality evaluation and comparison of FerdowsNet with other Persian WordNets, 1000 synsets of English WordNet (Princeton) were randomly selected. Then, the corresponding synsets in the Persian wordnets were extracted. Next, words and their Glosses, along with examples for each synset in English WordNet, were prepared in a list of Persian vocabulary equivalent to that synset. A single-blind trial was then conducted. To ensure fairness, Persian vocabulary list was presented in a way that the participating experts were not informed about which WordNet each word belonged to. Finally, in regard to natural language processing concepts and WordNet, the experts identified the wrong words of each synset. After the trial, the accuracy, recall, and F1-Measure rates for each synset were calculated, based on the tags applied by the experts. The results are shown in Table 8.

Table 8 The qualitative assessment of Persian wordnets

For the confidence level of the words in FerdowsNet synsets, a quality assessment for different confidence intervals has been calculated. In Table 7, conf represents the FerdowsNet confidence level. It is important to note that in Persian WordNets, there are no Persian synsets equivalent to some of the synsets available in English WordNet. For this purpose, in Table 7, the recall of these groups is considered once as zero. Then, the groups are left out when calculating the recall average. For example, out of a set of 1000 synonymous groups selected from the English WordNet for assessment, there are only 147 equivalent synsets (about 15%) in FarsNet. The accuracy of words in existing synsets is about 0.898 and the overall recall (from 1000 synsets) is 0.101. However, if the recall value is calculated only for the 147 synsets (synsets whose equivalents are available in FarsNet), the recall value of this WordNet is 0.695.

For quality evaluation of Persian Sentiment WordNet (PSWN), about 150 polar words were randomly extracted. Using the same technique, the accuracy and recall of PSWN are calculated and shown in Table 9.

Table 9 Qualitative assessment of the PSWN

During the calculation of the recall criterion, it was found that among the 150 sentiment phrases, only 113 existed in PSWN. Therefore, the recall value of the sentiment WordNet is about 0.75. However, most sentiment phrases not in the WordNets are related to informal language or spelling errors of sentiment words that are not corrected by the text pre-processing tools. Also, as the polarity of 97 out of 113 words in PSWN matches the sentiment determined by the experts, the accuracy is approximately 0.86. For the words that exist in several synsets, the average polarity is considered. Part of Persian Sentiment WordNet errors is related to the errors in FerdowsNet. In addition, the polarity specified for the words in SentiWordNet has some errors which are propagated into Persian Sentiment WordNet (PSWN) and, as a result, reduce its accuracy [105].

Sentiment Classification Results

In order to classify reviews, their texts will be first pre-processed with the normalizer, sentence splitter and tokenizer, stop word removal, lemmatizer, informal-to-formal converter and spell-checkFootnote 15 tools. To evaluate the superior feature set, a variety of classification approaches are initially assessed over various features. The F-Measure metric and tenfold cross-validation method are applied.

Finally, as shown in Table 10, four groups of features are extracted from the reviews. Due to the high number of Char n-Gram features, the number of features of this dataset are reduced to 10,000 using the CPD feature selection method. In the word n-grams approach (FS3), the POS tags of words are considered.

Table 10 Feature set description

Different TFIDF Weighting Schemas were studied to apply various features using different classification methods. Table 11 presents the results of these widely used TFIDF-based approaches. Also, global weighting methods, such as IDF (Invert Document Frequency), GFIDF, Entropy, DeltaBM25Idf, DeltaSmoothIdf, DeltaSmoothProbIdf, and local weighting methods for words in each document were considered, such as TermFrequency (TF), LogAvg, Augnorm, and BM25Plus. The LogAvg–DeltaSmoothIdf technique and Linear SVM produced better results among the evaluated approaches.

Table 11 A comparison of the quality (F-Measures (%)) of TF-IDF variants for the sentiment classification of Persian text reviews

The results of different classification methods are compared in Table 12. A KNN algorithm was used with K = 3 and log-distance-weighted nearest neighbors. Among the classification approaches, the Linear SVM method (using the LibLinear library [106])Footnote 16 produces the best results.

Table 12 Results of various classification methods

Then, analytical tests are performed to assess the preprocessing methods. The impact of different text pre-processing tools on the average F-Measure of sentiment classification for different feature sets is shown in Fig. 4.

Fig. 4
figure 4

The effect of text preprocessing tools on Persian sentiment classification

As previously mentioned, the sentiment lexicon list was created using two different techniques. Similarly, two sets of sentiment features are extracted from the reviews: the SentiWords (PSW) feature set and the SWNSS feature set.

The Persian SentiWords (PSW) feature set contains the number of Bi-tagged pattern features, sentiment expressions obtained from the PSWM method, as well as an intensifier, reverser, and reducer of the sentiment in the text. Contrarily, the SWNSS feature set is based on the sentiment words obtained from Persian Sentiment WordNet (PSWN).

To select the superior sentiment feature set, the sentiment classification results are compared with every sentiment feature set mentioned in Table 10 and combined with the sentiment feature sets (PSW and SWNSS) in Fig. 5. In this figure, the classification results of text reviews are compared with the sentiment feature sets of PSW and SWNSS.

Fig. 5
figure 5

The comparison of PSW and SWNSS method

For extracting sentiment words, the results demonstrate that the semi-supervised learning method PSWM used in the PSW feature set is more efficient than the one based on the Persian sentiment WordNet PSWN. The reason is that the polar words of PSW for the target domain were extracted from reviews related to commercial products, but SWNSS (Persian Sentiment WordNet) is general and domain-independent.

Different feature selection methods for extracting 2000 superior features on the selected feature set were implemented and their results are presented in Table 13. In order to select non-binary features (such as TF-IDF feature set), features were converted to binary values using the Maximum Gini-Index method, similar to the one applied in binary decision trees.

Table 13 Impact of the sentiment feature set (PSW) on the quality (F-Measure) of LibLinear classification

As shown in Table 13, reducing the dimensions via feature selection methods, despite significantly improving the performance of the classification algorithms, often decreases the quality of sentiment classification. Feature selection methods like CPD, RS, and MI using sentiment features and CHI and RS methods not using sentiment features are more efficient than other feature selection methods in classifying product review corpora in Persian.

The impact of each step of sentiment classification process is shown in Fig. 6. Our final analysis revealed that the sentiment lexicon feature (PSW) has the most impact on the sentiment classification results. Impact rate of a classifier algorithm and feature set was calculated by the difference between the best state and second best state.

Fig. 6
figure 6

Impact of different step of sentiment polarity classification in Persian reviews

Conclusion

In this paper, the impact of Persian NLP tools for preprocessing and sentiment classification was thoroughly examined. After developing the tools, a comprehensive Persian WordNet (FerdowsNet), and a corpus of Persian reviews, a new method (PSWN) of using English SentiWordNet was proposed for generating a Persian SentiWordNet. Moreover, the SentiWords lexicon was extracted using a semi-supervised PSWM method.

An in-depth analysis of different supervised machine learning methods was conducted to analyze the sentiments in commercial product data in Persian. In addition, a detailed assessment was done on a set of common state-of-the-art features and various methods of feature selection and classification, and the impact of different pre-processing tools on different feature sets was studied. Finally, informal-to-formal conversion and stop word removal for LogAvgTF-DeltaSmoothIDF with sentiment features and applying the Linear SVM classification method proved to produce the best results for sentiment classification of commercial products in Persian.

In addition to the development of the WordNet, the appropriate sentiment patterns, and the corpus of opinion mining for Persian language, the result of the current study could be used as a solid basis for selecting and using features, feature selection methods, and various classification approaches. The findings and developments made in this study could prove useful in the advancement of opinion mining research in Persian and other similar languages, such as Urdu and Arabic.

We are currently extending our research to aspect-based sentiment analysis, and preliminary results are encouraging. Our ultimate objective is to apply semantics in the sentiment analysis of comments by developing the opinion ontology. Therefore, a semantic framework as an integrated method will be used in all stages of aspect-based sentiment analysis.