Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In recent years, with fast growing of Internet people all around the world share their opinions on different topics. This huge amount of unstructured data available online in different languages is very useful for companies and organisations to improve their products and services (Poria et al. 2014).

The corresponding field of science and technology is called sentiment analysis (SA). SA techniques involve a number of tasks, among them identification of the polarity (positive/negative) or emotion (happy, sad, angry, etc.) expressed the text or in a sentence (Turney 2002). Sentiment polarity can be binary or can involve multiclass classification, such as strongly positive, positive, neutral, negative, and strongly negative. Most of research is focused on the binary polarity classification, though identifying at least the neutral opinion in the sentence is more helpful (Tang et al. 2009).

In the recent years, sentiment analysis has been a very active area of research. There have been compiled numerous lexical resources and datasets for English language. However, much less effort has been devoted to the development of lexical resources in other languages, which makes is difficult for researchers to analyze the text in languages other than English because of lack of available lexical resources (Dashtipour et al. 2016).

In particular, there is no well-known dataset or lexicon available for Persian language (Neviarouskaya et al. 2011). In this paper, we present PerSent, a Persian polarity lexicon for sentiment analysis, which contains words and phrases along with their polarity and part-of-speech tag. We evaluate its quality and performance via applications to a sentiment analysis task using different features such as POS-based features, the presence and frequency of sentiment words, average polarity of words, etc., and two machine-learning algorithms: SVM and Naïve Bayes.

The lexicon is freely available for the research community and can be downloaded from the URL http://www.gelbukh.com/resources/persent.

This paper is organized as follows. Section 2 discusses related work on lexicons in languages other than English; Sect. 3 presents PerSent, our Persian sentiment lexicon; Sect. 4 describes our evaluation methodology, and Sect. 5 gives the evaluation results. Section 6 concludes the paper.

2 Related Work

Data analysis is important for small and large companies. They gather opinions from texts available in Internet. Analysis of such opinions has great impact on customer relationships. Companies use customer comments about negative features of products to improve their products (Cambria et al. 2016). Moreover, sentiment analysis is not only restricted to product reviews but is also used in other fields such as politics, sport, etc.

In this section, we give some background on sentiment classification and discuss related work.

2.1 Types of Sentiment Analysis

Approaches to sentiment classification can be divided into three groups: statistical approaches, knowledge-based approaches, and hybrid approaches.

Statistical approaches use machine-learning algorithms such as SVM or Naïve Bayes to classify text. They can use supervised or unsupervised learning methods. Supervised methods use labelled data to classify the text, while unsupervised ones use only raw data (Maynard and Funk 2012). Statistical approaches are usually used, for example, to detect sentiments holders and target (Cambria et al. 2013).

Knowledge-based approaches classify the text by affect categories based on the presence of unambiguous affect words such as sad, happy, afraid, or bored (Cambria 2016). They use lexicons to calculate the statistics of positive and negative words in the given text: for example, the word good is known to be positive and the word bad negative. The lexicon can contain single words or phrases. The advantage of knowledge-based approaches is that they do not require trained data; the main disadvantage is lack of scalability.

Hybrid approaches combine statistical and knowledge-based methods to improve performance and accuracy (Maynard and Funk 2012; Cambria 2016). Pak and Paroubek (2010) developed a dataset that contains positive and negative documents; for classification, they calculate the cosine similarity between the given document and the documents with known polarity. They evaluated their method using the Naïve Bayes algorithm.

2.2 Knowledge-Based Approaches

Various lexicon-based approaches have been used for sentiment classification of documents in different languages; see Table 1. Most of the lexicon-based approaches used adjectives to identify the polarity of the text. There have been suggested different methods to develop sentiment lexicons, such as manual, corpus-based, and dictionary-based compilation. Manual construction is time consuming; it is usually combined with other methods to improve performance.

Table 1. Existing sentiment lexicons for various languages

Corpus-based methods use lists of sentiment words along with their polarity and syntactic patterns to find more sentiment words and their polarity. For example, Hatzivassiloglou and McKeown (1997) developed graph-based technique for learning lexicons; they identified polarity of adjectives using conjunctions. They used a clustering algorithm to divide words into positive and negative. They achieved 82% of accuracy.

Dictionary-based approaches do not require pre-compiled lists of sentiment words. They are used to collect sentiment words and their orientation manually and look up synonyms and antonyms in a dictionary. The main disadvantage of this method is that it unable to find sentiment words with domain-specific orientation: sentiment words can be positive in one domain and negative in another. For example, the word large is positive when it refers to a computer screen, but negative when it refers to a mobile phone (Hu and Liu 2004).

2.3 Persian Language

Persian uses 32 letters, which cover 28 Arabic letters. Its writing system includes special signs and diacritic marks that can be used in different forms or omitted from the word. Short vowels are not indicated in writing. There are letters with more than one Unicode encoding. Some words have more than one spelling variants. Spelling of some words changes with time. All this increases the number of both homographs and synonyms, which presents problems in computational treatment of Persian (Karimi 1989; Seraji et al. 2012).

Saraee and Bagheri (2013) proposed a method for feature selection in Persian sentiment analysis able to calculate the co-occurrence of Persian words in different classes. They used customer reviews to evaluate the performance of the approach. Naïve Bayes algorithm has been used in evaluation. The overall accuracy of their approach was 75%. The advantage of this approach is its simplicity; a disadvantage is the need of a great amount of training data.

Chen and Skiena (2014) proposed a lexicon for major languages such as English, Arabic, Japanese, and Persian. The English data has been collected online. They used Google translator to translate data into different languages and WordNet to gather synonyms and antonyms for English; these words and phrases were translated into different languages. They used Wikipedia pages to evaluate the performance of their lexicon, and obtained the overall performance of 45.2%. An advantage of this approach is its ability to develop lexicons for 136 languages; a disadvantage is that the lexicons for most of these languages were only of less than one hundred words and phrases.

3 PerSent Persian Sentiment Lexicon

Many researchers note that the main problem of the multilingual sentiment analysis is the lack of resources. To overcome this issue, we developed a Persian lexicon of 1500 Persian words along with their polarity and part of speech tag, which we called PerSent. Table 2 shows some examples.

Table 2. Examples from our Persian sentiment lexicon

Most of the previous research on sentiment used adjectives to identify the polarity of sentences (Hu and Liu 2004). Some researchers used adverbs and adjectives together to build a lexicon (Benamara et al. 2007); some used adjectives, adverbs, and verbs (Taboada et al. 2011). For our Persian sentiment lexicon, we used adjectives, adverbs, verbs, and nouns, because all these words and phrases are useful to determine the polarity of the sentence.

A lexicon can be developed in different ways, such as manually or using existing lexicons such as SentiWordNet (Esuli and Sebastiani 2006) or General Inquirer (Stone et al. 1966). The words and phrases used in our lexicon were taken from different resources such as movie review website, weblogs, and Facebook. There were four different categories of sources, namely, websites related to movies, news, mobile phones, and computers.

We manually assigned polarity between −1 and +1 to each word and phrase. The degree of intensity was indicated: e.g., (happy), (cheerful), and (delighted) have different positive values. In order to assign polarity manually to some words and phrases, we used the TextBlob Python package, used to assign polarity to words, phrases, and sentences in English (Yang 2015. For this, we translated Persian words into English. We also manually assigned a part of speech (POS) tag to each word or phrase. Table 3 shows the distribution of the POS tags in the lexicon.

Table 3. Statistics by POS

4 Evaluation Methodology

In order to evaluate the performance of our lexicon, we used two classification algorithms; we used our lexicon to assign polarity to the features extracted from the dataset. Figure 1 shows the general framework we used to evaluate the performance of our lexicon. Below we describe each processing steps.

Fig. 1.
figure 1

The Persian framework

Pre-processing. The pre-processing step consisted of four parts, tokenisation, normalisation, stop-word removal, and stemming. Normalization was used to remove noise from the text. Stemming was used to remove inflection of infected forms: the lexicon only provides the base form of the words.

Feature selection. The purpose of the feature selection was to remove unnecessarily features, which improved the performance and efficiency of the classification. The features we used were based on word polarity, POS tag, and presence and frequency of sentiment words; see Table 4.

Table 4. Features used

Presence and frequency of sentiment words. The sentiment words identify the overall polarity for sentiment classification. Example of positive words in Persian are (beautiful) and (excellent), and of negative words are (ugly) and (bad). The features of presence of positive and of negative words (two different features) are binary, without considering the number of occurrences of a given word, while the other two features are integer and indicate the number of occurrences of positive and of negative works, correspondingly.

POS-based features. Our lexicon contains words along with their POS tag, such as adverb, verb, noun, or adjective. Most of the previous research used only adjectives and nouns to identify the polarity of sentences (Kouloumpis et al. 2011), but we consider eight different features: the frequencies of positive and of negative adjectives, adverbs, verbs, and nouns, correspondingly.

Word Polarity. Our lexicon gives polarity for words. As two different features, we used the overall polarity of negative and of positive words, correspondingly.

5 Experimental Results

We applied simple baseline approaches to sentiment analysis using our lexicon to the Persian VOA (Voice of America) news corpus, which contains 500 positive and 500 negative news headlines. We then measured the performance in terms of accuracy:

$$ {\text{Accuracy }} = \frac{number \,of\,data \,classified \,correctly}{total \,number \,of\,data} $$

5.1 Results

We used support vector machine (SVM) and Naïve Bayes classifier for evaluation. The support vector machine gave better results than Naïve Bayes; see Table 5. In this experiment all the features were used.

Table 5. Performance of different classifiers with all features

We also compared the effectiveness of different features in order to determine their importance; see Table 6. The accuracy varied from 46% to 63%. SVM gave uniformly better results than Naïve Bayes did. The experiment showed that the mere presence of opinion words gives better performance than their frequency.

Table 6. Performance of the frequency features

We also compared the POS features, such as the frequency of positive and negative adjectives, adverbs, verbs, and nouns, correspondingly; see Table 7. SVM again almost uniformly outperformed Naïve Bayes.

Table 7. Performance of the POS features

Table 8 shows the results for overall polarity of negative and of positive words. Positive words outperformed negative words, and SVM outperformed Naïve Bayes.

Table 8. Performance of the overall polarity feature

5.2 Discussion

Based on related work on lexicon-based methods, we expected that PerSent lexicon would perform better. Classification of news into positive and negative is, however, a difficult task, because most of the bad news do not contain any subjective terms that would help to classify them as negative.

The main problem of our lexicon is its relatively small size: 1500 words are not enough for Persian because it has many dialects and actively us idiomatic expressions, and thus requires a larger lexicon, development of which would take time and effort (He and Zhou 2011).

Another problem is that our simple application did not properly handle sarcasm. A much more sophisticated system should be developed to be able to identify sarcasm in the texts. Further study is required to detect ironic and sarcastic sentences. Sarcasm should be studied independently and another tool needs to be developed to handle sarcasm in order to improve our classification performance.

Similarly, our simple testing application did not properly handle code switching between Persian and English: some sentences used a mixture of Persian and English words.

Adjectives gave better results in comparison with other parts of speech, because the examination of adjectives in a sentence is easier as compared with other words. For example, in , which means “It is beautiful picture”, the adjective clearly indicates the sentiment.

Rather not surprisingly, all features together gave better results than individual features separately, because in this way the algorithm had access to more information.

6 Conclusions

We have developed a new lexicon for Persian language, which can be used for Persian sentiment analysis. The lexicon contains 1500 Persian words along with their polarity on a numeric scale from −1 to +1 and the part of speech of tag. The majority of the values were assigned manually. The new lexicon is freely available for download from the URL http://www.gelbukh.com/resources/persent.

Our experiment results show that our lexicon is a useful tool to determine the polarity of sentences in Persian. In the experiments, we used two classifiers: SVM and Naïve Bayes, of which SVM gave better results.

As future work, we plan to extend our lexicon, try computer-assisted methods of its compilation, as well as to apply our lexicon to a wider variety of tasks and corpora. In addition, we will combine knowledge-based methods with deep textual features for sentiment classification (Poria et al. 2015a). An end-to-end Persian sentiment analysis framework based on the linguistic patterns and common-sense knowledge is another important work to be done (Poria et al. 2015b, 2012; Cambria et al. 2015). Aspect-based sentiment analysis (Poria et al. 2016) and disambiguating sentiment words (Pakray et al. 2011a, 2011b, 2010) will play a major role in such a framework.