Keywords

1 Introduction

Sentiment analysis is a field of natural language processing which focuses on extraction of objective and subjective information from a natural language sentence. With the boom of Online community people are expressing their likes and dislikes towards different subjects in blogs, microblogs and social networking sites like Twitter and Facebook. Analyzing these expressions of short colloquial text [1] can yield vast information about the behavior of the people that can be helpful in many other subjects like Political Science, opinion extraction and Human Computer Interaction (HCI).

With the ever emerging social media, more and more people are expressing their sentiments about current affairs on blogs, microblogs and social networking sites [2]. During the Indian general elections 2014, in the timeframe of 4 months, conversations regarding Indian elections were more than twice the conversations during whole of the 2013 and Indian twitter users also more than doubled.

Early research in this area focuses on adjectives and single word phrases to evaluate sentiments of the sentence like in [3]. An adjective with positive connotation can imply the overall sentiment for the subject positive and an adjective with negative connotation can imply the overall sentiment for the subject negative. However recent studies has showed that verbs and adverbs [4] and two word phrases [5] also contribute significantly to the overall opinion of the sentence. The whole process is depended upon usage of dictionary of words with their ranking (or polarity scores). This has been suggested in [6]. This method is also called lexicon based sentiment analysis. In this a pre-evaluated knowledge base consisting of words and their polarity scores are used like SentiWordNet [7] to determine relation of word phrases and there sentiment score based on classification into positivity and negativity of the subject that signifies the altitude of the author on that particular subject. Recent researches have shifted their focus on Rule Based sentiment analysis [8], Machine Learning approaches [9] and semantic meaning of the sentences [10]. Rule based techniques focuses on set of pre-defined rules which are when detected in a natural language sentence gives a definite output.

In this paper, we proposed an approach for detecting sentiment in political tweets based on our semantic rules. Proposed political sentiment analysis model is unsupervised which don’t require any prior training dataset.

This paper organized as follows. Section 2 describes the related works. Section 3 discusses the proposed approach. Section 4 presents the results of the proposed approach with the discussion. Finally, Sect. 5 presents the conclusion.

2 Related Work

Previous studies [1113] show that analyzing these sentiments and patterns can generate useful results which can be handy in determining opinions of public on elections and policies of the government. In [11], authors extract sentiments (positive, negative) as well as emotions (anger, sadness etc.) regarding the major leading party candidates and on the basis of that they calculate a distance measure. The distance measure shows the proximity of the political parties, smaller the distance higher the chances of close political connections between that parties [12] and [13] also shows how twitter data can be helpful in predicting election polls and deriving useful information about public opinions.

Existing problems in analyzing political tweets have been discussed in [14]. Sarcasm tends to reduce the accuracy of the classifier [15] shows how Sarcastic tweets in which a positive sentiment followed by an negative situation is handled. For deep analysis of the sentences, dependency parsing tool should be used which can extract relations among the words that are forming the sentence [16, 17] show the usage of Stanford Dependency Parser [18] (abbreviated from now on as SDP) in extracting these relations.

We also used categorization specified in [16] but modified them a little to suite our approach. Our categorization consists of six entities namely: Modifiers, Intensifiers, Dividers, Negations, Verbs and Objects. We believe that these entities are important as they can significantly affect sentiment of the overall sentences.

3 Proposed Approach

In this paper we proposed an approach for sentiment analysis of tweets. We believed in a common system which will be able to solve different problems like Sarcasm, Conjunction and Implicit negation combined. For this we proposed an unsupervised hybrid approach of Lexicon Based and Rule Based Sentiment Analysis which will analyze words related to other words, thus giving overall sentiment of the sentence. For lexicon, SentiWordNet is used which can give us the sentiment scores of a word. A negative score signifies negative connotation and a positive score signifies positive connotation of the word. Tweets were manually downloaded from a time period of 28 February 2014 to 28 March 2014. Our system follows in mainly 4 steps which are explained below.

3.1 Dependency Extraction

We used SDP to extract rules from the tweets. The sole reason is to remove extra words that are not related to overall sentiment or contribute very less to the overall sentiment. From these rules, those that are containing verbs, adjectives, adverbs, nouns, conjunctions and negations are extracted and rest are discarded.

When analyzing twitter sentences we found out that due to wrong grammatical formations, efficiency of SDP decreases which will affect our system. When SDP is unable to detect relation between two words, it uses rule ‘dep’ which shows unknown dependency between those words. To improve this we used Ark twitter POS tagger (abbreviated from now on as ATP) [19]. ATP enables us to determine the part of speech of the two words thus giving us the dependency.

3.2 Set Distribution

We approach the problem in set wise manner. It is easier to deal with the problem when it is divided into sets. A natural language sentence is divided into 6 sets according to their part of speech and the polarity of the whole sentence is described by describing polarity of each set in relation to the other sets. Word phrases in the sets contains reference to the words of the previous sets to which they are specifically connected. This helps us in extracting features that will be vital in classifying the sentences according to the rules in next section. The functionality of each set is explained below with the help of example sentence (1).

  1. 1.

    ‘BJP will make good government but still will not remove corruption from India.’

  • Set W0 (Keyword Set)—Includes Subject or Objects containing Keywords Like ‘BJP’. These contain Noun or Noun + Noun. From the above sentence (1) this set will include ‘BJP’.

  • Set W1 (Verb Set)—Includes verbs which describes the action performed by the contents of Set W0 with a reference to the specific noun to which it is connected. From the above example (1), this set includes ‘make’, ‘remove’ because of the extracted rules nsubj(make-3, BJP-1) and nsubj(remove-10, BJP-1) from Fig. 1 and we will extract features ‘BJP_make’, ‘BJP_remove’.

    Fig. 1
    figure 1

    Algorithm for sentiment score calculation

  • Set W2 (Object/subject set)—Includes objects on which the Set W0 are performing actions. This set also includes Noun and Noun + Noun. From (1) this set includes ‘government’, ‘India’ because of the relations dobj(make-3, government-5), dobj(remove-10, corruption-11), prep_from(remove-10, India-13). We will extract features ‘make_government’, ‘remove_corruption’, ‘remove_India’.

  • Set W3 (Modifier Set)—Includes adjectival and adverbial modifiers that are providing or modifying sentiments from the above sets (W0, W1, W2). From (1) this set includes ‘good’ due to the relation amod(government-5, good-4). We will extract features from this as ‘government_good’.

  • Set W4 (Intensifier Set)—Includes adverbial intensifiers that are strengthening or weakening the sentiment scores from the Sets above. From (1) this set includes ‘still’ from the relation advmod(remove-10, still-7). We will extract feature ‘remove_still.’

  • Set W5 (Buffer or Divider Set)—Includes conjunctions like ‘but’ and ‘and’ with references to two words which it is dividing. From (1) this set will include ‘but’ from the relation conj_but(make-3, remove-10). We will extract feature ‘remove_make_but’.

  • Set W6 (Negation Set)—includes the negation words like ‘not’, ‘never’ which flips the sentiment score from the sets above. From above example (1) this will include ‘not’ because of the relation neg(remove-10, not-9). We will extract feature ‘remove_not’.

3.3 Context Rules Formation

We developed rules to determine the sentiment of tweets into positive and negative. These rules are presented in Table 1 and each rule is explained with example further. Polarity of the words is determined with the SentiWordNet. We used following abbreviations for the rules.

Table 1 Context rules

Example: Consider the tweet ‘AAP bhakts r always right, BJP waste time for dharnas. If u don’t trust then see it’ for the above rule. Here keyword is ‘BJP’ and set W1 includes ‘trust’ which is a positive verb in SentiWordNet and W2 includes ‘time’ and ‘waste’ both of these are minor positive and neutral nouns respectively. Notice that negator ‘don’t’ (placed in W6) is attached to trust i.e. we extract feature ‘trust_don’t’ which will reverse the polarity of Set W1 containing ‘trust’, thus classifying in Rule 1.

3.4 Determining Sentiment Scores

Once the rule formation occurs, sentiment scores are calculated using SentiWordNet. We used method similar to specified in [16] in calculating and distributing scores. Let Spre be the polarity from the previous sets with which it is connected to, Sset be the polarity of particular set and Snew be the updated polarity. Figure 1 shows the algorithm used for the sentiment score calculation.

4 Results

We prepared a datasets of total 259 tweets from the date of 28 February 2014 to 28 May 2014 just before the time of Indian elections to know the public trend and general opinion about the elections. Among the total tweets 116 are positive, 92 are negative and remaining 51 are objective tweets. We used accuracy as evaluation measure and it is computed by dividing the correctly classified tweets with total number of tweets. Our approach correctly predicted 76 positive tweets and 55 negative tweets. Further, we investigated manually that tweets containing colloquial language (containing Hindi words) is 56 out of which 20 were positive and 17 were negative. We removed these tweets from total tweets. Results are presented in Table 2.

Table 2 Results

Next, we tried to model elections in National Capital Territory (NCT) Delhi region. For this, we manually downloaded 106 tweets giving sentiments for the Aam Admi Party (AAP) and its party leader Arvind Kejriwal from the same time period by the users of Delhi Region. We investigated tweets with #AamAdmiParty, #AAP and #ArvindKejriwal. Results are presented in Table 3.

Table 3 Results related to modelling elections in NCT Delhi

4.1 Discussions and Comparison with Other Approaches

Above results shows that 34.91 % of users were positive towards AAP party in NCT Delhi region. From the Indian General Election results 2014 we know that all the 7 seats of Delhi region were won by Bhartiya Janta Party (BJP). Although AAP was not able to won any seat in NCT Delhi, there voting share in the elections were 32.90 % Ref. [20]. This gives us an error percentage of 6.11 %. So we were able to predict the voting share of AAP with acceptable error percentage.

We compared our proposed approach with the state-of-art approaches. Table 4 presents some cases of where other approaches fails whereas proposed approach performs better than other methods. The example tweets are chosen from the dataset according to the classification type by algorithm.

Table 4 Handling problems in existing strategies

5 Conclusion

People are increasingly using Social media to express their opinion. And, Twitter is a great source to investigate the public opinion especially during election time. Observing the results has led us to believe that there is a great scope in analyzing Indian political twitter data and considering its sentiment alone can result in giving a general idea about the election results. In this paper, we proposed various rules based on semantic structure of the sentence. Experimental results show the effectiveness of the proposed approach over existing methods.