Keywords

1 Introduction

The number of users on Internet is increasing day by day with over 3.9 billion people in the world accessing Internet for their day to day tasks. This also leads to people utilizing various social media sites and connecting to their acquaintances. Thus, individuals tend to posit their views on various issues or advancements on social media networks. Such arguments or thoughts of people are found to be useful for analysis of propensity of people toward a particular side through their contention on social networks. Advantage of such analysis can be used by various political parties to recognize the mindset of the population. This forms the base of a Sentiment Analyzer.

Twitter is a popular microblogging site which lets people put forth their thoughts on to a worldwide platform [1]. The popularity of twitter is bolstered by the fact that 500 million tweets are posted everyday. This in turn provides large amount of data available for any topic with people casting their views on it. Twitter allows users to put forth their thoughts in a short text with a limit of 280 characters per tweet. Also, being in the category of blogging, twitter has more users using textual format to contend their views on a pool of issues. Twitter allows us to view various trends and lets us determine the inclination of users on it, i.e., either positive or negative. The above factors make analysis on the data faster, easier, and more accurate. As a result, we narrowed down our approach to a Tweet-Based Sentiment Analyzer from a Sentiment Analyzer [2].

Despite having an enormous amount of data, the text tweeted by users often contains discrepancies. The tweets are often multilingual, have acronyms used, contain various emojis, misspelt words, incorrect grammar, etc.

Such inconsistencies act as noise in data [3]. However, we can still classify the data into positive and negative tweets with the help of significant words or terms stated in the text. The paper thus depicts the approach of the project to classify such data by tackling these difficulties [2, 4].

2 Related Work

Twitter being a humongous platform for people to express their opinions, its data was analyzed by different researchers for varying purposes. Azam et al. [2] built a system to cluster tweets based on similarity using Markov Clustering technique with each cluster depicting an event. Various scenarios considered were for Israel—Gaza conflicts, Delhi assembly election and union Budget 2015. This was carried out by considering tweets as nodes in a social graph and weighted edge between them representing the similarity between the tweets. Norman et al. [5] performed sentiment analysis by gathering English tweets on demonetization and Indian Budget 2017. They used a Naive Bayes Classifier to predict sentiment of tweets fetched in real time to classify them into either positive, negative or neutral. The results helped to determine the feeling and estimation of the general population about the government’s call to demonetization and its outcome on the proposed Budget in 2017. Naiknaware [6] worked upon estimating the inclination of people by scrutinizing the tweets of Union Budget of India from 2016 to 2018 in order to classify them into three classes, viz., positive, negative, or neutral. Kaur [7] worked upon improving the accuracy of a sentiment analyzer by proposing a system design that combined Lexicon-based and machine learning approaches. This also brought into light an approach of hybrid model which may consist of multiple machine learning methods like SVM, Naive Bayes, etc. to determine the polarity of a tweet. Sarlan et al. [8] in their research, developed a model to obtain opinion of customers on an organization or company which will turnout to be beneficial for the company by measuring the perceptions of their customers. The model gave output in the form of a pie chart on an HTML page after classifying the tweets into two classes, i.e., positive and negative. The accuracy was improved by incorporating Natural Language Processing before actually classifying the data. Verma et al. [9] worked upon opinion mining for movies to be released in India in real time. The tweets were streamed in real time with help of a Twitter Streaming API. This helps in determining the mood of viewers and how the movie will perform in box office upon its release. Rahman et al. [10] proposed an approach to sentiment analysis on various topics by categorizing tweets into sentiments by a trained model on Machine Learning algorithm, i.e., Naive Bayes Method. Guha et al. [4] proposed a system for analyzing twitter data of SemEval 2015 by training a linear SVM.

2.1 Lacuna of Existing Systems

The existing systems classify data for English-based tweets only, i.e., there is no multilingual support. Also, these systems do not fetch data in a dynamic manner or produce tweet sentiments in real time along with no functionality to get and classify historical tweets for analysis purpose.

3 Proposed System

  • The user first logs into the system.

  • Then, he can choose between two options:

    • Get the sentiments of old tweets.

    • Get the sentiments of live tweets.

  • The old tweets are fetched using web scraping and the live tweets are fetched using Twitter API.

  • These tweets are processed and then fed to the trained model.

  • Training the model:

    • First, the dataset is preprocessed.

    • Then it is fed to the algorithm which outputs the model.

  • The results obtained from the model are plotted as graphs and presented to the user (Fig. 1).

    Fig. 1
    figure 1

    Block diagram

4 Methodology

The project was developed in various phases. First, dataset was collected for training the classifier after which an appropriate algorithm was selected to categorize tweets into different classes. Tweets were captured based on users’ input text using two main approaches, i.e., through Twitter API and Web Scraping. The input tweets were fed to the trained model to predict the sentiment. The output of the algorithm was then displayed to the end user.

4.1 Dataset Used

A predefined dataset of tweets by Indians on a Union Budget or government decision is not readily available. As a result, a movie review dataset for short text is used for training purpose. The dataset comprises of two different types of tweets in two files—positive and negative—with each having over 5000 movie reviews.

These files are then combined and shuffled randomly to obtain a mixed dataset with both positive as well as negative tweets. From a line of text, according to the study, an adjective plays the most vital role in determining the polarity of the sentence [11]. A ratio of positive to negative occurrence of an adjective is calculated to find in which of the two cases positive or negative the word is more associated. If the word “excellent” occurs 30 times in positive classified data and only 3 times in negative data, then it is more closer to the positive side. Thus, adjectives are extracted as features of a sentence and mapped to the respective polarity. All such words form feature sets which are classified according to frequency of occurrence in the dataset. Among them, top 5000 or most frequent 5000 feature sets are picked and the corresponding model is trained based on these selected frequent feature sets along with picking 3000 sentences from the shuffled dataset. The trained model is then tested on next 1000 sentences to test the accuracy of the classifier.

4.2 Classification Algorithm

The classification algorithm used for predicting sentiment of tweets is Multinomial Naive Bayes Classifier as it is suitable for classification with discrete features such as text based classification [12]. Under Naive Bayes assumption we have:

$$ p(f_{1} ,f_{2} , \ldots ,f_{i} \left| c \right.) = \prod\limits_{i = 1}^{n} {p(f_{i} \left| c \right.)} $$
$$ p(f_{i} \left| c \right.) = p(\left. c \right|f_{i} ) * p(f_{i} )/p(c) $$

The term Multinomial Naive Bayes lets us know that each p(fi|c) is a multinomial distribution, rather than some other distribution. p(fi|c) denotes probability that fi lies in class c [13].

This works well for data which can easily be turned into counts, such as word counts in text. Consider following training dataset (Table 1).

Table 1 Training dataset

Let us determine whether the statement “overall budget is good” results in a positive statement or negative statement [14] (Fig. 2).

Fig. 2
figure 2

Calculation of probability

Since positive probability is greater as compared to negative, the text “Overall budget is good” is classified as “Positive”.

4.3 Gathering Tweets

The project includes analysis of live tweets as well as old tweets or historical tweets. Thus, two different approaches have been used. The former uses Tweepy while the latter applies concept of web scraping using Selenium. Scraping is used, as twitter’s privacy policy provides truncated old tweets using its API [15].

Live Tweets Using Tweepy. In order to fetch live tweets, Tweepy and Twitter API are used [16]. Steps to get Twitter keys:

  • Apply for twitter developer account.

  • Click on “Create an Application”.

  • Fill the details of the Application.

  • The access tokens will then be available.

Tweepy handles the authentication, connection, creation, and destruction of the session.

Web Scraping using Selenium Webdriver. For fetching the old tweets, scraping of Twitter pages is done using Selenium Webdriver. Basically, Selenium is an automation testing tool. It can be used to perform various browser actions by writing a program [17]. Now, twitter is a dynamic website which loads more content upon scrolling with changing HTML as compared to other static web pages which have a fixed HTML code. As a result, we require a dynamic web scraping tool. Selenium is thus apt for the requirement. It simulates a human browsing the twitter pages loading more tweets by pressing page the down button. More the webdriver scrapes, more tweets are acquired from the HTML for sentiment prediction.

Based on the topic to be analyzed, a URL of search query is generated. Then, the corresponding page is visited on a browser. Followed by, all the content having “body” as the tag name being fetched. From the body, the “div” tag which has tweet text is reached and the tweet is captured.

4.4 Processing Tweets

A captured tweet contains variety of languages, emoticons, and noise. All of this has to be processed first to obtain a generalized format before predicting its polarity. So, it is passed through three different phases after which the sentiment is determined.

Convert Emojis to Literal Meanings. Tweets comprise of emoticons which play a vital role in expressing the sentiment of the user [18]. As a result, the system converts them into their meaning in textual format. This functionality is achieved using python module “Emoji” and the function being called as “demojize” in its documentation. E.g., the emoji “:)” is converted to the text “smile”.

Cleaning of Tweets. Tweets contain URLs if it has an image associated to it. In this case, the URL is also fetched with the text and it acts as noise in data and should be eliminated [19]. This is removed using python module called “preprocessor” [20]. It also removes hashes from the text in case there is any hashtag present. Input statement: “The decision is good. https://xyz.com/image.png” is processed to give “The decision is good.”

Translation. Tweets captured can be in various languages and not just English. However, the model recognizes and is capable of processing only English language. Hence, “GoogleTrans” Python library is used to detect the language of tweets and translate them into English wherever required. The API is also capable to translate a tweet in some other language typed in English to English language.

4.5 Output Representation

Both the methods incorporated for sentiment analysis depict the results in different fashion. They are as follows:

Line Graph. Live capturing of tweets generates a live graph which updates continuously based on time. The X-axis represents time and the Y-axis shows sentiment value. The graph initially starts with the value of Y-axis being 0. When a tweet is classified as positive, the Y value increments by 1 else it is decremented by 1. This shows current mood of people on a particular topic. An upward moving graph denotes that people are happier about a decision or there is a positive feedback and a downward moving graph suggests a negative feedback (Fig. 3).

Fig. 3
figure 3

Live sentiment analysis

Circular Statistics. Scraping of twitter gives its analysis in the form of a pie chart with a view of the number of tweets analyzed based on amount of scrolling done. The pie chart shows the percentage of tweets classified as positive and negative (Fig. 4).

Fig. 4
figure 4

Scraped tweets analysis

4.6 Conclusion

The developed Tweet-Based Sentiment Analyzer can be used for analyzing various decisions or policies undertaken by the government to get an overall view of the public reaction. The sentiment analyzer provides an accuracy of about 76% upon taking multilingual input and an accuracy about 85% for input tweets with English language only. The multilingual inputs’ text is converted to English using GoogleTrans Python library each time for the classifier to recognize the input text. The accuracy of the classifier upon considering multiple languages decrease as the python library may at times incorrectly translate the tweet thus resulting in wrong predictions at a few instances. This problem, however, does not occur when considering English tweets, thus resulting in a better accuracy. This is achieved using Multinomial Naive Bayes algorithm to classify tweets and output is represented in the form of line graph when graphing the tweets live and pie chart when using web scraping to predict sentiment of old tweets.