Keywords

1 Introduction

Opinion plays an important role in deciding about everything in the life as millions of people express their thoughts through personal blogs, social networking sites and many more. Opinions are private states, which are not directly observable by others but expressions of opinion can be reflected through actions including written and spoken languages [1]. Sentiment analysis which is also known as opinion mining is a task in Natural Language processing which deals with the discernment and categorization of opinions in narrative [2]. Predominantly, these opinions are classified into positive, negative and neutral classes and is thus helpful in many fields including marketing, sociology, psychology etc. [3] Sentiment Analysis is popular for English languages and it is found rarely for Indian Languages [4].

Twitter, a microblogging stage, permits its clients to post short messages about any subject what’s more, tail others to get their posts. Many individuals use Twitter as a platform to communicate with each other. The objective of this examination is to study client opinion communicated on Twitter and to add to a system that permits observing it in the constant. Tweet planning among others include spelling correction, equivalent word substitution, hyperlink cancellation and stop words are performed [5]. Notion is physically ordered the slant into three classes: positive, neutral and negative, so as to make the preparation set for the classifier [6]. The classified tweets are utilized to make positive, neutral and negative feeling lists. The sensational increment in the utilization of the internet as a method for correspondence has been joined by a sensational change in the way individuals express their opinion and perspective [7]. They can express their surveys online about items and administrations and also the perspectives about anything by means of social network (i.e. web journals, examination discussions). Sentiwordnet is one of the widely used lexicon resources for sentiment analysis, emotional analysis, opinion mining [8]. Sentiwordnet is an automatically created lexicon with positive and negative scores [9, 10].

The contemporary work is done as slice of shared task in Sentiment Analysis in Indian Language (SAIL) 2015, constrain category. The task which contains three classes (positive, negative, neutral) of twitter data in three languages - Hindi, Bengali and Tamil is to identify the sentiment of the given tweet in a given language. The main objective of the share task is to stimulate researchers to accomplish sentiment analysis in their endemic language. Section 2 provides a view about the methodology used in the system; Sect. 3 discusses about the short analysis of the dataset provided to the work; Sect. 4 discusses about various experiments and their results.

2 Methodology

In the proposed system, feature extraction is the most crucial process as the accuracy of the classifier is based on the extracted feature. The flow of the proposed system is depicted in the Fig. 1. Generally for the text classification problem, preprocessing is to be done and is mandatory especially for twitter dataset. The preprocessing steps include normalization and tokenization. In tokenization, the tweets are further chunked into small instances called tokens. These tokens are normalized using normalization process in which superficial variations are removed from the words and are thus converted to the similar form. The common type of normalization includes case folding and stemming [11]. Predominantly, stemming is avoided for Indian languages in case of text classification as this leads to stem the useful information into its root form. Case folding is used mainly for English language as it has upper case and lower case letters [12]. This is not needed in case of Indian languages as no such case differences exist. The terms which are normalized using the system are listed in Table 1. Along with the features, the machine also learns from the training dataset which is already labelled. A small part of the data from the training dataset say about 10 % is taken and given for the validation process. This is given as an input for the Naive Bayes classifier and the classified outputs are taken into consideration. Naive Bayes classifier is chosen for the classification purpose as the size of the dataset is very small. In machine learning hunk SciPy library is used for classification.

Fig. 1.
figure 1

Flow diagram of the proposed system

Table 1. Normalized symbol

2.1 Feature Extraction

The words of the sentiwordnet are taken as features because it contains the classified words of respective languages. The binary features are the features in which if a symbol is present in the tweet, it is marked 1. These features are extracted from the twitter dataset as it contains various special characters such as @, RT, # and few more which are enlisted in the Table 2. The stop words are removed from the tweets. All special symbols are removed except for the question mark and the exclamation mark as these punctuations has the ability to change the meaning of a particular tweet.

2.2 Naive Bayes Algorithm

Naive Bayes has been used in information retrieval for many years and recently it has been used for many machine learning researches [13]. Multinomial Naive Bayes has been carried out for this work [14]. In recent years, the work has been focused on two basic instantiations of the classifier Bernoulli model and multinomial model [15]. Bernoulli model represents the document as a vector of binary features whereas the multinomial model uses vector of integer feature to represent documents [16]. The multinomial model works on the assumption that the probability of each word event in a document is independent of the words context and position in the document [17]. To normalize the error in the Naive Bayes, a small correction known as Laplacian Smoothing is included [18, 19]. Generally the Naive Bayes is mathematically represented as,

$$\begin{aligned} P(c|d) = p(c)\prod \limits _{1 \le k \le {n_d} \le n} {p({t_k}|c)} \end{aligned}$$
(1)
Table 2. Binary feature description
Fig. 2.
figure 2

Tweets-before and after preprocessing

As shown in the Fig. 2, the tweets are taken and if it includes any punctuations other than exclamations and question marks, it is removed and the tweet ids are also processed. If the tweet has any binary feature, they are marked 1.

Table 3. Twitter training dataset for SAIL 2015 shared task

3 An Analysis of SAIL Dataset

SAIL stands for Sentiment Analysis for Indian Languages. They have released twitter dataset for three languages namely Tamil, Hindi, Bengali. The size of the training and testing dataset is shown in Table 3 in which approximately 27 % of training data of Tamil and Hindi and 54 % of the Bengali data contain URLs. Most of the Tamil tweets are regarding movie reviews and comments about some actors and actresses whereas the Hindi tweets are based on politics. The dataset has issues such as single tweet are there in both of the positive and negative training data and also there are tweets which are repeated. Many of these tweets are misspelt which affect the accuracy of the classifier. The training dataset of Tamil tweets contains more colloquial words which is not present in Sentiwordnet and hence it is not clean whereas the Hindi and Bengali tweets are conventional. In the test data, many of the ambiguous tweets are present. As these data are already ambigiuous, the accuracy drops.

4 Experiments and Results

The experiment is conducted on Windows 64-bit machine with i7 core processor and 8 GB RAM. The tweets from the dataset are taken. The initial step is preprocessing in which steps such as normalization and tokenization are done and the output of this step is raw tokens. These tokens are then given as an input for feature extractor. The feature extractor will take the tokens as input and extract the features from these tokens. The words, Sentiwordnet are taken as features and binary features are also included so as to improve the feature extractor. Sentiwordnet, hashtags, retweet, links, question marks, exclamatory marks are taken as binary features which means if any of these features are present then the output will be 1 else 0. The description of the binary feature is well illustrated in the given Table 2 In this paper, engrossment is given for feature extraction. The features are extracted from the training dataset and stored in a text file. Using the features that are extracted, the classification step is proceeded to. There are different algorithms that are used for the classification in Machine learning. The algorithm which is used in this paper is Naive Bayes classification algorithm. This Naive Bayes algorithm works on the principle of Bayes theorem. The training dataset and testing dataset is given as input for the classifier, in which 10 % of the training data is taken as a validation data. The data are classified using Naive Bayes and the output of the classifier will be the labelled tweets with positive, negative and neutral. The accuracy of the classifier is verified using F-score. The F-score is calculated for all the three classes. It is given below

$$\begin{aligned} F - scorepos = \frac{{numberofpositivedocument}}{{Totalnumberofdocument}} \end{aligned}$$
(2)
$$\begin{aligned} F - scoreneg = \frac{{numberofnegativedocument}}{{Totalnumberofdocument}} \end{aligned}$$
(3)
$$\begin{aligned} F - scoreneu = \frac{{numberofneutraldocument}}{{Totalnumberofdocument}} \end{aligned}$$
(4)
Table 4. Accuracy and F-score of the proposed system(Cross Validated)

The F-score and the accuracy of the system for all the three languages are given in the Table 4. Table 5 shows accuracy obtained for the proposed system by SAIL. The number of tweets that are classified into their respective classes in three languages are shown in the Fig. 3.

Table 5. Testing result of the proposed system by SAIL
Fig. 3.
figure 3

Bar chart representation of the final result

5 Conclusion and Future Work

We have presented a method to classify twitter data based on the sentiment which is highly useful in the field of information retrieval (IR). In this work, we classify the tweets of the Indian languages into positive, negative and neutral classes. Generally before classifying the tweets, preprocessing is done. This is carried out in order to eliminate the unwanted symbols and also to retrieve words which are highly useful for analysing sentiment. The preprocessing step is taken extra care of and it gave better result after classification. Naive Bayes algorithm is used which gives a better classification result. This method can also be extended using SVM classifier and also the unsuupervised way of implementation can be done as future work.