Keywords

1 Introduction

A huge amount of data is being created on a daily basis. Making sense of this data is an important task, which is where data mining comes into the picture. Data mining is an interdisciplinary subfield of computer science. Data mining helps in identifying and extracting information and patterns from large data sets.

Opinion mining or sentiment analysis or opinion extraction or review mining is an important specialisation of data mining. It makes use of techniques analysis of texts, natural language processing, and computational linguistics to make sense of the subjective information from a given material, which may be anything from products to services, etc.

Sentiment analysis is extremely useful as it allows us to gain an overview of the wider public opinion behind certain topics. A huge amount of data is available, from which critical information and patterns can be retrieved. This retrieval of information can be a challenging task, which makes it that much more fascinating. Also, the commercial use of this information makes it an important task to be undertaken.

In the Indian perspective, India has a very large Internet user base at over 400 million internet users. With the exponential rise in the Internet user base, there has been a rise in the number of users who are using specific websites and social platforms to express their opinions and also to get first-hand reviews on any product or topics. The popularity of gaining knowledge from the expressed opinions can be gauged by the fact that a particular website (www.mouthshut.com) [1] claims that more than lakhs of users have been influenced by the opinions expressed on this particular website.

By 2013, in the electronics section alone only on amazon.com, there were more than 1.2 million reviews [2]. Also, according to Internet and Mobile Association of India, it was believed that more than 66% of Internet users claimed that they were influenced by the online movie reviews [3].

This proves that more and more people are turning to the Internet in order to share their opinions and read reviews from customers to make knowledgeable decisions. This is why sentiment analysis is important. And, this paper gives a brief introduction to the different approaches to sentiment analysis and current research being carried in these different approaches.

2 Sentiment Analysis

Many techniques have been proposed in the field of sentiment analysis, while some of the related works have been discussed in this section.

Farra et al. [4] conduct sentiment analysis for Arabic texts. This is done at both sentence level and document level. For the sentiment analysis at the sentence level, two approaches have been investigated. For document-level classification, they propose to use sentences to classify whole documents, which are of known classes. The major advantages of this paper are that it provides a novel idea of document-level classification.

Tungthamthiti et al. [5] propose a new method to identify sarcasm in tweets (twitter message). The paper focuses on several approaches including concept-level sentiment analysis and use of common-sense knowledge, coherence and classification through machine learning.

The biggest advantage of this paper is that it tries to tackle sarcasm which is one of the most difficult tasks in natural language processing with a respectable accuracy of 80%.

Sharma et al. [6] propose a simple system for sentiment analysis of movie review. With the help of a list containing sentiment scores for opinion words, the opinion sentiment score is obtained. And finally, the scores are aggregated to give a sentiment score at the document level.

Khan and Baharudin [7] propose a rule-based, domain-independent sentiment analysis method. The paper proposes to use the popular sentiment dictionary SentiWordNet to calculate their polarity of all the words under. The biggest contribution of this paper is the creation and use of knowledge base for domain-independent sentiment classification.

In the Indian context, it is very common to use Hindi words in an English script, known as Hinglish words. General classifiers or feature extractors do not take these words into consideration while performing sentiment analysis. Seshadri et al. [8] provide a way through which the classifiers, in this case Naïve Bayes classifier, can be effectively modified with the addition of a corpus containing Hinglish words.

When it comes to complex linguistic structures and understanding the context of sentences, most existing machine learning approaches fail in this area. To counter this, Yang and Yang [9] propose to build a system which is context-aware, and hence, sentiment at individual sentences can be understood. The system makes use of the context-aware constraints which are helpful when the amount of labelled data is limited. Over the years, many techniques have been proposed for sentiment analysis. These include supervised and unsupervised learning approaches and dictionary-based or corpora-based approaches. Fig. 1 gives the taxonomy of basic techniques for sentiment analysis.

Fig. 1
figure 1

Classification of techniques for sentiment analysis [10]

The techniques for sentiment analysis can be broadly classified into two types: machine learning-based and lexicon-based approaches.

2.1 Lexicon-Based Approach

Lexicon is a vocabulary of a language. Generally, it is considered that languages have two parts: a lexicon, which is essentially a complete catalogue of language’s words, and grammar, which is a set of rules for using these words to form a meaning sentence.

The rapid growth of technology is helping entrepreneurs to launch new products almost on a daily basis. And in this competitive field, knowing and having information about the rival product is a necessity. To make the job simpler, Kuppili et al. [11] propose a system called as variance-based product recommendation (VPR) approach. The main aim of this approach is to find the top competitors of your newly launched product. This is done by checking the similarity in the description.

On the other hand, Bhoir and Kolte [12] found out that while giving reviews on any movies, especially Hindi movies, people used language which was generally not found in a usual lexicon dictionary. So, they proposed to develop a new lexicon dictionary with sentiment score. This own dictionary with sentiment score was very useful for getting a better sentiment score. The lexicon-based sentiment analysis again can be divided into two types: dictionary-based and corpus-based approaches.

2.1.1 Dictionary-Based Approach

In dictionary-based sentiment, a dictionary with pre-defined sentiment score is created and this dictionary is used to find the overall sentiment score. There are many dictionaries with sentiment scores for many languages; for example, SentiWordNet is a very popular Sentiment dictionary for English.

Jose and Chooralil [13] proposed a system which uses these standard sentiment dictionaries, WordNet and SentiWordNet, for the prediction of election results by using real-time tweets.

But, Kawabe et al. [14] found out that in many cases, these dictionaries could not be used. In the aftermath of tsunami, it was found out that many people were using the microblogging site for rumour-mongering and using the standard sentiment dictionary was impossible to ascertain the credibility of the tweets. So, the paper proposes a new method for sentiments of words and phrases. Also, the paper proposes substantial changes to the sentiment dictionary.

2.1.2 Corpus-Based Approach

In the corpus-based lexicon sentiment analysis, instead of a sentiment dictionary, a huge collection of sentences with a sentiment score is maintained for a particular language. The sentiment score calculated for a particular word is more dynamic in corpus-based approach. Because of this dynamism, corpus-based approach is better compared to the dictionary-based approach. But, the corpus used should be sufficiently large, else sentiment score might not be so accurate.

Abdulla et al. [15] demonstrate the development of a corpus for the Arabic language. To demonstrate the usefulness of building a corpus for sentiment analysis, they created a corpus containing 1000 positive tweets and 1000 negative tweets.

The paper demonstrates the usefulness of developing or using a corpus for a language over directly using a sentiment dictionary for the analysis of sentiments.

Riloff and Shepherd [16] have taken the idea of semantics forward. But, instead of using a corpus of a language and creating a semantic relationship for sentiment analysis, they proposed to develop a complete corpus using semantics as a base.

Motivated by this observation, Carvalho et al. [17] propose to use paradigm words to classify tweets. Here, using genetic algorithm, the paradigm words were selected and subset of such words are found. Because of which, the paper claims to improve the classification accuracy.

In corpus-based approaches, the requirement of corpus is critical and also the amount of data required is also large. But the major drawback of corpus-based approach is that no model can be trained to perform sentiment analysis on unknown data.

2.2 Machine Leaning-Based Approach

The second broader classification of sentiment analysis technique is the machine learning-based approach. Machine learning deals with algorithms that allow computers to learn. It is generally seen that all data contain some kind of patterns and insights hidden inside which comes to the fore when observed keenly. This is possible because the algorithm can make a generalised rule from what it has seen in the data it was given [18].

The training data is the most important part of machine learning algorithm. Without the proper selection of training and testing data, machine learning approach will fail miserably. So, for sentiment analysis of tweeter data, Suchdev et al. [19] propose to take a minimum of 5600 tweets as training data for a specific domain.

The machine learning-based approach can be further classified into two types: supervised learning and unsupervised learning.

2.2.1 Unsupervised Learning

The problem of unsupervised learning is that of trying to find hidden structure in unlabelled data. Since the examples given to the learner are unlabelled, there is no error or reward signal to evaluate a potential solution. Because of the absence of any labelled data, the unsupervised learning approach makes use of clustering and association techniques to find the relation of the unknown data and the unlabelled training data.

Generally in unsupervised where there no labelled data clustering and association, rules can be applied with the help of bag of words technique and also called as Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF).

Also, word2vec is a system which learns to correlate words with other words in an unlabelled environment of unsupervised learning. Doc2vec, on the other hand, learns to correlate labels, and Sanguansat [20] propose an extension to this technique by adding a distributed memory to speed up the system.

Khanaferov et al. [21] are of the opinion that on the Internet, especially on the microblogging sites like twitter it is very difficult to carry out sentiment analysis as these opinions on social platforms are very difficult to classify using supervised learning. So, clustering was considered to be the best option.

2.2.2 Supervised Learning

In supervised learning, a function is inferred from labelled training data. In supervised learning, the algorithm is given both the input and a desired output and it produces an inferred function. This function is used for mapping of new examples. This is done by the learning algorithm by generalising from the training data to unseen situations. Because of the presence of labelled data, the supervised learning makes use of classification and regression techniques.

Bouazizi and Ohtsuki [22] found that even though there had been a surge in different automated system for sentiment analysis of tweets, these state-of-the-art proposed approaches were mostly focusing on the binary and ternary sentiment classifications. They were of the opinion that it would be very interesting if a deeper classification was carried. So, they proposed to classify a text into seven classes; these classes are “happiness”, “sadness”, “anger”, “love”, “hate”, “sarcasm” and “neutral”.

Classifiers are generally used in the supervised learning approach. These classifiers can be categorised into four types.

  1. i.

    Decision Tree Classifier

A decision tree is a decision support tool which uses a tree-like graph to map decisions to its supposed possible consequences.

Zharmagambetov and Pak [23] combine the idea of unsupervised learning and supervised learning in the proposed paper. The paper proposes to use the technique of using Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF) and implement it with decision tree to get a better and accurate sentiment analysis score.

  1. ii.

    Linear Classifier

The goal of any classifier is to try and understand the characteristics of an object and use this information to identify its class or group. In linear classification, the classifier performs some kind of linear operations in order to perform classification. The linear classification is achieved by two techniques: neural networks and support vector machines.

Duncan and Zhang [24] demonstrate the use of artificial neural networks in the field of sentiment analysis. The proposed system uses the feedforward neural network system.

Ouyang et al. [25] proposed a convolutional neural network model for sentiment analysis. A convolutional neural network is a variant of feedforward artificial neural network. This artificial neural network is inspired by the neurological structure of the animal visual cortex. That is, by tilting of the visual field helps the individual neurons to respond to the overlapping regions.

For the sentiment analysis, the proposed system uses word2vec by Google to compute vector representations of words. This vector representation is an input to the proposed neural network.

The second technique for linear classification is support vector machine (SVM). SVM analyses the data and works in a non-probabilistic manner. Generally, the SVM is used for binary classification. SVM training algorithm builds a model that assigns new unknown or unseen examples into one category or the other.

Devi et al. [26] demonstrate the effectiveness of SVM in sentiment analysis. The proposed system first performs sentence-level classification. This is followed by POS tagging, subjectivity/objectivity classification and extraction of aspects.

On the other hand, Zhao et al. [27] proposed a technique of combining of semantic and prior polarity with SVM. The proposed system incorporates semantic feature, prior polarity score feature and n-grams feature as sentiment feature set into support vector machines (SVM) model training and perform sentiment classification.

  1. iii.

    Rule-Based Classifier

In rule-based classifier system, the administrator has to give an extensive and exhaustive set of rules to classify the input data.

Im Tan et al. [28] proposed a rule-based system for sentiment analysis of financial news. The system works at assigning polarity at the sentence level. The system proposes seven sets of rules of phrases.

The rules have been given for noun phrase sentiment composition, verb phrase sentiment composition, verb–noun/noun–verb phrase sentiment composition rules, the preposition phrase sentiment composition rules and rules for conjunction.

  1. iv.

    Probabilistic-Based Classifier

A probabilistic-based classifier predicts the probability distribution of classes for the given sample, rather than just giving the likely class that the given sample might belong to.

There are three techniques which use probabilistic approach for classification. These techniques are Naïve Bayes classifier, Bayesian network and maximum entropy classifier.

The Naive Bayesian classifier is based on Bayes’ theorem with independence assumptions between predictors. It is one of the most commonly used algorithm as it is not very complex, even when the data set is large, to understand and implement as it does not contain iterative parameters.

On the other hand, Bayesian network is a probabilistic graphical model that represents conditional dependencies for a set of random variables via a directed acyclic graph (DAG), whereas when the data is the most random maximum entropy is achieved.

Wikarsa and Thahir [29] effectively demonstrate the usefulness of Naïve Bayes classifier on twitter data set, with a claimed accuracy of around 83%.

On the other hand, Yan and Huang [30] showcases the effective use of maximum entropy model for Tibetan language sentiment analysis. The maximum entropy model was implemented and tested for around ten thousand Tibetan sentences.

There are other papers which have made use of multiple classifiers together or have tested new system on multiple classifiers.

Moh et al. [31] are of the opinion that instead of using a single-tire prediction system, it would be much more beneficial to have a multi-tire prediction system. The predictive model is divided into three tires. It has been claimed that this three-tire system improves the accuracy of the classifier as the classification task complexity reduces.

Jain and Katkar [32] carry out a comprehensive comparison of different classifiers to determine which classifier gives the best result for an experimental twitter data set. The proposed experiment used the combination of different classifiers to ascertain the best possible result. The experiment shows that the ensemble of classifier approach, contrary to common perception, was outperformed by single classifiers.

Yang and Zhou [33] pitches Naïve Bayes classifier and SVM classifier against each other. After the comparison, it has been claimed that the Naïve Bayes classifier is very slow compared to SVM classifier, that the processing time required by Naïve Bayes classifier is more compared to SVM classifier for the same data set. And, for the taken data set the accuracy of both classifiers was nearly equal.

Yuan et al. [34] proposed a hybrid method for multi-class classification for sentiment analysis of microblogging site. The proposed system combines the model-based approach with the lexicon-based approach.

It has been seen that extensive research has been carried out in the field of sentiment analysis. Many scholars have also acknowledged that proper preprocessing of the extracted data is a necessary step or a prerequisite to get an accurate sentiment score.

3 Analysis

In this section, the analysis of different sentiment analysis approaches and algorithms has been given.

3.1 Analysis of Sentiment Analysis

Sentiment analysis can be categorised into lexicon- and machine learning-based approaches. Each of these two approaches can be further categorised into two types. Table 1 gives the analysis of these four sentiment analysis approaches.

Table 1 Sentiment analysis classification

As the unsupervised learning accepts non-annotated training set, it uses clustering algorithms such as K-means clustering and K-nearest neighbour. Considering all the above sentiment analysis approaches, it can be seen that the supervised learning approach comparatively gives one of the highest accuracies and also it uses annotated training set which helps in using classification algorithms.

3.2 Analysis of Supervised Learning Approaches

There are two broad approaches for performing sentiment analysis, lexicon- and machine learning-based approaches. And further these two again have two sub techniques each. Dictionary- and corpus-based approaches are for lexicon and supervised and unsupervised for machine learning-based approaches. Table 2 performs basic analysis of these four approaches based on the type of learning, complexity of the approach, etc.

Table 2 Supervised learning approaches

Of the four approaches in supervised learning, linear classifiers and probabilistic-based approach are the most efficient as they produce highest accuracy. support vector machine and neural network are two algorithms which under linear classifiers, and Naïve Bayes and maximum entropy are the important algorithms which use probabilistic approach.

3.3 Analysis of Supervised Learning Algorithms

In Table 3, some basic supervised learning algorithms have been analysed with respect to the deployment complexity, the amount of data required for training purpose, etc.

Table 3 Supervised learning algorithms

Of the four supervised algorithms discussed, the support vector machine and Naïve Bayes produce the highest accuracy and their deployment is not very complex. One important thing that needs to be noticed is that the support vector machine finds it difficult to deal with multi-class problems.

4 Conclusion

The adaptation of sentiment analysis and opinion mining in different domains has raised many issues. Over a period of time, many scholars have proposed many different techniques in order to solve the issues. Also, many scholars have adapted different learning approaches and learning algorithms and have further customised the algorithms to carter to the need of solving the issues that arise in different domain.

The paper conducts a review of different sentiment analysis algorithms and different approaches proposed in different papers. The review highlights that though there are different machine learning approaches which perform sentiment analysis supervised learning approaches prove to better. Further in supervised learning even though neural networks are extremely complex to implement, it has been seen that it performs much better compared to others when the network configuration has been mastered. Also, a simple artificial, neural network can be deepened to form a deep neural network or a deep learning network which gives better accuracy and is more efficient.

Moreover with the expansion in the use of opinion mining in different fields different issues like cross-domain sentiment analysis or cross-domain adaptation, preprocessing of the data, etc., have cropped up which require in-depth and detailed investigation before using the standard machine learning approaches and algorithms.