1 Introduction

Considering the case of marketing management, the review comments on any product help not only the customer to buy the best product available in the market but also the seller, to know the good or bad features related to their product. These reviews are mostly available in the text format in an unstructured way and those need to be classified properly in order to provide any useful information to both customer and seller. Thus, the process of classification plays an important role in analysis of reviews [1,2,3]. The concept of sentiment analysis, also known as opinion mining, analyzes people’s opinion as well as emotions toward different entities such as products, organizations and their attributes [4].

In general, sentiment analysis has been investigated by various researchers mainly in three levels such as document level, sentence level and aspect level [5]. Document level classifies as to whether the whole opinion document expresses a positive or negative sentiment. Sentence level determines whether the sentence expresses any negative, positive or neutral opinion. Aspect level focuses on all sentiments as expressions present within given document and the aspect to which it refers. Aspect-based sentiment analysis is mainly concerned to a particular aspect of the topic or product [4]. When the analysis of the document is carried out based on aspect basis, the document may provide positive polarity, while for other aspect, it may provide the negative one. Thus, the polarity of the document depends upon the aspect considered for analysis. Again, expressions associated with sentiment are mainly the words or features which express the sentiment of the text, i.e., the adjective and adverb words [48]. Thus, during the process of analysis of the reviews the sentiment words are collected and based on the polarity values of those sentiment analysis are carried out. But as the type of the words is not fixed throughout the text, it is difficult to analyze the polarity of text. In positive–negative polarity, i.e., two class problem, the document is analyzed properly to provide the sentiment polarity of the text. During the course of this study, document-level sentiment analysis has been taken into consideration.

There are mainly three types of machine learning techniques which are very often used in sentiment analysis, i.e., supervised, unsupervised and semi-supervised learning. In the supervised learning category, the dataset is labeled and is trained to obtain a reasonable output that helps in proper decision making [6]. Unlike supervised learning, unsupervised learning process does not provide any label data; hence, they cannot be processed at ease. In order to solve the problem of processing of unlabeled data, clustering algorithms are applied [7]. The semi-supervised approach is used, where there is a small volume of labeled data for training and a large volume of unlabeled data for testing. The main purpose of semi-supervised approach is to label the large size unlabeled data with the help of labeled data whose size is small [8]. This study presents the impact of supervised learning method on labeled data.

In order to analyze sentiment reviews, it is observed that various machine learning techniques have been considered by researchers and practitioners. In this study, both support vector machine (SVM) and artificial neural network (ANN) algorithms are used in combination for classification. SVM helps to select the best features from the set of features. These features are then considered for input to ANN, which in turn helps classify the testing reviews into positive or negative group.

The contribution of the paper can be stated as follows:

  1. i.

    Two different movie review datasets IMDb [9] and polarity [2] are considered for sentiment classification. The IMDb dataset has separate data for training and testing purpose but as polarity dataset has no such separation, tenfold cross-validation technique is used for classification.

  2. ii.

    After removal of stop words and unwanted information from the training data in both datasets, the rest words are considered as features for sentiment classification. The sentiment values of each words are calculated using SVM; then, for feature selection purpose, the words having sentiment values higher than mod(0.009) are considered.

  3. iii.

    These selected features/words are then given input to ANN, which tests the testing data based on selected features and classifies the review dataset into either positive or negative polarity.

  4. iv.

    During the analysis using ANN, the number of hidden nodes is kept on changing in order to find out the best possible solution. With the help of confusion matrix and other performance evaluation parameters like “precision,” “recall,” “f-measure” and “accuracy,” the performance of the proposed approach is accessed and then compared with the results obtained by different authors as reported in the literature and it is observed that the result obtained by the proposed approach performs more accurately.

The organization of the paper is as follows: The next section discusses on the articles as available on the literature and justifies the use of proposed approach. Section 3 indicates the methodologies used in sentiment classification. Section 4 informs about the setup for the present approach. Section 5 indicates the performance evaluation of proposed approach. Section 6 finally concludes the paper.

2 Literature survey

It is observed that articles of good number of authors are available in the literature that is quite rich on sentiment classification of textual reviews. The datasets which are used in this study are mostly considered to be of movie reviews i.e., Internet movie database (IMDb) [9] and polarity dataset [2]. The following subsection is concerned with the literature review on document-level sentiment classification, which is very much relevant to this study.

2.1 Document-level sentiment classification

Pang et al. [1] have undertaken the sentiment classification with three standard algorithms, i.e., Naive Bayes classification, maximum entropy classification and SVM applied over the n-gram technique. In this paper, the techniques of unigram, bigram and their combination are considered and the analysis is made on features such as frequency and presence. Pang and Lee [2] have labeled the sentences of a document as either subjective or objective. They have implemented machine learning classifier to the subjective group which prevents polarity classification from considering useless and misleading data. They have explored extraction of methods on the basis of minimum-cut formulation that provides effective way for integration of intersentence-level information with bag of words. Matsumoto et al. [11] have considered the syntactic relationship among words as a basis of document-level sentiment analysis. Frequent word subsequence and dependency subtrees are extracted from sentences, and they act as features for SVM algorithm. They have used unigram method, bigram method and combination of both as the methods for classification.

Moraes et al. [12] have made a comparison between SVM and ANN regarding document-level sentiment analysis. They have adopted feature selection technique and weighting in bag-of-words (BOW) model. Their experiment indicates that application of ANN shows superior result in comparison with SVM. Tang has suggested that learning sentiment-specific semantic representation of document is important for document-level sentiment analysis [13]. He decomposes the review documents into four constituents, i.e., word representation, sentence structure, sentence composition and document decomposition. He used three different versions of sentiment-specific word embedding (SSWE), i.e., \(\mathrm{SSWE}_u\), \(\mathrm{SSWE}_r\) and \(\mathrm{SSWE}_h\) along with neural network for classification. They used Twitter sentiment classification on the benchmark dataset from SemEval 2013. Tu et al. [14] have used sequence and convolution kernels using different types of structures for document-level sentiment classification. They use both sequence and convolution kernels for analysis. For sequence kernels, they have used a sequence of lexical words (SW), POS tags (SP) and combination of sequence of words and POS (SWP). For dependency kernel, they have used word (DW), POS (DP) and combined word and POS settings (DWP) and similarly for simple sequence kernels (SW, SP and SWP). They used vector kernel (VK) in a bag of words as baseline. Their approach of VK + DW has shown the best result among all the proposed result. They have used polarity [2] dataset for analysis.

Liu and Chen [15] have proposed different multi-label classification on sentiment classification. According to the authors, the multi-label classification process performs the task mainly in two ways, i.e., problem transformation and algorithm adaptation. In problem transformation, the problem is transformed into multiple single-label problems. During training, the system learns from these transformed single-label data and during testing, the trained classifier makes prediction at a single label and then translates it to multiple label. Zhang et al. have proposed the classification of Chinese comments based on word2vec and \(\mathrm{SVM}^\mathrm{perf}\) [16]. Their approach mainly consists of two part. In first part, they have considered the word2vec tool to cluster similar features in order to capture the semantic features in selected domain. Then in second part, the lexicon-based and POS-based feature selection approaches are adopted to generate the training data.

Luo et al. [17] have proposed an approach to convert the text data into low-dimension emotional space (ESM). They have annotated small size words which have definite and clear meaning. They have used two different approaches for assigning weight to words by emotional tags. The total weight of all emotional tags are calculated, and based on these values, the messages are classified into different groups. Niu et al. [18] have proposed a multi-view sentiment analysis dataset including a set of image-text pair with manual annotation collected from Twitter. Their approach of sentiment analysis can be categorized into two parts, i.e., lexicon based and statistic learning. In case of lexicon-based analysis, a set of opinion words or phrases is considered which has a predefined sentiment score, while in statistic learning, various machine learning techniques are applied with dedicated textual features. Tripathy et al. [19] have used n-gram machine learning technique to perform sentiment classification using IMDb dataset. They have applied Naive Bayes, maximum entropy, SVM and stochastic gradient descent method for classification using n-gram techniques such as unigram, bigram, trigram, unigram + bigram, bigram +trigram, unigram + bigram + trigram. Among all these approaches, they find out that SVM with unigram + bigram has shown best result among other approaches.

Table 1 provides a comparative study of different approaches adopted by authors contributed to document-level sentiment classification.

Table 1 Comparison of approaches used by authors for document-level sentiment classification

2.2 Sentiment classification using hybrid machine learning approach

Section 2.1 discusses about the literature where the classification is carried out using different machine learning techniques. There are few authors who have also used hybrid methods i.e., combination of two or more machine learning algorithms to perform classification. This section discusses few literature that uses this hybridization approach.

Govindarajan [20] has proposed sentiment classification method using the hybridization of Naive Bayes and genetic algorithm. He used the concept of number of occurrence of word i.e., the count as feature selection. The words that occur more than three times in text are only considered for further analysis. They also used best first search (BFS) approach to evaluate the classifier result. Abbasi et al. [21] have proposed an algorithm known as entropy-weighted genetic algorithm (EWGA) which is a hybridized genetic algorithm. They have combined EWGA with SVM for classification of web forums in different languages i.e., English and Arabic.

Balage Filho et al. [22] have adopted a hybrid classification using three classification approaches: rule based, lexicon based and machine learning. They also suggest that pipeline approach is used by them to extract best feature from each classifier. Their approach can be divided into two subtasks, i.e., expression-level classification and message-level classification. Jagtap and Dhotre [23] have proposed hidden Markov model (HMM) and SVM-based hybrid classification, along with advent feature extraction model. They have used voting mechanism to classify the reviews. In voting mechanism, the classifier votes for the polarity of the review. The reviews which get highest vote belong to the polarity i.e., either positive or negative. Wang et al. [24] have proposed a hybrid method based on category distinguishing ability of words and information gain is adopted for feature selection. They have performed the classification using feature selection and SVM, and their best result obtained with feature dimension equal to 3000.

Table 2 provides a comparative study of different approaches adopted by various authors using hybrid machine learning technique.

Table 2 Comparison of approaches used by authors for sentiment classification using hybrid machine learning technique

2.3 SVM as feature selection

It is observed from Sects. 2.1 and 2.2 that for sentiment classification of reviews different machine learning techniques or hybrid approaches have been considered by a good number of authors. As SVM is used in this paper for feature selection, the present section discusses about the use of SVM as feature selection.

Neumann et al. [26] have proposed four novel continuous feature selection approaches which intends to minimize classifier load. They include linear and nonlinear SVM classifier. Their focus is mainly on embedded approach, which minimizes training errors of a linear classifier. Then, they have fixed up an objective to minimize the feature selection for nonlinear SVM classifier. Lozano et al. [27] have used three different techniques for texture classification using feature selection. The subgroup-based multiple kernel learning technique performs feature selection by down-weighting or removing subsets of features having similar characteristics. Two different conventional feature selection techniques such as recursive feature elimination with different classifier and a genetic algorithm-based approach with SVM are used as decision function. These classifiers work according to tenfold cross-validation technique. Maldonado et al. [28] have used feature selection by penalizing each features that are used in dual formulation. Their approach known as Kernel penalized SVM (KP-SVM) eliminates features having low relevance for the classifier. KP-SVM also employs stopping condition which avoids the negative affect of feature elimination on classifier’s performance.

Zheng et al. [29] have used feature selection for Chinese online reviews. They have used N-char-grams as well as N-POS-grams approach for potential sentimental features. Improved document frequency method is used for feature subset selection and Boolean weighting method for calculation of feature weight. Finally, Chi-square test is carried out to test significance of experimental result. They have used LIBSVM, a SVM classifier developed by Taiwan University to conduct experiment. Sharma and Dey [30] have used five different feature selection methods such as information gain (IG), gain ratio (GR), Chi statistics (CHI), Relief-F and document frequency (DF) along with seven machine learning techniques, i.e., Naive Bayes, SVM, maximum entropy, decision tree, K-nearest neighbor, Winnow and Adaboost for sentiment analysis of movie reviews. Their result shows that GR provides a better result for feature selection and SVM performs better than all other machine learning techniques. Hardin et al. [31] have used SVM-based feature selection. They have assigned zero weight to irrelevant variables by linear SVM. After assigning zero weights to the variables, they do not play any active role in classification; thus, feature selection is carried out and finally SVM is used for classification.

Table 3 provides a comparative study of different approaches adopted by authors using SVM as feature selection technique.

Table 3 Comparison of approaches used by authors for sentiment classification using SVM as feature selection technique

2.4 Motivation for proposed approach

The above-mentioned literature survey helps to identify some possible research areas which can be extended further. The following aspects have been considered for carrying out further research.

  1. i.

    A number of authors have used the concept of part-of-speech (POS) tags or count of the occurrence of the word as a criteria for feature selection. But it is observed that the POS tag for a word is not fixed and it changes as per the context of their use. For example, the word “book” can have the POS “noun” when used as reading material, whereas in case of “ticket booking” the POS is verb. Again the occurrence of a particular word mainly depends upon the author’s writing style. Thus, it may not be suitable for using feature selection. Hence, in this paper, the sentiment values of each word are calculated and feature selection is carried out on the basis of these sentiment values.

  2. ii.

    Different authors have used a single machine learning technique for both feature selection and classification purpose. Thus, it is found out that the shortcoming of the techniques biases the final classification result. To solve this issue, in this paper SVM is used as feature selection, whereas ANN is used for sentiment classification. Hence, the bias of any algorithm does not affect the result of classification.

  3. iii.

    Most of the machine learning algorithms work on the data represented as matrix of numbers. But the sentiment data are always in text format. So, it needs to be converted to number matrix. Different authors have considered TF or TF-IDF to convert the text into matrix on numbers. But in this paper, in order to convert the text data into matrix of numbers, the combination of TF-IDF and CountVectorizer has been applied. The rows of the matrix of numbers represent a particular text file, whereas its column represents each word/feature present in that respective file which is shown in Table 4.

3 Methodology

Sentiment classification techniques may be categorized into two types, i.e., binary sentiment classification and multi-class sentiment classification [32]. In binary classification, each document \(d_{i}\) in D where D = {\(d_{1}\),..., \(d_{n}\)} is classified as a label C where C is a predefined category set C = {Positive, Negative}. In multi-class sentiment analysis, each document \(d_{i}\) is classified as a level in \(C^{*}\) where \(C^{*}\) = {strong positive, positive, neural, negative, strong negative}. The result of binary classification helps to present an indication of quality of one product over others. In this paper, study on binary sentiment classification has been carried out.

3.1 Text data to numerical data conversion technique

The task of conversion of text data into numerical data is carried out using following functions such as:

Table 4 Matrix generated under CountVectorizer scheme
  1. 1.

    CountVectorizer (CV) It converts the text document collection into a matrix of token counts [10]. This function generates a sparse matrix of the counts.

    • Calculation of CountVectorizer Matrix: Suppose we have three different documents containing following sentences.

      “Book is good.”

      “Book is average.”

      “Book is nice.”

      Matrix is generated of size 3*5, because we have 3 documents and 5 distinct features such as book, is, good, average and nice. The elements of matrix can be represented as 4.

      where each “1” in a row corresponds to the presence of a feature and 0 represents absence of a feature from particular document.

  2. 2.

    Term frequency-inverse document frequency (TF-IDF) Reflects the importance of a word in the corpus or the collection [10]. TF-IDF value increases with increase in frequency of a particular word which appears in document. In order to control the generality of more common words, the term frequency is offset by the frequency of words in corpus. Term frequency is the no. of times a particular term appears in the text. Inverse document frequency measures the occurrence of the term in all document.

    • Calculation of TF-IDF value: Suppose a movie review contains 100 words, wherein the word Awesome appears 10 times. The term frequency (i.e., TF) for Awesome is calculated as \((10/100)=0.1\). Again, suppose there are 1 million reviews in the corpus and the word Awesome appears 1000 times in whole corpus. Then, the inverse document frequency (i.e., IDF) is calculated as \(\mathrm{log} (1{,}000{,}000/1{,}000)=3\). Thus, the TF-IDF value is calculated as 0.1 * 3 = 0.3.

  3. 3.

    word2vec Word2vec was created by a team of researchers led by Tomas Mikolov at Google. Word2vec is a group of related models that are used to produce word embeddings [49]. Word2vec takes as its input a large corpus of text and produces a high-dimensional space, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space [51].

In this paper, the CV and TF-IDF functions are used to transform the text document into a numerical vector, which is then considered as input to supervised machine learning techniques. However, it is observed that few authors have applied word2vec to transformation words in the text into vector form but in document-level sentiment analysis, the machine learning techniques need the vector representation of whole document. Thus, CV and TF-IDF are more suitable over word2vec for transformation of the text into numerical vectors.

3.2 Used dataset

In this paper, two different datasets are considered for sentiment classification. The datasets are as follows:

  1. 1.

    aclIMDb dataset The acl Internet movie database (IMDb) consists of 12,500 positive labeled test reviews and 12,500 positive labeled train reviews. Similarly there are 12,500 negative labeled test reviews and 12,500 positive labeled train reviews [9]. Apart from labeled supervised data, an unsupervised dataset is also present with 50,000 reviews.

  2. 2.

    polarity dataset The polarity dataset consists of 1000 positive reviews and 1000 negative labeled reviews [2]. Though the database contains both negative and positive reviews, it is not separately partitioned for training and testing. In order to perform the classification, the cross-validation method is being used for this dataset.

3.3 Sentiment classification technique

As mentioned in Sect. 3.2, two different datasets are used for sentiment classification. Thus, the techniques used for classification are also different in both cases.

  • The IMDb dataset has separate data for training and testing [9]. Thus, the training data are used by the machine learning classifier for training purpose and on the basis of this training result, the testing data are classified. Different performance evaluation parameters along with confusion matrix are used to evaluate the performance of the classifier.

  • The polarity dataset does not have separation between the testing and training data [2]. Thus, in order to perform classification, cross-validation technique is used. The cross-validation technique is used to partition the dataset into two parts, i.e., learning and validate [33]. These two sets are designed to crossover in successive rounds so that each data point must be validated. k-Fold cross-validation is the basic cross-validation method. In k-fold validation, dataset is partitioned into k different folds. Among these k-folds, (\(k-1\))-folds are used for training and onefold is used for testing. Tenfold cross-validation is most commonly used in machine learning and classification problems, by different authors.

3.4 Application of machine learning techniques

In this paper, the following two machine learning techniques are used.

  1. 1.

    Support vector machine method (SVM) This method analyzes data and defines decision boundaries by having hyperplanes. In two category cases, the hyperplane separates the document vector in one class from another class, where the gap of separation is kept as large as possible.

    For a training set with labeled pair \((x_i, y_i), i=1,2,\ldots \) where \(x_i \in R^n\) and y \(\in \{ 1, 0\}^l\), the SVM required to solve the following optimization problem [34] may be represented as:

    $$\begin{aligned}&{\mathop {\text{ min }}\limits _{w,b,\xi }} \frac{1}{2}W^TW + C\sum _{l}^{i=1}\xi _{i}\nonumber \\&\quad \mathrm{subject~to}\quad y_i(w^T \phi (X_i) +b) \ge 1 - \xi _{i}, \nonumber \\&\quad \xi _i \ge 0. \end{aligned}$$
    (1)

    here training vector \(X_i\) is mapped to higher-dimensional space by \(\phi \). SVM requires input in the form of a vector of real numbers. Thus, the reviews of text file for classification may be converted to numeric value before it can be made applicable for SVM. After the text file is converted to numeric vector, it goes through a scaling process which manages the vectors in order to have the values in the range of [1, 0].

    In SVM the kernel is used for pattern analysis. Mostly four different types of kernels are used for analysis in SVM. These are as follows:

    1. (a)

      Linear kernel: The linear kernel function can be represented as follows

      $$\begin{aligned} K(x_{i}, x_{j}) = x_{i}^\mathrm{T}x_{j} \end{aligned}$$
      (2)

      where \(x_i\) and \(x_j\) are the input space vector and \(x_{i}^\mathrm{T}\) is the transpose of \(x_i\).

    2. (b)

      Polynomial kernel: For degree “d,” the polynomial kernel can be defined as

      $$\begin{aligned} K(x_{i}, x_{j}) = \{x_{i}^\mathrm{T}x_{j}+c\}^d \end{aligned}$$
      (3)

      where \(x_i\) and \(x_j\) are the input space vector i.e., the features computed from training sample, “c” is a parameter used for the trade-off between the highest order and lowest order polynomial. Polynomial kernel with degree = 2 is mainly used in binary sentiment classification.

    3. (c)

      Gaussian radial basis function (RBF) kernel: The RBF is a real-valued function, whose value depends upon the distance from the origin. The RBF kernel can be defined as follows

      $$\begin{aligned} K(x_{i}, x_{j}) = \mathrm{exp}(-\gamma ||x_{i}-x_{j}||)\quad \mathrm{for}\, \gamma > 0 \end{aligned}$$
      (4)

      where \(x_i\) and \(x_j\) are the input space vector and the value of \(\gamma \) can be used as \(\frac{1}{2\sigma ^2}\), where \({\sigma ^2}\) is the variance of input data.

    4. (d)

      Sigmoid kernel: The sigmoid kernel can be defined as follows

      $$\begin{aligned} K(x_{i}, x_{j}) = \mathrm{tanh}(a x_{i}^\mathrm{T}x_{j} + b) \end{aligned}$$
      (5)

      where \(x_i\) and \(x_j\) are the input space vector, a>0 is the scaling parameter for the input data, and b is a shifting parameter that controls the threshold of mapping.

    The text classification problems are mostly linearly separable [45]. The linear kernel shows better result, when there is a presence of good number of features as mapping the data to a higher space does not affect the performance. In text classification, the number of documents and features is mostly large, i.e., for IMDb dataset the total number of reviews including both training and testing is 50,000 and the features present are 159,438, while for polarity dataset the total number of reviews including both training and testing is 2000 and the number of features is 25,579. Thus, in this paper, the linear kernel is preferred to other kernels while using SVM.

  2. 2.

    Artificial neural network (ANN) Neural network method used for classification can be represented as a mapping function such as

    $$\begin{aligned} F:A ^{d} \longrightarrow A ^{m} \end{aligned}$$
    (6)

    where “d”-dimensional input is submitted to network and “m” vector output is obtained with classification result. Following Fig. 1 shows the structure of a neural network. The input layer of neural network consists of “d” neurons that represent “d” pieces of input signal (Independent variable). The number of neurons in the hidden layer is chosen by the user. Finally the output layer consists of “m” number of neurons (considered as dependent variables) [35].

Fig. 1
figure 1

A typical neural network

In the input layer, the state of each neuron is determined by input variable. For other neurons the state of neurons is evaluated using values associated with previous neurons as:

$$\begin{aligned} a_{j} = \sum _{i=1}^{I} X_{i} W_{ji} \end{aligned}$$
(7)

where \(a_{j}\) is the net input of neuron j, and \(X _{i}\) is the output value of \(\mathrm{neuron}_i\) in previous layer. \(W_{ji}\) is the weight factor of the connection between neuron i and neuron j.

The neuron’s activity is usually determined via a sigmoid function.

$$\begin{aligned} g(a_{j}) = \frac{1}{1+exp^{-a_{j}}} \end{aligned}$$
(8)

In backpropagation technique, each iteration tries to minimize the error. The adjustment of weight is started from output layer to input layer [36]. Error correction is carried out using following function:

$$\begin{aligned} \Delta W_{ji} = \eta \delta _{i} F(a_i) \end{aligned}$$
(9)

where \(\Delta W_{ji}\) is the adjustment of weight between neuron j and i, \(\eta \) is the learning rate, \(\delta _{i}\) depends on the layer, and \(F(a_i)\) is the output of network “i.” The training process is carried out till the error is minimized. After the completion of training process, the performance of ANN is tested using the input data and the result is compared using the confusion matrix.

Deep learning approaches are mainly used for analysis of the problems where there is presence of multiple types of features, i.e., for analysis of video data, images and sound all thing to be taken care [46]. Convolutional neural network (CNN) is a kind of ANN where the weight shares the network structure [47]. In CNN, continuous bag of words (CBOW) are considered for analysis. But, in this case each word is considered as a single separate unit. So, when the continuous words are considered they change the accuracy. Again, ANN is a special type of deep learning method which is mainly used when all the features are of the same time. In this paper, while classifying the reviews, the features obtained are all of the same type and as each word is considered as a separate feature and combining them impacts the accuracy result. Thus, ANN is preferred to CNN and most of the general deep learning techniques.

3.5 Performance evaluation parameters

Confusion matrix also known as contingency table is typically used in supervised machine learning techniques in order to allow visualization of performance of algorithm. From classification point of view, its element such as true positive (TP), false positive (FP), true negative (TN), false negative (FP) are used to compare label of classes [37]. True positive represents the reviews that are positive also classified as positive by the classifier, whereas false positive are positive reviews but classifier classifies it as negative. Similarly, true negative represents the reviews which are negative also classified as negative by the classifier, whereas false negative are negative reviews but classifier classifies it as positive (Table 5).

Table 5 Confusion matrix

The elements of confusion matrix can be used to find the values of some of the evaluation parameters such as precision, recall and accuracy that indicates the performance of classifier.

  1. 1.

    Precision: It measures the exactness of the classifier result. It is the ratio of number of examples correctly labeled as positive to total no. of positively classified example.

    $$\begin{aligned} \mathrm{Precision} = \dfrac{\textit{TP}}{\textit{TP} + \textit{FP}} \end{aligned}$$
    (10)
  2. 2.

    Recall: It measures the completeness of the classifier result. It is the ratio of total no. of positively labeled example to total examples that are truly positive.

    $$\begin{aligned} \mathrm{Recall} = \dfrac{\textit{TP}}{\textit{TP} + \textit{FN}} \end{aligned}$$
    (11)
  3. 3.

    Accuracy: It is the most common measure of classification accuracy. It can be calculated as the ratio of correctly classified example to total number of examples.

    $$\begin{aligned} \mathrm{Accuracy} = \dfrac{\textit{TP}+\textit{TN}}{\textit{TP}+\textit{TN}+\textit{FP}+\textit{FN}} \end{aligned}$$
    (12)

4 Proposed approach

The IMDb and polarity datasets are preprocessed in order to remove the stop words and unwanted information from dataset. The processed textual data are then used to transform into matrix of numerical vectors using vectorization techniques. Further, the dataset goes through a feature selection step, which selects the features depending upon some conditions. The dataset is classified based on the feature selected. Stepwise elaboration of the approach is described in Algorithm 1.

figure d

The detailed description of Algorithm 1 is presented as follows:

  1. Step 1.

    The two datasets used for classification are as follows:

    • The IMDb dataset consists of 12500 positive and 12500 negative reviews for training, and the same amount of reviews is there for testing [9].

    • The polarity dataset consists of 1000 positive and 1000 negative reviews for analysis, where the reviews are not separated into test and training reviews [2].

  2. Step 2.

    It is observed that the text reviews consist of absurd information, which need to be removed from the original reviews before it is considered for classification. The absurd information is

    • Stop words: Stop words do not play any role in determining the sentiment and thus may be removed. The list of the stop words are collected from the site “http://norm.al/2009/04/14/list-of-english-stop-words/,” and then a list of those words being created and if the words appear in the text reviews, then they are removed considered them as the stop words.

    • Numeric and special character: In the text reviews, there are different numeric (1,2,...,5, etc.) and special characters (@, #, $,%, etc.) present, which do not have any effect on the analysis, but they create confusion while conversion of text file to numeric vector.

    • URL and HTML tags: This information also needs to be removed as they do not play any role in finding out the sentiment.

    After the absurd information is removed, the stemming process is carried out, i.e., the process of getting the root word from any word. For example, the root word for reading is read. For the stemming purpose, PorterStemmer tool is used. It is used to remove the common morphological and inflexional endings from words in English [50].

  3. Step 3.

    After the preprocessing of text reviews, they need to be arranged into a matrix form of numeric vectors. The algorithms for conversion of text file to numeric vectors are as follows:

    • CV: It converts the text reviews into a matrix of token counts. It implements both tokenization and occurrence counting.

    • TF-IDF: It suggests the importance of the word to the document and to the whole corpus. Term frequency informs about the frequency of a word in a document, and IDF informs about the frequency of the particular word in whole corpus.

  4. Step 4.

    After the text reviews are converted to numeric vectors, this information is then considered for the process of feature selection using SVM algorithm. The steps for feature selection for different datasets are shown as follows:

    1. (a)

      The IMDb dataset has 12,500 positive reviews and 12,500 negative reviews for training and the same amount of reviews present for testing purpose. For the purpose of feature selection, only the training data are considered.

      The polarity dataset has 1000 positive and 1000 negative reviews. As tenfold cross-validation technique is used for classification, 900 positive and negative reviews are considered as input for feature selection. As the step of feature selection is mainly carried out on training data.

    2. (b)

      After the preprocessing stage, when the unwanted words or information is removed, rest of the words are then considered as feature. After preprocessing, the total number of features obtained from IMDb dataset is 159438 and from polarity dataset is 25579 features.

    3. (c)

      Then a matrix is generated, where the row specifies the file and the column specifies the feature with its occurrence.

      $$\begin{aligned} \begin{bmatrix} x_{11}&\quad x_{12}&\quad x_{13}&\quad \ldots&\quad x_{1n} \\ x_{21}&\quad x_{22}&\quad x_{23}&\quad \ldots&\quad x_{2n} \\ \ldots&\quad \ldots&\quad \ldots&\quad \ldots&\quad \ldots \\ x_{m1}&\quad x_{m2}&\quad x_{m3}&\quad \ldots&\quad x_{mn} \end{bmatrix} \begin{bmatrix} \alpha _1\\ \alpha _2\\ \ldots \\ \alpha _n \end{bmatrix} \end{aligned}$$

      Each element x(ij) represents the occurrence of feature j in review i, and \(\alpha _i\) is a random variable multiplied with the feature.

      $$\begin{aligned} \begin{bmatrix} \alpha _1x_{1, 1} + \alpha _2x_{1, 2} + \alpha _3x_{1, 3} +\ldots + \alpha _{25579}x_{1,25579} \\ \alpha _1x_{2, 1} + \alpha _2x_{2, 2} + \alpha _3x_{2, 3} + \ldots + \alpha _{25579}x_{2,25579}\\ \alpha _1x_{3, 1} + \alpha _2x_{3, 2} + \alpha _3x_{3, 3} + \ldots + \alpha _{25579}x_{3, 25579}\\ \ldots \\ \ldots \\ \ldots \\ \ldots \\ \ldots \\ \alpha _1x_{2000, 1} + \alpha _2x_{2000, 2} \ldots + \alpha _{25579}x_{2000, 25579}\\ \end{bmatrix} \end{aligned}$$
      Table 6 Result obtained using different number of hidden nodes on IMDb dataset

      As the review is supervised, it is known that whether the sum of the product of \(\alpha _1\) and \(x_{i,j}\) is turning out to be positive or negative. If the sum total of the product is positive, then its review is considered to be of positive polarity or else it is of negative polarity. As the value of \(\alpha _1\) is considered at random, sometimes the polarities do not match; in that case another set of \(\alpha _1\) is considered. These \(\alpha _1\) values finally show the polarity value of the respective features.

    4. (d)

      Thus, for each of features, the polarity value is obtained, but all words do not affect the polarity of review in the same order. In this present paper, the features are selected whose sentiment value is greater than mod(0.009). Thus, the set of features is reduced to 19,729 from 159,438 for IMDb dataset and to 3199 from 25,579 for polarity dataset.

  5. Step 5.

    After the feature selection is complete, then the input data are considered for testing. For classification, the tenfold cross-validation is being used i.e., 90% of the review are used for training i.e., already being done. Then for the testing of the rest, 10 % of reviews are carried out. The result is being analyzed using various performance evaluation parameters like precision, recall and accuracy. The number of hidden neurons depends upon the user to obtain improved value of accuracy; so, in this paper, the numbers of hidden nodes are considered to be order of 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000 and 5000. The input matrix considered for IMDb dataset is of size \(25{,}000\times 19{,}729\), and the output has two neurons, i.e., either positive or negative. The input matrix considered for polarity dataset is of size \(1800\times 3199\), and the output has two neurons, i.e., either positive or negative.

    Following Tables 6 and 7 show the result obtained using ANN classification on IMDb dataset and polarity dataset, respectively.

    From Table 6 and Fig. 2, it can be observed that the hidden nodes of ANN are kept on changing until to best possible result is obtained for IMDb dataset. It is observed that the accuracy of the proposed system is in a increasing mode till the 600 hidden node. After 600 nodes, the accuracy of the system is stable or in a decreasing mode due to the overfitting of the machine learning algorithm.

    From Table 7 and Fig. 3, it can be found that the accuracy of the proposed system is in an increasing mode up to a specific number of hidden nodes i.e., up to 500 hidden nodes the accuracy of the system increases and after that the accuracy of the system is either stable or decreases. Thus, the hidden nodes are kept up to 5000. The accuracy of the system decreases due to overfitting of the machine learning algorithm.

Table 7 Result obtained using different number of hidden nodes on polarity dataset
Fig. 2
figure 2

Result obtained using different number of hidden nodes on IMDb dataset

Fig. 3
figure 3

Result obtained using different number of hidden nodes on polarity dataset

5 Performance evaluation

Table 8 and Fig. 4 show the comparison of accuracy values using the proposed approach with other approaches as available in literature using IMDb dataset. It can be observed from both figure and table that most of authors have preferred either NB, SVM or combination of them for classification. But, in the proposed approach of combining SVM and ANN, the result obtained by the proposed approach is found out to be better than that results obtained by other authors. Different authors have considered all words for classification, but when feature selection is carried out, the obtained result is found out to be more accurate.

Table 8 Comparative result obtained by different literature using IMDb dataset
Fig. 4
figure 4

Comparative result obtained by different literature using IMDb dataset

Table 9 and Fig. 5 show the comparison of accuracy values using the proposed approach with other approaches as available in the literature using polarity dataset. It is observed from both table and figure that most of the authors have preferred NB, SVM for classification. Mores et al. [12] have used ANN along with NB and SVM for classification. It is evident from the figure and table that the proposed approach has shown value with better accuracy as compared to result of other literature. As there is no separation between testing and training data in polarity dataset, cross-validation technique is used by different authors for classification. The authors have used tenfold cross-validation technique where 90% of dataset are used for training and rest 10% are used for testing. Again, as in the proposed approach, instead of using all feature which is done by authors, the features with best sentiment values are considered for classification.

Table 9 Comparative result obtained by different literature using polarity dataset
Fig. 5
figure 5

Comparative result obtained by different literature using polarity dataset

5.1 Managerial insights based on result

The managerial insight based on the obtained result can be explained as follows:

  • It was almost an observed practice that sellers send questionnaires to the customers, about the feedback of the product they have bought. But nowadays buyers or users share those views through reviews or blogs.

  • The reviews can be collected and given input to the proposed approach for qualitative decisions.

  • The proposed approach classifies the reviews into either positive or negative polarity and hence is able to guide the managers properly about the shortcoming or good features of the product for future decision making.

  • Every product needs improvement in course of time, and also updates are needed for the product to be useful in all circumstances. But for that reason, the comments about the product must be obtained from the users.

  • The users share their comments after use. The comments are collected and as they are not labeled one or if labeled mostly in five-star rating. The reviews are analyzed properly that suggest the issue related to the product to keep it up to date.

  • By collecting the reviews about the product with in a specific time interval, analyzing it and then implementing them keep the quality of the product which help the product to be successful in the competition market.

6 Conclusion and future work

In this present study, an attempt has been made to combine two different machine learning algorithms, i.e., SVM and ANN, in order to classify the sentiments of review data associated with movies. As the SVM method analyzes most of the features, i.e., words present in the review, it provides a sentiment value to it. Then using the sentiment value as criteria, it selects the best features from the list of features. By varying the no. of neurons in the hidden layer of ANN, the value of accuracy is found out. The final output indicates whether the review is of positive or negative in nature. The accuracy obtained in this method is found to be comparably better than the results obtained by other authors.

The present study has also some limitations as mentioned below:

  • The Twitter comments are mostly small in size. Thus, the proposed approach may have some issues while considering these reviews.

  • Different reviews or comments contain symbols like (

    figure e

    ,

    figure f

    ,

    figure g

    ,

    figure h

    ) which help in presenting the sentiment, but these images are not taken into consideration in this study for analysis.

  • In order to give stress on a word, it is observed that some persons often repeat the last character of the word a number of times such as “greatttt, fineee.” These words do not have a proper meaning, but they may be considered and further processed to identify sentiment. However, this aspect is also not considered in this paper.

In future, all of above-mentioned limitations may be considered for the future work, in order to improve the quality of sentiment classification. Again an attempt may be made to add few linguistic features while performing the classification of reviews.