Keywords

1 Introduction

Recently, researches about natural language processing and text mining became very important because of the increase of sources that gives electronic datasets according to [1] many text mining procedures are required to analyze data on social media and e-commerce websites for the identification of different patterns on texts. We classify the documents based on categories that are predefined for text classification. We use a corpus to be the primary structure for management and representation a collection of the dataset. All preprocessing data techniques can be performed on the corpus. The performance of a document classification technique is acquired by creating the confusion matrix on training and testing datasets. The latest news and discoveries in the field of exchange of information and opinions carve the way of computer applications designed for the analysis and detection sentiments expressed on the Internet. Presented in the literature under the name opinion mining and sentiment analysis, sentiment analysis is used among others for the detection of opinions on websites and social networks, the clarification on the behavior of consumers, product recommendation and explanation of the election results. It is used to look for evaluative texts on the Internet such as criticism and recommendations and analyze automatically or manually feelings that are there expressed to understand public opinion better. It has already been demonstrated in previous studies that feelings of Analysis prove particularly attractive for those who have an interest in knowing the public, it whether for personal, commercial, or political reasons. Thus, many systems Autonomous have already been developed for automatic sentiment analysis. In this research paper, we made a comparative study of three machines learning algorithms which are Support vector machine, random forest, and extreme Gradient Boosting for classification and made the confusion matrix results.

2 Literature Reviews

The most used algorithm for document classification is called Support Vector Machines. The main feature of Support Vector Machines is to build a hyperplane between the classes that provide maximum margins and use these cut off points for text classification. For feature two-dimensional case the created hyper plane is a straight line. The main advantage of Support Vector Machines is that it can create datasets with many attributes with less overfitting than other methods [2, 5]. Nevertheless, SVM classification has speed limitations during both training and testing phases [3]. XGBoost was designed for speed and performance using gradient-boosted decision trees. It represents an element for machine boosting, or in other words applying to boost the machines, initially made by Tianqi Chen [4] and further taken up by many developers. RF is an ensemble of classification that proceed by voting the result of individual decision trees. Several techniques and methods have been suggested by researchers in order to grow a random forest classifier [7]. Among these methods, the authors method has gained increasing popularity because it has higher performance against other methods [8].

In [10], the author considered sentimental classification based on categorization aspect with negative sentiments and positive sentiments. They have experimented with three different machine learning algorithms which are Naive Bayes classification, Maximum Entropy and Support Vector machine, classification applied over the n-gram techniques.

In [6] they have used balanced review dataset for training and testing, to identify the features and the score methods to determine whether the reviews are negative or negative. They used classification to classify the sentences obtained from web search through search query using the product name as a search condition.

3 Implementation Procedure

(See Fig. 1).

Fig. 1.
figure 1

Implementation flowchart

3.1 Dataset

(See Table 1).

figure a
Table 1. Dataframe

3.2 Feature Extraction

Feature extraction relates to dimension reductions. It is a technique for dimension reduction that can reduce an initial dataset into groups for data processing. When the dataset one usually input to an algorithm is being processed, it can be converted into less data and information. This procedure is named feature extraction.

The extracted features ought to contain all needed information from the inputted data so that the experimentation procedure can be done by using reduced representation instead of the complete initial dataset.

Term occurrence (To): \( {\text{To}}_{{{\text{t}},{\text{d}}}} \) of term \( {\text{t}} \) in document \( {\text{d}} \) is defined as the number of times that a term \( {\text{t}} \) occurs in document \( {\text{d}} \).

$$ {\text{To}}_{{{\text{t}},{\text{d}}}} = {\text{n}}_{{{\text{t}},{\text{d}}}} $$

Term Frequency: \( {\text{To}}_{{{\text{t}},{\text{d}}}} \) is the normalized form of \( {\text{To}}_{{{\text{t}},{\text{d}}}} \) to prevent a bias towards very long documents in order to measure the importance of the term \( {\text{t}} \) within the particular document \( {\text{d}} \).

$$ T_{FT,d} = \frac{{n_{t,d} }}{{\sum\nolimits_{k} {n_{k,d} } }} $$

Inverse Document Frequency (\( lef_{t} \)): Estimate the rarity of a term in the whole document collection. (If a term occurs in all the document, 0 will be the Tfidf.)

$$ idf_{t} = log\frac{\left| D \right|}{{\left| {\left\{ {d:t \in d} \right\}} \right|}} $$
  • With \( \left| D \right| \): The total number of documents in the corpus

  • \( \left| {\left\{ {d:t \in d} \right\}} \right| \): The number of the document where the term \( t \) appears i.e. (\( n_{t,d} \ne {\mathbf{0}} \)).

Term Frequency - Inverse Document Frequency ( \( TFidf_{t,d} \) ):

Finally, the tf-idf of the term \( t \) in document \( d \) is computed as follows:

$$ TFidf_{t,d} = Tf_{t,d} \,*\,idf_{t} $$

Evaluation Method:

  • Precision: the Precision is the fraction of retrieved values that are relevant. It encompasses all retrieved values and only considers the topmost results returned by the system at a cutt-off point. This kind of measure is named Precision [9].

    $$ Precision = \frac{relevant\,value\, + \,retrevied\,value}{retrevied\,value } $$
  • Recall: Recall is the fraction of the relevant value plus de retrieved value divided by the relevant value. In the classification binary, recall is called sensitivity [9]. It is trivial to get the recall of 100% by returning all values in response to the query. Recall equation is defined by:

    $$ recall = \frac{relevant\,value\, + \,retrevied\,value}{relevant\,value } $$
  • F1-score: F1 score can be seen as a the average weight of the precision and recall, the F1 score task 1 as best value and the 0 as least value. The contribution of the precision and the recall to the F1 score are equal. The formula for the F1 score is:

    $$ F1\,score = 2\,*\,\frac{precision\,*\,recall}{precision\, + \,recall } $$

3.3 Confusion Matrix

The methods for the classification of the review dataset can be obtained by the terms frequencies of correctness by computing statistical measures that are True Positives, True Negatives, False Positive and False Negatives. These are the elements of the Confusion Matrix which is a table that has been generated for a classifier on a binary dataset and can be used to justify the performance of a classifier.

Precision and Recall performances of their experimentation have been tested so that they can predict the false data and the correct data. The assessment is made with Confusion Matrices in which True Positive rate is a definite class and it is considered a positive class. True Negative rate is considered a class negative. False Positive rate is a negative class Classified as a positive and neutral classes, False Negative rate is a definite class that is classified as a negative and neutral class.

3.4 Classification Model

The way the dataset was used in the model of our experiment is represented by the classification methodology in Fig. 2 as well as the flow of work chart in Fig. 3 This review dataset included test and train data types. Both datasets were used in our model. The dataset was split into test and train type datasets. Both datasets were first concatenated to make the data into one file. Python was used as the environment in which this combined data had to run. Moving further, the XGBoost, RF and SVM packages were downloaded on the Anaconda software, which has in-built packages. Using Python, the required packages had to be called. The dataset was opened on the Python on jupyter notebook platform and, by setting various parameters related to, the code was run on the three machine learning algorithms on a product review dataset. The results included confusion matrix, Precision, accuracy and we used matplotlib to draw the results.

Fig. 2.
figure 2

XGBoost classifier confusion matrix

Fig. 3.
figure 3

Support vector machine confusion matrix

Fig. 4.
figure 4

Random forest classifier confusion matrix

4 Results

In this part we presented the best results for the confusion matrices of our implementation which is for the bigram with stop words using 200 features.

figure b
figure c
figure d

For this research we did the confusion matrix using the unigram, bigram and trigram for the entire three-machine learning algorithm. We used different sets of selected features with and without stop words respectively 100, 200, 300.

From all these 3 classifiers we choose the one that has the highest accuracy and presented it in the figure. Our results have shown that the classifier works better for bigram with stop words using 200 features.

5 Comparative Analysis

After plotting this confusion matrix, the values which are obtained for all our four classes are used to calculate the accuracy of the algorithm’s prediction. As a result, the bar graph that has been plotted are for the actual results and predicted results for all classes that are positive, negative, and neutral (Fig. 5).

Fig. 5.
figure 5

Confusion matrices bar plot

Confusion matrices of all the algorithms been obtained, we calculated the score for the accuracy. The best performing algorithm will be best at predicting and that model will be considered for further for the Sentiment analysis tasks. We did a comparative analysis for the three machine learning algorithms, and the results are shown in Fig. 4. On top as per the results obtained, the support vector machine is getting the lowest accuracy, and the random forest comes on the second position. As per the results, the XGBoost is getting the highest accuracy. The calculation of accuracy value of Analysis towards the SVM method’s result that was done using need to have the accuracy, Precision, and recall performance evaluation from the experiment with the confusion matrix method. The evaluation done by using Confusion Matrix includes the following indicators: True Positive Rate (TP rate), True Negative Rate (TN Rate), False Positive Rate (FP Rate) and False Negative Rate (FN rate). The TP rate is the percentage of the positive class, which was classified as the positive class, whereas the TN rate is the percentage of the class negatively classified as a negative class. FP rate is class negative, which is classified as the positive class. The FN rate is a class positive that is classified as a negative class.

6 Conclusion

In Sum, a comparative analysis of the performance of the support vector machine, random forest, XGBoost review classification techniques is presented in this paper. All the results were found using the Python package and the implementation of the document classification techniques was done using the anaconda Jupiter notebook libraries. Sentimental Analysis of reviews Classification Algorithms using Confusion Matrix indicated that XGBoost has a better performance better classification than support vector machine and random forest. There is a large need in the industry for use of Sentiment Analysis because every company wants to know how consumers feel about their services and products. In future work, different types of approaches, such as machine learning and products reviews should be combined in order to overcome their challenges and improve their performance by using their merits. In order to get a better understanding of natural language processing, a complete knowledge as well as reasoning methods that originates in human thoughts and psychology will be needed.