Keywords

1 Introduction

Sentiment analysis or also known as opinion mining is a natural language processing technique used to determine whether the data is positive, negative or neutral. People are eager to share their comments and reviews on social media. This kind of short review could highlight their preference on certain topic [1,2,3]. Sentiment analysis is often performed on textual data which may use slang phrases, misspellings, short forms, recurring characters, the use of dialects and modern emoticons [4, 5]. The same words and phrases can be used in a different context, thus making it difficult to be determined. Such analysis, for example, may help businesses on monitoring the brand and product sentiment in the customer feedback and understanding the customers’ needs [6, 7]. It is extremely crucial because it helps businesses to quickly understand the overall opinions of their customers. By automatically sorting the sentiments behind those reviews, social media conversations, and more, you can make faster and more accurate decisions.

An example of this application in the real world, is when we want to study the people’s opinion on the Covid-19 [8, 9] vaccine. One way to do that is by doing a survey, interview, questionnaire, etc. However, these methods take an enormous amount of time and cost. Therefore, instead of doing these traditional methods, we can scrap people’s opinions from the social media, and then running a sentiment analysis towards the scrapped text [10]. Not only that it will save time and any unnecessary costs, but we can also get up to millions of samples from all over the world. However, the accuracy of this prediction is also dependent on the models that were being applied. Some of the machine learning approaches that help classify the sentiments are Logistic Regression [11], Support Vector Machine [12], Random Forest [11], K-Nearest Neighbors [11] and Naïve Bayes [7, 13]. Thus, the goal of this study is to find the best machine learning algorithms out of these five models. We will be tuning the hyper-parameters of each model so that we can find an optimal combination of hyperparameters that minimizes a predefined loss function to give better results.

2 Literature Review

One good reason why opinion mining worth to be explored is that, we have massive data recorded in digital form which have potential to be examined [4, 11, 14,15,16]. The growth of social media such as Twitter, forum discussions and reviews, contributed to the huge data repository. To handle such big data, require intelligent approach such as machine learning. This section will elaborate on the machine learning approaches used to classify the sentiments.

Support Vector Machine (SVM) is a powerful machine learning model algorithm which is used for both classification and regression. But generally, it is used in the classification problem. The strength of an SVM rooted from its ability to learn the data classification patterns with balanced accuracy and reproducibility [17].

Logistic Regression is a regression model that utilizes binary on the targeted variables. In other words, the dependent variable is binary in nature having data coded as either 1 (stands for success/yes) or 0 (stands for failure/no). Study in [18], reported that, the classifier has confidence when predicting the positive sentiments but biased when predicting negative reviews.

The Naïve Bayes classifier assumed that the presence of a particular feature in a class is unrelated to the presence of any other features of Naïve Bayes. It is widely used for text classification and spam detection. Despite the simplicity of this model, it works surprisingly well for document classification [19]. Naïve Bayes on text classification require a small data set for training [2]. Conditional probability can be used to classify words into their respective categories.

Random forest was introduced by Breiman [20]. It is a tree-based technique that uses a large number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable, but it is generally produced good performance makes it a popular algorithm.

The K-Nearest Neighbor algorithm, commonly known as k-NN, is a nonparametric approach where the response of a data point is determined by the nature of its K-Nearest Neighbors from the training set. It is suitable to be used in both classification and regression settings [21]. The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.

3 Methodology

3.1 Data Description

The data that we obtained is from Kaggle. It contains 1,600,000 number of rows and extracted using the twitter API. There are a total of 800,000 rows for Positive Sentiment and another 800,000 for Negative Sentiment which is perfectly balanced. However due to time complexity, we decided to take a sample of 40,000 number of samples with a balanced Positive and Negative ratio. The data contains 6 number of columns, and the description of the data is on Fig. 1.

Fig. 1.
figure 1

Data description

Figure 2 shows the snippet of the data that we obtained from Kaggle. As you can see, there are a total of six columns, namely Target, Ids, Date, Flag, User, and Text. However, we will only be using the Target and Text columns. As you can see from Fig. 3, there are a total 40,000 rows of data that we have selected with a balance between the positive and negative sentiment.

Fig. 2.
figure 2

Data snippet

Fig. 3.
figure 3

Bar plot of the sentiment data

3.2 Data Pre-processing

One of the common steps that we did when we do data pre-processing is to remove null values if there is any. Null values can cause misleading results. Besides that, we also remove any duplicate values of the ‘text’ column. Duplicate values will only put unnecessary weight to a certain parameter in the model that might cause the model to be biased or overfit.

Then for the text analysis, the data cleaning step applied is very important for the data to be reliable. There are a lot of possible data noise for the text, such as the inconsistency use of the upper and lower cases, unexpected words such as symbols or emojis, also some useless words like nouns and others. This is expected since the text is one of the most unstructured data forms [22].

In this study, we will be using the stopwords list from the NLTK library to remove any unnecessary words in the text such as ‘a’, ‘and’, ‘how’ and others. Next, we also normalise the text data by using the snowball stemmer from NLTK to convert the words into its root form. For example, from ‘stemmed’ into ‘stem’, from ‘kicks, kicked, kicking’ into ‘kick’, etc.

3.3 Machine Learning

We have tested five different types of machine learning algorithms to find the best model to classify the sentiments of a Twitter text, namely, SVM, Logistic Regression, Naïve Bayes, Random Forest and K-Nearest Neighbors. We have tuned the hyper-parameters in each model to find out the best model in this study. We compared the speed of the training time and the accuracy of each model. The following subsection will describe the parameters of each algorithm that will be tested.

SVM.

The goal of support vector machines is to find the line that maximizes the minimum distance to the line. The parameters that will be tested will be as follows:

  • Kernel Specifies the kernel type to be used in the algorithm. [linear, poly, rbf]

  • Gamma Kernel coefficient. [scale, auto] 5

  • Decision function shape: Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1)/2). [ovo, ovr]

Logistic Regression.

The parameters that will be tested will be as follows:

  • Solver Algorithm to be used in the optimization problem. [liblinear, lbfgs, saga]

  • Penalty Used to specify the norm used in the penalization. [l2, none]

  • C Inverse of regularization strength; must be a positive float. [1, 4, 10]

Naïves Bayes.

The parameters that will be tested will be as follows:

  • Alpha Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). [0, 1]

  • Fit prior Whether to learn class prior probabilities or not. If false, a uniform prior will be used. [True, False]

Random Forest.

The parameters that will be tested will be as follows:

  • Criterion The function to measure the quality of a split. [gini, entropy]

  • Number of estimators is the number of trees in the forest. [50, 100, 200]

K-Nearest Neighbors.

The parameters that will be tested will be as follows:

  • Number of neighbors Number of neighbors to use by default for kneighbors queries. [5, 15, 40]

  • Weights Weight function used in prediction. [uniform, distance]

  • Power parameter (p) for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and Euclidean distance (l2) for p = 2. [1, 2]

4 Finding and Results

Table 1 shows the summary of the overall results by comparing the accuracy and training time. The first row of SVM described the comparison by using different Kernel parameters. Based on the results, we can say that the ‘linear’ kernel has a higher accuracy compared to ‘rbf’ and ‘poly’. However, the ‘rbf’ kernel takes less time to train. The second row of SVM shows the results by using different Gamma parameters. The results illustrate that the ‘scale’ kernel coefficient has the higher accuracy compared to the ‘auto’ kernel coefficient. But the ‘auto’ kernel coefficient takes shorter time to train. The last row of SVM shows the results by using different parameters of the Decision Function Shape. The results show no significant difference on the accuracy and the time training of the model.

In this study, we have tested three parameters of Logistic Regression. The first one is the Solver parameters. It did not cause any significant different on the accuracy and the time training of the model. It takes only a few seconds of difference between these solver types. As for the different Penalty parameters, by using the ‘l2’ penalization, it has a higher accuracy compared to the ‘none’ penalization. Besides, the ‘none’ penalization also takes a bit longer to train the model. The C parameters exhibited that using the ‘l0’ regularization strength has a greater accuracy compared to ‘1’ and ‘4’ regularization strength. However, we also noticed that the greater the regularization strength is, the longer it takes for the model to be trained. We tested two parameters for the Naïve Bayes. The Alpha parameters, when using the additive smoothing one (1) has a greater accuracy compared to the no smoothing zero (0), while for the training time, they are only a few milliseconds apart. When using Fit Prior parameters, if we set the fit prior to True or False, it would not cause any significant differences on the accuracy and the training time of the model. It only takes a few milliseconds in difference between these solver types.

Table 1. Summary of results.

Using Random Forest, we tested two parameters. The criterion parameters also show that, there is no significant difference in accuracy by setting the criterion to either ‘gini’ or ‘entropy’. The ‘gini’ criterion takes a slightly longer time to train the model compared to the ‘entropy’ criterion. The Different Number of Estimators parameters also presented no significant difference in accuracy by setting the criterion to either ‘50, ‘100’ or ‘200’. However, there is a significant difference for the time taken to train the model, that is the greater the number of estimators, the longer it takes to train the model.

We tested three parameters of K-Nearest Neighbour. We have set a different number of the Neighbors parameters and it shows that, the ‘5’ neighbors have a higher accuracy compared to ‘15’ and ‘40’ neighbors. However, there is no actual pattern in the time taken to train the model, since the result is inconsistent when the number of neighbors increased. Weights parameters shows that when setting the weights to ‘uniform’ or ‘distance’, does not cause any significant difference on the accuracy and the training time of the model.

It takes only a few milliseconds of difference between these weight types. The P parameters shows that by setting P to ‘1 or ‘2, they do not cause any significant difference on the accuracy and the training time of the model. It only takes a few milliseconds of difference between these power parameter types.

5 Conclusion and Recommendation

The main purpose of this study is to find the best hyper parameters setup that can be used for the sentiment analysis for each model. By using SVM, we noticed that it has the highest accuracy compared to other models, however it also takes longer time to train. For the ‘kernel’ parameter, by setting it to linear, it has the highest accuracy compared to other kernel type. But it also causes the model to train longer. Next, for the ‘gamma’ parameter, by setting it to scale, it has a much higher accuracy when compared to the auto scale. However, it made the model to train longer. Furthermore, for the ‘decision function shape’, we do not notice any difference between the various decision function shapes. The difference between the time taken to train the model is also unnoticeable.

Besides, for the Logistic Regression model, we observed that it has one of the highest accuracies compared to other models and takes shorter time to train the model. For the ‘Solver’ parameter, there is no significant difference on the accuracy and the training time of the model. ‘lbfgs’ have a slightly longer time to train the model but in mere seconds. For the ‘penalty’ parameter, we can say that the ‘l2’ penalization has a greater accuracy compared to the ‘none’ penalization. It also causes the model to have a slightly less time to train the model. On top of that, for the regularization strength of the ‘C’ parameters, there is no significant difference on the accuracy performance between the value of the regularization strength. However, we noticed that the higher the regularization strength, the time it takes to train the model will also increase.

Furthermore, for the Naïve Bayes model, overall, we can say that it has a decent performance on the accuracy and the time taken to train the model. For the ‘alpha’ parameters, by setting it to 1 (additive smoothing), causes the model to be extra magnificent compared with no smoothing, 0. Not only that it causes the model to have better accuracy, but also a slight less time to train the model. Next, for the ‘fit prior’ parameters, there is no significant difference on the accuracy and time taken to train the model when setting it up to True or False.

Next, for the model Random Forest, it also has a decent performance on the accuracy, however it is quite the opposite for the time taken to train the model. For the ‘criterion’ parameters in the Random Forest model, even though there is a slight difference on the time taken to train the model, but there is no significant difference on the accuracy either when setting it up to gini or entropy. For the ‘number of estimators’ parameters, there is also no significant difference in accuracy when setting the criterion to either ‘50, ‘100’ or ‘200’. However, there is a significant difference for the time taken to train the model, that is; the higher the number of estimators, the longer it takes to train the model.

Finally, for the model K-Nearest Neighbors, we can say that the accuracy performance of the model is quite terrible compared to other models even though it causes only a few milliseconds to train. For the ‘number of neighbors’ parameters, we noticed that by setting the number lower, it will cause the model to have a slightly better accuracy on the prediction. However, there is no real pattern on the time taken to train the model. Next for ‘weights’ parameters, by setting it up to either uniform or distance, there is still no significant difference on the performance of the model. Besides that, by setting the power parameter (P) to ‘1 or ‘2, it does not cause any significant differences on the accuracy and the time training of the model. It only takes a few milliseconds in difference between these power parameter types. Therefore, out of all 5 models, we can say that the Naïve Bayes and the Logistic Regression performed extremely well compared to other models. Not only the fact that they have good accuracy, the time they took to train the model are also nominal. However, if the time taken to train the model is not the main concern, the Support Vector Machine would be the best model due to its sharp accuracy. Additionally, we would also say that the K-Nearest Neighbors is not suitable for sentiment analysis due to its below-average accuracy despite having less training time.

However, the result of this study focuses only on one single text processing and vectorizing technique. Therefore, it is recommended for future researchers to try different vectorizing techniques of the text, such as different n-grams of the vector, etc. According to Subarno et.al. (2018), LSTM RNNs are more effective than Deep Neural Networks and conventional RNNs for sentiment analysis. Hence, we would also recommend future researchers to try LSTM RNNs and compare it with different models so that various results could be attained.