Keywords

1 Introduction

The popularity of social networks and society dependent on mobile media has led the young scientists to continue to work on feeling study. Organizations from those days have been keen to evaluate consumers or the public view of their social media products [1]. Online services are linked to the site, internet forum, remarks, tweets, and product data evaluation of social communication [2].

There is an increase in the popularity of the social and electronic media networks. Societies also encouraged to carry out research on feeling analysis via the extreme use of the internet by companies all over the world [3]. Web texts have shaped the market and socio-economic processes. To analyze the sentiments, different methods drive the computer control based on four text classifiers, namely, Naïve Bayes, SVM, LSTM, and Random Forest.

Sentimental Analysis

It is also known as emotion AI or opinion polling. It mostly focuses on identifying subjective data. It is useful to determine how pleased customers are with goods and services. Sentiment analysis examines the polarity of language, determining whether it is positive or negative. Change the notion, boost production, and advertise with the aid of these polarities to help eliminate some negative. The various steps to analyze sentiment data are given in Fig. 1.

Fig. 1
figure 1

Steps to analyze the sentiment data

Section 2 discusses the related work, the proposed model is being discussed in Sects. 3, and 4 highlights the analysis of results, and Sect. 5 concludes the paper.

The paper highlights the different machine learning techniques to analyze the sentiment of the Twitter tweet.

2 Related Work

Badr et al. [4] used SVM and NB algorithms to analyze social media sentiment and used ant colony and particle Swarm optimization methods to achieve 73.62%, 77.30%, and 80.54% accuracy for Naïve Bayes and 76.71%, 80.54% for SVM, respectively. Two approaches were discussed by the author Kawade, [5], sentiment score and polarity count, to analyze the social network Uri assault tweets, achieving 94.3% accuracy for negative results and 5.7% accuracy for positive results.

Nguyen et al. [6] attained an F1-score of 90.2% accuracy utilizing the Vietnamese student feedback corpus and the LSTM support vector machine technique. Taking Twitter tweets, reviews and applying different machine learning algorithm and deep learning algorithm such as NB, SVM, LSTM, and Random Forest. Amazon and IMDB movie reviews were taken by Bansal and Kaur [7] and consider NB, J48, BFtree, and oneR classifier among these the faster one in learning is NB and the more promising one is oneR, for generating the accuracy in classified instances. In [8], Shreyas R Labhsetwar et al. use a dataset of 1000 labeled sentences, 500 +ve and 500 −ve, to conduct sentence-level analysis. It appears that using conjunct analysis with sentence-level analysis will improve accuracy. They discovered that the ML method is inefficient, therefore he used WordNet to improve the accuracy by around 80%. The POSICTCLAS tool for Chinese text is being suggested by Bhargav et al. [9] where the average review size is about 600 words in education review, average reviews of length are about 460 terms in stock review, and 120 words average length in computer review.

Singh et al. are taking sentiment analysis on movie and product review dataset in Turkish and English language using SVM classifier to obtain 91.33% accuracy [1]. Khan  et al. [10] worked on Urdu dataset for sentimental analysis and polarity detection.

3 Proposed Model

The Twitter dataset using machine learning and deep learning classifiers such as Naive Bayes, LSTM, SVM, and Logistic Regression is being discussed in the proposed model. The procedure is discussed in Fig. 2.

Fig. 2
figure 2

Flowchart of the proposed model

The different steps of the process include:

  • Data collection is very essential. In order to discover or analyze sentiment, a collection of Twitter tweets from the database is used in the proposed model.

  • Data Preprocessing is important in data mining. When there is irrelevant, redundant information, noisy or inaccurate data, knowledge discovery become more difficult. It takes a long time to prepare data, and after this step is over, it is time to create the training set. The different steps of preprocessing of data consist of:

    • Remove Punctuation: It removes the special characters.

    • Tokenization: It breaks a long text into little segments or words, which are referred to as tokens. The phrase is broken up and converted into a token format.

    • Remove Stop words

      It is a Natural Language Toolkit (NLTK) that includes a collection of terms. It just functions as a filter and used to filter out natural language data after or before processing.

    • Stemming

      The practice of reducing a term to its underlying word by removing superfluous characters. It reduces the term to its simplest form.

    • Lemmatizing

      This converts a word's result into its dictionary or canonical form. The resulting lexical form is known as lemma.

    • Vectorization

      This converts the word into a number, which is executed faster. It aids word embedding or word vectorization approach by employing vectorization technology.

    • Bag of word (BOW)

      As the machine learning algorithm is unable to operate with text, it must be transformed into numbers. Because the algorithm requires a vector input, first the materials are transformed into fixed length vectors.

  • Classification of Training Data

    It is the next step, four different classifiers are used for the proposed model.

    • Naive Bayes:

      This is a strong algorithm. It is a probabilistic model that looks like Bayes theorem of the algorithm. It is used to determine the assumption of predictor independence. It is the most basic model that is straightforward to construct.

    • Logistic Regression:

      It is a supervised classification method that uses statically learning techniques. It is a regression model, after all it predicts the probability in a given dataset. It makes use of the sigmoid function. Because it detects defaulters, this approach is employed in the financial industry. It may also be used to forecast binary classes. Logical regression's output, or goal value, is binary in nature. It is used to detect spam, diagnose cancer, and forecast diabetes. Logistic regression may be classified into three types based on number categories: binary, multinomial, and ordinal. Binary and multinomial models are the two types of models.

    • Support Vector Machine (SVM):

      The enhanced feature sets were utilized for sentiment classification after the outliers were removed using clustering. SVM is mostly used to classify sentiments. It categorizes good and negative feedbacks. Precision, Recall, F-Measure, and Accuracy are all factors that influence the algorithm’s performance.

    • Bullet Long Short-Term Memory (LSTM):

      The inputs are multiplied by weight then the bias is added, and so on, until the output from the last layer is obtained in the feed-forward networks. However, because these networks do not retain memory, they cannot be utilized to process sequential data. This sort of network’s input and output are also fixed. Long Short-Term Memory (LSTM) networks can learn long-term dependencies. These networks perform admirably in a wide range of situations. Long-term dependence is not a concern with LSTMs because they are expressly intended to avoid it. It is achieved by LSTM memorizing information over lengthy periods of time, as this is their inherent behavior. The chain structure also exists on LSTM networks, but the repeating module has a different structure. It contains four interacting layers instead of a single layer of a single neural network.

4 Result Analysis

The proposed model is classified using Naïve Bayes classifiers. In this, the Twitter Tweets are taken as dataset, around 60,000 pieces of data are taken of which 30,000 pieces are used for training and 30,000 pieces are used for testing. The word vector of positive and negative reviews is kept separated, and also it keeps track of positive and bad ratings. Then, using conditional probability, best words are determined. After training the data using Navies Bayes Classifier, the result is obtained as shown in Fig. 3.

Fig. 3
figure 3

Naive Bayes classifier

There is a comparison graph between SVM and logistic regression based on accuracy and prediction, and it is shown in Fig. 4.

Fig. 4
figure 4

Comparison between SVM and logistic regression

Precision = Correct true predictions/Total number of positive prediction

$$Precision=\frac{TPP}{TPP+FPP}$$
(1)

Recall is to figure out the percentage of item actually present in the input.

Recall = Correct true predictions/total of true positive prediction and false-negative prediction

$$Recall=\frac{TPP}{TPP+FNP}$$
(2)

F-Measure is the ratio of the harmonic mean of accuracy and memory, which combines precision and recall.

$$F-Measure=2\times \frac{Precision\times Recall}{Precision+Recall}$$
(3)

Accuracy is the statistical metric for determining how effectively a classification test properly detects or eliminates a condition. The proportion of genuine findings (including true positive and true negative) among the total number of cases is investigated in the accuracy:

$$Accuracy=\frac{TPP+TNP}{TPP+TNP+FPP+FNP}$$
(4)

In Eqs. (1), (2) and (4), TP signifies the True Positive Prediction, False Positive Prediction is denoted as FPP, True Negative Prediction is denoted as TNP, and FNP is denoted as False Negative Prediction.

Figure 5 shows the count of words using LSTM. The accuracy and loss factor of LSTM using bidirectional NN are shown in Fig. 6.

Fig. 5
figure 5

Count of words using LSTM

Fig. 6
figure 6

Accuracy and loss factor of bidirectional LSTM using NN

The results for the different parameters for the different classifiers are shown the Table 1.

Table 1 Comparison based on different parameters for different techniques

5 Conclusion

In the proposed work, different techniques of deep learning and machine learning are applied to sentiment analysis for review of the tweets in Twitter. It is the most popular and famous topic through which can know the sentiment of any post or any type of text document. The intension of people about other things or people can be known through the sentiment analysis. In this model, the sentiment analysis classification is performed, and their prediction is done through Naïve Bayes, SVM, LSTM, and logistic regression based on different parameters like accuracy, precision, recall, and F1-score. It is observed that for LSTM has better accuracy, precision, recall and F1-score is more as compared to the other model.