Keywords

1 Introduction

The tremendous amount of unstructured data that is available online can be used for sentiment analysis. It helps to analyze customer’s feedback to improve their products and services or to make decision in political movement or public welfare. There have been very few research carried out in sentiment analysis of Bangla text data in comparison with other languages (e.g., English, Spanish, French). In our paper, we have suggested an Bangla sentiment classification system which works effectively

In our paper, we used both tokenization and stemming of words in combination with n-gram (i.e., unigram, bigram, and trigram) for feature selection after removing the noise and redundancy of the initial raw data through several steps of preprocessing. Tokenization has shown better results than stemming with 68% F1-score. N-grams determine the relationship between N−1 neighboring words by assigning probabilities to them. It captures contextual information based on the value of n. The higher value of N ensures more contextual information from the sentences. N-gram techniques were used along with the TF-IDF method for vectorization. The processed data was fed into a classification model for evaluation. Our goal is to provide a system of classification that will be able to classify Bangla text into a category of positive, negative, and neutral classes using the two most efficient machine learning algorithms (i.e., SVM and RF) in opinion mining along with n-gram techniques and show a comparative analysis between the classifiers and n-grams.

The remaining sections of the manuscript are divided into four section. Section 2 consists of related works. Section 3 states the overall architecture and methodology of the system. The evaluation of the system and comparisons among the features are showed in sect. 4. Finally, Sect. 5 summarizes the findings and points the work’s future directions.

2 Related Work

Over more than 1300 years, the Bengali language has been affected by a diversity of languages and cultures. There are also a few NLP tools for Bengali. Many experts have studied the Bengali language to fill this deficit. Abinash et al. [1] provided a resemblance of results obtained using the classification algorithms of NB and SVM. The algorithms help to decide whether a sentimental assessment is positive or negative. Taher et al. [2] focused their views on Bangla texts by using diverse web-based data. The ML process and the N-gram approach used to categorize Bangla papers were 89% precision achieved using 1–2 grams with SVM. Several works have been done on this work [3, 4]. Lee et al. [5] used the classification of Naive Bayes, SVM and maximum entropy. Pundlik et al. [6] suggested a strategy for categorizing Hindi speech documents into many classes using ontology, combining HSWN and LM classifier. To identify the best combination of user input, Rahman et al. [7] examined the effects of extractive function approaches. Uni, bi, and trigram are used for representations with TF-IDF independently. Tripto et al. [8] created deep learning-based models. They test the model’s performance using a fresh corpus of English, Roman, and Bangla feedback from YouTube, and their method achieved 65.97% and 54.24% accuracy in three and five label sentiments, respectively. Related several works have been done in this work [9, 10]. It was developed by Haque et al. [11] to classify feedback using machine learning algorithms where SVM’s accuracy is measured at 75.58%., according to a comparison of vectorizers.

3 Methodology

This section provides an overview of the overall system. Different preprocessing procedures reduced noise and redundancy from the underlying raw data. TF-IDF was utilized for vectorization and tokenization. In the training set, SVM and RF classifiers were used to classify the data. A three-category categorization strategy was used to organize the data (positive, neutral, and negative). The model was evaluated based on test results. Our model’s system structure is depicted in Fig. 1.

3.1 Data Description

In this research, we used the dataset on Bangla news comments by Ashik et al. [12]. The dataset contains a total of 13802 Bangla text labeled with five categories: positive, slightly positive, neutral, negative, and slightly negative. Each data point was annotated by three different people to obtain three different viewpoints, and the final tag was selected based on the majority’s decisions. The dataset has an imbalanced distribution of data at each label. Figure 2 illustrates an example of the dataset. The frequency data for each sentiment label is shown in Fig. 3a.

Fig. 1
figure 1

System structure of the model

Fig. 2
figure 2

Example of Bangla news comments with annotated sentiment label

Fig. 3
figure 3

a Amount of data at initial label, b percentage of data at newly annotated labels

As the amount of data at each label is small, we put positive, slightly positive classes in the positive category and negative, slightly negative classes in the negative category, keeping the neutral class as it was. Figure 3b shows the percentage of data at each newly annotated label.

3.2 Data Preprocessing

The initial raw data comprised of many redundant information and noise that did not offer anything to the sentiment analysis. To remove these, the data goes through several preprocessing steps. The preprocessing steps are given below:

  1. (a)

    The removal of emoticon: Emoticons are basically used to represent different expressions and modes of the face. As in this paper, we are only working with textual information. We removed the emoticons, which do not contribute to our analysis.

  2. (b)

    The removal of punctuation marks: Punctuation marks play a very minor role in determining the sentiment of a sentence. That is why punctuation marks are removed to avoid time and space complexity.

  3. (c)

    The removal of stop words: Conjunctions and prepositions are common stop words. It is important to remove the stopwords to focus more on the important words and reduce complexity.

3.3 Actual Processing

A well-defined series of linguistically significant units was created from the preprocessed texts in this section. Tokenization or stemming of phrases, vectorization of tokens, and n-gram approaches are all involved in this process.

Tokenization and Stemming

In a sentence, a token consists of a sequence of characters that is regarded as a semantic unit. When a sentence is tokenized, the tokens are broken up into smaller chunks of information. For tokenization, we used whitespace as a delimiter. The technique of slicing words in order to return to their base words is known as stemming. Because each word derives its meaning from its root, removing affixes has no effect on opinion mining and simplifies calculation.

N-gram techniques We used our data to test various n-gram models in order to find the most useful feature. Based on n−1 prior words, n-gram anticipates the possibility of a term emerging. N-gram determines the associations between words and their neighbors and, to some extent, captures the context. Based on the value of n, the N-gram might be of numerous sorts. It is referred to as a unigram when n equals one. Only one word is used at a time by a unigram. The model is called a bigram for n = 2 and a trigram for n = 3. They, respectively, capture the context of two or three previous words. Figure 4 shows the 12 most frequently occurring bigrams and trigrams in our dataset.

Fig. 4
figure 4

Twelve most frequently occurring a bigram and b trigram of the dataset

3.4 Random Forest and Support Vector Machine

Random Forest Classifier

It enhances the projected accuracy of the dataset by averaging the results of many decision trees on different subsets of a dataset. It creates a classification category rather than a classification and then splits fresh data points based on the classifiers’ predictions [13].

Support Vector Machine

SVM is used in machine learning algorithms, in which each node is represented as a data point in n-dimensional space, and every feature value relates to a specific coordinate. Then we classify the data by choosing the hyperplane that clearly divides the two groups [14]. To differentiate positive and negative data, SVMs look for the optimal surface.

4 Result Analysis and Comparison

As our dataset is not properly balanced, we used four performance metrics to evaluate it accurately. These performance metrics are precision, recall, F1-measure, and accuracy. Better performance is associated with higher precision and recall values. Precision gives a measure of how precisely the system captures the correct cases, and recall gives a measure of the minimization of false negatives. The F1-measure is the reciprocal of the arithmetic mean of recall and precision. The equations for the performance metrics are given below, true positive and false positive are denoted by (TP) and (FP), true negative and false negative are denoted by (TN) and (FN), and m stands for sample size (TP + FP + FN + TN).

$$\begin{aligned} \begin{aligned} {\text {Accuray}}&= \frac{{{\text {TP}} + {\text {TN}}}}{m}\quad {\text {Precision}} = \frac{{{\text {TP}}}}{{({\text {TP}} + {\text {FP}})}}\quad {\text {Recall}} = \frac{{{\text {TP}}}}{{{\text {TP}} + {\text {FN}}}} \\ F1\,{\text {measure}}&= \frac{{(2 \times {\text {Precision}} \times {\text {Recall}})}}{{({\text {Precision}} + {\text {Recall}})}} \\ \end{aligned} \end{aligned}$$
Table 1 Performance statistics of the proposed model
Table 2 Performance scores of the proposed model

Recall, F1-measure, and precision scores of each classifier’s feature are all shown in Table 1, and the accuracy of the classifiers for each feature is shown in Table 2.

Fig. 5
figure 5

a Comparison of SVM and random forest classifier, b comparison of n-grams in SVM

From the performance metrics, we can see that the classifiers predict negative opinions with high perfection (e.g., 96% recall using SVM) where neutral and positive classes are predicted with less perfection. We can say that the imbalance of labeled data has affected the performances highly. In both tokenization and stemming, SVM outperformed random forest. Tokenization has proven to be more effective for opinion mining than stemming. A comparison between the accuracy of SVM and random forest for each feature is given in Fig. 5a. We can see that unigram has shown better performance than bigram and trigram, though they capture more contextual information than unigram. We can say that the performance of n-gram depends highly on the characteristics of the elements of the dataset and their internal context and relationship among words. The comparison of n-gram techniques in the SVM classifier is shown in Fig. 5b.

5 Conclusion

SVM and RF are two of the most efficient machine learning algorithms in opinion mining that we used in conjunction with n-gram techniques to create a system for categorizing models for classification of Bangla text based on positive, negative, or neutral classes. Tokenization has showed greater results than stemming with a F1-score of 68%. SVM has shown better performance than the RF classifier and the best performance with unigram.

In the future, we want to introduce multi-classes in our system and expand the amount of data in our dataset. We also want to experiment with more preprocessing and feature selection techniques to improve the accuracy.