1 Introduction

Social media such as Twitter, Telegram, Facebook, Tik Tok, and so on has changed the communication ways to be more convenient and ubiquitous (Singh et al. 2020). Actually, social media has become one of crucial channels to express personal feelings, opinions, and communicating to other users, to teach students, and even to support mental health specialists to diagnose depression (Denecke and Nejdl 2009; Zhu et al. 2011; Giuntini et al. 2020). Therefore, personal comments regarding products or services have strong influence on purchasing decision making of other social media participators (Kumar et al. 2006; Zhang et al. 2007; Chang et al. 2020).

Comments in social media could be a major information source of providing suggestions and recommendations regarding commercial products from the customer perspectives (Bai 2011). But, some comments are often harmful and then to reduce purchase intentions. Based on a research online report of Lightspeed company (2011), about 60% of customers will change their purchase decisions after reading 1–3 negative comments. But, these harmful reviews could be viewed as customers’ complaints which might provide useful information to improve enterprisers’ services. In addition, these online reviews are usually unstructured, subjective, and hard to comprehend in short time. Therefore, how to recognize social media users’ sentiments from a huge amount of online reviews has become one of important issues (Chen et al. 2011).

Now, sentiment classification becomes more important when the number of digital text resources rises (Gokalp et al. 2020; Chouchani and Abed 2020). In recent years, sentiment classification which classifies textual sentiment into positive or negative group has attracted lots of attention (Zhao et al. 2020; Kong et al. 2020). Generally speaking, sentiment classification aims to recognize reviewer’s sentiments from text comments of customers for specific products or services (Chen et al. 2009; Ye et al. 2009; Mekawie and Hany 2019; Akhtar et al. 2020). Lots of studies have focus on conducting textual sentiment classification. Sentiment classification can also detect social media users’ emotions to help enterprises to respond to customers’ comments carefully.

From available literatures, machine learning algorithms are widely used to solve this critical problem (Chaovalit and Zhou 2005; Tang et al. 2009; Tan and Zhang 2008; Wu et al. 2006). Machine learning aims to construct classification models from text reviews, and then to recognize the new coming review’s sentiment by using the built models. According to published studies, these kinds of methods have been viewed as an effective solution. However, the high dimensionality problem of text data will decrease the classification performance and result in long learning time (Wang et al. 2011). Consequently, to quickly and easily reduce the dimensionality of text data and to retain the performances of classifier have to be solved.

Lots of works attempt to solve high dimensionality problems by integrating dimension reduction techniques into machine learning methods. For examples, Liu (2020) proposed a sentiment analysis model which combines bag of words and convolutional neural network (CNN) to increase the classification performance. Kim (2018) presented a semi-supervised dimension reduction framework which is mainly based on linear feature extraction. Liu et al. (2017) combined feature selection algorithm and machine learning method to propose a framework for multi-class sentiment classification. Khan et al. (2016) introduced a new framework called SWIMS to determine the feature weight based on sentiment lexicon, SentiWordNet. Liu et al. (2017) proposed a framework which combines feature selection algorithm and machine learning method for multi-class sentiment classification. However, traditional feature selection tends to select features from the majority sentiments, which usually cannot improve the performances of classifiers. And, these methods usually need a lot of computational cost. Therefore, we need a feature selection method which can quickly pick up crucial features and then build term-document matrix (TDM) based on them.

Consequently, the major purpose of this work is to develop effective feature selection methods to improve sentiment classification performance and avoid negative sentiments to bring a great damage to enterprisers. This study will propose two feature selection methods called modified categorical proportional difference (MCPD) and balance category feature (BCF) strategy which equally selects features from both positive and negative sentiments for improving the performance of classifying sentiments. Finally, some real text sentiment cases of customers’ comments will be provided to illustrate the effectiveness of our proposed methods.

2 Related works

2.1 Feature selection methods in sentiment classification

Sentiment classification has become very important when the amount of digital text resources remarkably increases (Gokalp et al. 2020; Chouchani and Abed 2020). The purpose of sentiment analysis is to analyze the publics’ sentiments, opinions, attitudes, emotions, and so on, towards different elements such as topics, products or services, individuals, or organizations (Liu et al. 2005; Khan et al. 2016; Singh et al. 2020).

According to available works, machine learning method has been report as one of effective solutions. For instances, Dave et al. (2003) used feature selections and scoring methods for sentiment classification for online reviews. Based on extracting and analyzing appraisal groups, whitelaw et al. (2005) proposed a new method containing support vector machines (SVM) to sentiment analysis. Abbasi et al. (2008a) proposed entropy weighted genetic algorithm (EWGA) method with support vector machines (SVM) for recognizing sentiments of movie reviews. Abbasi et al. (2008b) developed SVRCE method to identify emotional states. O’Keefe and Koprinska (2009) used Naive Bayes and SVM in sentiment analysis. But, when using machine learning approaches to deal with text data, we should consider the dimensionality problems. Consequently, feature selection methods which aim to discover important features from the huge amount of candidate attributes, and achieve a goal of dimension reduction in a short term, should be taken into consideration.

Social media data has a curse of dimension problem (Singh et al. 2020), because a large number of text reviews for sentiment analysis entails huge complexity and cost (Kim, 2018). Therefore, for those high dimensional data, it required specific pre-processing and dimension reduction, which leads to improve computational cost (Singh et al. 2020). Xu et al. (2020) also thought the computational efficiency to process a huge amount of text reviews and the ability to continuously learn from increasing reviews are the major problems for sentiment classification. Among dimension reduction techniques, feature selection methods are one of popular used methods.

General speaking, feature selection approaches were widely used to decrease computational cost and to delete unimportant features for improving classification performance (Li et al. 2007). Feature selection could obtain a high-quality minimal feature subset (Yousefpour et al. 2017). In related works, lots of methods have been proposed for dimension reduction in sentiment classification. For instances, In the work of Liu et al. (2017), they compared four feature selection algorithms (document frequency, CHI statistics, information gain and gain ratio) and five machine learning algorithms (decision tree, naïve Bayes, support vector machine, radial basis function neural network and K-nearest neighbor). Results indicated that gain ratio and support vector machine have the best performance. Akhtar et al. (2017) developed a framework of feature selection and classifier ensemble using particle swarm optimization (PSO) for aspect based sentiment analysis. Yousefpour et al. (2017) showed part-of-speech (POS) patterns are more effective in their classification accuracy compared to unigram-based features.

To sum up, feature selection algorithms can result in good performance but they also need lots of computational cost. For text data, we need other feature selection methods to quickly select important terms and then construct term-document matrix (TDM) based on them. To avoid confusing readers, we use “term selection” instead of using feature selection to denote dimension reduction tools for sentiment classification.

2.2 Term selection method

In this work, we separate feature selection into two types, term selection which uses metrics to quickly reduce feature space, and traditional feature selection which needs lots of computational cost and has good classification performance. Term selection aims to extract important and relevant attributes (key words) to describe collected documents from a huge amount of candidate attributes and achieve a goal of dimension reduction in a short term. Usually, unlike conventional feature selection algorithms which can result in good performance but they also need lots of computational cost, term selection methods in text classification need to quickly select important term to construct TDM.

Usually, we only set a threshold of DF (document frequency) or TF-IDF (term frequency–inverse document frequency) to select important features. If one feature whose DF or TF-IDF is below this threshold, this attribute will be considered as irrelevant. Other studies tried to use POS tagging to pick up crucial features for sentiment classification. But, till now, this kind approach cannot result in significant improvement of performance (Na et al. 2005; Chen and Su 2008).

Other approaches calculate a score for each individual features and then select a predefined amount of feature set based on the rank of scores, such as Chi-square statistic (CHI), information gain (IG) and so on (Keshtkar and Inkpen 2009; O’Keefe and Koprinska 2009; Simeon and Hilderman 2008; Tan and Zhang 2008; Ye et al. 2009). From Table 1, we can know these kinds of methods are effective in some experiments. Zheng et al. (2004) indicated that there are two groups of feature selection methods, one-sided (e.g. correlation coefficient and odds ratios) and two-sided (e.g. IG and CHI).

Table 1 Related works of term selection and machine learning methods in sentiment classification

Among them, IG is the most widely used approach and it has been viewed as effective for classifying documents. For instances, Tan and Zhang (2008) indicated IG outperformed document frequency (DF), MI, and CHI when building SVM classifiers. In the study of Ye et al. (2009), they integrated IG into SVM, Naïve Bayes, and N-gram model to identify sentiments of travellers. An improved Fisher’s discriminant ratio (FLDA) developed by Wang et al. (2011) for feature selection. Zheng et al. (2004) employed signed indexes to handle class imbalance problems in text categorization. Singh et al. (2020) aims to find optimal combination of machine learning (SVM, Navies Bayes, linear regression and random forest) and feature extraction techniques (POS, BOW and HASS tagging). They indicated that random forest and linear regression provide the better result with Hass tagging.

When using two-sided feature selection methods to process data of binary classes, the selected features also have the problem of biasing to a certain class. Therefore, this study proposes the balance category feature (BCF) strategy. It is expected that when two-sided methods are used for feature selection, the class distribution of the features will be taken into account at the same time to further improve the classification efficiency.

2.3 Categorical proportional difference (CPD)

CPD (Simeon and Hilderman 2008) is another easy term selection method for multi-class classification problems. O’Keefe and Koprinska (2009) employed CPD on binary sentiment classification. CPD can be defined in Eq. (1):

$$ CPD = \frac{{\left| {PositiveDF - NegativeDF} \right|}}{PositiveDF + NegativeDF} $$
(1)

where ‘Positive DF’ represents the positive document frequency and ‘Negative DF’ means the negative document frequency.

CPD attempts to compute Positive DF and Negative DF of one term individually, and next it calculates the proportional difference of one term in both positive and negative classes. The CPD score will locate into [0, 1] interval. If one feature only appears in positive document or negative document, the CPD score is equal to 1. Then, this feature will be considered as important. On the other hand, if a feature appears equally in positive and negative documents, the CPD score is equal to 0. And this feature will be viewed as unimportant. Practically, CPD can discover the useful attributes. However, when using CPD, the dimensionality space of text data still is too large to be solved.

CPD indeed could consider the class information and select relevant attributes effectively. But, the important attributes might be deleted when the dimensionality space of training data is low. To demonstrate this disadvantage, we take Table 2 for example. In this example, there are six candidate features. It can be found that feature A is more relevant than others, but all features have the same CPD score. We cannot know which one should be selected, if we merely use CPD. Therefore, the important feature A might be removed, if we try to use lower dimension of training documents. That’s the reason why we proposed MCPD to enhance CPD.

Table 2 An illustrative example of drawbacks of CPD

Besides, in the work of Zheng et al. (2004), they indicated conventional term selection methods tend to select attributes from majority examples. Therefore, they proposed Sign-IG combining sign metric and IG to classify imbalanced text data. Signed IG and signed CHI also had been employed for imbalanced text data (Ogura et al. 2011). Wang et al. (2011) proposed an improved FLDA and compared to IG. In works of Ye et al. (2009) and Tan and Zhang (2008), they indicated integrating IG into SVM can have an optimal performance. Consequently, this study modified the Sign index to classify candidate features into positive and negative sets, and then equally select important ones from both set according to IG and FLDA. The results will be compared with traditional IG and FLDA.

Therefore, CPD which introduces class information has been employed to select important terms. In practice, CPD is very easy to be used and it can effectively extract crucial features in practice. However, CPD cannot dramatically reduce the size of feature sets when applying it to real world.

2.4 TF and TF-IDF

After segmenting words, TF and TF-IDF are kinds of weights in describing text data. Every document could be views as an attribute vectors with these weights (Zhang et al. 2007). Using TF or TF-IDF, we can build term-document matrix (TDM). Some term weights are widely used in text classification, including term frequency (TF), inverse document frequency (IDF), term frequency-inverse document frequency weights (TF-IDF), feature presence (FP), and so on. Among these weight methods, TF and TF-IDF are the most popular a widely used in the related areas of text mining (Aizawa 2003; Na et al. 2005; O’Keefe and Koprinska 2009; Tan and Zhang 2008). The definition of TF-IDF can be found in Eq. (2):

$$ tf - idf = tf \times idf $$
(2)

IDF is defined as Eq. (3):

$$ idf = log\frac{The\,number\,of\,total\,documents}{{The\,number\,of\,documents\,include\,a\,term\,t.}} $$
(3)

In Eq. (2), TF and IDF mean term frequency and the general importance of a term in overall documents, respectively. If a feature’s score of TF or TF-IDF is higher, it represents that the feature occurs frequently in documents. In this work, we use the TF-IDF to count the weights of a feature in a document.

Since TF and TF-IDF are the methods for representing attributes’ weights in TDM, these two term weighting methods are also the most widely used and simplest techniques for selecting important features in text data. TF indicates the amount of occurrences frequency of a feature. Because TF is easy to be computed, many studies in text mining utilize this method. In this study, we called “FF” method which uses TF as threshold to remove irrelevant features (Keshtkar and Inkpen 2009; Na et al. 2005; O’Keefe and Koprinska 2009; Pang et al. 2002). TF-IDF is another popular term weighting technique. So, using TF-IDF to extract relevant attributes, which was called “TI” method in this study, is also very common. In both methods, we utilize the two methods for extracting attributes by removing unimportant features whose TF or TF-IDF are below the set thresholds. If those features whose weights (TF or TF-IDF) are larger than the pre-defined threshold, they will be kept for further learning and the rests will be removed.

2.5 Support vector machines

SVM is a successful classifier developed by Vapnik (1995). It also has been widely applied to related areas in sentiment classification. For examples, Akhtar et al. (2017) used maximum entropy (ME), conditional random field (CRF) and support vector machine (SVM) for aspect based sentiment analysis. Liu et al. (2017) indicated that support vector machine have the best performance compared to naïve Bayes, decision trees, neural networks and K-nearest neighbor in sentiment classification. SVM have been employed to classify sentiment of online comments regarding travel destinations, product, and movies (Tan and Zhang 2008; Na et al. 2005; O’Keefe and Koprinska 2009). Song et al. (2020) proposed a SVM based sentiment classification model by introducing probabilistic linguistic terms sets.

In the work of Alqaryouti et al. (2019), they attempt to help government entities gain insights on the expectations of their customers from reviews. They found that using lexicons and rules as input features to the SVM model has achieved higher accuracy than other SVM models. To enhance the performance of sentiment analysis, Hassonah et al. (2020) presented a hybrid machine learning approach which integrates two feature selection techniques using the ReliefF and multi-verse optimizer (MVO) algorithms into SVM.

From these published works, it’s reported that SVM had a superior performance in sentiment classification. Besides, SVM have several advantages include the use of kernels, the absence of local minima, the sparseness of solution and the generalization capability obtained by optimizing the margin (Cerqueira et al. 2008). For these reasons, SVM has been employed to be the learner in this study.

Briefly speaking, SVM constructs a decision boundary between two classes by mapping the training data onto a higher dimensional space via kernel functions, and then finding the maximal margin hyperplane within that space. This hyperplane can thus be viewed as a classifier (Cortes and Vapnik 1995). A brief introduction of SVM operations have been given as follows.

Giving n examples \(S = \left\{ {x_{i} ,y_{i} } \right\}_{i = 1}^{n} \begin{array}{*{20}c} , & {y_{i} \in \left\{ { - 1, + 1} \right\}} \\ \end{array}\), where xi represents the condition attributes, yi is the class label, and i is the number of examples. The decision hyperplane of SVM can be defined as \((w,b)\), where \(w\) is a weight vector and \(b\) a bias. Let \(w_{0}\) and \(b_{0}\) denote the optimal values of the weight vector and bias. Correspondingly, the optimal hyperplane can be written as

$$ w_{0}^{T} x + b_{0} = 0 $$
(4)

To find the optimum values of \(w\) and \(b\), it is required to solve the following optimization problem.

$$ \begin{gathered} \begin{array}{*{20}c} {\mathop {\min }\limits_{w,b,\xi } } & {\frac{1}{2}w^{T} w + C\sum\limits_{i = 1}^{n} {\xi_{i} } } \\ \end{array} \hfill \\ \begin{array}{*{20}c} {\text{Subject to}} & {\begin{array}{*{20}c} {y_{i} (w^{T} \varphi (x_{i} ) + b) \ge 1 - \xi_{i} } \\ {\xi_{i} \ge 0} \\ \end{array} } \\ \end{array} \hfill \\ \end{gathered} $$

where \(\xi\) is the slack variable, C is the user-specified penalty parameter of the error term (\(C > 0\)), and \(\varphi\) is the kernel function.

SVM can change the original non-linear separation problem into a linear separation case by mapping input vector onto a higher feature space. On the feature space, the two-class separation problem is reduced to find the optimal hyperplane that linearly separates the two classes transformed into a quadratic optimization problem. In addition, several popular kernel functions including linear, polynomial, radial basis function (RBF), and sigmoid have been used in related works. According to suggestions in the work of Hsu et al. (2006), RBF kernel function is employed in this study.

3 Proposed methodology

The main objective of this work is to develop two feature selection methods for increasing the performance of sentiment classification.

3.1 The proposed MCPD feature selection metric

In practice, CPD cannot greatly reduce the size of feature space when applying it to real world, even it can get the important attributes. To enhance CPD, we revise the original CPD by introducing variation of positive document frequency (PDF) and negative document frequency (NDF). Before defining MCPD, we let \(d_{P,i} (1,2,...,m)\) and \(d_{N,j} (1,2,...n)\) represent the ith positive document and the jth negative document respectively. Random variables \(d_{P,i} (t_{k} )\) and \(d_{N,j} (t_{k} )\) defined as Eqs. (5) and (6) denote a specific feature \(t_{k}\) appeared in the ith positive document and the jth negative document, individually.

$$ d_{{P,i}} (t_{k} ) = \left\{ {\begin{array}{ll} 1 & {if{\text{ }}t_{k} {\text{ }}\text{occurs in}{\text{ d}}_{{{\text{P,i}}}} } \\ 0 & {\text{otherwise}} \\ \end{array} } \right. $$
(5)
$$ d_{{N,j}} (t_{k} ) = \left\{ {\begin{array}{ll} 1 & {{\text{if }}t_{k} {\text{ occurs in d}}_{{{\text{N,j}}}} } \\ 0 & {{\text{otherwise}}} \\ \end{array} } \right. $$
(6)

Besides, let \(m_{1}\) denote the PDF of feature \(t_{k}\), which means feature \(t_{k}\)’s occurrence frequency in positive documents, and \(m_{2}\) be NDF of feature \(t_{k}\). CPD can be defined as Eq. (9).

$$ m_{1} = \sum\limits_{i = 1}^{m} {d_{P,i} (t_{k} )} $$
(7)
$$ m_{2} = \sum\limits_{j = 1}^{n} {d_{N,j} (} t_{k} ) $$
(8)
$$ CPD = \frac{{\left| {m_{1} - m_{2} } \right|}}{{m_{1} + m_{2} }} $$
(9)

After introducing variation of PDF and NDF to the original CPD metric, the proposed MCPD could be defined as Eq. (10).

$$\mathrm{MCPD}=\sqrt{\frac{{\left({m}_{1}-\frac{{m}_{1}+{m}_{2}}{2}\right)}^{2}+{({m}_{2}-\frac{{m}_{1}+{m}_{2}}{2})}^{2}}{2}}\times \frac{\left|{m}_{1}-{m}_{2}\right|}{{m}_{1}+{m}_{2}}$$
(10)

The proposed MCPD will be compared with CPD, IG and FLDA. The implemental procedure follows Fig. 1.

Fig. 1
figure 1

The procedure of implementing MCPD and comparing with traditional metrics

3.2 The proposed BCF strategy

The second objective of this work is to propose a balancing category features (BCF) strategy. Before introducing our BCF strategy, we need to discuss “positive features” and “negative features”. A feature’s sign listed in Eq. (11) can be used to determine one feature tends to positive or negative class. In this study, we use F score in Eq. (12) to determine one feature is positive or negative. For example, if one feature’s F score is + 1 (-1), then this feature will be considered as “positive” (“negative”).

$$ Sign = m_{1} (n - m_{2} ) - m_{2} (m - m_{1} ) $$
(11)
$$ F = \left\{ {\begin{array}{*{20}c} { + 1,\begin{array}{*{20}c} {} & \quad {{\text{if Sign}} > 0} \\ \end{array} } \\ {0,\begin{array}{*{20}c} {} &\quad {{\text{if Sign}} = {0}} \\ \end{array} } \\ { - 1,\begin{array}{*{20}c} {} &\quad {{\text{if Sign}} < {0}} \\ \end{array} } \\ \end{array} } \right. $$
(12)

The implemental procedure of our proposed BCF strategy consists of following 5 major steps. And its detailed procedure can be found as Fig. 2.

Fig. 2
figure 2

The procedure of implementing BCF strategy

Step 1::

Construct a candidate feature set.

We use unigram to represent collected documents. After removing some stop words and irrelevant terms, a set of candidate features can be constructed.

Step 2::

Divide candidate features into positive and negative sets.

According to Eqs. (11) and (12), we calculate F value for every feature in candidate set, and then assign those features whose F value is + 1 (− 1) to positive set P (negative set N).

Step 3::

Feature selection.

Step 3.1::

Calculate feature selection metric.

For P and N sets, respectively, we calculate each term’s CPD, MCPD, IG, and FLDA. Then, according to the score of CPD (or MCPD, IG, and FLDA), we rank these features in P and N set, individually.

Next, we define IG and FLDA, respectively. For a term \(t_{k}\), its IG can be defined as Eq. (13).

$$ \begin{gathered} IG(t_{k} ) = H(C) - H(C|t{}_{k}) \hfill \\ = - \sum\limits_{i = 1}^{m} {p(c_{i} )\log (p(c_{i} )) + p(t_{k} )\sum\limits_{i = 1}^{m} {p(c_{i} |t_{k} )\log (p(c_{i} |t_{k} ))} } + p(\overline{t}_{k} )\sum\limits_{i = 1}^{m} {p(c_{i} |\overline{t}_{k} )} \log (p(c_{i} |\overline{t}_{k} )) \hfill \\ = \sum\limits_{i = 1}^{m} {\left( {p(c_{i} ,t_{k} )\log \left( {\frac{{p(c_{i} ,t_{k} )}}{{p(c_{i} )p(t_{k} )}}} \right) + p(c_{i} ,\overline{t}_{k} )\log \left( {\frac{{p(c_{i} ,\overline{t}_{k} )}}{{p(c_{i} )p(\overline{t}_{k} )}}} \right)} \right)} \hfill \\ \end{gathered} $$
(13)

where \(p(c_{i} )\) is the probability that category \(c_{i}\) occurs, \(p(t_{k} )\) is the probability that term \(t_{k}\) occurs, \(p(\overline{t}_{k} )\) denotes the probability that term \(t_{k}\) does not occur, \(p(c_{i} ,t_{k} )\) means the joint probability of \(c_{i}\) and \(t_{k}\), and \(p(c_{i} ,\overline{t}_{k} )\) represents the joint probability of \(c_{i}\) and \(\overline{t}_{k}\).

For a certain term \(t_{k}\), its FLDA can be defined as Eq. (14).

$$ FLDA(t_{k} ) = \frac{{(E(t_{k} |P) - E(t_{k} |N))^{2} }}{{D(t_{k} |P) + D(t_{k} |N)}} $$
(14)

where \(E(t_{k} |P)\) and \(E(t_{k} |N)\) denote the conditional mean of term \(t_{k}\) with respect to the categories P and N respectively, \(D(t_{k} |P)\) and \(D(t_{k} |N)\) are the conditional variances of term \(t_{k}\) with respect to the categories P and N respectively.

Step 3.2:

:Determine the reduced feature size.

Users need to predetermine the feature size they want to reduce. In this project, we will reduce the dimension size from original dimension size to 25%, 10%, 5%, respectively.

Step 3.3::

Select features.

In this step, based on the pre-determined dimension size, we implement two different feature selection techniques, BCF1 and BCF2. BCF1 is to equally select important attributes from P and N sets based on the computed IG, FLDA, CPD, and MCPD scores. BCF2 is to select candidate positive and negative features according to original proportion of P and N.

Step 3.4:

: Construct the feature set for further experiments.

This step joins the selected subsets of P and N together to be the employed features of training data.

Step 4::

Construct term-document matrix.

Every single comment is converted into a vector of terms (keywords) with term frequency–inverse document frequency (TF-IDF) weights. Then, based on selected features in step3, the collected documents will be transformed to a term-document matrix (TDM).

Step 5::

Build SVM model and make conclusion.

This step will build support vector machine (SVM) classification model. Then, the constructed model will be validated by test sets built. Moreover, fivefold cross validation (CV) experiment has been employed for these training data. Based on experimental results, we can make some concluding remarks.

4 Implementation

4.1 The employed data and data preparation

we employs two sentiment data sets including one real cases from real world comments in social media and one famous movie reviews database. Table 3 summarizes the brief background of the employed sentiment data. The first data set is from movie reviews database. They have 1000 positive and 1000 negative comments. After segmenting words and deleting stop words, 4428 words are left for further analysis.

Table 3 The employed textual data sets

The second data set comes from “ReviewCentre (www.reviewcentre.com)”. By focusing on “MP3 product evaluations (MP3)” related issues, we collect 400 comments. There are 200 positive and 200 negative comments in this data set and the amount of attributes is 1384. In addition, because these evaluations have no sentiment information, we use the 5-star rating system in “ReviewCentre” website to define sentiment labels. A comment will be labeled as positive (negative) if the rate is above 4-star (below 2-star). Those comments whose rate is 3-star have been disregarded.

By the way, some frequently used stop words should be removed. Readers can find a useful stop word listed at https://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words. And, the package software QDA miner has been utilized to extract key words and construct TDM in this work. Each comment is converted into a vector of terms (keywords) with TF-IDF. In addition, LIBSVM has been employed to build SVM model (Chang and Lin 2001. RBF kernel function was utilized. All optimal parameter settings of SVM could be obtained by grid search.

4.2 Experimental results

4.2.1 Movie review case

Table 4 summarizes the results of movie reviews. In this experiment, we compare three feature selection approaches in different dimensionality. The comparison base is our original data which only remove the stop words without doing feature selection. The original data contains the 4428 attributes. After fivefold CV experiment, the average classification accuracy is 75.75% and the standard deviation is 4.20%.

Table 4 Results of movie reviews

Next, we implement experiment # 1 by descending the feature number from 4428 to 50. When the dimensionality space descends from 4428 to 1000, the accuracies of both CPD and MCPD raise greatly (from 75.75 to 93.50%). That is because they can keep important attributes. But, TF and TF-IDF have a significant performance loss. In fact, from Fig. 3, we can easily find that the performances of FF and TI went down during the dimension reduction process. Therefore, we merely compare CPD and MCPD.

Fig. 3
figure 3

Results of term selection methods with SVM in six dimensions (movie review case)

There are 415 attributes which have the highest CPD scores (CPD = 1). No selection criteria could be followed if we use smaller size of feature set than 415. Consequently, only thing we can do is to randomly select attributes from those who have the same CPD score. In dimensions 1000, 700, and 400, both CPD and MCPD could have good classification performances. However, when dimension size keeps going down, we find a performance gap from the result of CPD. The classification performances drop dramatically to 79.00%, 72.00%, and 68.75%, when the dimension size decreases from 400 to 200, 100, and 50, respectively. That’s the drawback of CPD mentioned above. In contrast, when the dimension decreases from 400 to 200, 100, and 50, our method still outperforms others. Their accuracies are 88.75% (dimension = 200), 83.00% (dimension = 100), and 81.25% (dimension = 50).

In order to have statistical evidence, we implement three hypotheses listed in Table 5. The results as shown in Table 5 indicated that the p values of all hypotheses are far less than 0.05. We can reject the null hypotheses (H0). Consequently, we have 95% confidence to believe the proposed MCPD based SVM is much better than FF, TI, and CPD in movie review case.

Table 5 Hypothesis testing for verifying experiment results (movie review case)

The ranges of decreasing for FF and TI methods are stable. However, it cannot get important attributes effectively. Therefore, from results shown in Fig. 3, we can find that MCPD is superior to CPD even when the dimensionality space is low. In addition, both CPD and MCPD are superior to the widely used methods, FF and TI.

Table 6 summarizes the results of the 2nd experiment. We set 8 MCPD thresholds from 1 to 8 select crucial attributes. While the values of MCPD are decrease from less 1 to less 3, the number of attributes is decreasing. However, the classification performances are rising greatly. When the features whose MCPD values are from less 4 to less 8 have been deleted, although the classification performances descend eventually with the decrease of attribute amount, the accuracies still could be acceptable. Consequently, our MCPD can keep important attributes and screen un-crucial attributes. If one attribute’s MCPD is small, it represents that the frequency of this attribute in both classes is same. It can’t effectively identify the class labels. On the other hand, when one attribute’s MCPD is large, it represents that the frequency of this factor is high in parts of classes. It could effectively identify the class labels.

Table 6 Results of movie reviews

4.2.2 MP3 product evaluation case

Table 7 summarizes the results of MP3 product reviews. This employed data has 1,382 attributes. After fivefold CV experiment, the average classification accuracy is 81.50% and the standard deviation is 8.59%. It’s our comparison base.

Table 7 Results of the first experiment (mp3 product evaluation case)

When dimension space descends from 1382 to 1000, and 700, the performances of CPD are 84.75% and 90.75%, respectively. And MCPD has the performances of 86.5% and 87.5%. Compared with the result of raw data, both CPD and MCPD have better performances than the benchmark (81.5%). About FF and TI, we also find that both of them have descending trends in classification when the dimension size of feature set going down. That’s could be confirmed from Fig. 4.

Fig. 4
figure 4

Results of term selection methods with SVM in six dimensions (mp3 product evaluation case)

In MP3 product evaluation case, There are 616 attributes who have the highest CPD scores (CPD = 1). Therefore, we can find CPD can have good performance when dimension size is larger than 616 (1000 and 700). However, when dimension size descending from 700 to 400 (less than 616), the classification accuracy decreases remarkably from 90.75% to 81%. It’s almost 10% performance loss. When dimension size keeps dropping to 200, 100, and 50, the CPD performances 71.75% (dimension = 200), 60.75% (dimension = 100), and 61.25% (dimension = 50) are even worse than those of FF and TI methods.

Contrarily, our proposed MCPD has stable performances. In dimensions 400, 200, 100, and 50, MCPD could have excellent performance for classifying bloggers’ sentiment. They are 87.75% (dimension = 400), 85.25% (dimension = 200), 86.50 (dimension = 100), and 86.50% (dimension = 50). Even when dimension size reduces from 1382 to 50, the performance of MCPD (86.50%) is better than benchmark (81.5%).

In order to have statistical evidence, we implement three hypotheses listed in Table 8. The results as shown in Table 8 indicated that the p values of all hypotheses are far less than 0.05. Therefore, we can reject all null hypotheses (H0). So, we have 95% confidence to believe the proposed MCPD based SVM is much better than FF, TI, and CPD in MP3 review case.

Table 8 Hypothesis testing for verifying experiment results (MP3)

Table 9 lists the results of the 2nd experiment in MP3 reviews. Eight thresholds of MCPD have been set from 1 to 8 for selecting crucial attributes. When the features whose MCPD values are from < 1 to < 3 have been deleted, the number of attribute descends, but the classification performances arise remarkably. When removing the features those MCPD values are from < 4 to < 8, the performances of classification decrease slightly. Even in worst situation that remove those whose MCPD scores are < 8, only 67 features are left. But, the accuracy is 85.75% that is also greater than the benchmark (81.50%). Consequently, our MCPD method can extract the useful attributes for classifying sentiment.

Table 9 Results of the second experiment (mp3 review)

4.3 Results of BCF strategy

BCF strategy has been developed for two-sided feature selection methods such as MCPD, IG and FLDA. Table 10 shows the experimental results of MP3 product reviews and movie reviews using BCF strategy combined with MCPD. In the results of MP3 product review, the classification efficiency of the MCPD method combined with the BCF indicator is not significant. But, in the movie review data, when using low dimensions (dimension size reduced to the original 25%, 10%, and 5%), BCF combined MCPD can significantly improve classification performance. And the BCF1-MCPD method has the best classification performance.

Table 10 Results of implementing BCF strategy to MCPD

From the experimental results, BCF1-MCPD and BCF2-MCPD are generally superior to MCPD. Among them, the classification performance of BCF1-MCPD is significantly better than BCF2-MCPD and original MCPD.

When the dimension size reduced to 25%, 10%, and 5% of original dimensionality, BCF strategy combined with MCPD has the better performance. Therefore, we also combined the traditional CPD, IG, and FLDA methods with the BCF strategy to conduct experiments in reduced 25%, 10%, and 5% dimension size. Table 11 summarizes experimental results in MP3 product reviews. From the results, the classification efficiency of the CPD method combined with the BCF strategy has only been significantly improved in the dimension of 10%. IG and FLDA methods have significantly improved the classification efficiency under the three feature dimensions. In addition, from the experimental results of IG and FLDA, we can see that the classification performance using the BCF1 indicator is generally better than the BCF2 indicator.

Table 11 BCF strategy combined with CPD, IG, FLDA experimental results (MP3 product reviews)

Table 12 shows the experimental results of CPD, IG and FLDA methods combined with BCF strategy in movie reviews. The experimental results indicate that the CPD, IG, and FLDA methods combined with the BCF strategy have improved the classification performance in the three dimensions. When BCF2-CPD uses the feature dimension of 25%, all three evaluation indicators show the best classification performance. In addition, the classification performance of the IG and FLDA methods combined with the BCF1 indicator is generally better than the BCF2 indicator, and the best classification performance is achieved when the dimension is 10%.

Table 12 BCF strategy combined with CPD, IG, FLDA experimental results (movie reviews)

4.4 Concluding remarks

In addition to comparing the proposed MCPD and traditional CPD, TI, and FF, this section also conducts experiments on the BCF strategy combining MCPD, traditional IG, CPD, and FLDA methods. Based on the results, some concluding remarks could be given as below.

  1. 1.

    The proposed MCPD method significantly improves the shortcomings of poor classification performance when CPD method uses lower feature space for classification.

  2. 2.

    From the evaluation results, we can see that MCPD generally has better classification performance when using fewer features.

  3. 3.

    MCPD combined with BCF strategy can improve the classification performance, of which BCF1-MCPD has better classification results. However, the classification performance improvement in MP3 product reviews is less obvious.

  4. 4.

    The CPD, IG, and FLDA methods combined with the BCF index can improve the classification performance, and the BCF1 index generally has the best classification results.

5 Conclusions

To tackle dimensionality problems when dealing with the huge amount text reviews in social media, we proposed MCPD method and BCF strategy. Results indicated that MCPD outperforms other traditional one-sided term selection methods, CPD, TI and FF. In addition, we also found that BCF strategy combined with MCPD could have the better performance. Consequently, we should use BCF and MCPD together, then we can get the best performance for sentiment classification.

From the experimental results, we can draw some concluding remarks. First, it’s confirmed that CPD has drawback in low dimensionality, and MCPD indeed can enhance CPD. In classification problems, both CPD and MCPD outperform FF and TI methods which are widely used term selection techniques because they are very easy to be calculated. But, like CPD, FF and TI, MCPD also has the same characteristic of being easy employed. It’s very important for sentiment classification, because with the increasing amount of the online reviews, the feature space of textual data increases dramatically. If the feature selection methods cannot reduce the dimensionality with lower computational cost, they might be impractical for applications in real world. Second, the optimal interval of MCPD scores locate at [2, 4]. It means users of MCPD can set a threshold from 2 to 4, and then select important attributes based on this threshold. They can use fewer attributes to obtain better performance in sentiment classification.

This study proposed an easy and simple term selection technique called MCPD to extract crucial features for sentiment classification. Experimental results indicated that our proposed MCPD based SVM learning scheme can improve the drawback of CPD in lower dimensions. In addition, even if we reduce the dimension size from 4428 and 1382 to 50 features, MCPD still has better performances than the performances of using the original dimension size of raw data. Therefore, our method can not only increase the performance of classifying sentiment data, but also dramatically reduce the dimensionality.

Moreover, as mentioned above, when using two-sided feature selection methods, they have the problem of biasing to a certain class. Therefore, this study proposes the BCF strategy. Results indicated that BCF1 + MPCD and BCF1 + FLDA could have the best performance when reduce feature space to extreme low.

With the popularity of the Internet, the amount of text comments in social media is going to increase remarkably. Consequently, our method is very suitable not limited to apply to real-world data of sentiment classification, but also text classification problems. In addition, we use TF-IDF to be our term weights in TDM. Using different term weighting methods could be one of the potential directions of future works.