1 Introduction

People rely on consumer reviews and recommendations for insight when purchasing products, trying new restaurants, finding a primary care physician, and other goods and services. In the past, consumer reviews and recommendations were passed via word of mouth or publications, but with the ubiquitous nature of the Internet, more people are obtaining reviews and recommendations via the web. Google, Yelp, and Amazon are a few of the websites loaded with user reviews on a multitude of topics. Reviews can be created by anyone who has access to the web. This open access leads to one of the main issues with online reviews—spam.

Spam reviews can be classified into three categories: untruthful reviews, reviews on brands, and non-reviews (Dixit and Agrawal 2013). Untruthful reviews are fake reviews created with malicious intent. Reviews on brands are reviews not specific to a product, but the entity creating the product. Non-reviews contain all reviews that do not actually review a product. This research focuses on the detection of untruthful reviews, as these have been shown to be the most difficult to detect (Jindal and Lui 2008).

Spam reviews are created by “spammers” or suspicious persons to either harm or bolster the reputation of an establishment or product. As these reviews are all public, usually posted by an unknown user, and not passed via word of mouth, it becomes increasingly difficult to determine whether the review is coming from a reputable source. Up to one-third of all online reviews are considered untruthful; this increase in false reviews has led to a drop in credibility of online reviews (I.C. Government of Canada 2014). To combat this issue, many methods using machine learning tools, specifically supervised learning, have been proposed.

Supervised machine learning algorithms are trained on labeled data sets, in our case spam review data sets. These algorithms learn to classify reviews as truthful or untruthful based on features describing the reviews. These labeled data sets contain instances (the reviews), features describing the review, and their corresponding class label. The features describing a review typically include review text features, the meta data found in the review (rating, date, time, etc.), and reviewer-oriented features, which describe the user that created the review. Since we are interested in predicting spam reviews based only on the text, we only explore the effects of text-based features and choose to leave other features for future work.

Unfortunately, due to the text-based nature of reviews, the feature space associated with a data set in the domain of spam detection is large. Having a large feature space is commonly known as high dimensionality and creates challenges, such as higher computational costs and redundant or useless features (Haykin 1998). Classification algorithms are not adept at handling a high dimensional feature space and will see degraded performance due to overfitting (Crawford et al. 2016). A family of machine learning techniques, known as feature selection, can be implemented to combat the problem of high dimensionality. Feature selection has only recently been employed in spam review detection studies (Crawford et al. 2016; Mukherjee et al. 2013).

In addition to feature selection, a family of techniques called ensemble learning techniques have been shown to increase classifier performance and reduce overfitting (Dietterich 2000). In previous work, we have found ensemble techniques, such as Bagging and Boosting, increase classification performance in other text classification domains (Prusa et al. 2015). However, ensemble techniques alone do not combat the issue of high dimensionality (Heredia et  al. 2016). To combat high dimensionality, while still taking advantage of the ensemble framework, feature selection can be embedded within ensemble techniques. We employed three of these techniques: Select-Boost, Select-Bagging, and Random Forest. These three techniques combine feature selection and ensemble learning to form a classifier which is resilient to the adverse effects brought upon by high dimensionality. We build upon the results found in Crawford et al. (2016) by examining ensemble learners, which are known to improve performance in a variety of domains, and determine how the addition of feature selection to ensemble algorithms impacts classification performance. To the best of our knowledge, this is the first study to employ the combination of ensemble and feature selection techniques in the domain of spam review detection.

In this study, we present results for feature selection and aim to determine the impact of ensemble techniques and the combination of feature selection and ensemble techniques on spam review detection. We discuss results for a total of ten feature selection techniques across five classifiers to demonstrate the effects of feature selection. We explore the effectiveness of Boosting and Bagging when identifying review spam. Our Select-Boost and Select-Bagging algorithms are implemented using five base classifiers: multinomial naïve Bayes (MNB), naïve Bayes (NB), support vector machines (SVM), logistic regression (LR), and C4.5 Decision Tree (C4.5). We embed signal-to-noise (S2N), Chi-Squared (CS), and Mutual Information (MI) within the Select-Boost and Select-Bagging approaches. Random Forest is conducted with three tree sizes: 100 (RF100), 250 (RF250), and 500 (RF500). All models are evaluated using four runs of fivefold cross-validation with model performance measured using the area under the receiver operator characteristic curve (AUC) metric. Our study is unique in that we provide a comprehensive overview of feature selection, ensemble methods, and the combinations of both in spam review detection.

The highest performing classifier, using no ensemble techniques, for spam review detection is multinomial naïve Bayes (Crawford et al. 2016; Heredia et  al. 2016). When using MNB, no significant changes in performance occur with the addition of an ensemble technique (Bagging or Boosting). Moreover, the inclusion of feature selection when training MNB results in a lower AUC score for smaller subset sizes. However, applying a combination of ensemble and feature selection (Select-Boost) shows significant improvement in performance when compared to using solely MNB. Our results indicate the combination of Select-Boost, MNB, and Chi-Squared (or signal-to-noise) to be the best performing model, significantly outperforming all other methods, with the exception of RF500. Select-Boost does produce a higher AUC score than RF500; however, the difference between the two is not significant. The Select-Boost framework combines re-weighting and feature selection in each iteration, which is fundamental for improving model performance. Our results are tested for significance using ANalysis Of VAriance (ANOVA) and Tukey’s honestly significant difference (HSD) tests (Berenson et al. 1983).

The remainder of the paper is organized as follows. Section 2 contains works related to spam detection, ensemble learning, and feature selection. Section 3 provides insight into the feature selection techniques used. Section 4 describes ensemble techniques, such as Boosting and Bagging. In Sect. 5, we examine techniques that combine feature selection and ensemble learning: Select-Boost, Select-Bagging, and RF. Section 6 summarizes our methodology including data set information, learners, cross-validation, and the performance metric. Section 7 contains the first case study, which presents baseline results for each base learner and the effectiveness of feature selection. In Sect. 8, we present our second case study, discussing the results for ensemble techniques and the combination of ensemble and feature selection. Section 9 provides a discussion of the results. Finally, in Sect. 10, we present our final conclusions and possible avenues for future work.

2 Related work

The area of spam review detection includes various studies dedicated to detecting untruthful reviews using supervised learning. There exist multiple types of feature sets which may be used for detection of untruthful reviews (Crawford et al. 2015). However, one of the main problems with spam review detection is labeled data set availability. A study by Jindal and Lui (2008) used review-oriented features to predict whether a review was classified as spam or not spam. A data set was generated by collecting 5.8 million reviews of products offered by Amazon. A subset of 222,000 instances were sampled from the original data set and a method known as w-shingling (Broder 1997) was utilized to detect near-duplicate reviews. The near-duplicate reviews were classified as untruthful reviews. From these reviews, 36 additional features were derived and used in a logistic regression model to predict spam reviews. These additional features consisted of Part Of Speech (POS) and reviewer-centric features. Using the full feature set, their model resulted in an AUC of 0.78 when detecting duplicate reviews. The data set was tested against two other review categories: brand only and non-reviews. For brand only and non-reviews, the resulting AUC values were over 0.98, indicating the detection of untruthful reviews to be a more difficult task.

A study by Ott et al. (2011) proposed a data set of considerably smaller size with an equal number of positive (spam) and negative (non-spam) instances. The authors put forth a data set created using Amazon Mechanical Turk (AMT) (Buhrmester and Goslining 2011). AMT is a service provided by Amazon that allows access to a scalable work force. By leveraging AMT, the authors were able to generate fake positive reviews for hotels. The data set consists of 800 positive reviews with half the positive reviews being false, while the remaining half was obtained from TripAdvisor. The best model created had an accuracy measure of 89.8% using a support vector machine (SVM) classifier with a combination of Linguistic Inquiry and Word Count (LIWC) and bigram features. Although this data set has been used in many studies, it was shown to have a major flaw in that it is not representative of real-world review scenarios. The word distributions found in the AMT reviews are significantly different from distributions found in truthful reviews, allowing for easier detection of false reviews (Mukherjee et al. 2013). A more realistic data set was proposed by Li et al. (2014) which extends the AMT data set created by Ott et‘al. (2011) through the addition of expert reviews and by combining reviews from three different domains: hotels, restaurants, and doctors. To create a more comprehensive and accurate real-world data set, Li et al. (2014) obtained reviews from employees of the locations   being reviewed in addition to reviews from AMT. The authors elected to use the hotel reviews to train the model and the doctor reviews to test the model. Using a classification algorithm modeled off the Sparse Additive Generic Model (SAGE), they were able to obtain an accuracy of 64.7 and 63.4% when using LIWC and POS features, respectively. We elected to use this data set in our study, since it most closely represents real-world review scenarios.

In Shojaee et al. (2013), stylometric attributes were used for detection of untruthful reviews. Stylometric features consist of lexical and syntactic features. These include features which give an indication of the grammar and vocabulary used by the writer. Lexical features focus on word or character use, while syntactic features represent the style of the writer, such as their use of the words “the”, “of”, “a”, etc. Three sets of experiments were performed using the data set from Ott et‘al. (2011). The first set used solely lexical features, the second used syntactic features, and the final set used a combination of both. Two classifiers were trained with each feature set: SVM and naïve Bayes. The classifiers were trained using tenfold cross-validation. In all three feature sets, SVM outperformed naïve Bayes in terms of F-measure. The best F-measure, 0.84, was obtained using the feature set containing the combination of lexical and syntactic features. The increase in feature space, due to the addition of stylometric features, requires higher computational costs and no feature selection was done to optimize the feature space.

A study by Mukherjee et al. (2013) used reviewer-centric features to detect spam reviews. The data set used was a combination of the data set found in Ott et‘al. (2011) and reviews collected from Yelp. In total, there were three feature sets, one consisting of LIWC features, one consisting of POS features, and the final set contained bigram features. The study applied feature selection using Information Gain (IG) to select the top 1 and 2% of features from each set. The models created using the reduced feature space did not generalize well to real-world data. Feature selection was found to offer no improvement on classification performance. However, only a single feature selection technique was tested; thus, no conclusive results can be drawn from the effect of feature selection on the classification of spam reviews. A study by Crawford et al. (2016) determined the effects of ten feature rankers on classification of spam review. Ten filter-based feature selection techniques were applied across five classifiers and a range of subset sizes. Out of the ten feature selection techniques, Chi-Squared was chosen for comparison against a word frequency feature selection method. Results show multinomial naïve Bayes has the highest performance and Chi-Squared performs better than word frequency at low subset sizes across all classifiers. However, at large subset sizes, word frequency performs just as well as Chi-Squared, if not better.

Ensemble techniques have largely gone unused in spam review detection studies in exchange for a more traditional supervised learning approaches. Our study is the first to apply a wide range of ensemble techniques in the area of spam review detection. Ensemble techniques may be useful for detecting untruthful reviews as they have been shown to increase performance in noisy data (Dietterich 2000). There is evidence of the effects of ensemble techniques on classifier performance in related domains such as sentiment detection (Prusa et al. 2015). A previous study on tweet sentiment classification showed the improved effects of employing ensemble classifiers with feature selection (Prusa et al. 2015). The study compares the performance of Select-Boost and Select-Bagging against various feature selection techniques using five base learners: K-Nearest Neighbor (KNN), C4.5 decision tree, multilayer perceptron (MLP) and logistic regression (LR). Two tweet sentiment data sets were used: SemEval and Sentiment140. A Threshold-Based Feature Selection (TBFS) technique, and the area under the Receiver Operator Characteristic (ROC) curve metric was employed within Select-Boost and Select-Bagging. Results showed Select-Boost performing significantly better than Select-Bagging and plain classification on both tweet sentiment data sets.

Our study is unique since it explores the effects of Boosting, Bagging, Select-Boost, Select-Bagging, and Random Forest in the spam review domain. Moreover, no other study has examined the effect of Boosting and Bagging in this domain. We also compare the classification performance of these techniques against feature selection and the base learners. To the best of our knowledge, this is the first study to present a thorough comparison of these techniques in the realm of spam review detection.

3 Feature selection

Feature selection can be performed using wrapper-based, filter-based, embedded or hybrid techniques. Wrapper-based techniques use a classifier to rank the performance of a subset of features. Wrapper-based techniques do not scale well with a high feature space, due to their computationally expensive approach. Filter-based techniques can be categorized into two types: filter-based subset evaluation and filter-based ranking. Filter-based subset evaluation techniques attempt to find the subset of features which maximizes classification performance. Usually, this process uses a greedy approach, where a feature set is created by adding the initial best feature that discriminates between the classes (or starting with all features and removing the feature which discriminates the least between the classes). Features are then added to the subset individually and tested for performance, the feature with the highest increase in performance is added to the subset. This process is repeated until the best subset of features is found. Similar to wrapper-based techniques, filter-based subset evaluation does not scale well with very high dimensional data as it is computationally expensive.

Unlike the previous two approaches, feature rankers provide the type of fast and scalable feature selection suitable for very large feature spaces. They use various measures to rank features based on performance. Features are given a score relative to the class label based on the performance metric used. In this study, filter-based feature rankers were employed, as they scale well with high dimensional data sets and have been previously used in detection of untruthful reviews (Crawford et al. 2016). We applied ten different feature rankers to the data set from three different feature selection families: commonly used (Dittman et al. 2010), Threshold-Based Feature Selection (TBFS) (Dittman et al. 2010), and first-order statistics (FOS) (Khoshgoftaar et al. 2012). These families use distinctly different methods to rank features. The following three subsections describe the feature ranker techniques used and the families they belong to.

3.1 Commonly used techniques

This family of commonly used techniques consists of techniques that are widely used and easily accessible. These techniques are found in most open-source machine learning tool-kits, such as the Weka toolkit (Hall et al. 2009). From this family of techniques, we employ the Chi-Squared (CS) ranker (Witten and Frank 2005).

Chi-Squared utilizes the Chi-Squared statistic to measure the relationship between a feature and the class. This feature ranker measures the independence of two events, in this case the events are the occurrence of the feature and the occurrence of a specific class. If we consider the occurrence of a feature as A and the occurrence of a class as B, then these occurrences are independent if \(P(AB) = P(A) * P(B)\). The formula below shows the common form of the Chi-Squared ranker:

$$\begin{aligned} {\chi }^2=\sum _{k=1}^{n} \frac{(O_k - E_k)^2}{E_k} \end{aligned}$$

where O is the observed value and E is the expected value. Thus, a higher Chi-Squared value indicates some form of dependence between the feature and the class, whether positive or negative.

3.2 Threshold-based feature selection techniques

TBFS techniques are bi-variate procedures, where each feature is evaluated against the class, independent of all other features (Dittman et al. 2010). All attributes are first normalized in a range [0,1], then a classifier is built for all thresholds \(t \in [0,1]\). Two classification rules are used for building the simple classifiers. Rule 1 classifies instances with a normalized value greater than t as positive, and examples with a normalized value less than t as negative. Rule 2 is the converse of rule 1 and classifies instances with a normalized value greater than t as negative, while examples with a normalized value less than t are positive. The metric \(\omega\) is used to create the attribute ranking. These \(\omega\) metrics are primarily used to measure classification performance (Dittman et al. 2010). We use Mutual Information (MI), the area under the Precision–Recall Curve (PRC), the area under the receiver operator characteristics curve (ROC), Gini-Index, Kolmogorov–Smirnov (KS), significance analysis of microarrays (SAM) and probability ratio as our \(\omega\) metrics used to rank the features.

Mutual Information determines the importance of a feature by measuring the contribution of the presence, or absence, of a feature toward a correct classification (Peng et al. 2005). The formula below describes mutual information:

$$\begin{aligned} I(W,C) = \sum _{e_W \in [0,1]} \sum _{e_C \in [0,1]} p(e_W,e_C) \log \frac{p(e_W,e_C)}{p(e_W)p(e_C)} \end{aligned}$$

where W is the word and C is the class. This formula can be simplified to \(I(X,Y) = H(Y) - H(Y|X)\) or \(I(X,Y) = H(X) - H(X|Y)\), where \(H(X)\) and \(H(Y)\) are marginal entropy values and \(H(X|Y)\) and \(H(Y|X)\) are conditional entropy values. \(H(X|Y)\) and \(H(Y|X)\) are values that indicate the amount of uncertainty left after \(H(X)\) or \(H(Y)\) are learned.

The PRC metric measures the area under the curve of Recall on the x-axis and Precision on the y-axis. Recall is defined as the True Positive Rate (TPR) and Precision is defined as the fraction of instances classified as positive that are actually positive.

The ROC metric measures the area under the curve of False Positive Rate (FPR) on the x-axis and TPR on the y-axis. This directly indicates model performance, where the higher the area under the ROC curve the better the model performance across all thresholds.

Probability Ratio was used for text classification in Forman (2003) and was defined as the probability of the feature given the positive class divided by the sample estimate probability of the feature given the negative class. The formula (TPR/FPR) describes the probability ratio between the positive and negative classes.

The Gini-Index was originally utilized for measuring income over a population. The Gini-Index has a range from 0 to 1, which indicates the following: GI \(=\) 0 implies the income is split evenly across the population, and GI \(=\) 1 means that a single person receives all the income. The feature ranker version of the Gini-Index analyzes the distribution of a feature across the classes.

The KS statistic is a nonparametric test that measures the difference in the cumulative distribution of two data sets.

SAM was originally created for detecting significance in microarrays within the bioinformatics domain. SAM compares the observed and expected values of the parameter to identify features whose associated values differ by an amount of statistical significance among the sets (Tusher et al. 2001).

3.3 First-order statistics techniques

The FOS family of techniques are uni-variate feature rankers that use first-order statistic measures, such as mean, mode and standard deviation, to rank features in order of importance. In our experiments, two feature rankers from this family are used: signal-to-noise (S2N) (Chen and Wasikowski 2008) and Wilcoxon Rank Sum (WRS) (Breitling and Herzyk 2005).

The signal-to-noise ratio represents the ratio of signal information to noisy background information. This process can be used to determine how well a feature discriminates between two classes. The formula for S2N is:

$$\begin{aligned} S2N = \frac{(\mu _P - \mu _N)}{(\sigma _P+\sigma _N)} \end{aligned}$$

where \(\mu _P\) is the mean of the positive class, \(\mu _N\) is the mean of the negative class, \(\sigma _P\) is the standard deviation of the positive class, and \(\sigma _N\) is the standard deviation of the negative class.

The Wilcoxon Rank Sum is a variation of the standard t-statistic. In a standard t-statistic, the distribution is assumed to be normal, while in the WRS no assumption on the distribution is made. There are two steps to be performed before defining the WRS: (1) all instances are ranked based on the value of the feature, and (2) all the rankings in the positive class are summed as \(W_p\). The formula for WRS is:

$$\begin{aligned} \hbox {WRS} = \frac{\left( W_p - \frac{n_p(n_p +1)}{2}\right) -\frac{n_p n_n}{2}}{\sqrt{\frac{n_p n_n(n_p +n_n + 1)}{12}}} \end{aligned}$$

4 Ensemble techniques

Two different ensemble algorithms are used in our study: Boosting and Bagging. These techniques are similar in that they combine multiple instances of a base learner to generate a more robust and generalized classifier; however, the process by which this is achieved is different. Boosting takes an iterative approach; training and evaluating a model using a classifier, then training and evaluating subsequent classifiers based on the results of the previous classification (Freund and Schapire 1996). In our study, we use the AdaBoost family of techniques, specifically AdaBoost.M1. AdaBoost works by training a base learner then re-weighting the misclassified instances and repeating this process every iteration. Therefore, the subsequent iterations have more focus on the classification of previously misclassified instances, due to the weights. All initial weights are set to one. The algorithm runs for a predetermined number of iterations, then aggregates the results of all models to form a final decision. In our study, ten boosting iterations are performed and model results are aggregated using majority voting. Due to the re-weighting step, Boosting is not compatible with certain base learners; however, a separate approach, where the instances are re-sampled from the original data set according to the weights, allows these base learners to be compatible with Boosting. This data sampling approach is applied to re-select instances in accordance with the assigned weights.

Bagging, also known as bootstrap aggregating, uses data sampling with replacement (re-sampling from the original data set where the same instances can be re-selected) to create a predefined number of bootstrap data sets. In our study, ten bootstrap data sets are created. These new bootstrap data sets are the same size as the original and contain instances from the original data set. The bootstrap data sets are then used to train a single base classifier each. Once all the classifiers have been trained an aggregation technique, in our case majority voting, is used to determine the final classification. Bagging has been shown to increase performance of weak classifiers but adversely affect the performance of stable learners (Breiman 1996).

5 Ensemble techniques with feature selection

In addition to Boosting and Bagging, we want to examine the effects of ensemble techniques that take advantage of feature selection. Three of these techniques are used in our study and explained in this section: Select-Boost, Select-Bagging, and Random Forest.

Select-Boost is a modified version of the AdaBoost.M1 algorithm developed by our research team. The Select-Boost (Prusa et al. 2015) framework embeds feature selection within the Boosting ensemble framework. As with Boosting, Select-Boost is also incompatible with certain base learners due to the re-weighting step. To mitigate this, we re-sample from the original data set according to the weights assigned and create a new training data set where the probability of selecting an instance is equal to its weight. Initially, all samples in the training data are assigned equal weights. The data is then sampled from the original data set. Feature selection is applied and classifiers are trained on the reduced feature space. The weights are then updated based on misclassified instances and the Select-Boost algorithm begins anew. As with Boosting, we use ten iterations of the Select-Boost framework. Figure 1 depicts the Select-Boosting framework. The dotted line represents the path the framework takes once all iterations have finished.

Select-Bagging (Prusa et al. 2015), like Select-Boosting, adds a feature selection step to the Bagging algorithm. The Select-Bagging framework first creates a predefined number of bootstrap data sets from the original through sampling with replacement. As with Bagging, ten bootstrap data sets are created for Select-Bagging. Feature Selection is applied to every bootstrap data set and then each data set is used to train a classifier. The results of those trained classifiers are aggregated and a final classification is made through majority voting. Figure 2 shows the Select-Bagging framework.

Fig. 1
figure 1

Select-Boosting framework

Fig. 2
figure 2

Select-Bagging framework

6 General methodology

This section describes the general methodology for our experiments with ensemble and feature selection techniques. We describe the data set, the base learners, cross-validation, and the performance metric.

6.1 Data set

Our data set originated from the study by Li et al. (2014). The data set contains spam reviews from three distinct domains: doctors, hotels, and restaurants. The original data set was obtained through as similar process as used in Ott et‘al. (2011), using AMT. AMT (Buhrmester and Goslining 2011) is a service provided by Amazon which gives customers access to an on-demand, scalable workforce. This workforce was employed to create fake reviews using real reviews as a baseline. As AMT uses regular workers, whom may have no knowledge of domain keywords, the word distributions found in the reviews are significantly different from word distributions found in real reviews (Mukherjee et al. 2013). These synthetic reviews are not always representative of real-world spam data; to counteract this, domain experts were hired to create false reviews for their specific domains. For example, an employee in a doctor’s office was asked to write false reviews for their practice. Li et al. (2014) posited employees would have a better grasp on daily processes which would lead to a better synthetic spam reviews. Two examples of reviews from the data set can be found below. From a human standpoint, determining which is spam between the two from reading the text alone is near impossible. The following review is an untruthful spam review:

The rates at The Talbott Hotel were cheaper than I had expected, and that was my reason for booking a room. I had been prepared for service and a room similar to what I had experienced in the past, and I was quite pleased when I did stay. The room was neat and clean, and the halls were quiet at night. The traffic noise was muffled to the point where it was no problem sleeping either. I did ask one question at the service desk and they answered it nicely, which is good because normally hotel workers can be a bit snippy, especially at night. Overall I had no problems with The Talbott Hotel and I would stay at this here again if I were in the area a second time.

The following review is a truthful review, created by a guest who stayed in the hotel:

My husband & I stayed at the Fitzpatrick in early June 2004 for my birthday-great hotel! Location provides easy walking to Navy Pier, Marshall Field, Michigan Avenue & John Hancock. Room seemed spacious even though its only about 300 sq ft. Lots of room in bathroom; comfortable bed; very quiet, upscale hotel. We would definitely stay here again!

The data set characteristics can be found in Table 1. The domain of review spam detection suffers from a lack of data sets representative of real-world scenarios. To the best of our knowledge, this data set is the closest publicly available text-based spam review data set representative of real-world data. The data set contains the text found in the review, the rating given, the domain the review belongs to, and the class label. For our purposes, the rating and domain information are not required as we are interested in using only text; thus, they are removed from the data set. To create the feature set, the StringToWordVector filter in Weka is used. While StringToWordVector can output a variety of word vector representations, we elect to use a bag-of-words representation over TF-IDF as preliminary studies have found it to be more effective. The StringToWordVector filter in Weka returns the number of words specified in the WordsToKeep parameter based on word frequency. However, if there is a tie in the word frequencies, it returns all the words with that specific frequency, causing more words to be returned than the specified number. To remedy this, we created the ExactStringToWordVector filter that returns the exact number of words specified in WordsToKeep parameter. ExactStringToWordVector samples randomly from the words that have equal frequency. If a value larger than the number of unique words in a document is specified for the WordsToKeep parameter, all unique words are extracted as features.

Table 1 Data set characteristics

6.2 Base learners

In this study, we use five base learners: naïve Bayes (NB), multinomial naïve Bayes (MNB), support vector machine (SVM), logistic regression (LR), and C4.5 decision tree (C4.5) in combination with the Boosting and Bagging ensemble techniques. We provide only a brief discussion of these learners here, since they are all well-understood classifiers. However, an interested reader may consult the provided references for more information. All models in this paper are built using the Weka data mining open-source software package (Hall et al. 2009) with default parameter values, unless otherwise specified. Note that any changes to default parameter values were applied when experimentation showed an overall improvement of the classification performance based on preliminary analysis.

Naïve Bayes (Rish 2001) falls under the category of Bayesian learners. Naïve Bayes uses Bayes’ theorem to approximate the posterior probability of an instance belonging to a class based on its feature values (Witten and Frank 2005). Naïve Bayes makes the “naïve” assumption that all features are independent. Although, in general, this is not the case with the majority of features, naïve Bayes still offers good performance and we consider it a good baseline learner for comparisons. The formula for naïve Bayes is found below:

$$\begin{aligned} \hat{y} = p(C_k) \prod _{i=1}^{n} p(x_i|C_k) \end{aligned}$$

where \(\hat{y}\) is the predicted class, \(p(C_k)\) is the probability of the instance belonging to class k, and \(\prod _{i=1}^{n} p(x_i|C_k)\) is the product of all conditional probabilities for the features associated with the instance.

Multinomial naïve Bayes (McCallum and Nigam 1998) is a variant of the naïve Bayes learner. The main difference between NB and MNB is the way in which the probabilities are calculated. Naïve Bayes uses conditional probabilities of each feature to help determine classification status. In multinomial naïve Bayes for text classification, the instance (a document in this example) is assigned to the class which has the highest conditional probability of \(P(C|X)\). To calculate this probability, a count is done of the words which overlap between the document and the class, and then the count is divided by the total number of words. If \(P(C_1|X) > P(C_2|X)\) then the document is classified as C1; otherwise, it is classified as C2. The MNB algorithm does not take any parameters, thus no changes can be made to the default function within Weka.

Support vector machine (Hsu et al. 2003) constructs a hyper-plane that divides the instances into two groups. The data may be transformed via a kernel function into linearly separable spaces. The transformations allow for nonlinear boundaries to be formed around the data. The best such hyper-plane would be the one that maximizes the distance between the hyper-plane and members of each class. For our models, the complexity constant “c” was set to 5.0 and the “buildLogisticModels” parameter set to “true.”

Logistic regression (Hosmer and Lemeshow 2004) seeks to find a linear relationship between the features and the class label. This process is different from linear regression as it is used for the task of classification. Logistic regression aims to create a probability function that uses features as inputs and returns the probability of that instance belonging to a class as an output. Logistic regression was chosen due to its simplicity and effectiveness. The formula for logistic regression can be found below:

$$\begin{aligned} \log \frac{P}{1-P} = \beta _0 + \beta _1 X_1 + \beta _2 X_2 + \cdots + \beta _n X_n \end{aligned}$$

where \(\frac{P}{1-P}\) is the odds ratio, \(X_i\) is the value for that feature, and \(\beta _i\) is the coefficient associated with feature \(X_i\).

The final learner, C4.5 decision tree (Quinlan 2014), creates a tree based on the features that discriminate the most between classes. The decision tree process uses IG to determine a decrease in entropy when examining a certain feature. The tree splits at the feature that minimizes entropy and maximizes information for a class, meaning the more discriminant features will be found toward the top of the tree. The final classification is found in the leaves of the tree. When constructing the trees, we implement “Laplace smoothing” and “no pruning” as this increased performance in previous studies (Witten and Frank 2005).

6.3 Cross-validation and performance metric

All models are trained using four runs of fivefold cross-validation. For each run of cross-validation, the data is split into five separate folds. At any time, four of the fivefolds are used to train the data, while the last fold is used to evaluate the model. This process is repeated five times, varying the fold that is used for testing so every fold is used for testing once. We chose to use cross-validation over random sampling as all the data is used both to test and train the model in cross-validation, while only part of the data is used with a random sampling approach. Four runs of fivefold cross-validation are performed to reduce the bias due to an unlucky split in the data when creating the folds. The results of the four runs of fivefold cross-validation are averaged for final results.

We elect to use the Area Under the receive operator characteristic Curve (AUC) as a performance metric (Witten and Frank 2005). This metric is chosen because it plots the performance of the model across all decision thresholds. The AUC is a graph of the False Positive Rate versus the True Positive Rate. The area under the curve shows performance of the model across all decision thresholds, thus the larger the area under the curve, the better the performance of the model. This is not to be confused with the TBFS technique ROC, which uses AUC to rank features.

7 Case Study I: plain classification and feature selection

7.1 Non-ensemble

We consider results for classification using the base learners and word frequency in the interest of having a baseline for comparison with ensemble methods and feature selection techniques. For this purpose, we examine how our five learners perform for a variety of features, as we vary the number of words extracted by changing the WordToKeep parameter which selects words based on frequency. In Fig. 3, we see the resulting AUC values of the models generated using the base learners and varying levels of features based on frequency. We use a modified version of the StringToWordVector in the WEKA toolkit (Hall et al. 2009). This modified version returns the number of words specified in the frequency selection option WordsToKeep. For example, if you specify 100 WordsToKeep, you will receive 100 distinct words based on the frequency of the word in the data set. This allows for comparison between word frequency performance and performance of features chosen by a feature ranker. Word frequency selection is used with 46 feature subset sizes ranging from 100 to 20,000 to determine performance of word frequency as a feature ranker (20,000 is chosen as the upper limit and encompasses the full feature set as the data set contains less than 20,000 unique words).

In Fig. 3, it can be seen that the best performing learner is multinomial naïve Bayes and its performance improves as WordsToKeep (the number of features used to train the classifier) increases; however, other learners do not follow this trend. C4.5 and NB level off at approximately 2000 features. LR has the highest performance when using approximately 6500 features. SVM performance levels off at approximately 10,000 features. To aid in the choice of the best learner and number of features, we conduct a two-factor ANalysis Of VAriance (ANOVA) to confirm both the choice of learner and the number of features that are significantly impacting classifier performance. The results of the ANOVA test are presented in Table 2A and show both factors and their interactions are significant.

Fig. 3
figure 3

Classification with word frequency

We conduct further tests using Tukey’s honestly significant difference (HSD) test to group factors by conducting pairwise similarity tests and determining whether the choice between two levels of a factor is significant. First, we wish to determine which learner, without ensemble technique, is best. If the classifiers share the same letter, then the difference in performance is not statistically significant. Table 2B shows that multinomial naïve Bayes is the best learner across all levels. From Fig. 3, we see that performance for multinomial naïve Bayes increases with number of features; however, from the figure alone we cannot determine when this increase is no longer significant. To determine the top performing subset size, an additional HSD test, found in Table 3, is conducted with feature subset sizes ranging from 100 to 20,000. We found there is no significant difference when using more than 900 features; however, the highest AUC (0.8995) is observed using the full feature set. Thus, multinomial naïve Bayes with the full feature set will be used as the plain classifier in comparison with feature selection in the following section.

Table 2 ANOVA and Tukey’s HSD test for plain classification
Fig. 4
figure 4

Average AUC score using feature selection. a C4.5, subset sizes from 100 to 5000, b logistic regression, subset sizes from 100 to 10,000, c naïve Bayes, subset sizes from 100 to 5000, d multinomial naïve Bayes, subset sizes from 100 to 10,000 and e support vector machines, subset sizes from 100 to 10,000

7.2 Feature selection

We would like to confirm the effects of feature selection on spam review detection to determine whether to use word frequency or feature selection as a baseline with which to compare ensemble techniques in Case Study II. To test performance of each feature selection technique, we use all base learners and subset sizes ranging from 100 to 1000, as this was shown to be a good range of feature subset size (Crawford et al. 2016). As it is difficult to distinguish significant differences between the rankers, a Tukey’s HSD test, at a 95% confidence level, is used to conduct pairwise similarity tests to determine whether the differences between two given rankers is statistically significant. Table 4 presents the results of this test, summarized by placing rankers into groups. Both Chi-squared and signal-to-noise belong to the top group, while ROC, KS, MI and PRC are all in the second group. The remaining rankers perform considerably worse than these 6 rankers.

In Fig. 4, classification results for Chi-Squared are compared against those using word frequency. Chi-Squared is chosen over signal-to-noise as it is more commonly used and the results for both signal-to-noise and Chi-squared are similar across all classifiers as shown in Table 4. We do not provide the results for every feature selection technique against every learner, as the emphasis of this study is on ensemble methods with feature selection and not feature selection alone. We see that as subset size increases, the difference between Chi-Squared and word frequency decreases. Subfigures a and c show performance for subset sizes up to 5000, while subfigures b, d, and e show performance up to 10,000. This subset size difference is due to performance; for C4.5 and NB there is no difference in performance between word frequency and Chi-Squared past 5000 features, while for LR, MNB, and SVM there is a difference up to 10,000 features. Figure 4 shows word frequency performing as well as a feature ranker when working with larger subset sizes. As more features (words) are used, the overlap in feature sets between feature ranker and word frequency increases, leading to similar performance. It is important to note that, with either Chi-Squared or word frequency, multinomial naïve Bayes has the highest AUC scores for most subset sizes.

Table 3 Tukey’s HSD test for WordsToKeep with MNB
Table 4 Tukey’s HSD test of feature selection techniques across all classifiers with feature subset sizes from 100 to 1000

Additionally, multinomial naïve Bayes’ performance increases with the number of features when using Chi-Squared, though these changes are not significant. The difference between Chi-Squared and word frequency is minimal when using larger subset sizes. In the case study on algorithms combining ensemble and feature selection techniques presented in the following section, signal-to-noise, Chi-squared, and Mutual Information are chosen as the feature selection techniques to be embedded within ensemble learners. We elect to use S2N and CS due to their performances, as both techniques fall under group ‘a’. MI is selected because of it’s performance in other social media/text domains, such as sentiment analysis (Prusa et al. 2015). We consider this to be a similar domain as both are considered social media, both are online and publicly viewable, both are user-generated, and both contain short texts. This allows us to explore the effects of Select-Boost and Select-Bagging on top scoring rankers and a group ‘b’ ranker, resulting in a more robust experiment.

8 Case Study II: ensemble techniques

This section presents results related to the ensemble techniques and the combination of ensemble techniques and feature selection. We first present results for Boosting and Bagging; then, we present results for RF, Select-Boost, and Select-Bagging.

8.1 Boosting and Bagging

Results for our experiments utilizing Boosting and Bagging can be found in Table 5. From Table 5, we observe the top performing combination is MNB with Boosting. Bagging shows higher AUCs for three of the five learners (C4.5, LR, SVM), while Boosting produces higher AUCs for the remaining two (MNB, NB).

Table 5 AUC results for Boosting and Bagging across five base learners and all feature subset sizes

To determine whether Boosting or Bagging statistically improve over the base learners, we perform a two-factor ANOVA and a Tukey’s HSD test. Table 6 presents both the ANOVA and Tukey’s HSD results for Boosting, Bagging and the base learners. The two-factor ANOVA test is presented in Table 6A. The factors for this ANOVA are ensemble techniques and base learners. Ensemble techniques include Boosting and Bagging, while the base learners are NB, MNB, SVM, C4.5, and LR. The ANOVA results show there is a significant difference between base learners, ensemble methods, and the interaction between them.

Table 6 ANOVA and Tukey’s HSD for Boosting and Bagging

From Table 6B, we see the average AUC and Tukey’s groups across all experiments for Boosting and Bagging. There are eight distinct groupings: ‘a’ through ‘f’. These results indicate that ensemble techniques significantly increase performance of SVM, LR, NB, and C4.5. Bagging significantly increases performance over the base SVM and LR learners, while Boosting significantly increases performance of LR, NB, and C4.5. We see that the best base learner is MNB. Moreover, the differences between MNB with and without ensemble techniques are not significant; however, not using an ensemble technique is faster and less computationally costly. Boosting with MNB does produce the highest AUC and the lowest standard deviation.

8.2 Random Forest

Random Forest provides an different ensemble approach when compared to Bagging and Boosting and performs its own internal random feature selection; thus, the classifier is trained using the full feature set. This ensemble technique does not directly use feature rankers, but does perform random feature subspace selection at every node within a tree.

Table 7 Tukey’s HSD test for Random Forest and C4.5

Presented in Table 7, a Tukey’s HSD test includes AUC scores for each of the three RF tree sizes and for the C4.5 learner. The result for the C4.5 learner is included as a baseline for comparison, as it is a single decision tree. All RF tree sizes significantly outperform the C4.5 baseline. Moreover, the results show that RF250 and RF500 perform significantly better than RF100, but there are no significant differences between RF250 and RF500. If we compare results in Table 7, we see RF500 generates the highest AUC score.

Fig. 5
figure 5

Results for Select-Boost and Select-Bagging. a Chi-Squared with Select-Boosting, b Chi-Squared with Select-Bagging, c Mutual Information with Select-Boosting, d Mutual Information with Select-Bagging, e Signal-to-Noise with Select-Boosting, f Signal-to-Noise with Select-Bagging

8.3 Select-Boost and Select-Bagging

Figure 5 presents results for ensemble classifiers trained using Select-Boost and Select-Bagging grouped by feature ranker (Signal-to-Noise, Chi-Squared, and Mutual Information) and ensemble method (Select-Bagging and Select-Boosting). In each subfigure, AUC for each learner is plotted against the number of attributes selected.

From the subfigures, there are several important trends that warrant further discussion. First, Select-Boost yields higher performance than Select-Bagging for all rankers. We have seen this trend in previous work, where we found Select-Boost outperforms Select-Bagging in the text classification domain (Prusa et al. 2015). Second, as seen in Fig. 5, multinomial naïve Bayes produces the highest AUC for any base learner using Select-Boost.

We are interested in which factors (base learner, feature selection technique, ensemble approach, and subset size) are significant; thus, we conduct a four-factor ANOVA. Results are shown in Table 8 and indicate that all four factors and many of their interactions are significant. To investigate each factor: base learner, feature selection (FS in the table), ensemble method (ensemble in the table), and subset size, we conduct Tukey’s HSD tests on each factor. Our HSD test for base learners, presented in Table 9A, shows that multinomial naïve Bayes and SVM are the top performing learners and the difference between them and other learners is significant; however, the difference between MNB and SVM is not significant. Table 9B shows that Chi-Squared or Signal-to-Noise should be chosen over Mutual Information. As the difference between using Signal-to-Noise and Chi-Squared is not significant and Chi-Square is more readily available, we will be using Chi-Squared for future experiments. Table 9C shows Select-Boost performs significantly better than Select-Bagging. These tests indicate the best performing classifiers are trained with Select-Boost using Chi-Squared, or Signal-to-Noise, as a ranker and MNB, or SVM, as a base learner.

Table 8 ANOVA for ensemble classifiers

In Fig. 5, we see that the AUC values for Select-Boost are higher than Select-Bagging and Table 9C shows this difference is significant. Furthermore, we observe higher AUC values when using MNB as the base learner within Select-Boost, thus we elect to examine this combination further. There is little change in AUC for multinomial naïve Bayes, across feature subset sizes with Select-Boost. As this is our best performing learner, further investigation is warranted to determine the optimal subset size. A Tukey’s HSD test will determine if feature subset size significantly impacts classifier performance with MNB as the learner and Chi-Squared as the ranker. Results, presented in Table 10, show that there are no significant differences as all subset sizes are in the same group; however, 400 features resulted in the highest observed AUC.

8.4 Comparisons

The results presented in the previous sections indicate that multinomial naïve Bayes is the best performing learner, when using no ensemble technique, and one of the top base learners when using the Select-Boosting ensemble framework. Thus, it is of interest to compare how this learning algorithm performs, both with and without the Select-Boost framework, and also how it compares to the best performing models generated with Random Forest. Additionally, we have yet to compare ensemble algorithms with no feature selection and ensemble methods with feature selection. Thus, we compare RF250 and RF500, MNB as a classifier, and Select-Boost with Chi-Squared and 400 features, Bagging, and Boosting using MNB.

Table 9 Tukey’s HSD for base learners, feature rankers, and ensemble techniques
Table 10 Tukey’s HSD test for feature subset size with multinomial naïve Bayes and Select-Boost with Chi-Squared
Fig. 6
figure 6

Visualization of Tukey’s HSD for models

Table 11 ANOVA and Tukey’s results for models

Table 11A presents a one-factor ANOVA result (choice of model) for the previously stated classifiers. In the case of this ANOVA, model encompasses the combination of base learner, feature selection technique and ensemble technique. From Table 11A, we note that the choice of model is significant, thus, we use a Tukey’s HSD test to determine the difference between the six rankers. Table 11B presents a Tukey’s HSD test showcasing the AUC values for each of these techniques grouped by pairwise similarity. Figure 6 depicts the results found in Table 11B where each model is shown along with their confidence interval. If there is no overlap between two models, then their averages are significantly different. We observe that there are two groups ‘a’ and ‘b’. The bottom group, ‘b’, contains RF250, MNB as a classifier, and Boosting and Bagging using MNB. Group ‘a’ contains the Select-Boost algorithm with the previously listed components, implying Select-Boost performs significantly better than RF250, Bagging, Boosting, and MNB with no ensemble technique. RF500 is in group ‘ab’ indicating no statistically significant difference from the other classifiers.

9 Discussion

Our results show a combination of Select-Boost, multinomial naïve Bayes and, either Chi-Squared or Signal-to-Noise, significantly outperforms all methods except RF500. However, it is interesting to note that while Select-Boost improves performance significantly, the components which create the Select-Boost framework do not increase performance when applied separately.

Feature selection techniques alone do not improve classification over the full data set, in fact, feature selection shows degraded performance with lower feature set sizes. This could be due to properties specific to this data set. The data set contains reviews from three domains, features (words) that determine spam in one domain may not be the same for a different domain. For example, the word ‘small’ may be positive in respect to cell phones, but negative in respect to hotel rooms. Thus, the reduced feature set may not be able to discriminate between truthful and untruthful reviews across all three domains, leading to lower classification performance.

We also observe that, while boosting increases the AUC of our MNB model, the difference is not significant. It is likely that Boosting and Bagging show no improvement over the base MNB model because they do not address the high dimensionality of the data. When we compare Select-Boost to ensemble techniques using feature selection (RF and Select-Bagging), we see closer performance. Select-Boost still outperforms Select-Bagging, RF100, and RF250. However, the difference between RF500 and Select-Boost not significant. Overall, we recommend Select-Boost over RF500, since Select-Boost has less variance than RF500, most likely due the independent trees, and is faster than RF500.

Select-Boost significantly outperforms Select-Bagging, implying an aspect of the Select-Boost’s iterative framework significantly affects performance. This performance difference may be due to a combination of the re-weighing step found in Boosting and feature selection done in Select-Boost. The Select-Boost framework re-weights instances every iteration, with greater weight being placed on previously misclassified instances, by re-sampling (with replacement) from the original data set with probabilities of selecting instances determined by their weight. As feature selection is performed during every iteration, after the re-sampling step (where misclassified instances are more likely to be present), the weights assigned to the instances directly affect the features chosen by our feature rankers. Every iteration allows for more focus on misclassified instances and their features, which leads to a more optimized feature set as the Select-Boosting framework progresses. Thus, Select-Boost is more effective than Select-Bagging, which performs feature selection on randomly sampled data bootstraps, and offers a significant improvement in classification performance, whereas Boosting and feature selection, individually, do not.

10 Conclusion

In this study, we tested the performance of ensemble classifiers, classification using feature selection, and the combination of both on spam review detection. We evaluated the performance of Select-Boost and Select-Bagging against Random Forest, feature selection, Boosting and Bagging. Select-Boost and Select-Bagging were employed with five learners, three feature selection techniques, and subset sizes ranging from 100 to 1000 in increments of 100. Random Forest was done using three tree sizes: 100, 250 and 500. Ten feature selection techniques were used with subset sizes ranging from 100 to 20,000. The effects of ensemble classifiers on spam review detection had previously not been explored. As spam reviews make up more than a third of all online reviews, the exploration of advanced machine learning techniques, to further current methods for the detection of spam reviews, becomes increasingly necessary.

Our results show that feature selection has degradative effects on classification performance when compared to using the full feature set. The performance of the model steadily increases as feature set size increases, although this increase is no longer significant above 900 features when using word frequency. Ensemble techniques, without feature selection, had no significant effect on classification performance for spam detection. However, using Select-Boost, a combination of feature selection and boosting, significantly increased classification performance. Select-Boost significantly outperformed the base MNB classifier, Boosting, Bagging, Select-Bagging, RF100, and RF250. Select-Boost provides a higher AUC value than all other learners; however, the difference between Select-Boost and RF500 was not significant.

We recommend the use of Select-Boost over Select-Bagging. The re-weighting step in Select-Boost is crucial in generating an optimized, reduced feature set, allowing for significantly better performance, since each iteration selects a new subset of features that aid in the correct classification of previously misclassified instances. Select-Boost should be chosen over RF500 as it results in a higher AUC, although not significantly different, and is faster. We found using the combination of Select-Boost, multinomial naïve Bayes, Chi-squared and a subset size of 400 has the highest AUC when detecting spam reviews, although Chi-Squared and Signal-to-Noise may be interchanged. There is no significant difference in performance between feature subset sizes when using feature selection within Select-Boost, thus, while the highest AUC is observed when selecting 400 features, computational resources can be preserved by using a smaller number of features without significantly degrading classification performance.

Future work may involve testing this process on other data sets to see if results generalize. Additionally, work should be conducted to establish what types of features are the most beneficial, since we only employed text-based features.