1 Introduction

The expansion of e-commerce and the Internet has raised excessively nowadays. The exponential increase in these sectors has given rise to online reviews and the reliance on such reviews has also increased tremendously. Some of the cases where online reviews play a vital role are:

  1. 1.

    When we want to buy something through an online retail website, product and seller both reviews are important.

  2. 2.

    When we want to buy software, online reviews are of great help.

  3. 3.

    Online reviews for movies help people in deciding whether to watch a movie or not.

Due to the rise of e-commerce, online customer reviews have become an important part of our decision when we are willing to purchase any product online. Since every customer who buys a similar product has his/her perspective of that product which is generated in the form of a review. It is quite difficult to judge whether a review is fake or genuine through the naked eye. Several e-commerce platforms, including Amazon, Flipkart, Myntra, eBay, etc provide the facility of writing user reviews of the product, the review may be of any type thus influencing the decision of the buyer. Hence, online reviews may enhance or degrade the reputation of any brand product. Thus, the prediction of spam reviews is a prerequisite.

Jindal et al. [13], classified spam reviews into three types.

  1. 1.

    Untruthful reviews: The reviews which purposely deceive readers or review mining systems by writing unworthy positive reviews for a specific target object for false promotions, also known as hyper spam, on the other handwriting negative reviews for some other specific objects to deteriorate their image, also known as defaming spam.

  2. 2.

    Non-reviews: Reviews that contain irrelevant content and commercials.

  3. 3.

    Review on brands: These reviews majorly focus on promoting a brand rather than focusing on the product.

Amongst all the three categories of reviews, detecting untruthful reviews is the hardest task. Some examples of untruthful reviews are as follows:

Review 1

”Fantastic Product. I haven’t seen a better HP LaserJet Pro with such cool features. It has wifi, lan connectivity. I can print from my phone laptop any device wireless to the machine. And it shows up instantly. Moreover, the whole setup process was a breeze. I would not have got a better HP Laserjet product with MFP capabilities at this price. Hats off!”

Review 2

”The copy quality of the printer is horrible, scan quality is just ok and print quality acceptable. It doesn’t have led light in the display so its gonna be more difficult and annoying for you to operate this printer”

The determination of fake and real reviews is highly crucial for the user. Therefore, some standard methods such as n-grams, bag of words, filtering, parts of speech, etc., are proposed. A continuous series of n features form a speech or text sample in n-gram[35]. The major drawback of n-grams is that the feature spaces represented by n-grams models are highly sparse. In bag of word-based method [2], to classify the spam reviews, individual words are considered as a feature, ignoring the meaning of words. Therefore, making such methods less efficient for spam identification. Many other methods such as unigrams-based methods [37, 38], lexical [4], and syntactical features [40] based methods were used by researchers for spam detection. Furthermore, supervised, unsupervised, and semi-supervised machine learning algorithms are commonly employed to detect spam reviews. Saeed et al. [1] presented a study that compares several machine learning method predictive accuracy for predicting phishing emails. Tingmin et al. [51] presented a novel technique based on deep learning to identify Twitter Spam. Chang et al. [8] gave a comparative analysis of different methods which identify fake hotel reviews. Mukherjee et al. [24] introduced a method that identifies the group of fake reviewers affecting consumer products. Li et al. [15] presented a machine learning-based two-view co-training algorithms to identify review spam of products.

Petrescu et al. [31] presented an analysis of the impact of motivated review campaigns. His findings show that posting positive reviews of the product by the users is highly biased due to these motivating campaigns. Michael Luca [18] showed how online reviews affect the demand of restaurants using a dataset from Yelp.com and Washington State Department of Revenue’s restaurant data. He gave three major conclusions through his work: firstly, 5% to 9% of increment in revenue occurs with just a one-star increase in Yelp rating; secondly, this rating does not affect the chain restaurants, only the independent restaurants are affected, thirdly, a declination in markets of chain restaurants due to increased Yelp penetration. Mesleh et al. [21] implemented a classifier for text classification based on Support Vector Machines (SVMs) which use Chi-square method for selecting relevant feature vector at the pre-processing phase for text classification. Catal and Guldan [7] presented a high-performance model that uses Multiple Classifier Problems (MCPs). To detect fraudulent negative reviews, this model combines the strengths of five high-performing individual classifiers and applies the majority vote aggregation technique. Xie et al. [53] proposed a method for detecting spam using significantly associated temporal emulations. Xu et al. [54] presented a technique called as Sparse Aspect Coding Model (SACM), to handle the problem of latent aspect mining. In SACM, the user intrinsic aspect-interest and item intrinsic aspect-quality, the latent variables are used to model the observed review text and complete rating. Hu et al. [10] proposed a novel multi-text summarization technique for hotel reviews. This method helps to identify the top- k most informative sentences. Sasaki and Shinnou [36] introduced a spam detection technique that uses a spherical k-Means algorithm to automatically identify disjoint clusters. This technique is used for text clustering and is based on the vector space model. Yang et al. [56] used a two-layered spam detection flow, showing the trade-off among accuracy and efficiency. McCord and Chuah [20] used four traditional classifiers Random Forest, Naïve Bayesian, Support Vector Machine, and KNN neighbor to identify spam in Twitter and showed that Random Forest classifier performed best with 95.7% precision and 95.7% F-measure values. Mateen et al. [19] proposed a combined method for disclosing of spammer’s Twitter platform. This method uses content-based as well as graph-based features for identifying spam.

Singh and Singh [43] merged the efficiency of correlation-based feature selection technique (CFS) and Particle Swarm Based Optimization (PSO) to detect web spam. Singh and Batra [41] introduced a technique based on ensemble learning for spam detection. This method uses a quotient filter for efficient searching and locality-sensitive hashing for similarity searching. Bindu et al. [5], used the Twitter dataset for detecting spam by implementing unsupervised techniques. This technique was implemented on graphs, URLs, and community-based features. Li et al. [16] classified web spam with the help of de-noising auto-encoder and synthetic minority over-sampling techniques in Deep Belief Network (DBN) of machine learning. Liu et al. [17] identified spam tweets by introducing a hybrid method based on fuzzy-redistribution and asymmetric sampling. Miller et al. [22] used clustering methods like stream clustering methods and other methods such as streamKM++, and denStream to classify tweets that were spam. Narayan et al. [27] suggested a semi-supervised PU-learning-based technique to detect spam reviews. Inuwa-Dutse et al. [12] discovered the accounts that post spam on Twitter by using the feature of account information. Singh et al. [42] gave a model for detection and blocking of counterfeit reviews and spam. Wu et al. [50] presented a hybrid method for spam filtering. This method uses rule-based processing and back-propagation neural networks to filter spam. Asghar et al. [3] achieved an accuracy of 96% by adding a revised scheme that gives weight to features. The experiment was performed based on a rule-based feature weighting scheme that tags the review sentence as spam or ham. Wu et al. [52] proposed a hybrid PU-learning-based Spammer Detection (hPSD) model to identify the hidden spam users based on the reviews of products. This model also detects multi-type spammers by identifying only a small portion of positive samples, which meets particularly real-world application scenarios. A hybrid method for feature selection was proposed by Rjamohana et al. [33], which uses cuckoo search along with harmony search to increase the processing rate and prediction accuracy. The Naive Bayes classifier was employed to classify the review into spam and non-spam.

To classify spam, recently some algorithms have also been employed, which are metaheuristics. Salehi et al. [34] detected email spam by using genetic algorithms. Idris et al. detected email spam by using two algorithms, differential evolution [44] and negative selection algorithms. Idris et al. [11] used a combined method based on particle swarm optimization (PSO) [14] and negative selection for detecting email spam. Shekhawat et al. [39] proposed a model that uses Spider Monkey Optimization (SMO) with k-Means to classify twitter sentiments. Rajamohana et al. [32] proposed a model that uses Adaptive Binary Flower Pollination Algorithm (BFPA), a global optimization technique for feature extraction and Naive Bayes (NB) classifier as the objective function to increase the classification accuracy. Pereira et al. [30] proposed a hybrid technique using the efficiency of local search and evolutionary algorithms to have a good search mechanism and to balance the difference in the population. As the metaheuristic strategies are stochastic in nature and often traps in the local optima, an effective solution is always required which can identify the optimal cluster-heads as well as can properly explore the search region without stagnating at some local optimal point.

Therefore, in this paper, to classify the spam and non-spam reviews, a novel metaheuristic method based on clustering has been proposed. The paper is sectioned in the following manner: The Grey Wolf Optimizer (GWO) and k-Means are described in Section 2. The proposed method, Grey Wolf Optimizer using k-Means (GWOK) is described in Section 3. Section 4 illustrates the experimental results and we have concluded our work in Section 5.

2 Preparatory

2.1 k-Means clustering

To organize the items of a set among segregated groups or clusters, partitioning is the most fundamental method of cluster analysis. One of the most well-known methods is k-Means [9]. Given a data-set D, which include n-items in the Euclidean space, the k-Means algorithm distributes the item n among k-clusters, Ci,....,Ck, i.e., Ci ⊂ D and CiCj = for (1 ≤ i, j ≤ k). To estimate the quality of the partitioning method, an objective function is used, aspiring higher intercluster resemblance and lower intercluster resemblance.

To represent a cluster, k-Means uses the center of that cluster known as centroid Ci. The difference between an item q ∈ Ci and ci, the centroid, is given by the Euclidean distance between any two points given by, dist(q,ci). To identify the quality of any cluster, the sum of squared error is calculated amongst all items in the cluster and the centroid, given as follows:

$$ SE={\sum}^{k}_{i=1}{\sum}_{q\in C_{i}}dist(q, c_{i})^{2} $$
(1)

where SE represents the squared error in the given dataset for all objects, q represents a point in the given object, ci represents cluster centroid of cluster ci

The pseudo-code of k-Means is given in Algorithm 1:

figure a

2.2 Grey wolf optimizer(GWO)

The GWO algorithm developed by Mirjalili [23] is a population-based metaheuristic algorithm, designed to explore and construct a heuristic (partial search algorithm), to find an optimal solution for any optimization problem. All the algorithms with randomization and local search capacity are known as metaheuristic algorithms [55]. Metaheuristic algorithms can relatively handle problems with a huge population [46]. Unlike other optimization algorithms, metaheuristic algorithms do not assure to obtain the optimal solution, but they are capable of computing sub-optimal, good-quality solutions and take feasible execution time [28]. GWO is one such type of metaheuristic algorithm that mimes the attacking behavior and management hierarchy of grey wolves. In GWO, to fabricate the management hierarchy, the colony of wolves is divided into primarily four main classes, alpha (α), beta (β), omega (ω) and delta (δ). The alpha wolf is the leader and is considered the best ones. They are responsible for making decisions for hunting, time to sleep, waking up time, and so on. Beta is the second-level wolves. They are the auxiliary wolves that help alpha wolves in decision-making and other activities. Delta is the third-best, and it is responsible for sacrifice. These wolves are responsible for dominating other wolves. Omega is the lowest-ranked grey wolves. They are considered weaklings and are ready to sacrifice. All other wolves are delta wolves.

The working of GWO can be described in the following four steps:

  1. 1.

    Encircling prey

  2. 2.

    Hunting

  3. 3.

    Attacking prey

  4. 4.

    Search prey

2.2.1 Mathematical model

In this subsection, the mathematical model for encircling the prey, hunting, attacking, and searching the prey is illustrated.

2.2.2 Encircling the prey

At first, the grey wolves encircle the prey which can mathematically be given as:

$$ D =\mid C.G_{p}(t)-{G}(t) \mid $$
(3)
$$ G(t+1)= G_{p}(t)- A*D $$
(4)
$$ A=2 * a * r_{1} - a $$
(5)
$$ a=2-2(\frac{t}{I}) $$
(6)
$$ C=2* r_{2} $$
(7)

where current iteration is indicated by t, vector A and C indicates coefficient vectors. Gp and G indicates the prey position and grey wolf position, respectively. a is a vector that linearly decreases from 2 to 0 over the course of iterations. r1,r2 are random vectors in the range [0,1]. Initially, vector A has a maximum value, which decreases gradually as the iterations increase which can be calculated by (6). I indicate the maximum number of iterations.

2.2.3 Hunting

To mathematically simulate the hunting behavior of grey wolves, the α, β, and δ are considered as the best solution which possesses the optimal information about where the prey is. Based on this information, other grey wolves update their positions using the following equations [23] :

$$ G_{1}= G_{\alpha}(t)-A_{1}*D_{\alpha} $$
(8)
$$ G_{2} = G_{\beta}(t)-A_{2}*D_{\beta} $$
(9)
$$ G_{3}= G_{\delta}(t)-A_{3}*D_{\delta} $$
(10)

where,

$$ \begin{array}{@{}rcl@{}} D_{\alpha}= \mid C_{1}. G_{\alpha}(t)-G \mid D_{\beta}= \mid C_{2} . G_{\beta}(t)-G \mid D_{\delta}= \mid C_{3}.G_{\delta}(t)-G \mid \end{array} $$
$$ G(t+1)=\frac{ G_{1} + G_{2} + G_{3}}{3} $$
(11)

2.2.4 Search prey (Exploration)

The exploitation and exploration behavior in the GWO algorithm, depends upon the A and C parameters. A lies randomly in the range of [-a,a]. The wolf shows exploration behavior when ∣A∣ > 1 and C > 1.

2.2.5 Attack on prey (Exploitation)

The attack on prey leads to exploitation. Exploitation occurs if ∣A∣ < 1 and C < 1.

The steps of GWO Algorithm are described in Algorithm 2.

figure b

3 Proposed clustering method for spam review detection

In this paper, we introduce a hybrid Grey Wolf Optimizer using k-Means (GWOK) clustering mechanism to identify the spam reviews. The GWOK helps in identifying spam reviews by the following phases:

  1. 1.

    Preprocessing of data

  2. 2.

    Feature Extraction Phase

  3. 3.

    Feature Selection using Chi-square

  4. 4.

    GWO clustering method

  5. 5.

    Testing the Proposed Clustering Method.

Figure 1 illustrates the flowchart of the proposed method.

Fig. 1
figure 1

Proposed clustering method flowchart

3.1 Preprocessing of data

The online reviews collected from social media have noise in the form of unwanted and vague words, stop words, URLs, etc., which are not desired in feature extraction. The removal of such uncertain words is executed in two different phases, performed through python natural language toolkit (NLTK) [6]:

3.1.1 Phase 1

Eliminates all the noise and word with uncertainty by the following steps:

  1. 1.

    Conversion of all reviews in lowercase.

  2. 2.

    Remove special symbols from the online reviews such as ®;, @, #, etc.

  3. 3.

    Remove stop words from reviews that do not consist of any relevant information such as with, at, of, to, etc. with the help of the NLTK library.

  4. 4.

    Replace multiple white spaces and add single white spaces in their place.

  5. 5.

    Remove all numbers from the reviews.

  6. 6.

    Remove few punctuations from the reviews such as hyphen, braces, slash, and quotation marks.

3.1.2 Phase 2

In this phase, the paragraphs are divided into sentences by applying lexical analysis or tokenization. After tokenization, the words are reduced to their root form by using lemmatization, like ”changing” is altered to ”change”.

3.2 Feature extraction phase

Once the preprocessing phase is complete, relevant features are extricated with the help of Linguistic Inquiry and Word Count (LIWC 2015) [29]. The LIWC program has the main text analysis module in addition to a group of built-in dictionaries. The work of LIWC is to read the text given and count the percentage of words reflecting various emotions, styles of thinking, concerns of society, and even parts of speech. The features provided by LIWC are generally 93.

3.3 Feature selection using chi-square

In this phase, the relevant attributes are selected from the 93 extracted features given by LIWC. Feature selection plays an important role due to the following reasons:

  1. 1.

    Eliminates the repetitive data.

  2. 2.

    Picks relevant attributes/variables.

  3. 3.

    Minimizes the occurrence of overfitting.

  4. 4.

    Minimizes the training time.

93 features were selected using the LIWC tool. The features may be irrelevant and may cause the problem of overfitting [48]. As the number of features increases, the training time [37] also increases [47]. To eliminate the redundant and irrelevant features, the Chi-square feature selection method used for text classification is implemented [57]. It is one of the simplest tools for univariate feature selection by performing the univariate statistical analysis for classification. The feature interactions are neglected in this process. It is mainly used for categorical data and is extensively used for text data. The Chi-square method is majorly used for categorical features in a dataset. The chi-square is calculated amongst each feature and the target and the feature with the highest Chi-square value is selected.

The Chi-square score is given by the following equation:

$$ Ch=\frac{(Observed frequency - Expected frequency)^{2}}{Expected frequency} $$
(12)

where, Observed frequency = Total number of class observations, Expected frequency = Number of expected class observations without any relation between feature and the target. The relevant features are selected using the Chi-square method for all three datasets.

3.4 k-Means GWO (the proposed clustering mechanism)

To classify the reviews into spam and non-spam, the proposed method finds out optimal cluster heads with the help of Grey Wolf Optimizer(GWO). Spam reviews are classified using the following steps:

  1. 1.

    Within the search space, randomly initialize the population of Grey Wolf Optimizer.

  2. 2.

    Position of the search agents represents the cluster head coordinates of spam and non-spam reviews.

  3. 3.

    For each search agent, the fitness is calculated by using accuracy as the objective function.

  4. 4.

    Use Grey Wolf Optimizer to optimize the clusters.

Algorithm 3 shows the steps of the proposed mechanism

figure c

4 Experimental results

The effectiveness of the proposed Grey Wolf Optimizer using the k-Means (GWOK) clustering method is tested on the Synthetic Spam Review, Movie Review, and Yelp review datasets. We have used MATLAB 2016a to simulate all the experiments. The experiments were performed on a system with the configuration of 20.21 GHz Intel(R) Core(TM) i7 processor and size of RAM 16 GB.

4.1 Performance analysis of proposed k-Means grey wolf optimizer (GWOK)

To train a model, using k-Means, as the number of samples of the training data increases, the computational time and space also increase, due to the Mean computed in each iteration. In the proposed GWOK, to reduce this computational complexity of taking mean in each iteration, we randomly shift a Cluster Head(CH), to get a new combination of Cluster Heads(CH). This helps to find the optimal set of Cluster Heads(CH) within a given time or iteration.

The proposed GWOK is compared with the existing nature-inspired algorithms, like cuckoo search (CS) algorithm, Differential Evolution (DE) algorithm, Particle Swarm Optimization (PSO) algorithm, and Genetic Algorithm (GA) for validating the accuracy and computational time of the proposed method. The population size (N) is 50 and the maximum iteration (MaxIter) is 1000 for all the algorithms. The average value of accuracy and computational time for 30 runs, with dimension, equals twice the number of features (f), on all three datasets is presented in Table 1. It can be seen that the accuracy of the Grey Wolf Optimizer using k-Means is better than the other methods.

Table 1 Parameters for k-Means GWO

4.2 Experimental analysis of proposed k-Means GWO (GWOK)

Synthetic spam reviews, Yelp hotel and restaurant reviews, and Movie reviews datasets are used to test the proposed method. A brief description of all three datasets is given in Table 2.

Table 2 Dataset Description

4.2.1 Synthetic spam reviews dataset

The synthetic spam dataset was unlabelled [49] and was retrieved from the Database and Information System Laboratory, University of Illinois (TripAdvisor Dataset). The spamming methods known as, synthetic review method for spamming have been used to generate spam reviews [45]. This method generated 479 reviews, out of which 316 reviews were spam and 163 reviews generated were non-spam.

4.2.2 Movie review

The movie review dataset is generated from the popular site IMDB. This dataset has a total of 8544 reviews, out of which 3998 were spam reviews and 4546 reviews were non-spam.

4.2.3 Yelp hotels & restaurant reviews dataset

Yelp hotels & restaurant review dataset is extracted through Yelp.com, consisting of 85 hotels and 130 restaurant reviews, in the areas of Chicago [25, 26]. The reviews of both popular and disliked hotels and restaurants are considered in the dataset. There is a total of 4952 reviews, out of which 3709 are spam and 1243 are non-spam.

These datasets are noisy. Therefore, we prepreprocess the datasets to remove the noise as described in Section 3.1. After preprocessing, 93 features are extracted from these datasets using the LIWC 2015 tool. as all the features may not be of relevance, the Chi-square feature selection method, as described in Section 3.3 is applied to these 93 features to obtain the best features. The optimum number of features selected from all three datasets is given in Table 3.

Table 3 Feature Selected

For measuring the efficacy of the proposed clustering method, we have evaluated the classification accuracy. The values of precision and recall are also evaluated to check the performance of GWOK and to compare it with other existing techniques, as classification accuracy alone can be deceptive if the number of instances is not equal in each class.

4.3 Results

For computing the values of accuracy, precision, and recall, a confusion matrix sized n × n is generated. The confusion matrix has n number of classes with Cji representing the number of patterns predicted in i by j.

The values used in Table 4 have the following significance:

  • TP represents precisely predicted spam reviews.

  • TN represents precisely predicted ham review predicted correctly.

  • FP represents imprecisely predicted ham reviews.

  • FN represents imprecisely predicted spam review.

Table 4 Confusion Matrix

Table 5 illustrates the value of precision and recall for the three datasets over original and optimal features.

Table 5 Comparative analysis of measure of Mean values of precision and recall of proposed method with other methods for original and optimum features dataset

Using confusion matrix, the value of accuracy, precision, and recall can be calculated by (13), (14), and (15) respectively.

$$ Precision= \frac{TP}{TP+FP} $$
(13)
$$ Recall= \frac{TP}{TP+FN} $$
(14)
$$ Accuracy= \frac{TP+TN}{TP+TN+FP+FN} $$
(15)

The mean and standard deviation values for optimum feature selected after feature selection, on all three datasets is described in Tables 6 and 7 and compared with other existing nature-inspired algorithms. From the Table 7, we can see that the proposed Grey Wolf Optimizer clustering method outperforms other existing methods in respect of accuracy and computational time.

Table 6 Comparative analysis to measure mean and standard deviation values of Accuracy of the proposed method with other methods over datasets with original features
Table 7 Comparative analysis to measure mean and standard deviation values of Accuracy of the proposed method with other methods over datasets with optimum features

The boxplots of accuracy and computational time (seconds) for all the three datasets over the optimal features are depicted in Figs. 2 and 3 respectively. From the box plots, we can see that the accuracy of the proposed grey wolf optimization clustering method is high as compared to other metaheuristic algorithms. The proposed method also takes less time for execution as compared to other methods, thus proving the efficacy of our algorithm.

Fig. 2
figure 2

Box plots of Accuracy of proposed Grey Wolf based clustering method and other nature-inspired algorithms of (a) Synthetic Spam Review Dataset, (b) Movie Review Dataset, and (c) Yelp Hotels & Restaurant Review Dataset

Fig. 3
figure 3

Box plots of Computational Time of proposed Grey Wolf based clustering method and other nature-inspired algorithms of (a) Synthetic Spam Review Dataset, (b) Movie Review Dataset, and (c) Yelp Hotels & Restaurant Review Dataset

So, through experiments, it is proved that the GWOK is an efficient and Robust algorithm to solve the spam review detection problem. So this algorithm can be considered as an prominent solution to solve the spam review detection problem.

5 Conclusion

In this paper, for the classification of spam reviews, a novel clustering approach, Grey Wolf Optimizer with k-Means is proposed. The proposed method is tested on three different datasets namely synthetic spam reviews, movie reviews and, yelp hotels & restaurant reviews. The proposed method is compared with k-Means, PSO, DE, GA, and CS. The experimental results demonstrate that the proposed clustering method for detecting spam outperforms the existing nature-inspired methods like PSO, DE, GA, and CS. Boxplots show the consistency of the proposed method. The Chi-square feature selection method is used for finding the optimum features. In the future, some new optimization strategies or a modified Grey Wolf Optimizer-based hybrid approach can be employed in combination with other feature selection methods such as wrapper-based, Pearson’s correlation, etc. to attain higher accuracy for the given datasets.