Spam review detection using spiral cuckoo search clustering method

Pandey, Avinash Chandra; Rajpoot, Dharmveer Singh

doi:10.1007/s12065-019-00204-x

Spam review detection using spiral cuckoo search clustering method

Research Paper
Published: 05 February 2019

Volume 12, pages 147–164, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Evolutionary Intelligence Aims and scope Submit manuscript

Spam review detection using spiral cuckoo search clustering method

Download PDF

1115 Accesses
50 Citations
Explore all metrics

Abstract

Nowadays, online reviews play an important role in customer’s decision. Starting from buying a shirt from an e-commerce site to dining in a restaurant, online reviews has become a basis of selection. However, peoples are always in a hustle and bustle since they don’t have time to pay attention to the intrinsic details of products and services, thus the dependency on online reviews have been hiked. Due to reliance on online reviews, some people and organizations pompously generate spam reviews in order to promote or demote the reputation of a person/product/organization. Thus, it is impossible to identify whether a review is a spam or a ham by the naked eye and it is also impractical to classify all the reviews manually. Therefore, a spiral cuckoo search based clustering method has been introduced to discover spam reviews. The proposed method uses the strength of cuckoo search and Fermat spiral to resolve the convergence issue of cuckoo search method. The efficiency of the proposed method has been tested on four spam datasets and one Twitter spammer dataset. To validate the efficacy of proposed clustering method it is compared with six metaheuristics clustering methods namely; particle swarm optimization, differential evolution, genetic algorithm, cuckoo search, K-means, and improved cuckoo search. The experimental results and statistical analysis validate that the proposed method outruns the existing methods.

Detection of spam reviews using hybrid grey wolf optimizer clustering method

Article 25 April 2022

Performance Evaluation of Clustering-Based Classification Algorithms for Detection of Online Spam Reviews

A Feature Selection Approach to Detect Spam in the Facebook Social Network

Article 20 October 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

These days, Internet and e-commerce sectors are incessantly growing. Due to exponential growth in these sectors, online reviews are also increasing and reliance on these online reviews are also hiked. Some of the instances where we rely on online reviews are:

1.
Buying something from online retail website, we look at the product reviews followed by the seller reviews.
2.
For buying business software reviews at different websites are inspected.
3.
Online reviews are also investigated to decide whether to watch a movie or not.

Online reviews have become an essential part of our lives. According to an experiment conducted by Lackermair et al. [1] on 104 German online shoppers, 74.04% of the participants rated online reviews as “important or very important” and 85.57% of the participant claimed that before purchasing a product they read reviews “often or very often”. Presently, e-commerce websites like Amazon, Flipkart, etc. provide an option for writing review for a particular product. The reviewers can write whatever they feel about the product which may impact buyer’s decision. Hence, these reviews may either increase or degrade product’s reputation and sales. Thus, spam review detection becomes a necessity.

Dixit et al. [2] categorized spam reviews into three classes namely; Untruthful reviews, Reviews on brands, and Non-Reviews. Untruthful reviews are the reviews which are completely fake while reviews on brands are the reviews that are for a brand or for a seller but does not focus upon the product. Non-reviews are reviews which contain unrelated text or advertisement. Untruthful reviews are the hardest to detect due to its structure. The example of Untruthful review is given below.

Review 1: Great hotel in heart of Chicago for business or pleasure. Rooms are recently upgraded and very modern and large. Flat screen TVs, marble baths, all rooms are suites, great desk, kitchenette, comfortable bed, free wireless Internet... everything you could ask for. Location is easy walk to Magnificent Mile and lots of great restaurants. Staff is friendly and helpful. Short cab ride to Loop.

Review 2: What a terrible experience my family and I had at Affinia Chicago! First of all, we reserved a room with 2 queen size beds and received only 1 King size bed with a cot. When we got to the room, we found hair balls on the floor as if a cat had previously stayed there. What an absolute terror Affinia was and I will never be going back!

For a user it is very difficult to identify that the review 1 is a real review whereas the review 2 is not. Therefore, to identify fake/ spam reviews many baseline methods such as bag of words, n-grams, etc., are proposed. Bag of words-based spam detection methods use individual words as feature for spam review classification. Since, bag of words-based methods generally ignore the semantics of words. Hence, these methods are not very effective in review classification. Some researchers have used lexical and syntactical features for spam detection [3,4,5], while Ott et al. [6, 7], and Lin et al. [8] have used unigrams-based techniques for fake review detection.

Furthermore, supervised, unsupervised and semi-supervised-based machine learning techniques have also used for spam review detection. Cheng et al. [9] presented a case study and compared various methods used for detecting fake reviews. Munzel [10] presented various contextual cues which helped Internet users to distinguish fake from genuine reviews. Narayan et al. [11] introduced a spam review detection method based on opinion mining and supervised learning approach. Petrescu et al. [12] studied the evolution and outcomes of incentivized review campaigns and found that these incentivized campaigns influences the users to post positive reviews of their product. Luca and zervas [13] have used two complementary approaches on Yelp datasets and identified that the only 16% of restaurant reviews on Yelp are filtered. Gieseke et al. [14] have used efficient recurrent local search policy for unsupervised and semi-supervised models to handle the binary classification problems. Further, Behdad et al. [15] investigated the fraud detection problem and also inspected how machine learning models can be applied to it. Mani et al. [16] combined the ability of multiple classifier to identify spam reviews. Ghai et al. [17] introduced a spam detection method based on rating variation score, caps count score, and reviewer’s count score. Heydari et al. [18] examined the doubtful time intervals acquired from time series of reviews to overcome the rating variation of the reviewers. Liu and Pang [19] introduced an aspect-based review deviation unsupervised framework for detecting spamicity. Most of the spam detection model use hand crafted features for spam detection and hand-crafted features cannot reveal the semantics of reviews. Therefore, to learn the semantic representation of reviews a neural network based model has been proposed [20]. Hu et al. [21] introduced a multi-text summarization approach which uses k -medoids clustering to discover the top k-most significant reviews. Hai et al. [22] have used logistic regression-based multi-task learning method (MTL-LR) followed by semi-supervised multi-task Laplacian regularized logistic regression method to enhance the performance of spam detection model.

Moreover, Mateen et al. [23] introduced a hybrid method that uses content-based and graph-based features to identify spam on twitter platform. Vishwarupe et al. [24] have used novel feature to enhance the classification model for spammer detection in twitter dataset. Sedhai and sun [25] proposed a semi-supervised spam detection (S3D) scheme for spam detection in twitter datasets. To study the class imbalance issue in Twitter, Li and liu [20] surveyed some popular methods and identified the most effective method. Chen et al. [26] have used deep analysis on the statistical features of tweets to identify spam tweets. Wu et al. [27] surveyed and compared different methods used for spammer detection in tweets. Singh and singh combined the strength of particle swarm optimization (PSO) and correlation based feature selection technique (CFS) [28] for web spam detection. Li et al. [29] have used synthetic minority over-sampling and de-noising auto-encoder method in the deep belief networks for the classification of web spam. Singh and batra [30] proposed an ensemble based spam detection method in which they have used quotient filter and locality sensitive hashing for efficient and similarity searching respectively. Wei and Singh [31] have discussed current challenges and some future directions for effective surveillance of twitter data. Bindu et al. [32] proposed a unsupervised method that uses community-based features, graph and URL characteristics of user accounts for spam detection on Twitter. Liu et al. [33] introduced a fuzzy-redistribution and asymmetric sampling based hybrid method to detect spammer tweets. Inuwa-Dutse et al. [34] have used account information features to discover the spam posting accounts on twitter. Miller et al. [35] have used two stream clustering methods namely; StreamKM++ and DenStream to identify spammer tweets. Singh et al. [36] have designed a model to detect and block fake review and spams. Narayan et al. [37] introduced a semi-supervised PU-learning-based method for review spam detection.

Recently, metaheuristic algorithms are also used for spam classification. Salehi et al. [38] introduced an genetic algorithm based approach for email spam detection. Idris et al. [39] uses differential evolution [40] and negative selection algorithm to detect spam email. A combined approach based on negative selection and particle swarm optimization (PSO) [41] has been used for email spam detection [42] which sometimes trap to its local solution and also takes more time to stabilize. Metaheuristic-based algorithms generally trap to their local optima therefore, to maintain the diversity in the population and guide the search process a hybrid method based on the strength of evolutionary algorithms and local search methods has been introduced [43].

In this paper, a novel metaheuristic clustering (spiral cuckoo search-based clustering) method has been proposed for spam detection. The overall contribution of this paper has been divided into two folds.

First, a novel metaheuristic method based on the cuckoo search and Fermat spiral has been proposed.

Secondly, the proposed method has been used to solve spam review detection problem.

In CS, Lévy flight is used to generate new solutions which may not be diverse and it may also trap to its local solution. Therefore, to make balance between exploration and exploitation spiral cuckoo search method has been proposed. The proposed method uses Fermat spiral and Lévy flight to generate new solutions. The proposed spiral CS method has been validated on 15 standard benchmark problems including both unimodal and multi-modal problems [44]. Furthermore, a spiral cuckoo search-based clustering method has been introduced for spammer detection. To validate the effectiveness of proposed clustering method, it is tested on five spammer datasets and compared with particle swarm optimization algorithm (PSO), differential evolution (DE), Genetic algorithm [45], K-means [46], cuckoo search (CS) [47] and improved cuckoo search (ICS) [48].

The rest of the paper is structured as follows: the Fermat spiral and cuckoo search method is reviewed in Sect. 2. In Sect. 3 the spiral cuckoo search method is discussed. Section 4, briefs the proposed spam detection method. Section 5 discusses the experimental results and the conclusion is presented in Sect. 6.

2 Preliminaries

2.1 Cuckoo search

Cuckoo search (CS) is a nature inspired optimization method which is based on the brood parasitic conduct of some cuckoo species. Due to obligate brood parasitism behavior, cuckoos use a suitable host to hatch their eggs [47,48,49]. Ani and Guira are some of the cuckoo species who put their eggs in communal bird’s nest [51, 52]. Timing of placing an egg in these cuckoos species are also very amazing. They select a nest in which host birds just placed its own eggs. Usually, cuckoos eggs are incubated earlier as compared to host birds [50, 53]. Therefore, cuckoo’s chicks are born prior to host and these chicks may throw out or remove the host’s eggs which increases the the food share of cuckoo’s chicks. CS method is based upon three principles: (1) at a time, each cuckoo places one egg in a arbitrarily selected nest, (2) nest, having top quality eggs, will carry over the upcoming iterations, (3) total number of host nests are fixed, and $P_a$$\epsilon$ [0, 1] is the probability that a host discovers an egg placed by cuckoo. If the host recognizes the cuckoo’s egg, it either removes the eggs from nest or leave the nest and construct the another one. In short, using this principle, the poor quality eggs (solutions) are replaced by new eggs (solutions).

The complete steps of CS method is depicted in Algorithm 1 [53]. New solutions $x_i^{(r+1)}$ for a cuckoo n in CS method is generated by using Eq. (1) which rely on the present state and transition probability.

$$\begin{aligned} x_{i}^{(r+1)}= x_{i}^{(r)} + \alpha \otimes Levy(\lambda ) \end{aligned}$$

(1)

here $\alpha$ is used to scale the step size produced by lévy flight and in most of the cases $\alpha$ is set to 1 The $\otimes$ in Eq. (1) represents entry wise multiplications. In CS, Lévy flight is used to explore complete search space as its step size is much longer in the big run and biased random walk is used for exploitation. For exploitation, the fraction $P_a$ of the worse nest is left and another ones are constructed.

2.2 Fermat’s spiral

The American Heritage Dictionary defines a spiral “as a curve on a plane that winds around a fixed centre point at a continuously increasing or decreasing distance from the point”. Spiral follows a winding, generally to upward direction and displays a twisted form or shape. In mathematics, Spirals are categorized into two groups namely; two dimensional and three dimensional spirals based on their movement around pivot. The two-dimensional spirals may be easily described using polar coordinates. Archimedean spiral, Fermat’s spiral, Cornu spiral, etc. are some of the important two dimensional spirals. The three dimensional spirals is a two dimensional spiral with additional variable height h.

Fermat spiral is discovered by the great mathematician Pierre de Fermat in 1636. Fermat spiral is based on parabolic formula in polar coordinate as given in Eq. 2 hence, it is also known as the parabolic spiral.

$$\begin{aligned} \displaystyle r=\theta ^{1/2}, \end{aligned}$$

(2)

where radius r is a monotonic continuous function of angle $\theta$.

The Fermat spiral shows the similar behavior to the Archimedean spiral for $m=2$ in polar equation. The Fermat spiral produces two r values of opposite sign for any positive $\theta$ value using Eqs. (3) and (4).

$$\begin{aligned} \displaystyle r=\, & {} a\theta ^{1/2}, \end{aligned}$$

(3)

$$\begin{aligned} \displaystyle r= & {} -a\theta ^{1/2}. \end{aligned}$$

(4)

The Fermat spiral is created by combining the plots generated by both the above equations and shown in Fig. 1. From the Fig. 1, it can be discovered that the resulting spiral is symmetrical about the origin.

3 Spiral cuckoo search method

CS employs Lévy flight and biased random walk to find the optimal solution. Generally, CS uses Lévy flight to explore the search region, as its step size is much longer in long run [53]. In CS, Lévy flight generates some of the new solutions closed to the current best solution to expedite the search process and remaining of the solution are generated far away from the current best solution to avoid the premature convergence as given in Fig. 2.

From the Fig. 2, it is envisioned that the Lévy flight produces a random walk. The step sizes in random walk are not equal since they rely on the step size scaling factor $\alpha$ and probability $P_a$. Due to unequal step sizes in random walk convergence speed of the method will also be affected. The convergence speed of CS relies on the parameters $\alpha$ and probability $P_a$, which is fixed in CS method. From the experiments it is found that CS will take longer time to converge if large value of $\alpha$ and small value of $P_a$ have been used while CS will converge quickly and its accuracy will be low, if small value of $\alpha$ and large value of $P_a$ are used. Therefore, to avoid premature convergence and for better precision, many variants of CS have been proposed. In this paper, a novel cuckoo search method based on Fermat spiral movement has been proposed. A two dimensional Fermat’s spiral can be described using Eqs. (3) and (4) as given in Sect. 2.2.

The spiral movements of Fermat’s spiral is given in Fig. 1. From the figure, it is easily visualize that the movement of Fermat spiral depends upon the angle $\theta$. In Fermat spiral, for any value of $\theta$, one positive and one negative value of r is produced. Thus, the resultant spiral will be symmetrical about the line $y = -x$ as given in Fig. 1, which will help to explore the complete search space and avoids the premature convergence.

The spiral cuckoo search method uses the property of Fermat spiral and Lévy flight along with variable $\alpha$ and $P_a$ to find the optimal solution. To accelerate the local search (exploitation), the proposed spiral cuckoo search method employs Lévy flights that generate some of the solution vectors adjacent to best solution while it uses Fermat spiral to explore the complete search space.

4 Proposed spam review detection method

This paper introduces a spiral cuckoo search-based clustering method to detect spam reviews. The proposed clustering method detects the spam reviews in four phases; (i) preprocessing the reviews, (ii) feature extraction, (iii) feature selection and normalization and (iv) spam review detection using spiral cuckoo search-based clustering method. The detailed flow chart of the proposed method has been shown in Fig. 3.

4.1 Preprocessing reviews

Online reviews usually contain noise such as stop words, slang words etc. which are not desired while extracting features. Therefore, python natural language toolkit (NLTK) [54] has been used to remove noise and unwanted words from online reviews using following two phases:

4.1.1 Phase 1

In this phase all the unwanted words and noise are removed from online reviews using the following steps:

1.
All the reviews are converted into lowercase.
2.
Special symbols like ®, @, #, etc. are removed from online reviews.
3.
Stop words such as we, the, a, etc. which do not carry any relevant information are removed from reviews using NLTK library.
4.
Multiple white spaces in reviews are replaced by single white space.
5.
All numbers are removed from reviews.
6.
Some punctuation such as forward slash parenthesis, backward slash, and dash are removed from reviews.

4.1.2 Phase 2

This phase employs tokenization step to split paragraphs into sentences. Tokenization is also known as lexical analysis or text segmentation. After tokenization, lemmatization is used to reduce words to their root forms. For example “reading” is converted to “read.”

4.2 Feature extraction

After preprocessing, significant features are extracted using Linguistic Inquiry and Word Count (LIWC 2015) [55]. LIWC 2015 is a text-analysis tool which generally provides 93 features.

4.3 Feature selection

Feature selection also called as attribute selection or variable subset selection is a process of selecting appropriate features with respect to target data. Feature selection is important since it:

1.
Removes redundant data.
2.
Selects attribute that are significant.
3.
Reduces chances of over fitting.
4.
Reduces training time.

LIWC tool extract 93 features from dataset. Since, some of the extracted features may be irrelevant and redundant so, they may cause over fitting. Moreover, training time also increases with more number of features [56]. Thus, to eliminate irrelevant and redundant features, whale optimization algorithm with simulated annealing (WOASA) [57] has been used which dynamically selects the optimal set of features from dataset. The main objective of feature selection method is to maximize the classification accuracy and minimize the number of selected features along with error rate. After selecting relevant features, proposed spam detection method is used.

4.4 Spam reviews detection using spiral cuckoo search-based clustering method

Cuckoo search method generates initial population randomly and due to random initialization of population, CS may take longer time to converge. Moreover, it may also trap to its local solution due to the lack of diversity in the population. Therefore, in this paper a novel variant of CS named spiral CS has been proposed. The proposed spiral CS method takes the advantages of Fermat spiral and Lévy flight to generate new solutions. Due to this modification, the proposed method requires lesser number of iterations for convergence and to find the optimal solution.

Furthermore, the proposed spiral CS method has been used to detect spam reviews. To identify spam reviews, a spiral CS-based clustering method has been introduced. The proposed clustering method uses the following three steps to detect spam and non-spam reviews:

1.
Generate k cluster centers ($c_1$, $c_2,\ldots c_k$) randomly and use them to initialize the population of spiral cuckoo search. For the spam detection problem, cluster centers for spam ($c_1$) and non-spam reviews ($c_2$) are generated.
2.
Compute the fitness of each pattern (review) using objective function that minimizes the sum squared error and assign it to one of the cluster.
3.
Optimize the clusters using spiral cuckoo search.

To understand mathematically, consider $\hbox {X} = (x_1^d , x_2^d,\ldots , x_r^d$) is a set of r reviews which are to be divided into k classes such as $C_1, C_2, \ldots , C_k$. Each review is depicted by a feature matrix having L features and has been scaled in [0, S]. The probability distribution of each feature may be described as follows [58, 59]:

$$\begin{aligned} p_{j}=\frac{O_{j}}{r}. \end{aligned}$$

(5)

where j is the $j^{th}$ feature i.e. $0 \le \hbox {j} \le$ S and $O_j$ is number of reviews that contain $j^{th}$ feature. The total mean of each feature can be expressed by Eq. (6).

$$\begin{aligned} \mu =\sum _{j=1}^{S}{jp_{j}}. \end{aligned}$$

(6)

Any review is categorized into class $C_k$ for which it has minimum Euclidean distance. Thus, the probability ($W_k$) of occurrence of class $C_k$ ($k=1,2, \ldots , n$) is given by Eq. (7).

$$\begin{aligned} W_{k}=\sum _{j\in C_k}{p_{j}}. \end{aligned}$$

(7)

The mean of class $C_k$ can be calculated by Eq. (8).

$$\begin{aligned} \mu _{k}=\sum _{j\in C_k}\frac{jp_{j}}{W_{k}}. \end{aligned}$$

(8)

If, $\mu _{k}$ is the mean of class $C_k$ then, intra-cluster distance can be calculated using Eq. (9).

$$\begin{aligned} D_{intra}=\sum _{i=1}^{k}\sum _{\forall x_i\in C_k}{\left\| {(x_i-\mu _k)}\right\| }^2,\quad i=1,2,\ldots ,k \end{aligned}$$

(9)

where $x_i$ is the set of data points in cluster $C_k$ and $\mu _k$ is representative point (cluster centroid) for cluster $C_k$.

To cluster the data points into their respective classes, intra-cluster distance should be minimized or inter-class variance should be maximized. The proposed clustering method minimizes the intra-cluster distance as given in Eq. (9) [60]. The pseudo-code of the spiral CS-based clustering method is given in Algorithm 2.

5 Experimental results

The efficiency of the proposed spam detection method is discussed in two sections. First, Sect. 5.1 analyze the efficiency of the proposed spiral CS on benchmark functions belonging to two different categories i.e., unimodal and multimodal [32]. Second, Sect. 5.2 discusses the effectiveness of proposed clustering method on spam review and Twitter spammer datasets. For fair comparison, all experiments are simulated on Matlab 2016a running on a computer having 2.30 GHz Intel R core i3 processor, 2 GB of RAM and 500 GB hard-disk.

5.1 Performance analysis of spiral CS

Spiral CS has been tested on 15 benchmark functions including both unimodal ($F_1 - F_8$) and multimodal ($F_9 - F_{15}$) functions [44]. The unimodal functions evaluate the rate of convergence in achieving global optimum while multi-modal functions test the chances of stucking in local optima. Table 1 depicts the considered benchmark functions along with optimal value. The comparative analysis has been conducted against the four existing nature-inspired algorithms namely; particle swarm optimization (PSO), differential evolution (DE), genetic algorithm (GA), cuckoo search (CS) and a novel variant of CS (ICS) in terms of mean fitness values along with their standard deviation values. In all the algorithms, population size (N) is 50 and maximum iteration (max itr) is 1000. The parameters setting of the considered algorithms is illustrated in Table 2. The obtained best fitness as well as standard deviation values over 30 runs on each benchmark function is averaged and presented in Table 3. From table, it can be visualized that the spiral CS has obtained better results than other methods (PSO, DE, GA, CS, and ICS) on all the considered benchmark functions except benchmark functions $F_4$ and $F_{13}$. For the benchmark function $F_{13}$, ICS perform slightly better than the proposed spiral CS method while CS returns the best standard deviation value for benchmark function $F_4$. Moreover, Spiral CS, PSO, and DE have eqvivalant mean fitness function and standard deviation value for benchmark function $F_9$. Thus, it can be stated that proposed spiral cuckoo search outperforms the compared methods.

Table 1 Benchmark functions

Full size table

Table 2 Parameter values for all the methods

Full size table

Table 3 Comparative analysis of existing and proposed methods for mean fitness value and corresponding standard deviation values on the standard benchmark functions

Full size table

5.2 Experimental Analysis of Proposed Spam Detection Method

The accuracy of the proposed spiral CS-based clustering method has been tested on one Twitter spammer and four spam review datasets. The brief description of these datasets have been depicted in Table 4. From the Table 4, it can be visualize that the class distribution of synthetic and yelp datasets are imbalanced (skewed). It is widely known that poor models usually do not show satisfactory results over skewed datasets [61]. Hence, to show the efficacy of proposed spam detection method both skewed (imbalanced) and non-skewed (balanced) datasets have been used.

Table 4 Considered datasets

Full size table

5.2.1 Spam review dataset

This dataset [6] has been taken from Mylee Ott website which contain total 1600 reviews of 20 Chicago hotels, divided in four labels; negative truthful, negative deceptive, positive truthful and positive deceptive. Each label having 400 reviews. In this dataset, both positive and negative deceptive reviews are acquired from Amazon Mechanical Turk while positive truthful reviews are extracted from TripAdvisor. Negative truthful reviews of this dataset are acquired from Hotels.com, Expedia, Priceline, Orbitz, and TripAdvisor. For better comparison, negative truthful and positive truthful reviews are given “Not spam” label and negative deceptive and positive deceptive are given “spam” label.

5.2.2 Synthetic spam review dataset

This dataset was initially taken from the Database and Information System Laboratory, University of Illinois (TripAdvisor Dataset) and was unlabeled [62]. Thus, to produce spam reviews, synthetic review spamming method has been used [63]. The synthetic review spamming method produces a dataset which consist of 479 reviews with 316 spam and 163 non-spam reviews.

5.2.3 Yelp fake review dataset

This dataset has been taken from Yelp.com which contain reviews of 85 hotels and 130 restaurants in Chicago area [64, 65]. For fair comparison, mixture of popular and disliked restaurants & hotels reviews are considered. The detailed statistics of dataset is given in Table 4. From the table, it can be observed that the distribution of dataset is skewed.

1.
Yelp hotel review dataset

This dataset is subset of Yelp fake review dataset and consist of 5678 reviews [64]. There are 802 spam (fake) and 4876 non-spam reviews generated by 5124 reviewers.
2.
Yelp restaurant review dataset

This dataset is also subset of Yelp fake review dataset and contains 58517 reviews generated by 35593 reviewers [64]. There are 8368 spam and 50149 non-spam reviews.

5.2.4 Twiiter spam dataset

This dataset has been collected using Twitter API and contains 600 million tweets. All the tweets of dataset are manually annotated into two classes namely; spammer and non-spammer.To detect spammer tweets, 12 features are extracted from each tweet. In this paper, a subset of standard Twitter dataset have been used. This dataset consist of 10,000 tweets (5000 spammer and 5000 non-spammer tweets) which are randomly chosen from a fixed continuous time frame.

Table 5 Some of the selected features of synthetic dataset

Full size table

The spam review datasets are preprocessed to remove noise as discussed in Sect. 4.1. From the preprocessed datasets 93 features are extracted using LIWC 2015 and some of the features of spam review datasets are given in Table 5. However, all the 93 features may not be relevant. Therefore, feature selection method as discussed in Sect. 4.3 has been used to select the best set of features from the spam review datasets. As Twitter spammer dataset contains only 12 features hence, WOASA feature selection method has not been used on Twitter spammer dataset. The total number of selected features and mean error from WOASA for all the spam review datasets are represented in Table 6. Since, the range of values of feature vector in dataset varies widely. Therefore, for uniformity and faster convergence, feature vector matrix is normalized. Afterwords, proposed clustering method is used to identify spam and non-spam reviews. However, classification accuracy alone can be misleading if each class have an unequal number of instances. Therefore, to assess the performance of proposed clustering method and make it comparable with other considered methods along with accuracy, recall and precision are also computed. To compute the precision, recall, and accuracy confusion matrix is created. The confusion matrix C of size $n \times n$ represents that there are n number of classes and its value $C_{ji}$ shows the number of patterns of class j predicted in class i.

Table 6 Error in feature selection using binary whale optimization with simulated annealing

Full size table

In confusion matrix, four values namely; TP (true positive), TN (true negative), FP (false positive), and FN (false negative) as shown in Table 7 are used. where:

TP
represents the quantity of spam messages which are exactly predicted to spam.
TN
depicts the amount of non-spam messages which are correctly predicted as non spam.
FP
shows the amount of non-spam reviews that are incorrectly labeled as spam.
FN
represents the spam reviews which are wrongly predicted to non-spam.

Based on the confusion matrix, precision, recall and accuracy are computed using Eqs. (10)–(12)

$$\begin{aligned} Precision= & {} \frac{TP}{TP+FP}, \end{aligned}$$

(10)

$$\begin{aligned} Recall= & {} \frac{TP}{TP+FN},\end{aligned}$$

(11)

$$\begin{aligned} Accuracy= & {} \frac{TP+TN}{TP+TN+FP+FN}. \end{aligned}$$

(12)

Table 7 Confusion matrix

Full size table

However, metaheuristics method are randomized in nature, thus each method has been executed 30 times over each dataset and the experimental outcomes have been examined in regards to mean precision, mean recall, mean accuracy, mean fitness, and standard deviation values. The performance of the proposed spiral cuckoo search clustering method has been analyzed on original datsets as well as datasets with optimal set of features. The mean precision and mean recall of each method over original datsets as well as datasets with optimal set of features has been presented in Table 8. From the table, it can be perceived that the proposed spam detection method attains the best results in the metrics of recall, and precision over all the datasets.

Table 8 Comparison of proposed spam detection method with other methods in terms of mean precision, mean recall over datasets with original and optimal set of features

Full size table

Table 9 Comparison of proposed spam detection method with other methods in terms of mean accuracy, mean fitness function and standard deviation values with original set of features

Full size table

The mean fitness, mean accuracy, and standard deviation values for each dataset with original set of features are given in Table 9. From the Table 9, it is clearly observed that the proposed spam detection method gives better results than other methods in terms of mean fitness and mean accuracy. K-means and DE give competitive results over spam review and Yelp hotel review datasets respectively for performance parameters mean computational time while PSO shows better standard deviation value on synthetic spam review dataset.

Furthermore, the proposed clustering method has been tested on datasets with optimal set of features. The mean fitness value, mean accuracy, and standard deviation values of each dataset with the relevant set of features are given in Table 10. From the table, it is observed that the spiral cuckoo search clustering method outperforms all the other methods. However, in terms of mean computational time, the proposed method shows better results over all the datasets except spam review and Yelp hotel review datasets. If the results of Tables 9 and 10 are compared then it can be perceived that the proposed method shows very prominent results over datasets with optimal set of features.

Table 10 Comparison of proposed spam detection method with other methods in terms of mean accuracy, mean fitness function and standard deviation values over datasets with relevant features returned by feature selection method

Full size table

To validate the performance of proposed method, box plots [66] have been also plotted for all the spam review datasets with relevant set of features and represented in Figs. 4, 5 and 6. In the box plot x-axis denotes the name of the method and the y-axis denotes the parameter under consideration. From the box plots, it is observed that the spiral CS-based clustering method has an edge over other methods in terms of consistency. Moreover, convergence graph has also been plotted in Fig. 7 to show the convergence behavior of all the considered methods and proposed method. In the convergence plot x axis denotes the name of the method and the y- axis denotes the parameter under consideration.

Furthermore, to validate the significance of results wilcoxon rank sum multiple-problem test is also conducted at 5% level of significance of proposed method and existing methods. Table 11 presents the corresponding p-value and $z-value$ along with SIG (significance) of each method. The p-value is used in the context of null hypothesis and it determines the significance of results. The null hypothesis is rejected if p-value $\le 0.05$ and symbolized by $+$ or −, else, it is accepted and represented by $=$ symbol. The ’$+$’ indicates that method is different and significantly good while ’−’ shows that it is different and significantly poor. From the table, it is visualized that values of SIG are ’$+$’ for all datasets i. e. spiral CS-based clustering method is significantly different from the considered methods.

Table 11 Results of the wilxcon test for statistically significance level at $\alpha = 0.05$

Full size table

6 Conclusion

In this paper, a novel variant of cuckoo search namely; spiral CS has been proposed. The proposed method takes the advantages of Fermat spiral and Lévy flight to find the optimal solution in lesser number of iterations. The experimental results of proposed spiral cuckoo search method is validated on 15 benchmark functions including both unimodal and multi-modal. From the experimental results, it can be elicited that the proposed spiral CS method shows promising results than PSO, DE, GA, CS, and ICS. Additionally, the efficiency of spiral CS has been validated through the proposed spiral CS-based clustering method. The performance of proposed clustering method has tested on four spam datasets and one Twitter spammer dataset. Further, the proposed spam detection method has been compared with K-means, PSO, DE, GA, CS, and ICS. Convergence graph is also plotted to depict the exploration and exploitation capabilities of the proposed method. Moreover, box plots are also drawn to show the consistency of proposed method. From the experimental and statistical evidences, it is found that the proposed spiral cuckoo search clustering method is efficient than the compared methods. Though the proposed clustering method is better than the existing methods, still effort is required to improve the accuracy. Therefore future work involves exploring more feature selection techniques and optimization algorithms for better accuracy.

References

Lackermair G, Kailer D, Kanmaz K (2013) Importance of online product reviews from a consumer’s perspective. Adv Econ Bus 1:1–5
Google Scholar
Dixit S, Agrawal A (2013) Survey on review spam detection. Int J Comput Commun Technol ISSN 4:0975–7449
Google Scholar
Shojaee S, Murad MAA, Azman AB, Sharef NM, Nadali S (2013) Detecting deceptive reviews using lexical and syntactic features. In: Intelligent systems design and applications (ISDA), 2013 13th international conference on, IEEE, pp 53–58
Rosso P, Cagnina LC (2017) Deception detection and opinion spam. In: A practical guide to sentiment analysis, Springer, New York, pp 155–171
Heredia B, Khoshgoftaar TM, Prusa JD, Crawford M (2017) Improving detection of untrustworthy online reviews using ensemble learners combined with feature selection. Soc Netw Anal Min 7(1):37
Article Google Scholar
Ott M, Choi Y, Cardie C, Hancock JT (2011) Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Vol 1, association for computational linguistics, pp 309–319
Jindal N, Liu B, Lim E-P (2010) Finding unusual review patterns using unexpected rules. In: Proceedings of the 19th ACM international conference on information and knowledge management, ACM, pp 1549–1552
Li F, Huang M, Yang Y, Zhu X (2011) Learning to identify review spam. In: IJCAI proceedings of international joint conference on artificial intelligence, vol 22, p 2488
Cheng L-C, Tseng JC, Chung T-Y (2017) Case study of fake web reviews. In: Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017, ACM, pp 706–709
Munzel A (2016) Assisting consumers in detecting fake reviews: the role of identity information disclosure and consensus. J Retail Consumer Serv 32:96–108
Article Google Scholar
Narayan R, Rout JK, Jena SK (2018) Review spam detection using opinion mining. In: Progress in intelligent computing techniques: theory, practice, and applications, Springer, New York, pp 273–279
Petrescu M, O’Leary K, Goldring D, Mrad SB (2018) Incentivized reviews: promising the moon for a few stars. J Retail Consumer Serv
Luca M, Zervas G (2016) Fake it till you make it: reputation, competition, and yelp review fraud. Manag Sci 62(12):3412–3427
Article Google Scholar
Gieseke F, Kramer O, Airola A, Pahikkala T (2012) Efficient recurrent local search strategies for semi-and unsupervised regularized least-squares classification. Evolut Intell 5(3):189–205
Article Google Scholar
Behdad M, Barone L, French T, Bennamoun M (2012) On XCSR for electronic fraud detection. Evolut Intell 5(2):139–150
Article Google Scholar
Mani S, Kumari S, Jain A, Kumar P (2018) Spam review detection using ensemble machine learning. In: International conference on machine learning and data mining in pattern recognition, Springer, New York, pp 198–209
Ghai R, Kumar S, Pandey AC (2019) Spam detection using rating and review processing method, smart innovations in communication and computational sciences. Springer, Singapore, pp 189–198
Heydari A, Tavakoli M, Salim N (2016) Detection of fake opinions using time series. Expert Syst Appl 58:83–92
Article Google Scholar
Liu Y, Pang B (2018) A unified framework for detecting author spamicity by modeling review deviation. Exp Syst Appl 112:148–155
Article Google Scholar
Li C, Liu S (2018) A comparative study of the class imbalance problem in twitter spam detection. Concurr Comput Pract Exp 30(5):e4281
Article Google Scholar
Hu Y-H, Chen Y-L, Chou H-L (2017) Opinion mining from online hotel reviews-A text summarization approach. Inf Process Manag 53(2):436–449
Article Google Scholar
Hai Z, Zhao P, Cheng P, Yang P, Li X-L, Li G (2016) Deceptive review spam detection via exploiting task relatedness and unlabeled data. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 1817–1826
Mateen M, Iqbal MA, Aleem M, Islam MA (2017) A hybrid approach for spam detection for twitter. In: Applied sciences and technology (IBCAST), 2017 14th international Bhurban conference on, IEEE, pp 466–471
Vishwarupe V, Bedekar M, Pande M, Hiwale A (2018) Intelligent twitter spam detection: a hybrid approach. In: Smart trends in systems, security and sustainability, Springer, New York, pp 189–197
Sedhai S, Sun A (2018) Semi-supervised spam detection in twitter stream. arXiv:1702.01032
Chen C, Wang Y, Zhang J, Xiang Y, Zhou W, Min G (2017) Statistical features-based real-time detection of drifted twitter spam. IEEE Trans Inf Forensics Secur 12(4):914–925
Article Google Scholar
Wu T, Wen S, Xiang Y, Zhou W (2018) Twitter spam detection: survey of new approaches and comparative study. Comput Secur 76:265–284
Article Google Scholar
Singh S, Singh AK (2018) Web-spam features selection using cfs-pso. Proc Comput Sci 125:568–575
Article Google Scholar
Li Y, Nie X, Huang R (2018) Web spam classification method based on deep belief networks. Expert Syst Appl 96:261–270
Article Google Scholar
Singh A, Batra S (2018) Ensemble based spam detection in social iot using probabilistic data structures. Fut Gen Comput Syst 81:359–371
Article Google Scholar
Wei Y, Singh L (2018) Detecting users who share extremist content on twitter. In: Surveillance in Action, Springer, New York, pp 351–368
Bindu P, Mishra R, Thilagam PS (2018) Discovering spammer communities in twitter. J Intell Inf Syst, pp 1–25
Liu S, Zhang J, Xiang Y (2016) Statistical detection of online drifting twitter spam. In: Proceedings of the 11th ACM on Asia conference on computer and communications security, ACM, pp 1–10
Inuwa-Dutse I, Liptrott M, Korkontzelos I (2018) Detection of spam-posting accounts on Twitter. Neurocomputing 315:496–511
Article Google Scholar
Miller Z, Dickinson B, Deitrick W, Hu W, Wang AH (2014) Twitter spammer detection using data stream clustering. Inf Sci 260:64–73
Article Google Scholar
Singh M, Kumar L, Sinha S (2018) Model for detecting fake or spam reviews. In: ICT based innovations, Springer, New York, pp 213–217
Narayan R, Rout JK, Jena SK (2018) Review spam detection using semi-supervised technique. In: Progress in intelligent computing techniques: theory, practice, and applications, Springer, New York, pp 281–286
Salehi S, Selamat A, Bostanian M (2011) Enhanced genetic algorithm for spam detection in email. In: Software engineering and service science (ICSESS), 2011 IEEE 2nd international conference on, IEEE, pp 594–597
Idris I, Selamat A, Omatu S (2014) Hybrid email spam detection model with negative selection algorithm and differential evolution. Eng Appl Artif Intell 28:97–110
Article Google Scholar
Storn R, Price K (1997) Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces. J Glob Optim 11:341–359
Article MathSciNet MATH Google Scholar
Kennedy J, Eberhart R (1995) Particle swarm optimization. Neural Netw 4:1942–1948
Google Scholar
Idris I, Selamat A, Nguyen NT, Omatu S, Krejcar O, Kuca K, Penhaker M (2015) A combined negative selection algorithm-particle swarm optimization for an email spam detection system. Eng Appl Artif Intell 39:33–44
Article Google Scholar
Pereira FB, Marques JMC (2009) A study on diversity for cluster geometry optimization. Evolut Intell 2(3):121
Article Google Scholar
Simon D (2008) Biogeography-based optimization. IEEE Trans Evolut Comput 12(6):702–713
Article Google Scholar
Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recognit 33:1455–1465
Article Google Scholar
Žalik KR (2008) An efficient k’-means clustering algorithm. Pattern Recognit Lett 29:1385–1391
Article Google Scholar
Yang X-S, Deb S (2009) Cuckoo search via lévy flights. In: World congress on nature and biologically inspired computing, IEEE, pp 210–214
Pandey AC, Rajpoot DS, Saraswat M (2016) Data clustering using hybrid improved cuckoo search method. In: Contemporary Computing (IC3), 2016 9th international conference on, IEEE, pp 1–6
Pandey AC, Rajpoot DS, Saraswat M (2017) Twitter sentiment analysis using hybrid cuckoo search method. Inf Process Manag 53(4):764–779
Article Google Scholar
Pandey AC, Rajpoot DS, Saraswat M (2017) Hybrid step size based cuckoo search. In: Contemporary computing (IC3), 2017 10th international conference on, IEEE, pp 1-6
Pavlyukevich I (2007) Lévy flights, non-local search and simulated annealing. J Comput Phys 226(2):1830–1844
Article MathSciNet MATH Google Scholar
Payne RB, Sorensen MD (2005) The cuckoos, vol 15. Oxford University Press, Oxford
Google Scholar
Kulhari A, Pandey A, Pal R, Mittal H (2016) Unsupervised data classification using modified cuckoo search method. In: Contemporary computing (IC3), 2016 9th international conference on, IEEE, pp 1–5
Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., Newton
MATH Google Scholar
Pennebaker JW, Boyd RL, Jordan K, Blackburn K (2015) The development and psychometric properties of liwc2015, Tech. rep
Tran CT, Zhang M, Andreae P, Xue B (2016) Improving performance for classification with incomplete data using wrapper-based feature selection. Evolut Intell 9(3):81–94
Article Google Scholar
Mafarja MM, Mirjalili S (2017) Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 260:302–312
Article Google Scholar
Roessler EB, Alder HL (1977) Introduction to probability and statistics. WH Freeman
Saraswat M, Arya K, Sharma H (2013) Leukocyte segmentation in tissue images using differential evolution algorithm. Swarm Evolut Comput 11:46–54
Article Google Scholar
Hatamlou A (2013) Black hole: a new heuristic optimization approach for data clustering. Inf Sci 222:175–184
Article MathSciNet Google Scholar
Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newsl 6(1):1–6
Article Google Scholar
Wang H, Lu Y, Zhai C (2010) Latent aspect rating analysis on review text data: a rating regression approach. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 783–792
Sun H, Morales A, Yan X (2013) Synthetic review spamming and defense. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 1088–1096
Mukherjee A, Venkataraman V, Liu B, Glance NS (2013) What yelp fake review filter might be doing? In: ICWSM, pp 409–418
Mukherjee A, Venkataraman V, Liu B, Glance N (2013) Fake review detection: classification and analysis of real and pseudo reviews. Technical Report UIC-CS-2013–03, University of Illinois at Chicago, Tech. Rep
Pandey AC, Pal R, Kulhari A (2018) Unsupervised data classification using improved biogeography based optimization. Int J Syst Assur Eng Manag 9(4):821–829
Article Google Scholar

Download references

Author information

Authors and Affiliations

Jaypee Institute of Information Technology, Noida, India
Avinash Chandra Pandey & Dharmveer Singh Rajpoot

Authors

Avinash Chandra Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Dharmveer Singh Rajpoot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Avinash Chandra Pandey.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pandey, A.C., Rajpoot, D.S. Spam review detection using spiral cuckoo search clustering method. Evol. Intel. 12, 147–164 (2019). https://doi.org/10.1007/s12065-019-00204-x

Download citation

Received: 15 June 2018
Revised: 25 October 2018
Accepted: 22 January 2019
Published: 05 February 2019
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s12065-019-00204-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Spam review detection using spiral cuckoo search clustering method

Abstract

Similar content being viewed by others

Detection of spam reviews using hybrid grey wolf optimizer clustering method

Performance Evaluation of Clustering-Based Classification Algorithms for Detection of Online Spam Reviews

A Feature Selection Approach to Detect Spam in the Facebook Social Network

1 Introduction