Spam Review Detection Using K-Means Artificial Bee Colony

Saini, Prateek; Shringi, Sakshi; Sharma, Nirmala; Sharma, Harish

doi:10.1007/978-981-16-1089-9_57

Prateek Saini¹³,
Sakshi Shringi¹³,
Nirmala Sharma¹³ &
…
Harish Sharma¹³

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 204))

1200 Accesses
2 Citations

Abstract

The current businesses which use Internet for marketing depend on online reviews, and these online reviews can direct a customer toward or away from a product and service. This effect of online reviews is the reason that business uses spam reviews to either benefit their business or hinder their rival’s business. In this paper, a novel solution k-means artificial bee colony for feature selection and optimized clusters using artificial bee colony to detect spam reviews is presented. We report the testing of our novel method on three different datasets. The findings of our testing are encouraging and show a respectable performance on all three datasets.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Detection of spam reviews using hybrid grey wolf optimizer clustering method

Article 25 April 2022

A Hybrid Classifier for Detection of Online Spam Reviews

Spam Review Detection Using Ensemble Machine Learning

Keywords

1 Introduction

In the last two decades, the Internet had changed the way we communicate, interact with our peers, and do business in a good way [1]. All this is possible because of the reachability of the Internet and its ease of use. The one major boom we see in the last decade is in e-commerce or in other words we can say including the power of Internet to increase the pool of available customers. This is done by providing products and services online which directly impacts the pool of customers meaning now the limit of accessing any service or product is limited only by the reachability of the internet. In all this, new currency for these online businesses emerges which is called product review or simply review.

One thing we can notice in today’s online business that reviews play a major role whether a service or product will be able to make its place in the market or not [2]. Any customer who wants to buy a product or use a service through the internet will first go through the reviews of the respective product and service. At this point, if reviews do not give him enough confidence in the authenticity and reliability of the product, it is highly likely that he/she will not buy it and look for something else or somewhere else.

A scenario-like mentioned above had encouraged the online businesses to understand the experience of their customer and improve their product or services to better serve them in the future. To do so a new technology at the time help them achieve such a goal, namely opinion mining [3]. Opinion mining allows businesses to analyze their customers by finding out their attitude toward their product or service. To do so, the opinion mining techniques use the reviews given by customers and try to find about their attitude based on those reviews but there is an underlying assumption that is made by these techniques which is that all reviews are trustworthy and authentic. But sadly that is not the case because most of the time each product page on the web is affected by ‘spam reviews’ which do affect these opinion mining techniques in negative way and defeat the purpose of using them in the first place.

Jindal and Liu [4] had categorized the spam reviews into three categories which have varying degrees of effect on the customer’s decision to buy a product or not. ‘Positive spam reviews’ and ‘negative spam reviews’ are the most effective method to change a buyer’s decision on false information. As the name suggests, ‘positive spam review’ gives a positive review to a non-deserving product and ‘negative spam review’ gives a negative review to a non-deserving product.

Consider a situation in which a restaurant failed to provide a nice dining experience to one of its customers and as a result of that the customer gives a bad review about the restaurant’s bad hospitality on a site like Zomato or Yelp. As this can be bad for business, the restaurant hired someone to give positives reviews related to their dinning and mitigate the effect of that one honest review. These kinds of tactics cannot only be used to save your own skin but can also be used to destroy the business of your rivals. Spam reviews are designed to change the opinion of someone regarding something which is catastrophic.

In 2017, Rajamohana et al. [5] proposed a model using adaptive binary flower pollination algorithm for feature selection using naive Bayes classifier’s accuracy as the objective function and k-nearest neighbors as the classifier using selected features. In 2019, Pandey and Rajpoot [6] proposed a model using spiral cuckoo search to optimize k-means algorithm using sum squared error as the objective function. The available literature is a source of motivation to carry out the future work in the field.

The rest of the paper is structured as follows: the artificial bee colony algorithm is reviewed in Sect. 2. In Sect. 3, k-means clustering is reviewed. Proposed feature selection and cluster head optimization using artificial bee colony are mentioned in Sect. 4. Experimental results are discussed in Sect. 5. Finally, conclusion is given in Sect. 6.

2 Artificial Bee Colony Algorithm

Artificial bee colony algorithm is a swarm-based algorithm which mimics the intelligent foraging behavior of honeybees. They forage honey by coordinating with each by exchanging the location of the food source, i.e., honey. To exchange the information, they dance in a particular area of hive called the dancing area and the dance itself is called waggle dance [7]. This exchange of the information about the location of food source allows them to work collectively and efficiently. Their gathering, exchanging of information, and collective working are called collective intelligence, and it is the reason why we mimics their behavior to solve complex problems. Recently, ABC algorithm modified for various application and successfully applied to get rid of complex problem [8,9,10].

To understand the implementation of foraging behavior of honeybee, the whole process can be divided into four phases as discuss below.

2.1 Initialization Phase

This is the first phase of the algorithm and will be implemented only once. In this phase, position of search agents is randomly initialized within the search space according to Eq. (1).

Once the position has been initialized, then the fitness of each search agent is calculated.

$$\begin{aligned} x^{j}_{i} = x^{j}_\mathrm{{min}} + \mathrm{{rand}}(0,1)(x^{j}_\mathrm{{max}} -x^{j}_\mathrm{{min}}), \forall _{j} = 1,2,\ldots ,D \end{aligned}$$

(1)

2.2 Employed Bee Phase

This is the second phase of the algorithm, and it will be implemented for each iteration of the algorithm. In this phase, each search agent changes their current position and evaluates the fitness of the new position. If the new position’s quality is better than the current position, then it keeps the new one otherwise the old one. The new position is selected using Eq. (2)

$$\begin{aligned} v_{ij} = x_{ij} + \phi _{ij}(x_{ij} - x_{kj}) \end{aligned}$$

(2)

2.3 Onlooker Bee Phase

This is the third phase of the algorithm, and it will also be implemented for every iteration. In this phase, a search agent is selected based on probability, on its fitness according to Eq. (3).

$$\begin{aligned} p_{i} = \dfrac{\mathrm{{fitness}}_{i}}{\sum _{i=1}^{N}} \end{aligned}$$

(3)

Once a search agent has been selected, then the onlooker bee will change the current position and evaluate its fitness. If the new position is better than the current position, it will keep the new position otherwise the old one.

2.4 Scout Bee Phase

This is the fourth phase of the algorithm, and it will be implemented only when a search agent’s position has not be changed for predetermined number of iterations. If any search agent enters this phase, it is now called a scout bee and it has to now find a new position. To do so, it is randomly initialized within the search space using Eq. (4).

$$\begin{aligned} x^{j}_{i} = x^{j}_\mathrm{{min}} + {\text {rand}}[0,1](x^{j}_\mathrm{{max}} -x^{j}_\mathrm{{min}}), \forall _{j} = 1,2,\ldots ,D \end{aligned}$$

(4)

3 K-Means Clustering

K-means clustering algorithm is designed to group similar things/samples/objects together or in other ways group them separately if they are dissimilar to each other in k distinct clusters. Their similarity and dissimilarity are evaluated based on what they are representing. If it is just numbers, then it can be their values or if its a complex object like word and image, it can be their attributes.

From implementation point of view, k-means clustering can be represented in simple four steps implemented in sequence to achieve the goal.

1.
Select k points either randomly or form the samples such that they are not too close to each other and within the boundaries of sample space. These k points will represent the centroid of k clusters.
2.
Now assign each sample to a cluster centroid which is closest to it.
3.
For each cluster, calculate its new centroid by taking the mean of all the samples within the cluster.
4.
Repeat steps (2) and (3) till there is no change in the position of cluster centroid.

4 Proposed Method

This paper introduced a clustering method optimized with ABC to detect spam reviews. The proposed method is divided into following phases.

1.
Preprocessing
2.
Feature Extraction
3.
Proposed feature selection using k-means ABC
4.
Proposed Artificial Bee Colony Optimizer with k-Means
5.
Testing (Fig. 1).

4.1 Preprocessing

The online reviews most of the time contains noise or words that do not added any meaning to our model so instead of wasting resources in evaluating such entities we completely remove them out of the equation. Following are the operations performed in preprocessing

1.
Convert all words into lowercase.
2.
Remove stop words like is, or, as etc.
3.
Remove symbols like pound, asterisk, etc.
4.
Remove punctuation.
5.
Replace continuous white spaces with one white space.

4.2 Feature Extraction

Feature extraction is done using linguistic inquiry and word count.

4.3 Feature Selection

Feature selection is used to remove redundant, noisy, and less significant features so that the learning model can give higher accuracy in a reasonable time. Higher-dimensional data generally tends to increase the training time without giving a significant increase in accuracy (sometimes making it even worse). For these two reasons, feature selection is desirable in most of the machine learning models.

K-means with ABC is used to find out the optimal feature set according to the Algorithm 1. In the proposed method, following steps are taken

1.
Every search agent is initialized with random features, and their fitness values are calculated using Algorithm 2.
2.
Algorithm 2 in turn uses Algorithm 3 for labeling the cluster heads and Algorithm 4 for calculating the accuracy.
3.
In employed bee phase, each search agent’s feature subset is changed by replacing one of the current feature with the new one. New fitness is calculated and better one among the current and new is kept.
4.
In onlooker bee phase, search agents with higher probability of having a good solution are given the chance to update their feature set.
5.
In scout bee phase, search agents whose feature set is not updated for a predefined number of times will be reinitialized randomly.
6.
After maximum iteration has been done return the optimal set of features.

4.4 Divide Samples into Training and Testing

The original samples are divided into training samples and testing samples in 7:3 ratio.

4.5 Proposed Artificial Bee Colony Optimizer with K-Means

The proposed method uses ABC to find optimal cluster heads to classify reviews into spam and ham reviews. For finding optimal cluster heads, Algorithm 5 is used in which each search agent represents the position of two cluster heads spam and ham. Followings are the steps for Algorithm 5.

1.
Initialize each search agent randomly within the search space and find the fitness using Algorithm 6.
2.
Algorithm 6 will use Algorithm 3 for labeling the clusters and Algorithm 4 for calculating accuracy.
3.
In employed bee phase, each search agent is given a chance to update its position using Eq. (2). New fitness value will be compared with the current fitness value, and the best will be kept.
4.
In onlooker bee phase, search agents with higher probability of providing a good solution are given the chance to update their position. Probability for each search agent is calculated using Eq. (3).
5.
In scout bee phase, search agents whose position is not updated for a predefined number of times will be reinitialized within the search space.
6.
Return the optimal cluster position after maximum iterations are done.

4.6 Testing

For testing the efficiency of cluster heads provided by Algorithm 5, Algorithm 7 is used.

5 Experimental Results

The proposed method is tested on three datasets, namely Synthetic spam [11], Yelp [12, 13] and Movie [14], presented in Table 2. Synthetic spam dataset is taken from Database and Information system Laboratory, University of Illinois, and labeled using synthetic review spamming method. Yelp dataset is taken as a subset from restaurant and hotel data, and movie reviews are subset of IMDB dataset . All experiments are done on Python-3.6 on Intel core i5 processor with 6 GB of RAM (Table 1).

For calculating the effectiveness of proposed method, number of true positive, true negative, false positive, and false negative prediction are observed to calculate accuracy, precision, and recall.

True positive represents spam review predicted correctly.
True negative represents ham review predicted correctly.
False positive represents ham reviews predicted incorrectly.
False negative represents spam review predicted incorrectly.

These four parameters together represent confusion matrix and based on this confusion matrix, precision, recall, and accuracy are computed using Eqs. (5)–(7), respectively.

The proposed model is implemented a total of ten times on each dataset, and average values are considered to evaluate the overall performance of the model. Tables 3, 4, and 5 show the result of synthetic spam review dataset, movie review dataset, and Yelp dataset, respectively. Figure 2 shows the performance of proposed model on all three datasets with different size of feature set (Fig. 3; Tables 6 and 7).

$$\begin{aligned} {\text {Precision}} = \dfrac{\text {TP}}{{\text {TP}} + {\text {FP}}} \end{aligned}$$

(5)

$$\begin{aligned} \mathrm{{Recall}} = \dfrac{\mathrm{{TP}}}{\mathrm{{TP}} + \mathrm{{FN}}} \end{aligned}$$

(6)

$$\begin{aligned} \mathrm{{Accuracy}} = \dfrac{\mathrm{{TP}} + \mathrm{{TN}}}{\mathrm{{TP}} + \mathrm{{TN}} + \mathrm{{FN}}+ \mathrm{{FP}}} \end{aligned}$$

(7)

Table 1 Parameters for k-Means ABC

Full size table

Table 2 Datasets and their composition

Full size table

Table 3 Synthetic spam review results

Full size table

Table 4 Movie results

Full size table

Table 5 Yelp results

Full size table

Table 6 Optimum number of features in each dataset

Full size table

Table 7 Average computation time for one run

Full size table

6 Conclusion

In this paper, we have introduced a novel approach by combining k-means and artificial bee colony algorithm for feature selection and cluster head optimization using ABC to detect spam reviews. The proposed method tested on three different datasets and gave us respectable results which shows the potential of the method. In our proposed model, we train on a snapshot of data which makes the model effective for current trend only. In future, the work can be extended to make the model continuously update its knowledge after a period of time. For feature selection, other optimization algorithms can be explored for more optimal set of features.

References

Van Deursen AJAM, Helsper EJ (2018) Collateral benefits of internet use: explaining the diverse outcomes of engaging with the internet. New Media Soc 20(7):2333–2351
Google Scholar
Nieto J, Hernández-Maestro RM, Muñoz-Gallego PA (2014) Marketing decisions, customer reviews, and business performance: the use of the toprural website by Spanish rural lodging establishments. Tour Manage 45:115–123
Article Google Scholar
Bakshi RK, Kaur N, Kaur R, Kaur G (2016) Opinion mining and sentiment analysis. In: 2016 3rd International conference on computing for sustainable global development (INDIACom). IEEE, pp 452–455
Google Scholar
Jindal N, Liu B (2007) Analyzing and detecting review spam. In: Seventh IEEE International Conference on Data Mining (ICDM 2007). IEEE, pp 547–552
Google Scholar
Rajamohana SP, Umamaheswari K, Abirami B (2017) Adaptive binary flower pollination algorithm for feature selection in review spam detection. In: International Conference on Innovations in Green Energy and Healthcare Technologies, pp 1–4
Google Scholar
Pandey AC, Rajpoot DS (2019) Spam review detection using spiral cuckoo search clustering method Evol Intell 147–164
Google Scholar
Karaboga Dervis, Akay Bahriye (2009) A comparative study of artificial bee colony algorithm. Appl Math Comput 214(1):108–132
MathSciNet MATH Google Scholar
Sharma Sonal, Kumar Sandeep, Sharma Kavita (2019) Archimedean spiral based artificial bee colony algorithm. J Stat Manage Syst 22(7):1301–1313
Google Scholar
Kumar S, Nayyar A, Kumari R (2019) Arrhenius artificial bee colony algorithm. In: International conference on innovative computing and communications Springer, pp 187–195
Google Scholar
Nayyar A, Nguyen NG Kumari,R, Kumar S (2020) Robot path planning using modified artificial bee colony algorithm. In: Frontiers in intelligent computing: theory and applications. Springer, pp 25–36
Google Scholar
Sun H, Morales A, Yan X (2013) Synthetic review spamming and defense. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1088–1096
Google Scholar
Mukherjee A, Venkataraman V, Liu B, Glance, NS (2013) What yelp fake review filter might be doing? In: ICWSM, pp 409–418
Google Scholar
Mukherjee A, Venkataraman V, Liu B, Glance N et al (2013) Fake review detection: classification and analysis of real and pseudo reviews. Technical report UIC-CS-2013–03. University of Illinois at Chicago
Google Scholar
https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Download references

Author information

Authors and Affiliations

Rajasthan Technical University, Kota, India
Prateek Saini, Sakshi Shringi, Nirmala Sharma & Harish Sharma

Authors

Prateek Saini
View author publications
You can also search for this author in PubMed Google Scholar
Sakshi Shringi
View author publications
You can also search for this author in PubMed Google Scholar
Nirmala Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Harish Sharma
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Rajasthan Technical University, Kota, Rajasthan, India
Harish Sharma
Department of Computer Science and Engineering, Swami Keshvanand Institute of Technology, Jaipur, India
Mukesh Kumar Gupta
Birla Institute of Applied Sciences, Nainital, Uttarakhand, India
G. S. Tomar
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore
Wang Lipo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Saini, P., Shringi, S., Sharma, N., Sharma, H. (2021). Spam Review Detection Using K-Means Artificial Bee Colony. In: Sharma, H., Gupta, M.K., Tomar, G.S., Lipo, W. (eds) Communication and Intelligent Systems. Lecture Notes in Networks and Systems, vol 204. Springer, Singapore. https://doi.org/10.1007/978-981-16-1089-9_57

Download citation

DOI: https://doi.org/10.1007/978-981-16-1089-9_57
Published: 29 June 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1088-2
Online ISBN: 978-981-16-1089-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Spam Review Detection Using K-Means Artificial Bee Colony

Abstract

Similar content being viewed by others

Detection of spam reviews using hybrid grey wolf optimizer clustering method

A Hybrid Classifier for Detection of Online Spam Reviews

Spam Review Detection Using Ensemble Machine Learning

Keywords

1 Introduction