Keywords

1 Introduction

In the last two decades, the Internet had changed the way we communicate, interact with our peers, and do business in a good way [1]. All this is possible because of the reachability of the Internet and its ease of use. The one major boom we see in the last decade is in e-commerce or in other words we can say including the power of Internet to increase the pool of available customers. This is done by providing products and services online which directly impacts the pool of customers meaning now the limit of accessing any service or product is limited only by the reachability of the internet. In all this, new currency for these online businesses emerges which is called product review or simply review.

One thing we can notice in today’s online business that reviews play a major role whether a service or product will be able to make its place in the market or not [2]. Any customer who wants to buy a product or use a service through the internet will first go through the reviews of the respective product and service. At this point, if reviews do not give him enough confidence in the authenticity and reliability of the product, it is highly likely that he/she will not buy it and look for something else or somewhere else.

A scenario-like mentioned above had encouraged the online businesses to understand the experience of their customer and improve their product or services to better serve them in the future. To do so a new technology at the time help them achieve such a goal, namely opinion mining [3]. Opinion mining allows businesses to analyze their customers by finding out their attitude toward their product or service. To do so, the opinion mining techniques use the reviews given by customers and try to find about their attitude based on those reviews but there is an underlying assumption that is made by these techniques which is that all reviews are trustworthy and authentic. But sadly that is not the case because most of the time each product page on the web is affected by ‘spam reviews’ which do affect these opinion mining techniques in negative way and defeat the purpose of using them in the first place.

Jindal and Liu [4] had categorized the spam reviews into three categories which have varying degrees of effect on the customer’s decision to buy a product or not. ‘Positive spam reviews’ and ‘negative spam reviews’ are the most effective method to change a buyer’s decision on false information. As the name suggests, ‘positive spam review’ gives a positive review to a non-deserving product and ‘negative spam review’ gives a negative review to a non-deserving product.

Consider a situation in which a restaurant failed to provide a nice dining experience to one of its customers and as a result of that the customer gives a bad review about the restaurant’s bad hospitality on a site like Zomato or Yelp. As this can be bad for business, the restaurant hired someone to give positives reviews related to their dinning and mitigate the effect of that one honest review. These kinds of tactics cannot only be used to save your own skin but can also be used to destroy the business of your rivals. Spam reviews are designed to change the opinion of someone regarding something which is catastrophic.

In 2017, Rajamohana et al. [5] proposed a model using adaptive binary flower pollination algorithm for feature selection using naive Bayes classifier’s accuracy as the objective function and k-nearest neighbors as the classifier using selected features. In 2019, Pandey and Rajpoot [6] proposed a model using spiral cuckoo search to optimize k-means algorithm using sum squared error as the objective function. The available literature is a source of motivation to carry out the future work in the field.

The rest of the paper is structured as follows: the artificial bee colony algorithm is reviewed in Sect. 2. In Sect. 3, k-means clustering is reviewed. Proposed feature selection and cluster head optimization using artificial bee colony are mentioned in Sect. 4. Experimental results are discussed in Sect. 5. Finally, conclusion is given in Sect. 6.

2 Artificial Bee Colony Algorithm

Artificial bee colony algorithm is a swarm-based algorithm which mimics the intelligent foraging behavior of honeybees. They forage honey by coordinating with each by exchanging the location of the food source, i.e., honey. To exchange the information, they dance in a particular area of hive called the dancing area and the dance itself is called waggle dance [7]. This exchange of the information about the location of food source allows them to work collectively and efficiently. Their gathering, exchanging of information, and collective working are called collective intelligence, and it is the reason why we mimics their behavior to solve complex problems. Recently, ABC algorithm modified for various application and successfully applied to get rid of complex problem [8,9,10].

To understand the implementation of foraging behavior of honeybee, the whole process can be divided into four phases as discuss below.

2.1 Initialization Phase

This is the first phase of the algorithm and will be implemented only once. In this phase, position of search agents is randomly initialized within the search space according to Eq. (1).

Once the position has been initialized, then the fitness of each search agent is calculated.

$$\begin{aligned} x^{j}_{i} = x^{j}_\mathrm{{min}} + \mathrm{{rand}}(0,1)(x^{j}_\mathrm{{max}} -x^{j}_\mathrm{{min}}), \forall _{j} = 1,2,\ldots ,D \end{aligned}$$
(1)

2.2 Employed Bee Phase

This is the second phase of the algorithm, and it will be implemented for each iteration of the algorithm. In this phase, each search agent changes their current position and evaluates the fitness of the new position. If the new position’s quality is better than the current position, then it keeps the new one otherwise the old one. The new position is selected using Eq. (2)

$$\begin{aligned} v_{ij} = x_{ij} + \phi _{ij}(x_{ij} - x_{kj}) \end{aligned}$$
(2)

2.3 Onlooker Bee Phase

This is the third phase of the algorithm, and it will also be implemented for every iteration. In this phase, a search agent is selected based on probability, on its fitness according to Eq. (3).

$$\begin{aligned} p_{i} = \dfrac{\mathrm{{fitness}}_{i}}{\sum _{i=1}^{N}} \end{aligned}$$
(3)

Once a search agent has been selected, then the onlooker bee will change the current position and evaluate its fitness. If the new position is better than the current position, it will keep the new position otherwise the old one.

2.4 Scout Bee Phase

This is the fourth phase of the algorithm, and it will be implemented only when a search agent’s position has not be changed for predetermined number of iterations. If any search agent enters this phase, it is now called a scout bee and it has to now find a new position. To do so, it is randomly initialized within the search space using Eq. (4).

$$\begin{aligned} x^{j}_{i} = x^{j}_\mathrm{{min}} + {\text {rand}}[0,1](x^{j}_\mathrm{{max}} -x^{j}_\mathrm{{min}}), \forall _{j} = 1,2,\ldots ,D \end{aligned}$$
(4)

3 K-Means Clustering

K-means clustering algorithm is designed to group similar things/samples/objects together or in other ways group them separately if they are dissimilar to each other in k distinct clusters. Their similarity and dissimilarity are evaluated based on what they are representing. If it is just numbers, then it can be their values or if its a complex object like word and image, it can be their attributes.

From implementation point of view, k-means clustering can be represented in simple four steps implemented in sequence to achieve the goal.

  1. 1.

    Select k points either randomly or form the samples such that they are not too close to each other and within the boundaries of sample space. These k points will represent the centroid of k clusters.

  2. 2.

    Now assign each sample to a cluster centroid which is closest to it.

  3. 3.

    For each cluster, calculate its new centroid by taking the mean of all the samples within the cluster.

  4. 4.

    Repeat steps (2) and (3) till there is no change in the position of cluster centroid.

4 Proposed Method

This paper introduced a clustering method optimized with ABC to detect spam reviews. The proposed method is divided into following phases.

  1. 1.

    Preprocessing

  2. 2.

    Feature Extraction

  3. 3.

    Proposed feature selection using k-means ABC

  4. 4.

    Proposed Artificial Bee Colony Optimizer with k-Means

  5. 5.

    Testing (Fig. 1).

Fig. 1
figure 1

Experiment process flowchart

4.1 Preprocessing

The online reviews most of the time contains noise or words that do not added any meaning to our model so instead of wasting resources in evaluating such entities we completely remove them out of the equation. Following are the operations performed in preprocessing

  1. 1.

    Convert all words into lowercase.

  2. 2.

    Remove stop words like is, or, as etc.

  3. 3.

    Remove symbols like pound, asterisk, etc.

  4. 4.

    Remove punctuation.

  5. 5.

    Replace continuous white spaces with one white space.

4.2 Feature Extraction

Feature extraction is done using linguistic inquiry and word count.

4.3 Feature Selection

Feature selection is used to remove redundant, noisy, and less significant features so that the learning model can give higher accuracy in a reasonable time. Higher-dimensional data generally tends to increase the training time without giving a significant increase in accuracy (sometimes making it even worse). For these two reasons, feature selection is desirable in most of the machine learning models.

K-means with ABC is used to find out the optimal feature set according to the Algorithm 1. In the proposed method, following steps are taken

  1. 1.

    Every search agent is initialized with random features, and their fitness values are calculated using Algorithm 2.

  2. 2.

    Algorithm 2 in turn uses Algorithm 3 for labeling the cluster heads and Algorithm 4 for calculating the accuracy.

  3. 3.

    In employed bee phase, each search agent’s feature subset is changed by replacing one of the current feature with the new one. New fitness is calculated and better one among the current and new is kept.

  4. 4.

    In onlooker bee phase, search agents with higher probability of having a good solution are given the chance to update their feature set.

  5. 5.

    In scout bee phase, search agents whose feature set is not updated for a predefined number of times will be reinitialized randomly.

  6. 6.

    After maximum iteration has been done return the optimal set of features.

figure a
figure b
figure c
figure d

4.4 Divide Samples into Training and Testing

The original samples are divided into training samples and testing samples in 7:3 ratio.

4.5 Proposed Artificial Bee Colony Optimizer with K-Means

The proposed method uses ABC to find optimal cluster heads to classify reviews into spam and ham reviews. For finding optimal cluster heads, Algorithm 5 is used in which each search agent represents the position of two cluster heads spam and ham. Followings are the steps for Algorithm 5.

  1. 1.

    Initialize each search agent randomly within the search space and find the fitness using Algorithm 6.

  2. 2.

    Algorithm 6 will use Algorithm 3 for labeling the clusters and Algorithm 4 for calculating accuracy.

  3. 3.

    In employed bee phase, each search agent is given a chance to update its position using Eq. (2). New fitness value will be compared with the current fitness value, and the best will be kept.

  4. 4.

    In onlooker bee phase, search agents with higher probability of providing a good solution are given the chance to update their position. Probability for each search agent is calculated using Eq. (3).

  5. 5.

    In scout bee phase, search agents whose position is not updated for a predefined number of times will be reinitialized within the search space.

  6. 6.

    Return the optimal cluster position after maximum iterations are done.

figure e
figure f

4.6 Testing

For testing the efficiency of cluster heads provided by Algorithm 5, Algorithm 7 is used.

figure g

5 Experimental Results

The proposed method is tested on three datasets, namely Synthetic spam [11], Yelp [12, 13] and Movie [14], presented in Table 2. Synthetic spam dataset is taken from Database and Information system Laboratory, University of Illinois, and labeled using synthetic review spamming method. Yelp dataset is taken as a subset from restaurant and hotel data, and movie reviews are subset of IMDB dataset . All experiments are done on Python-3.6 on Intel core i5 processor with 6 GB of RAM (Table 1).

For calculating the effectiveness of proposed method, number of true positive, true negative, false positive, and false negative prediction are observed to calculate accuracy, precision, and recall.

  • True positive represents spam review predicted correctly.

  • True negative represents ham review predicted correctly.

  • False positive represents ham reviews predicted incorrectly.

  • False negative represents spam review predicted incorrectly.

These four parameters together represent confusion matrix and based on this confusion matrix, precision, recall, and accuracy are computed using Eqs. (5)–(7), respectively.

The proposed model is implemented a total of ten times on each dataset, and average values are considered to evaluate the overall performance of the model. Tables 3, 4, and 5 show the result of synthetic spam review dataset, movie review dataset, and Yelp dataset, respectively. Figure 2 shows the performance of proposed model on all three datasets with different size of feature set (Fig. 3; Tables 6 and 7).

$$\begin{aligned} {\text {Precision}} = \dfrac{\text {TP}}{{\text {TP}} + {\text {FP}}} \end{aligned}$$
(5)
$$\begin{aligned} \mathrm{{Recall}} = \dfrac{\mathrm{{TP}}}{\mathrm{{TP}} + \mathrm{{FN}}} \end{aligned}$$
(6)
$$\begin{aligned} \mathrm{{Accuracy}} = \dfrac{\mathrm{{TP}} + \mathrm{{TN}}}{\mathrm{{TP}} + \mathrm{{TN}} + \mathrm{{FN}}+ \mathrm{{FP}}} \end{aligned}$$
(7)
Table 1 Parameters for k-Means ABC
Table 2 Datasets and their composition
Table 3 Synthetic spam review results
Table 4 Movie results
Table 5 Yelp results
Fig. 2
figure 2

Number of features versus accuracy graph

Fig. 3
figure 3

Average computation time for one run

Table 6 Optimum number of features in each dataset
Table 7 Average computation time for one run

6 Conclusion

In this paper, we have introduced a novel approach by combining k-means and artificial bee colony algorithm for feature selection and cluster head optimization using ABC to detect spam reviews. The proposed method tested on three different datasets and gave us respectable results which shows the potential of the method. In our proposed model, we train on a snapshot of data which makes the model effective for current trend only. In future, the work can be extended to make the model continuously update its knowledge after a period of time. For feature selection, other optimization algorithms can be explored for more optimal set of features.