Keywords

1 Introduction

With the continuous development of technology and the ubiquitous presence of network-based services in our everyday life, it has become very common to make online purchases of products and services. With the rapid spread of e-commerce services, user reviews have become one of the most influential factors in purchase decisions of customers [14]. Consequently, e-commerce marketplaces are nowadays the most important target of spammers, which have the malicious goal of manipulating the reputation of products and brands, to either promote or criticize products and services. Positive and negative opinions can greatly influence a company’s business. For this reason, review spam detection is a crucial problem to address for guaranteeing the reliability of products and services.

Spam reviews are described as untruthful or deceptive opinions that are posted on online commerce platforms in an attempt to manipulate the public perception (in a positive or negative manner) of specific products or services presented on the affected platforms [11]. In the last years, automatic review spam detection has attracted the attention of machine learning, natural language processing and deep learning researchers due to the difficulty of recognizing fake reviews by manually reading their content [9]. On the other hand, over a few years, professional spammers have greatly increased and improved their writing techniques, to evade detection tools that base review spam detection on the analysis of text content only. Most of the existing review spam detection approaches focus on extracting robust, engineered features from both review contents and reviewer behaviours [2, 7, 17, 23], but in the past decade, several approaches have been developed, to leverage the social interaction between users and enhance the feature space of review spam detection problems with contextual information.

In this paper, we perform a step forward in this “social” research direction. In fact, we investigate the use of a heterogeneous graph representation of review data, to capture the relationships between products, users and reviews and gain accuracy in problems of review spam detection. For this purpose, we propose a Graph Convolutional Network (GCN) approach, named GLORIA (Graph convoLutiOnal Network for RevIew spAm), that learns spatial convolutions on the graph representation of review data, to take advantage of the expressive structural information enclosed in graphs. In particular, the proposed approach implements a heterogeneous bipartite graph used as input to a Crystal GCN [28]. This architecture has been proven effective in the context of chemical material property prediction [8]. Traditional GNN algorithms perform convolutions using a shared weight matrix for all neighbours of a node by neglecting the difference of interaction between neighbours. Instead the Crystal GNN first aggregates neighbour vectors and then performs convolutions on the aggregated neighbour vectors. To the best of our knowledge, this is the first study that explores the use of Crystal GCN in review spam detection problems by showing how the proposed approach can gain accuracy compared to shallow and deep neural models trained neglecting the graph structure of data. An issue of review spam detection problems is that spam data are highly skewed. The imbalance of malicious data is a common condition in several cybersecurity problems (e.g., malware [4], fraud [22] or intrusion [3] detection), as well as in remote sensing problems (e.g., [6]). In this study, we handle the imbalanced condition of review spam data by training the Crystal GCN model with the sigmoid focal loss. This choice bases on [20] that shows how the sigmoid focal loss can help a neural model in focusing on rare samples. In this study, we show that the sigmoid focal loss is better suited than the traditional cross-entropy loss, to handle the imbalance condition of review spam data. Finally, we analyse the topological structure of graphs by showing how the exploration of the centrality of products and users in the graph representation of review data may disclose useful knowledge to explain characteristics of reviews and possible spam patterns.

The paper is organized as follows. Section 2 overviews the related work. Section 3 describes the proposed approach. Section 4 describes the benchmark data collections adopted in the experimental study, describes the experimental setting and discusses the relevant results. Finally, Sect. 5 draws conclusions and outlines the future directions of this work.

2 Related Work

The research in the field of review spam detection has received great attention in the last years. Several machine learning approaches have been recently designed to disentangle spam reviews from non-spam reviews [9, 16]. In particular, the seminal study of [17] started the investigation of the task of review spam detection in the context of product reviews.

Recent research trends have started exploring deep learning approaches in problems of review spam detection [2, 5, 7, 24, 31]. In [2, 5] a multi-view, deep learning approach is described for review spam detection. The proposed approach combines embeddings of textual features, extracted with Word2Vec and BERT models, and behavioural reviewer features to improve the accuracy of a review classifier trained through a multi-input, deep neural network. [7] describes a combination of Word2Vec and Convolutional Neural Networks (CNNs), to learn a document-level representation of reviews. Finally, a Bi-directional LSTM is used for review classification. The work in [24] adopts word embeddings trained on an Amazon review dataset using the Continuous Bag-of-Words (CBOW) algorithm. Finally, it trains a review classification model that combines CNNs and Gated Recurrent Neural Networks. A CNN is also trained in [31], to extract semantic information from the text of reviews by exploiting convolution and pooling operations.

Although all the above studies describe feature-based approaches that rely on an effective way to extract and learn features (from both reviews and reviewers), they ignore relationships between users, products and reviews. On the other hand, a few recent studies have started the investigation of the effectiveness of graphs as data modelling approaches of review spam data. The study of [27] first adopts a heterogeneous graph to represent reviewers, reviews and stores, through different categories of nodes. The review graph is used to infer the truthiness of reviews, honesty of reviewers and reliability of stores. [26] explores an unsupervised review spam detection approach that resorts to clustering, to identify communities of users with similar spam behaviours. [25] describes the use of a heterogeneous graph to connect users to reviews and analyses how graph meta-paths may help in recognizing review spam.

The recent studies that have adopted a graph representation of review data have also paved the way for leveraging GCNs in review spam detection problems. Although GCNs have recently gained great attention in several domains (e.g., recommendation systems [29] and chemical properties predictions [28]), a few studies have explored GCNs in review spam detection problems [1, 19, 30]. [1] trains a GCN for node classification in spam bot detection problems. This study adopts a social graph representation of relationships between Twitter users (represented as nodes) and leverages both feature nodes and relationships between neighbour nodes for training a GCN that addresses spam bot detection as a problem of graph node classification. A social graph of Twitter user relationships is also adopted in [30] in combination with an Attention-based Graph Neural Network trained for spam bot detection. So both these studies consider a problem of graph node classification, and train a GCN to recognize spam bot communities. They label all messages produced by the member of bot communities in the spam class. Differently, our study accounts that a malicious user does not necessarily only produce review spam. Based upon this consideration, we focus on classifying reviews, instead of classifying reviewers.

Finally, [19] studies the review spam detection problem for Xianyu, that is one of the largest second-hand goods apps in China. In Xianyu, reviews are communication tools for buyers and sellers and the review action usually happens before purchases. As recognised by the authors of [19], this is different from the common use of reviews in other e-commerce systems, also considered in this study, where reviews are usually made by customers who have bought the products. Accounting for the peculiar characteristics of reviews in Xianyu, [19] adopts two graph representations of Xianyu reviews: a heterogeneous graph modelling relationships between users and review items and a homogeneous graph modelling similarities between review items. A review item denotes a review topic (e.g., “iPhone 6s”) that is associated in Xianyu with a sequence of review comments produced by (multiple) users on the specific topic. The nodes of the homogeneous graph are associated with the content features extracted from review items. [19] concatenates embeddings extracted through the GCN trained on two graphs to obtain the feature vector for the final classification of the review item. Differently, we consider the traditional e-commerce perspective with reviews written by users on products. So we use a single heterogeneous graph to represent reviews as relationships between users and products, and we associate review features to edges, while characteristics of users and products to nodes. Finally, we train a GCN model for graph edge classification, to classify each single review message.

3 The Proposed Method

In this Section, we describe the GLORIA approach. It adopts a graph representation of review data, where users and products are represented as heterogeneous graph nodes and the reviews as graph edges. Hence, GLORIA implements a GCN for edge classification in heterogeneous graphs.

Let us consider the input graph representation of review data as a heterogeneous bipartite graph defined as \(\textbf{G}=(U, P, R)\), where U, P and R correspond to the set of user nodes, product nodes and edges, respectively. Each edge \(r=(i, j)\in R\) with \({R}\subseteq (U\times P) \cup (P \times U)\) defines the undirect relationship between a user node and a product node to express that the user reviewed the product (and the product is reviewed by the reviewer). In addition, let us consider three mapping functions: \(\phi _U:U \rightarrow \mathbf {X^U}\) that associates each node \(u\in U\) to a feature vector in \(\mathbf {X^U}\), \(\phi _P:P \rightarrow \mathbf {X^P}\) that associates each node \(p\in P\) to a feature vector in \(\mathbf {X^P}\) and \(\phi _R:R \rightarrow \mathbf {X^R}\) that associates each edge \(r \in R\) to a feature vector in \(\mathbf {X^R}\). We process this graph representation of review data to train a GCN for edge classification.

The GCN takes both node feature vectors, edge feature vectors and adjacency matrix as input and passes them through a series of L layers. At each layer l, node embeddings are updated according with the Eq. 1 to create an intermediate hidden representation \(\textbf{h}^l\). In particular, at each \(\textbf{h}^l\), the GCN of GLORIA applies a crystal graph convolutional operator [28]. For each node \(i\in U \cup P\), for each layer l, this operator learns a function \(\textbf{h}^{l}_i = \textbf{h}^{l-1}_i + f(i)\) defined on the previous hidden \(l-1\) layer. f() is formulated as follows:

$$\begin{aligned} f(i)= AGG_{(j \in \mathcal {N}(i))^k} (\sigma \left( \mathbf {z^l}_{(i,j)^k} \mathbf {W^l}_\sigma + \mathbf {b^l}_\sigma \right) \odot g \left( \mathbf {z^l}_{(i,j)^k} \mathbf {W^l}_g + \mathbf {b^l}_g \right) ) \end{aligned}$$
(1)

where \(\mathcal {N}(i)\) is the set of neighbours of node i (i.e., \(\mathcal {N}(i)=\{j\vert (i,j) \in R\}\)) and \(\mathbf {z^l}_{(i,j)^k} = \textbf{h}_i^l \oplus \textbf{h}_j^l \oplus \phi _R (i,j)^k\) is the concatenation of embeddings computed at layer l for the feature vectors associated with node i, neighbour node \(j\in \mathcal {N}(i)\) and the feature vector associated with the k-th edge between i and j. If \(l=0\) then embeddings return original feature vectors computed with \(\phi _U\) and \(\phi _P\) for user nodes and product nodes, respectively. \(\odot \) denotes the element-wise multiplication, \(\textbf{W}_\sigma \) and \(\textbf{W}_g\) denote the convolution weight matrix at layer l, while \(b_\sigma \) and \(b_g\) denote the bias at layer l, for both \(\sigma \) and g functions. In fact, each layer of the Crystal GCN applies both a sigmoid function (\(\sigma \)) [13] and a softplus function (g) [10]. Finally, the operator of aggregation (AGG) denotes the aggregation scheme used for grouping node embeddings generated by different edges relating multiple neighbours j to the same node i. In this study, we use the mean as the aggregation operator.

In particular, GLORIA comprises two graph convolutional layers (i.e., \(l=1,2\)). During the message-passing phase at layer l, the information of each node of the graph is updated based on the aggregation of the messages received from their immediate neighbours achieved in two hops. As such, each message-passing layer increases the receptive field of the GCN by one hop. As we perform two hops in GLORIA, we are able to model relationships between pairs of users, as well as relationships between pairs of products, in addition to the review relationships between users and products.

Figure 1 reports an example of the message-passing realized by GLORIA by considering the user node \(u_1\) as target node. At \(l=0\), all neighbour nodes of \(u_1\) are assigned to initial node feature vectors (by \(\phi _U\) or \(\phi _P\)). At \(l = 1\), the information of both node features and edge features are concatenated and aggregated, while \(\textbf{h}^1_{p_1}\) and \(\textbf{h}^1_{p_2}\) are updated based on f() (Eq. 1). In order to get the node embeddings available for target node \(u_1\) at \(l = 2\), embeddings of neighbour nodes \(\textbf{h}^1_{p_1}\) and \(\textbf{h}^1_{p_2}\) are concatenated and aggregated to update \(\textbf{h}^2_{u_2}\). Therefore, each node in the graph learns from all the neighbour nodes transitively achieved in two hops.

Fig. 1.
figure 1

Example of message-passing with user node \(u_1\) as target in a Crystal GCN with two-hop neighbourhood

Finally, to deal with the expected imbalanced condition of review data in review spam detection problems, we use a sigmoid focal loss [20] for the final prediction:

$$\begin{aligned} \mathcal{S}\mathcal{F} = -\sum _{k=1}^K\alpha (1-\hat{y})^{\gamma } \log \hat{y} \end{aligned}$$
(2)

where K corresponds to the number of classes in the dataset (i.e., spam, non-spam) and parameter \(\gamma \) adjusts the rate and reduces the loss for well-classified samples, to focus learning on hard misclassified samples. \(\alpha \) is a weighting factor in range (0, 1) to balance spam versus non-spam samples.

4 Experimental Setup

We performed experiments on two benchmark review datasets described in Sect. 4.1. The implementation details of GLORIA architecture, adopted in the experiments, are illustrated in Sect. 4.2. The experimental results are discussed in Sect. 4.3.

4.1 Data

We considered two datasets, namely Hotel and Restaurant, described in [23]. The two datasets contain reviews across 72 hotels and 129 restaurants, respectively, in the Chicago area. Each dataset contains reviews recorded by Yelp.com – a well-known large-scale online review site. In addition, both datasets were provided with ground-truth labels (spam and non-spam) in [23]. So they can be used for the evaluation of the accuracy of review spam detection approaches. Both datasets include information about products (e.g., category, price range, rating) and reviewers (e.g., number of friends, number of reviews), as well as plain-text reviews. Each review is associated with a reviewer and a product. In this study, we adopted the feature-vector representation of plain-text reviews described in [5]. In this study, each dataset was processed separately, as each domain has specific characteristics to take into account for the review spam analysis. A summary of the characteristics of both datasets is reported in Table 1. We note that the class distribution is imbalanced in both datasets with the “spam” minority class.

To perform the experimental study, we adopted the same split used in [5] with reviews sorted by the post date and the 80% of the oldest reviews selected for the training stage and the 20% of the newest reviews selected for the testing stage.

Table 1. Summary of characteristics of Hotel and Restaurant datasets

4.2 Implementation Details

We implemented GLORIA in Python 3Footnote 1. In particular, the GCN architecture was realized using PyTorch Geometric (PyG) 2.3, a geometric deep learning extension library for PyTorch. For each dataset, we conducted an automatic hyper-parameter optimization, using the tree-structured Parzen estimator algorithm, as implemented in the Hyperopt library. In particular, we selected the configuration of the hyper-parameters that achieved the highest F1 computed on the validation set extracted using \(20\%\) of the entire training according to the Pareto Principle, by considering spam as the positive class The values of the search space of the hyper-parameters, automatically explored with the tree-structured Parzen estimator, are reported in Table 2.

Table 2. Hyper-parameter search space for the multi-input neural network

The neural architecture of GLORIA comprises two Graph Convolutional Layers, a Dense layer and a Sigmoid layer [13] used for the final edge classification. The standard Rectified Linear Unit (ReLU) [12] was selected as the activation function for each hidden layer. A dropout layer was placed before each Graph Convolutional layer, to perform data regularisation and prevent overfitting. The neural network was trained with mini-batches by back-propagation, while the gradient-based optimization was performed using the Adam update rule [18]. The maximum number of epochs was set equal to 300. The early stopping approach based on the lowest loss on the validation set was used, to obtain the best classification model.

4.3 Results and Discussion

We evaluate the performance of GLORIA to answer the following research questions:

  1. Q1

    How does the accuracy of the proposed GCN-based approach change by varying the cost function?

  2. Q2

    Does the defined GCN model gain accuracy compared to state-of-the-art review spam detection algorithms that neglect the graph structure of review data?

  3. Q3

    Can the graph representation of review data disclose useful knowledge to explain the review domain better?

The accuracy performance of the analysed methods was measured in terms of F1 score computed on both the “spam” class and “non-spam” class, respectively (i.e., F1 spam and F1 non-spam), Macro-F1 (i.e., the average of F1 spam and F1 non-spam) and AUC-ROC. All these metrics were computed on the testing reviews of each dataset.

Table 3. F1 spam, F1 non-spam, Macro-F1 and AUC-ROC of GLORIAby using both BCE loss and SF loss for learning the GCN model. The best results are in bold.

Sensitivity Analysis. We explored the sensitivity of the accuracy performance of GLORIA to the cost function adopted to learn the GCN model. To this aim, we compare the accuracy results obtained by using the binary cross-entropy (BCE) (defined as \( \mathcal {BCE} =\sum _{k=1}^K y \log \hat{y}\)) and the sigmoid focal (SF) loss (defined in Eq. 2), to perform the training stage of GLORIA.

Table 3 reports the F1 (spam), F1 (non-spam), Macro-F1 and AUC-ROC measured on the testing data by using both SF and BCE as loss function in both datasets. The results show that the use of the SF loss can help the GCN model to gain accuracy in both datasets. These results confirm the ability of the SF loss to improve the accuracy performance of a deep neural model in the presence of data showing a strong imbalanced condition. In fact, we can observe that the use of SF loss increases almost all the accuracy metrics in both the experimented datasets. The only exception is observed in the F1 (non-spam) calculated in Restaurant dataset, where the BCE loss performs better than SF loss. This is an expected outcome since the BCE loss is a cost function that considers samples of the two classes to have equal weights. Thus, the neural model can be learned with the BCE loss to recognise the majority class better (e.g., non-spam review in this study).

Table 4. Competitor analysis: F1 spam, F1 non-spam, Macro-F1 and AUC-ROC of GLORIA, SVM and EUPHORIA. The best results are in bold.

Competitor Analysis. We compare the accuracy performance of GLORIA to that of two competitors: SVM that learns a Support Vector Machine classifier and EUPHORIA that learns a multi-input deep neural model for review spam detection. We consider the SVM as a classification algorithm for this comparison since it has been already adopted in multiple related studies on review spam detection (e.g., [15, 21, 23]). On the other hand, EUPHORIA is a recent method described in [5] for review spam detection. Both competitors ignore the graph structure of review data.

Table 4 reports the F1 spam, F1 non-spam, Macro-F1 and AUC-ROC, of SVM, EUPHORIA and GLORIA, respectively. The results show that the highest accuracy is achieved by GLORIA, with EUPHORIA as runner-up of this experiment in both datasets. These results contribute to showing the effectiveness of resorting to a graph representation of review data and leveraging the graph structure of data to learn relationships between users and products, to improve the ability of the classification model to predict accurately the review spam.

Fig. 2.
figure 2

Betweenness centrality (centrality, axis Y) with respect to the number of edges, which are labeled with the class “spam” on each node of the review graph (\(\sharp \) spam axis X). Figures 2a and 2c refer to users, while Figs. 2b and 2d refer to products, in the review graphs of Hotel and Restaurant, respectively.

Qualitative Graph Analysis. Finally, we explore how the graph representation of the review data can disclose useful knowledge to explain the relationships between users, products and reviews in the spam class in the considered datasets. For this purpose, we analyse the betweenness centrality of users and products. The betweenness centrality of a node in a graph measures the amount of influence of the node on the flow of information in the graph. In particular, for a given node within a graph, the betweenness centrality of the node is computed as the number of the shortest paths in the graph, which connect any pair of nodes passing through the node under study, on the total number of the shortest paths which connect any pair of nodes in the graph. A node with a high value of betweenness centrality can be seen as a bridge that, if removed, could disrupt connections between other nodes in the graph.

Due to the main focus of this problem on the class “spam”, Fig. 2 shows the betweenness centrality (axis Y) plotted with respect to the number of edges labelled with the class “spam” on each graph node. As GLORIA adopts a heterogeneous graph, we show the betweenness centrality for both users (Figs. 2a and 2c) and products (Figs. 2b and 2d). These results show that the betweenness centrality of a node tends to increase as the number of spam reviews involving the node increases. This trend is more evident in the products than in the users, since there is a large number of users who produced zero spam reviews, while a small number of products received zero spam reviews.

Fig. 3.
figure 3

Sub-graphs rooted in the users: “3873” of Hotel and “33826” of Restaurant. Both sub-graphs are produced with two hop levels. Figures 3b and 3d report the projection of sub-graphs shown in Figs. 3a and 3c on the edges labeled with the class “spam”.

Figure 3 shows the sub-graphs rooted in the users: “3873” of Hotel and “33826” of Restaurant. These users are identified according to the plots reported in Figs. 2b for Hotel and 2d for Restaurant as the users who produced the highest number of reviews labeled in the class “spam” in the two datasets. The two sub-graphs are produced with two hop levels. In particular, Figs. 3a and 3c show the entire sub-graphs rooted in the selected users “3873” and “33826”, respectively. Figures 3b and 3d show the projection of these sub-graphs on the edges labeled in the class “spam”. The sub-graphs rooted in the users “3873” and “33826” show the products for which these two users are spammers into Hotel and Restaurant, respectively. In both cases, the sub-graphs highlight that “multiple” reviewers produced spam reviews on the same target products. This suggests that several spammer profiles co-operated to produce malicious spam on the same target products. Hence, the activity of a spammer on a product may attract the attention of further spammers on the same target product.

Fig. 4.
figure 4

Subgraphs of products “5082”, “5083” and “5084” of Hotel dataset. These products have achieved the higher number of spam as reported in Fig. 2b.

Figure 4 shows the sub-graphs rooted in products “5082”, “5083” and “5084” of Hotel, which are the three products that received the highest number of spam reviews in Fig. 2b. We note that the product “5082” has the highest betweenness centrality in Fig. 4. Instead, the product “5084” has a low betweenness centrality, while the product “5083” has a medium betweenness centrality in Fig. 4. Consistently with this analysis of betweenness centrality, the node density is higher in the sub-graph rooted in product “5082” than in the sub-graphs rooted in product “5083” and product “5084”, respectively. In addition, the sub-graph rooted in product “5084” shows that all reviews produced on this product belong to the class “spam”. These malicious reviews were produced by users who created these single reviews (with the exception of two users who created two reviews and both these reviews were spam). This suggests that a possible malicious behaviour is observable in products with low betweenness centrality.

5 Conclusion

In this paper, we illustrate a GCN approach for review spam detection, which takes advantage of relationships between users, products and reviews by resorting to a graph-based representation of review data and training GCN model for edge classification of reviews. The experiments performed on two benchmark datasets prove the accuracy of the proposed approach compared with two baselines that are SVM and a multi-view deep learning-based approach, respectively. In addition, we show that the analysis of the betweeness centrality of products and users allows us to extract useful knowledge to explain review data by disclosing possible review spam patterns. As future work, we plan to continue the investigation of how knowledge explaining the review graph topology can be used during the GCN training stage, to help the learned classification model gain accuracy in detecting review spam. In addition, we plan to investigate the use of the graph-based representation of review data in an online setting, to explore how changes occurring over time in the graph topology may help to keep high accuracy detecting review spam in real-time. Finally, we plan to extend our approach to perform a link prediction task to predict future behaviours of spammers.