Keywords

1 Introduction

User-generated feedback on e-commerce platforms like Amazon and Yelp can provide valuable information for potential customers to help them make decisions about purchase and avoid risks. As a result, malicious users will aim at those rating platforms for different purposes e.g., overrating products for promoting sales or underrating products for defaming competitors [11, 15]. Therefore, identifying such abnormal users on rating networks has been a meaningful and popular research topic in recent as the rapid spread of online trading.

Traditional methods [11, 17, 24] try to extract handcraft features such as length of reviews, rating distributions of users etc., and feed them into classic machine learning models to learn suitable weights. However, conventional models are bottlenecked in performance and not scalable to larger size of data in real world. Therefore, [9, 18] start to investigate the rating patterns of users, examining how fraudsters behave differently from normal users within the rating network. More recently, graph-based approaches [1, 3, 14, 22, 29, 31] have emerged in the area of network fraud detection and shown promising performance. Generally, they attempt to regard rating networks as graphs and analyze the interactions between nodes to find out anomalies within a graph.

Although the field of network fraud detection is well-investigated by existing graph-based methods, they are still not able to perform consistently well across real-world platforms for many reasons. Firstly, on large trading websites like Amazon, millions of users could potentially conduct shopping activities and post their opinions or ratings on a daily basis. It is impossible to annotate such a huge amount of data manually due to the massive labour and time cost. However, training process without sufficient high-quality labeled data will limit the performance of fraud detectors [19]. Secondly, it is challenging for graph-based detection models to analyze nodes with only a few of inter-connections with others, while this situation is quite common in reality [26, 32]. For example, a user entity with only one or two ratings could be easily identified as a spammer by many existing models when its ratings deviate from mainstream opinions. Nevertheless, it still could be a benign user just with a different personal preference on specific products. Besides, when a product only receives unreliable ratings, it is unlikely to infer its inherent quality with interpretation. Lastly, many graph-based models decide the trustworthiness of rating depending on local deviation and ignore overall rating patterns of users, which could lead to overfitting [6, 7]. The reason is that both normal and unfair users might occasionally give unusual ratings, which violate their overall rating patterns.

Table 1. Comparison in terms of algorithmic properties with other baselines.

In this paper, we aim to address the challenges mentioned above by proposing an Unsupervised Fraud Detection ranking algorithm on Sparse Rating Networks (FD-SpaN) to effectively identify abnormal users. Our FD-SpaN is designed to eliminate the negative influence of network sparsity on entity level by flexibly controlling the weights of global defaults on computation for the quality of items and the trustworthiness of ratings. Moreover, FD-SpaN attempts to further boost performance by including overall rating patterns of active users to avoid overfitting. As shown in Table 1, we compare FD-SpaN with other baselines in terms of four major algorithmic properties and FD-SpaN is the only one to satisfy all of them. The contributions of our work are summarized as follows:

  • Analysis. We comprehensively analyze the problem of network fraud detection and existing challenges, then model rating networks as graphs to include structural relations.

  • Algorithm. We propose an efficient graph-based fraud detection algorithm FD-SpaN, in setting of unsupervised learning, to rank and identify fraudulent users on sparse rating networks according to computed fraud scores.

  • Effectiveness. We conduct extensive experiments on two real-world sparse datasets. Results show the effectiveness of our algorithm FD-SpaN, which outperforms all other baselines in terms of all evaluation metrics.

2 Related Work

In this section, we classify existing works into two categories: general fraud detection and graph-based fraud detection. Comprehensive surveys could be found in [2, 19].

General Fraud Detection. [11] is the first work to investigate fraud detection on e-commerce platforms by extracting handcraft features corresponding to reviews, reviewers and products and learning suitable weights. Inspired by it, Behavior [18] expands feature set to detect target-based spamming, regarding multiple reviews from the same user on a single item or brand within a short time period as potential frauds. Co-training [17] builds up two independent models with different feature sets related to reviews and reviewers respectively, then uses a small group of labeled data to train both classifiers and annotate unseen data mutually to create new training samples for boosting each other iteratively. Deceptive [24] and Singleton [26] focus on textual content analysis and measure semantic anomaly scores of reviews by computational linguistic strategies such as Linguistic Inquiry and Word Count (LIWC) and Latent Dirichlet Allocation (LDA). [32] studies the problem of singleton review spam detection by converting it into a temporal pattern discovery problem. Similarly, Birdnest [9] analyzes the distribution and frequency of ratings given by a user in a Bayesian probabilistic model, which measures fraud score depending on how estimated posterior distribution is different from global expectation. SpEagle [25] incorporates behavioral and textual features into FraudEagle [1] to improve the performance.

Graph-Based Fraud Detection. Since PageRank [4] and HITS [12] were introduced to explore relations among website pages based on citing, many studies have been extensively investigating fraud detection using topological information rather than content comparison. BAP [22] values ratings even from biased users if ratings are opposite to the usual patterns of their reviewers. Troll-Trust [31] ranks nodes within signed networks based on calculated trustworthiness using sigmoid-like functions, which provides clear semantic interpretation but is limited in dealing with discrete values i.e., ratings in most of real-world platforms. Trustiness [28, 29] formally model network fraud detection with three intrinsic metrics to examine information of nodes and edges in mathematical manner within graphs. Rev2 [14] incorporates behavioral features and sentiment analysis as auxiliary information to improve the performance of Trustiness. FraudEagle [1] regards network fraud detection as a node classification problem and uses loopy belief propagation to update the entire graph. More recently, FdGars [30] and CARE [6] learn the vector representations of user nodes by leveraging the power of Graph Convolutional Network (GCN). For tackling the problem of camouflage, Fraudar [10] proposes a novel metric denoting the suspiciousness of users, which would not decrease as any amount of camouflages added.

3 Preliminaries

In this section, we first model general rating networks as bipartite graphs with mathematical symbols. Then, we formally propose the problem that we will focus in this paper. Lastly, we introduce four basic metrics to mathematically examine information of nodes and edges in the graph for addressing the proposed problem.

3.1 Problem Definition

Firstly, we define a rating network as a bipartite graph \(G = (U, P, R)\), where U denotes all user nodes; P denotes all product (interchangeable with item) nodes; and R represents all ratings given by user \(u \in U\) on product \(p \in P\) within the graph G as edges. In addition, let \(R_{u, *}\) be all ratings given by user u and \(R_{*, p}\) be all ratings received by product p. We denote the value of rating \(r_{u, p}\) as \(w(r_{u, p})\) and scale it to the range between \(-1\) and \(+1\) for generalization, i.e., \(w(r_{u,p}) \in [-1, 1] ~ \forall r_{u,p} \in R\). Based on above preliminaries, we raise the fraud detection problem as following:

Definition 1 (Problem)

Given a bipartite rating graph \(G = (U, P, R)\), what is the fraud score of each user node \(u \in U\), also known as the probability of being a fraudster, whose rating behavior is maliciously deviated from normal user nodes?

3.2 Metrics

To solve this specific problem, we introduce four intrinsic metrics corresponding to bipartite rating graphs: 1) quality of item; 2) deviation of rating; 3) trustworthiness of rating and 4) honesty of user. The definitions of four metrics will be explained as follows:

  1. 1.

    Quality measures the inherent goodness of each item, which is also considered as the deserved rating from fair users. Therefore, we use a real number between \(-1\) and \(+1\) to indicate the quality of a product p, denoted as \(Q(p) \in [-1, 1], ~ \forall p \in P\). Intuitively, high quality Q(p) (close to \(+1\)) of product p represents high expected rating from normal users, and vice versa.

  2. 2.

    Deviation measures the difference between the value \(w(r_{u,p})\) of a rating \(r_{u,p}\) and the quality Q(p) of its targeted item p, denoted as \(dev(r_{u,p})\). Negative deviation suggests lower rating received than what the targeted item deserves, while positive deviation represents higher rating than expectation.

  3. 3.

    Trustworthiness measures how reliable is a rating \(r_{u, p}\) from user u on item p, denoted as \(T(r_{u,p}) \in [0, 1] ~ \forall r_{u,p} \in R\). The trustworthiness \(T(r_{u,p})\) is determined by multiple factors. Intuitively, the trustworthiness scales from 0 to 1, standing for completely untrustworthy and totally reliable, respectively.

  4. 4.

    Honesty measures the fairness of a user node u in the graph, denoted as \(H(u) \in [0, 1], ~ \forall u \in U\), where 1 indicates absolute fair users and 0 suggests definite spammers. In particular, the honesty of a user entity is decided by all its ratings given to items within the graph.

Fig. 1.
figure 1

Rating deviation distributions of fraudulent users and normal users on two real-world datasets.

4 Methodology

4.1 Unsupervised Learning

After we formulate the bipartite rating networks comprehensively with four basic metrics, we now can explore relations among metrics in mathematical manner. Due to the lack of annotation, we aim to infer true values of metrics in unsupervised setting. Therefore, we design mathematical formulations to calculate values for all metrics interdependently. Firstly, for each item, we use all ratings received by the item to compute the intrinsic quality. Intuitively, each rating weighs variously based on the trustworthiness, as reliable ratings will reflect the real quality of item, while untrustworthy ratings aim to mislead other users. Hence, we develop the formulation for the quality of item as below:

$$\begin{aligned} Q(p) = \frac{\sum _{r \in R_{*, p}} T(r) \cdot w(r)}{| R_{*, p} |} \end{aligned}$$
(1)

where \(|\cdot |\) is the element count operator. Secondly, for inferring the trustworthiness of a rating, the key factor is the rating deviation. As shown in Fig. 1, on both real-world rating networks, deviations of ratings from normal users are quite neutral, while fraudsters are more likely to give ratings on items with lower values than what items deserve, which generally implies that extreme deviations lead to unreliable ratings. Hence, we develop the formulation for the trustworthiness as:

$$\begin{aligned} T(r_{u, p}) = 1 - \frac{\Vert dev(r_{u, p}) \Vert }{2} \end{aligned}$$
(2)
$$\begin{aligned} dev(r_{u, p}) = w(r_{u,p}) - Q(p) \end{aligned}$$
(3)

where \(\Vert \cdot \Vert \) is an operator for computing absolute value. Thirdly, the honesty of each user node in the graph is determined by all ratings it gives. In general, benign users tend to give trustworthy ratings to items, while fraudsters rarely rate items fairly. Thus, we infer the honesty of each user as:

$$\begin{aligned} H(u) = \frac{\sum _{r \in R_{u, *}} T(r)}{| R_{u, *} |} \end{aligned}$$
(4)

Lastly, the fraud score of each user entity in the graph is reversely correlated to its honesty score, denoted as F(u) and expressed as:

$$\begin{aligned} F(u) = 1 - H(u) \end{aligned}$$
(5)

4.2 Alleviating Network Sparsity

Normally, introduced formulations will work properly when each node in the graph is densely connected with others. Nevertheless, on real-world rating platforms, the performance of fraud detection systems will suffer from insufficient inter-connections, which is a problem called network sparsity [23]. For example, newly registered users on e-commerce websites will naturally have few ratings on items, which could be easily classified as fraudulent users for first one or two ratings significantly deviating from mainstream [3]. Similarly for item node, it could be hard to infer the real quality of item when most of its received ratings are untrustworthy. Therefore, effective fraud detectors are supposed to address network sparsity, because online platforms will notify suspicious users to verify identification only when feel confident rather than bother normal users frequently in practice, which is harmful to the revenue [16]. Thus, we add a variable Laplace smoothing term in Formula (1) to alleviate network sparsity on entity level as:

$$\begin{aligned} Q(p) = \frac{\sum _{r \in R_{*, p}} T(r) \cdot w(r) + \sum _{r \in R_{*, p}} (1 - T(r)) * q}{| R_{*, p} | + \sum _{r \in R_{*, p}} (1 - T(r))} \end{aligned}$$
(6)

where q refers to the default value of quality in the graph, which could be decided as the median value of all items’ quality scores. And variable term \(\sum _{r \in R_{*, p}} (1 - T(r))\) flexibly controls the proportion of default on calculation for the quality. Intuitively, the default would account less for the calculation if the item receives more trustworthy ratings. Likewise, we also add a variable smoothing term in Formula (2) as:

$$\begin{aligned} T(r_{u, p}) = \frac{(1 - \frac{\Vert dev(r_{u, p}) \Vert }{2}) + C_1 * t}{1 + C_1} \end{aligned}$$
(7)
$$\begin{aligned} C_1 = \max (0, ~ 1 - \frac{| R_{u, *} |}{K}) \end{aligned}$$
(8)

where t refers to the default value of trustworthiness in the graph and K is a positive constant number indicating the threshold number of interactions as active users. Intuitively, the calculation will be less dependent on default value as the activity level of user increases. Together, variable smoothing terms in our formulations can automatically adapt to different nodes in the graph, which helps to alleviate network sparsity on entity level.

4.3 Avoiding Overfitting

Even in real-world sparse rating networks, where many user nodes in the graph only give few ratings to item nodes, there still could be some active users who rate items frequently, no matter they are fraudsters or normal users. For an active benign user node, one of its ratings might not be as usual as others due to the specific unsatisfactory shopping experience or unique personal preference. Therefore, it is inaccurate to always decide the trustworthiness of a rating from active users only by local rating deviation [20]. To solve this problem, we introduce a new metric to reflect the rating patterns of users for resisting overfitting when user nodes are active, denoted as \(bias(u), ~ u \in U\). Then, we integrate this metric into Formula (7) for computing trustworthiness of rating as:

$$\begin{aligned} T(r_{u, p}) = \frac{(1 - \frac{\Vert dev(r_{u, p}) \Vert }{2}) + C_1 * t + C_2 * (1 - \frac{\Vert bias(u) \Vert }{2})}{1 + C_1 + C_2} \end{aligned}$$
(9)
$$\begin{aligned} bias(u) = \frac{\sum _{r_{u, p} \in R_{u, *}} (w(r_{u, p}) - Q(p)) }{| R_{u, *} |} \end{aligned}$$
(10)

where \(C_2 = 1 - C_1\), controlling the importance of term bias on smoothing extreme rating deviation with respect to the activity level of users.

4.4 Proposed Algorithm: FD-SpaN

figure a

In this section, we propose algorithm FD-SpaN to effectively rank and identify network frauds as shown in Algorithm 1. Specifically, we iteratively calculate the true values for metrics and adjust default values simultaneously according to the global trend in the graph. In the end, our algorithm will output the likelihood of each user node being abnormal in the graph. In addition, FD-SpaN can also be applied to other fields that information can be organized as bipartite graphs, such as social media [5, 13], finance [27] and recommender system [33].

4.5 Time Complexity Analysis

In each iteration of training, our algorithm FD-SpaN will go through all user nodes once, all item nodes once and all ratings three times to update values of metrics. Hence the time complexity for each iteration is \(\mathcal {O}(x + y + 3z)\), where x is the total number user nodes, y is the total number of item nodes and z is the total number of ratings in the graph. Let, n denote the number of training iterations needed for our model FD-SpaN to converge. As a result, the overall time complexity for FD-SpaN would be \(\mathcal {O}(n * (x + y + 3z))\), where n is a constant number. Thus, theoretically, our proposed model FD-SpaN has a linear time complexity with respect to the size of graph, which is suitable to be applied in larger scale of data of real world.

5 Experiments

In this section, we conduct extensive experiments on two real-world datasets to evaluate FD-SpaN compared with the state-of-the-art baselines to show the effectiveness of our proposed model. Specifically, We aim to answer following research questions:

  • RQ1. How does FD-SpaN perform on real-world sparse rating networks for identifying frauds compared with other baselines?

  • RQ2. How does the performance of FD-SpaN and other baselines change under different training percentages?

  • RQ3. Is FD-SpaN scalable to real-world platforms with larger size of data?

  • RQ4. How does each different component of FD-SpaN contribute to the overall performance?

Table 2. Two real-world datasets used for evaluation.

5.1 Experiment Setup

Datasets. Following the tradition of previous representative work [6], we also select two real-world datasets for extensive experiments. Table 2 shows the statistics of two real-world datasets, where attribute density refers to the average number of ratings given by each user in the network. Both datasets are relatively sparse and extremely imbalanced to different extents, as we aim to evaluate the performance of FD-SpaN on identifying frauds under actual circumstances.

  • FineFoods [21] is a bipartite rating network in category of fine foods, with around \(65\%\) of user nodes and \(41\%\) of item nodes giving and receiving only one rating, respectively. The ground truth is given by the method mentioned in [14] based on provided helpfulness votes.

  • Instruments [8] refers to the musical instruments trades on Amazon websites. This rating network is even sparser, as roughly \(80\%\) of users and \(44\%\) of items have only one rating. The ground truth is also given based on the similar method as used in FineFoods dataset.

Baslines. We compare our model with five state-of-the-art baselines on fraud detection. We convert these baselines to calculate the probability of each user being anomaly, as exactly what our model does, for fair comparison.

  • Birdnest [9] only considers the distribution and timestamps of ratings, and uses Bayesian inference to calculate suspiciousness scores for users.

  • BAP [22] measures the expected rating for each item and the trend of each user giving high or low ratings. The key of BAP is to value ratings from users even with high bias if ratings are opposite to the usual pattern of reviewers.

  • Eagle [1] is based on loopy belief propagation and regards the network fraud detection as a node classification problem in graphs.

  • Behavior [18] designs and extracts multiple behavioral features of users to calculate overall fraud scores for users.

  • Rev2 [14] is based on HITS algorithm [12] and iteratively computes three intrinsic metrics in the graph: the fairness of users, the reliability of ratings and the goodness of products.

Evaluation Metrics. Our task is to infer the probability of each user node being spammer in the graph. Then, we will rank user entities based on the assigned probabilities. However, real-world rating networks are extremely imbalanced and sparse as shown in Table 2. Additionally, effective fraud detectors are supposed to assign as high anomaly scores as possible to real spammers and as low as possible to normal users. Therefore, following by the previous work [14], we extract the most and least suspicious 100 users from the output of model to form the test set and evaluate models by Average Precision (AP), which measures the capacity of distinguishing fraudsters and normal users. Also, we evaluate models by Area Under ROC Curve (AUC) on the top 200 suspicious user entities, reflecting the ability of each algorithm to identify positive cases on imbalanced dataset.

Implementation Details. For BAP and Rev2, we strictly follow the instructions from their paperwork to implement their algorithms. For BirdnestFootnote 1, We use the original code from authors to run experiments. For EagleFootnote 2 and BehaviorFootnote 3, we use available open-source implementations online. As for our proposed model FD-SpaN, we set parameter \(K = 4\) according to the result on validation set, number of training iterations \(epochs = 20\) and training error threshold \(\epsilon = 10^{-4}\) as many other studies did.

Table 3. Performance comparison (%) on two real-world datasets. The best result of each metric is in bold. ‘-’ represents not applicable.

5.2 Effectiveness (RQ1)

In this experiment, we compare the results of proposed model FD-SpaN on both real-world datasets with other state-of-the-art baselines as shown in Table 3. Overall, we can observe that our FD-SpaN consistently outperforms all baselines in terms of all evaluation metrics. More specifically, Birdnest performs poorly across both datasets and has remarkable gaps with other models. This is due to Birdnest only considers the temporal pattern and distribution of ratings provided by users to decide their suspiciousness scores, while it suffers from massive inactive users with only few ratings, which are very common in real-world rating networks but cannot be interpreted by Birdnest.

BAP and Eagle are graph-based models, trying to exploit topological information of graphs to identify anomalies, as what Rev2 and FD-SpaN do. However, they cannot achieve consistently competitive performance because they fail to alleviate the impacts of network sparsity. Even though Behavior obtains promising results by including multiple designed features regarding to rating behaviors, our FD-SpaN still outperforms Behavior by \(7.48\%\) in AP and \(5.20\%\) in AUC on FineFoods, which implies the superiority of graph-based methods.

Moreover, Rev2 is a graph-based model and attempts to address network sparsity on graph level. However, even within the same graph, each node is in a different situation. Therefore, applying the same parameters to different nodes in the graph will cause damage to the overall performance. Meanwhile, by alleviating network sparsity on entity level and resisting overfitting through new designed metric, our proposed FD-SpaN achieves a \(5.98\%\) improvement in AP value on FineFoods. Also, 91 out of top 100 suspicious user nodes predicted by our model FD-SpaN on dataset Instruments are real spammers, which is the best among all algorithms. In summary, under realistic circumstance of class imbalance and network sparsity, our proposed model FD-SpaN can effectively identify fraudsters within rating networks.

5.3 Training Percentage (RQ2)

In this experiment, we run our model FD-SpaN with different percentages of training data and record the results in terms of AP and AUC values. As we can see from Fig. 2, on FineFoods, FD-SpaN obtains stable performance in both evaluation metrics under different training percentages. On Instruments, even though FD-SpaN scores relatively low in AP with \(10\%\) of training data, it rapidly recovers to normal as the training percentage increases. Together, experiment results show the robustness of our FD-SpaN with respect to training percentages.

Fig. 2.
figure 2

Performance (%) of model FD-SpaN on two real-world datasets under different training percentages.

Fig. 3.
figure 3

Average Precision values (%) of different algorithms on two real-world datasets under different training percentages.

Moreover, we record the performance variations of our FD-SpaN and other baselines in value of AP under different training percentages, as illustrated in Fig. 3. On FineFoods, FD-SpaN scores the highest when only \(10\%\) of training data provided and maintains one of the tops in most choices of training percentages. On Instruments, we can observe FD-SpaN has continuous growth in AP value and consistently outperforms other baselines as training percentage increases from \(10\%\) to \(90\%\). Together, experiment results suggest our FD-SpaN can effectively analyze topological structure of graphs to identify anomalies.

Fig. 4.
figure 4

Training time cost (seconds) of model FD-SpaN, using different number of training samples.

5.4 Linear Scalability (RQ3)

Alongside with previous theoretical analysis of linear scalability of our proposed FD-SpaN, we also run our algorithm using different numbers of training samples to verify linear scalability. We scale training samples in different magnitudes of amount from \(10^4\) to \(10^7\) and record the training time costs for FD-SpaN to converge on the same machine, as illustrated in Fig. 4. Basically, the running time of FD-SpaN is increasing linearly as fed with more training samples. Therefore, experiment results verify our model has linear scalability and is suitable to handle with larger size of data in practice.

Table 4. Performance variations (%) when different component of algorithm FD-SpaN is excluded.

5.5 Ablation Study (RQ4)

In this experiment, we conduct an ablation study on both datasets to validate the effectiveness of designed components of FD-SpaN and demonstrate how each different component contributes to the overall performance of FD-SpaN.

  • FD-SpaN/AO: it excludes the component for avoiding overfitting.

  • FD-SpaN/AS: it removes the component for alleviating network sparsity.

As shown in Table 4, on FineFoods, by removing the component of either anti-overfitting or anti-sparsity, the performance of FD-SpaN will drop remarkably by \(8.21\%\) and \(7.59\%\) in AP and AUC values, respectively. However, we observe a marginal decrease in performance on Instruments when the component of avoiding overfitting is excluded. The reason could be that most of users in Instruments are very inactive so that incorporating their overall rating patterns has very limited effects to resist overfitting. FD-SpaN can achieve the best performance on sparse rating networks only when all designed modules are integrated into it.

6 Conclusion

In this paper, we propose an unsupervised graph-based fraud detection ranking algorithm called FD-SpaN, which targets at effectively identifying anomalies on sparse rating networks in reality. We alleviate the negative impacts of network sparsity on entity level, which leads to finer granularity and better performance than existing methods. We also include the rating patterns of users in our FD-SpaN to avoid overfitting. Our algorithm has a linear time complexity with respect to the size of graph, which makes FD-SpaN is suitable for real-world platforms to deal with large-scale data. More importantly, extensive experiments on two real-world datasets have shown the effectiveness and superiority of FD-SpaN, scoring the highest in all metrics against other state-of-the-art baselines. In the future, we plan to investigate fraud detection on dense rating networks.