Keywords

1 Introduction

With the thriving growth of Internet services facilitating our daily life, there also brings various kinds of fraudulent behaviors. The fraudsters or attackers disguise as the benign users to do some malicious activities and conceal themselves within the mass of data, which has caused great damage to finance security [10, 16, 20], cyber security [4] and comment management [3, 14]. Fortunately, our online behavior is always traceable no matter whether we are benign or fraudulent, and we can transform these interactive behaviors as graph-like data where the users and their interactions are treated as the nodes and the edges, respectively. Recently, the emerging graph neural network (GNN) has shown its great representation power of graph data, which makes GNN-based fraud detection methods escalate extensive attention.

Fig. 1.
figure 1

Illustration of graph disassortativity and imbalance

Although these GNN-based fraud detection methods have made much progress, there still exist the following two main challenges.

Graph Disassortativity. Generally, GNNs update the representation of nodes by aggregating the signals from their neighbors, which can be treated as a low-pass filter to retain the commonality between the connected nodes [9, 18]. Benefiting from the smoothness of the low-frequency filter, it works well for assortative graphs, i.e., similar nodes tend to make the connections [1], which makes GNN-based fraud or anomaly detection effective as it assumes that the fraudulent or abnormal nodes with the same malicious goals tend to make the connections with each other. However, some studies [5, 7, 14, 19] have shown that the fraud graphs are not always assortative but more likely disassortative as the fraudsters often camouflage themselves by making many connections with the benign users to make them look normal with less suspiciousness, which makes the fraud graph flood with numerous disassortative connections, i.e., the entities from different classes tend to make the connections. Consequently, the low-pass smoothing aggregation mechanisms of GNNs are insufficient to support the inference for these disassortative graphs as it enables the fraudsters to achieve their intentions, i.e., the fraudulent features are concealed within the myriads of benign ones. As shown in Fig. 1, given a query user who has far more connections with the normal user than the fraudster, the assortativity-based GNNs tend to classify him/her into the normal user as its behavior features have been concealed by his numerous benign neighbors within the low-pass smoothing aggregation. Under this circumstance, the high-frequency signals (i.e., the difference between entity nodes) are more suitable for these disassortative graphs.

Graph Imbalance. First, the number of fraudsters is generally far less than that of the benign users, which causes the graph node imbalance issues. According to Amazon and YelpChi datasets, only 9.50\(\%\) and 14.53\(\%\) of labeled entities are the fraudulent ones respectively, as introduced in Table 1. The graph node imbalance may make the training bias towards the majority class (i.e., the benign users) with the reduction of model generalization ability. Expect for node imbalance, the edge imbalance is more urgent as it directly guides the node aggregation process. There are three types of edges in the fraud graph: edges between normal entity nodes (denoted as N-N), edges between the fraudulent ones (denoted as F-F), and the edges between the fraudulent and normal ones (denoted as F-N). As introduced in Table 2, the number of N-N edges is far more than that of others, which helps the normal user detection by retaining commonalities between them. But for F-F and F-N edges, they are rare but valuable, and we should pay more attention to them with exploring the commonalities between fraudsters and the difference between fraudsters and normal users. However, most current GNN-based fraud detection studies pay little attention to both nodes and edges imbalance and haven’t made full use of labeled F-F and F-N edges to detect the new fraudsters. Unfortunately, the graph imbalance issues further exacerbate the disassortativity of the graph with more difficulties for discrimination.

To address the above challenges, we propose an Imbalanced Disassortative Graph Learning framework (IDGL) to simultaneously adaptively aggregate low- and high-frequency signals from assortative and disassortative connections on the imbalanced fraud graph. Specifically, IDGL is composed of four module layers: 1) a re-embedding layer. Some recent studies [2] have emphasized that the performance and robustness of the model may be hurt by the entanglement of graph filters and parameter matrices, and the fraudsters usually camouflage themselves with the similar raw features to the normal users. Therefore, a non-linear re-embedding layer is applied to relearn the representations of nodes; 2) an adaptive dual-channel convolution layer, which is used to adaptively combine the advantage of dual-channel (i,e., the low- and high-frequency) signals from its neighbors to assimilate the nodes with assortative edges and discriminate the nodes with disassortative edges; 3) a representation fusion layer, which combines the intermediate embeddings to be the final representation of nodes; 4) an imbalanced-oriented classification layer. To alleviate the effects of graph imbalance, a label-aware nodes and edges sampler is designed with the consideration of nodes’ popularity (i.e., degree) and corresponding label class frequency. Sampled nodes are used for classification training to eliminate the bias towards the major classes, and the sampled edges are treated as the supervision information to facilitate the training of adaptive filters and make the model pay more attention to the valuable edges (i.e., F-F and F-N layers). The contributions of the paper can be listed as follows:

  • We formulate the graph-based fraud detection problem as an imbalanced disassortative node classification task and propose an imbalanced disassortative graph learning framework to deal with the disassortativity and graph imbalance issues on the graph.

  • An adaptive dual-channel convolution filter is further developed for fraud detection to assimilate the nodes with assortative edges and discriminate the nodes with disassortative edges. A label-aware node and edge sampler is proposed to relieve graph imbalance issues with more attention to the valuable edge information.

  • Experiments on two public real-world datasets demonstrate the effectiveness of our proposed IDGL for fraud detection.

2 Preliminaries

2.1 Definition

Definition 1

(Assortativity and disassortativity). Given a graph, if two nodes (e.g., \(v_i\) and \(v_j\)), which make the connection as an edge, belong to different classes, then we treat the connection as a disassortative edge, denoted as \(\varepsilon _{ij}^ -\), and if they belong to the same class, the connection is an assortative edge, denoted as \(\varepsilon _{ij}^ +\). The larger disassortativity of the graph, the more nodes from different classes tend to connect with each other, and vice versa. For our task, the fraud graph is of both assortativity and disassortativity at the same time.

Definition 2

(Graph). Consider a graph \({G} = \left\{ {{V},{ X}, \left\{ {{\varepsilon ^ + },{\varepsilon ^ - }} \right\} ,{A},{Y}} \right\} \), \({V} = \left\{ {{v_1},{v_2},...,{v_N}} \right\} \) is the set of nodes, N is the number of nodes; \({X} \in \mathbb {R}^{N \times d}\) is the original d-dimension feature vector of all of N nodes; For \({\left\{ {{\varepsilon ^ + },{\varepsilon ^ - }} \right\} }\), \({{\varepsilon ^ + }}\) and \({{\varepsilon ^ - }}\) represent the assortative and disassortative edge sets respectively where \({\varepsilon ^ + } \cup {\varepsilon ^ - } = \mathbf{{\varepsilon }}\) and \({\varepsilon ^ + } \cap {\varepsilon ^ - } = \emptyset \); A is the corresponding adjacency matrice of the graph where \({{A}_{ij}} = 1\), if \({e_{ij}} \in \varepsilon \); Y is the set of labels for the nodes, and each one has a label \({y_i} \in \left\{ {0,1} \right\} \) where 1 represents the fraudster and 0 represents the benign.

Definition 3

(Multi-relation Graph). There are different relations among the nodes, and \({ G} = \left\{ {{V},{X},\left\{ {\varepsilon _r^ + ,\;\varepsilon _r^ - } \right\} \left| {_{r = 1}^R} \right. ,\left\{ {{{A}_1},{{A}_2},...,{{A}_R}} \right\} ,{Y}} \right\} \) is defined as a multi-relation graph, where \(e_{ij}^r \in \left\{ {\varepsilon _r^ + ,\varepsilon _r^ - } \right\} \) represents the edge between the node \({v_i}\) and \({v_j}\) under the relation \(r \in \left\{ {1,2,...,R} \right\} \) and \({{A}_r}\) is the corresponding adjacency matrix.

2.2 Problem Formulation

Definition 4

(Graph-based Fraud Detection). Considering a multi-relation graph G, which has been defined in definition 2, the task is to detect the fraud nodes from the benign ones in the given graph. Specifically, given the structural information of the graph \({\left\{ {{{A}_1},{{A}_2},...,{{A}_R}} \right\} }\) and the original feature information X, we need to learn the function f to map the nodes into a d-dimension feature vector \({z_i} \in {\mathbb {R}^d}\) across the multi-relation graph where \(d \ll N\). With the learned embedding and the labeled nodes, a classifier is trained to detect whether a given unlabeled node is a fraudster.

3 Overview Framework of Method

In this section, we present the proposed IDGL framework, as shown in Fig. 2. IDGL includes four module layers: the re-embedding embedding layer, adaptive dual-channel convolution layer, representation fusion layer, and the imbalance-oriented classification layer. For the first module layer, it is used to add some uncertainty by the dense and dropout layer to relieve the effect of feature camouflage. Furthermore, the second module layer is introduced to make full use of low- and high-pass signals to deal with the graph disassortativity issue. The third module layer makes the fusion of the intermediate representation to be the final features of nodes. The final module layer is used to deal with the graph imbalance and detect the fraudsters.

Fig. 2.
figure 2

An illustration of proposed framework of IDGL

3.1 Re-embedding Layer

Fraudsters usually learn the normal users to camouflage themselves, so the original attribute features between the fraudsters and the normal users are of great similarity. Therefore, it is highly desirable to add some uncertainty into the original features to re-learn the feature similarity between nodes to cater downstream fraudster detection tasks. In the paper, a dense-based dropout layer is introduced to encode the embeddings of the nodes without depending on the network topology, and it is denoted as follows:

$$\begin{aligned} h_i^{\left( 0 \right) } = \sigma \left( {dropout\left( {{x_i},\eta } \right) \cdot {\mathbf{{W}}_0}} \right) , \end{aligned}$$
(1)

where \({x_i}\) is the original feature vector of node \(v_i\), \(dropout\left( { \cdot , \eta } \right) \) is the function which drops the neurons from the network with a certain probability \(\eta \) during network training, \({\textbf{W}_0}\) is a learnable weight matrix, \(h_i^{\left( 0 \right) }\) represents the embedding of node \(v_i\), and \(\sigma \left( \cdot \right) \) is the non-linear \(\textbf{ReLU}\) activation function. Based on the above components, the difference between the fraudsters and the normal users will be further amplified via network topology based on the improved convolution and aggregation strategies.

3.2 Adaptive Dual-channel Convolution Layer

Dual-channel Graph Convolution. Following GCN [8], from the perspective of graph signal processing, the graph convolution \(*_G\) between the given signal x and filter f are denoted as:

$$\begin{aligned} f{ * _G}x \approx \theta \left( {{I_N} + {D^{ - \frac{1}{2}}}A{D^{ - \frac{1}{2}}}} \right) x, \end{aligned}$$
(2)

where \({I_N} \in \mathbb {R} {^{N \times N}}\) denotes the identity matrix and \({D} = diag\{d_1, d_2,...,d_N\}\) is the diagonal degree matrix (\({d_i} = \sum \nolimits _j {{{A}_{ij}}}\)). From Eq. 2, we can observe that \({{I_N} + {D^{ - \frac{1}{2}}}A{D^{ - \frac{1}{2}}}}\) can be considered to smooth the central node by its adjacent nodes’ information via summation between signals, and it can be treated as a low-pass filter to capture the commonalities between the central node and its adjacent nodes [1]. Naturally, for a high-pass graph filter, it should be used to capture the difference between them. Heuristically, a low-pass filter \(f_L\) and a high-pass filter \(f_H\) are designed as follows:

$$\begin{aligned} \begin{array}{l} f_L{ * _G}x \approx \theta \left( {\varepsilon {I_N} + {{\hat{D}}^{ - \frac{1}{2}}}A{{\hat{D}}^{ - \frac{1}{2}}}} \right) x\\ f_H{ * _G}x \approx \theta \left( {\varepsilon {I_N} - {{\hat{D}}^{ - \frac{1}{2}}}A{{\hat{D}}^{ - \frac{1}{2}}}} \right) x, \end{array} \end{aligned}$$
(3)

where \(\hat{D} = I + D\), and \(\varepsilon \in \left[ {0,1} \right] \) is a scaling factor. We can generalize Eq. 3 to the signal \(X \in \mathbb {R}^{N \times d}\) (i.e., a d-dimension representation for each node) as follows:

$$\begin{aligned} \begin{array}{l} {Z_L} = (\varepsilon {I_N} + {{\hat{D}}^{ - \frac{1}{2}}}A{{\hat{D}}^{ - \frac{1}{2}}})X\varTheta \\ {Z_H} = (\varepsilon {I_N} - {{\hat{D}}^{ - \frac{1}{2}}}A{{\hat{D}}^{ - \frac{1}{2}}})X\varTheta , \end{array} \end{aligned}$$
(4)

where \(\varTheta \in \mathbb {R}^{d \times M}\) is the matrix of learnable filter parameter, and \({Z_L} \in \mathbb {R}^{N \times M}\) and \({Z_H} \in \mathbb {R}^{N \times M}\) are the signal matrix convolved by low-pass and high-pass filters, respectively. Then the low-pass and high-pass convolution of the node i can be denoted as:

$$\begin{aligned} \begin{array}{l} {\left( {{F_L}X} \right) _i} = \left( {\varepsilon {X_i} + \sum \limits _{j \in {N{\left( i \right) }}} {\frac{1}{{\sqrt{{{\hat{D}}_{ii}}{{\hat{D}}_{jj}}} }}} {X_j}} \right) \varTheta \\ {\left( {{F_H}X} \right) _i} = \left( {\varepsilon {X_i} - \sum \limits _{j \in {N{\left( i \right) }}} {\frac{1}{{\sqrt{{{\hat{D}}_{ii}}{{\hat{D}}_{jj}}} }}} {X_j}} \right) \varTheta , \end{array} \end{aligned}$$
(5)

where \({N{\left( i \right) }}\) is the set of one-hop neighborhoods of node i.

Learnable Channel Fusion Aggregation. Given the above low- and high-pass graph filters, the next step is how to aggregate both low- and high-frequency information from the node’s neighbors, respectively. Naturally, a basic idea is to add the weight parameter \(\alpha _{ij,r}\) to balance the importance between such two filters under the relation r:

$$\begin{aligned} {{\tilde{h}}_{i,r}} = {\alpha _{ij,r}}{\left( {{F_L}H} \right) _{i,r}} + \left( {1 - {\alpha _{ij,r}}} \right) {\left( {{F_H}H} \right) _{i,r}}, \end{aligned}$$
(6)

where \(\alpha _{ij,r} \in \left[ {0,1} \right] \), and Eq. 6 can be further expanded as follows:

$$\begin{aligned} {{\tilde{h}}_{i,r}} = \left( {\varepsilon {h_{i,r}} + \sum \limits _{j \in {N_r}\left( i \right) } {\frac{{2{\alpha _{ij,r}} - 1}}{{\sqrt{{{\hat{D}}_{ii,r}}{{\hat{D}}_{jj,r}}} }}} {h_{j,r}}} \right) {\varTheta _r}, \end{aligned}$$
(7)

where \({{\tilde{h}}_{i,r}}\) is the aggregated embedding of node \(v_i\) under the relation r. Here, we set a learnable coefficient \(\omega _{ij,r}= 2\alpha _{ij,r} - 1\), where \(\omega _{ij,r} \in \left[ {-1,1} \right] \). For \(\omega _{ij,r}\), it decides whether a low or high-frequency signal should be extracted between the node \(v_i\) and \(v_j\), and thus the features of both the node itself \(v_i\) and its neighbor \(v_j\) should be considered together. Naturally, a shared self-gating mechanism is used to learn \(\omega _{ij,r}\) as follows:

$$\begin{aligned} \omega _{ij,r}^{\left( l \right) } = \tanh \left( {{{\left( {\mathbf{{W}}_r^{\left( l \right) }} \right) }^T}\left[ {h_i^{\left( {l - 1} \right) }||h_j^{\left( {l - 1} \right) }} \right] } \right) , \end{aligned}$$
(8)

where \({h_i^{\left( {l - 1} \right) }} \in \mathbb {R}^{{d_{{v_i}}} \times 1}\) and \({h_j^{\left( {l - 1} \right) }} \in \mathbb {R}^{{d_{{v_j}}} \times 1}\) are the representations of the nodes \(v_i\) and \(v_j\) at l-th layer, \({\mathbf{{W}}_r^{\left( l \right) }} \in \mathbb {R}^{\left( {{d_{{v_i}}} + {d_{{v_j}}}} \right) \times 1}\) is a trainable matrix, || is the concatenation operation, and \(\tanh \left( \cdot \right) \) is the hyperbolic tangent function, which makes the value \(\omega _{_{ij}}\) in the range of \(\left( {0,1} \right) \). Finally, the aggregation of node i can be denoted as follows:

$$\begin{aligned} h_{i,r}^{\left( l \right) } = \sigma \left( {\left( {\varepsilon h_i^{\left( {l - 1} \right) } + \sum \limits _{j \in {N_r}\left( i \right) } {\frac{{\omega _{ij,r}^{\left( l \right) }}}{{\sqrt{{{\hat{D}}_{ii,r}}{{\hat{D}}_{jj,r}}} }}} h_j^{\left( {l - 1} \right) }} \right) \varTheta _r^{\left( {l - 1} \right) }} \right) . \end{aligned}$$
(9)

Layer Architecture and Cross-relation Combination. In the previous sections, we have introduced the message passing paradigm of our method. Here, we formally define the convolution layer of our method under the r-th relation, and the mathematical formulation is denoted as follows:

$$\begin{aligned} \begin{array}{l} h_i^{\left( 0 \right) } = \sigma \left( {dropout\left( {{x_i},\eta } \right) \cdot {\mathbf{{W}}_0}} \right) \\ ...\\ \omega _{ij,r}^{\left( l \right) } = \tanh \left( {{{\left( {\mathbf{{W}}_r^{\left( l \right) }} \right) }^T}\left[ {h_i^{\left( {l - 1} \right) }||h_j^{\left( {l - 1} \right) }} \right] } \right) \\ h_{i,r}^{\left( l \right) } = \sigma \left( {\left( {\varepsilon h_i^{\left( {l - 1} \right) } + \sum \limits _{j \in {N_r}\left( i \right) } {\frac{{\omega _{ij,r}^{\left( l \right) }}}{{\sqrt{{{\hat{D}}_{ii,r}}{{\hat{D}}_{jj,r}}} }}} h_j^{\left( {l - 1} \right) }} \right) \varTheta _r^{\left( {l - 1} \right) }} \right) \\ h_i^{\left( l \right) } \leftarrow [h_{i,1}^{\left( l \right) },h_{i,2}^{\left( l \right) },...,h_{i,R}^{\left( l \right) }]\\ ... \end{array} \end{aligned}$$
(10)

In Eq. 10, we can observe that \(h_{i,r}^{\left( l \right) }, r \in \left[ {1,2,...,R} \right] \) (i.e., the signal of node i learned under the different relation at the l-th layer) is concatenated as a new signal \(h_i^{\left( l \right) }\), and it will be the input of node feature for the next layer.

3.3 Representation Fusion Layer

In our model, the node embedding outputted by different neural network layers has different smoothness and sharpness. In the model, the designed low- and high-pass filters make the node embedding outputted by different convolution layers have different smoothness and sharpness, and they can help facilitate the downstream classification task. Thus, we combine the intermediate embeddings outputted by the different layers as the final representation of the node:

$$\begin{aligned} {z_i} = [{x_i},h_i^{\left( 0 \right) },h_i^{\left( 1 \right) },...,h_i^{\left( L \right) }], \end{aligned}$$
(11)

where L is the number of convolution layer.

3.4 Imbalance-Oriented Classification Layer

As introduced in Eq. 3.2, for each graph neural layer l under the relation r, \(\alpha _{ij,r}^{\left( l \right) }\) can be directly calculated by the learnable weight \(\omega _{ij,r}^{\left( l \right) }\) as: \(\alpha _{ij,r}^{\left( l \right) } = 0.5\left( {\omega _{ij,r}^{\left( l \right) } + 1} \right) \in \left[ {0,1} \right] \), and it can be used to measure the assortativity and disassortativity of the edge \(e_{ij}\), so as to be the supervised information to balance the weight between low- and high-pass filters. Actually, the supervision signal from the known label nodes can be treated as the ground truth to make an auxiliary loss:

$$\begin{aligned} \mathcal{L}_r^{\left( l \right) } = - \sum \limits _{{e_{ij,r}} \in {\varepsilon _{t,r}}} {\left[ {{y_{ij,r}}\log \left( {\alpha _{ij,r}^{\left( l \right) }} \right) + \left( {1 - {y_{ij,r}}} \right) \log \left( {1 - \alpha _{ij,r}^{\left( l \right) }} \right) } \right] } , \end{aligned}$$
(12)

where \(y_{ij,r} \in \left\{ {0,1} \right\} \) is the label of the assortative edge (i.e., \(y_{ij,r} =1\)) or the disassortative edge (i.e., \(y_{ij,r}=0\)) under the relation r, and \({\varepsilon _{t,r}}\) is the edge set whose source nodes and target nodes have been labeled under the relation r. For each layer and each relation, the final loss for assortative and disassortative edges can be formulated as follows:

$$\begin{aligned} {\mathcal{L}_\varepsilon } = \frac{1}{{L \times R}}\sum {\mathcal{L}_{{\varepsilon _{t,r}}}^{\left( l \right) }}, \end{aligned}$$
(13)

where L and R are the number of layers and relations, respectively.

Additionally, given the final embedding of nodes \(z_i\), the fraud detection problem can be treated as a binary node classification problem, and we use cross-entropy loss function to model it:

$$\begin{aligned} \begin{array}{*{20}{l}} {{\mathcal{L}_c} = - \sum \limits _{i \in V} {\left[ {{y_i}\log \left( {{p_i}} \right) + \left( {1 - {y_i}} \right) \log \left( {1 - {p_i}} \right) } \right] } }\\ {{p_i} = softmax\left( {MLP\left( {{z_i}} \right) } \right) .} \end{array} \end{aligned}$$
(14)

To sum up, we define the overall loss of our method as follows:

$$\begin{aligned} {\mathcal{L}_{total}} = {\gamma _c}{\mathcal{L}_c} + {\gamma _\varepsilon }{\mathcal{L}_\varepsilon } + \gamma {\left\| \varTheta \right\| ^2}, \end{aligned}$$
(15)

where \({\gamma _c}\), and \({\gamma _\varepsilon }\) (\({\gamma _c} + {\gamma _\varepsilon } = 1\)) are the weights to balance the importance of different losses, \({\left\| \varTheta \right\| ^2}\) is the regularization term to avoid over-fitting and \(\gamma \) is the control coefficient. Note that, to alleviate the influence of sample imbalance problem (i.e., the number of normal users is significantly larger than that of the fraudsters), a label-aware sampler is proposed to take the nodes’ label frequency and degree information into consideration, which make the minority class of relatively high sampling probability. First, as to the node sampling for classification, the sampling probability is denoted as follows:

$$\begin{aligned} P\left( {{v_i}} \right) \propto \frac{{\sqrt{{d_i}} }}{{Z\left( {\mathcal{C}\left( {{v_i}} \right) } \right) }}, \end{aligned}$$
(16)

where \({d_i} = \sum \limits _{r = 1}^R {\sum \limits _j {{A_{ij,r}}} }\) is the degree of node \(v_i\) under all relations, and \({Z\left( {\mathcal{C}\left( {{v_i}} \right) } \right) }\) represents the label frequency of class \({\mathcal{C}\left( {{v_i}} \right) }\). Note that, \({\sqrt{{d_i}} }\) means that more “popular” nodes are more likely to selected, and \({Z\left( {C\left( {{v_i}} \right) } \right) }\) means the more “rare” nodes are more likely to be selected.

Table 1. Datasets statistic information

The set of the sampled nodes is denoted as \(V_s\). Next, for the edge sampling under the relation r, the sampling probability is defined as follows:

$$\begin{aligned} P\left( {{\varepsilon _{ij,r}}} \right) \propto \frac{{\sqrt{{d_{i,r}}{d_{j,r}}} }}{{Z\left( {\mathcal{C}\left( {{\varepsilon _{ij,r}}} \right) } \right) }}, \end{aligned}$$
(17)

where \({d_{i,r}} = \sum \nolimits _k {{A_{ik,r}}}\) is the degree of node \(v_i\) under the relation r, and \({Z\left( {\mathcal{C}\left( {{\varepsilon _{ij,r}}} \right) } \right) }\) is the edge label (i.e., the assortative or disassortative edge) frequency of class \(\mathcal{C} {\left( {{\varepsilon _{ij,r}}} \right) }\). The sets of the sampled edge under all relations are marked as: \(\left\{ {{\varepsilon _{s,r}}} \right\} \left| {_{r = 1}^R} \right. \). Similarly, \({\sqrt{{d_{i,r}}{d_{j,r}}} }\) and \({Z\left( {C\left( {{\varepsilon _{ij,r}}} \right) } \right) }\) represents the popularity and rareness of the edge \({e_{ij}}\) under the relation r. For the edges between fraudsters (F-F) and the edges between the fraudsters and the normal users (F-N), they are rare but valuable. Thus, \(P\left( {{\varepsilon _{ij,r}}} \right) \) can make F-N and F-F edges be selected at a higher probability.

4 Experiments

4.1 Experiment Setup

Datasets. Two public real-world fraud detection datasets (i.e., Yelp review dataset and Amazon dataset[13]) are used to validate the performance of IDGL. YelpChi dataset collects the reviews of hotels and restaurants on the Yelp platform, and the reviews are treated as the node with three relations: 1) R-U-R represents the reviews, which are provided by the same user, are linked; 2) R-T-R represents the reviews, which are given to the same product within the same month; 3)R-S-R represents the reviews, which are given to the same product with the same star-rating, are linked. The nodes are labeled by Yelp’s filter (spam) and recommendation (legitimate). For Amazon, it is composed of users with their comments on the musical instruments. Here, users are treated as the node with three different types of relations: 1) U-P-U represents the users, who make the comments on at least one same product, are linked; 2) U-S-U represents the users, who give at least one same star-rating within a same week, are linked; 3) U-V-U represents the users, who have top-\(5\%\) mutual review TF-IDF similarities, are linked. Note that the user is labeled the normal user or the fraudster according to more than \(80\%\) or less than \(20\%\) helpful votes. The statistics of such two datasets are shown in Tab. 1.

Fig. 3.
figure 3

Disassortativity Evidence. X-axis is the proportion of disassortative edges in the neighborhood of fraud nodes. Y-axis is the proportion of the corresponding fraudulent nodes among all fraudsters. NN is the fraudulent node without any neighbor.

Baselines. In this section, several state-of-the-art GNN-based methods and their variants are compared to verify the effectiveness of our proposed method. The source code of our model is available.Footnote 1

GCN [8] is a general GNN model which aggregates the embedding information of node’s first-order neighours. GAT [15] is an attention-based graph neural network which takes the attention mechanism into the process of aggregation. GraphSAGE [6] is an inductive graph neural network which takes the multi-order node sampling strategy into node aggregation. GEM [12] is an improved graph neural network for malicious accounts detection which constructs the heterogeneous account-device graphs by summarizing the weakness of the attackers. FdGars [17] is a GCN-based method for fraud detection which reconstruct a relational graph between the fraudsters and the normal users based on multi-context information. GraphConsis [11] is a heterogeneous GNN-based method that aims to address the inconsistency problems of context, feature, and relation. CARE-GNN [3] is a GNN-based method that investigates the camouflage behavior of fraudsters and their negative influence on GNN-based fraudster detectors, and proposes a label-aware similarity measure and a similarity-aware selector. FRAUDRE [19] is an improved GNN method that focuses on the graph inconsistency and imbalance issues of the camouflaged fraudsters.

Note that we perform GCN, GAT, GraphSAGE, and FdGars on the graphs characterized by Definition 2 (i.e., ALL in Table 1), and perform the other methods on multi-relation graphs characterized by Definition 3.

Evaluation Metrics. Since YelpChi and Amazon datasets have imbalanced classes, following previous work, we use AUC, Macro-Recall and Macro-F1 for performance evaluation. As a widely used binary classification metric, AUC is computed based on the relative ranking of prediction probabilities of all samples, and it could eliminate the influence of class imbalance.

4.2 Evidence of Graph Disassortativity and Imbalance

First, we investigate the existence of graph disassortativity. Specifically, we calculate the proportion of disassortative edges to all one-hop neighbors of each fraudulent node under different relation subgraphs, and further count the proportion changes of these fraudulent nodes among all of them with the growth of disassortativity proportion, as shown in Fig. 3. We can observe that there are numerous fraudulent nodes with high disassortativity in such subgraphs, and more than half-past of fraudsters are with larger than 80\(\%\) disassortativity proportion. Consequently, the fraud graph is of disassortativity naturally, and we need to take the separation of assortativity and disassortativity into consideration.

Table 2. Edge type statistic information

Next, we further study the imbalance of the fraud graph, and we find that the imbalance of nodes and edges is widespread. Specifically, for node imbalance, we have counted it in Table 1, and we can observe that only 14.53\(\%\) and 9.50\(\%\) of labeled nodes are fraudsters, which makes the model dominated by the majority class (i.e., the benign users) with the reduction of model generalization ability. Furthermore, we also investigate the edge imbalance of fraud graphs. Specifically, we first classify the edge type into three classes, namely N-N(edges between normal users), F-F(edges between fraudsters), and F-N(edges between the normal users and fraudsters), and then count their numbers under the different relational subgraphs, as shown in Table 2. We can observe that the number of N-N edges is of maximum quantity, which may make the node aggregation dominated by N-N edges and weaken the ability to model differences (between fraudsters and normal users) and commonalities (between the fraudsters) while they are the keys to fraud detection. Consequently, the imbalance of nodes and edges is widespread in fraud detection, and we need to design an imbalanced-oriented nodes and edges sampling strategy to eliminate bias toward the majority classes.

Table 3. Performance under various ratios of Amazon and YelpChi training sets. Recall and F1 are the abbreviation of Macro-Recall and Macro-F1.

4.3 Performance Comparison

In this section, we compare our proposed method IDGL with the state-of-the-art methods on both Amazon and YelpChi datasets. Note that we also perform the metrics (i.e., AUC, Macro-Recall, and Macro-F1) under different ratios (from \(10\%\) to \(40\%\)) of both datasets, as shown in Table 3, and we have the following observations.

First, compare the methods which are performed on the single graph (i.e., GCN, GAT, GraphSage, and FdGars) with the ones which are performed on the multi-graph (i.e., GME, GraphConsis, CARE-GNN, FRAUDRE, and IDGL), the latter is mostly better than formers, expect for GME, which brings two aspects of inspiration. On the one hand, the multi-relation graph contains richer information than the single one, which may provide the chance to make performance improvements. On the other hand, the richer data means a more complex relationship, which means that it is unworkable to directly apply GNNs to fraud detection under the multi-relation graph, and we need to deal with the relationship between the node and its neighbors more carefully. GraphConsis, CARE-GNN, and FRAUDRE have achieved the promising performance by introducing similarity measure and fraud-aware module into the node aggregation process, and IDGL outperforms all other SOTA methods via the learnable high- and low-filter to adaptively learn the difference and similarity commonalities between the node and its neighbors to facilitate the target task.

Second, it has been introduced in Table 1 that node imbalance is widespread in fraud detection. For GNN-based fraud detection methods, CARE-GNN and FRAUDRE take the influence of node imbalance into consideration to eliminate the training bias towards the majority class (i.e., the normal users), and achieve better performance than other methods. However, they haven’t taken edge imbalance into consideration. As we discussed in Sect. 4.2, we categorize the edge type into three classes: N-N, F-F, and F-N. For F-N, we can treat it as a guide to learn the difference between the fraudsters and the normal users, and F-F is rare but valuable for us to learn the commonalities between the fraudsters, which helps better fraud detection. Thus, an edge sampling method is proposed to make the model pay more attention to the edge of F-N and F-F. Consequently, IDGL achieves better performance than CARE-GNN and FRAUDRE by taking both node and edge sampling into consideration.

Fig. 4.
figure 4

Ablation Analysis of Learnable Adaptive Filter and Nodes/Edges Sampling on Amazon dataset with AUC, Recall and F1.

4.4 Ablation Analysis

High- and Low-Filters. To demonstrate the effectiveness of the adaptive filter, we conduct the ablation study on the Amazon dataset by ranging the percentage of the training dataset from \(10\%\) to \(40\%\), as shown in Fig. 4 (a), and a similar result can also be observed on YelpChi dataset. Specifically, we construct two ablation models by replacing the learnable adaptive high- and low-filer with the only low-filer (i.e., \({\alpha _{ij,r}} = 1\), denoted as IDGL+) and the only high-filter (i.e., \({\alpha _{ij,r}} = 0\), denoted as IDGL−), respectively. We can observe that both IDGL+ and IDGL− can be applied to fraud detection with competitive performance. Additionally, IDGL− have better performance than IDGL+, because the fraud graph is of great disassortativity as proved above, and IDGL+ only aggregates the low-frequency signals from its neighbors without considering the difference between classes which makes itself submerged with lack of discrimination. IDGL outperforms such two methods by combining the advantages of low- and high-filters adaptively.

Node and Edge Sampling. To demonstrate the effectiveness of node and edge sampling, we construct two ablation models by removing either edge sampling or both of them on the Amazon dataset with ranging the percentage of training dataset from \(10\%\) to \(40\%\), as shown in Fig. 4 (b), and the similar result can also be observed on YelpChi dataset. We can observe that compared with the ablation model without any imbalance-oriented sampling, the one with node sampling performs better as it can mitigate the imbalance issue to some extent. By further introducing edge sampling to make the model pay more attention to the valuable edge types (i.e., F-F and F-N), the performance has been further improved, which shows the effectiveness of node and edge sampling.

Fig. 5.
figure 5

Performance of IDGL with varing embedding size and running efficiency on Amazon dataset.

4.5 Parameter Sensitivity and Running Efficiency

In this section, we investigate the sensitivity and running efficiency.

First, with \(40\%\) of the Amazon dataset as the training set, we vary the value of embedding dimensionality in the range of [8,64], and the result is depicted in Fig. 5(a). We can observe that it first makes a slight improvement with embedding size increasing, and it becomes stable after 32. Considering a larger embedding dimensionality requires higher computational complexity, we finally set d as 32 to make the balance between performance and complexity.

Second, to investigate the running efficiency of IDGL, with different percentages of the training set on the Amazon dataset, we compare the average training time of each epoch with FRAUDRE, which has the most competitive performance among all the baseline methods. To be fair, we set the number of convolution layers to 2, the embedding size to 32, and the batch size to 256 for both of them, and the result is depicted in Fig. 5 (b). We can observe that IDGL runs faster than FRAUDER with more time efficiency.

5 Conclusion

In the paper, we propose an imbalanced disassortative graph learning framework called IDGL to solve the graph disassortativity and imbalance issues. To tackle the graph disassortativity, an adaptive dual-channel convolution filter is further developed to adaptively combine the advantage of dual-channel (i,e., the low- and high-frequency) signals from its neighbors, which helps assimilate the nodes with assortative edges and discriminate the nodes with disassortative edges. For graph imbalance issues, a label-aware nodes sampler and edges sampler are designed with the consideration of nodes’ popularity and corresponding label class frequency, which helps the model simultaneously eliminate the bias towards the major classes and pay more attention to the valuable edges (i.e., F-F and F-N). Extensive experiments on two public fraud datasets demonstrate the effectiveness of our method.