1 Introduction

With the booming of the Internet and telecommunication industries, various fraud activities have emerged in the fields of finance (Wang et al., 2019; Zhong et al., 2020), social security (Van Vlasselaer et al., 2017), and healthcare (Zhang et al., 2022), leading to user privacy breaches, personal property losses, and so on. Since frauds often occur in graph-like data such as the Internet, graph-based methods are widely used to detect fraudsters. Compared with traditional graph-based methods, GNN-based methods leverage the rich feature information and structural information of fraudsters, and thus gain more and more attention.

Although these GNN-based fraud detection methods have made much progress, there still exist the following two main challenges.

Graph Heterophily. Generally, GNNs are essentially established on the homophily assumption that linked nodes tend to share similar features or have the same labels(Hamilton et al., 2017; Abu-El-Haija et al., 2019), which is exactly the opposite of fraud detection (Liu et al., 2021). Specifically, as confirmed by some researchers(Kaghazgaran et al., 2019; Ge et al., 2018), fraudsters deliberately make extensive connections with normal users and try to prevent contact with fraudsters to camouflage themselves, leading to an increase in graph heterophily in the local social network. To illustrate this phenomenon with statistics, we calculate the heterophily ratio of heterophilic edges to all adjacent edges of each fraudster in different relational subgraphs (i.e., meta-paths) from two opinion fraud detection datasets Amazon and Yelpchi (see Section 5.1 for details), and calculate the proportion of the number of fraudsters with the corresponding heterophily ratio to all the fraudsters in the whole graph from low to high. As Fig. 1 shows, these relational subgraphs, except the relation R-U-R of YelpChi, include a large number of fraudsters whose heterophily ratio is very high. Specifically, more than 80% of the fraudsters on Amazon have more than 50% heterophily; more than 70% of the fraudsters on the relation R-T-R and the relation R-S-R of Yelchi have more than 50% heterophily ratio. It is worth noting that more than 35% of fraudsters on Amazon have 100% heterophilic edges, which means that when simply aggregating their neighbor attributes based on GNNs, the fraudster’s features will be completely swamped inside the normal users and will be difficult to identify correctly. To address this problem, several improved GNN-based algorithms have been proposed and they mainly fall into three directions, namely neighbor filtering-based methods (Dou et al., 2020; Liu et al., 2021), neighbor extension-based methods (Pei et al., 2019; Chien et al., 2020) and graph partitioning-based methods (Manaskasemsak et al., 2021; Corizzo & Slenn, 2022; Xu et al., 2021). However, it is a significant challenge to set personalized neighborhood sizes or fraud subgraphs for the different fraudsters. Meanwhile, fraudsters excluded from the neighborhood or fraud subgraph are ignored for aggregation, resulting in valuable information being lost.

Class Imbalance. Another challenge is the class imbalance problem. In general, fraudsters usually make up a fewer proportion than normal users, showing a highly skewed distribution. As shown in Table 2, in the Amazon dataset, 90.5% of the nodes are normal users while only 9.5% are fraudsters and in the Yelpchi dataset, 85.5% of the nodes are normal users while only 14.5% are fraudsters. The class imbalance problem poses a challenge to existing GNN-based methods because the majority of class may dominate the loss function of the GNN, making the trained GNN overfit the majority class (i.e., the normal users) and fail to predict accurately for the minority class (i.e., the fraudsters). Unfortunately, considering that a fraudster may still have the majority of his/her connections to the normal users, the class imbalance exacerbates the graph heterophily with more difficulties for identification. Though some recent works (Liu et al., 2021; Shi et al., 2020; Zhang et al., 2021) have noticed similar challenges, their solutions either use a single-relational balanced sampler or an imbalanced distribution-oriented loss function, both of which fail to consider differences in the prevalence of nodes under different relationships, making them difficult to apply to the problem of fraud detection under multiple relationships.

Fig. 1
figure 1

Evidence of graph heterophily. The x-coordinate represents the graph heterophily ratios and y-coordinate represents the proportion of fraud nodes with corresponding graph heterophily ratios." NN" denotes a fraud node with no neighbors

To address the above challenges, we propose an Imbalanced Graph Structure Learning framework (IGSL) for fraud detection. Specifically, IGSL is composed of four module layers: (1) a multi-relational class-balanced sampler layer. To alleviate the effects of class imbalance, a multi-relational class-balanced nodes sampler is designed with the consideration of nodes’ prevalence (i.e., degree) under different relationships and corresponding label class frequency; (2) a graph-independent embedding layer, which is used to roughly pre-process the raw graph to alleviate the graph heterophily by a structure-independent embedding; (3) an iterative graph structure learning layer. We introduce an iteration metric within the graph convolution framework to iteratively construct the global homophilic adjacency matrix, making homophilic nodes connected and heterophilic nodes disconnected from each other; (4) an embedding aggregation layer, which combines the intermediate embeddings to be the final representation of nodes. We summarize the main contributions of our paper as:

  • We formulate the graph-based fraud detection problem as an imbalanced heterophily graph node classification task and propose an imbalanced graph structure learning framework to deal with the heterophily and class imbalance problem on graphs.

  • An iterative deep graph structure learning method that iteratively constructs the global homophilic adjacency matrix is further developed to make the reconstructed graph structure and node embeddings optimal for fraud detection. A multi-relational class-balanced sampler is proposed to reinforce the learning of the minority class.

  • Experiments on two public datasets demonstrate that IGSL outperforms the state-of-the-art baselines.

The rest of our paper is organized as follows. Section 2 reviews relevant studies in the literature. Section 3 details the definition and problem statement. Section 4 describes the model design of our method. Section 5 presents the experimental setup and the corresponding results. Finally, in Sect. 6, we summarize the paper and discuss future work.

2 Related work

Graph-based fraud detection aims at identifying fraudsters from normal users in graph-structured data. Due to the excellent representation capabilities of GNNs, GNNs are widely used in fraud detection. However, classic GNN-based fraud detectors are vulnerable to topology inconsistency and class imbalance problems, and current scholars are dedicated to designing robust GNNs to defend against the problems in graph data. GraphConsis (Liu et al., 2020) is a pioneer work concerning topology inconsistency in fraud detection by implementing dissimilar neighbors filters for nodes based on a pre-defined threshold. CARE-GNN (Dou et al., 2020) adopts a reinforcement learning-based module to extend the neighbor filtering operation to make the thresholds adaptive. PC-GNN (Liu et al., 2021) adopts a label-balanced sampler and neighbors over-sampling/under-sampling strategy to solve the class imbalance problem and topology inconsistency problem. FRAUDRE (Zhang et al., 2021) adopts an imbalance-oriented classification module to solve the class imbalance problem in fraud detection. AO-GNN (Huang et al., 2022) adopts AUC-maximization to resolve the label-imbalance problem for GNNs. Most of the mentioned approaches adopt the simple neighbor filtering-based method or neighbor extension-based method to address the hererophily problem on graphs. Different from them, we formulate the fraud detection problem as an imbalanced graph structure learning task and learn a task-relevant graph structure and node representation on the imbalanced graph to yield optimal results compared to the above methods.

3 Definition and problem statement

In this section, we first give the conception of homophily and heterophily, imbalanced graph, and multi-relational imbalanced graph. Then, we formulate the graph-based fraud detection problem. Furthermore, we summarize important symbols, as shown in Table 1.

Table 1 Glossary of notations

3.1 Definition

Definition 1

(Homophily and Heterophily): For a graph, an edge (connection) is homophilic if the two nodes connected by an edge belong to the same class. Otherwise, this edge is heterophilic. A graph consisting of homophilic edges is called a homophilic graph, and a graph consisting of heterophilic edges is called a heterophilic graph. In particular, a fraud graph has both homophilic and heterophilic edges.

Definition 2

(Imbalanced Graph): Given a series of labels \(\varvec{Y}=\{y_1,y_2,...,y_i\}\) in graph \(\varvec{\mathcal {G}}\), where i denotes the total number of label classes. We use class imbalance ratio \(\rho\) to measure the extent of class imbalance.

$$\begin{aligned} \rho = \frac{{{{\max }_i}(|{{y_i}} |)}}{{{{\min }_i}(|{{y_i}} |)}}, \end{aligned}$$
(1)

where \({{{\max }_i}(|{{y_i}} |)}\) and \({{{\min }_i}(|{{y_i}} |)}\) return the maximum and minimum class size over all i classes, respectively. If \(\rho >1\), the class with the maximum size is called the majority class, and the class with the minimum size is called the minority class. If \(\rho\) is far larger than 1, we call \(\varvec{\mathcal {G}}\) an imbalanced graph, otherwise, the graph is balanced.

Definition 3

(Multi-relational Imbalanced Graph): Let \(\varvec{\mathcal {G}}=(\varvec{\mathcal {V}},\varvec{\mathcal {E}},\varvec{\mathcal {A}},\varvec{\mathcal {X}},\varvec{Y})\) be an undirected multi-relational graph. Specifically, \(\varvec{\mathcal {V}}=\{v_1,...,v_N\}\) denotes a set of nodes; \(\varvec{\mathcal {E}}=\{ \mathcal {E}_1, \mathcal {E}_2,...,\mathcal {E}_R\}\) is the edges set of R relationships, where \(e^r_{i,j} \in \mathcal {E}_r\) indicates that there is an edge between node i and j under the r-th relationship; \(\varvec{\mathcal {A}}=\{A_1, A_2,...,A_R\}\) denotes the responding adjacency matrix of relationships where \(A^ r_{i,j}=1\) if \(e^ r_{i,j} \in \mathcal {E}_r\); \(\varvec{\mathcal {X}}=\{\varvec{x}_1,\varvec{x}_2,...,\varvec{x}_N\}\) is the attribute vector of node and \(\varvec{x}_i \in {\mathbb {R}^d}\); \(\varvec{Y}=\{y_1,y_2,,...,y_N\}\) is the label set in which \(y_v \in \{0,1\}\), where 1 represents fraudster and 0 represents normal user, if the class imbalance ratio \(\rho\) of \(\varvec{\mathcal {G}}\) is far larger than 1, we call \(\varvec{\mathcal {G}}\) a multi-relational imbalanced graph.

3.2 Problem statement

Definition 4

(Graph-based Fraud Detection): Given a multi-relational imbalanced graph \(\varvec{\mathcal {G}}=(\varvec{\mathcal {V}},\varvec{\mathcal {E}},\varvec{\mathcal {A}},\varvec{\mathcal {X}},\varvec{Y})\) defined in definition 3 where each node has a ground truth label of fraudster or normal user. In this study, the graph-based fraud detection is considered as a transductive semi-supervised binary classification problem on graph \(\varvec{\mathcal {G}}\) where both the training and testing samples are accessed during the training phase. The graph-based fraud detector is trained based on the labeled node information and the graph consisting of homophilic edges and heterophilic edges under multiple relationships. The trained model is then used to predict the probability of suspiciousness of unlabeled nodes.

4 Methodology

We first give a brief overview of our approach and then specify the proposed novel method, including a detailed description of each component.

4.1 Overview

In this section, we give details of the proposed framework IGSL. An illustration of the proposed framework is shown in Fig. 2. IGSL is composed of four components: a multi-relational class-balanced sampler module, a graph-independent embedding module, an iterative graph structure learning module, and an aggregation module. Next, we give details of each component.

4.2 Multi-relational class-balanced sampler

We devise a multi-relational class-balanced sampler to pick nodes for training. The key idea lies in incorporating node prevalence in different relationships and class frequency into the sampling process. For node samplers, those of the minority class with more prevalence have a higher sampling probability than the majority class with less prevalence.

Formally, \(\varvec{\mathcal {G}}=(\varvec{\mathcal {V}},\varvec{\mathcal {E}},\varvec{\mathcal {A}},\varvec{\mathcal {X}},\varvec{Y})\) is a multi-relational imbalanced graph, where \(\varvec{\mathcal {A}}=\{A_1, A_2,...,A_R\}\) denotes adjacency matrix of relationships. For node \(v \in \varvec{\mathcal {V}}\), its sampling probability \(p_{r,v}\) under relation r is defined as Eq.(2)

$$\begin{aligned} p_{r,v} \propto \frac{{D_{r,v}}}{{\mathrm{{LF}}(y_v)}}, \end{aligned}$$
(2)

where \(D_{r,v} = \sum \nolimits _{i \in \mathcal {N}_{r,v}} {A_{r,v,i}}\) is the degree of node v under relation r which denotes the prevalence of node, \(\mathcal {N}_{r,v}\) denotes the neighbors of v under relation r, and \(\mathrm{{LF}}(y_v)\) denotes the frequency of labels for the class \(y_v\). Note that, \(D_{r,v}\) means that more “popular” nodes are more likely to selected, and \({\mathrm{{LF}}(y_v)}\) means the more “rare” nodes are more likely to be selected. Therefore, the set of picked nodes \(\varvec{\mathcal {V}}_p\) is marked as:

$$\begin{aligned} \varvec{\mathcal {V}}_p = \sum \nolimits _{v \in \varvec{\mathcal {V}}_\mathrm{{train}}} {\sum \nolimits _{r = 1}^R {\mathrm{{Pick}}(p_{r,v})} }, \end{aligned}$$
(3)

where \(\varvec{\mathcal {V}}_\mathrm{{train}}\) denotes the training set, and \(\text{Pick}\) denotes sampling function.

4.3 Graph-independent embedding

According to Wu et al. (2019), most of the graph heterophily in fraud detection is caused by the extensive illegitimate connections that fraudsters intentionally establish with normal users. A straightforward approach is to remove heterophilic edges and add homophilic edges using a structure-independent approach to the original structure, thus avoiding pitfalls. We apply a graph-agnostic multi-layer perceptron (MLP) to extract class-aware information from the original node attributes:

$$\begin{aligned} \varvec{h}_v = \sigma (\varvec{x}_v \varvec{\mathcal {W}}_1), \end{aligned}$$
(4)

where \({\varvec{x}_v} \in \varvec{\mathcal {X}}\) is the attribute vector of node v, \(\varvec{\mathcal {W}}_1 \in {\mathbb {R}^{d \times C}}\) is the learnable weight matrix for MLP, \(\sigma\) is the activation function, and \(\varvec{h}_v\) denotes the output of the final layer of MLP. We can obtain the soft assignment matrix \(\varvec{f}_v \in {\mathbb {R}^{1 \times C}}\) as follows:

$$\begin{aligned} \varvec{f}_v = \mathrm{{softmax}}(\varvec{h}_v), \end{aligned}$$
(5)
Fig. 2
figure 2

The figure demonstrates the framework of IGSL. The example graph has seven nodes, the cyan lines indicate heterophilic connections and the dark lines indicate homophilic connections. The nodes in red are fraudsters and blue are normal users. Firstly, a graph-independent embedding module using a multi-layer perceptron to roughly adjust the original structure. Secondly, an iterative graph structure learning module obtains the global homophilic adjacency matrix for fraud detection by performing iterative similarity learning on the node embeddings. Finally, nodes under different relationships are connected to obtain the final representation

where \(\varvec{f}_{v,c} \in \varvec{f}_{v}\) denotes the probability that node v belongs to class C. Let all parameters of the MLP be \(\theta _m\), then the optimal \(\theta _m^ *\) is obtained by minimizing the loss of the following predicted labels by the MLP:

$$\begin{aligned} \theta _m^{*} = \mathop {\text {arg}\,\text {min}}\limits _{\theta _m} {\mathcal {L}_{\text{pre}}} = \mathop {\text {arg}\,\text {min}}\limits _{\theta _m} -{\sum \limits _{v \in \varvec{\mathcal {V}}_p} {[{y_v}{\text{log}\varvec{f}}_v + (1 - {y_v})\log (1 - \varvec{f}_v)} ]}. \end{aligned}$$
(6)

Since \(\varvec{f}_{v}\) is under the guidance of partially known labels, it can capture class-aware information in attributes. Then, based on the matrix \(\varvec{f}_{v}\), we can calculate the probability that two nodes \(v_i\) and \(v_j\) belong to the same class:

$$\begin{aligned} {\varvec{S}_{ij}} = {\varvec{f}_i} \cdot \varvec{f}_j^T. \end{aligned}$$
(7)

Then we prune edges whose scores are below a threshold \({\tau ^{-}}\) under all relationships:

$$\begin{aligned} {\bar{e}}^r_{i,j} = \left\{ \begin{array}{ll} e^r_{i,j} & \varvec{S}_{ij} > {\tau ^{-}}\\ 0 & \varvec{S}_{ij} < {\tau ^{-}}, \end{array} \right. \end{aligned}$$
(8)

where \(e^r_{i,j} \in \mathcal {E}_r\) is an edge between node i and j under the r-th relation. Inspired by PC-GNN(Liu et al., 2021), we pay more attention to the fraud nodes. On the one hand, we remove the heterophilic edges between in the training set, and on the other hand, we oversample the neighbor of fraud nodes. We set a threshold \(\tau ^ +\) to control the oversampling, and fraud nodes with high similarity create links:

$$\begin{aligned} \widetilde{e}^r_{i,j} = \left\{ \begin{array}{l} 0\;\;\; y(i) \ne y(j)\\ 1\quad y({i}) = y(j)=1\;and\; \varvec{S}_{ij} > {\tau ^ + } \\ {\bar{e}}^r_{i,j}\;\;\;otherwise, \end{array} \right. \end{aligned}$$
(9)

where y(i) denotes the label class of node i. It is worth noting that the adjacency matrix \({\widetilde{A}}_r\) after graph-independent embedding responding to \({\widetilde{e}}^r_{i,j}\) is estimated based on node features, which are not constrained by the heterophily of networks.

4.4 Iterative graph structure learning

Although the heterophily of the graphs is somewhat alleviated by the graph-independent embedding layer, it is still not possible to eliminate the heterophily as the structural features of the user are ignored. Motivated by IDGL (Chen et al., 2020), we define the problem as an iterative graph structure learning process, learning a global homophilic adjacency matrix that is optimal for fraud detection.

Specifically, we use a GNN to learn t-th iteration node representations \(\varvec{H}_{r,t}^{(l)}\) under relation r in t-th iteration by utilizing the message passing scheme to aggregate information from nodes’ neighbors:

$$\begin{aligned} \varvec{H}^{(l)}_{r,t} = \beta \mathrm{{Aggr}}(\widetilde{A}_{r,ini},\varvec{H}_{r,t-1}^{(l)})+ (1-\beta ) \mathrm{{Aggr}}(\widetilde{A}_{r,{t-1}}^{(l)},\varvec{H}_{r,t-1}^{(l)}), \end{aligned}$$
(10)

where \({\widetilde{A}}_{r,ini}={\widetilde{A}}_r\) is the adjacency matrix of the initial graph obtained in graph-independent embedding layer, \({\widetilde{A}}^{(l)}_{r,t-1}\) is the t-1-th homophilic adjacency matrix learned under relation r in layer l, \(\varvec{H}_{r,t-1}^{(l)}\) is the t-1-th iteration node embedding under relation r in layer l, \(\varvec{H}_{r,0}^{(0)}=\mathcal {X}\) is the raw attribute vector of node, and \(\beta\) is a weight to balance the trade-off of initial graph structure and the homophilic adjacency matrix learned. Specifically, we choose the graph convolutional network (GCN) as the basic GNN. For a relation r, we perform metric learning on the node embedding and obtain homophilic adjacency matrix \(\widetilde{A}_{r,t}^{(l)}\) by the learned embedding similarity, where the homophily probability between nodes i and j is obtained by:

$$\begin{aligned} {\widetilde{A}}_{r,t}^{(l)}[i,j] = \left\{ \begin{array}{l} \Gamma _{r}^{(l)}(\varvec{h}_{r,t,i}^{(l)},\varvec{h}_{r,t,j}^{(l)})\;\;\;\Gamma _r^{(l)}(\varvec{h}_{r,t,i}^{(l)},\varvec{h}_{r,t,j}^{(l)}) \ge \varepsilon _r^{(l)}\\ 0 otherwise, \end{array} \right. \end{aligned}$$
(11)

where \(\varepsilon _r^{(l)} \in [-1,1]\) is the threshold that controls the sparsity, \(\varvec{h}_{r,t,i} \in \varvec{H}_{r,t}\) is the embedding of node i under relation r in t-th interation, and \(\Gamma _r^{(l)}\) is a K-head weighted cosine similarity function defined as:

$$\begin{aligned} \Gamma _{r,ij}^{(l)} = \frac{1}{K}\sum \limits _{p = 1}^K {\cos (\varvec{\mathcal {W}}_{r,p}^{(l)} \odot {\varvec{h}_{r,t,i}},\varvec{\mathcal {W}}_{r,p}^{(l)} \odot {\varvec{h}_{r,t,j}}) } , \end{aligned}$$
(12)

where \(\odot\) denotes the Hadamard product, \(\varvec{\mathcal {W}}_{r}^{(l)}\) is the learnable weight matrix that weights the importance of different dimensions of the feature vectors under relation r in l-th layer.

Node-anchor homophilic matrixHowever, Eq. (12) calculates the similarity score for all pairs of graph nodes under all relationships in all iterations with a computational complexity of \(\text{O}({N^2}*r*t*K)\), which is difficult to apply to fraud detection in large graphs. Inspired by anchor-based methods, we design an anchor-based similarity learning which learns a node-anchor matrix \(\mathcal {K} \in \mathbb {R}{^{(N\times s)}}\) with a computational complexity of \(\text{O}({ns})\) where s is the number of anchors. Specifically, we randomly select a set of \(s \in \varvec{\mathcal {V}}\) anchors nodes, where the number of s is usually much smaller than N. Thus, Eq. (11) can be rewritten as follows:

$$\begin{aligned} \varvec{\mathcal {K}}_{r,t}^{(l)}[i,k] = \left\{ \begin{array}{l} \Gamma _{r}^{(l)}(\varvec{h}_{r,t,i}^{(l)},\varvec{h}_{r,t,k}^{(l)})\;\;\;\Gamma _r^{(l)}(\varvec{h}_{r,t,i}^{(l)},\varvec{h}_{r,t,k}^{(l)}) \ge \varepsilon _r^{(l)}\\ 0 otherwise, \end{array} \right. \end{aligned}$$
(13)

where k is the anchor node. Similarly, we adopt a threshold \(\varepsilon _r^{(l)} \in [-1,1]\) to control the sparsity of the node-anchor graph \(\varvec{\mathcal {K}}_{r,t}^{(l)}\).

Node-anchor message passing According to the stationary Markov random walk theory, the homophilic adjacency matrix \(A^{(l)}_{r,t-1}\)can be fully recovered by the node-anchor homophilic matrix \(\varvec{\mathcal {K}}^{(l)}_{r,t-1}\).

Further, we decompose graph embedding based on the learned adjacency matrix into two steps to reduce the computational complexity. 1): compute the message propagation using the anchor-node matrix \({\varvec{\mathcal {K}}_{t-1}^{(l)T}}\) and 2): compute the message propagation using the node-to-anchor matrix \(\varvec{\mathcal {K}}^{(l)}_{t-1}\). Thus, the \(\text{GCN}(A_{r,{t-1}}^{(l)},\varvec{H}_{r,t-1}^{(l)})\) could be explained as follows,

$$\begin{aligned} \begin{array}{l} {\varvec{H}_{r,t-1}^{(l)}} = {\Lambda ^{ - 1}}\varvec{\mathcal {K}}_{r,t-1}^{(l)T}{\varvec{H}_{r,t-1}^{(l)}}\\ \text{GCN}(A_{t - 1}^{(l)},\varvec{H}_{r,t-1}^{(l)}) = {\Delta ^{ - 1}}\varvec{\mathcal {K}}_{r,t-1}^{(l)}\varvec{H}_{r,t-1}^{(l)}, \end{array} \end{aligned}$$
(14)

where \({\Delta _{kk}} = \sum \nolimits _{i = 1}^n {{\varvec{\mathcal {K}}_{ik}^{(l)}}}\) and \({\Lambda _{ii}} = \sum \nolimits _{k = 1}^s {{\varvec{\mathcal {K}}_{ik}^{(l)}}}\).

Dynamic stopping strategy we define the dynamic stopping strategy for graph structure learning as:

$$\begin{aligned} \frac{{|{\varvec{\mathcal {K}}^{(l)}_t} - {\varvec{\mathcal {K}}^{(l)}_{(t-1)}}|_F^2}}{{|{\varvec{\mathcal {K}}^{(l)}_t}|_F^2}} > \delta , \end{aligned}$$
(15)

where \(|\cdot |_F\) denotes the Frobenius norm of a matrix, \(\delta\) is the threshold of dynamic stopping. This means that graph structure learning converges in the two most recent iterations and indicates that an optimal homophilic adjacency matrix for fraud detection is found. After dynamic stopping, the current node embeddings \(\varvec{H}^{(l)}_{r,t}\) are used for the fraud detection.

4.5 Embedding aggregation

After the dynamic stopping, we have obtained the optimal node embedding \(\varvec{H}^{(l)}_r \in \mathbb {R}^{(N \times d_l)}\) at layer l under relation r, where \(r=1,...R\), \(l=1,...L\) and L is the number of layers. We adopt concatenation to aggregate the embeddings of nodes under different layers and different relationships, which is illustrated as Eq. (16) and \({\varvec{W}^l} \in \mathbb {R}^{(d_l \times (d_{l-1}+ R \cdot d_l))}\) is the learnable weight matrix.

$$\begin{aligned} {\varvec{H}^{(l)}} = \mathrm{{ReLU}}({\varvec{W}^{(l)}}({\varvec{H}^{(l-1)}_{}} \oplus \varvec{H}^{(l)} \oplus \varvec{H}_2^{(l)} \cdots \varvec{H}_R^{(l)})). \end{aligned}$$
(16)

The loss function of fraud detection is defined as follows:

$$\begin{aligned} {\mathcal {L}_\mathrm{{GNN}}}&= - {\sum \limits _{v \in \varvec{\mathcal {V}}_p} {[{y_v}\log z_v + (1 - {y_v})\log (1 - z_v)} ]} \\ z_v&=\mathrm{{MLP}}(\varvec{H}^{(l)}_v), \end{aligned}$$
(17)

where \(\mathcal {L}_\mathrm{{GNN}}\) refers to the cross-entropy loss between the prediction result and the ground truth, \(\text{MLP}\) refers to the multi-layer perceptron, \(y_v\) is either equal to 1 or 0 for the ground truth, and \(z_v\) refers to the identification probability for the \(\varvec{H}^{(l)}_v\) outputted by \(\text{MLP}\).

The overall loss function of model is formulated as Eq. (18), where \(\alpha\) is weights to balance the importance of different losses, \(\Vert \theta \Vert ^2\) is the regularization term to avoid over-fitting, and \(\lambda\) is the regularization coefficient.

$$\begin{aligned} {\mathcal {L}_{\mathrm{{Model}}}}\mathrm{{ = }}{\mathcal {L}_\mathrm{{GNN}}}{+}\alpha {\mathcal {L}_\mathrm{{pre}}}{+ }\frac{\lambda }{2}\Vert \theta \Vert {^2}. \end{aligned}$$
(18)

The overall training algorithm is summarized in Algorithm 1.

Algorithm 1
figure a

Learning strategy of the IGSL

5 Experimental evaluation

In this section, we investigate the effectiveness and robustness of the proposed IGSL model for graph-based opinion fraud detection tasks. Our goal is to answer the following questions.

  • RQ1: Does IGSL outperform the state-of-the-art methods for graph-based fraud detection?

  • RQ2: Does IGSL benefit from all four modules?

  • RQ3: How robust is IGSL on different heterophily and class-imbalance rates?

  • RQ4: How efficient is IGSL to run?

  • RQ5: How does the hyperparameters affect the performance of IGSL ?

5.1 Experimental setup

Dataset. We adopt two multi-relational opinion fraud detection datasets Amazon (McAuley & Leskovec, 2013) and YelpChi (Rayana & Akoglu, 2015) to validate the performance of IGSL. The statistical information of these two datasets is shown in Table 2. The Amazon dataset includes both legitimate and fraud reviews under the musical instrument category. In the Amazon dataset, nodes are users with 25-dimensional features and three manually defined relational graphs (i.e., meta-paths) are contained: 1) Relation U-P-U connects users who have reviewed more than one same product; 2) Relation U-S-U connects users who have more than one same star rating in a week; 3) Relation U-V-U connects users with top 5% of mutual review TF-IDF similarities among all users. Similarly, the YelpChi dataset includes both legitimate and fraudulent reviews of restaurants and hotels. In the YelpChi dataset, nodes are reviews with 32-dimensional features. Analogous to the Amazon dataset, this dataset includes three manually defined relational graphs: 1) Relation R-U-R connects reviews posted by the same user; 2) Relation R-S-R connects reviews with the same star rating (1–5 stars) under the same product; 3) Relation R-T-R connects reviews posted under the same product in the same month. All graphs under different relationships are merged together to form the single-relation ALL.

Table 2 Statistics of multi-relational opinion fraud detection datasets

5.2 Baselines

We compare several GNN-based fraud detection algorithms to demonstrate the effectiveness and superiority of our method in identifying fraudsters.

Classical GNN-based model: GCN (Kipf & Welling, 2017), GAT (Veličković et al., 2018), DR-GCN (Shi et al., 2020), GraphSAGE (Hamilton et al., 2017), GraphSAINT (Zeng et al., 2019).

Classical graph structure learning model: IDGL (Chen et al., 2020) and Pro-GNN (Jin et al., 2020).

State-of-the-art GNN-based fraud method: GraphConsis (Liu et al., 2020), CARE-GNN (Dou et al., 2020), PC-GNN (Liu et al., 2021), FRAUDRE (Zhang et al., 2021), AO-GNN (Huang et al., 2022).

Among those baselines, GCN, GAT, DR-GCN, GraphSAGE, GraphSAINT, IDGL, and Pro-GNN are run on the single-relational graph (i.e., relation ALL in Table 2, where all relationships are merged while GraphConsis, CARE-GNN, PC-GNN, FRAUDRE, AO-GNN, and IGSL are run on the multi-relational graph.

Table 3 Performance (%) under different percentages of training data. F1 (abbr. F1-macro)

5.3 Experimental setting and implementation

We use the Adam optimizer to optimize the parameters of the IGSL with the learning rate set to 0.01, and the weight decay is 0.001. In the IGSL, \(N_\mathrm{{epoch}}=150\), \(L=1\), batch size \(N_\mathrm{{batch}}\) set to 256 (Amazon) and 1024 (YelpChi), \(\beta =0.5\), \(\varepsilon _r^{(l)}\) (\(\varepsilon\) for all relations and layers) set to 0.5 (Amazon) and 0 (YelpChi), \({\tau ^{-}} = 0.2\), and \({\tau ^{+}}\) is decided by the top-k distance of the minority class, where k equals half the average neighbor size of the minority class. \(\alpha\) and \(\lambda\) in the overall loss function are set to 2 and 0.002, respectively. For the training set and test set split, we follow FRAUDRE (Zhang et al., 2021) with the same random seed. For baselines of state-of-the-art GNN fraud method, we use the default hyperparameter settings according to published papers. For the remaining baselines (i.e., classical GNN models), we refer to the hyperparameters provided by previous research works, such as PC-GNN. Additionally, IGSL is implemented in torch 1.10.1 with Python 3.8 and all experiments are run on Ubuntu 20.04.1 LTS server with Cuda 11.4. Our source code is available at https://github.com/Ling-Fei-Ren/IGSL.git

For class imbalance classification, the evaluation metrics should be unbiased for any class (Luque et al., 2019). Like previous work (Liu et al., 2021), we utilize ROC-AUC (AUC), F1-macro, and GMean to evaluate the overall performance of all models.

5.4 Performance comparison (RQ1)

To answer RQ1, we compare the proposed framework IGSL with the baselines in the opinion fraud detection task. The corresponding AUC, F1-macro, and GMean on Amazon and YelpChi are reported in Table 3. We can make following observations:

Firstly, IGSL outperforms the baselines in terms of AUC, F1-macro, and GMean metrics at different training percentages. For example, when the training percentage is set to 40%, IGSL achieves performance improvements of 1.42%, 2.51%, and 0.73% on the Amazon dataset in terms of AUC, F1-macro, and GMean compared with the state-of-the-art baseline model. On the Yelpchi dataset, the performance gap is 2.86%, 2.64%, and 1.54%, respectively. This is because IGSL employs an iterative graph structure learning process to learn a global homophilic graph that is optimal for downstream fraud detection tasks, which could be a better graph input for GNNs to learn better node embeddings. In contrast, neighbor filtering-based methods in state-of-the-art baselines tend to fall into local optima. In addition, a multi-relational balanced sampler can further alleviate the problem of graph heterophily increasing due to class imbalance by considering the prevalence of nodes under different relationships. Therefore, the proposed model exhibits superior performance compared to all baselines. However, it is worth noting that in some cases the IGSL model does not achieve the best performance. For example, when the training ratio is set to 30%, the AO-GNN model achieved a higher GMean score on the Amazon dataset compared to IGSL. This is due to the fact that AO-GNN employs both classifier parameter search and edge pruning policy search methods, respectively. The former mitigates the problem of class imbalance, and the latter policy searching is designed for graph noise removal.

Secondly, GraphConsis, CARE-GNN, PC-GNN, FRAUDRE, and AO-GNN are five state-of-the-art multi-relational graph-based fraud detection methods, in which GraphConsis and CARE-GNN only focus on the heterophily problems while PC-GNN and FRAUDRE focus on both graph heterophily and the class imbalance at the same time. PC-GNNN, FRAUDRE and AO-GNN perform better than GraphConsis and CARE-GNN in all these metrics, which indicates that solving the class imbalance problem is helpful for fraud detection. Meanwhile, PC-GNN and AO-GNN performs better than FRAUDRE, indicating the contribution of modifying the local structure to fraud detection.

Thirdly, IDGL and Pro-GNN are two representative graph structure learning on the balanced graph. However, they perform worse than IGSL, which indicates that classical graph structure learning models are not suitable for fraud detection where classes are imbalanced.

Fourthly, GCN, GAT, DR-GNN, GraphSAGE, and GraphSAINT are five classical GNN-based methods implemented in the single-relational graph ALL. The performance of these methods is generally worse than that of GraphConsis, CARE-GNN, PC-GNN, and FRAUDRE which run in a multi-relational graph. The reason for this is, on the one hand, none of these methods take any measures to alleviate the class imbalance. On the other hand, a single-relational graph loses multi-relational profiles, leading to the aggravation of graph heterophily.

Fig. 3
figure 3

Ablation study

5.5 Ablation study (RQ2)

Fig. 4
figure 4

Robustness for graph heterophily

Fig. 5
figure 5

Robustness for class imbalance

To answer RQ2, we conduct ablation experiments to verify the contribution of each module. The three variants are (1) IGSL \(\backslash\)p: which does not include the multi-relational class-balanced sampler module, so the nodes are sampled with the same probability. (2) IGSL \(\backslash\)e: which does not include the graph-independent embedding module, so the iterative graph structure learning module is performed on the raw graph directly. (3) IGSL \(\backslash\)c: which switches the graph structure learning to GCN so the neighbor aggregated as the raw graph. As shown in Fig. 3, IGSL is superior to the above three variants w.r.t. all evaluation metrics as the percentage of training samples varies from 40% to 10%. Specially, we can observe:

The performance of IGSL \(\backslash\)p is comparably worse than IGSL, which shows the necessity of a multi-relational class-balanced sampler, which is better at finely modeling the importance of different classes under different relationships than a simple single-relational imbalanced sampler.

The performance of IGSL \(\backslash\)c is comparably worse than IGSL, which demonstrates that graph structure learning has a contribution to solving the graph heterophily problem.

5.6 Robustness comparison (RQ3)

Fig. 6
figure 6

Time efficiency analysis on Amazon and YelpChi

To answer RQ3, we evaluate the robustness of IGSL in defending against graph heterophily and class imbalance. We only report results on Amazon, since similar patterns are observed in YelpChi.

Robustness for graph heterophily. To evaluate the robustness of IGSL to graph heterophily, we vary the heterophilic edges proportion in the multi-relational graph. We randomly select part of fraud nodes (100 nodes for example) and add heterophilic edges with 10, 20, 30, 40, 50, 60, and 70 normal users with highly similar features (feature similarity > 0.8) to each fraud node in the raw graphs. Every experiment is conducted 10 times and the average results are presented in Fig. 4, IGSL achieves better performance compared to PC-GNN, CARE-GNN, and FRAUDRE. We speculate that the local neighbors filtering strategy cannot completely filter the heterophilic edges, leading to sub-optimal results, while the graph structure learning method obtains the global homophily adjacency matrix for fraud detection in an iterative manner, which is robust to heterophily.

Robustness for class imbalance. To evaluate the robustness of IGSL to class imbalance, we vary the class imbalance rate by randomly deleting 10%, 20%, 30%, 40%, 50%, 60%, and 70% of the fraud nodes from the training set. Every experiment is conducted 10 times and the average results are presented. As shown in Fig. 5, IGSL achieves better results for different class imbalance ratios compared to PC-GNN, CARE-GNN, and FRAUDRE.

5.7 Time efficiency (RQ4)

To answer RQ4, we evaluate the run-time performance of IGSL and the baselines and record the average training time per epoch, with the proportion of training data varying from 10% to 40%. As shown in Fig. 6, IGSL has obvious advantages in running speed compared to the multi-relational graph-based fraud detection algorithms (e.g., CARE-GNN, PC-GNN, and FRAUDRE).

Fig. 7
figure 7

Hyper-parameter sensitivity

5.8 Hyper-parameter sensitivity (RQ5)

IGSL involves several hyperparameters. In this section, we study the hyperparameters that we consider critical for IGSL (i.e., batch size \(N_\mathrm{{batch}}\), balance weight \(\beta\), and sparsity threshold \(\varepsilon\)) and investigate the robustness of IGSL to various parameter settings. For a fair comparison, we only vary the value of the investigated parameter with all other parameters fixed.

5.8.1 Effect of batch size \(N_\mathrm{{batch}}\)

To investigate the effect of batch size \(N_\mathrm{{batch}}\) on the final detection, we varied the value of batch size \(N_\mathrm{{batch}}\) in the range of [64, 2048]. The results are shown in Fig. 7a, and it can be observed that the performance improves as the batch size increases. In fact, too small a batch size leads to instability in network convergence. Too large a batch size means less training time but requires large running memory requirements and can easily fall into local optimum. To make a tradeoff between performance and training time, we finally set the batch sizes \(N_\mathrm{{batch}}\) to 512 and 1024 for Amazon and Yelpchi, respectively.

5.8.2 Effect of balance weight \(\beta\)

To investigate the effect of the balance weight \(\beta\) on the final detection, we varied the value of the balance weight \(\beta\) in the range of [0.1, 0.9]. Specifically, \(\beta\) can influence the importance of the initialized adjacency matrix in iterative graph structure learning. A smaller \(\beta\) indicates a smaller weight of the initialized adjacency matrix and vice versa. The results are shown in Fig. 7b, and it can be observed that the detection performance increases with \(\beta\) increasing, which means that the increased \(\beta\) can capture the useful information in the original graph structure. When \(\beta >0.5\), the detection performance fluctuates, which means a too-large \(\beta\) further captures the noise in the original graph structure. Generally, the performance F1-score is best when the balance weight \(\beta\) is 0.5 for both datasets.

5.8.3 Effect of sparsity threshold \(\varepsilon\)

To investigate the effect of the sparse threshold \(\varepsilon _r^{(l)}\) (\(\varepsilon\) for all relationships and layers in the experiment) on the final detection, we vary the value of the sparse threshold \(\varepsilon\) in the range of [− 1, 1]. Specifically, \(\varepsilon\) can affect the sparsity of the learned adjacency matrix. A smaller \(\varepsilon\) indicates a smaller sparsity of the learned adjacency matrix and more computational resources required, and vice versa. The results are shown in Fig. 7c, where the performance of Amazon is best when \(\varepsilon = 0.5\) and the performance of Yelpchi is best when \(\varepsilon =0\).

6 Conclusion and future work

In this paper, we propose an imbalanced graph structure learning framework called IGSL to solve the graph heterophily and class imbalance problems in fraud detection. To solve the graph heterophily problem, we further develop an iterative graph structure learning module to iteratively aggregate global homophilic neighbors to learn a homophilic graph structure, which helps to make nodes with the same class have similar embedding representations while nodes with different classes have different embedding representations. For the class imbalance problem, a multi-relational class-balanced sampler is designed that considers the nodes’ prevalence under different relationships and corresponding class frequencies, which helps the model to eliminate the bias towards the major class (i.e., normal users) and focus more on the valuable minority class (i.e., fraudsters). Extensive experiments on two public fraud datasets demonstrate the effectiveness of our approach.

For future work, there are some research directions worth studying: (1) Fraud detection on dynamic graphs. In fact, user behaviors and social structures keep changing over time, and some studies (Rao et al., 2022; Jiang et al., 2022) have demonstrated that the dynamic evolution patterns of fraudsters differ significantly from those of normal users, and that fraud can be better modeled using dynamic graph-based methods. (2) Fraud detection based on non-manually defined meta-paths. Manually defined meta-paths require same priori knowledge from domain experts, which makes wide application limited. However, some current works (Hussein et al., 2018; Petković et al., 2022) have made it possible to extract meta-paths automatically, or without the need to define them in advance. They inspire us to study fraud detection based on non-manually defined meta-paths in future work.