1 Introduction

Graph Neural Networks (GNNs) have achieved superior performance by taking advantage of embedding features via aggregating representations of nodes and their neighbors [25]. GNNs benefit a lot of applications across different tasks, such as computer vision [31], traffic prediction [14], recommend system [33] and risk control [19]

1.1 Existing problem

The factor that drives the success of GNN is the rapid growth of high-dimensional data and their adjacent information. However, existing GNN methods face two main challenges. First of all, with the increasing awareness of security and privacy, data-isolation problem is serious, which limits the data size of a single party (client) and further damage the performance of GNN. Furthermore, the isolated datasets in different clients are usually Non-Independent and Identically Distributed (Non-IID), due to the reasons that clients belong to diverse geographic locations or have different time windows of data collection. Therefore, it becomes more and more difficult to train a global GNN model with the Non-IID data in data isolation scenario.

Figure 1 shows a typical example of the Non-IID graph data, where we assume there are I separated clients. These clients collect graph data from different sources with the same format. In other words, clients share the same feature domain, e.g., {f1, f2, f3}, but differ in sample space, which are represented by colorful nodes. Meanwhile, clients may have diverse graph structures of nodes, i.e., heterogeneous graphs. Furthermore, data distributions are likely to be Non-IID , as is shown in Fig. 1. itationRef>].

Fig. 1
figure 1

The data isolation problem with Non-IID graph data, assuming I clients with four nodes, three features and different label distributions

Moreover, hyper-parameters are important for GNN learning algorithms. For example, activation function determines the output of layers, regularization parameter impacts the calculation of loss functions, and learning rate influences the update of model weights in the back-propagation process [20]. These hyper-parameters directly affect the training process of GNN models.

Intuitively, in order to achieve the best model performance, clients with Non-IID graph data are likely to have individual hyper-parameter sets rather than a global hyper-parameter set [13]. Due to the huge search space and limited network status among clients, tuning of hyper-parameters is quite time-consuming. Therefore, it is important to design a proper distributed GNN model on Non-IID dataset with hyper-parameter optimization power.

Unfortunately, there is few literature on solving the above problem. Although directly applying federated learning to GNN seems a good choice, it has two main shortcomings [13]. Firstly, federated learning faces the statistical challenge. The original goal of federated learning, i.e., training a single global model on the union of clients’ datasets, is no longer suitable for Non-IID graph data [35]. Secondly, communication of federated GNN learning is time-consuming. This is because, in order to achieve the best performance, models and hyper-parameters of clients are likely to be different. Comparing with the traditional neural network, GNN has extra individual hyper-parameters to represent graph information, which further increases the unbearable training time.

1.2 Our solution

In order to bridge these gaps, we propose an Auto Separated-Federated GNN (ASFGNN) learning paradigm. As graph data is often owned by companies and governments, we focus on the cross-silo federated learning in which the clients are a limited number of organizations with powerful computing ability and reliable communications [13]. Our proposed ASFGNN consists of two steps, i.e., GNN training and hyper-parameters optimization.

In the first step, the Separated-Federated GNN learning framework decouples a GNN model into two parts: message passing sub-model that is conducted by clients separately and loss computing sub-model which is performed by clients federally. Specifically, clients first perform message passing, i.e., neighbor information aggregation, individually, and get node embeddings. In the following step, clients take the embeddings as the input of the discrimination model to compute loss, then update both message passing sub-model and loss computing sub-model using backward propagation for the first time. After it, the server securely aggregates the local discrimination models using federated learning method and gets the global discrimination model. Finally, the global discrimination model is broadcast to clients to update the local discrimination models with the help of Jensen–Shannon divergence.

In the second step, we propose a Bayesian optimization algorithm to automatically optimize the hyper-parameters of Separated-Federated GNN model. That is, Bayesian optimization algorithm takes hyper-parameters as input and regards the average value of clients’ evaluation metrics (e.g., precision) as output [7, 34], where these metrics are uploaded by clients and averaged by server in a secure manner. To this end, we get the hyper-parameters that achieve the best metric.

To verify the performance of our proposed ASFGNN, we empirically compare the accuracy of SFGNN and traditional federated GNN model, and analyze the efficiency of Bayesian optimization method and the traditional grid search method.

We summarize our main contributions as follows:

  • We propose a novel Separated-Federated Graph Neural Network (SFGNN) learning framework, which can be used to learn any existing GNN models under privacy consideration.

  • We propose to adopt Bayesian optimization to tune model parameters automatically, which significantly improves the efficiency of the SFGNN model.

  • We conduct experiments on three benchmark datasets and the results demonstrate that our proposed SFGNN outperforms federated GNN in terms of accuracy, and ASFGNN significantly reduces the hyper-parameter tuning time of SFGNN comparing with grid search.

2 Related work

In this section, we briefly review the literature on federated learning and hyper-parameters optimization.

2.1 Federated learning

Federated learning model is prevailing privacy-preserving approach via model or gradient aggregation rather than data aggregation [21]. However, the accuracy of federated learning would drop significantly with Non-IID datasets[13]. Existing works propose different strategies to resolve the statistical challenge of federated learning. One natural approach is to create a small shared dataset which makes the data across clients more similar [30]. For some applications, the contributions of clients to the global model are bounded according to the dataset characteristics [28]. Furthermore, model-agnostic meta-learning has been developed to meta-learn a global model, which can be used as a starting point for learning a good model of Non-IID data in each client [8]. These methods modify federated learning model with Non-IID datasets, which can not be applied in GNN model directly. As GNN model includes two parts as shown in preliminary, among which the message passing part owns personal information which should be learned individually.

Besides the federated learning, Split Learning (SL) is another decentralized method which trains the local models separately and sends hidden layers to server [11]. The separated local models represent the personality of clients with Non-IID datasets [9]. However, it is obviously that the hidden layers leak privacy information and the deep local layers decrease the accuracy seriously [3]. In this paper, we combine the advantages of federated learning and split learning, and propose a novel Separated-Federated Graph Neural Network learning framework.

2.2 Hyper-parameters optimization

Recently, there has been an increasing literature on hyper-parameters optimization [34]. Grid search is the most traditional way of hyper-parameters tuning, which enumerates every possible configuration in the search space. Random search is better than naive grid search, which samples configurations randomly. Moreover, Evolutionary Algorithm (EA) and Reinforcement Learning (RL) methods are used to generate a new population (a bunch of configurations). Another conventional solution resorts to formalizing machine learning process as a black-box optimization task, reference [27] finds the optimal of black-box objectives with the method of Bayesian Optimization (BO). Comparing with EA and RL, BO is more efficient than these methods and shows promising results in hyper-parameters optimization [34]. In this paper, we propose to apply BO as a prevailing approach to find the proper hyper-parameters in our proposed model.

3 Preliminaries

In this section, we present some preliminary techniques and methods of our proposal, including Graph Neural Network (GNN), federated learning, secret sharing, Jensen-Shannon divergence, and Bayesian optimization.

3.1 Graph neural network

GNN learns node embeddings by aggregating features of node and its neighbors. The node embeddings are regarded as the new node representations which are fed to downstream machine learning tasks. The process of GNN training includes two steps: message passing and loss computing. The first step is the difference between GNN and other neural network models, which uses a generation function to infer node embeddings. Numbers of message passing functions have been proposed, e.g., random walk statistics based, attention based, similarity based, and convolution based [14, 23, 24, 29]. In this work, we select GraphSAGE as the node embeddings generation function, which aggregates the embeddings from a node’s local neighborhood in a inductive way [12]. The message passing process is described in Eq. 1, where k ∈{1,2,...,K} denotes the depth of neighborhood aggregation, hv,k denotes the embedding of node v during k-th aggregation step, xv denotes features of node v and \(\mathcal {N}(v)\) denotes the neighbors of node v in the graph [2].

$$ \begin{array}{ll} & \textbf{h}_{\mathcal{N}(v),k} \leftarrow \text{AGG}_{k}(\{\textbf{h}_{u,k-1}, \forall u \in \mathcal{N}(v)\}),\\ & \textbf{h}_{v,k} \leftarrow \left( \textbf{W}_{k} \cdot \text{CONCAT} \left( \textbf{h}_{v,k-1},\textbf{h}_{\mathcal{N}(v),k} \right) \right). \end{array} $$
(1)

where AGGk is the aggregation function in k-th step, such as Mean, LSTM, and Pooling methods [12].

3.2 Federated learning

Federated Learning (FL) was first proposed by Google [22], which builds distributed machine learning models while keeping personal data on clients. In other words, federated learning models are trained via model aggregation rather than data aggregation. We suppose that I clients have their own datasets {D1, D2,...,DI} which are collected from different sources with the same feature domain. Private raw dataset Di is preserved locally, client i uses forward and backward propagations to update its own model Mi individually, which has the identical neural network architecture with other clients. Then clients upload the encrypted weights to the server with the help of secret sharing or homomorphic encryption [4, 6, 26, 36]. The server averages the uploaded model parameters to update the global federated model Ms, which will be sent back to client i to replace the local model Mi.

3.3 Jensen-Shannon divergence

The Jensen–Shannon divergence (JS) is popularly used to evaluate the dissimilarity between two probability distributions [16]. JS has a finite value range from 0 to 1 for two probability distributions. Motivated by [5], JS can be used to indicate the dissimilarity between two None-IID datasets. Considering two probability distributions P and Q, the JS between P and Q is defined in Eq. 2.

$$ \begin{array}{@{}rcl@{}} & & \textbf{JS}\left (P||Q \right ) \leftarrow \frac{1}{2} \textbf{KL} \!\left (P|| \frac{P + Q}{2} \right ) \!+ \frac{1}{2} \textbf{KL} \!\left (\!Q|| \frac{P + Q}{2} \right ),\\ & & \textbf{KL}\left (P_{1}||P_{2} \right ) \leftarrow \sum\limits_{x \in X}P_{1}(x)\log{\frac{P_{1}(x)}{P_{2}(x)}}. \end{array} $$
(2)

As the machine learning model is built to represents the trained dataset, the difference between the aggregated model in server and the local model in client can be simulated by the distribution similarity between the participated data and the client data.

3.4 Secret sharing

Our proposal depends on Shamir’s t-out-of-n threshold secret sharing algorithm [26]. Typically, we use n-out-of-n additive secret sharing to recover the privacy in this paper. For example, we suppose that there is an -bit value a of client \(i, i \in \mathcal {P}\) with \( \mathcal {P}= \{1,...,I\} \), which will be shared among all the participant clients. Firstly, in order to encrypt (Shr(⋅)) the value a of client i, client i generates a random number aj, \(\{a_{j} \in \mathcal {Z}_{2^{\ell }}, j \in \mathcal {P}, j \neq i\}\), which will be distributed to client \(j, \{j \in \mathcal {P}, j \neq i\}\). Then client i calculates \(a_{i} = a- {\sum }_{j} a_{j}\) mod 2 which will be kept locally. For simplification, We use 〈ak to denote the share of a in client k, \( \forall k \in \mathcal {P}\). To decrypt (Rec(⋅)) the shared value a, client k \( (\forall k \in \mathcal {P} )\) sends the encrypted value 〈ak to the server. The server aggregates \( {\sum }_{k} \langle a \rangle _{k}\) mod \(2^{\ell }, k \in \mathcal {P}\), and gets the value a of client i.

3.5 Bayesian optimization

Bayesian Optimization (BO) is an effective method to solve the black-box parameter optimization problem [34]. In our paper, we care about the hyper-parameter optimization in the training of GNN model, where we try to find the optimal hyper-parameter setting that maximizes the utility function:

$$ \theta^{*} = \underset{\theta\in{\Theta}}{\arg\max} f(\theta), $$
(3)

where 𝜃 denotes the hyper-parameters, such as learning rate and dimension of hidden units. The Θ denotes the search space and f is the utility function which is measured by certain model metrics, such as model accuracy and the Area Under Curve (AUC) score. Typically, the evaluation of f is expensive and we cannot obtain its closed-form solution. Therefore, we treat Eq. 3 as a black-box optimization and adopt BO to solve this problem. The key ingredients of BO include a surrogate model to “imitate” f and an acquisition function to decide the next trial based on historical trails (i.e., hyper-parameters). In our paper, we use Gaussian process (GP) as our surrogate model and use the Expected Improvement (EI) function as the acquisition function [32].

4 The proposed method

In this section, we first give an overview of the proposed Automated Separated-Federated Graph Neural Network (ASFGNN) learning paradigm. We then present its three main components, i.e., separated learning for message passing on clients, federated learning for loss computing with Jensen–Shannon divergence, and hyper-parameters optimization with Bayesian optimization. Finally, we summarize the whole algorithm.

4.1 Overview

We first give an overview of the proposed ASFGNN learning framework. We focus on horizontally split datasets in this paper.

Our design of ASFGNN consists of two steps. First, we need to design a privacy preserving GNN learning model, which can solve the Non-IID problem and reduce the communication cost as much as possible. Second, since GNN has many hyper-parameters, we need to design a strategy to automatically optimize hyper-parameters to reduce the training time.

The first step is to design a practical GNN learning paradigm without leaking the private plaintext data of clients. Inspired by existing works [10], we propose a Separated-Federated GNN (SFGNN) learning framework. The main idea is decoupling the computation module of GNN into two sub-modules, i.e., the Separated GNN learning (SGNN) model and the Federated GNN learning (FGNN) model, as shown in Fig. 2a. The former performs message passing and obtains the node embeddings as inputs of the latter one. As clients have Non-IID datasets, node embeddings are generated separately with individual network architecture and hyper-parameters. After the generation of node embeddings with SGNN, FGNN trains the discrimination neural network taking advantage of federated learning algorithm.

Fig. 2
figure 2

Our proposed automated separated-federated graph neural network model

Secondly, hyper-parameters of SFGNN, such as learning rate, regularization factor, network structures etc., explode with the increasing number of clients. We adopt Bayesian Optimization method to solve this black-box optimization problem, in which we regard the hyper-parameters of model as inputs and the average of clients’ metrics as outputs, as shown in Fig. 2b. The metrics of SFGNN model in clients are securely aggregated in server. Then the server optimizes the hyper-parameters and sends the hyper-parameters back to clients to finish another training epoch of SFGNN. To the end, the whole parameter-tuning time is greatly decreased, as the searching round of hyper-parameters is highly reduced.

In summary, we leverage Bayesian optimization technique to automatically tune the hyper-parameters of SFGNN model, combining SGNN with FGNN.

Notations

Before presenting our model in details, we first describe the notations. Considering there are many notations, for clarity, we summarize the notations used in this paper in Table 1.

Table 1 Notations and descriptions

4.2 Separated GNN learning (SGNN)

We summarize how to generate initial node embeddings for client \(i (i \in \mathcal {P})\) using GraphSAGE method [12] in SGNN Algorithm 1, where the entire graph Gi = (Vi, Ei), features for all nodes \(\textbf {x}_{v}^{i}\) \(\left ({\forall v \in V^{i}} \right )\) are provided as inputs. The weight matrix \({W_{k}^{i}}, \forall k \in \left \{ 1, ...,K \right \}\) are used to propagate information of message passing layers. The first step is generating initial node embeddings using nodes’ private features, e.g., user features in social networks (line 2). In the next step, clients generate local node embeddings by aggregating multi-hop neighbors’ information using GraphSAGE method [12] for the FGNN computations as shown in line 4-15 in Algorithm 1.

figure g

4.3 Federated GNN learning (FGNN)

First of all, client \(i \left (\forall i \in \mathcal {P} \right )\) randomly initializes weights of Federated GNN Learning model \(\overline W_{l,0}^{i}, l \in \{1,...,L\}\) with L denoting the max layer. Client i gets the label distribution \({Q^{i}_{t}}\) \(\left ({Q^{i}_{t}}= \left \{ q_{t,1}^{i},q_{t,2}^{i},...,q_{t,J}^{i} \right \}\right )\) in the current batch during training epoch t with \( {n^{i}_{t}}\) samples, where J denotes the label classification as shown in FGNN Algorithm 2. Then client i counts sample numbers of different categories \({N^{i}_{t}} = \{ n_{t,1}^{i},n^{i}_{t,2},...,n^{i}_{t,J} \}\), where \({\sum }_{j=1}^{J} n_{t,j}^{i} = {n^{i}_{t}}\) (line 4). Meanwhile client i updates local FGNN model’s weights \(W_{k,t}^{i}\) and \({W}_{l,t}^{i}\) using forward and backward propagation with their own embeddings \(H_{v,t}^{i}\) generated by Algorithm 1 (lines 5-11). Loss function \(L(\hat {y_{t}}^{i},{y_{t}^{i}}))\) is defined by different tasks, e.g., cross entropy loss for classification task and mean absolute loss for regression task. In this paper, we choose classification task for example, the loss of which is defined in Eq. 4.

$$ L(\hat{y_{t}}^{i},{y_{t}^{i}}) = -\! \frac{1}{{n_{t}^{i}}}\sum\limits_{j=1}^{J} \hat{y}_{j,t}^{i}\log^{y_{j,t}^{i}}\!+{ l_{2}}^{i}_{n} \cdot \!\left( \sum\limits_{k=0}^{K}||W^{i}_{k,t}||_{2} + \sum\limits_{l=1}^{L}||\overline W^{i}_{l,t}||_{2}\!\right)\!. $$
(4)

After it, \(W_{l,t}^{i}\), \({N^{i}_{t}}\), and \({M^{i}_{t}}\) \(\left (i \in \mathcal {P}, l \in \{1,...,L\}\right )\) of clients are uploaded to server with the help of secret sharing (Shr(⋅)), supposing all clients participate in the federated learning, as shown in line 13. The server aggregates the global FGNN model \(\overline W_{l,t}^{s}\) by averaging the sum of \(W_{l,t}^{i}\), and gets the global label distribution \({Q^{s}_{t}}\), sample numbers \({N^{s}_{t}}\) of a training batch and average of metrics Mt, all of which are regarded as outputs of FGNN model, as shown in Algorithm 2 line 16-19. Then \(\overline W_{l,t}^{s}\) and \({Q^{s}_{t}}\) are sent back to clients. To the end, client i calculates \(j{s_{t}^{i}}\) with the help of \({Q^{i}_{t}}\) and \({Q^{s}_{t}}\), then the local FGNN model is updated by combining \(\overline W_{l,t}^{s}\) and \(W_{l,t}^{i}\) (line 24). \(j{s_{t}^{i}}\) controls the percent of the client local model in update process. The more Non-IID clients datasets are, the bigger priority of client model is. In a world, the addition of JS contributes to the accuracy of client model in Non-IID federated learning.

4.4 Hyper-parameters optimization

We employ Bayesian optimization in tuning hyper-parameters, where we treat the hyper-parameter search process as a black-box optimization, as shown in Eq. 3. Specifically, the hyper-parameter set 𝜃n includes dropout rate, L2 regularization, propagation depth, learning rate, and dimension of hidden units. The utility function f is set to be the average of clients’ accuracy. The high-level optimization process is shown in Algorithm 3. Firstly, we update the posterior probability distribution on f using all the hyper-parameters sets(line 5). Then we calculate the maximize point of the EI acquisition function as the next hyper-parameters groups and observe the value of utility function (line 6 - line 7). The hyper-parameter tuning time is measured by T = nt, where t denotes the running time of one set of hyper-parameters, n denotes the number of hyper-parameter combinations, and T is the total hyper-parameter tuning time. Bayesian optimization optimizes the hyper-parameter tuning time by narrowing down the number of combinations n greatly.

4.5 Putting all together

To sum up, we conclude the ASFGNN framework in the Algorithm 4. Before the training process, we initialize the hyper-parameters set of clients and server as 𝜃0. First of all, we get the node embeddings \(H_{v,t}^{i}\) for each client i using Algorithm 1 (SGNN) with the relevant hyper-parameters set 𝜃n (line 5). Secondly, we start the training of FGNN model using node embeddings as the inputs and get the average of accuracy (Mt) in each training round (line 7). The max of Mt is marked as M(𝜃n) (line 9), which is regarded as outputs of black-box. Then, the following input 𝜃n+ 1 is updated by Bayesian optimization. Finally, we get the best hyper-parameters set 𝜃N and the relevant M(𝜃N).

figure h

5 Experiment

In this section, we empirically compare the performance of our proposed ASFGNN model with the GraphSAGE of Centralized Model (CM) which is trained using all the data, the traditional Federated Learning model (FL) and the Separated model (SP) in which clients can only use their own data without any communications. We aim to answer the following questions.

  • Q1: whether our model (SFGNN) outperforms the CM model, FL model and SP model that is trained on the isolated Non-IID data, including Non-IID label and Non-IID graph?

  • Q2: how the distribution parameter influences the performance of our model?

  • Q3: how the number of clients influences the performance of our model?

  • Q4: how the JS divergence influences the performance of our model?

  • Q5: how does Bayesian optimization affect the efficiency of parameter tuning comparing with grid search?

figure i
figure j

5.1 Experimental settings

5.1.1 Framework

We construct our experiment on the popular TensorFlow framework [1]. All the experiments were performed on a Macbook Pro laptop with 2.3GHz 4-core Intel Core i5 processor. For simplification, we ignore the communication cost and focus on the performance and computation efficiency.

5.1.2 Datasets

To test the effectiveness of our proposed model, we choose three benchmark citation datasets, i.e., Cora, Pubmed, and Citeseer. For simplification, we assume there are only two clients (\(\mathcal {A}\) and \({\mathscr{B}}\)) who split datasets according to label classes and number of neighbours in graph. We use N1 and N2 to denote the number of samples in each part. We divide Cora dataset into Co1 and Co2. The first part Co1 has four label categories (theory, reinforcement learning, genetic algorithms, and probabilistic methods) with 1,412 nodes. The second part Co2 contains the rest three label categories (possessing neural networks, case based and rule learning labels) with 1,296 nodes. We also divide Citeseer and Pubmed datasets into two parts (Ci1 and Ci2, Pu1 and Pu2) in a similar way. We report the data split result in Table 2. In order to study the influence of data Non-IID on our method, we use α to denote the label distribution ratio. The data of client \(\mathcal {A}\) is made up of αN1 samples from the first part and (1 − α) ⋅ N1 samples from the second part. Similarly, the data of client \({\mathscr{B}}\) is made up of (1 − α) ⋅ N2 samples from the first part and αN2 samples from the second part. In other words, the hyper-parameters α implies the non-iid level. We assume that α ranges from 0.5 to 1.0 due to the symmetry. We use exactly the same dataset split of training, validate, and test following the prior work [15]. Apparently, our proposal can be applied into the scenario where there are multiple clients.

Table 2 Statistic analysis of subsets

5.1.3 Metrics

Following the existing work [15], we use accuracy as the evaluation metric. To compare the performance of different strategies in decentralized scenario, we choose the average of metrics in all clients as the optimization target.

5.1.4 Hyper-parameters

Following recent research [18], we use hyperbolic tangent (TanH) as the active function of hidden layers and set the max layer of the fully-connected deep neural network in the discrimination model (L = 2). We tune other hyper-parameters by using Bayesian optimization. The hyper-parameters include dropout rate d ∈{0.0,0.5}, L2 regularization l2 ∈{0.0,5e− 4,1e− 3,5e− 3,1e− 2}, propagation depth K ∈{1,2,3,4,5}, learning rate lr ∈{5e− 4,1e− 3,5e− 3,1e− 2}, and dimension of hidden units l ∈{64,128,256,512}. As clients train the discrimination model federated, the dimension of embeddings should be aligned, which means that all clients have the same hidden units dimension. The experiment are conducted in a stand-alone PC to simulate the communication in federated learning. We tune parameters based on the validate dataset and evaluate model performance on the test dataset.

5.2 Accuracy comparison

5.2.1 Accuracy comparison of different models with Non-IID label

To answer the proposed question Q1, we first set the label distribution ratio α = 1.0, which implies the labels between client \(\mathcal {A}\) and client \({\mathscr{B}}\) are totally different. In general, we take advantage of grid search method to find the highest accuracy with the proper hyper-parameters. We summarize the comparison results in Table 3, and report the corresponding best hyper-parameters set in Table 4.

Table 3 Performance comparison on three datasets in terms of accuracy
Table 4 Hyper-parameters of the SFGNN model and FL model with the best accuracy

From the Table 3, we can conclude that SFGNN outperforms the other three models in all the datasets. Besides, comparing with the traditional FL model, the improvement of accuracy is about 5.70% percent in average, which means the SFGNN model is more effective for data Non-IID scenarios. Because our proposed SFGNN generates embeddings separately with individual hyper-parameters and aggregates discrimination layers of clients, the SFGNN model can balance the inference and contributions from samples with different labels. From Table 3, we can also find an interesting result. That is, the Centralized GNN Model (CM) achieves the worst performance. This is because clients have absolutely different label classes when the α = 1.0, and the models with relatively pure label classes will naturally achieve better performance. When different label classes are combined together in CM, it introduces distractions to the model learning target, which makes CM behave the worst.

From Table 4, we can also observe that clients generally have different hyper-parameters to achieve the best accuracy, and these parameters are also different from the hyper-parameters of FL model. The individual hyper-parameters describe the diversity of Non-IID datasets.

5.2.2 Accuracy comparison of different models with both Non-IID label and Non-IID graph

he GNN model benefits a lot from adjacent information, which is different from the traditional neural network model, the distribution of graph data also has an important influence on model accuracy. As median of edges in Cora dataset is 3.8, we firstly split the Cora dataset into two sub-datasets, i.e., Co3 and Co4, according to the average edges. The Co3 has the samples with equal or lesser than 3 edges, while Co4 containts the rest samples. Furthermore, we combine the Non-IID graph data with the Non-IID label data, which means the datasets have different graph and label distributions. Similar as the setting in Table 2, we build a subset of Co3 as Co5, which only has label classes of ‘theory’, ‘reinforcement learning’, ‘genetic algorithms’, and ‘probabilistic methods’, and a subset of Co4 as Co6, which only has labels of ‘possessing neural networks’, ‘case based’ and ‘rule learning’. Similarly, we get the subsets of Citeseer dataset (Ci3, Ci4, Ci5, Ci6). After the data being preprocessed, client \(\mathcal {A}\) owns the Co5 subset and client \({\mathscr{B}}\) owns the Co6 subset of Cora dataset, and client \(\mathcal {A}\) owns the Ci5 subset and client \({\mathscr{B}}\) owns the Ci6 subset of Citeseer dataset. We train the SFGNN model and FL model respectively and compare their accuracy in Table 5. We can conclude that the SFGNN model performs better than FL model when both label and graph are Non-IID, and the improvement percent average increases from 5.70% to 7.04%. The experiment results indicate that SFGNN model is more appropriate for the scenarios where both graph and label have different distributions.

Table 5 Performance comparison on both Non-IID label and Non-IID graph

5.2.3 Accuracy comparison with different label distribution

To answer the proposed question Q2, we vary α from 0.6 to 1.0 on Cora dataset and report the accuracy of different models in Fig. 3. We compare our model with the SP model in which client \(\mathcal {A}\) and client \({\mathscr{B}}\) can only use their own data without any communications. The bigger ratio means the less similar distributions of clients’ datasets. From the results, we can conclude that (1) SFGNN performs better than both SP model and FL model when data distribution is more asymmetrical (α > 0.75), which is a quite common situation in real-world applications, (2) FL model is more suitable for training a single global model when all the clients tend to have IID data (0.5 <= α < 0.75), and (3) SP model even works better than FL when clients have severely heterogeneous label distributions (0.88 < α <= 1.0).

Fig. 3
figure 3

Average accuracy comparison of three models with different α

5.2.4 Accuracy comparison of different clients’ number with Non-IID label

The CM model is trained by the whole dataset, which can be regarded as the ASFGNN model with only one client. To answer the proposed question Q3, we vary the number of clients from 2 to 5 on Cora dataset. The labels of clients’ data are different from each other. We report the average accuracy of ASFGNN in Fig. 4. From it, we can find that the average accuracy of ASFGNN first increases with the number of clients, and then tends to be stable. This is because, when the number of clients first increases, each client has fewer kinds of labels, which makes the Non-IID problem more serious. Therefore, our proposed ASFGNN achieves better performance with the increase of client number.

Fig. 4
figure 4

Average accuracy comparison of different clients’ number with Non-IID label

5.2.5 Accuracy comparison of JS divergence

To answer the proposed question Q4, we execute the FGNN model in Algorithm 2 in an another way. That is, clients update the local discrimination models using global discrimination model directly. Then we compare the accuracy with different α on Cora dataset, the results are shown in Table 6. From it, we find that SFGNN with JS consistently outperforms SFGNN without JS, which shows the effectiveness of the proposed JS method. Besides, we also find that the promotion of JS on SFGNN increases with the raise of α. The contribution of JS method becomes negligible when clients tend to have IID data, e.g., α = 0.6.

Table 6 Performance comparison of JS in SFGNN model

5.3 Efficiency comparison

To answer the proposed question Q4, we compare the parameters tuning time of grid search and Bayesian optimization on the three datasets. We perform the Bayesian optimization of hyper-parameters with the help of the open source framework SMAC3 [17]. The domains of lr, l2, d in SMAC3 are continuous, and the domains of K,l are discrete with the interval of 1. Both grid search method and Bayesian optimization method are implemented under the same computation and communication environment. We report the parameter tuning time in Table 7. Note that both methods achieve comparable accuracy on these three datasets. From Table 7, we can observe that, (1) the Bayesian optimization method greatly reduces the hyper-parameters tuning time on all the three datasets, comparing with the traditional grid search method, and (2) the speedup of Bayesian optimization against grid search becomes higher when dataset gets larger. For example, the speedup on Pubmed dataset is 102.56 while it is 10.10 on Cora dataset. This is because Bayesian optimization reduce the parameter tuning time of grid search by decreasing the parameter search space. In real-world applications, the network bandwidth is always limited between clients and server, and the model training procedure under data isolated setting usually takes much longer time then traditional centralized model training. Therefore, decreasing the parameter search space becomes the key of reducing the tuning time. The results demonstrate that our proposal is good at doing this.

Table 7 Training time comparison between BO and grid search on three datasets

6 Conclusion and future network

In this paper, we proposed a Automated Separated-Federated GNN learning paradigm in the Non-IID isolated scenario. We first proposed a separated-federated GNN learning model, which decoupled the training of GNN into two parts: the message passing part was done by clients separately, and the loss computing part was learnt by clients federally. To handle the time-consuming problem, we leveraged the Bayesian optimization technique to automatically tune the hyper-parameters of all the clients. Experiments on real world datasets demonstrated that our model significantly outperformed the federated GNN learning on the isolated Non-IID data.

In the future, we would like to verify our proposal with more existing GNN models. We are also interested in deploying our proposal into real-world applications.