1 Introduction

Learning node embeddings in graphs is an important yet challenging task due to its wide application in various fields such as social media and medicine [1, 2]. The existing approaches of graph representation learning mainly focus on static graphs, which can be divided into two categories: solutions based on the pre-defined adjacent matrix [3,4,5] and solutions based on the adaptive adjacency matrix [6,7,8]. The former solutions employ a spatial-based or spectrum-based graph convolution neural network [1, 2] to capture local spatial correlation. However, these approaches do not extract global information and are difficult to achieve global optimization. The latter solutions employ a well-designed learning parameter matrix and one-hop graph convolution neural network to capture the hidden inter-dependencies between distant nodes adaptively. However, these approaches do not consider local and global consistency for feature aggregation.

Many methods [1, 2] based on graph convolution neural network (GCN) employ sharing parameters as a common training method to capture the prominent patterns across all nodes. However, owing to complex spatio-temporal factors and external factors (such as weather), adjacent nodes may have diverse patterns [6]. It is not reasonable to only capture the significant patterns and specific patterns of each node must be considered.

Generally, many graphs in the real world are dynamic, which indicates that their structure evolves over time. Owing to new nodes do not have labels and historical information, new node classification faces challenges in the dynamic graph. Recurrent approaches [9, 10] employ the Recurrent Neural Network (RNN) to learn historical information from graph snapshots and use the classical GCN model to extract spatial correlation. These approaches perform well on node representation learning, but they cannot perform reliably for unsupervised classification [9]. Self-attentional approaches [11, 12] can dynamically capture long-term temporal correlation and have a good link prediction performance. However, the performance is still weak in the classification task of new nodes without historical information.

Transfer learning is a method of knowledge transfer between domains. It aims to transfer the information of source domains and improve the performance of target domains. Nowadays, with the success of graph convolution, many approaches [13, 14] is transfer knowledge from the graph of an existing domain where nodes are labeled to classify nodes in the target domain to solve the problem of unlabeled nodes in the target domain. Inspired by these methods, we employ transfer learning on the time steps and transfer the features of known nodes to make the new nodes lean rich historical information so as to enhance the classification performance of unknown nodes.

To resolve these problems, we propose a novel deep learning model based on transfer learning named DLA-GCN, which employs multiple components along spatial and temporal dimensions. Specifically, we design the DLGCN component to fulfill local and global feature aggregation and propose NMPL algorithm to capture different patterns of nodes. Then, we propose the DATL method to learn and transfer dynamic temporal correlation. On two real-world dynamic graph datasets, we evaluate DLA-GCN for the unsupervised dynamic graph node classification task. The experimental results show that DLA-GCN performs consistently over time and outperforms several state-of-the-art baselines. The main contributions of this paper are summarized as follows:

  • To capture adaptively global and local spatial correlation, we propose the DLGCN component. Specifically, we employ two different adjacency matrices and an inter-graph attention mechanism to integrate local and global consistency automatically and learn effective node feature embedding in the network.

  • By modeling the node parameters, we design the NMPL algorithm. Specifically, the matrix decomposition method is designed to learn the two smaller parameter matrices instead of learning the unique parameter space of each node. The approach can enhance the perception of spatial-temporal features and improve the classification performance.

  • We propose the DATL method to exploit source information and target information with different loss functions, so that domain-invariant and feature representations can be effectively learned to reduce the domain discrepancy for new node classification. DATL can significantly accelerate the training speed and boost the classification accuracy.

The remainder of this paper is organized as follows. A brief review of related work is provided in Sect. 2. In Sect. 3, we detail the DLGCN, NMPL, and DATL components to capture spatial-temporal evolutionary features of dynamic graphs. The experiment settings and results are shown in Sect. 4. Section 5 comes to concluding remarks.

2 Related work

Our work involves static graph representation learning, dynamic graph classification task, and transfers learning, and this section will cover the most recent developments.

Static graph embeddings. Recently, many approaches based on graph neural network architectures have achieved great success. Amit Roy et al. [5] proposes the unified spatio-temporal graph convolution network (USTGCN) framework. Based on STSGCN, the framework employs the temporal mask to improve the ability to learn spatiotemporal correlations and models different historical data components (such as hourly, daily, and weekly components) to improve the fine-grained representation. Lei Bai et al. [6] designs the adaptive adjacency matrix to learn spatial correlation automatically and captures hidden correlation between distant nodes easily. However, these approaches only focus on an aspect of spatial correlation and do not both consider local and global consistency [15]. Meanwhile, These methods only employ the shared parameter to capture the prominent patterns. Therefore, we propose the double-layer graph convolution network and NMPL to learn the spatial structural properties and diversified patterns of nodes.

Dynamic graph The dynamic graphs include two common ways: snapshot sequence [11], which consists of a series of graph snapshots at different time steps, and timestamped graph [16], which is a dynamic graph with evolves over continuous time. Although snapshot-based approaches can be employed to timestamped graphs by reducing the number of time steps, these approaches may not perform reliably due to the lack of a fine-grained snapshot sequence [17].

Dynamic graph classification task: Dynamic graph classifications mainly fall into two categories. One class of these approaches employs the recurrent convolutional network, such as evolving graph convolutional networks (EvolveGCN), proposed by Aldo Pareja et al. [9]. The model captured spatio-temporal dynamics by employing long short-term memory (LSTM) or gated recurrent units (GRUs) to evolve the GCN parameters. It is worth mentioning that GCN is not trained, and its parameters are only updated using RNN. Although the performance of the model is good in short-term dynamic graph node classification, the performance is degraded as the number of time steps increases. The other class uses a self-attention mechanism to capture historical information. Aravind Sankar et al. [11] proposed dynamic self-attention network (DYSAT), which employed the encoder of transformer [18] to learn the dynamic temporal evolution. However, it could not learn the historical information of new nodes and model them [19].

These methods focus on discussing and evaluating the classification performance of nodes with labels and do not extract the historical information of new unseen nodes. We seek to employ the transfer learning method to model new nodes.

Transfer learning: The existing approaches are currently divided into three categories. (1) Instance-based transfer learning. For example, Dai Wenyuan et al. [20] proposes the TrAdaBoost algorithm, which calculates task similarity to make the feature distribution of the target domain close to that of the source domain. This approach is easy to implement. However, the algorithm performs poorly when the source and target domains come from different domains [21]. (2) Feature-based transfer learning. For example, Yaroslav Ganin et al. [22] proposes the DANN framework. The model learns similar features between the source and target domains and transferred them using the DAN method. (3) Parameter-based transfer learning, such as pre-learning models. For example, Google [23] proposes bidirectional encoder representation from transformers (BERT), which tackles downstream tasks by pre-training and fine-tuning model parameters.

Instance-based transfer learning requires the source and target domains to come from the same domain. Parameter-based transfer learning needs a large amount of data and the performance is not outstanding for a single task such as the dynamic graph classification task [24,25,26]. Therefore, we propose the DATL method to learn and transfer similar features at adjacent time steps with three loss functions.

3 Methodology

In this section, we will present the problem definition and the neural network structure design. The main parameters used in DLA-GCN are summarized in Table 1.

Table 1 Key notation and definition

3.1 System model

In this section, the definition of the dynamic graph can be defined on a series of observed snapshots \(G=\left\{ \zeta ^1,\zeta ^2,\cdots ,\zeta ^t,\cdots ,\zeta ^T\right\}\) as the research subject, where T is the number of time steps. At each time step t, each graph snapshot \(\zeta ^t=(V,\varepsilon ^t)\) is a weighted undirected graph with a node-set V, a link set \(\varepsilon ^t\), an adjacency matrix \(A^t\), and a node feature matrix \(X^t\). Dynamic graph representation learning is to learn the embeddings \(e_v^t \in R^d\) for each node \(v \in V\) in graphs at the time step \(t=\left\{ 1,2,\cdots ,T\right\}\). And the embeddings \(e_v^t \in R^d\) represent the aggregation of the spatial features and temporal evolution such as link connection and node addition at time step t. New node classification in dynamic graphs aims to identify the labels of new node by employing the original label set \(l_o\) in the source domain and the embeddings \(e_v^t \in R^d\).

Fig. 1
figure 1

Network architecture of DLA-GCN

We propose a new neural network structure named DLA-GCN. As shown in Fig. 1, the model consists of the DLGCN, DATL, and NMPL components. These approaches learn the spatial structural properties by combining the DLGCN and NMPL and employs the DATL algorithm to learn and transfer similar features of dynamic graphs at adjacent time steps. Note that nodes in this section dynamically evolve. In other words, the number of nodes in each time step is different. Therefore, after feature extraction at t, we add 0 to match the dimension in adjacent time steps t and \(t+1\). This approach avoids dimension mismatches and prevents the occurrence of the multi-zero matrix.

3.2 Spatial module

Fig. 2
figure 2

Framework of the DLGCN method

We proposes DLGCN to capture local and global features of spatial structure properties and learn more hidden information. Figure 2 illustrates the overall architecture of DLGCN. The network consists of a local spatial module, a global spatial module, and an inter-graph attention mechanism. Essentially, the spatial feature representation is the aggregation of spatial information at each time step. In other words, all operations of DLGCN should be completed in a time step and the total number of nodes in a time step remain unchanged.

3.2.1 Local spatial module

The one-hop neighbors of the each node v are aggregated by an attention mechanism to capture the local features.

First, the feature embedding of each node is defined as:

$$\begin{aligned} S_v=W^sx_v+b_1 \end{aligned}$$
(1)

where \(W^s\) denotes a learnable parameter shared by all nodes, \(x_v\) is the feature matrix of each node v as input of the layer, and \(b_1\) is a learnable bias.

The pre-defined adjacent matrix is obtained by the graph snapshot \(\zeta\) at the time step t, and the one-hop neighbors of the each node v are selected as node topology information through the pre-defined adjacent matrix. Furthermore, the attention weight of the node v and its neighbors can be calculated by:

$$\begin{aligned} S_u&=W^sx_u+b_2 \nonumber \\ e_{uv}&=\sigma (A_{uv}[S_u+S_v])\forall (u,v)\in \varepsilon \end{aligned}$$
(2)

where \(A_{uv}\) denotes the pre-defined adjacency matrix of two adjacent nodes u and v and \(\sigma (\cdot )\) is activation functions, such as ReLu, Sigmoid, and so on. Note that the attention weight is obtained by adding the embedding matrix. Other techniques, such as splicing, can also be used to calculate the embedding matrix.

Then the calculated attention is normalized by the following equation:

$$\begin{aligned} \alpha _{uv}=\frac{exp(e_{uv})}{\sum \limits _{w\in N_v}^{n}exp(e_{uv})} \end{aligned}$$
(3)

Finally, the local feature representation of each node v is calculated by aggregating the one-hop neighbors according to the previously required attention weight method:

$$\begin{aligned} Z_{local}=z_v=\sigma (\sum \limits _{w\in N_v}^{n}\alpha _{uv}(W^sx_u+b_2))\forall v\in V \end{aligned}$$
(4)

3.2.2 Global spatial module

In addition to local feature learning which calculated by distance or similarity, we employ an adaptive adjacency matrix to model the global information.

Specifically, a learnable node embedding \(E_A\in R^{N\times d}\) is randomly initialized, where N denotes the total number of nodes and \(d_c\) is the each node embedding dimension (generally \(d_c\ll N\)). The global features of nodes are calculated by multiplying \(E_A\) and \(E_A^T\):

$$\begin{aligned} A_{adp} = I_N+D^{-\frac{1}{2}}AD^{-\frac{1}{2}}=Softmax(ReLU(E_A\cdot E_A^T )) \end{aligned}$$
(5)

The learned adjacency matrix is normalized by softmax. The adaptive adjacency matrix captures global spatial information using \(E_A\), which is more simple than [35]. Finally, the global feature embedding is represented as:

$$\begin{aligned} Z_{global}=(Softmax(Relu(E_A\cdot E_A^T )))XW^s+b_3 \end{aligned}$$
(6)

where X denotes the node feature matrix, \(W^s\) is the learnable parameter shared by all nodes, and \(b_3\) is the learnable node bias.

3.2.3 Inter-graph attention

After performing the DLGCN component, we obtain two feature embeddings \(Z_{local}\), \(Z_{global}\). We need to aggregate different embeddings to produce a unified representation. As embeddings from the local and global spatial module contribute differently to learning the representation, we propose an inter-graph attention scheme to capture the significance of each feature embedding.

The attention mechanism between the local and global feature representation is calculated by the equation:

$$\begin{aligned} \partial _l&=\frac{exp(Z_{local})}{exp(Z_{local})+exp(Z_{global})} \nonumber \\ \partial _g&=\frac{exp(Z_{global})}{exp(Z_{local})+exp(Z_{global})} \end{aligned}$$
(7)

where \(\partial _l\) and \(\partial _g\) denote the importance of local feature and global feature, respectively.

Furthermore, the spatial feature representation of nodes is denoted as:

$$\begin{aligned} Z=\partial _lZ_{local}+\partial _gZ_{global} \end{aligned}$$
(8)

3.3 Node multi-parameter learning

In deep learning models, parameter sharing is a widely used technique that can significantly reduce the number of model parameters during training, thus accelerating training and improving the model’s generalization ability. However, in dynamic graphs, each node may have unique features and attributes, so the use of shared parameters may not accurately capture the unique characteristics of each node. Moreover, if only one shared parameter space is learned, it may not be able to handle the complexity and dynamic variability of node parameters. Therefore, to fully exploit node parameter information while maintaining node uniqueness, we propose the NMPL algorithm, which can learn individual parameters for each node and improve the model’s performance. As a result, each node should maintain its own parameter space to learn node-specific patterns.

By considering the matrix decomposition method, this NMPL method learns two smaller parameter matrices instead of learning the unique parameter space of each node. The parameter \(W^s\) can be represented by \(W^s=E_\psi \cdot W_\psi\): (1) a shared parameter \(E_\psi \in R^{N\times d_c}\), where \(d_c\) denotes the embedding dimension, and \(d_c\ll N\). (2) a specific parameter \(W_\psi \in R^{d\times N}\). The method can be interpreted as capturing the node-specific patterns from a set of candidate patterns discovered from all nodes in graphs. As a result, NMPL not only reduces the number of parameters but also learns the more fine-grained patterns of the nodes.

By employing \(E_A\) as a shared parameter between local and global features, Eqs. 4 and 6 can be re-expressed as:

$$\begin{aligned} Z_{local}&=z_v=\sigma (\sum \limits _{w\in N_v}^{n}\alpha _{uv}(E_AW_\psi x_u+E_Ab_{\psi _1}))\forall v\in V \nonumber \\ Z_{global}&=(Softmax(Relu(E_A\cdot E_A^T )))XE_A\theta +E_Ab_{\psi _2} \end{aligned}$$
(9)

3.4 Domain-adversarial transfer learning

To better learn a knowledge transfer across different time step to assist in the dynamic node classification task, our model consists of a adversarial module, a source classifier as well as a target classifier working together to to learn both domain node representations, thus enabling classifying nodes in the next time step. The overall objective is as follows:

$$\begin{aligned} L(Z^t,Y^t,Z^{t+1})&=L_S(Z^t,Y^t)-\lambda *L_{DA}(Z^t,Z^{t+1})-\gamma *L_{T}(Z^{t+1}) \end{aligned}$$
(10)

where \(\lambda\) and \(\gamma\) denote the hyper-parameters that tune the trade-off between domain loss, target loss function and total loss function during the learning process. The \(L_S\), \(L_{DA}\), \(L_T\) are the source classifier loss, the domain classifier loss and the target classifier loss, respectively.

Source classifier loss: For the source classifier, the negative log-probability method is employed as the loss function and is written as:

$$\begin{aligned} {L(Z^t,Y^t)=-\frac{1}{N_t^K} \sum _{i=1}^{N_t^K}log({\hat{y}}_t)} \end{aligned}$$
(11)

where \(y_i\) denotes the label of the i-th node in the source domain, \({\hat{y}}_t\) is the classification prediction for the i-th source labeled node, respectively. The loss of the source classifier is minimized to improve the prediction accuracy.

Domain classifier loss: Different from minimizing the loss of the label classifier, the domain classifier can make the features of the source and target domains indistinguishable by maximizing the domain discriminator loss. To achieve this, we learn a domain classifier with an adversarial transfer training. On the one hand, the source classifier can classify each node into the correct class via minimizing the source classifier loss. On the other hand, node representations from different domains can be similar, so that the domain classifier cannot differentiate if the node comes from source domain or target domain. In this paper, Gradient Reversal Layer (GRL) [27] is used for adversarial training. Learning a GRL is adversarial in such a way that: the reversal gradient enforces to be maximized, at the same time, the cross-entropy domain classifier loss is be minimized:

$$\begin{aligned} L_{DA}(Z^t,Z^{t+1})&=-\frac{1}{N_{sum}} \sum _{i=1}^{N_{sum}} dlog({\hat{m}}_i^{t+1})+(1-d)log(1-{\hat{m}}_i^t) \end{aligned}$$
(12)

where \(N_{sum}\) denotes the total number of nodes in the source and target domain, d is the binary variable, which indicates whether the feature comes from the source or target domain. The \({\hat{m}}_i^{t+1}\) and \({\hat{m}}_i^t\) are the domain prediction for the i-th document in the target domain and source domain, respectively.

Target classifier loss: An entropy loss is placed on the target classifier. Unlike the source classifier, we do not use cross-entropy as the label loss, because we do not have the label information for the new node in the target domain. In order to utilize the data in the target domain, we employ an entropy loss for the target classifier:

$$\begin{aligned} L_{T}(Z^{t+1})=-\frac{1}{N_{t+1}^U}\sum _{i=1}^{N_{t+1}^U}{\hat{y}}_ilog({\hat{y}}_i) \end{aligned}$$
(13)

Where \({\hat{y}}_i\) denotes the the classification prediction for the i-th node the target domain.

\(L_T(Z^t,Y^t)\), \(L_{DA}(Z^t,Z^{t+1})\) and \(Z_T(Z^{t+1})\) are jointly optimized via our objective function in Eq. 10, and all parameters are optimized using the standard backpropagation algorithms.

Algorithm 1 summarizes the unsupervised dynamic graph classification learning model.

figure a

4 Experiments and analysis

4.1 Dataset

This paper uses two real-world dynamic graph datasets: Cora dataset [28] and the BlogCatalog dataset [29].

The Cora dataset consists of 2708 nodes classified into one of seven classes, including Case_Based, Genetic_Algorithms, Neural_Networks, Probabilistic_Methods, Reinforcement_Learning, Rule_Learning, and Theory. Meanwhile, the citation network consists of 5429 edges. Each node represents a scientific paper, and the edges denote reference flows of scientific papers. The Cora dataset is widely used in the field of dynamic graph classification task [30, 31]. And our purpose is to more accurately study the impact of research papers. Specifically, as time progresses, the influence of papers tends to evolve. By learning the dynamic graph, we can investigate the impact and its variations more accurately. We use temporal directed edges that represent citations from one paper to another, with timestamps of the citing paper’s publication date as in [32].The method of transforming the Cora static graph dataset into a dynamic graph dataset is as follows: Firstly, select these papers between 1900 and 1988 as known nodes, and use their connection relationships, feature attributes, and corresponding classification labels to construct the first static graph of the initial time step. Secondly, for each additional time stamp, papers’ feature attributes and connection relationships until 1999 are added to incorporate new nodes without labels, and complete the construction of the topologies of graphs at different time stamps. Finally, a dynamic graph dataset with 12 time steps is formed.

The given adjacency matrix of Cora is simply formulated based on the practical citation outcome between the two papers. In terms of the technical components, papers i and j are on completely different topics, for example computer sciences and chemistry, while the citation between them is barely because the algorithm/model developed in paper i is used in j for special application. It means, apart from the given adjacency matrix, there are other possible viewpoints to better represent the relationship between papers i and j [33, 34]. To achieve this, we utilize the similarity of node features and an attention mechanism when computing the representations of each node in the graph. Specifically, during the representation process, we assign different weights to neighboring node features based on their dissimilarity. This allows us to capture more diverse and informative information. By incorporating the node feature similarity and attention mechanism, we enhance the learning of richer and more relevant information for each node in the computational graph [35, 36].

The BlogCatalog dataset is a network of social relationships in the BlogCatalog website, which consists of 5196 nodes and 171743 edges. And the 8189 nodes’ attributes are constructed by keywords, which are generated by users as a short description of their blogs. The social network has 6 labels which represents the topic categories provided by the authors. The BlogCatalog dataset is a network dataset that contains social relationships between blog users. This dataset is widely used in the field of social network analysis and dynamic graph classification task. Moreover, the BlogCatalog dataset is used as a dataset in many important work [37, 38].

4.2 Experimental setting

We implement DLA-GCN in Python3.6 with Tensorflow [39] and employ mini-batch gradient descent with Adam optimizer for training. On the Cora data set, for structural multi-head self-attention of graph attention network (GAT), the number of layer and attention heads are 1and 16, respectively. Each attention calculates 8 features(for a total of 128 dimensions). For DATL, the adaptive parameter \(\lambda\) is chosen among 9 values between \(10^{-3}\) and 1 on a logarithmic scale. \(\lambda \approx 0.31\) has the best performance and the learning rate is set as \(u_{r}=10^{-3}\). Two MLP layers and softmax are used to perform multi-classification training of the target and source domains. The balance parameters set to 0.8. On the BlogCatalog data set, self-attention GAT layers and attention heads are 2 and 16, respectively. Each attention calculates 16 features. For DATL, the adaptive parameter \(\lambda\) is set as 0.64, and fix the learning rate \(u_r\) is fixed as \(10^{-3}\). The balance parameters set to 0.8. Finally, we select 60% of all data sets as training sets, 15% as validation sets, and 25% as test sets.

4.3 Baseline

In order to demonstrate the effectiveness of our proposed model, we employ the following methods as baselines. We compare our approaches with both state-of-the-art GCN-based dynamic node classification models and models based on graph transfer learning methods. Note that to apply GCN to dynamic graphs, two inputs are required: a feature matrix composed of all the nodes in the graph, which can be obtained by considering the nodes at each time step, and an adjacency matrix representing the connections between nodes at each time step. In this paper, the adjacency matrix is obtained by calculating the similarity between node features. In summary, we apply GCN to the task of node classification on dynamic graphs using the feature matrix of nodes at each time step and the pre-defined adjacency matrix.

State-of-the-art GCN-based dynamic node classification models:

  • GCN [28]: It is a classic network which can efficiently capture the spatial information.

  • GAT-AE [40]: The model is a graph convolutional network with an attention mechanism, which is an autoencoder to model the spatial correlation.

  • DySAT [11]: DySAT is a dynamic self-attention approach, which uses GAT to capture spatial correlation, and uses the self-attentional architecture of transformer encoder to capture dynamic temporal correlation

  • DS-TAGCN [12]: DS-TAGCN is a a dual-stream topology attentive graph convolutional network for dynamic graph node classification, which can learn the evolution pattern of node attributes and graph topology simultaneously.

  • FADGC [41]: The model is a stable and scalable dynamic GCN method using a fine-grained attention mechanism.

Dynamic node classification models based on graph transfer learning:

  • TL-DCRNN [42]: TL-DCRNN is a diffusion convolutional recurrent neural network based on a new transfer learning approach. The network performs dynamic node classification through downstream fine-tuning, such as the MLP and Conv operations.

  • NodeTrans [14]: NodeTrans is a graph transfer learning approach which combines the spatial-temporal graph network and transfer learning. The model performs dynamic node classification through downstream fine-tuning, such as the MLP and Conv operations.

4.4 Experimental results

Table 2 summarizes the performances of different models on the Cora and BlogCatalog datasets. We adopt accuracy, AUC and Fl as our evaluation metrics.

Table 2 Comparison between different models

Short-term classification performance analysis: As shown in Table 2, DLA-GCN is superior to the baselines for all three metrics with significant improvements. In particular, DLA-GCN achieves 18% and 10% improvements in the accuracy on the Cora and BlogCatalog dataset compared with DYSAT, which is the state-of-the-art method. Dynamic Graph-based methods (DYSAT, DS-TAGCN and FADGC) have better performances than the traditional static graph embedding methods (GCN and GAT-AE). It shows that the spatio-temporal dynamics GCN with an attention mechanism encoding both local graph structure and nodes evolution pattern have competitive advantages than traditional GCN-based models in dynamic node classification. Compared with dynamic GCN-based node classification methods ( DS-TAGCN and FADGC), TL-DCRNN and NodeTrans have better performance, confirming the superiority of transfer learning in dynamic node classification.

Long-term classification performance analysis: Fig. 3 further shows the accuracy performance of different models over the 12 time steps in Cora and BlogCatalog datasets. As can be seen from the results: (1) The a dynamic GCN-based node classification methods (DS-TAGCN, FADGC) can achieve high accuracy in short-term timesteps(the first step), but the accuracy will decline rapidly with the accumulation of missing historical information for new nodes in long-term nodes classification. (2) DLA-GCN balances short-term and long-term classification well and achieves the best performance for almost all horizons. This finding shows that the robustness of the DLA-GCN is superior to those of the other methods. Our model mainly utilizes the double-layer graph and transfer learning methods to improve the classification performance. On the 12 time steps, our model gradually decreases due to the continuous increase of new nodes. In contrast, most other models rapidly decrease in the first few time steps because the new nodes can only be classified through spatial information extraction, without any historical information or corresponding labels, making it impossible to learn better feature performance from known nodes. The accuracy of the GCN method using static graph is 0.81, however, the accuracy of the DLA-GCN method using dynamic graph is only 0.76. The main reason is that static and dynamic graphs contain different node information. Dynamic graphs have less node information such as the missing label and historical information. After the static graph is converted to dynamic graph, the accuracy of the existing method decreases because of the missing node information [31, 32, 43, 44]. Our method effectively improves the prediction accuracy of dynamic graphs. The dynamic graph consists of a series of graph snapshots at 12 different time steps, and new nodes have no labels or historical information. When converting the Cora dataset into a dynamic graph with 12 temporal steps, only the first temporal step contains nodes between 1900 and 1988 with corresponding labels, and the additional nodes until 1999 added in each temporal step do not have labels. Therefore, new node are only classified through unsupervised learning. As the number of temporal steps increases, the performance will decrease, and the overall performance is only 0.76. In contrast to the dynamic graph dataset, the Cora static graph dataset consists of 2708 labeled nodes that can be classified through supervised learning. It has the complete knowledge for node classification tasks, which does not change with the increase of temporal steps.

Fig. 3
figure 3

Performance changes over 12 time steps of different models on the Cora and BlogCatalog datasets

4.5 Model ablation analysis

To sufficiently investigate the effect of different components and source domains input in the DLA-GCN, we evaluate the prediction performance of eight variants of the DLA-GCN.

Table 3 Comparisons of classification performances of the different DLA-GCN variants
  • (1) Effectiveness evaluation of DATL component: Table 3 summarizes classification performances of the different DLA-GCN variants on the Cora and BlogCatalog dataset. Compared with “without (w/o) DATL” and “r/ DATL w/ SAT” variants, the DLA-GCN achieves significant improvements in dynamic node classification. The comparison results proves that the effectiveness of the DATL component. According to the left panel of Fig. 4, our model consistently outperforms the “r/ DATL w/ SAT” variant, which demonstrates that the DATL component obtains the better performance in almost all time steps compared with the self-attention mechanism. When the the “without (w/o) DATL” and “r/ DATL w/ SAT” variants are compared, the slightly better performance indicates that the self-attention mechanism improves learning efficiency [11] by capturing some historical information of other nodes.

  • (2) Effectiveness evaluation of different source domains input: Experimental results of models based on different source domains input on Cora and BlogCatalog datasets are shown in the right panel of Fig. 4. When the DLA-GCN is compared with the “r/ SF w/ OR” variant, the smaller performance indicates that information of similar features may be slightly less than the original features in the shot-term time step. However, with time steps increase, DLA-GCN achieves significant improvements compared with the “r/ SF w/ OR” variant. The results prove that the effectiveness of the similar features in the long-term node classification task. In other words, The correlation between the new nodes and the original nodes gets weaker over time to lower the performance of transferring features.

  • (3) Effectiveness evaluation of double-layer graph convolutional network component: From table 3, We can observe that (1) Compared with the “without (w/o) ADP”, the better performance of the DLA-GCN demonstrates the importance of capturing strong local dependencies. (2) the “without (w/o) SAT” variant is superior to the “without (w/o) ADP”, which shows that global feature extraction methods are more effective than local feature extraction approaches, which indicates global spatial module based on the adaptive matrix can capture the local correlation of some nodes.

  • (4) Effectiveness evaluation of inter-graph attention component: As shown as Fig. 5, the accuracy curves of DLA-GCN are above the curves of the “r/ IG w/ CN” variant. This result demonstrates that local and global spatial correlations are different and dynamic and prove that the inter-graph attention component enables the dynamic selection of information flows from two different spatial feature extraction modules.

  • (5) Effectiveness evaluation of NMPL component: The experimental results of models based on different parameters are shown in the right panel of Fig. 5. From the sub-figure, the performance of DLA-GCN is generally better than the “r/ NMPL w/ SP” variant at different time slots. Its good performance proves that the effectiveness of NMPL component. We can conclude from the comparison that sharing parameters only learn the prominent patterns of all nodes and do not capture the possible node-specific patterns. In brief, the NMPL algorithm not only employs the sharing parameters to learn prominent patterns but independent parameters to capture more fine-grained patterns.

  • (6) Reasonableness evaluation of time steps: In order to demonstrate the relationship between time steps and time complexity, we conducted comparative experiments as shown in Table 4. From Table 4, it can be observed that the model complexity exhibits an exponential relationship with the number of nodes. As the number of nodes in the dynamic graph increases with each time step, we need to consider the relationship between the number of time steps and time complexity. By comparing the time complexity between time steps the 16th, 14th, and 12th, we found that the time complexity increases significantly after the 12th time steps, which indicates that we should balance the relationship between long-term analysis and complexity. Meanwhile inspired by the time step usage in other authoritative papers on dynamic graphs. [30, 37], we chose 12 time steps for long-term performance analysis.

Fig. 4
figure 4

Influence of transfer learning on the Cora and BlogCatalog datasets

Fig. 5
figure 5

Ablation experiment of the different spatial modules,attention weight and NMPL on the Cora and BlogCatalog datasets

Table 4 Time complexity comparison between different time steps on BlogCatalog datasets

4.6 Model analysis

Embedded dimension: The dimension of node embedding is a key hyper-parameter in DLA-GCN. It influences DLGCN to capture spatial correlation and decides the shared parameter diversity. The performance of different embedded dimensions on the Cora dataset is given in Fig. 6. DLA-GCN has good performance on all embedded dimensions. Besides, DLA-GCN has the best performance when the embedding dimension is 10. An excessively small or large embedded dimension makes the performance of DLA-GCN weaker. Therefore, DLA-GCN may employ the appropriate embedding dimension to balance the performance and complexity.

Fig. 6
figure 6

Influence of embedding dimension on the Cora dataset

The NMPL algorithm time complexity: To demonstrate the appropriateness of the parameter increase in the NMPL algorithm, we present in Table 5 the time complexity and model performance of the DLA-GCN and the ’r/ NMPL w/ SP’ variant. From Table 5, it can be observed that DLA-GCN exhibits significant improvement in performance compared to the ’r/ NMPL w/ SP’ variant on both datasets, while the increase in time complexity is not significant. Therefore, considering the significant performance improvement, the computational cost of the NMPL algorithm is moderate.

Table 5 Time complexity comparison and model performance on the Cora and BlogCatalog datasets. Note that “SP” denotes sharing parameters method

Computation cost: In order to further evaluate the computation cost between DLA-GCN and baseline models, we compare training time with GCN, GAT-AE, DYSAT, DS-TAGCN, FADGC, TL-DCRNN. and NodeTrans models (as shown in Table 6). FADGC has a longer training time and DS-TAGCN has the longest training time owing to the complex spatio-temporal self-attention mechanism. DLA-GCN employs the DATL method to transfer the similar feature, resulting in a decrease in the trainingtime.

Table 6 Computation cost comparison on the Cora and BlogCatalog datasets

5 Conclusion

In this work, we propose DLGCN to capture more spatial correlation. Specifically, The network employs the different adjacency matrices to exploit both local and global information of the graphs. Furthermore, we propose an inter-graph attention mechanism to adaptively aggregate a unified embedding for node classification task. To both capture prominent patterns and node-specific patterns, we design the NMPL algorithm. Specifically, the algorithm employs the matrix decomposition method to learn the two smaller parameter matrices instead of learning the unique parameter space of each node. In terms of dynamic temporal correlation, we propose DATL method to learn and transfer similar features at different time steps as historical information of new nodes. The experiments verify the effectiveness of the double-layer graph and NMPL in terms of spatial feature aggregation, and also verify the effectiveness of DATL in terms of temporal dynamics.