1 Introduction

Heterogeneous Graph (HG), also known as Heterogeneous Information Network (HIN), are common network structures composed of multiple types of nodes and edges in the real world [1]. For example, an academic network can be represented as a HG, which consists of three types of nodes (author, paper, subject) and two types of edges (authors write papers, papers contain subject) as shown in Fig. 1(a). Similarly, a network of legal documents, such as civil judgments, can also be represented as a HG, which consists of five types of nodes (plaintiff, defendant, judge, instrument, cause) and three types of edges (judges write instruments, instruments contain causes, plaintiffs and defendants are parties involved in the instruments) as shown in Fig. 1(b). In recent years, Heterogeneous Graph Neural Networks (HGNNs) have achieved significant success in handling HG data [2]. This is primarily due to their effective integration of message-passing mechanisms with the inherent complexity of heterogeneity, enabling a more comprehensive capture of the intricate structures and rich semantic information inherent in heterogeneous graphs [3]. With the prevalence of large-scale complex networks, HGNNs have become a powerful tool for processing fields such as social networks [4], e-commerce [5], smart justice [6], and bioinformatics [7].

Fig. 1
figure 1

Examples of heterogeneous graphs

Semi-supervised learning (SSL) [8] is a machine learning paradigm aimed at enhancing model performance by leveraging both labeled and unlabeled data. In traditional supervised learning, models are trained and predicted using only labeled data. However, in many practical scenarios, obtaining a large amount of labeled data can be expensive or challenging. The goal of semi-supervised learning is to improve the generalization capability and performance of models by utilizing a limited amount of labeled data alongside a significant amount of unlabeled data.

In recent years, graph-based SSL methods have made significant progress [9]. However, few studies have provided an overarching view to address the core issue of SSL, which is that the insufficiency of labeled data can lead to overfitting and distribution shift problems in the model [10]. In addition, existing SSL methods such as GCN (Semi-supervised classification with graph convolu- tional network) [11], GraphSAGE (Inductive Representation Learning on Large Graphs) [12], GAT (Graph attention network) [13], etc., typically focus on learning the mapping function between node representation and labels, where labels are only used to compute the classification loss of the output. This means that the process of learning node representation does not fully utilize label information, limiting the comprehensive consideration of label information in SSL [14].

The fundamental idea of graph contrastive learning [15], a novel approach in self-supervised graph representation learning, aims to optimize the model by minimizing the distance between the target node and positive samples and maximizing the distance with negative samples [16]. Although contrastive learning can use the data itself to provide supervisory information for representation learning, it is not directly applicable to SSL [17]. Contrastive learning focuses on extracting features by learning the similarity and dissimilarity between data samples, typically used in unsupervised or self-supervised learning tasks. However, in semi-supervised learning, we often have a small amount of labeled data and a large amount of unlabeled data. In this scenario, contrastive learning may face several challenges: firstly, due to the vast number of unlabeled data compared to labeled data, contrastive learning may be affected by issues such as excessive unlabeled data, difficulties in measuring similarity, leading to instability and inaccuracy in feature learning; secondly, semi-supervised learning emphasizes how to utilize information from labeled data to guide the learning process, while contrastive learning’s core lies in unsupervised learning, which may not effectively utilize information from labeled data. Therefore, contrastive learning cannot be directly applied to semi-supervised learning. Furthermore, few studies have fully utilized valuable label information to supervise the construction of effective positive and negative samples in contrastive loss.

Indeed, labels can carry valuable information that is beneficial for node classification. Firstly, each label can be seen as a virtual center for nodes belonging to that label, reflecting the proximity of intra-class nodes. For example, in an academic network, papers within the same field are more relevant than those from different fields. In a business network, products within the same category often share similar characteristics. Secondly, labels are associated with rich semantics, and certain labels can be semantically close to each other. For instance, the fields of artificial intelligence and machine learning are more interrelated than artificial intelligence and chemistry. The relationship between computers and mice is closer than that between computers and digital cameras. Therefore, when classifying paper domains or product categories, it is essential to explore the rich information provided by labels. This motivates us to design a new framework that thoroughly considers the performance of GNNs in semi-supervised node classification by leveraging label information.

In this work, we focus on exploring, building upon, and proposing a label information based method for semi-supervised HGNN. To achieve this, we are faced with two key challenges: (1) How to explicitly incorporate label signals into the graph structure? (2) How to construct more reliable positive and negative samples by the label and semantic information in HGs?

To address the issues, we propose a new framework Semi-Supervised Heterogeneous Graph Contrastive Learning with Label-Guided (SSGCL-LG) designed to maximize the use of label information, thereby enhancing the performance of HGNNs in semi-supervised tasks. In the paper, we integrate rich label information comprehensively into GNN to facilitate semi-supervised node classification. We construct a label graph where a novel node is created for each label with semantic features, and connections are established with intra-class nodes, making each label act as the center for its corresponding nodes. By utilizing a message passing mechanism to jointly learn node and label representations, we can effectively smooth intra-class node representations and explicitly encode label semantics. Additionally, we apply label information to the selection of positive samples, fully leveraging label information to tightly integrate nodes of the same category in embeddings.

Specifically, to capture both homogeneous and heterogeneous neighborhood information effectively, we decompose the heterogeneous graph into multiple homogeneous and heterogeneous subgraphs based on metapaths. We first introduce a strategy where heterogeneous subgraphs guide the fusion of homogeneous subgraphs. Then, we treat labels as special nodes and design a label graph to explicitly encode label information into the learning process of Graph Neural Networks (GNNs). Furthermore, we introduce a contrastive loss for semi-supervised learning, aiming to fully leverage the supervisory signals inherent in the data itself. The semi-supervised contrastive loss is built upon the foundation of self-supervised contrastive loss functions, utilizing the supervision signals from both labeled and unlabeled data. This tightens the embedding of nodes within the same class, leading to improved classification accuracy. This method enables better utilization of label information in SSL, overcoming the challenge of sparse labeled data, thereby enhancing the performance of HGNNs in semi-supervised tasks.

The remainder of this paper is organized as follows: Section 2 surveys the related work. Section 3 presents some theories about heterogeneous graphs and provides formal definitions. Section 4 presents the Semi-Supervised Heterogeneous Graph Contrastive Learning with Label-Guided model. Section 5 describes the experiments performed in this study with results analysis. Finally, Section 6 concludes this paper and future work.

2 Related work

In this section, we will introduce the related work about graph neural networks, and give a brief description of graph representation learning as well as graph contrast learning.

2.1 Graph neural networks

GNNs propagate and aggregate node features through multiple neural layers to predict labels from feature propagation [18]. For example, GCN [11] obtains node representation that aggregate neighborhood information through an approximation of spectral graph convolutions. GAT [13] assigns attention coefficients based on the similarity of features between nodes to aggregate neighborhood information. GraphSAGE [12] samples a fixed number of neighbor nodes and aggregates the representation of neighbors at each layer. Additionally, AM-GCN (Adaptive Multi-channel Graph Convolutional Networks) [19] learns specific and common embedding for nodes in both topological and feature spaces and constrains the diversity and consistency of node embedding by measuring the similarity between specific and common embedding. However, it is important to note that the aforementioned models cannot be directly applied to heterogeneous graphs.

Fig. 2
figure 2

An example of HIN

HGNNs learn node representation by capturing information from different types of nodes and edges through metapaths or relation types. For instance, HAN (Heterogeneous graph attention network) [20] learns the importance between nodes and their neighbors under meta-paths through node-level attention and the importance of different meta-paths through semantic-level attention. Building upon HAN, MAGNN (Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding) [21] further enhances node representation by aggregating information from heterogeneous nodes within meta-paths. Additionally, HetGNN (Heterogeneous Graph Neural Network) [22] uses random walks with a restart to sample a fixed-size neighborhood and integrates features of the same or different types of nodes through a bidirectional LSTM (Long Short Term Memory). HGT (Heterogeneous Graph Transformer) [23] captures the importance of different types of edges by computing attention coefficients between nodes and aggregates edge attention with node information for message passing. Finally, HGSL (Heterogeneous Graph Structure Learning for Graph Neural Networks) [24] achieves heterogeneous graph structure learning by fusing multiple subgraphs (feature graph, semantic graph, and the original graph). ie-HGCN (Interpretable and Efficient Heterogeneous Graph Convolutional Network) [25] is a relation extraction model based on graph neural networks that uses a combination of various relation representation methods, effectively capturing dependencies and contextual information between entities. RoHe (Robust Heterogeneous Graph Neural Networks against Adversarial Attacks) [26] employs an attention purifier that can prune malicious neighbors based on topology and features, thus eliminating the negative influence of malicious neighbors in the soft attention mechanism. HPN (Heterogeneous Graph Propagation Network) [27] is a graph neural network model for graph classification that enhances model performance through hierarchical graph pooling and structure learning, effectively handling graph structures at different levels.

2.2 Graph contrastive learning

Contrastive learning on graphs follows the principle of Mutual Information (MI) maximization [28], which aims to pull closer the representation of samples with similar information while pushing away the representation of unrelated samples [29]. In heterogeneous graph contrastive learning, it is common to perform MI maximization on samples at different scales (i.e., node-level and graph-level representation). HDGI (Heterogeneous Deep Graph Infomax) [30] fuses node representation under different meta-paths through semantic-level attention to form positive sample node representation and optimizes node representation by maximizing the mutual information between positive samples and graph-level representation. DMGI (Unsupervised Attributed Multiplex Network Embedding) [31] optimizes node representation by maximizing mutual information between subgraph-level representation learned under each relation subgraph and node-level representation. Additionally, a recent self-supervised heterogeneous graph neural network HeCo (Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning) [32] that maximizes node-level mutual information has attracted widespread attention. It employs collaborative contrastive learning from the perspectives of network schema and meta-paths to uncover more information in heterogeneous graphs. However, existing heterogeneous graph contrastive learning methods are only used in self-supervised models and cannot directly utilize label information.

Although these methods provide insightful solutions for utilizing labels, they still fail to capture the rich information contained in the labels.

This paper proposes a label-guided semi-supervised contrastive learning framework that integrates the rich label information into GNN learning by jointly learning the representation of nodes and labels.

3 Preliminary

Definition 1. Heterogeneous Graph. A Heterogeneous graph is defined as \(G = (V, E, \mathcal {A}, \mathcal {R}, \phi , \varphi )\). Where, \(V\) and \(E\) represent sets of nodes and edges, respectively. \(\mathcal {A}\) and \(\mathcal {R}\) represent sets of node types and edge types, where \(|\mathcal {A} + \mathcal {R} |> 2\). There are two types of node mappings: \(\phi : V \rightarrow \mathcal {A}\) for node types and \(\varphi : E \rightarrow \mathcal {R}\) for edge types. \(V\) is defined as having two categories: labeled nodes \(V_L\) and unlabeled nodes \(V_U\), where \(V_L + V_U = V\). The labeled nodes are defined as \(V_L\).

For example, Fig. 2(a) illustrates a heterogeneous graph composed of multiple types of nodes (paper, author, subject) and relationships (the writing relationship between author and paper, the purpose relationship between paper and subject).

Definition 2. Metapath. A metapath is defined as a path in a heterogeneous graph: \(A_1 \xrightarrow {R_1} A_2 \xrightarrow {R_2} \cdots \xrightarrow {R_l} A_{l+1}\), representing a composite connection relationship \(R = R_1 \circ R_2 \circ \cdots \circ R_l\) between \(A_1\) and \(A_{l+1}\), where \(\circ \) denotes the composition operator on relationships, \( A_{l} \) \(\in \) \( \mathcal {A} \), \( R_l \) \(\in \) \( \mathcal {R} \).

“path” typically refers to a sequence of connections between nodes in a graph, while “metapath” refers to a specific type of path pattern. For example, in Fig. 2(b), two metapaths PAP and PSP are illustrated. Where, PAP represents the connection where two papers share a common author, and PSP represents the connection where two papers jointly express the same topic.

Definition 3. Metapath-based Homogeneous Subgraph. For a given heterogeneous graph \( G \), a metapath \( P \), and a node \( V \), the homogeneous subgraph \( G^{ho} = (V, E, \mathcal {A}, \mathcal {R}) \) \(\in \) \( G \) is defined as the graph constructed from all neighbor pairs based on metapath \( P \). Please note that \( P \) starts and ends with the same node type, where \( \mathcal {A} = 1 \) and \( \mathcal {R} = 1 \).

For example, in Fig. 2(c), it shows the homogeneous subgraph generated by the two metapaths PAP and PSP, where the homogeneous subgraph only contains nodes of type paper.

Definition 4. Metapath-based Heterogeneous Subgraph. Given a metapath \( P \) and a node \( V \) in a heterogeneous graph \( G \), the metapath-based heterogeneous subgraph \( G^{he} = (V, E, \mathcal {A}, \mathcal {R}) \) \(\in \) \( G \) is defined as the graph constructed by pairs of neighboring nodes of different types connected to node \( V \) through the metapath, where \(|\mathcal {A} + \mathcal {R} |> 2 \).

For example, in Fig. 2(d), it shows the heterogeneous subgraph based on two metapaths PAP and PSP. The heterogeneous subgraph under the metapath PAP contains only paper nodes and author nodes, while the heterogeneous subgraph under the metapath PSP contains only paper nodes and subject nodes.

Definition 5. Label Graph. The node label graph \( G^{Y} \in \mathbb {R}^{M \times C} \) is composed of one-hot vectors for labeled nodes and zero vectors for unlabeled nodes, where \( M \) is the number of nodes in \( V \), and \( C \) is the number of label classes. Specifically, each labeled node \( V_i \in V_L \) has a one-hot vector \( Y_i \in \{0, 1\}^C \), where 1 indicates the label category of \( V_i \). For each unlabeled node \( V_i \in V_U \), \( Y_i \in \{0\}^C \)is a all-zero vector, where all elements are 0.

For example, in Fig. 2(e), the labeled node \( V_0 \) has \( Y_0 = \{1, 0\} \), where 1 represents that the node \( V_0 \) belongs to label 0. In the label graph \( G^{Y} \) shown in Fig. 2(e), it is represented as

$$ G^Y = \begin{bmatrix} 1 & 0 \\ 0 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 0 \\ 0 & 1 \end{bmatrix} $$

Definition 6. Node Embedding [21]. Node Embedding is a technique that maps nodes in a graph to a low-dimensional vector space, commonly used for representation learning of graph data.

For a node \( v_i \in V \), its Node Embedding is denoted as \({v}_i = f(v_i, G) \), where \( f \) is a mapping function that maps node \( v_i \) and the entire graph \( G \) to a vector representation \( {v}_i \) in \( \mathbb {R}^d \) space. \( {v}_i \in \mathbb {R}^d \), where \( d \) is the dimensionality of the chosen embedding space.

Problem. Heterogeneous Graph Embedding. Given a heterogeneous graph \(G = (V, E, \mathcal {A}, \mathcal {R}, \phi , \varphi )\), with node attribute matrices \(X_{A_i}\), heterogeneous graph embedding is the task to learn the d-dimensional node representa-tions \(Z \in \mathbb {R}^d\) with \(d \ll |V|\) that are able to capture rich structural and semantic information involved in \(G\).

4 The proposed method

In this section, we propose a Semi-Supervised Heterogeneous Graph Contrastive Learning with Label-Guided, as illustrated in Fig. 3. The model comprises three parts: (a) Metapath-based Heterogeneous Subgraph, (b) Metapath-based Homogeneous Subgraph, and (c) contrastive learning. Specifically, to better capture information from homogeneous and heterogeneous neighbors in the heterogeneous graph, SSGCL-LG decomposes the graph into multiple metapath-based homogeneous and heterogeneous subgraphs. In (a) Metapath-based Heterogeneous Subgraph, information from different metapath-based heterogeneous subgraphs is aggregated using attention mechanisms. In (b) Metapath-based Homogeneous Subgraph, a label graph is constructed and concatenated with the homogeneous subgraphs to learn node representations using a GNN Encoder. Finally, in (c) contrastive learning, different positive and negative samples are selected, optimizing the model through a combination of contrastive loss and cross-entropy loss.

Fig. 3
figure 3

SSGCL-LG model

4.1 Metapath-based heterogeneous subgraph embedding

Most current research on heterogeneous graphs is based on metapaths, used to capture specific semantic information in graphs. However, these heterogeneous graph models are primarily constrained by two limitations: firstly, many of them only aggregate information from homogeneous neighbors connected by meta-paths, thereby discarding rich structural and attribute information from heterogeneous neighbors; secondly, some studies aggregate information from both homogeneous and heterogeneous neighbors but treat these neighbors indiscriminately in the same way. As a result, these methods may lose important information and lead to suboptimal performance. We illustrate this point through the following example. Therefore, the heterogeneous graph is first partitioned into homogenous and heterogeneous subgraphs based on Metapath, enabling the comprehensive learning of complex information in the heterogeneous graph.

As shown in Fig. 2(c) of the heterogeneous graph constructed with the metapath PAP, a homogeneous subgraph is formed with paper nodes (\(V_0\), \(V_1\), \(V_2\), \(V_5\)). For a specific node (e.g., \(V_0\)), if we only aggregate information from homogeneous neighbors (\(V_1\), \(V_2\), \(V_5)\), the structural and attribute information contributed by heterogeneous neighbors (\(V_8\), \(V_9\), \(V_{10}\)) connected to it will be ignored. Therefore, considering only homogeneous subgraphs can lead to the loss of a significant amount of useful interaction information from the original graph. Additionally, there are different interaction patterns between nodes and neighbors of different types, which often carry different semantics and should be considered separately to avoid information loss. It’s worth noting that nodes of different types typically have different attributes. For example, in a recommendation system, user node attributes may include age, gender, interests, while item attributes may include price, text descriptions, images, etc. Original attributes cannot be directly transferred between nodes of different types and require pre-transformation.

Since nodes of different types in HIN usually have different vector dimensions, in SSGCL-LG, they need to be projected into a common space through specific transformation. Additionally, we treat labels as a special type of node and initializes the feature of label nodes with unit vectors. Specifically, for nodes of type \(\Phi \), a specific type of mapping matrix \(w_{\Phi }\) is designed to transform their features \(X\) into the common space, as shown below:

$$\begin{aligned} H = \sigma (w_{\Phi }X + b_{\Phi }), \end{aligned}$$
(1)

where \(\sigma \) represents the activation function, and \(b\) denotes the vector bias.

Different types of metapaths represent different semantic information. For \(M\) types of metapaths, SSGCL-LG constructs heterogeneous subgraphs of this type \(\{G_1^{he}, \ldots , G_M^{he}\}\). For each subgraph \(G_n^{he}\), node embedding \(H_n^{he}\) are learned using GCN [11]. Specifically:

$$\begin{aligned} \left( H_n^{he}\right) ^{lhe} = \sigma \left( \hat{D}^{-\frac{1}{2}}\hat{A}\hat{D}^{-\frac{1}{2}}\left( H_n^{(l-1)}\right) W^{(l)}\right) , \end{aligned}$$
(2)

where \(\hat{A} = A + I\) represents the adjacency matrix of the heterogeneous subgraphs \(G_n^{he}\) with the addition of self-loop connections. \(\hat{D}\) is the degree matrix of \(\hat{A}\). \(W^{(l)}\) denotes the weight matrix for the \(l\)-th layer of the Graph Convolutional Network, and \(\left( H_n^{he}\right) ^{lhe}\) represents the node representation at the \(l\)-th layer.

After obtaining embedding for each type of heterogeneous subgraphs \(\{h_1^{he}, \ldots , h_M^{he}\}\), SSGCL-LG utilizes semantic-level attention to fuse them, resulting in node \(Z^{he}\) under the heterogeneous subgraph:

$$\begin{aligned} Z^{he} = \sum _{n=1}^{M} \alpha _n \cdot h_n^{he}, \end{aligned}$$
(3)

where \(\alpha _n\) represents the weight of the heterogeneous subgraphs \(G_n^{he}\) , calculated as follows:

$$\begin{aligned} e_{G_n^{he}} = \frac{1}{|M |} \sum _{n \in M} {a}^T \tanh \left( W \cdot H_n^{he}+b\right) , \end{aligned}$$
(4)
$$\begin{aligned} \alpha _n = \text {softmax}\left( e_{G_n^{he}}\ \right) = \frac{\exp \ \left( e_{G_n^{he}}\ \right) }{\sum _{n \in M}{\exp \ \left( e_{G_n^{he}}\ \right) }}. \end{aligned}$$
(5)

Note that, because the importance of heterogeneous subgraphs varies across different metapaths, we can compute weights for aggregating different heterogeneous subgraphs, denoted as \(\{\alpha _1, \ldots , \alpha _M\}\).

4.2 Metapath-based homogeneous subgraph embedding

For different types of metapaths representing distinct semantic information, SSGCL-LG constructs homogeneous subgraphs \(\{G_1^{ho}, \ldots , G_M^{ho}\}\). According to the number of metapaths, weight matrices for constructing homogeneous subgraphs \(\{w_1^{ho}, \ldots , w_M^{ho}\}\) are formed.

When integrating different views traditionally, the channel attention method described in HGSL [24] is employed:

$$\begin{aligned} w_{\text {ho}} = \Psi [w_1^{ho}, \ldots , w_M^{ho}], \end{aligned}$$
(6)

where \(\Psi \) represents a channel attention layer with parameters \(W^{\Psi } \in \mathbb {R}^{1 \times 1 \times M}\). It performs a \(1 \times 1\) convolution on the input using \(\ softmax (W^{\Psi })\).

However, this method only utilizes softmax for the \(1 \times 1\) convolution operation, neglecting the influence of node features on graph structure fusion. To account for the impact of node features, we utilize semantic-level attention within heterogeneous subgraphs, resulting in different weight coefficients \(\{\alpha _1, \ldots , \alpha _M\}\) guiding the construction of the weight matrix \(w_{\text {ho}}\) for homogeneous subgraphs,

$$\begin{aligned} w^{ho} = \sum _{n=1}^{M} \alpha _n w_n^{ho}. \end{aligned}$$
(7)

SSGCL-LG achieves this by creating a type of node for the labels and establishing connections with nodes within the same class. It constructs a connection matrix \(G^{Y}\) between labels and nodes within the same class. \(G^{Y}\) is concatenated with \(w^{ho}\) to create the label heterogeneous adjacency matrix \(w^{la}\). Similarly, GCN [11] is utilized to learn node embedding representation \(Z^{ho}\) within the label subgraph,

$$\begin{aligned} w^{la} = \left[ \begin{matrix}w^{ho} & G^{Y}\\ {G^{Y}}^T & 0\end{matrix}\right] , \end{aligned}$$
(8)
$$\begin{aligned} Z^{ho} = \sigma (\hat{D}^{-\frac{1}{2}} w^{la} \hat{D}^{-\frac{1}{2}} H^{(lho-1)} W^{(lho)}). \end{aligned}$$
(9)

In this work, we use one-hot encoding to represent label features, which can provide rich representations when label features or prior knowledge related to labels are explicitly given. When further performing message passing on \(w^{la}\), labels can contribute in two aspects. Firstly, each label serves as a virtual center for intra-class nodes, making them 2-hop neighbors even if they are far apart from each other in \(w^{la}\). This enhances the smoothness of intra-class node representations. Secondly, modeling label semantics through one-hot encoding helps in discovering semantic correlations among labels. Although there are no direct connections between labels, they can still receive messages from each other through higher-order interactions, aiding in uncovering their implicit relationships.

Fig. 4
figure 4

Positive sample selection strategy

4.3 Positive sample selection strategy

In the selection of positive samples, as illustrated in Fig. 4, considering that only a small number of nodes have label information, nodes with the same label are considered positive samples, and nodes with different labels are considered negative samples. Additionally, since nodes are typically connected by multiple paths and are highly correlated, we propose a positive selection strategy: if there are multiple metapaths connecting two nodes, they are considered positive samples. This is depicted in Fig. 4, where links between papers indicate that they are positive samples for each other. One advantage of this strategy is that the selected positive samples can better reflect the local structure of the target node.

In Fig. 4, for example, node \( V_0 \) and node \( V_3 \) belong to label 0, while node \( V_2 \) and node \( V_5 \) belong to label 1. Thus, node \( V_0 \) and node \( V_3 \) are positive samples for each other, and they are negative samples for node \( V_2 \) and node \( V_5 \). Among other nodes with unknown labels, node \( V_0 \) has two paths connecting to node \( V_1 \) and one path connecting to node \( V_4 \). Assuming the threshold \( T \) is set to 2, then node \( V_0 \) and node \( V_1 \) are positive samples for each other, and node \( V_0 \) and node \( V_4 \) are negative samples for each other. Thus, the positive sample set for node \( V_0 \) is [\( V_1, V_3 \)], and the negative sample set is [\( V_2, V_4, V_5 \)].

If there is a metapath connecting two nodes, then these two nodes are related. The more metapaths between two nodes, the stronger their correlation. For nodes \(i\) and \(j\), we define a function \(C_i(\cdot )\) to count the number of metapaths connecting these two nodes:

$$\begin{aligned} C_i(j) = \sum _{n=1}^{M} \theta (j \in N_i), \end{aligned}$$
(10)

where \(\theta (\cdot )\) denotes the sign function. We construct a set \(C_i\) sorted in descending order according to \(\mathbb {C}_i = \left\{ j \mid j \in V \text { and } \mathbb {C}_i(j) \ne 0 \right\} \) nodes as candidate positive samples \(\mathbb {P}_i^T\).

To make full use of the limited label information available for some nodes, we consider nodes with the same label as positive samples and nodes with different labels as negative samples. Specifically, we construct a set \(\mathbb {Q}_i = \left\{ j \mid j \in V \text { and } Q_i(j) = 1 \right\} \), where \(Q_i(j) \in \{0,1\}\) represents the label discrimination function (when nodes i and j have the same label, \(Q_i(j) = 1\), otherwise \(Q_i(j) = 0\)). The final positive samples are filtered based on both meta-paths and label information, denoted as \(\mathbb {P}_i = \mathbb {P}_i^T \parallel \mathbb {Q}_i\), and the remaining nodes serve as negative samples \(\mathbb {N}_i\).

4.4 Training

The semi-supervised contrastive loss is an extension of the self-supervised contrastive loss. As evident from the selection of positive samples, the incorporation of label information expands the number of positive node pairs in semi-supervised contrastive learning.

Once obtaining the embedding \(z_i^{ho}\) for the homogeneous subgraphs and \(z_i^{he}\) for the heterogeneous subgraphs, we feed them into a MLP with one hidden layer to map them into the space where contrastive loss is calculated:

$$\begin{aligned} z_i^{ho}proj= & w^{(2)}\sigma \left( w^{(1)}z_i^{ho} + b^{(1)}\right) + b^{(2)},\end{aligned}$$
(11)
$$\begin{aligned} z_i^{he}proj= & w^{(2)}\sigma \left( w^{(1)}z_i^{he} + b^{(1)}\right) + b^{(2)}, \end{aligned}$$
(12)

where \(\sigma \) is the activation function, \( w^{(1)} \) is the weight matrix for the first layer, used to map input \( z^{(i)} \) to the output of the hidden layer, \( b^{(1)} \) is the bias term for the first layer, used to adjust the influence of the input, \( w^{(2)} \) is the weight matrix for the second layer, used to map the output of the first layer to the output layer, \( b^{(2)} \) is the bias term for the second layer, used to adjust the influence of the output of the first layer.

After obtaining the positive sample set \(\mathbb {P}_i\) and negative sample set \(\mathbb {N}_i\), the loss for the homogeneous subgraphs is computed as:

$$\begin{aligned} \mathcal {L}_i^{ho} = -\log {\frac{\sum _{j \in \mathbb {P}_i} \exp \left( \text {sim}\left( z_i^{ho}proj, z_j^{he}proj\right) /\tau \right) }{\sum _{k \in \{\mathbb {P}_i \cup \mathbb {N}_i\}} \exp \left( \text {sim}\left( z_i^{ho}proj, z_k^{he}proj\right) /\tau \right) }}, \nonumber \\ \end{aligned}$$
(13)

where \(\text {sim}(u, v)\) is the cosine similarity function, and \(\tau \) is an environmental variable.

In the homogeneous subgraphs perspective, the target embedding \(z_i^{ho}proj\) comes from the homogeneous subgraphs perspective, while the positive and negative sample embedding \(z_k^{he}proj\) come from the heterogeneous subgraphs perspective.

Similarly, the loss in the heterogeneous subgraphs perspective is:

$$\begin{aligned} \mathcal {L}_i^{he} = -\log {\frac{\sum _{j \in \mathbb {P}_i} \exp \left( \text {sim}\left( z_i^{he}proj, z_j^{ho}proj\right) /\tau \right) }{\sum _{k \in \{\mathbb {P}_i \cup \mathbb {N}_i\}} \exp \left( \text {sim}\left( z_i^{he}proj, z_k^{ho}proj\right) /\tau \right) }}.\nonumber \\ \end{aligned}$$
(14)

The difference lies in the fact that the target embedding \(z_i^{he}proj\) comes from the heterogeneous subgraphs perspective, while the positive and negative sample embedding \(z_k^{ho}proj\) come from the homogeneous subgraphs perspective.

Therefore, the contrastive loss is as follows:

$$\begin{aligned} L_{con} = \frac{1}{|V |} \sum _{i \in V} \left[ \lambda \cdot \mathcal {L}_i^{ho} + (1-\lambda ) \cdot \mathcal {L}_i^{he}\right] , \end{aligned}$$
(15)

where \(\lambda \) is used to balance the losses from the two subgraphs.

The cross-entropy loss function can be described as:

$$\begin{aligned} L_{he} =-\sum _{v \in y_l} {Y_{v_l} \cdot \ln (C \cdot Z^{he})}, \end{aligned}$$
(16)
$$\begin{aligned} L_{ho}=-\sum _{v \in y_l} {Y_{v_l} \cdot \ln (C \cdot Z^{ho})}, \end{aligned}$$
(17)

where \(y_L\) represents the labeled nodes set, \(Y_{v_l}\) is the true label of node \(v_l\), and \(C\) represents the parameters of the classifier.

The cross-entropy loss is defined as follows:

$$\begin{aligned} L_{cro}= uL_{he} + (1-\mu )L_{ho}, \end{aligned}$$
(18)

where \(u\) is used to balance the losses from two subgraphs.

Finally, by combining the contrastive loss \(L_{con}\) and the cross-entropy loss \(L_{cro}\), our overall loss function for SSGCL-LG can be represented as follows:

$$\begin{aligned} L = \tau L_{con} + (1-\tau )L_{cro}. \end{aligned}$$
(19)

Algorithm 1 describes the main flow of the enhanced representation of the target node attributes.

Algorithm 1
figure e

The algorithm of attribute enhancement.

5 Experiments

In this section, we conduct extensive experiments to demonstrate the performance of SSGCL-LG. Specifically, we show the excellent performance of our method through node classification, node clustering, and visualization. Additionally, label importance analysis experiments, ablation experiments, and parameter analysis experiments further prove the effectiveness of SSGCL-LG.

5.1 Datasets

To evaluate the effectiveness of the proposed framework for attribute completion, we utilize three common Heterogeneous Information Network (HIN) datasets. Table 1 summarizes the statistical data of these three datasets.

Table 1 Statistics of datasets
  • ACMFootnote 1: This is an academic network that includes three different types of nodes: 4,019 papers, 7,167 authors, and 60 subjects. The target nodes are papers, which are categorized into three different classes.

  • IMDBFootnote 2: This is a movie network that comprises three different types of nodes: 4,278 movies, 2,081 directors, and 5,257 actors. The target nodes are movies, which are categorized into three different classes

  • DBLPFootnote 3: This is also an academic network, containing four different types of nodes: 4,057 authors, 14,328 papers, 20 conferences, and 7,723 terms. The target nodes are authors, which are categorized into four different classes.

5.2 Baselines

We compare the proposed SSGCL-LG with three categories of baselines: Method based on homogeneous graph (GCN [11], GAT [13]), Method based on metapaths (HAN [20], RoHe [26], MAGNN [21], HPN [27]), Method based on Relation-aware (HGSL [24], HGT [23], ie-HGCN [25]).

  • GCN(2017) [11]: A semi-supervised graph convolutional network primarily designed for homogeneous graphs. In this paper, GCN is applied to all meta-paths of heterogeneous graphs and achieves the best performance.

  • GAT(2018) [13]: It employs a multi-head attention mechanism to assign weights to each neighboring node, mainly targeting homogeneous graphs. In this paper, GAT is applied to all meta-paths of heterogeneous graphs and achieves the best performance.

  • HAN(2019) [20]: This model generates node embedding by performing hierarchical aggregation of neighborhood features based on meta-paths, learning the importance from both the node level and the semantic level.

  • MAGNN(2020) [21]: This model generates node embedding by applying node content transformation,intra-meta-path aggregation, and inter-meta-path aggregation.

  • HGT(2020) [23]: It introduces an attention mechanism related to vertex and edge types.

  • ie-HGCN(2023) [25]: ie-HGCN is a relation extraction model based on graph neural networks that uses a combination of various relation representation methods, effectively capturing dependencies and contextual information between entities.

  • HGSL(2021) [24]: It generates a heterogeneous graph structure suitable for downstream tasks by mining feature similarity, the interaction between features and structure, and the high-order semantic structure in heterogeneous graphs, and jointly learns GNN parameters.

  • RoHe(2022) [26]: RoHe employs an attention purifier that can prune malicious neighbors based on topology and features, thus eliminating the negative influence of malicious neighbors in the soft attention mechanism.

  • HPN(2022) [27]: HPN is a graph neural network model for graph classification that enhances model performance through hierarchical graph pooling and structure learning, effectively handling graph structures at different levels.

  • SSGCL-LG(ours): It integrates label information into the learning process of graph neural networks by constructing a labeled graph and building positive samples related to labels.

5.3 Metrics

In this study, we employed multiple evaluation metrics to assess the performance of the models. These metrics cover different aspects of model performance, including classification accuracy, clustering consistency, and class distribution.

  • Micro-F1: Micro-F1 is one of the commonly used evaluation metrics in multi-class classification tasks. It combines precision and recall and is suitable for datasets with imbalanced class distributions. The formula for Micro-F1 is as follows:

    $$ Micro-F1 = \frac{2 \times (Micro-Precision \times Micro-Recall)}{Micro-Precision + Micro-Recall} $$

    Where, Micro-Precision represents micro-precision, defined as the ratio of correct predictions for all classes to all predicted instances. Micro-Recall represents micro-recall, defined as the ratio of correct predictions for all classes to all true labels.

  • Macro-F1: Macro-F1 is another commonly used evaluation metric in multi-class classification tasks, which computes the average F1 score for each class. The formula for Macro-F1 is as follows:

    $$ Macro-F1 = \frac{1}{N} \sum _{i=1}^{N} \frac{2 \times (Precision_i \times Recall_i)}{Precision_i + Recall_i} $$

    Where, \(N\) denotes the number of classes, \(Precision_i\) and \(Recall_i\) represent precision and recall, respectively, for class \(i\).

  • NMI (Normalized Mutual Information): NMI is a commonly used evaluation metric in clustering tasks, measuring the consistency between clustering results and true labels. The formula for NMI is as follows:

    $$ NMI = \frac{I(X; Y)}{\sqrt{H(X) \times H(Y)}} $$

    Where, \(I(X; Y)\) denotes mutual information, measuring the correlation between two random variables \(X\) and \(Y\); \(H(X)\) and \(H(Y)\) denote the entropy of random variables \(X\) and \(Y\), respectively.

  • ARI (Adjusted Rand Index): ARI is another commonly used evaluation metric in clustering tasks, evaluating clustering effectiveness by comparing the consistency between clustering results and true labels with the consistency between random clustering results and true labels. The formula for ARI is as follows:

    $$ ARI = \frac{\sum _{ij} \left( {\begin{array}{c}n_{ij}\\ 2\end{array}}\right) - [\sum _i \left( {\begin{array}{c}a_i\\ 2\end{array}}\right) \sum _j \left( {\begin{array}{c}b_j\\ 2\end{array}}\right) ] / \left( {\begin{array}{c}n\\ 2\end{array}}\right) }{\frac{1}{2} [\sum _i \left( {\begin{array}{c}a_i\\ 2\end{array}}\right) + \sum _j \left( {\begin{array}{c}b_j\\ 2\end{array}}\right) ] - [\sum _i \left( {\begin{array}{c}a_i\\ 2\end{array}}\right) \sum _j \left( {\begin{array}{c}b_j\\ 2\end{array}}\right) ] / \left( {\begin{array}{c}n\\ 2\end{array}}\right) } $$

    Where, \(n_{ij}\) represents the number of samples simultaneously belonging to class \(i\) and class \(j\); \(a_i\) represents the number of samples belonging to class \(i\) in clustering results; \(b_j\) represents the number of samples belonging to class \(j\) in true labels; \(n\) represents the total number of samples.

These evaluation metrics comprehensively consider the model’s performance in classification and clustering tasks, providing important references for the objective assessment of research results.

5.4 Experimental setting

To ensure fairness, we use the same training, validation, and testing sets for all methods in this study. Moreover, we set the same dimensional embedding for all methods compared. The hidden layer dimension is set to 64 for all compared methods. The attention mechanism is extended to multi-head attention, and the number of attention heads K is set to 8, as this is found experimentally to produce more stable results.

Node classification experiments, node clustering experiments, ablation experiments, attention analysis experiments, and parameter analysis experiments all utilized the ACM, IMDB, and DBLP datasets. Label importance analysis used the IMDB dataset.

5.5 Node classification

In this section, we first evaluate the node classification results of SSGCL-LG in a semi-supervised setting. Specifically, we input the node representation into a Support Vector Machine (SVM) for classification, dividing the data into different training ratios from 20% to 80%, and using Micro-F1 and Macro-F1 as evaluation metrics. Conduct five repeated experiments and report the average results. The best results are highlighted in bold. The results are shown in Table 2.

Table 2 Experiment results (%) for the node classification task
Table 3 Experiment results (%) for the node clustering task

Models based on heterogeneous graphs typically outperform models based on homogeneous graphs (GCN, GAT). It is evident that directly applying homogeneous graph models to heterogeneous graphs is not feasible, as heterogeneous graphs contain a greater variety of node and edge types, more complex information, necessitating research into more suitable heterogeneous graph models.

Compared to metapath-based heterogeneous graph models (HAN, RoHe, MAGNN, HPN), on the ACM dataset, Macro-F1 and Micro-F1 have increased by 1.5% and 1.4% respectively compared to HAN, by 1.4% and 1% respectively compared to RoHe, by 0.8% and 1% respectively compared to MAGNN, and by 1% and 1.7% respectively compared to HPN. On the IMDB dataset, Macro-F1 and Micro-F1 have increased by 3.4% and 3.0% respectively compared to HAN, by 3.0% and 2.2% respectively compared to RoHe, by 2.2% and 2.2% respectively compared to MAGNN, and by 1.7% and 1.7% respectively compared to HPN. On the DBLP dataset, Macro-F1 and Micro-F1 have increased by 2.2% and 2.8% respectively compared to HAN, by 2.8% and 2.2% respectively compared to RoHe, by 1.4% and 1.4% respectively compared to MAGNN, and by 2.2% and 2.2% respectively compared to HPN. The reason for these improvements is that HAN, RoHe, MAGNN and HPN are models built on homogeneous graphs derived from meta-paths, considering only the information of the target nodes and ignoring the information of other types of nodes. SSGCL-LG, on the other hand, decomposes the heterogeneous graph into multiple meta-path-based subgraphs of both homogeneous and heterogeneous types, which allows it to better capture the information of both homogeneous and heterogeneous neighbors in the heterogeneous graph.

Compared to metapath-based heterogeneous graph models (HGSL, HGT, ie-HGCN), on the ACM dataset, Macro-F1 and Micro-F1 have increased by 0.7% and 1.7% respectively compared to HGSL, by 1.7% and 1.7% respectively compared to HGT, and by 0.7% and 0.7% respectively compared to ie-HGCN. On the IMDB dataset, Macro-F1 and Micro-F1 have increased by 3.1% and 4.2% respectively compared to HGSL, by 4.2% and 4.2% respectively compared to HGT, and by 0.7% and 0.7% respectively compared to ie-HGCN. On the DBLP dataset, Macro-F1 and Micro-F1 have increased by 1.3% and 6% respectively compared to HGSL, by 6% and 6% respectively compared to HGT, and by 1.2% and 1.2% respectively compared to ie-HGCN.. The reason for these improvements is that while HGSL, HGT and ie-HGCN although consider the information of heterogeneous nodes, they only use labels for calculating loss, and the learning process cannot access label information. SSGCL-LG, on the other hand, encodes labels into the learning process of the graph neural network, fully considering the information of the labels.

5.6 Node clustering

In this section, the K-means method is employed to cluster the embedding vectors obtained from the model. The parameter K for K-means is set to the number of label categories in the dataset, which corresponds to the actual number of node categories. The clustering results are evaluated using NMI (Normalized Mutual Information) and ARI (Adjusted Rand Index). NMI measures the closeness of the clustering results, while ARI reflects the degree of overlap in the partitioning. The closer the NMI or ARI results are to 1, the better the clustering results are considered to be. The experimental results are shown in Table 3, with the optimal results highlighted in bold.

From the Table 3, it is evident that the SSGCL-LG model generally outperforms other models. Analysis indicates that our model fully takes into account the information of node labels and, through contrastive learning, regards nodes with the same label as positive samples. This approach allows nodes of the same category to cluster more effectively, hence demonstrating better clustering performance.

Fig. 5
figure 5

Visualization

Table 4 Description of the ablation experiments symbol

To perform the visualization task and provide a more intuitive comparison, we learn the node embedding of the aforementioned methods (i.e., MAGNN, ie-HGCN, HGSL, HPN, RoHe, SSGCL-LG) on the DBLP dataset and project the embedding into two-dimensional space. Then, we used T-SNE to visualize the paper embedding in DBLP, coloring the nodes based on their classes.

As shown in Fig. 5, SSGCL-LG demonstrates clearer boundaries and denser clustering structures, which helps to distinguish different categories in visualization. This indicates that labels can contain rich information. By integrating label information into the learning process of node representation through the label graph, we can effectively distinguish nodes of different categories, significantly improving the model’s performance, and effectively differentiating papers belonging to different research fields.

5.7 Ablation experiments

To verify the effectiveness of different components of SSGCL-LG, we designed three variants of SSGCL-LG and compared their classification performance with SSGCL-LG. The notation is shown in Table 4, and the comparison results are shown in Fig. 6.

Fig. 6
figure 6

Ablation experiments

Fig. 7
figure 7

Label importance analysis

From Fig. 6, it can be seen that the performance of the complete SSGCL-LG model is superior to that of its variants. The SSGCL-LG model integrates label information into the learning process of the neural network by constructing a label graph. Indeed, labels contain valuable information that is beneficial for node classification. Additionally, during the contrastive learning process, SSGCL-LG treats nodes with the same label as positive samples for each other, aiming to utilize the supervisory information present in the existing data for network training. By leveraging the supervisory signals contained in both labeled and unlabeled data, the SSGCL-LG model can learn node representation more effectively. This learning approach ensures that nodes from the same class are more closely clustered together in the representation space, making them more distinguishable from nodes of different classes.

5.8 Label importance analysis

To verify the importance of labels in the model, we re-divided the training set and selected subsets of different proportions for experiments on the IMDB dataset. The experimental results, as shown in Fig. 7(a), indicate that as the number of training samples increases, the performance of the model also gradually improves. This demonstrates that the quantity of labels has a significant impact on the performance of the model. Notably, among all the models compared, our model exhibits the most outstanding performance.

To further verify the effectiveness of labels, we conduct an ablation experiment by removing the labels from the model, including the labels in the heterogeneous node graph and the positive samples. The results, as shown in Fig. 7(b), indicate that when the number of training samples is very small, the performance of the ablated model is superior to that of the complete model. Our analysis find that when the number of training samples is very small, the label graph is too sparse, causing the model to fail to learn effective information from the labels, thereby reducing the model’s performance; as the number of training samples increases, the model can gradually learn more label information, hence the performance also gradually improves.

Compared to the ablated model, the performance improvement of our model is more pronounced. Because, as the number of training samples increases, the number of available labels also increases, allowing the model to learn useful information from the labels more effectively.

5.9 Attention analysis

To verify the effectiveness of the strategy where heterogeneous subgraphs under the meta-path guide the fusion of homogeneous subgraphs during aggregation, as opposed to aggregating under the meta-path’s homogeneous subgraphs alone, we compared the guided fusion strategy (ours) with a channel attention strategy (ours-channel attention). The results of the comparison are shown in the Fig. 8.

Fig. 8
figure 8

Attention analysis

Fig. 9
figure 9

Parameter sensitivity

From Fig. 8, it can be seen that the strategy of using heterogeneous graphs to guide the fusion of homogeneous subgraphs is more effective than using channel attention.

In heterogeneous graphs, heterogeneous subgraphs are composed of nodes with specific meta-paths. Heterogeneous subgraphs can provide richer local structural information because they include interactions between all types of nodes within the meta-path. During the learning process of heterogeneous subgraphs, semantic-level attention is used to fuse representation under different heterogeneous subgraphs, and this semantic-level attention utilizes node features. Therefore, when aggregating heterogeneous subgraphs using attention mechanisms, it is possible to better distinguish the importance of different meta-paths.

In contrast, homogeneous subgraphs only contain nodes of the target type, so when aggregating with attention mechanisms, only the importance of individual nodes is considered, which does not adequately summarize the importance of different meta-paths. Therefore, compared to channel attention mechanisms, the strategy of guiding the fusion of homogeneous subgraphs with heterogeneous subgraphs under meta-paths is more effective.

5.10 Parameter analysis

In this section, we investigate the sensitivity of important parameters. We conduct a parameter analysis on the number of layers in the homogenous subgraphs and the number of layers in the heterogeneous subgraphs.

As shown in Fig. 9, the performance of node classification generally shows a trend of first increasing and then decreasing with the increase in the number of neural network layers. This is because when nodes aggregate information from their neighbors, the state updates of the nodes typically only consider information from one-hop neighbors. Therefore, the number of network layers reflects how many hops of neighbor information a node can integrate. During the training process, when the network layers are shallow, nodes may not be able to gather sufficient effective information, which can negatively impact classification performance. As the number of network layers increases, nodes can integrate more effective information, thereby improving classification results. However, when the number of layers reaches a certain threshold, the nodes in the entire network may exhibit overly similar features, a phenomenon known as over-smoothing, which can lead to a decline in performance.

6 Conclusion

This paper proposes a semi-supervised heterogeneous graph contrastive learning model guided by label information, aiming to fully utilize label information and enrich the supervisory signal through contrastive learning. To address the first challenge, we construct a label graph, explicitly encoding label information into the learning process of the graph neural network, achieving joint representation learning of labels and nodes. To tackle the second challenge, when constructing positive and negative samples for graph contrastive learning, we introduce a method that jointly selects positive samples using both labels and meta-paths and utilizes contrastive loss to maximize the consistency between homogeneous and heterogeneous views. Extensive experiments conducted on various datasets fully demonstrate the superiority of the algorithm compared to others. Given the broad application prospects of heterogeneous graph neural network models, in the future, we will explore the construction of heterogeneous graph structures for legal judgment documents, as well as legal judgment prediction and legal text recommendation based on heterogeneous graph neural network models.