1 Introduction

Large-scale knowledge graphs (KGs), such as the Freebase [3], YAGO [25], and DBpedia [1], provide effective support for many important artificial intelligence tasks such as semantic search [12], recommendation [24, 37] and question answering [16]. With the rapid development of large language models, knowledge graph is still a powerful tool for knowledge storage and knowledge processing, and it is still of great research significance. A KG is a multi-graph with entities as nodes and relations as edges. Each edge between two nodes in a KG is represented as a triple with a head entity, relation, and tail entity (h, r, t) indicating the relation between two entities, e.g., (Stanford, location, California). Although KGs are effective, most existing KGs suffer from data missing. This problem has inspired the knowledge graph completion (KGC) task, which aims at evaluating the rationality of potential triples in the KG and enriching the KG.

Many research efforts have been devoted to KGC. A common method called KG embedding maps each entity and relation to low-dimensional vectors and evaluates triples with those vectors [31]. Some typical models include the TransE [4], TransH [33], RotatE [26], and TuckER [2]. Text-based methods [29, 30, 36] use available textual information for KGC. Intuitively, text-based methods should outperform embedding-based methods because they introduce additional information. However, experiments on partial datasets show that text-based methods lag behind structure-based methods.

It is assumed that the key reason for this performance degradation is sampling. While negative sampling is important for KGC models, existing text-based methods only allow a limited amount of negative sampling due to the training cost of language models. For example, KEPLER [32] as a text-based method only trains for 30 epochs and utilizes a negative sample size of 1 due to the significant computational burden of RoBERTa. Conversely, embedding-based methods can collect more negative samples. RotatE model entails training for 1000 epochs on the Wikidata5M dataset with a negative sample size of 64. The RotatE model, as an embedding method, can be trained for 1000 periods on the Wikidata5M dataset with a negative sample size of 64. Both embedding-based methods and text-based methods will generate false-negative samples with a certain probability. However, the embedding-based method reduces the impact of false-negative samples on model training due to the large number of negative samples.

In this paper, inspired by the latest advances in positive and unlabeled (PU) learning and prompt learning, we propose a new knowledge graph completion model. PU learning is used to select the negative samples with the highest confidence in the candidate negative sample set to ensure the quality of negative samples in the case of limited negative sampling. Meanwhile, prompt learning is introduced to improve the reasoning ability of the model and to help judge negative examples. As shown in Fig. 1, we use a random replacement method to generate candidate negative samples for triples in positive samples. Prompt learning input is used to infer negative examples into the pre-trained language model. We then mix these negative examples with positive examples and use prompt learning to train the knowledge graph completion model. The entire process conforms to the two-step method in PU learning. To improve training results, the focal loss is introduced to adjust the effect of the ratio and difficulty of positive and negative samples to distinguish samples on the model. The trained model is then used to predict the rationality of triples. This method can perform effectively in multiple completion tasks. The contributions of this paper are summarized as follows.

  1. 1.

    We propose a KGC method based on PLMs. This method uses PU learning for the first time to solve the negative sampling problem of KGs.

  2. 2.

    The results of multiple benchmark datasets show that the proposed method can produce competitive results in link prediction tasks.

  3. 3.

    Compared to models with the same accuracy, the proposed model significantly reduces the inference time.

Fig. 1
figure 1

Schematic diagram of negative example construction and model training

2 Related works

KGC helps complete the missing data in the existing KG by modeling the multi-relational data. Conventional methods like TransE [4] and TransH [33] use the structural information of the KG to accomplish inference. These methods regard the triple (h, r, t) as a specific relational transformation from the head entity h to the tail entity t. ComplEx [28] introduces multiple embedding to increase the expressive power of the model. RotatE [26] simulates the triple relational rotation in complex spaces. By using two vectors to represent each relation and adaptively adjusting the edge parameters in the loss function, scholars have implemented the coding of complex relational patterns [6]. Recently, scholars have tried to use additional textual information to improve KGC. The DKRL [34] uses convolutional neural networks (CNN) to encode the text, while the KG-BERT [36], StAR [29], BLP [9], and simKGC [30] all use PLMs to compute entity embedding. The performance of these methods has been improved but is still inferior to that of embedding-based methods on some datasets.

PLMs can be divided into two categories: feature-based methods and fine-tuning methods. Conventional word embedding methods such as the Word2Vec [21] and Glove [22] aim at taking a feature-based method to learn context-independent word vectors. In contrast to feature-based methods, fine-tuning methods such as the GPT [23] and BERT [11] first train a PLM on a large corpus of unlabeled text with language modeling objectives and then fine-tune the models for downstream tasks, which has triggered a paradigm shift in natural language processing.

Prompt learning, which directly models text probabilities, differs from conventional supervised learning. The training model takes an input x and predicts the output y. The template is used to modify the initial input x to a text string prompt with some padding slots so that these models can perform prediction tasks. The information is then padded to produce the final string Pt which is fed into the PLM to produce the output y. This framework is powerful and attractive for several reasons. It allows the language model to be pre-trained on massive amounts of raw text. By defining a new prompt function, the model can perform low-shot or even zero-shot learning, adapting to new scenarios with little or no label data. Prompt learning has achieved excellent results in natural language processing tasks such as text classification [13, 14], relation extraction [7, 15], named entity recognition [8], and question answering [18].

PU learning is based on scenarios where scholars have access only to positive examples and unlabeled data. PU learning has gained more attention because it appears naturally in applications such as medical diagnostics and KGC. Similar to general binary classification, the goal of PU learning is to train a classifier that enables classification according to target attributes. Most methods can be divided into three categories: the two-step methods [17], biased learning methods [20], and merged priority methods. Two-step methods, as the name suggests, involve two distinct steps. In the first step, reliable negative examples are identified, and in the second step, the classifier is trained based on the labeled positives and reliable negatives. The biased learning methods view unlabeled data as negative samples with label-like noise. The merged priority methods include postprocessing, preprocessing, and revising. Their main idea is to introduce class priority to modify traditional learning methods.

3 Methods

3.1 Symbols

The current study focuses on a knowledge graph (KG) that is a directed graph comprising entity E as the set of vertices and each edge represented by a triple (h, r, t), where h, r, and t denote the head entity, relation, and tail entity, respectively. The task of link prediction in KGs involves predicting the missing triples when the KG is incomplete. The widely adopted entity ranking evaluation protocol requires the sorting of all entities given h and r for tail entity prediction (h, r, ?), and the sorting of all entities given r and t for head entity prediction (?, r, t). In this work, an inverse triple (t, \({r}^{-1}\), h) is added for each triple (h, r, t) with \({r}^{-1}\) being the inverse relation of r. As a result, only tail entity prediction is required to deal with in this paper.

3.2 Model architecture

We propose a new PLM-based KGC model (PUPKGC), which can use the implicit knowledge as well as additional text description information in the PLM and infer new knowledge from structural information in KGs. As shown in Fig. 2, given a triple and its entity description, they are converted into a head entity prompt \({{\text{PT}}}_{{\text{head}}}\) and tail entity prompt \({{\text{PT}}}_{{\text{tail}}}\), a judgment sentence PJ, and an auxiliary prompt PA to input into the PLM. Formally, the final input text T of the PLM can be defined as T=\({{\text{PT}}}_{{\text{head}}}\) \({{\text{PT}}}_{{\text{tail}}}\) PJ PA [CLS]. The present study utilizes the [CLS] output of a language model to predict the label of a given triple. On the other hand, this paper uses PU learning to obtain negative triples. The triples classification model receives these triples along with positive samples, and the focus loss is used for training.

Fig. 2
figure 2

Illustration of our PUPKGC model for triple classification. Help triad classification by introducing entity description text and human prior knowledge

The following sections go into great detail about how to obtain negative samples (Sect. 3.3) and design strategies for prompts (Sect. 3.4). In addition, Sect. 3.5 explains how the proposed model is trained.

3.3 Negative samples

For KGC, training data consist of positive triples only. Given a positive triple (h, r, t), negative sampling requires sampling one or multiple negative triples to train the discriminant model. To obtain negative samples, existing methods typically involve replacing a part of correct triples at random or manually labeling negative samples. The former method causes positive samples to be mixed with negative samples thus affecting the inference effect, while the latter consumes a lot of time.

We use structural information from the knowledge graph to generate candidate negative examples. The relations of entities and neighboring entities are used for clustering. Two entities can be determined as the same category if they have the same relation to the same or neighboring entities. The method for determining head entities is similar. For training set triplets, we randomly replace their head or tail entities with entities that do not belong to the same class, generating multiple candidate triplet negative samples.

We use the two-step approach in PU learning to generate negative examples and train the model. Specifically, we design a classifier to identify high confidence negative samples. Then mix the generated negative samples with the original positive samples into the model for training. Due to the presence of relevant text information in the candidate triplet negative samples, we utilize existing language models as classifiers. To prevent negative samples recognized by the classifier from being judged as positive samples in the training model, we set the classifier and the model to be trained as the same language model. Considering the use of prompt learning as an auxiliary training tool in the future, for classifiers, we use prompt learning to assist in classification. Please refer to Sect. 3.4 for details. For a set of candidate triplet negative examples, choose the first K triplets with the lowest accuracy as the negative examples.

3.4 Prompts

To take advantage of the implicit knowledge within the PLM, each triple is transformed into prompt sentences. For each relation, a hard template is manually designed to represent the semantics of the associated triples. For the triple “[X], position, [Y]”, the [X] and [Y] are first replaced with the exact name of the head and tail entities to get a judgment prompt \({{\text{PJ}}}_{0}\). In this case, \({{\text{PJ}}}_{0}\) is “Stanford position California”. A soft prompt is added to the relation to finally form the more expressive judgment sentence PJ.

To make the inference effect more accurate, text descriptions of the head and tail entities are included in the judgment sentence. Entity definitions or attribute sentences associated with relations are typically used for the text description. To ensure inference accuracy and prevent redundant information interference, the text description is limited to a single sentence that is not overly long. To ensure the accuracy of incoming text description, we use hard prompts instead of soft prompts to form \({{\text{PT}}}_{{\text{head}}}\) and \({{\text{PT}}}_{{\text{tail}}}\). We also add the prompt PA about the task to create the more expressive judgment sentence, which results in the final prompt sentence.

3.5 Training

In this paper, the model is trained on a triple set as a triple classification. The negative sample generation method is presented in Sect. 3.3. After comparison, the setting of 1:8 can ensure both low training time and good training effect. Given a triple τ (h, r, t), the classification fraction of the triple can be defined as:

$$ s_{{\uptau }} = {\text{Softmax}}\left( {Wc} \right) $$
(1)

where \(c\in Rd\) is the output vector of the input token [CLS], and \(W\in R2\times d\) is a linear neural network. Since the proportion of positive and negative samples is unbalanced, the optimization function in this paper is set as the focus loss.

$$ FL\left( {s_{{\uptau }} } \right) = - \alpha_{{\uptau }} \left( {1 - s_{{\uptau }} } \right)^{\gamma } {\text{log}}\left( {s_{{\uptau }} } \right) $$
(2)

Parameter \({{\upalpha }}_{\tau }\) can suppress the imbalance between the number of positive and negative samples. We make it consistent with the distribution of positive and negative samples in the training dataset. Parameter γ can control the difficulty to identify sample number imbalance, and it is set to 2 in this paper to reduce the influence of easily distinguishable samples.

4 Experiments

4.1 Experimental settings

4.1.1 Datasets

The evaluation in this study employs the WN18RR and FB15k-237 datasets, which are presented in Table 1. The WN18 and FB15k datasets were initially proposed by Bordes et al. [4], but later works [10], Toutanova et al., 2015) revealed that these datasets suffer from test set leakage. To address this issue, the WN18RR and FB15k-237 datasets were created by removing reverse relations. The WN18RR dataset comprises 41 k entities and 11 relations from the WordNet, while the FB15k-237 dataset includes 15 k entities and 237 relations from the Freebase.

Table 1 Statistics of the datasets used in this paper

For text description, the WN18RR and FB15k-237 datasets provided by the KG-BERT [36] are used in this paper. For the WN18RR dataset, the first sentence is chosen as the text description. For FB15K-237, the first sentence or the sentence related to triples is chosen as the text description.

4.2 Evaluation index

The proposed PUPKGC model was evaluated through a link prediction task, following previous studies. Specifically, for each triple (h, r, t) in the test set, the model performed tail entity prediction, which involved determining the likelihood of all possible entities being t given h and r, and then ordering them. Head entity prediction was performed in a similar manner. To assess the model's performance, we used four automatic evaluation indices: mean reciprocal rank (MRR) and Hits@k (H@k) for k ∈ {1,3,10}. MRR was calculated as the mean reciprocal rank of all test triples, while H@k calculated the proportion of correct entities appearing in the top k positions of the ordered rank list. MRR and H@k were reported under filter settings, which ignored the fractions of all known true triples in the training, validation, and test sets.

4.3 Hyper-parameter

The encoder is initialized with T5-base. Using an appropriate PLM can further improve performance. Most hyper-parameters are shared across all datasets to avoid specific dataset tuning. The AdamW optimizer is used with linear learning rate attenuation. After comparison, α = 0.05 is empirically set. Besides, early stopping is used to balance the training effect and training time.

4.4 Main results

The figures reported by scholars are used in this paper for the TransE [4], DKRL [32]. The results of RotatE [26] are from the official GraphVite 4 benchmark. Models with in-batch, pre-batch, and self-negatives are used for the SimKGC [30].

Table 2 shows the performance of our model and baseline model on two datasets. TransE [4], DistMult [35], RotatE [26], and TuckER [2] are typical embedding-based methods. Except for the early TransE and DistMult, all other methods achieved good performance on the FB15K-237 dataset, while performing mediocrely on the WN18RR dataset. Among text-based methods, KG-BERT [36], MTL-KGC [19], and StaR [29] perform poorly and are not as good as traditional embedding-based methods. It should be noted that the simKGC model introduces comparative learning and designs multiple negative sampling categories, achieving significant improvements. Therefore, the performance of simKGC can also confirm our viewpoint that negative sampling restricts the performance of text-based methods. However, simKGC still performs negative sampling through substitution, and there is still the problem of false-negative examples. Our model is highly competitive compared to text-based methods, with six indicators achieving optimal results. Compared to embedding-based methods, our model comprehensively surpasses on the WN18RR dataset and is slightly lower on the FB15K-237 dataset.

Table 2 Main results for WN18RR and FB15k-237 datasets

FB15k-237 is different from other datasets in that it has a higher chart density (the average degree per entity is about 37) and fewer entities (about 15 k). To perform better, models must learn generable inference rules, not just model text correlation. In this case, an embedding-based method may be advantageous. Consequently, the proposed method can be combined with an embedding-based method. Since it is not the primary focus of this paper, it is viewed as a future work. In addition, scholars point out that many links in the FB15k-237 dataset are unpredictable based on available information [5]. Besides, the entity text description of the FB15k-237 is longer, and it is difficult to find the text description that is suitable for the current entity triple through conventional matching. These three reasons help explain the poor performance of the model in this paper.

The forward transmission of the model is the most expensive part in terms of inference time. For triple inference and judgment, it is often necessary to replace the head and tail entities with other entities for comparison. In this study, we find that replacing triple inference with all entities is time-consuming. Therefore, we suggest using structural information that exists in the KG to limit the inference and comparison scope. Entities are clustered based on their association relations. Finally, for a triple, we only need to replace its head and tail entities with entities in its class for comparative analysis. This method can significantly speed up inference while also improving its accuracy. It should be noted that due to the sparsity of the WN18RR dataset, there are 294 triples whose entities to be predicted do not belong to the possible classes, and as a result, their link prediction fails.

5 Analyses

In this paper, a series of analyses are performed to further understand the proposed model and the KGC task.

5.1 Rapid training method

The training time of the model using 4 × 3070 on the FB15K-237 dataset in this paper is about 38 h. There are mainly three reasons for fast training. The ratio of positive to negative samples in this paper is 1:8, which is a relatively low ratio. For description text, an attempt is made to select the entity description associated with the triples; this method can effectively reduce the cost of model training and prevent irrelevant text interference. In addition, the early stopping method is used to avoid additional training and produce the best training results for the model. Experiments show that, for most relations, training can be completed within 8 to 13 periods. However, for relations that frequently occur in the KGs, the training time is usually 8 epochs.

5.2 Function of negative sampling

To carefully verify the role of PU learning in the model, a comparative experiment is conducted on the FB15K-237 dataset. The results for the complete model are consistent with those in Sect. 4.2. For comparison, the negative sampling part of the complete model is removed, while the other modules remain unchanged. In terms of this model, the head or tail entities in the triple are randomly replaced with other entities in the training set while ensuring that the newly generated triples do not exist in the training or test sets. These newly generated triples are used as negative samples. The ratio of positive to negative samples is also set to 1:8. The experimental results are shown in Table 3.

Table 3 Main results for FB15k-237 dataset

In this paper, it is demonstrated that the introduction of PU learning significantly improves the overall performance of the model. For further study and analysis, some of the relations are extracted for illustration, as shown in Table 4.

Table 4 Main results for FB15k-237 dataset of a small number of data sets. The relationships corresponding to the ordinal numbers are listed

PU learning can improve the inference effect of this triple on frequent relations in the KGs. However, the improvement is rather evident for relations with average or more frequencies in the KGs. This is considered an outcome of negative sampling. For the triples that frequently appear in the KGs, the probability and proportion of mixed false-negative samples and negative samples produced by random substitution methods are increased, which has a significant influence on the quality of the training dataset. Therefore, the introduction of PU learning to help negative sampling significantly enhances the impact of low-frequency sampling. This demonstrates in full the rationality of this paper’s starting point.

5.3 Effectiveness of auxiliary inference

Prior embedding-based methods have also achieved some success in inference. This fully shows the role of graph structural information in KGC. Therefore, we use the structural information of the KG for clustering. For KGs with rich relational information, such as the FB15K-237 dataset, clustering information is used for inference, as detailed in Sect. 4.2.

In order to explore the role of auxiliary inference modules, we designed a comparative experiment. Specifically, we removed the auxiliary reasoning part from the original model and compared it with the original model. For previous methods, the number of times required for comparing and judging a single triplet is determined by the following formula.

$$ N = \frac{{N_{{{\text{entity}}}} *N_{{{\text{test}}}} - N_{{{\text{all}}}} }}{{N_{{{\text{test}}}} }} $$
(3)

\({N}_{entity}\) is the number of entities in the knowledge graph dataset. \({N}_{test}\) is the number of triples in the knowledge graph test dataset. \({N}_{all}\) is the number of all triples in the knowledge graph dataset. \(N\) is the number of times an average single triplet needs to be compared and judged. We substituted it into Table 1 and found that for the FB15K-237 dataset, the average number of times a single triplet needs to be judged and compared is about 14,526. After introducing the auxiliary reasoning module, it was found through testing that this frequency can be reduced to 524.

Meanwhile, it should be noted that another 15 test triplets cannot be linked to the correct entity. This number is not worth mentioning compared to the 20,466 data in the test set. Therefore, our auxiliary inference module can save 96% of time on the FB15K-237 dataset. This fully proves the effectiveness of our module.

We extracted some relationships for further comparative analysis, and the results are shown in Fig. 3. The x-coordinate is the inference time, in seconds, and the y-coordinate is the MRR value. A set of points with the same color is a representation of the same triples in the two models. A set of points with the same color is the experimental result of the same relationship in two models. On the left is the experimental results of the model that introduces structural information for auxiliary reasoning, while on the right is the experimental results of the original model. From the figure, it can be seen that the introduction of structural information greatly improves the inference speed. The inference accuracy of the model has also been improved, and the effect is more pronounced for relationships with lower frequencies in KG. This is mainly the result of eliminating the influence of false-positive entities during the model training process. This fully demonstrates the rich informational content of the KG structure and the efficiency of the module.

Fig. 3
figure 3

The influence of auxiliary reasoning on partial relations

6 Conclusion and future works

With the rapid development of PLMs, some PLM-based KGC models have been proposed. However, there is still a performance gap between these models and the SOTA KGE model. In this work, we identify that the key factor of poor performance lies in negative sample collection. To address this issue, we introduce PU learning to help generate negative samples, propose a new PLM-based KGC model, and verify the influence of negative sample collection on inference. Experimental results show that the proposed model produces better results than the previous text-based methods. In a future work, we plan to improve the module on negative sample collection and explore the potential for expanding the usage of structural information.