1 Introduction

Many large-scale Knowledge Graphs (KGs) were built for real-world applications such as recommendation systems [1] and question answering [2]. High-quality KGs can provide powerful support for many AI tasks [3]. For example, ERNIE 3.0 [4], which recently thrived on NLP tasks, claims that the model absorbed a well-designed knowledge graph with over 50 million entities. However, KGs usually suffer from incompleteness and miss important facts, jeopardizing their usefulness in downstream tasks. Therefore, it is critical to develop automatic methods of knowledge graph completion.

A KG is usually composed of a large number of fact triples (subject, predicate, object) (denoted (s, p, o)). The subject and object are entities, and the predicate is the relationship that connects the two entities. There is a special class of triples, whose predicates are rdf:type, use the object to indicate the entity type of the subject. We refer to these triples as type knowledge, and the remaining as triple knowledge. Figure 1 shows an example of a knowledge graph centered on an Indian film actor Raza Murad. Knowledge Graph Entity Typing (KGET) aims at inferring missing entity types by utilizing existing type knowledge and triple knowledge of the knowledge graph.

Fig. 1
figure 1

An example of knowledge graph

Previous KGE-based models achieved many achievements on KGET. They embed entity types and complete entity types through link prediction. High-quality type inference results can be obtained by relying on the entity embeddings trained by the KGE model. However, real-world knowledge graphs are often constantly updated. Take 7-lore Knowledge GraphFootnote 1 as an example, the knowledge graph adds millions of entities a day. The KGE models does not have the entity information of these new additions during the training process, which causes KGE models to encounter many difficulties in the inference stage. [5] proposes the inductive concept to elaborate on this challenge. Although, some recent studies begin to focus on Out-of-Knowledge-Base Embedding [6,7,8], there is few KGET algorithm for new entities. On the other hand, the current KGE-based models cannot take advantage of incoming new data to improve performance, i.e. these algorithms are not incremental. Statistics-based algorithms have natural advantages in increment and efficiency, but the performance of existing statistics-based KGET algorithms is far inferior to KGE-based models.

In this paper, we aim to design a statistics-based KGET algorithm that takes into account both performance and incrementality. Specifically, for a target entity, we leverage its neighborhood information and its existing type information for inference. Inspired by [9], we assume that the relations connecting entities have semantic invariant properties from entity to type. As shown in Figure 1, for the fact triple (Jodhaa Akbar, starring, Raza Murad), even if the subject and object are replaced with their corresponding types, the derived type triple (film, starring, film actor) still holds. For neighborhood information, it could be observed that Raza Murad is connected with the neighbor entity Jodhaa Akbar by the relation starring. If we know that entity Jodhaa Akbar is a film and has type film, we can infer the type triple (film, starring, film actor) and further infer that entity Raza Murad has type film actor. For type information, multiple different types often co-occur in an entity, such as film actor and person. If we know that Raza Murad already has the type of film actor, we can infer that Raza Murad has the type person by using co-occurrence information. We named the algorithm PIANO, which means P erformant and I ncremental algorithm with A ggregating N eighborhood and co-O ccurrence information for KGET. PIANO consists of two stages: data statistics stage and information aggregation stage.

Data statistics stage

All type triples derived from fact triples will be counted. Intuitively, type triples with more occurrences are more likely to be considered facts. However, such intuitions may suffer from noise and imbalance in knowledge graphs. Therefore, we treat type-predicate pairs as queries and type triples as answers, and use the conditional probability of type triples under query pairs to measure the confidence of type triples. It is worth mentioning, we use a neat trick to add a self-loop with relationship co-occurrence to each entity for statistical type co-occurrence information. The entire data statistics process only needs to traverse the knowledge graph and its export type triples once.

Information aggregation stage

At this stage, we consider two issues: i) How to aggregate the information of multiple type triples derived from a fact triple. ii) How to aggregate the information of multiple fact triples associated with a target entity. For these two issues, we adopt the strategy of mean and sum respectively. Finally, we get a score for the entity-type pair, which reflects the confidence that the entity has a type, the higher the score, the more likely the entity is to have the type.

Our contributions are as follows:

  • We design a statistics-based algorithm for KGET named PIANO, which maintains both high performance and incremental property.

  • We conduct KGET experiments on multiple datasets, and the experimental results show that PIANO far outperforms most previous statistics-based algorithms, and even outperforms KGE-based models.

  • We design a specific incremental experiment to infer types of new entities and to verify the incremental properties of algorithms.

2 Related work

Statistics-based Algorithm

In the semantic web community, there has been research related to KGET for a long time [10], which is called schema discovery. Schema discovery can be roughly classified into three categories: implicit schema discovery, explicit schema enrichment, and structural pattern discovery. Explicit schema enrichment into which KGET can be categorized. Among them SDType (Statistical Distribution of Types) [11] is the research most relevant to our work. SDType treats the predicates associated with the target entity as features, and tries to avoid propagation of errors of irrelevant instances through a weighted voting method. Its essential idea is type-constrained [12, 13], which means the scope of entity type is constrained by predicate associated with the entity. For example, the types of subject and object entities connected by the MarriedTo predicate can usually only be person or subtypes of person. Fang et al. [14] is similar to SDType as it is based on the statistical distribution of types. The difference is that it infers the type based on the category information. However, not all knowledge graphs have category information available.

KGE-based Algorithm

In the field of knowledge graph embedding, many works [15,16,17,18,19] have been proposed to embed entities and predicates into low-dimensional semantic spaces for downstream tasks. However, most of these models are used in the knowledge graph link prediction task, while ignoring the KGET task. Until recently, these KGE models were first applied to the KGET task by Moon et al. [20]. They tried two paradigms: One paradigm treats rdf:type triples as fact triples and directly uses the KGE models for type inference. Another paradigm is to also embed entity types into a low-dimensional semantic space, and perform type inference through the distance between entity embeddings and type embeddings. Thereafter, Zhao et al. [9] proposed the connecting embeddings model (ConnectE) and introduced the concepts of global type knowledge and local triple knowledge. The concept of global type knowledge considers entities with the same type to be close in the embedding space. The concept of local triple knowledge believes that the predicate has semantic invariance. That is, when the subject and object entities in the fact triple are replaced with the corresponding types, the derived type triple still holds. In addition, Pan et al. proposed CET [21] model, which utilizes the neighbor information in an independent-based mechanism and aggregated-based mechanism for type inference. Recently, several new KGE-based KGET models were proposed continuously, such as ConnectE-MRGAT [22], AttEt [23] and RACE2T [24].

In this paper, we propose a new statistics-based KGET algorithm PIANO. Different from SDType, PIANO utilizes not only the predicates associated with the target entity, but also the type information of neighbor entities for type inference. Incorporating the idea of aggregating neighborhood information based on representation learning, PIANO is able to possess satisfactory performance while maintaining the incremental property.

3 Methodology

To facilitate reading, we first define some notions. A KG G usually contains fact triples (s,p,o) ∈ G, where entities s,oE, and predicates pP. In KGET tasks, an entity e usually has some existing type declarations teT. We use T(e) to represent the set of known type instances of entity e. Given a fact triple (s,p,o), based on the existing type declarations of its subject entity and object entity, we can derive a large number of type triples (ts,p,to). All of the type triples derived from fact triples in G form a Type Graph (TG), denoted as Gt. Note that a type triple may appear in Gt repeatedly. For a given predicate p, we define the set of all subject types associated with p in Gt as domain, and the set of all object types as range, denoted as D(p) and R(p) respectively.

3.1 SDType++

Our goal is to be able to compute an entity-type matrix \(M \in \mathbb {R}^{\lvert E \lvert \times \lvert T \lvert }\), where element Mij represents the confidence score that the ith entity has the jth type. For a predicate p, if its domain frequently contains a type t, we believe that the subject s associated with the predicate p has a high probability of having that type. The intuition holds for its range.

To reflect this intuition, we first calculate the predicate-type incidence matrices \(M_{p2t}^{s}, M_{p2t}^{o} \in \mathbb {R}^{\lvert P \lvert \times \lvert T \lvert }\). The elements in the matrices represent the times that the predicate and the subject(or the object) type appear together in Gt. For example, \(M_{p2t}^{s}[p][t_{s}]\) indicates the number of (ts,p,−) in Gt. As we described in Figure 1, an entity in the KG may be associated with more than one fact triple. To this end, we calculate the entity-predicate incidence matrices \(M_{e2p}^{s}, M_{e2p}^{o} \in \mathbb {R}^{\lvert E \lvert \times \lvert P \lvert }\). \(M_{e2p}^{s}[e][p]\) indicates the number of (e,p,−) in G. We refer to the algorithm as SDType++. Different from SDType [11], SDType++ not only considers what kind of predicate is associated with the target entity, but also considers that the target entity may be associated with the same predicate multiple times.

The statistical process of the previous matrix can be regarded as the data statistics stage of SDType++. Through matrix operations we can naturally aggregate all statistics and obtain entity-type matrix M:

$$ M = M_{e2p}^{s}M_{p2t}^{s}+M_{e2p}^{o}M_{p2t}^{o} $$
(1)

3.2 PIANO

However, there are some problems with the intuition in SDType++. As shown in Table 1, no matter what the target entity and its neighbor entities are, as long as the entity is located in the object position of the predicate starring, the entity will be preferentially predicted to have type tv actor. Recalling Figure 1, we find that entity Raza Murad should be more likely to have type film actor than have tv actor. To compensate for this flaw, we fuse the information of neighbor entities of the target entity.

Table 1 Top 6 types of range of predicate starring

Data statistics stage

First, we count the number of each type triple in the Gt. Considering data sparsity, we use a dictionary A to store these information. That is, A[(ts,p,to)] represents the number of (ts,p,to) in Gt. In the process of calculating A, we can also calculate \(M_{p2t}^{s}, M_{p2t}^{o}\) above mentioned. In order to facilitate calculations and save computing resources, we use dictionaries Qo and Qs to store \(M_{p2t}^{s}\) and \(M_{p2t}^{o}\), respectively. That is, \(Q_{o}[(t_{s},p)] = M_{p2t}^{s}[p][t_{s}]\) and \(Q_{s}[(p,t_{o})] = M_{p2t}^{o}[p][t_{o}]\). Second, we convert the above statistics into probability information. For each type triple (ts,p,to), we calculate the conditional probability under the query of (ts,p) and the query of (p,to):

$$ p((t_{s}, p, t_{o})\vert(t_{s}, p)) = A[(t_{s}, p, t_{o})]/Q_{o}[(t_{s}, p)] $$
(2)
$$ p((t_{s}, p, t_{o})\vert(p, t_{o})) = A[(t_{s}, p, t_{o})]/Q_{s}[(p, t_{o})] $$
(3)

For a triple (s,p,o), the value p((ts,p,to)|(ts,p)) indicates the probability that o has the type to under the query (ts,p). The motivation for computing conditional probabilities is that the types of neighbor entities can constrain the types of target entities. For example, from common sense, we think that the p((film, starring, film actor)| (film, starring)) should be greater than p((film,starring,tvactor)| (film,starring)).That is, the probability of a film starring a film actor should be greater than the probability of a film starring a tv actor.

Information aggregation stage

We first consider how to aggregate the information of multiple type triples derived from a fact triple. For the target entity, the information returned by each type triple is probabilistic information and the numbers of type triples derived by fact triples are different. For example, in Figure 1, Indian has 66 known type declarations, while Jodhaa Akbar has only 7. We use the averaging operation to keep the numerical value of the probability information between 0 and 1. Based on our common sense, we believe that Jodhaa Akbar is more helpful for predicting that Raza Murad have type film actor. Keeping numerical information in probabilistic form better reflects this intuition. In the second step, we consider how to aggregate all neighborhood information of the target entity e. For a target entity e, we define its subject neighborhood as Ns = {(s,p)|(s,p,e) ∈ G} and its object neighborhood as No = {(p,o)|(e,p,o) ∈ G}. Similar to SDType++, the information aggregated by each fact triple can be regarded as a vote process, so we adopt a summation strategy when aggregating information in the second step. Our goal is still to calculate the entity-type matrix M. The element in M is the score of the target entity e on each type which is calculated as follows:

$$ \begin{array}{@{}rcl@{}} M[e][t]&=& \sum\limits_{(s,p) \in N_{s}} \frac{1}{\vert T(s)\vert} \sum\limits_{t_{s} \in T(s)} p((t_{s}, p, t)\vert(t_{s},p)) \\ &&+\sum\limits_{(p, o) \in N_{o}} \frac{1}{\vert T(o)\vert} \sum\limits_{t_{o} \in T(o)} p((t, p, t_{o})\vert(p, t_{o})) \end{array} $$
(4)

where |T(e)| represents the number of known type declaration of the entity e. 1/|T(e)| is the average factor.

Type co-occurrence information

Furthermore, we also consider the global type co-occurrence. As mentioned in [20], 10% of entities in the FB15k dataset have the type /music/artist but do not have the type /people/person in the Freebase. Another perspective to the problem, most of entities having type /music/artist in the dataset should also have type /people/person. Formally, if a large amount of entities have type X and type Y simultaneously, then for a target entity that has type X, we think that it is likely to have type Y. We add a predicate co-occurrence for each entity in KG (as shown in Figure 1) to mine type co-occurrence information. The triple (e, co-occurrence, e) will derive lots of type triple (te1, co-occurrence, te2), where te1,te2T(e). In order to explore the importance of neighborhood information and types co-occurrence information, we set a weight parameter ω. Equation (4) is updated as follows:

$$ \begin{array}{@{}rcl@{}} M[e][t]&=& \sum\limits_{(s,p) \in N_{s}} \frac{W_{p}}{\vert T(s)\vert} \sum\limits_{t_{s} \in T(s)} p((t_{s}, p, t)\vert(t_{s},p)) \\ &&+\sum\limits_{(p, o) \in N_{o}} \frac{W_{p}}{\vert T(o)\vert} \sum\limits_{t_{o} \in T(o)} p((t, p, t_{o})\vert(p, t_{o})) \end{array} $$
(5)

where Wp = ω if p is co-occurrence, otherwise Wp = 1.

SDType++ is an extension of SDType, extending the qualitative study of entity types associated with predicate relations in SDType to a quantitative study. Further, PIANO not only relies on predicate relationship information, but also adds information on neighbouring entity types as constraints to predict the constraints of the target entity. Overall, PIANO is an extension and generalisation of SDType and SDType++, which aims to improve the performance of prediction while retaining the efficiency of the statistical-based approach.

3.3 Complexity analysis

The algorithms SDType++ and PIANO both consist data statistics stage and information aggregate stage, we will analyze the time complexity of these two stages separately.

Data statistics stage

Obviously, it only needs to traverse Gt once to get the required statistics. Therefore, the time complexity of the data statistics stage is O(|Gt|). Extremely, |Gt| may be |T|2 ×|G|. It is worth to discuss whether there will be a quadratic explosion in the number of type triples in Gt. To eliminate this doubt, Figure 2 show the distribution of existing types number of entities in FB15kET and YAGO43kET. It can be clearly seen that the maximum number of types in the two datasets is about 100, and the number of types of most entities is below 20. This phenomenon is in line with the law of the real world, that is, a large number of entities can be classified with a few type catalogs. We denote the average types of all entities as α, α ≪|T|. Hence, the time complexity of the data statistics stage is O(α2 ×|G|).

Fig. 2
figure 2

Distribution of types in FB15kET and YAGO43kET

Information aggregate stage

For each entity, it needs to traverse all associated fact triples in the process of aggregating its neighborhood information. Furthermore, every time the fact triples are traversed, all possible types of the target entity need to be queried. Combined with the phenomenon in Figure 2, the time complexity of the type inference stage is O(|G|×|Tα). Note that there is |G|≫|T| for the knowledge graph of the real world. Let’s focus on the time complexity of the SDType++ algorithm. The SDType++ algorithm simplifies the process of aggregating neighborhood information into a matrix multiplication operation. According to the dimensions of the matrices, the time complexity of the SDType++ at this stage can be calculated as O(|E|×|R|×|T|). Although it seems that the complexity has increased (due to sparseness, usually, |E|×|R| > |G|), the SDType++ algorithm is faster in actual calculations benefiting from mature matrix calculation tools.

In view of the fact that the value of α is usually very small, type inference stage is the main time cost of the two algorithms. In summary, we can deduce that the total time complexity of the SDType++ algorithm is O(|E|×|R|×|T|) and the total time complexity of the PIANO is O(|G|×|Tα).

4 Experiments

4.1 Datasets

For evaluation, we use two real-world datasets widely used in KGE literature which contain thousands of entity types. Each dataset contains a large number of fact triples (s,p,o) and type pairs (e,te), and is divided into training set, validation set and test set. Considering the difference between statistics-based algorithms and the KGE-based models, we only use the training sets as the prior knowledge for fairness, and use the validation set and the test set to evaluate the models or algorithm. We did not follow datasets that are widely used in the semantic web community, such as DBpedia and Histmunic [25]. Because these datasets contain relatively few types, it is not conducive to the evaluation of the KGET task. Table 2 shows the statistics of used datasets. The introduction of datasets is as follows:

Table 2 Statistics of used datasets

FB15kET [20] is a subset of Freebase [26] which is a large fraction of content describing facts about movies, actors, awards, sports, and sport teams.

YAGO43kET [20] is a subset of YAGO [27] whose triples deal with descriptive attributes of people, such as citizenship, gender, and profession.

4.2 Knowledge graph entity typing task

This task aims to complete the entity-type pairs (e,te) when the types of the entity are missing.

Protocol

We use the same experimental protocol as [9] and [20] described. For each entity-type pair (e,te) in the test set, experiment uses all type instances in T for replacement and obtains candidate pairs set \(C = \{(e,t^{\prime }_{e})\vert t^{\prime }_{e} \in T\}\). The scores of all pairs are then calculated by energy function. Rank C according to the scores and get the ranking of (e,te). It looks like link prediction task. Finally, according to the ranking of correct pairs in the test set, we calculate the mean reciprocal rank (MRR) and the proportion of correct pairs ranked in the top n (H@n). Different from the experimental protocol in the previous literature, PIANO directly calculates the entity-type matrix M. The elements in the M are the scores of entity-type pairs, so we only need to sort the type list of each entity without replacing type instances. Considering that some pairs in C may be the other correct pairs in the training or validation set, the ranking obtained may be unreasonable. Such the setting is called ‘Raw’ in [15]. The setting to filter other all correct pairs before ranking is called ‘Filter’. We only report the experimental results with ‘Filter’ setting in this paper.

$$ \begin{array}{@{}rcl@{}} MRR &=& \frac{1}{\vert Test\vert}\sum\limits_{i=1}^{\vert Test\vert}\frac{1}{rank_{i}} \end{array} $$
(6)
$$ \begin{array}{@{}rcl@{}} H@n &=& \frac{1}{\vert Test\vert}\sum\limits_{i=1}^{\vert Test\vert}x_{i} \end{array} $$
(7)

where |Test| represents the number of entity-type pairs in the test set, ranki is the ranking of the ith true entity-type pair, and xi = 1 if rankin, otherwise xi = 0.

Parameter setting

PIANO has only one parameter ω to adjust the weight of neighborhood information and global types co-occurrence information. We search for ω in the range of {0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000} and select ω based on the H@10 of the validation set. Finally, the optimal ω is 30 for FB15kET, and 300 for YAGO43kET. Figure 3 shows the values of H@10 is taken for different values of ω on FB15kET and YAGO43kET.

Fig. 3
figure 3

Line chart of H@10 for different values of ω in different datasets

Experimental results

Table 3 shows evaluation of different models on FB15kET and YAGO43kET. The KGE-based models and the statistics-based algorithms are distinguished by a horizontal line, the upper part is the results based on the KGE models, and the lower part is the results of the statistics-based algorithms. For statistics-based algorithms, the improved SDType++ significantly outperforms SDType. This suggests that it is necessary to consider that the target entity may be associated with the same predicate multiple times. Remarkably, among the statistics-based algorithms, PIANO achieves the best results. This indicates that using the neighbor entity type information can effectively help the PIANO algorithm to perform type inference on the target entity.

Table 3 Entity type prediction results

Compared to KGE-based models, the PIANO algorithm outperforms most models. However, there is still a significant gap with few KGE-based models, especially CET. CET achieves the best results on all evaluation metrics on both datasets. We think there may be two reasons: i) CET also adopts the idea of aggregating neighborhood information, which shows that neighborhood information can help infer the type of target entity. ii) The KGE-based model maps entities and types into the semantic space and is able to better capture the semantic associations between entities and types. Unlike KGE-based approaches that seek higher prediction accuracy, PIANO is motivated from a practical point of view that in addition to seeking higher prediction accuracy, maintaining higher efficiency and sustaining incremental properties is more meaningful for a continuously updated knowledge graph. The experimental results show that the FB15kET results are better than those on the YAGO43kET. The difference between the two datasets leads to the difference in the experimental results. Compared with FB15kET, YAGO43kET contains more kinds of entity types and entities, but fewer kinds of predicates, i.e. YAGO43kET is more sparse than FB15kET, which makes the YAGO43kET dataset more challenging for the KGET task.

Ablation experiment and parameter sensitivity

PIANO has only one parameter of ω. We attempt to explore the influence of neighborhood information and global types co-occurrence information of the PIANO. We compare the results of using the neighborhood information, using the co-occurrence information of global types, and the optimal parameter ω in Table 4, respectively. It can be seen that the PIANO with optimal ω achieves best results on both datasets.

Table 4 Results of ablation experiment

We further analyse the sensitivity of parameter ω in PIANO based on the H@10 of the validation set in different datasets in Figure 3. In essence, the larger the value of ω, the more important the role of type co-occurrence information for PIANO. For FB15kET, the performance of PIANO increases with the increase of ω at first, and then decreases with the increase of ω value after reaching the optimal performance. For YAGO43kET, the performance of PIANO improves with the increase of the ω value, but eventually tends to be stable.

Combined with the analysis in Table 4 and Figure 3, we argue that the PIANO algorithm relies more on type co-occurrence information on the dataset YAGO43kET where the neighborhood information is more sparse. In contrast, on the dataset FB15kET with richer neighborhood information, excessive reliance on type co-occurrence information will degrade the performance of the algorithm. Neighborhood information and type co-occurrence information can complement each other to further improve the performance of the algorithm.

Case analysis

Since PIANO uses statistics to infer types, the algorithm is highly interpretable. Table 5 shows the details in Figure 1 of inferring the types of Raza Murad relying on fact triple (Jodhaa Akbar, starring, Raza Murad) and global statistics. Each element in the table represents the calculated conditional probability (the number in parentheses indicates the predicted ‘Raw’ ranking by corresponding type triple). The last line averages the calculated conditional probabilities of all type triples derived from the fact triple. As we expected, PIANO lowered the predicted ranking of tvactor to the 4 from 1 and raised the predicted ranking of film actor ranking from 3 to 2. Compared with Table 1, the ranking results are more reasonable and more interpretable. We also show the prediction details of the type in the test set influence node and the type entity in film which is correct but does not exist in dataset.

Table 5 The details of inferring the types of Raza Murad by the triple (Jodhaa Akbar, starring, Raza Murad) in Figure 1

Further, Table 6 shows the detail of aggregating neighborhood information of Figure 1. It can be found that almost all neighborhood information provides person information. As we expected, the film Jodhaa Akbar provides the most information for film actor. Unexpectedly, relying on nationality relations can also achieve close prediction results. We argue there may be bias issues in the dataset, which may lead a coarse-grained predictor can also achieve a not bad results.

Table 6 The details of inferring the types of Raza Murad by the information of the neighborhood in Figure 1

4.3 Incremental inference experiment

In the real world, the KGs are not only large-scale, but also constantly dynamically updated. It is very likely that a large number of new entities that have never existed in the knowledge graph will be linked into the knowledge graph. How to predict the types of these new entities has become a question that remains to be explored.

When adding a new entity, the embedding-based models encounters a great obstacle to predicting type instances of the entity since they never trained the embedding of the new entity. In addition, the embedding-based model training process is often very time-consuming. Therefore, it is unrealistic to retrain the model once adding a new entity. In the same scenario, PIANO could infer types of the new entity based on the existing statistical information and the entity-relation pairs connected with the new entity. If the degree of the new entity is d, then the complexity of this process is O(d). Next, we will simulate real scenes to predict the types of these entities newly linked into the knowledge graph.

Protocol

Suppose there is an existing KG, and now a new entity needs to be linked into the KG through some relations. Because it is a new entity, we know nothing about it. That is, we don’t have its embedding and any types. Therefore, it is challenging to predict its type instances. We implement incremental inference experiment on FB15kET and YAGO43kET. In order to simulate such real scenes, we make a few changes to datasets. We use FB15kET and FB15k as an example to illustrate the transformation process, and the same on YAGO43k and YAGO43kET.

First, for all entities having types in the test set of FB15kET, we move all of their other entity-type pairs from the training set and the validation set to the test set. Analogously, for all entities having types in the modified validation set, we move the entity-type pairs from the training set to the validation set. We call the modified dataset FB15kET-I. In particular, we refer to the modified training set, validation set, and test set as the prior set, expansion set, and evaluation set, respectively.

Second, for each fact triple in training set of FB15k, we consider the following three cases according to its head entity and tail entity: (1) both the two entities are in the prior set of FB15kET-I; (2) one entity is in the prior set of FB15kET-I, and the other one is in the expansion set; (3) one entity is in the prior set of FB15kET-I, and the other one is in the evaluation set. The prior set, expansion set, and evaluation set of FB15k-I are composed of fact triples that line with the (1), (2), and (3), respectively.

Finally, we remove all entities, which are not in FB15k-I, and their related entity-type pairs of FB15kET-I. After processing, we use the prior set of FB15k-I and FB15kET-I to calculate prior statistical information, use the expansion set to simulate the ever-increasing data, and use the evaluation set to evaluate the experimental results. It is ensured that the entities in the expansion set and evaluation set of FB15kET-I are completely unknown before inference.

Table 7 shows the statistics of datasets after transformation.

Table 7 Statistics of datasets after transformation

Infer types for New Entities

As a comparison, we design a process for ConnectE [9] and CET [21]to infer types of new entity. ConnectE utilizes TransE for training on fact triples to obtain entity and predicate embeddings. In order to better fit the ConnectE, we use the idea of TransE [15], i.e. h+rt, to obtain the embedding of new entities. Specifically, we utilize all the fact triples connected by the target entity and TransE to obtain a series of generated embeddings. Average pooling of these generated entity embeddings is performed to obtain embeddings for new entities. With these embeddings of new entities, ConnectE can predict their types. Since CET includes a computational process utilizing neighborhood information, we utilize the N2T mechanism and Agg2T mechanism [21]in CET to directly perform type inference on the target entity. We also report the results of SDType and SDType++ predicting new entity types. For PIANO, we aggregate the neighborhood information of the new entities to implement the inference process.

Experimental results

The experimental results under the new experimental protocol are shown in Table 8. In such a real scene, the performance of all methods are greatly reduced. It shows that inferring types for new entities is a challenging task. Through comparison, it can be found that the performance of PIANO is more stable, and the best results are obtained under the new experimental protocol. It is worth noting that the performance of CET under this experimental protocol degrades sharply and is even inferior to that of ConnectE. This is in stark contrast to the performance of CET in achieving SOTA results in Table 3 on the ordinary KGET task. We think there are mainly two possible reasons: i) KGE-based models rely on trained entity embeddings for type inference when performing the KGET task. When performing type inference on new entities, the model cannot utilize the information of the new entities, resulting in degraded inference performance. ii) Comprehensive consideration of Table 4 and Figure 3, the algorithm may over-rely on the co-occurrence information of entity types. We believe that CET may overfit during training, making the model over-rely on type co-occurrence information while ignoring the role of neighborhood information. Hence, it is difficult for CET to effectively apply the neighborhood information of the target entity for type inference.

Table 8 Results of inferring types of new entities

In addition, we consider utilizing continuously updated data to improve the performance of PIANO. This process is unrealistic for KGE-based models, because the models must be retrained once new data is added. However, the retraining processes are extremely cumbersome and require a lot of expensive computing resources, while PIANO only needs to update the statistics.

We attempt to add 100 entities of the expansion sets in turn, and connect the corresponding triples to update the statistics. Every time we add new data, we re-evaluate the results on the evaluation set. We compare the performance changes of SDType, SDType++ and PIANO in the process of incremental data addition. Figures 4 and 5 show the performance changes of the three algorithms on FB15ET-I and YAGO43kET-I, respectively. In Figure 4, we can see that the performance of SDType decreases during the process of increasing data. We think that the process of adding data has associated more predicates to entities in the prior set, and that these predicates may confuse SDType. Although SDType++ and SDType go through the same process, SDType++ can benefit from newly added data by distinguishing the number of predicates associated with the target entities. Compared with SDType++, which only has obvious incremental effect on H@10 indicators, PIANO can significantly benefit from the new data in all indicators. This phenomenon is more obvious in Figure 5, where the type data of YAGO43kET-I is numerous and sparse. In Figure 5, PIANO can still improve performance as new data is added, while SDType and SDType++ are almost insensitive. It shows that the addition of predicate data can bring gains to algorithms, but the gain is limited. The addition of the type information of neighborhood can continuously improve the ability of the algorithm to infer types of target entities.

Fig. 4
figure 4

Comparison of H@1/3/10 and MRR improvements on the test set for the SDType, SDType++ and PIANO with adding new data on FB15kET-I

Fig. 5
figure 5

Comparison of H@1/3/10 and MRR improvements on the test set for the SDType, SDType++ and PIANO with adding new data on YAGO43kET-I

In conclusion, although SDType has an incremental nature in theory, it encounters obstacles in a wide variety of data types. SDType++ alleviates the challenges SDType faces in incremental experimentation. PIANO has superior incremental properties and can continuously benefit from new data to improve performance.

Efficiency analysis of PIANO

To measure the efficiency of PIANO, we compare the training time of the KGE-based methods CET, ConnectE and the computational time of statistics-based method PIANO in incremental experiments. Since the KGE-based methods need to retrain the models after adding new entities, we consider five subsets of FB15KET-I with the sizes of 500, 1000, 1500, 2000 and 2500, respectively. We use these five datasets as the train sets to train models and report the results in Figure 6. The models CET and ConnectE are conducted with one Intel(R) Core(TM) i7-8700 CPU and one Nvidia A100 GPU, while PIANO are only conducted with one Intel(R) Core(TM) i7-8700 CPU. From the Figure 6, we can observe that the computational time of PIANO is much smaller than the training time of the models CET and ConnectE, especially the model ConnectE, which further indicates the efficiency of our model.

Fig. 6
figure 6

Comparison of training time for ConnectE, CET and PIANO with adding new data on FB15KET-I

5 Conclusion

In this paper, we propose a statistics-based knowledge graph entity typing algorithm PIANO. The algorithm performs type inference by aggregating the neighborhood information and type co-occurrence information of the target entity. The experimental results show that the PIANO algorithm achieves performance that can compete with the KGE-based models. We find that neighborhood information plays a better role in dense data, while type co-occurrence information plays a more important role in sparse data.

We also design incremental experiments to simulate the knowledge graph update process in real-world scenarios. The experimental results show that all the algorithms encounter a dramatic decline in performance when predicting types of new entities. Nevertheless, the PIANO algorithm retains a relatively stable performance. Experiments also verify that PIANO algorithm can improve the performance according to the continually added data, that is, the algorithm has outstanding incremental property.