1 Introduction

A Knowledge Graph (KG), which organizes human knowledge into a structured knowledge system, such as WordNet (Miller 1995) and Freebase (Bollacker et al. 2008), is a powerful database applied in knowledge inference (Minervini et al. 2016; Zhang et al. 2017; Han et al. 2018), information retrieval (Metzger et al. 2017), question answering (Ferrȧndez et al. 2016), and many other fields, promoting the development of artificial intelligence. A knowledge fact existing in KG is denoted as a discrete triple 〈h, r, t〉 where h, r, t indicate a head entity, a relation, and a tail entity, respectively. For example, in Freebase, a triple 〈 Steve Jobs, PlaceOfBirth, San Francisco 〉 indicates a fact that the person Steve Jobs was born in the place San Francisco, where Steve Jobs is the head h of a triple, San Francisco is a tail t and PlaceOfBirth is a relation r.

As the size of KG increases and the computation complexity arises from the heterogeneity and sparsity of knowledge graph, Knowledge Representation Learning (KRL) has been attracting massive research attention to project semantically similar points from the data manifold in KG onto metrically close points in a low-dimensional embedding space. Analogously, different points in KG should be projected onto metrically distant points in the embedding space. Particularly, the three components in the triple 〈h, r, t〉 are encoded as three embeddings h, r and t respectively. The rationality of the embedded h, r, t is evaluated by a semantic score function f that measures point distances in the embedding space. For example in TransE (Bordes et al. 2013), the score function is chosen as \(f(h,r,t)=\lVert \boldsymbol {h}+\boldsymbol {r}-\boldsymbol {t} \rVert _{L_{n}}\), which indicates t should be close to h + r in the metric of Ln distance. In other words, the smaller the distance between the relation r and the difference of two entities th, the higher the confidence of a triple is held and therefore the better the KG is preserved.

To facilitate the triples of KG to obtain the optimal scores during the preservation, in addition to the explicit triples, the synthetic fake triples obtained by “negative sampling (see (2))” from KG are also involved. Triplet loss then demands that the difference of the distance scores between the reals and the fakes be larger than some pre-assigned margin constant. So in contrast to the real triples, the score of the fake triples should be enlarged to distinguish them from the real ones. This training strategy based on both real and fake triples forms the training objective of the KRL model and is commonly called margin-based pairwise learning algorithm (Jenatton et al. 2012; Bordes et al. 2014; Zhou et al. 2016), where the constant margin is selected as the hyper-parameter of the model to separate the real score and the fake score.

However, that simple constant strategy of margin selection indicates a fixed boundary between the reals and the fakes, which is obviously inconsistent with the complex properties of KG — the imbalance and heterogeneity as shown in Figs. 1 and 2. The imbalance property refers to the fact that each relation occurs in KG many times and the occurring frequency differs from relation to relation, and so as the entity. The heterogeneity is considered as 6 kinds of difference: a) the difference of out-degree nrh among all entities; b) the difference of in-degree nrt among all entities; c) the difference of out-degree nhr among all relations; d) the difference of in-degree ntr among all relations; e) the difference between out-degree nhr and in-degree ntr for a relation, f) the difference of hptr and tphr among all relations,Footnote 1 where these arguments are denoted in Table 1. The above properties of imbalance and heterogeneity imply different knowledge categories — the triples 〈h, r, t〉 in KG can be categorized into different types in terms of the imbalance and heterogeneity of either entity h/t or the relation r. Through visualization of embeddings elaborated in Section 3, we discover that the diversity among knowledge categories will bring about diversity among the distribution density of embedding points during the KG preservation. Then, the fixed separating margin is no longer in the same order of magnitude with each category-specific density. The previous homogeneous learning strategy is no longer appropriate for the representation of the imbalanced and heterogeneous KG. Therefore, the separating margin in the original learning algorithm should be adjusted adaptively according to the category-specific density to facilitate the preservation of KG.

Fig. 1
figure 1

Imbalance in FB15k KG. In the Left, Middle and Right, the length of each column represents the existing frequency in KG for each relation, head entity or tail entity, respectively

Fig. 2
figure 2

Heterogeneity in FB15k KG. Left: The length of each column represents how many entities have the corresponding out-degree nrh or in-degree nrt. Right: Each circle represents a relation. Its coordinate depends on the in-degree nhr and out-degree ntr of the relation, and its color depends on the type of relation: 1-to-1, 1-to-MANY, MANY-to-1 and MANY-to-MANY

Table 1 Denotations

Furthermore, the optimization of the real-triple score and the fake-triple score is of equal importance in the previous margin-based pairwise learning algorithm. However, through visualization, we find that for different knowledge categories, either the real-triple score function f or the fake-triple one is under-restricted in different degree. Thus, the trade-off between the contributions coming from the real and fake triples should be controlled in different degree depending on the category of knowledge.

Though many improvements have involved redesigning or modifying the basic framework with regard to the semantic measurement of score function f such as KG2E (He et al. 2015), ProjE (Shi and Weninger 2017), etc, the underlying training objective is rarely concerned in literature. Therefore, in this work, we emphasize the high-level objective independent of the concrete form of f. With the method of introducing the concept of density-adaptive margin and density-adaptive weight into the previous margin-based pairwise framework, we propose an Adaptive Weighted Margin Learning AWML algorithm which can be potentially incorporated into many existing KRL approaches regardless of the complexity. Besides, we also disambiguate the relations to make the model perform more precisely. In our visualization analysis and experiment, two typical real-world KGs, Freebase and WordNet, are selected to build datasets and carry out evaluation on two tasks, including link prediction and triplet classification. Experimental and visualized results demonstrate that our general AWML algorithm can significantly improve the performance of KRL models and result in a more expressive representation.

Contributions

The main contributions of this work are concluded as follows:

  • Through visualization analysis, we explore the category-specific distributed density and discover the inconsistency between the original training objective and the complex property of KG.

  • We retrofit the original margin-based pairwise algorithm and propose a novel one by adding the adaptive weight and the adaptive margin into the training objective. The experiment and visualization of AWML both demonstrate its capability of equilibrating all the knowledge category and controlling the trade-off between the real and fake triples.

  • We evaluate our retrofitted algorithm, AWML, in the tasks of link prediction and triplet classification. The results show empirically that our adaptive methods end up being powerful on such applications.

Outline

In this work, we propose an adaptive framework appropriate for the KRL models. Two adaptive methods are utilized to solve the limits of KRL models. After exploring the representation distribution and the spatial density, we propose a density-adaptive margin and a density-adaptive weight in the training objective of KRL models. The evaluation results on Freebase and WordNet KGs indicates that our proposed framework has the capability to help the KRL model to achieve the better embeddings in the representation space.

The rest of the paper is organized as follows. In Section 2, we introduce the original margin-based pairwise criterion and the existing KRL models. In Section 3 we visualize the spatial distribution characteristics of embedding representations and discover two limitations of previous KRL models: a) the inflexibility over importance trade-off, b) the inflexibility over separating margin. Then in Section 4, we introduce the density-adaptive importance weight and the density-adaptive margin to propose a novel framework, AWML, to be incorporated into the previous KRL models, while in Section 5 we empirically evaluate the proposed learning framework. In Section 6, we discuss our proposed work and analysis method. Finally, we summarize our work and outline future research directions.

2 Related work

In this section, we review the origin of margin-based training objective and how the existing KRL models utilize it. Then, we summarize the triplet score function f over different classical KRL models. Note that our AWML framework is independent of the concrete form of score function f, and so, can be potentially incorporated into all KRL models.

2.1 Margin-based pairwise learning criterion

The notion of margin is generalized by a commonly-used classifier SVM (Weston and Watkins 1999; Boser et al. 1992), which maximizes the margin value between the training patterns and the decision boundary so that two classes can be separated in the feature space as precise as possible. In order to suit for multi-classification, some researchers extend SVM and introduce the margin-based pairwise learning criterion to take all the classes into account simultaneously. Such form of margin-based pairwise objective has been also applied in knowledge representation to separate the reals and the fakes in the embedding space.

The training objective of a distance-based KRL model is typically to minimize the following margin-based pairwise function:

$$ L(S)=\sum\limits_{\langle h,r,t \rangle \in S} \sum\limits_{\langle h^{\prime},r^{\prime},t^{\prime} \rangle \in S^{\prime}_{\langle h,r,t \rangle} } [\gamma + f(h,r,t) - f(h^{\prime},r^{\prime},t^{\prime})]_{+}, $$
(1)

where the real-triple score f(h, r, t) and the fake-triple score f(h, r, t) are measured simultaneously. The score function f represents the semantic similarity of a triple, i.e. the probability of a triple to be true. For distance-based KRL models, the score function f(h, r, t) is designed as some distance restriction among three components h, r, t in the triple.

To minimize the training objective is not only to get a real triple score f(h, r, t) lower than all the corresponding fake triple score f(h, r, t), but also to make the difference between such two kinds of triple scores at least higher than a positive constant, the margin γ. The fake triple is sampled by randomly replacing the head, the tail, or the relation of a real triple. The replacement rule as follows:

$$ S^{\prime}_{\langle h,r,t \rangle}=\{\langle h^{\prime},r,t \rangle|h^{\prime} \in E\} \cup \{\langle h,r^{\prime},t \rangle|r^{\prime} \in R\} \cup \{\langle h,r,t^{\prime} \rangle|t^{\prime} \in E\}, $$
(2)

where E and R refer to the entity set and the relation set in KG respectively.

2.2 Existing KRL models

Different KRL models formulate their score function f(h, r, t) based on different designs of semantic similarity measurement, which further lead to various training objectives. In this subsection, we summarize some KRL models and their distinctive similarity measurement of a triple.

Translation-based embedding methods

Inspired by the translation-invariant phenomenon of word embeddings in the work of word2vec (Mikolov et al. 2013), TransE (Bordes et al. 2013) model regards a relation as an embedding vector r that indicates the semantic translation from the head entity h to the tail entity t for each real triple 〈h, r, t〉. In order to satisfy the approximation h + rt when the triple 〈h, r, t〉 holds, the score function of a triple is designed as \(\lVert \boldsymbol {h}+\boldsymbol {r}-\boldsymbol {t} \rVert _{L_{n}}\), measuring the Ln-distance between a translated head entity h + r and some tail entity t.

Compared to traditional methods, TransE model can well balance the effectiveness and computational cost, while the over-simplified translation assumption encounters a challenge when dealing with complicated relations including 1-to-MANY, MANY-to-1, and MANY-to-MANY relations (Bordes et al. 2013). In order to solve this problem, TransH (Wang et al. 2014), TransR (Lin et al. 2015b) and TransD (Ji et al. 2015) translate embeddings based on relation-specific hyperplanes, relation-specific entity projection and relation-specific dynamic mapping respectively. However, in TransR, simple relations may be overfitting or complex relation may be underfitting because every relation (no matter complex or simple) has the same number of parameters to learn. KG2E (He et al. 2015) and TransG (Xiao et al. 2016) attempt to retrofit the model with the Gaussian probability distribution. KG2E performs relatively well on 1-to-N and N-to-1 relations. Furthermore, some KRL model enhance translation-based model with other information in addition to triple-based semantic information inherent in the graph structure, and for instance, PTransE (Lin et al. 2015a) utilizes path information between two entities and DKRL (Xie et al. 2016) utilizes entity description.

Other embedding methods

In addition to translation-based models, there are also many other embedding methods following the margin-based pairwise learning criterion. We list the seven typical models here and most of their score function are listed in Table 2 respectively. Their parameters corresponding to the relation are also displayed in the last column of Table 2. Note here that in Table 2, the Mr denotes a transformation matrix specific for the relation r. The h, r and t indicate the embedding vector of the head h, the relation r and the tail t.

Table 2 Scoring functions on triplet 〈h, r, t〉 of different KRL models, and their relation-dependent parameters

SE model (Bordes et al. 2009) designs two independent relation-specific projections for head and tail entities and then compute their distance. SME model (Bordes et al. 2014; 2012) encodes not only each entity but also each relation into a vector and utilizes linear algebra operations in a neural network to capture correlations between entities and relations. NTN (Socher et al. 2013) considers the second-order correlations into nonlinear neural networks. ProjE model (Shi and Weninger 2017) utilizes combination operation and non-linear transformations based on neural networks, while Zhao et al. (2017) uses convolutional neural network (CNN) to learn the sequential entity and relation representations. RESCAL model (Nickel and Ring 2012; Nickel et al. 2011) utilizes matrix factorization with every value of a three-dimensional tensor, where the value of 1 for real triples and 0 for fake triples will be all factorized approximately into the form of hMrt. Hole model (Nickel et al. 2015) introduces an operation of circular correlation ∗ between head and tail to represent this entity pair so that every dimension of the entity embedding is correlated with other dimensions: \([\boldsymbol {h}*\boldsymbol {t}]_{k}=\sum \limits _{i = 0}^{d-1}[\boldsymbol {h}_{i} \boldsymbol {t}_{(i+k)\bmod d}]\).

All these KRL models modify or redesign the semantic measurement of f based on the margin-based pairwise training objective (1). However, such form of training objective neglects the complex property of KG and treats all the knowledge categories equally without discrimination, which limits the performance of knowledge representation.

3 Objective analysis with visualization

To look deep into the limitation of the embedding properties of the previous works, in this section we analyze the representations by visualizing the embedding space. We take the TransE as an analysis example for simplicity and the results could be smoothly extended to other models with the similar underlying principles with TransE. In particular, for a triple 〈h, r, t〉, the embedding vectors of the three elements are transformed by t-SNE (Maaten and Hinton 2008) into the coordinates of three points in a 2D plane. By plotting a set of triples coming from different relation categories, we observe the distribution patterns and the densities of embedding points, based on which the shortcomings of distance-based KRL models are revealed and then the insight of our algorithm improvement is gained.

In the following, we start by introducing our analysis approach to explore the distribution pattern of the representations. Please note that here, the representation we observe is the implicit embedding vector of each triple. Take TransE model as an example, for the triple 〈h, r, t〉, we take th as the implicit embedding vector of the triple and visualize it in our observation. And then, we display the phenomenon of relational semantic diversity. As for this problem, we give out our solution to it to make the model perform more precisely and take it as the prerequisite of our algorithm. Afterwards, through the exploration of distribution density, we explain our idea of adaptivity on the basis of two kinds of inflexibility over the previous distance-based KRL models: inflexibility over importance trade-off and inflexibility over separating margin.

3.1 Representation distribution and semantic diversity

Representation distribution observation

Structured in the form of a graph, entities, and relations are projected into a continuous embedding space by some specific measurement of semantic similarity, which makes the embedding space fitted into the semantic space. With the goal of exploring such a structure of the embedding space and analyzing the performance of KRL model further, we visualize the knowledge embeddings with the help of a dimensionality reduction technique, t-SNE (Maaten and Hinton 2008). Such a dimensionality reduction technique will highly match and display the graph position and the structure of their local graph neighborhoods in the distributed embedding space.

In this paper, we take TransE model as a proof of principle and take a typical real-world dataset, FB15k (Bollacker et al. 2008) as a visualization dataset whose statistics and peculiar characteristics are listed in Section 5. As for the triple-wise measurement of score function f, TransE interprets the distance between vectors of th and r as the semantic similarity of an triple 〈h, r, t〉. Once the training objective of TransE is optimized over the whole KG for long enough, all the implicit vectors th of the real triples with the same relation will eventually form a single cluster near the relation r in the embedding space, and they are not required to collapse to a single point; they merely need to be closer to each other than to any offset with a different relation.

Thus, we take training triples 〈h, r, t〉 with the same relation r as a category of knowledge, i.e., an observed collection. Then, the training set S is divided into multiple triple categories Sr: S = {Sr|rR}. In Figs. 3 and 4, we visualize the specific relation r and all the entity-pair offsets th for each category Sr. To explore whether the common margin-based learning criterion is capable of capturing the complex interactive patterns between entities and relations, we observe the spatial distribution to analyze whether it matches the triplet semantic. Furthermore, to analyze the pairwise objective, we visualize not only the golden entity-pair offsets but also the synthetic ones in Fig. 4.

Fig. 3
figure 3

Visualization results of TransE embedding vectors with t-SNE dimension reduction. Four relations \((a \sim d)\) are chosen from FB15k. A black star denotes each relation embedding r, and a colorful dot denotes the entity-pair offset th of each golden triple. Different colors or symbols represent different latent semantics of a specific relation

Fig. 4
figure 4

Visualization results of CTransE embedding vectors with t-SNE dimension reduction. Four relations \((a \sim d)\) are chosen from clustered relation set Rc that contains 2291 relations, and each clustered relation is denoted in the form of RelationN. Each graph is visualized in the same size in the 2D plane. A black star denotes each relation embedding r and a colorful dot denotes the entity-pair offset th of each triple as shown in the legend: red dot represents golden triple and dark dot represents synthetic triple. Some concrete synthetic triples are marked in each graphs, whose semantics are shown in Table 3

Please note that here, the golden entity-pair refers to the pair of head and tail entities in the real triple, i.e. 〈h, t〉 in the real-life triple 〈h, r, t〉. The synthetic entity-pair refers to the entity-pair in the fake triple, i.e. 〈h, t〉 or 〈h, t〉 where the h and t is randomly corrupted by another entity in the golden triple 〈h, r, t〉.

Relational semantic diversity

First, we are interested in the restriction of TransE caused by relation semantic diversity. To do so, we visualize the embedding results of the triples on all the relations from FB15k, and randomly pick and display 4 of them in Fig. 3. As shown in Fig. 3a, the embedded relation r = Award-Nominee is plotted in the center as a black star. The Triples 〈h, r, t〉 containing r are also plotted. To clearly demonstrate the embedding accuracy, rather than the individual embedding vectors of h and t, we only plot the difference \(\boldsymbol {\hat {r}} = \boldsymbol {t}-\boldsymbol {h}\), i.e. the synthetic-triple implicit vector, as a point in the 2D plane for each Triple. As commonly regarded, the closer the \(\boldsymbol {\hat {r}}\) to r, the more appropriate is the embedding of the Triple. However, as we can see, the embeddings of \(\boldsymbol {\hat {r}}\) did not closely center around that of r. In fact, they clearly present clustering characteristic, each of which we plot with different color in Fig. 3a for emphasis.

To deeply understand the underlying cause of the multi-cluster phenomenon shown on the \(\boldsymbol {\hat {r}}\) ’s, we use Google Knowledge Graph Search APIFootnote 2 to collect the semantic of entity-pair in each Triple to obtain the relation semantic. Then, we discover that different clusters represent different latent semantics, which is shown in the legend of each visualization result of Fig. 3.

As shown in Fig. 3a, the relation Award-Nominee has five latent semantics : MusicComposing-related, MusicSinging-related, FilmActing-related, FilmDirecting-related and Literature-related, and some Triples are exampled in Table 3. For instance, the FilmDirecting-related latent semantic of Triple 〈 Academy Award for Best Film Editing, Award-Nominee, Robert Wise(a film director) 〉 is dependent on its entity pair 〈 Academy Award for Best Film Editing, Robert Wise(a film director) 〉, while the Literature-related latent semantic of Triple 〈 Nobel Prize in Literature, Award-Nominee, Thomas Mann (a novelist) 〉 is dependent on its entity pair 〈 Nobel Prize in Literature, Thomas Mann (a novelist) 〉. Such property of relational semantic diversity in KG will lead to the distributed divergence of \(\boldsymbol {\hat {r}}\) with the embedding of TransE-like models.

Table 3 Multiple latent semantics of the relation Award-Nominee

Therefore, it is unsuitable for TransE-like models to learn a unique embedding r for a multi-semantic relation, which may be under-representative to fit all entity-pairs under this relation. In order to better model these relations, we segment each category of triples Sr into several groups with the method of clustering following the idea of CTransR (Lin et al. 2015b). Afterwards, a separate embedding vector is obtained by the KRL model for each latent semantic. Specifically, each relation r is multi-projected into the embedding space as {r1, r2,⋯ ,rn}, each of which characterizes one latent semantic of the relation r, and the number n is decided by the clustering result. In the following, we will denote each multi-projected relation in the form of RelationN. For instance, in Fig. 3c, we will distinguish the three clusters Contains1, Contains2, or Contains3, where the relational semantics are automatically clustered to represent the meaning of associated entity pairs.

In the rest of this paper, we call the cluster-based TransE-like model as CTransXFootnote 3 and take CTransX as a proof of principle to conduct the following visualization and experiment. As for the total number of relations after clustering, we list the statistics for some KRL models in Section 5. Take CTransE as an example, we finally obtain 2291 relational embeddings over 1345 relations in FB15k. In other words, there are 2291 knowledge categories after clustering: \(\{S_{r_{1}},S_{r_{2}}, \cdots ,S_{r_{2291}}\}\).

3.2 Inflexibility over importance trade-off

In addition to the semantic diversity, we also explore in the training objective of KRL models whether the golden triple or the synthetic triple is insufficient to be restricted. In other words, the question we consider is whether the spatial distribution of learned embeddings matches the triple restriction of TransE: thr. Furthermore, we also consider whether the learning/restrictions of the goldens and the synthetics are out of balance or not.

To this end, when visualize the embeddings, we consider not only the golden Triples 〈h, r, t〉 but also the synthetic Triples 〈h, r, t〉 or 〈h, r, t〉, each of which is plotted as a point in the 2D plane. To display the distributed correlation of the goldens and the synthetics, we pick 4 typical relations to show in Fig. 4. The position of each Triple depends on its difference of tail and head: \(\boldsymbol {\hat {r}}=\boldsymbol {t}-\boldsymbol {h}\) for the goldens (red dots), \(\boldsymbol {\hat {r}^{\prime }}=\boldsymbol {t}-\boldsymbol {h^{\prime }}\) (purple dots) or th (blue dots) for the synthetics. As commonly regarded, the closer the \(\boldsymbol {\hat {r}}\) to r and the further the \(\boldsymbol {\hat {r}^{\prime }}\) to r,Footnote 4 the more appropriate is the embedding of the Triple.

Nevertheless, as can be seen from Fig. 4, for some relations such as Job-Film2 and Actor-Film3, there exist much deviation between the relation embedding r and the golden entity-pair cluster, which is contrary to the golden triple restriction of TransE thr. Consequently, we attempt to move the relation embedding r to the center of the golden cluster by making the golden triple score function f(h, r, t) reach the minimum regardless of the synthetic one f(h, r, t). Surprisingly, we find that, over the total 1345 categories of knowledge, there are 396 categories whose evaluation results are improved (the evaluate metric of MeanRank, which will be elaborated in Section 5), even there are 230 categories improved by 10% and 35 categories improved by 50%. This phenomenon indicates that for some categories, the golden triple lack of restriction in the previous work and should be paid more attention to in the training objective.

On the other hand, as shown in each graph, for some synthetic entity pairs 〈h, t〉 or 〈h, t〉 that are semantically irrelevant with the relation r, their offset \(\boldsymbol {\hat {r}^{\prime }}\) are interwoven with the golden entity-pair offset \(\boldsymbol {\hat {r}}\) or in the neighborhood of the relation r. For instance in Fig. 4b, the synthetic entity pair of Trip.3: 〈 FilmFlex(a company), Kung Fu Panda 2(a film) 〉 is actually connected by the relation Distributor-Film3 in KG, as Trip.4 displayed in Table 4, but its offset \(\boldsymbol {\hat {r}^{\prime }}\) positions in the neighborhood of the embedding of relation Actor-Film2 that is semantically different with the relation Distributor-Film3. This phenomenon is also exist for other synthetic triples in Table 4. The above problem reveals the under-restriction of the synthetic triple for some knowledge categories.

Table 4 Triple examples in Fig. 4

Consequently, we can say that for any category, the under-restriction of either the golden triples or the synthetic triples exists in the previous KRL models. Hence, for some categories of triples, their importance in the training objective should be finetuned. So in the previous KRL models, it is inflexible for the trade-off between the goldens’ importance and the synthetics’ importance. This is exactly the reason why we name this section as the “Inflexibility over importance trade-off”. Based on this problem of inflexibility, we should control the contributions of these two restrictions: f(h, r, t) and f(h, r, t) flexibly.

In the work of Miyamoto and Cho (2016), a gate is utilized to combine word-level and character-level representations. Moreover, another work Yang et al. (2016) improve the gating mechanism by using an adaptive gate to adaptively find the optimal mixture of those two inputs. Inspired by these two works, we adopt an adaptive weight to control the contributions coming from the goldens and the synthetics in our proposed framework. The details of our framework are elaborated in Section 4.

3.3 Inflexibility over separating margin

In the above subsection, we discover the inflexibility over importance trade-off between the golden restriction and the synthetic restriction in the KRL training objective. With the same visualizing method, in this subsection, we explore the spatial density of the embedding distribution through visualization. Note that, in our work, the spatial density indicates whether the embedding dots distribute densely or sparsely in the representation space.

From Fig. 4, we can discover that for each relation, the spatial density of golden entity-pair cluster (red dots) is various from one another. For instance, the golden cluster of relation Job-Film2 has a higher density than that of relation Actor-Film3, even though they have the similar number of golden triples: 992 and 1016 respectively. This phenomenon derives from the property of heterogeneity and imbalance existing in KG.

Take TransE as an example, though every triple is restricted by the approximation thr, if there exist too many entity-pairs 〈h, t〉 connected with the identical relation r in KG, the corresponding offsets \(\boldsymbol {\hat {r}}=\boldsymbol {t}-\boldsymbol {h}\) may be projected into relatively discrete positions in the embedding space. This is because there is insufficient space in the neighborhood of the relation embedding r to accommodate too many embedded entity-pair offsets \(\boldsymbol {\hat {r}}\). Take 1-to-MANY relation as another example, for triples 〈h, r, t〉 with the same relation r and the same head entity h, if the semantics of their tail entity t are totally distinctive, these tail entities will be projected into distinctive positions. Therefore, the spatial density of clusters of entity-pair offsets \(\boldsymbol {\hat {r}}\) will vary from relation to relation following the occurrence frequency of the corresponding connected relations r as shown in Fig. 4, and so as the density of head h or tail t clusters not shown in this paper.

How occurrence frequency affects spatial density?

In order to explore the correlation between occurrence frequency and spatial density, for each knowledge category Sr, we calculate the distance dr as the opposite of spatial density and the hptr and tphr (see Table 1) as occurrence frequency of head h and tail t, in which the dr is the average mutual distance among the corresponding cluster of golden offsets. Then, we scatter each knowledge category in Fig. 5, and discover that if chr and ctr for a specific relation r are almost the same (the marked area), the dr is almost relatively small, while if the chr and ctr differ greatly, the dr is relatively large. In other words, the former entity-pair offsets \(\boldsymbol {\hat {r}}\) cluster compactly and have high spatial density, while the latter entity-pair offsets \(\boldsymbol {\hat {r}}\) cluster discretely and have low spatial density. Generalizedly, because of the diversity of occurrence frequency, i.e. imbalance and heterogeneity existing in KG, the cluster of \(\boldsymbol {\hat {r}}\) for different categories will distribute with different density after the projection of TransE-like models.

Fig. 5
figure 5

The correlation between frequency of knowledge and spatial density. Each circle indicates a knowledge category. Its size depends on the average mutual distance dr among the corresponding cluster of golden offsets r, and the coordinate of each circle is dependent on the cardinality of head and tail arguments: chr and ctr. Note that there are only 200 categories scattered in the figure over 2291 categories totally, but these categories contain 425464 triples in the total 483142 triples

How spatial density affects embedding performance?

To further explore the connection between the category-specific density and the embedding properties of distance-based models, we display the correlation between the spatial density and the evaluation result as shown in Fig. 6. The evaluation result is from the task of link prediction (Bordes et al. 2013) and there are two metrics: MeanRank and Hits@10, which will be elaborated in Section 5. The lower MeanRank or the higher Hist@10 gets, the better the KRL model performs.

Fig. 6
figure 6

The correlation between spatial density and evaluation result. Each dot indicates a category of knowledge and there are still 200 categories scattered in the figure

Surprisingly, we discover that most of those categories with large dr perform poor in the evaluation result, while those with small dr perform well. The poor result is possibly caused by the inflexible separating margin and the inflexible importance trade-off between the golden Triple 〈h, r, t〉 and the synthetic Triple 〈h, r, t〉 or 〈h, r, t〉, which are unsuitable for the category-specific density. For those knowledge categories with large dr, the cluster of golden offsets \(\boldsymbol {\hat {r}}\) distributes so discretely that the synthetic offsets \(\boldsymbol {\hat {r}^{\prime }}\) need to be further away from the golden cluster. Thus, the original fixed separating margin is too small to separate the synthetics \(\boldsymbol {\hat {r}^{\prime }}\) from the golden cluster, and the synthetic triple score function f(h, r, t) need to be restricted more sufficiently.

Now that the spatial density for different knowledge categories differs a lot between each other, it should also be in diversity for the distance of margin to separate the goldens f(h, r, t) and the synthetics f(h, r, t) more appropriately. Motivated by the work of Wang et al. (2017) using an adaptive margin-based hinge loss function, we also adopt the margin adaptation and make the margin in our loss function adaptive to the spatial density of the representation. In this way, we can adaptively control the degree of separation between the goldens and the synthetics. The elaboration of this part is detailed in Section 4.

4 Our adaptive learning methods

In Section 3, we display the category-specific density and explain theoretically why we should adaptively choose the optimal margin and the optimal weight for each knowledge category to obtain better performance of embedding. In this section, we dive into the mathematical and algorithmic details of our adaptive learning methods and give a general framework that can be incorporated into any distance-based KRL model. Note that we take the clustering as the prerequisite of our proposed AWML framework.

4.1 Density-adaptive margin

When projecting the entities and relations into the embedding space, the distributed density of embedding points will differ from category to category, which derives from the imbalance and heterogeneity of KG. Consequently, an identical margin cannot be in the same order of magnitude with all the category-specific density, which will lead to a poor performance of embedding. It should also be in diversity for the distance of margin to separate the goldens f(h, r, t) and the synthetics f(h, r, t) more appropriately.

To overcome this shortcoming of previous KRL models, we also adopt a method of margin adaptation motivated by the work of Wang et al. (2017). In their work, an adaptive margin-based hinge loss function is used to improve the stability and performance of GANs. Similarly, an adaptive margin in the KRL training objective can also separate the goldens and the synthetics more appropriately to improve the embedding performance.

Furthermore, in that the spatial density is closely associated with the embedding performance (see Table 6) and is in the same spatial sense with the separating margin, we can make the separating margin adaptive to the spatial density of the representation. In this way, we can adaptively control the degree of separation between the goldens and the synthetics.

Therefore, in this work, we propose an Adaptive Margin Learning (AML) method, and the training objective is as follows and all loss-terms are divided by the number of summands in a batch:

$$ L(S)=\sum\limits_{\langle h,r,t \rangle \in S} \sum\limits_{\langle h^{\prime},r^{\prime},t^{\prime} \rangle \in S^{\prime}_{\langle h,r,t \rangle} } [\gamma_{r} + f(h,r,t) - f(h^{\prime},r^{\prime},t^{\prime})]_{+}, $$
(3)

The whole formation of training objective is the same as (1) except the separating margin γr that is adaptive to the category-specific density:

$$ \gamma_{r}=\gamma_{m} \cdot \sigma (w_{m} \times dens_{r}^{-1} + b_{m}), $$
(4)

where the hyperparameter γm controls the whole range of the adaptive margin and put γr into the range from 0 to γm, σ(⋅) is a sigmoid function, and \(w_{m}, b_{m} \in \mathbb {R}\) are the weight and bias parameters learned in the traning process.

The distributed density is inversely proportional to the average mutual distance of all the golden entity-pair offsets:

$$ dens_{r}^{-1}=\frac{1}{|S_{r}|^{2}}\sum\limits_{\langle h_{1},r,t_{1} \rangle \in S_{r}} \sum\limits_{\langle h_{2},r,t_{2} \rangle \in S_{r}} \lVert \boldsymbol{\widetilde{r}}_{\langle h_{1},t_{1} \rangle} - \boldsymbol{\widetilde{r}}_{\langle h_{2},t_{2} \rangle} \rVert_{L_{n}}, $$
(5)

where Sr is the set of golden triples with the specific relation r, and \(\widetilde {\boldsymbol {r}}_{\langle h,t \rangle }\) is the approximation of r with regard to h and t, where the embedding vectors h and t are obtained with the pre-trained KRL model. Every KRL model has its distinctive approximation of r according to the distance-based score function. For instance, \(\boldsymbol {\widetilde {r}}_{\langle h,t \rangle } = \boldsymbol {t}-\boldsymbol {h}\) in TransE, and \(\boldsymbol {\widetilde {r}}_{\langle h,t \rangle } = \boldsymbol {t}\boldsymbol {M}_{r} - \boldsymbol {h}\boldsymbol {M}_{r}\) in TransR. Remark that the above calculation of density only suit for translation-based KRL models, and the calculation method for other distance-based KRL models will be discussed in Section 6.

In the training process of KRL model, when the average mutual distance is relatively large for some category, the separating margin will accordingly get larger to move the synthetic entity-pair offsets \(\boldsymbol {\hat {r}^{\prime }}\) further away from the relation embedding r, and otherwise, the margin will get small.

4.2 Density-adaptive importance weight

In addition to adaptively controlling the separation between the goldens and the synthetics, we also consider the trade-off between the different contributions coming from the golden and synthetic triple among all the knowledge categories.

In that the under-restriction of either the golden triples or the synthetic triples exists in the previous KRL models, it is inflexible for the trade-off between the goldens’ importance and the synthetics’ importance, and their respective importance in the training objective should be finetuned. Based on this problem of inflexibility, we should control the contributions of these two restrictions: f(h, r, t) and f(h, r, t) flexibly.

Inspired by the work of Miyamoto and Cho (2016) and the work of Yang et al. (2016), we adopt an adaptive weight to control the contributions coming from the goldens and the synthetics in our proposed framework. In this way, the KRL training objective has ability to learn the goldens and the synthetics with adaptive importance.

Hence, in this work, the importance weights of golden-triple and synthetic-triple score function are introduced into the margin-based pairwise training objective. To adaptively select the optimal trade-off for every category and make it suitable for the category-specific density, we eventually propose another adaptive learning method, Adaptive Weighted Learning (AWL) to provide a framework to be incorporated into the KRL models. The training objective takes the form as:

$$ L(S)=\sum\limits_{\langle h,r,t \rangle \in S} \sum\limits_{\langle h^{\prime},r^{\prime},t^{\prime} \rangle \in S^{\prime}_{\langle h,r,t \rangle} } [\gamma_{u} + (1-\mu_{r}) f(h,r,t) - \mu_{r} f(h^{\prime},r^{\prime},t^{\prime})]_{+}, $$
(6)

where the hyper-parameter γu are the weight and bias parameters. The two score function f(h, r, t), f(h, r, t) is mixed by a category-specific weight μr that depends on the density:

$$ \mu_{r}=\frac{\beta+\sigma (w_{u} \times dens_{r}^{-1} + b_{u})}{2\beta+ 1}, $$
(7)

where \(w_{u}, b_{u} \in \mathbb {R}\) is a bias scalar. The form of \(\frac {\beta +\sigma (\cdot )}{2\beta + 1}\) is to control the range of μr and let it around 0.5. The calculation of the density is the same as (5).

For some knowledge category with low density (i.e. large average mutual distance), the importance weight of synthetic triple score μr will get larger and the distance measurement of the synthetics will be more restricted, which lead to the result that the synthetic offset points will be further away from the golden cluster. If the weight μr get larger than the value of 0.5, the score function of synthetic triples will contribute more to the training objective than that of golden triples in the subsequent training process.

4.3 Training objective of AWML framework

In our AWML framework, the training objective is as follows:

$$ \sum\limits_{(\langle h^{\prime},r^{\prime},t^{\prime} \rangle,\langle h^{\prime},r^{\prime},t^{\prime} \rangle) \in T_{batch} } [\gamma_{r} + (1-\mu_{r}) f(h,r,t) + \mu_{r} f(h^{\prime},r^{\prime},t^{\prime})]_{+} $$
(8)

Among (8), γr denotes a density-adaptive margin if we choose the margin in adaption (AML framework), otherwise it denotes a fixed constant (a hyper-parameter). One category of knowledge has a specific margin, so it is relation-specific for the γr. It means that separating the goldens and the synthetics is adaptive to the knowledge category. Similarly, μr indicates a density-adaptive weight if we choose the AWL framework, otherwise it is a constant of 0.5.

As for the synthetic triples, they are constructed following (2), which differs from some other KRL models. In our synthetic-triple construction rule, the relation is considered additionally to corrupt the triple. It can make the KRL model appropriate also for triplet classification task,Footnote 5 not only for the link prediction.

The objective favors lower scores for golden triples as compared with synthetic triples, and it restricts the golden-triplet score function with the importance weight 1 − μr and the synthetic-triplet one with the weight μr. If the category-specific weight μr gets larger than 0.5, then the model will try hard to maximize the synthetic triple score function, and the minimization of the golden triple score function will be pretty much ignored.

Algorithm implementation

Algorithm 1 summarizes the whole AWML training process, and the margin or the weight can be chosen to be adaptive respectively.Footnote 6 The AWML framework initializes the entity and relation embeddings randomly (Bordes et al. 2013) or pretrainedly. We take the symbol of ΨG to denote any explicit KRL model, such as TransE. The learning method is decided by the ΨG. For instance, in the work of TransE, the widely-used stochastic gradient descent (SGD) method is used to learn the embeddings, while in the work of Minervini et al. (2016), an adaptive learning approach, AdaGrad (Duchi et al. 2011), is utilized.

figure b

Before we begin to train the model, there are two things that should be conducted. Firstly, we should do the multi-projection on each relation. Particularly, we cluster all the entity-pair offsets th for each knowledge category to construct clustered relation set Rc. In this way, one original relation has one or more than one sub-relations. Therefore, incorporated by our AWML framework, the number of relation embeddings will increase compared with the original KG (see Table 6). The second operation we should conduct is to calculate each category-specific density according to (5).

Afterwards, we can loop the training process following our training objective (see (8)). Among the objective, the triplet score function f is also decided by the specific KRL model, such as \(f(h,r,t)=\lVert \boldsymbol {h}+\boldsymbol {r}-\boldsymbol {t} \rVert _{L_{n}}\) in TransE. In each epoch of training, we first normalize the entity embeddings.Footnote 7 Then we initialize the set of triple pairs – the goldens and the synthetics following the synthetic-triple construction rule (see (2)). Last but not least, we calculate the loss based on the training objective (8) and update all the relation embeddings r and the entity embeddings e. Additionally, for AML framework, the wm and the bm in (4) should be also updated so that the adaptive margin γr can change adaptively according to the precalculated density. As for AWL framework, wu and the bu in (7) should be also updated.

Comparisons of complexity

Table 5 lists the complexities of the original KRL model ΨG and ΨG+AWL/AML model. Compared with the baseline model CTransE, if it is incorporated by our adaptive framework, the number of parameters will be added by \(\mathcal {O}(2N_{r})\) because of the weight or the margin adaptive to the relation r. In that there are weight and bias parameters to be learned, Nr is multiplied by 2. The time complexity is similar between the ΨG and the ΨG+AWL/AML except for 1) time concuming on weight or margin learning (so there is a factor α > 1 before No) and 2) time consuming on density calculation \(\mathcal {O}({\sum }_{i = 1}^{N_{r}} \binom {n_{i}}{2})\) before training, where the ni denotes the number of triplets containing the same sub-relation. Hence, incorporated by our framework, the KRL model is still effective in time complexity.

Table 5 Complexity of AWML framework

5 Experiments

To evaluate our proposed framework AWML, we respectively incorporate AML and AWL into 3 cluster-based KRL models: CTransE, CTransE(AG) and CTransR, each of which is taken as the baseline in this work. Among the three cluster-based KRL models, their original approachesFootnote 8 before clustering are available respectively from TransE (Bordes et al. 2013), TransE(AdaGrad) (Minervini et al. 2016), which uses adaptive learning rate during representation learning, and TransR (Lin et al. 2015b), which adopts relational projection on entities. In the comparison experiments, we test the performances of CTrans{E, E(AG), R} + AML and CTrans{E, E(AG), R} + AWL for link prediction and triplet classification, and conduct visualization analysis on the embeddings.

Please note here that, it is CTransX model, not TransX model, that we compare CTransX+AWL/AML model with. So in the following table, we put the evaluating result of TransX in the brackets.

5.1 Experimental settings

Dataset

The datasets we adopt are publicly available from two widely used knowledge graphs, WordNet (Miller 1995) and Freebase (Bollacker et al. 2008). WordNet is a lexical ontology for English language. In WordNet, each entity represents a synset consisting of several words, and a word can also belong to different synsets. Relationships between synsets include hypernym, hyponym, meronym, holonym, troponym and other lexical relations. As for Freebase, this large collaborative KG consists of a huge number of real-life facts and contains various entities such as people, places, events and so on. The WordNet and Freebase are so typical and popular that hundreds of Knowledge Representation Learning (KRL) models adopting this KG to evaluate the performance of models. Among all the subsets of WordNet and Freebase, we employ WN18 and FB15k used in Bordes et al. (2013) respectively, and their statistics are listed in Table 6.

Table 6 Statistics of dataset and the results of clustering

In Section 3, we propose the relation multi-projection to disambiguate the relation with the method of clustering. After multi-projection, we can obtain more relational embeddings than that obtained from the original model ΨG. Here we list the number of relations #Rel after clustering in Table 6.

Implementation details

We train each evaluation model until it converges by using SGD(in mini-batch mode) for CTransE/CTransR and AdaGrad (Duchi et al. 2011) for CTransE(AG) with learning rate λ = 0.01. As for parameter regularization, we adopt the L2 regularizer to all parameters for CTransE and CTransR on FB15k dataset, and for other evaluations we adopt the L1 regularizer. Training time was limited to at most 2000 epochs over the training set. For three baseline models CTrans{E, E(AG), R}, we attempt several settings (Bordes et al. 2013; Minervini et al. 2016; Lin et al. 2015b) on the validation dataset to get the best configurations that are: dimension of embeddings k = 20, distance measurement d = L1, the fixed margin γ = 2 for CTransE and CTransE(AG) on WN18; k = 50, d = L1, γ = 2 for CTransR on WN18; k = 50, d = L2, γ = 0.5 for CTransE and CTransR on FB15k; k = 50, d = L1, γ = 1.0 for CTransE(AG) on FB15k.

For models incorporated by AWML framework, CTransX + AML and CTransX + AWL, we fix k and d that are the same as the settings of CTransE. Other hyper-parameters in our framework are: γm in AML, γu, β in AWL, and learning rate λ. We select λ from {0.01,0.02}, γm from {1,2,4}, γu from {0.05,0.25,0.5}, β from {12,24.5,49.5} to let μr and 1 − μr not in a great disparity. We use the metric of MeanRank that is described in the following Evaluation protocal to select parameters on the validation set for both frameworks: AML and AWL, and for both initialization methods: randomly and pretrainedly. For CTransX + AWL the selected parameters are: λ = 0.01, γu = 0.25, β = 24.5 either randomly or pretrainedly initialized. Note that β = 24.5 will make the adaptive weight μr range from 4.9 to 5.1. For CTransX + AML the selected parameters are: λ = 0.02 when randomly initialized and λ = 0.01 when pretrainedly initialized, γm = 2 on WN18 dataset and γm = 1 on FB15k dataset.

5.2 Link prediction

Link prediction is a classical evaluation task that concentrates on the quality of knowledge representation (Bordes et al. 2013). This task aims to complete a triple when one of head or tail is missing, which can be viewed as a simple question answering task. Similar to the setting in Minervini et al. (2016) and Bordes et al. (2013), etc, the task returns a list of candidate entities from KG instead of one best answer.

Evaluation protocol

We use two evaluating metrics by following Bordes et al. (2009): MeanRank and Hist@n. For each test triplet, we corrupt the head or tail by using other entities in the entity set E in turn and calculate the f scores for the test triplet and all the corrupted triplets. After that we rank these triplets with their scores by descending order. Finally, we get the ranking of correct entity. If the ranking of the correct entity is smaller than or equal to n, Hit@n for the test triplet will be equal to 1, otherwise it will be 0. For all the triplets in the testing data, we repeat the same procedure and get the MeanRank and mean value HITS@n for each kind of n ∈{1,3,10}.Footnote 9 We report the average scores on head prediction and tail prediction as final evaluation results. It is clear that a good predictor has lower MeanRank and higher Hist@n_c.

When constructing corrupted triplets, some of them may hold in training or validation set, which indicates that they are also real triples. So we filter out these triples before the ranking of candidate entities. This evaluation setting is denoted as Filt. and Raw otherwise.

In the specific evaluation, we adopt our novel ranking approach appropriate for cluster-based models including CTransX, CTransX + AWL and CTransX + AML in this work. Because of multi-projection in cluster-based models, we do not know which sub-relation embedding should be used when calculating the triplet score f. To solve this problem, we firstly classify every corrupted entity pairs 〈h, t〉 into the sub-relation clusters by means of calculating which sub-relation embedding is the nearest neighbor of the entity-pair offset. Then we can use the corresponding sub-relation embedding to accomplish the link prediction task. Finally, the candidate entities can be ranked as the order in the former method. In the following, we call the two cluster-based metrics as MeanRank_c and Hist@n_c.

For CTransR, in that our framework with pretraining almost performs better than randomly initializing, we only do the experiments with pretraining. But we believe this is enough to demonstrate the generalization of our framework on the CTransR model.

Experimental results

The overall resultsFootnote 10 of our frameworks as well as the baseline are shown in Table 7. On the dataset of FB15k, all settings of adaptive training bring a pronounced improvement to the original KRL model, no matter which triplet score function f we adopt (TransE or TransR), and no matter which learning method we use (SGD for CTransE or AdaGrad for CTransE(AG)). For the CTransE model, among all the incorporated approaches, AML with pre-trained initialization achieves the best MeanRank_c both in RawFootnote 11 and Filt and also achieves the best HITS@n_c in 1, 3 and 10 settings. With AML, MeanRank_c(Filt.) of CTransE decreases by 10.2 and Hist@10_c increases by 6.0%. For the learning method of AdaGrad in CTransE(AG), we find that it is our AML framework that helps the CTransE(AG) achieve the best. Furthermore, we can discover that the result of HITS@n_c is robust to the value of n, which indicates that the performance of our proposed framework is insensitive to the evaluating metrics.

Table 7 Link Prediction Results: Test performance of CTransX(the baseline) and CTransX+AWL/AML on the WordNet(WN18) and Freebase(FB15k) KGs

As for the WN18 KG, our framework also maintains comparative performance. Although the performance of the model with randomly initializing is somewhat poor, in all the compared models, the best MeanRank_c is consistently the model incorporated with our framework AWL or AML. In CTransE+AWL with pretrained setting, the MeanRank_c of Raw and Filt. settings are both decreases nearly 30%. Similarly, the CTransE(AG)+AWL also decreases a lot. This also indicates that our density-adaptive methods are consistently effective.

Additionally, as defined in Bordes et al. (2013), relations in KBs can be divided into four types according to hptr and tphr (see Table 1): 1-to-1, 1-to-MANY, MANY-to-1 and MANY-to-MANY. Here we demonstrate the performance of AML and AWL incorporated into the baseline model on different types of relations in Table 8. We can observe that on all the 4 types of relations, both AML and AWL consistently achieve significant improvement as compared with the baseline, CTransE.

Table 8 Link prediction on FB15k with respect to different types of relations(%)

5.3 Triplet classification

We also test our model on triplet classification, which is used to evaluate the knowledge representation. This task aims to predict the missing relation given two entities and is equal to a classification task that classifies the testing triple into one of the knowledge categories. Similar to the evaluation protocol in link prediction, this task also returns a list of candidate relations from KG.

Evaluation protocol

In this task, we corrupt the relation of each testing triple by using other relations in the set R in turn and calculate the f scores. After that we rank these triples in descending order. Similar to the link prediction task, we also use MeanRank_c and Hist@n_c for n ∈{1,3,10} to evaluate the triplet classification results. Particularly, the HITS@1 metric indicates the classification accuracy and only check if the first relation in the sorted list is the correct one.

In the triplet classification task, we will not face the same problem that the multi-projection brings about. We only need to use all the sub-relation embeddings to replace the relation for each triplet. Besides, we use the Filt. setting in this task to compare all our frameworks with the baseline model.

Experimental results

Evaluation results are shown in Table 9. We can see that on FB15k dataset, both AWL and AML models outperform CTransE on Hist@1 metric. On MeanRank_c metric, our AML model is slightly worse than CTransE, but our AWL model achieves the best MeanRank_c 2.97 and the best Hist@1_c 76.3%. The improved Hist@1_c and the comparative MeanRank indicate that there are more testing triples classified into the correct knowledge category even though the relation of several triples are predicted extremely poorly. What’s more, for any KRL model, our framework has the capability to help the KRL model achieve both the best MeanRank_c and the best Hist@n_c.

Table 9 Triplet Classification Results: Test performance of CTransX(the baseline) and CTransX+AWL/AML on the WordNet(WN18) and Freebase(FB15k) KGs

From the experimental results over both two tasks,Footnote 12 we can conclude that our AWML framework is capable of helping the KRL model to learn better entity and relation embeddings to accomplish the link prediction and the triplet classification.

5.4 Time efficiency analysis

In addition to the performance on two tasks, we also analyze the performance of our framework along the time efficiency. We list the time cost in the process of training for CTransE(AG) and CTransE(AG)+AWL/AML both on the datasets of WN18 and FB15k. As the time complexity in Section 4 analyzes, We can discover from Table 10 that the KRL model incorporated by our framework is similarly effective with the original model. The former is slightly more time consuming than the latter because of a factor α > 1 shown in Table 5.

Table 10 Time consumption in the KRL training on WN18 and FB15k datasets

Note that, what we compare in Table 10 is the training time of the first 100 epochs not the training time of convergence epochs. Besides, the models of CTransE(AG)+AWL and CTransE(AG)+AL listed in Table 10 are all initialized randomly.

5.5 Visualization analysis

Our AWML framework makes the KRL model adaptive to the knowledge category to learn the embeddings. After we learn the embedding, we compare the representation distributions of CTransE and CTransE+AML through visualization. Similarly, we use t-SNE method (Maaten and Hinton 2008) to reduce the representations to 2-dim space. Then, we visualize all the Triples 〈h, r, t〉 for each knowledge category Sr and randomly pick 3 categories to display in Fig. 7. Not only the goldens but also the synthetics we consider, and their positions are dependent on their entity-pair offset: the \(\boldsymbol {\hat {r}}\) and the \(\boldsymbol {\hat {r}^{\prime }}\) respectively.

Fig. 7
figure 7

Visualization results of CTransE embedding vectors with and without AML framework. Three relations \((a \sim c)\) are randomly chosen from clustered relation set Rc that contains 2291 relations, and each clustered relation is denoted in the form of RelationN. For each relation, two graphs are displayed to compare the CTransE+AML model and its baseline CTransE, and their link prediction results: MeanRank and HITS@10 are listed in the graph respectively. Each graph is visualized in the same size in the 2D plane. A black star denotes each relation embedding r and a colorful dot denotes the entity-pair offset th of each triple as shown in the legend: red dot represents golden triple and dark dot represents synthetic triple

As shown in Fig. 7, for each category, the golden entity-pair offsets \(\boldsymbol {\hat {r}}\) distribute similarly over CTransE+AML and CTransE.Footnote 13 But for CTransE+AML, the relation embedding r is much closer to the golden cluster, and at the same time, the evaluation result is also better than CTransE. This indicates that the embedding is obtained more appropriately to match the triple restriction of TransE: thr. In other words, in the distribution space, our adaptive framework is capable of building better representations to improve the performance of original KRL models.

6 Discussion

In this section, we first analyze the necessity of multi-projection for KRL models. Then we provide the visualization analysis methods and discuss the approximation of r in (5) for other distance-based KRL models except for TransE-like models.

Multi-projection of relation and entity

In Section 3, we propose the relation multi-projection to disambiguate the relation with the help of clustering. As a matter of fact, some other KRL models also imply the idea of multi-projection for entities, such as SE (Bordes et al. 2009), SLM (Socher et al. 2013), SME (Bordes et al. 2014), LFM (Jenatton et al. 2012; Sutskever 2009), NTN (Socher et al. 2013), RESCAL (Nickel et al. 2011; Nickel and Ring 2012), TransH (Wang et al. 2014), TransR/CTransR (Lin et al. 2015b), TransD (Ji et al. 2015). They all project head and tail entities into a relation-specific space when calculating the triple score function.

Why is the above entity multi-projection effective in building embeddings and improving the performance of KRL models? This is because the entity in KG also has the semantic ambiguity and the multi-projection will disambiguate the entity. Take SME as an example, for every triple 〈h, r, t〉, the model utilizes an relation-specific matrix Wr to transform the head embedding h into a relation-specific head embedding hr, and thus, when measuring different semantics associated with relations, an entity will be projected into distinctive embedding spaces to represent different contextual situations.

Therefore, no matter which concrete method we use, it is of great significance for KRL models to conduct the multi-projection.

Visualization analysis methods

When we explore the training objective to analyze the performance of TransE in Section 3, we visualize the representation distributions using t-SNE. As for other KRL models, we can also utilize the similar visualization analysis methods to explore the distribution characteristics of embeddings. Here we provide concrete analysis methods for TransE-like models and other distance-based KRL models.

For TransE-like models that do not use the relation-specific matrix to multi-project the entity, such as TransA (Xiao et al. 2015), TransG (Xiao et al. 2016) and KG2E (He et al. 2015), we can visualize the embeddings with the same method as TransE — conduct the dimensionality reduction over all the relation and entity embeddings, take the same knowledge category as an observed collection, and analyze whether the entity-pair offset th is in the neighborhood of the relation r.

For TransE-like models with entity multi-projection, such as TransH (Wang et al. 2014), TransR/CTransR (Lin et al. 2015b), TransD (Ji et al. 2015), the entity embedding we visualize should be projected into the relation-specific space, otherwise the head, tail, and relation will not satisfy the translation property. Thus, we should conduct the multi-projection of entity embeddings before the dimensionality reduction so that we can then observe whether they satisfy the translation approximation tMrthMrhr.

For distance-based KRL models that project the nonlinear transformation of head and tail onto the relation embedding, such as NTN (Socher et al. 2013) and Hole (Nickel et al. 2015): f(h, r, t) = g(rnl(h, r, t)), we can observe the distribution feature between the nonlinear transformation vector nl(h, r, t) and the relation embedding vector r for each triple. The projection vector of nl(h, r, t) on the r can be observed whether is small enough for the golden triples and large enough for the synthetics. For other models that utilize the bilinear transformation, such as LFM (Jenatton et al. 2012; Sutskever 2009) and RESCAL (Nickel et al. 2011; Nickel and Ring 2012), we can visualize the entity embeddings belonging to the same category of entity-pairs and analyze the distribution characteristics of the head and the tail in the bilinear transformation hMrt.

Approximation of relation \(\boldsymbol {\widetilde {r}}_{\langle h,t \rangle }\)

In Section 4.1, we display the approximation of r to calculate the distributed density densr for TransE-like models. While for other distance-based KRL models, we can approximate the relation embedding r inspired by the score function f. For example in NTN model (Socher et al. 2013), the nonlinear transformation tanh(h, t) can be regarded as \(\boldsymbol {\widetilde {r}}_{\langle h,t \rangle }\), because NTN consider the projection of tanh(h, t) on the r as the triple score function.

7 Conclusion and future work

In this paper, we tackle the knowledge embedding problem and propose an adaptive weighted margin learning framework, called AWML, to facilitate the KRL models to adaptively learn the representations of entities and relations in KG. We first analyze the visualization results of previous KRL model and discover the inconsistency between the original training objective and the complex property of KG. Then we explore the relation-specific density and explain the necessity of choosing an appropriate margin and importance weight for every knowledge category. Finally, we define the density-adaptive margin and the density-adaptive weight, and integrate them into the previous training objective respectively for knowledge embedding. Experimental and visualized results validate the effectiveness of our proposed framework.

From the performances on the evaluation tasks, we can conclude that our proposed framework is indeed capable of helping the KRL models to obtain better representations in the embedding space. The good performance is derived from the ability of adaptation. With our framework, the KRL model can adaptively control the contributions of golden and synthetic triples in the training process, and also can adaptively control the degree of separating the two kinds of restrictions. But there still exist challenges for our proposed KRL framework. For the 1-to-N and N-to-1 relations, it is still very difficult to learn a perfect representation and the improvement is small. For another, with our framework, some category of triples distribute worse in the embedding space. As the visualization shows in Fig. 7, we can see that there are still some synthetic implicit vectors distribute near the relation embedding. These two challenges are still what we should explore in the future work. Additionally, we will incorporate our framework, AWML, into more KRL models and apply it in more tasks to evaluate the generalization of our framework. It is possible to focus on the entity rather than the relation to analyze the distribution characteristics of the embeddings and explore the capability of knowledge embedding models.