Keywords

1 Introduction

KG simulates human understanding of various things and their relations in the real world to construct structured and semantic knowledge representation. Because of the large amount data in real-world KG, an efficient and scalable solution is crucial. KGE is a feature extraction process, mapping a complex network which includes nodes, content, and relations into low-dimensional vector spaces.

Many KGE methods have been proposed [1, 11, 13] to learn low-dimensional vectors of entities and relations. In fact, an entity may have multiple aspects which may be related to different relations [8]. Therefore, many independent models [6, 8], have been proposed recently and usually outperform dependent models on public datasets. However, current methods always assume the independence between relations and try to learn unique discriminate parameter set for each relation, which leads to a sharp increase in parameters and high time complexity.

Meanwhile, the sequence information in triple should also be taken into account. Although the translation-based model considers the order to some extent by the formula \(h+r \approx t\), it is still under exploit to inherent sequence information of the triples.

To optimize embedding performance by considering the relevance and inner sequence, we explore knowledge graphs embedding from two perspectives. On one hand, there is a certain potential connection between the relations, which is neither completely independent nor completely consistent for one entity. On the other hand, the triple should be considered as a sequence, which includes the order information. Based on those ideas, we develop a novel partial layer concatenate mechanism and propose an efficient knowledge graph embedding method based on relevance and inner sequence of relations (KGERSR). It uses a shared concatenate sigmoid layer: one part is two shared filters for discriminating specific relation information of all kinds of relations; the other part is a uniform sequence holder that preserves the inner sequence information of the triple.

We evaluate our method on knowledge graph completion, and the experiments show that our model is comparable to or even outperforms state-of-the-art baselines. The main contributions of this paper are summarized as follows:

  • We found that the relations in the heterogeneous KG are not completely independent, while each relation contributes differently to the embedding of one aspect of the entity. Therefore, we propose a scalable and efficient model KGERSR with two gates to discriminate the inherent relevance of relations.

  • Besides, the inner sequence of relations needs to be considered in the embedding of the entity. We develop a layer concatenate mechanism to capture the inner sequence information of the triples.

  • In order to find a balance between complexity and accuracy, we propose an shared parts of parameter matrix which can preserve correlation and inner sequence information, using three parameter matrices. The complexity is as same order as transE.

  • Experiments show that KGERSR delivers some improvements compared to state-of-the-art baselines, and reduces parameters. These results indicate that our method is a good way to further optimize embedding in a real KG.

2 Related Work

Translational Distance Models is one of the representative methods of KG Embedding model. TransE [1] is the earliest translational distance model. It represents both entities and relations as vectors in the same space. Despite its simplicity and efficiency, TransE has flaws in dealing with 1-to-N, N-to-1, and N-to-N relations [8, 14], so that they do not do well in dealing with some complex properties. To overcome the disadvantages of TransE in dealing with complex relations, some method such as transH [14] and transR [8] are proposed, which introduce relation-specific entity embeddings strategy. Those methods need a large-scale of parameters and high time complexity, which prevent them from applying on large-scale KG.

Some works take the relevence of relations into account, assuming the relations fit some sort of random distribution, and modeling them as random variables. KG2E [4] represents entities and relations as random vectors drawn from multivariate Gaussian distributions. TransG [15] also models entities with Gaussian distributions, and it believes that a relation can have multiple semantics, hence it should be represented as a mixture of Gaussian distributions. However, once the entities and relations of the actual KG do not conform to the assumed distribution, the effect of those models will be weakened.

There are many methods based on semantic matching models that also consider the correlation between relations to reduce learning parameters. DistMult [16] introduces a vector r and requires \(\mathbf{M} _{r} = diag( \mathbf{r} )\). The scoring function is hence defined as \(f_{r}(h,t) = \mathbf{h} ^\mathsf {T} diag( \mathbf{r} ) \mathbf{t} \). This score captures pairwise interactions between only the components of h and t along the same dimension, and reduces the number of parameters to \(\mathcal {O}(d)\) per relation. However, this over-simplified model can only deal with symmetric relations which is clearly not powerful enough for general KG. ComplEx [13] extends DistMult by introducing complex-valued embeddings so as to better model asymmetric relations.

In the KG, the sequence of relations can also reflect the semantic relation between entities. Lin proposed a representation learning method Path-based TransE (PTransE) [9]. Given a path p linking two entities h and t, p can be calculated using the addition, multiplication, or RNN of all \(r_{i}\) on the path. Guu et al. [3] proposed a similar framework, the idea of which is to build triples using entity pairs connected not only with relations but also with relation paths. Those models considering relational paths can greatly improve the discrimination of knowledge representation learning and improve performance on tasks such as knowledge graph completion. However, they both had to make approximations by sampling or pruning to deal with the huge number of paths.

3 Our Model

3.1 Motivation

Relations of KG are relevant for each entity. <Arnold Schwarzenegger, isGovernor, California> and <Arnold Schwarzenegger, isMemberOf, Republican> jointly infer to <Arnold Schwarzenegger, is, Politician>. Therefore, we should not completely separate the relations, or embedding together indiscriminately. By building the learning network, the model can automatically learn the intrinsic relevance between the relations, and let the related work together to express the characteristics of one aspect of the entity.

Additionally, KG is a directed graph, which head entity and tail entity have inner sequence connected by relation. The entities connected by the order relations will affect each other. Consequently, as mentioned before, path in a triple can also reflect the semantic relation between entities. A model capable of capturing sequence information should be proposed. Although the translation-based models handle the path information, which have preserved a certain information of the sequence, they still under exploit to inherent sequence information of the triples.

We should consider retaining the relevance of the relations while retaining the sequence information. So, we combine the ideas of LSTM [5] and RNN [12]. For the relevance of relations, we hope that related relations work and irrelevant relations are ignored, so we design two shared gates for head and tail entities embedding with relations, which draw on the core idea of LSTM. For sequence of relations, we consider the triples as a sequence combining by [hr] and [rt]. So we develop a recurrent discriminate mechanism to retain the sequence information which draw on the core idea of RNN.

3.2 KGE with Relevance and Sequence of Relations

The framework of KGERSR is shown as Fig. 1 and the detailed descriptions are as follow:

Fig. 1.
figure 1

The KGERSR architecture.

  • We map every entity and relation into continues vector with same dimension \(R^m\). Then we get original h, r, t embedding vectors.

  • We input both entity embedding and relation embedding into the concatenate sigmoid layer consisted by parameter matrix, which determined by entity and relation together.

  • Then we set two shared gates, \(\sigma _{h}\) and \(\sigma _{t}\) for head and tail entities respectively. Those two gates partially shared one recurrent parameter matrix W between [hr] and [rt].

  • We realize non-linear and adaptive relation-specific discrimination through multiplying the different parts of concatenate layer output and entity embedding element-wise.

  • We capture the inherent sequence property through multiplying the shared part of concatenate layer output and entity embedding element-wise.

  • Last, we build a scoring function useing discriminated information of heads and tails. Based on the score, we can determine whether the triple is valid.

Each triple is composed of two entities and their relation. We define \(h, t, r \in R^{m}\) as their embeddings respectively. The parameters of concatenate layer are denoted as \(W_{h}\), \(W_{t}\), \(W \in R^{m\times m}\) and \(b_{h}\), \(b_{t}\in R^{m}\). \(W_{h}\) and \(W_{t}\) record the relevance between the relations, respectively. W is first affected by [hr] and then by [rt], so that W can record the inner sequence of the whole triple. The discriminated vectors of entities are defined as

$$\begin{aligned} h_r = [h]\odot \sigma ([W_h,W]\cdot \begin{bmatrix} h \\ r \end{bmatrix} + [b_h]) \end{aligned}$$
(1)
$$\begin{aligned} t_r = [t]\odot \sigma ([W,W_t]\cdot \begin{bmatrix} r \\ t \end{bmatrix}+ [b_t]) \end{aligned}$$
(2)

The sigmoid function \(\sigma (\bullet )\) is applied in a element-wise manner. [,] means the concatenate operation. \(\odot \) means the element-wise product.

In practice, we enforce constraints on the norms of the embeddings. That is to say, \(\forall h, t, r\), we have \(||h||_2 = 1\), \(||r||_2 = 1\), \(||t||_2 = 1\), \(||h_r||_2 = 1\) and \(||t_r||_2 = 1\). The output of concatenate sigmoid layers describes how much relation-specific and sequence information should be maintained.

Then, we define the scoring function f as Eq. 3. The score is expected to be higher for a valid triple and lower for an invalid triple.

$$\begin{aligned} f(<h,r,t>) = \sum _{k=1}^{m} h_{rk}r_{k}t_{rk} \end{aligned}$$
(3)

The log-odd of the probability that G holds the triple is true is:

$$\begin{aligned} P(<h,r,t>\in G) = \sigma (f(<h,r,t>)) \end{aligned}$$
(4)

3.3 Training

We use the Adam optimizer [7] to minimize the loss function [13] with \(L_2\) regularization on weight matrix \(W_h\), \(W_t\) and W of concatenate layer.

$$\begin{aligned} \begin{aligned} L =&\sum _{<h,r,t>\in \{G\cup G'\}} log(1+exp(-Y_{hrt}f(<h,r,t>)))\\&+ \frac{\lambda }{2}(||W_h||_{2}^{2} + ||W_t||_{2}^{2} + ||W||_{2}^{2}) \end{aligned} \end{aligned}$$
(5)

where \(Y_{hrt} = 1\) if \(<h, r, t> \in G\), and \(Y_{hrt} = -1\) otherwise. \(G'\) is a collection of invalid triples generated by replacing entities or relations in training triples randomly. We use \(\eta =\frac{|G'|}{|G|}\), which is an important hyperparameter, to represent negative samples per training sample. \(G'\) is defined as

$$\begin{aligned} \begin{aligned} G' =&\lbrace<h', r,t>| h'\in E\rbrace \cup \lbrace<h, r,t'>| t'\in E\rbrace \cup \lbrace <h, r',t>| r'\in R\rbrace \end{aligned} \end{aligned}$$
(6)

In practice, we initialize the embeddings, weight matrices and weight vectors of gates through sampling from a truncated standard normal distribution. We use Adam optimizer with a constant range of learning rates for each epoch to carried out the training process which is stopped based on model’s performance.

3.4 Complexity Analysis

The parameter number of our method is \(\mathcal {O}(N_{e}m + N_{r}n + 3m^2+2m)(m=n)\), and the time complexity is \(\mathcal {O}(m^2)\). \(N_e\), \(N_r\) represent the number of entities, relations respectively. m, n are the dimension of entity and relation embedding space, respectively. The parameter complexity of KGERSR is almost the same as TransE in an epoch, because \(m \ll N_r \ll N_e \) among existing KG. The discriminate parameters brought by the filters can be ignored compare to embedding parameters. Besides, KGERSR do not need any hyper parameter or pre-training to prevent overfitting. This makes KGERSR can be trained easier, so that it can be used to process the real-world KG.

4 Experiments

4.1 Knowledge Graph Completion Results

Knowledge graph completion aims to predict the missing h or t for a relation fact triple \(<h,r,t>\). In this task, we need to filter out the correct triples from hybrid triples.

In this task, we use two datasets: WN18RR and FB15K-247, shown as Table 1. Since the datasets are same, we directly copy experimental results of several baselines. For those two datasets, we traverse all the training triples for at most 1000 rounds. We report MRR which is the mean reciprocal rank of all correct entities) and (Hits@K) which is the proportion of correct entities ranked in top K as our evaluation metrics.

Table 1. Statistics of datasets

We select the hyper parameters of KGERSR via grid search according to the MRR on the validation set. On WN18RR, the best configurations are: \(\lambda = 0.01\), \(\alpha = 0.01\), \(m = 200\), \(B = 120\) and \(\eta = 25\). On FB15K-237,the best configurations are: \(\lambda = 0.1\), \(\alpha = 0.1\), \(m = 200\), \(B = 500\) and \(\eta = 25\). Table 2 shows the evaluation results on knowledge graph completion.

From Table 2, we observe that KGERSR outperforms all of baselines in some metric and achieves comparable results in other metric. On more relational but sparse data set FB15K-237, our method outperforms 2.3% higher at MRR and 1.9% higher at Hits@1 than previous best result. Besides, on less relational but dense data set WN18RR, our method achieves comparable results in MRR and Hits@1 than previous best results.

Table 2. Experimental results of knowledge graph completion

The results indicate that our model is able to achieve more improvements on more relational but sparse graph FB15K-237. The promotion of less relational but dense data set WN18RR is relatively limited. That is to say, our model can perform better in more relational graph. That result shows that the relational relevance part in our model plays a more critical role in KG completion task. The improvement on the indicator Hits@1 is obvious, which shows that our algorithm has great ability on precise link prediction. In general, the results demonstrate that our method has a certain generalization. At the same time, the proposed method using concatenate gates which takes into account the relevance and sequence of relations can more fully capture the essential characteristics of entities and relations in KG.

4.2 Parameter Efficiency

We further investigate the influence of parameters on the performance and the sensitivity to parameters of our model. Experiments below focus on FB15K-237, with the best configurations obtained from the previous experiment. Training was stopped using early stopping based on filtered MRR on the validation set. For the negative samples, embedding size and batch size parameters, we select the parameter values in the corresponding interval one by one, and view the changes in MRR and the corresponding training time. The result shown in Fig. 2.

  • We let negative samples(\(\eta \)) vary in {1, 5, 10, 25, 50, 100}. It can be observed from Fig. 2(a) that generating more negative samples clearly improves the results. When \(\eta \) exceeds 25, the growth rate of indicators such as MRR, Hits@1 and Hits@10 slows down, and the results are basically flat, while the corresponding training time still increases linearly. So considering the training time, we choose \(\eta = 25\) as the optimal parameter.

  • We let embedding size(m) vary in {50, 100, 200, 300}. It can be observed from Fig. 2(b) that the embedding size improves the model performance from 50 to 200 interval, but the performance of Hits@10 decreases from \(55.4\%\) to \(55.3\%\) when \(m = 300\). Simultaneously, the training time of the model also increases accordingly. In other words, blindly increasing the size of the matrix does not guarantee the effectiveness of the model. The reason may be that increasing the embedding size make some triples trained less inadequately, which makes it impossible to learn accurate embedding results. So considering the model performance, we choose \(m = 200\) as the optimal parameter.

Fig. 2.
figure 2

Performance of the different parameter size

5 Conclusion and Future Work

In this paper, in order to find a balance between complexity and accuracy, we propose an shared parts of parameter matrix which can preserve correlation and sequence information, using three parameter matrices. Experiments show that KGERSR outperforms of state-of-the-art baselines in some indicators and achieves comparable results in other indicators. These results indicate that our method is a good way to further optimize embedding in the real KG.

In the future, we will conduct further study from the following aspects: (1) Add information such as text and attributes to further increase the accuracy of knowledge embedding. (2) We will use some sophisticated methods like RNNs to further optimize KGERSR methods. (3) The connection between triples is closer to the graph model.