Keywords

1 Introduction

Knowledge graphs (KGs) provide a massive amount of structure information for entities and relations, which have been successfully utilized in various fields such as knowledge inference [21] and question answering [23]. Typical KGs like Freebase [1], or YAGO3 [8] usually model the multi-relational information with many structure triples represented as (head entity, relation, tail entity), which is also abridged as (hrt) .

Currently, most RL methods focus on structure information but ignore attribute information in KGs. For example, in Fig. 1, we show two entity multi-attribute information in a structure triple sampled from DWY100K [13]. Although some works have realized the importance of multi-attribute information such as DT-GCN [11], they haven’t fully used the rich semantic information. First, they didn’t embed the attribute types and attribute values jointly and applied them to improve the entity representation semantic accuracy directly. Second, they didn’t consider using both structure information and multi-attribute information to improve the overall RL effect. In addition, entity multi-attribute information is generally stored in KGs in the form of triples and the attribute triples can not tackle successive attribute values and might suffer from issues like one-to-multi or multi-to-one relations in KGs. Furthermore, entity multi-attribute information is usually diverse and complex: different entities may have multiple attribute types in KGs, and even different attribute values may have various data structures and value granularities. For example, in Fig. 1, \(Babyfather\_(song)\) and \(Sade\_(singer)\) have different attribute types; the three attribute types of \(Sade\_(singer)\) that correspond to attribute values have different forms and structures. Entity multi-attribute information is too complex to use for learning embeddings directly. In the meanwhile, intuitively, different attribute types and values play different degrees of importance in the entities’ representations. If entity multi-attribute information cannot be used reasonably, the entities’ representations will lose a large amount of accurate semantic information thus reducing their semantic accuracy.

Fig. 1.
figure 1

Example of entity multi-attribute information in DWY100K.

To address those problems, we first design a novel encoder EAE, which can encode the complex and diverse entity multi-attribute information to generate the entities’ attribute-based representations. Moreover, we propose a novel RL model DERL for KGs, combining structure and multi-attribute information to improve KG embedding. In the DERL model, an entity’s representation is responsible for jointly modeling the corresponding structure information and multi-attribute information.

For learning structure information, we follow a typical RL method TransE [2] and regard the relation in each structure triple as a translation from the head entity to the tail entity. For learning multi-attribute information, we use the EAE to learn the entities’ attribute-based representations. In our EAE, we set up a training model that contains two embedding components. One component embeds the entity’s different attribute types, and the other component uses the bi-directional Long Short-Term Memory (Bi-LSTM) to characterize the attribute values. Attribute types and values apply an attention mechanism to learn their different importance for entities’ representations individually. Finally, we use the embeddings of these two parts to generate the entities’ attribute-based representations.

We evaluate our model on the knowledge graph completion task and the zero-shot task. Experimental results demonstrate that our model achieves state-of-the-art performances on both tasks. Our experimental results indicate that our model can use entity multi-attribute information to improve the overall KG embedding effect and verify the importance and necessity of attribute information for entity representation. We demonstrate the main contributions of this work as follows:

  • We design a novel encoder Entity Attribute Encoder (EAE), which uses both the entity’s attribute types and values to generate the entity’s attribute-based representation. We adopt the attention mechanism for attribute types and values to distinguish the importance of different attribute information to the entity’s representation.

  • We propose a novel RL model Duet Entity Representation Learning (DERL), which utilizes both entity structure information and multi-attribute information for enhancing RL’s effect.

  • We evaluate the DERL model’s effectiveness on the knowledge graph completion and zero-shot tasks. Experimental results on real-world datasets illustrate that the DERL model consistently outperforms other baselines on these two tasks. To the best of our knowledge, this is the first work attempt to use entity multi-attribute information to solve the zero-shot problem.

2 Problem Formulation

We first introduce the symbols used in this paper. Given a structural triple \( (h, r, t)\in T \), while \( h, t \in E \) stand for entities, \( r \in R \) stands for the relation. Respectively, h and t are the head entity and the tail entity. \( a \in A \) stands for the attribute type and \( v \in V \) stands for the attribute value. \( c \in v \) stands for the attribute value character. T stands for the whole training set of structural triples. E is the set of entities, R is the set of relations, A is the set of attribute types, and V is the set of attribute values. We propose two kinds of representations for each entity to utilize structure information and multi-attribute information in DERL.

Definition 1. Structure-Based Representations: \( {\textbf {e}} _{\textbf {s}} \) represents the entity’s structure-based representation. \( {\textbf {e}} _{\textbf {sh}} \) and \( {\textbf {e}} _{\textbf {st}} \) are the structure-based representations based on the head entity and the tail entity. \( {\textbf {r}} \) represents the relation’s representation. These representations could be learned through existing translation-based models.

Definition 2. Attribute-Based Representations: \( {\textbf {e}} _{\textbf {a}} \) represents the entity’s attribute-based representation. \( {\textbf {e}} _{\textbf {ah}} \) and \( {\textbf {e}} _{\textbf {at}} \) are the attribute-based representations based on the head entity and the tail entity. We will propose an encoder to construct this kind of representation in the following section.

Fig. 2.
figure 2

The Overall Architecture of DERL Model

3 Methodology

3.1 Overall Architecture

We attempt to utilize entity structure information as well as multi-attribute information in the DERL model. Following the framework of translation-based methods, we define the overall energy function as follows:

$$\begin{aligned} S(h,r,t) = S_{s} + S_{a}, \end{aligned}$$
(1)

where \( S_{s} = || {\textbf {e}} _{\textbf {sh}} + {{\textbf {r}}} - {\textbf {e}} _{\textbf {st}} || \). \( S_{s} \) is an energy function based on structure-based representations, which is the same as the translation-based methods. \( S_{a} \) is an energy function based on attribute-based representations and structure-based representations. To make the learning process of \( S_{a} \) compatible with \( S_{s} \). We define \( S_{a} \) as:

$$\begin{aligned} S_{a} = S_{as} + S_{sa} + S_{aa}, \end{aligned}$$
(2)

where \( S_{as} = || {\textbf {e}} _{\textbf {ah}} + {{\textbf {r}}} - {\textbf {e}} _{\textbf {st}} || \) and \(S_{sa} = || {\textbf {e}} _{\textbf {sh}} + {{\textbf {r}}} - {\textbf {e}} _{\textbf {at}} || \), in which one of the head entity or the tail entity is the structure-based representation, and the other is the attribute-based representation. \( S_{aa} = || {\textbf {e}} _{\textbf {ah}} + {{\textbf {r}}} - {\textbf {e}} _{\textbf {at}} || \), the head entity and the tail entity are both attribute-based representations. According to the overall energy function, the overall architecture of the DERL is demonstrated in Fig. 2. We learn the entities’ structure-based representations and relations’ representations from TransE. And we learn the entities’ attribute-based representations from EAE. Under the overall energy function, we can get the attribute-based representations and the structure-based representations simultaneously. The overall energy function will project these two types of entities’ representations into the same vector space with relation representations shared by all four energy functions, which will be promoted between two types of representations.

3.2 Entity Attribute Encoder

Entity multi-attribute information is difficult to use due to its complexity, heterogeneity, and different levels of importance. These problems directly lead to the difficulty of learning multi-attribute information into entities’ representations. Therefore in our EAE, we consider the encoding of attribute types and attribute values respectively. The framework of the EAE is demonstrated in Fig. 3. The Attribute Type Embedding (ATE) learns the entity attribute type embeddings, and the Attribute Value Character Embedding (AVCE) learns the entity attribute value character embeddings. In the Attribute Value Embedding (AVE), we use the Bi-LSTM to capture the attribute value characters’ information and generate the attribute value embeddings. We apply the attention mechanism to combine the attribute types and values for enhancing KG embedding. Finally, we combine the attribute type embeddings and the attribute value embeddings to generate the entities’ attribute-based representations.

Attribute Type Embedding (ATE). We first count the attribute types and randomly generate an embedding for each attribute type. Because each entity has a different number of attribute types and values, we adopt the zero-filling strategy to unify the numbers. To prevent the zero-filling strategy from affecting the model’s training, we separately generate the same embedding for all zeros to prevent problems such as vanishing gradients. Given the entity’s M attribute types: \( A = (a_{0},a_{1},...,a_{M}) \), we obtain the following embeddings of the entity’s M attribute types:

$$\begin{aligned} {\textbf {A}} = (\textbf{a}_{0},\textbf{a}_{1},...,\textbf{a}_{M}). \end{aligned}$$
(3)

Attribute Value Character Embedding (AVCE). We first count the characters that appear in the attribute values. Then we randomly generate an embedding for each attribute value character. Because the numbers of characters in each attribute value are different, we also utilize the zero-filling strategy and generate the same embedding for all zeros. Given the attribute value N characters: \( v_{i} = (c_{0},c_{1},...,c_{N}) \), we get the following attribute value character embeddings:

$$\begin{aligned} \textbf{v}_{i} = (\textbf{c}_{0},\textbf{c}_{1},...,\textbf{c}_{N}). \end{aligned}$$
(4)

Attribute Value Embedding (AVE). We observe that the different attribute values might appear differently in KGs. For example: “2012-12-12” and “180 cm” represent a person’s birthday and height respectively. In mono-lingual KGs, the attribute value can be considered as a sequence of characters with the same vocabulary. [15] proves that the LSTM can effectively capture the sequence information between characters. Therefore we choose the Bi-LSTM to learn the sequence information between characters from beginning to end. The following equations define the Bi-LSTM cell:

$$\begin{aligned} f_{t} = \sigma (\mathrm{{\textbf {W}}}_{f} [h_{t-1}, \textbf{c}_{t}] + \mathrm{{\textbf {b}}}_{f}), \end{aligned}$$
(5)
$$\begin{aligned} i_{t} = \sigma (\mathrm{{\textbf {W}}}_{i} [h_{t-1}, \textbf{c}_{t}] + \mathrm{{\textbf {b}}}_{i}), \end{aligned}$$
(6)
$$\begin{aligned} \tilde{H}_t=\textrm{tanh}(\mathrm{{\textbf {W}}}_{H} [h_{t-1}, \textbf{c}_{t}] + \mathrm{{\textbf {b}}}_{H}), \end{aligned}$$
(7)
$$\begin{aligned} {H}_t = f_{t} \odot H_{t-1} + i_{t} \odot \tilde{H}_t, \end{aligned}$$
(8)
$$\begin{aligned} o_{t} = \sigma (\mathrm{{\textbf {W}}}_{o} [h_{t-1}, \textbf{c}_{t}] + \mathrm{{\textbf {b}}}_{o}), \end{aligned}$$
(9)
$$\begin{aligned} h_{t} = o_{t} \odot \tanh (H_{t}), \end{aligned}$$
(10)

where \( \odot \) denotes a vector multiplication, \( f_{t} \), \( i_{t} \), \( o_{t} \) are the forget gate, input gate, and out gate of the Bi-LSTM cells. \( \mathrm{{\textbf {W}}}_{f} \), \( \mathrm{{\textbf {W}}}_{i} \), \( \mathrm{{\textbf {W}}}_{H} \), \( \mathrm{{\textbf {W}}}_{o} \) are weight matrices. \( \sigma \) is the sigmoid function. \( \mathrm{{\textbf {b}}}_{f} \), \( \mathrm{{\textbf {b}}}_{i} \), \( \mathrm{{\textbf {b}}}_{H} \), \( \mathrm{{\textbf {b}}}_{o} \) are biases. Bi-LSTM is divided into the forward LSTM (F-LSTM) and the backward LSTM (B-LSTM). The F-LSTM reads the input character embeddings. For example, the F-LSTM reads the attribute value character embeddings \(\textbf{v}_{i} =(\textbf{c}_{0}, \textbf{c}_{1},...,\textbf{c}_{N}) \) from left to right. The B-LSTM reads the attribute value character embeddings reversely. The outputs of the F-LSTM and B-LSTM are:

$$\begin{aligned} \textbf{h}_f=\mathrm{{{\textbf {F-}}{} {\textbf {LSTM}}}}(\textbf{c}_{N},\textbf{h}_{f-1}), \end{aligned}$$
(11)
$$\begin{aligned} \textbf{h}_b=\mathrm{{{\textbf {B-}}{} {\textbf {LSTM}}}}(\textbf{c}_{0},\textbf{h}_{b+1}). \end{aligned}$$
(12)

The initial states of the Bi-LSTM are set to zero vectors. After reading the embedding of all characters contained in an attribute value, we concatenate the final hidden states of the two-direction LSTM outputs to generate the attribute value embedding:

$$\begin{aligned} \textbf{v}_{i} = [\textbf{h}_f; \textbf{h}_b]. \end{aligned}$$
(13)

Given the entity’s M attribute values: \( V = (v_{0},v_{1},...,v_{M}) \), we get the following the attribute value embeddings:

$$\begin{aligned} {\textbf {V}} = ({\textbf{v}_{0},\textbf{v}_{1},...,\textbf{v}_{M}}). \end{aligned}$$
(14)
Fig. 3.
figure 3

The Framework of Entity Attribute Encoder

Attention for Attribute Types and Attribute Values. An entity’s attribute-based representation assembles all the entity attribute information, but not all attribute information is equally important to an entity’s representation. To learn the importance of different attribute types and attribute values for an entity’s representation, we adopt the attention mechanism to solve this problem [22]. Given the entity attribute type embeddings: \( {\textbf {A}} = (\textbf{a}_{0},\textbf{a}_{1},...,\textbf{a}_{M}) \), we calculate their attention weights:

$$\begin{aligned} \beta _{i} = \textrm{softmax}({\textbf {A}}^\mathrm{{T}}\mathrm{{\textbf {W}}}_{t}\textbf{a}_{i}), \end{aligned}$$
(15)

where \( \mathrm{{\textbf {W}}}_{t} \) is the weight matrix of \( \textbf{a}_{i} \). Here we utilize the attribute type embeddings to get the attention weights. The attention weights of attribute value should be consistent with that of its attribute type:

$$\begin{aligned} \textbf{e}_{type} = \sum ^{M}_{i=0} \beta _{i} \textbf{a}_{i}, \end{aligned}$$
(16)
$$\begin{aligned} \textbf{e}_{value} = \sum ^{M}_{i=0} \beta _{i} \textbf{v}_{i}, \end{aligned}$$
(17)

we concatenate \( \textbf{e}_{type} \) and \( \textbf{e}_{value} \) to get the entity’s attribute-based representation:

$$\begin{aligned} {\textbf {e}} _{\textbf {a}} = [\textbf{e}_{type}; \textbf{e}_{value}]. \end{aligned}$$
(18)

3.3 Objective Formalization

We utilize a margin-based score function as our training objective, which is defined as follows:

$$\begin{aligned} \begin{aligned} L = \sum _{(h,r,t) \in T} \sum _{(h^{'},r^{'},t^{'})\in T^{'} } \textrm{max}(\gamma + S(h,r,t) - S(h^{'},r^{'},t^{'}),0), \end{aligned} \end{aligned}$$
(19)

where margin \( \gamma \) means the artificially defined minimum distance between positive and negative examples. S(hrt) is the overall energy function, in which both head and tail entities have two kinds of representations: structure-based representations and attribute-based representations. The above energy functions are defined as the L1-norm. It is verified by experiments that the DERL’s effects based on L1-norm are better than the DERL’s effects based on L2-norm. \( T^{'} \) is the negative sample set of T, which we define as follows:

$$\begin{aligned} \begin{aligned} T^{'} = {(h^{'},r,t)|h^{'}\in E}\cup {(h,r^{'},t)|r^{'}\in R}\cup {(h,r,t^{'})|t^{'}\in E}, \end{aligned} \end{aligned}$$
(20)

which means one of the entities or relations in a triple can be randomly replaced by another one. Since we have two entities’ representations, if a triple already exists T, it will not treat it as a negative sample because the entity can be either a structure-based representation or an attribute-based representation.

3.4 Optimization and Implementation Details

DERL model can be defined as a parameter set \(\theta \) = (\( \mathrm{{\textbf {E}}}\), \( \mathrm{{\textbf {R}}}\), \( \mathrm{{\textbf {A}}}\), \( \mathrm{{\textbf {C}}}\), \( \mathrm{{\textbf {W}}}\), \( \mathrm{{\textbf {B}}}\)). \( \mathrm{{\textbf {E}}}\) stands for the embedding set of entities and \( \mathrm{{\textbf {R}}}\) stands for the embedding set of relations. They can be randomly initialized or trained by previous translation-based methods such as TransH [18] and TransR [7]. \( \mathrm{{\textbf {A}}}\) stands for the embedding set of attribute types and \( \mathrm{{\textbf {C}}}\) stands for the embedding set of attribute value characters and they are initialized randomly. \( \mathrm{{\textbf {W}}}\) and \( \mathrm{{\textbf {B}}}\) represent the weight set and bias set of Bi-LSTM and attention mechanism in EAE, which can be initialized randomly. We utilize the mini-batch stochastic gradient descent (SGD) to optimize our model, where chain rules are applied to update the variables and parameters. We use GPU to accelerate training.

4 Experiments

4.1 Datasets and Experiment Settings

Datasets. In our experiments, we use the DWY100K [13] to evaluate our models’ knowledge graph completion effect. For the zero-shot task, we build a new dataset FB24K-New based on FB24K [6] to simulate a zero-shot scenario. We select 12,789 entities as In-KG entities in FB24K and select 5,179 entities in FB24K that are related to In-KG entities as Out-of-KG entities. We extract the structure triples which contain In-KG entities and Out-of-KG entities and add them to the test set. Our test set is split into 4 types: ( I - I ), ( O - I ), ( I - O ), and ( O - O ). I represent an In-KG entity, and O represents an Out-of-KG entity. The DWY100K, FB24K, and FB24K-New details are listed in Table 1 and Table 2.

Table 1. Statistics of DWY100K

Experiment Settings. In the DERL model, the margin \( \gamma \) set among {1.0, 2.0, 3.0}. The learning rate \( \lambda \) set among {0.0005, 0.0003, 0.001}. We set different learning rates for different representation type combinations. The optimal configurations of the DERL are: \( \lambda \) = 0.001, \( \gamma \) = 1.0. We set the size of character embedding and attribute type embedding to 32. We set the attention weight size to 64 and the size of the hidden layer of Bi-LSTMs to 16. The dimensions of the attribute-based representation and structure-based representation are set to 64. The dimension of the relation’s representation is set to 64. We set two evaluation settings named “Raw” and “Filter”: “Filter’ drops the repeated triples in the training stage (when we alternate the entities and relations, the reconstructed triple has a chance to be an existing triple), while “Raw” does not.

Table 2. Statistics of FB24K and FB24K-New

4.2 Knowledge Graph Completion Task

Due to the incompleteness and complexity of KGs, many KGs are missing triples, and a large number of potential relations between entities in the KGs are not discovered. Knowledge graph completion aims to learn appropriate entities’ and relations’ representations to discover the latent, correct triples. In addition, the knowledge graph completion task has been widely used to evaluate the quality of knowledge representations [24].

Evaluation Protocol. We will report four prediction results based on our models. The DERL(Structure) only utilizes structure-based representations for all entities when predicting the missing ones. While DERL(Attribute) only utilizes attribute-based representations. The DERL(Union) is a simple joint method considering the weighted concatenation of both entities’ representations. The DERL(Ablation) only uses attribute information for training. We use three measures as our evaluation indicators: Mean Rank, Hits@10 and Hits@1 [19, 20]. In our experiment, we select TransE [2], ComplEx [16], SimplE [5], RotatE [14], QuatRE [9], ParamE [3], TransRHS [24], DT-GCN [11], and HittER [4] as baselines, which will be discussed in the Related Work.

Experimental Results. Table 3 and Table 4 present the entity and relation prediction results respectively. Our analysis draws the following conclusions: (1) most DERL models outperform all baselines on both Mean Rank, Hits@10, and Hit@1. It indicates that the entities’ representations with multi-attribute information perform better in knowledge graph completion, which not only proves that EAE can effectively encode attribute information but also shows that DERL model can learn an accurate entity’s representation. (2) DERL(Structure) shows good performance, although it is inferior to some experimental results. After the mutual promotion of two kinds of information, compared with some models (such as TransE, ComplEx, DT-GCN) performance effects have been improved. The results indicate that two entities’ representations can learn and share the same vector space. This proves that two kinds of information can be jointly trained to improve the RL’s overall effect. (3) The DERL models’ results outperform baselines on Mean Rank. The Mean Rank can well reflect the overall quality of knowledge representation and determine the prediction results. In this paper, we use entity multi-attribute information as semantics information to improve the entity representation semantic precision. Therefore, the DERL models’ results are much better than the baselines’ results on Mean Rank. The case studies indicate that we may not know the entities’ details only by using the structure information, but we may know the entity better by learning rich potential information from entity multi-attribute information.

Table 3. Entity Prediction Results in Knowledge Graph Completion Task
Table 4. Relation Prediction Results in Knowledge Graph Completion Task

4.3 Knowledge Graph Completion in Zero-Shot Task

How to embed the new entities in the KGs and apply them is the main purpose of the zero-shot task. However, it is difficult to embed the Out-of-KG entities directly, and efficiently finding the latent relations between Out-of-KG and In-KG entities is difficult. In this paper, we use multi-attribute information to learn the Out-of-KG entities’ representations, which solves the problems that the Out-of-KG entities can’t embed directly and the knowledge graph completion in zero-shot tasks.

Evaluation Protocol. We select DKRL [19], ConMask [12], and OWE [10] as our baselines which will be discussed in the Related Work. We utilize Hits@10, and Hits@1 [19] for entity and relation prediction. We only present the results on the “Filter” setting. We present four results in the experiment, and the (O - I), (I - O), and (O - O) have been explained above; the Total is the combined result of these three test sets.

Experimental Results. Fig. 4 shows the experimental results of (O - I), (I - O), (O - O), and Total. We can observe that: (1) In most cases, DERL significantly outperforms other models on all four types of test sets. DERL achieves about 16.2% improvement in entity prediction and 5.7% improvement in relation prediction. It demonstrates that DERL can effectively utilize the Out-of-KG entity multi-attribute information into the entity’s representation to handle the zero-shot problem. (2) The entity description information and multi-attribute information belong to the text information of the entity, but the DERL model performs better in entity prediction, relation prediction, and Mean Rank. It not only shows the effectiveness of the DERL model in embedding text information and capturing entity semantic information but also explains the advantages of using entity attribute information to solve the zero-shot problem. (3) From Fig. 4, we can see that some DERL’s results are not ideal, which may be because two entities belong to two entity spaces. Therefore, the connections between In-KG and Out-of-KG entities are still in need of enhancement.

Fig. 4.
figure 4

Entity and Relation Prediction Results in Zero-Shot Task

5 Related Work

5.1 Knowledge Graph Embedding

In recent years, knowledge graph embedding methods have achieved great success and promotion. TransE [2] follows the rule (\( {\textbf {h}} + {\textbf {r}} \approx {\textbf {t}} \)) to embed the entities and relations. SimplE [5] not only uses the Polyadia-Score but also utilizes the inverse of the relation. ParamE [3] extends current embedding methods by combining the nonlinear-fitting ability of neural networks and translational properties. ComplEx [16] first introduces the Complex-Spaces to capture symmetric and antisymmetric relations. RotatE [14] treats the relation as a rotation from the head entity to the tail entity. QuatRE [9] defines the Quaternion-Space with Hamilton-Product to enhance correlations between head and tail entities. TransRHS [24] utilizes the relative positions between vectors and spheres to enhance the generalization between relations. HittER [4] proposes a Transformer-based RL model to enhance the effects of entities and relations. DT-GCN [11] makes full use of the advantages of multiple-types entity’s attribute values to explore the expressiveness of the entity’s representation.

5.2 Zero-Shot Problem

Zero-shot problem is a key issue in Knowledge Graph Completion because of the data sparsity (including entity and relations). Currently, few models are in a position to solve the zero-shot problem by using ancillary information. DKRL [19] proposes to use entity description information to generate entities’ representations to solve the zero-shot problem. ConMask [12] comprehensively utilizes entities’ names and textual information to deal with zero-shot situations. OWE [10] combines the entities’ names and description information in the Transformation Space to improve open-world link prediction. To benefit the zero-shot problem in KGs, we utilize ancillary information directly to learn attribute-based representation and structure-based representation jointly, thus enriching the sparse information hidden in knowledge graphs.

6 Conclusion

In this paper, we propose a novel RL model (DERL) that utilizes both structure and multi-attribute information to improve the RL’s effect in KGs. To effectively encode entity multi-attribute information, we also design an attribute information encoder EAE. Experimental results on real-world datasets demonstrate that the DERL model consistently outperforms other baselines on the knowledge graph completion task and zero-shot task. [17]