1 Introduction

In this paper, we introduce the knowledge graph completion problem, which aims to alleviate the incompleteness and sparseness issues caused by collaboratively or semi-automatically knowledge graph (KG) construction.

The major challenge of the knowledge graph completion task is how to predict missing or likely connections between entities based on existing facts in a given KG. KGs are a type of semantic network that depicts how concepts are related to one another and illustrates how they are interconnected [23]. In the literature, missing relation prediction in the KG has been the subject of significant effort from researchers in the last decade [1, 4, 33]. Despite the effectiveness of previous studies, relation prediction still suffers from the following issues:

  1. (1)

    There is a problem of performing learning and inference on a large-scale KG with incomplete knowledge coverage and structure sparseness. The node degree, which is related to the betweenness centrality in the network, differs from entity to entity in a KG [5]. Compared with the entities of large degree, the entities of small degree are trained with less information, resulting in an inaccurate entity presentation [12].

  2. (2)

    Limited performance caused by noise paths. For example, to identify the relationship between a disease and a symptom, the path “disease\(\rightarrow\) alias\(\rightarrow\)disease\(\rightarrow\)symptom” is more effective than the path “disease\(\rightarrow\)suitable for eating\(\rightarrow\)food\(\leftarrow\)not suitable for eating\(\leftarrow\)symptoms.” Most existing work assumes that KG paths have equal weights, but in reality, there is a gap in weight distribution between different relationships and paths [20]. It is also time-consuming to adopt traditional depth-first search algorithms to explore a large number of paths in the KG, especially considering that some paths are noise paths, which cannot provide useful information [22].

In this paper, we present ARPP, an Attribute-embodied neural Relation Path Prediction model for relation prediction. Specifically, we first propose an entity/relation representation model that leverages the structure information and textual information to alleviate the limitation of structure sparseness. The learned structural and textual representations make the KG essentially computable and have proved to be helpful for relation prediction. Afterward, we perform the information propagation for the path by mining the linear combination of neighboring nodes to effectively alleviate the impact of noisy and redundant paths on the relation prediction task. We validate the effectiveness of ARPP on a public dataset.

The main contributions of this paper are summarized as follows:

  1. (1)

    We expand the semantic structure of the KG by proposing a text-enhanced knowledge representation method considering both structural information and textual contexts in deep neural networks, thereby overcoming the barriers of KG sparseness.

  2. (2)

    To alleviate the impact of the noise paths and unbalanced relations on the relation prediction task, we consider the information propagation between path nodes to distinguish the subtle differences between paths and then emphasize those paths with rich information.

  3. (3)

    Experiments are conducted on a large benchmark dataset and indicate that our model is superior to the state-of-the-art methods on path prediction tasks.

We organize the rest of our article as follows. Section 2 reviews related works of knowledge graph completion methods. Section 3 describes our Attribute-embodied neural Relation Path Prediction model (ARPP). Section 4 provides experimental results and several model analyses. Finally, in Section 5, we make a conclusion of our work and point out several promising directions for future researches.

2 Related work

KGs typically suffer from missing relations [34], which give rise to the task of automatic knowledge base completion that requiring predicting whether a given triple is valid or not.

Initial KG completion methods mainly drew upon rule-based reasoning approaches [2] and statistical methods [6]. These methods use a first-order relational learning algorithm [17, 18] to learn the probability rules and substitute a specific entity to instantiate the rule to infer a new relation instance from other relation instances that have been learned.

Afterward, distributed representation-driven KG completion methods [23] attempt to learn fact triples for the vector representation of the KG and conduct the inference or prediction in a low-dimensional vector space. Current knowledge representation methods can be carried out by semi-structured data exploration and textual information extraction.

The semi-structured data exploration method has been introduced to automatically create or augment KGs with facts extracted from Wikipedia, which has led to the high accuracy and trustworthiness of facts in its automatically created KGs, including YAGO and DBpedia. Although semi-structured texts are informative, they cover only a fraction of the actual information expressed in the articles and thus cannot meet the demand of completeness in real-world applications.

Textual information extraction methods, which attempt to extract facts from the natural language text, can be grouped into two four approaches, i.e., (i) compositional, (ii) translational, (iii) neural network-based, and (iv) graph-based models.

Several studies have attempted to use graph structure learning methods, e.g., the Path Ranking Algorithm (PRA) [10], and predictive models with multi-task learning [30], to reason over discrete entities and relationships in KGs. However, current structure models do not take full advantage of the contextual information contained in KGs to find more relevant and predictive paths.

Inspired by the success of word embedding, translation-based embedding models attempt to interpret the relations among entities as translations from head entity to tail entity. Translation-based methods have become increasingly popular for their simplicity and effectiveness. Nevertheless, the heterogeneity and imbalance issues in KGs are ignored by previous translation models such as TransE [1], TransH [8], TransR [9], and RotatE [25].

Many studies have attempted to adopt neural network models to facilitate reasoning over relationships between two entities. Neelakantan et al. [15] construct compositional vector representations for the paths and adopt Recursive Neural Networks to perform inference in the vector space. Shen et al. [19] perform multi-step path reasoning in the learned embedding space through shared memory and a controller. Socher et al. [24] introduce a semantic relationship prediction model based on a matrix-vector recursive neural network (RNN) to learn the compositional vector embeddings of phrases and sentences. Nathani et al. [14] propose a novel feature embedding based on the graph attention network (GAT) to capture both entity and relation features in a multi-hop neighborhood of a given entity. Jagvaral et al. [7] combine bidirectional long short-term memory (BiLSTM) and convolutional neural network (CNN) with an attention mechanism to perform path reasoning for link prediction tasks.

Graph-based models learn more expressive embeddings due to their parameter efficiency and consideration of complex relations. According to the transfer hypothesis, a score function is designed to measure the probability of a relation path connected by multiple triples [34]. Despite obtaining high precision and recall, these methods struggle to explore knowledge from the entities and relationships with small degrees in a KG [31].

In this study, we will explore the textual knowledge contained in the KG and show how to alleviate the impact of noise paths and unbalanced relations in such automatically extracted facts by considering the information propagation to better understand information differences.

3 Methodology

In this paper, we propose an Attribute-embodied neural Relation Path Prediction model (ARPP), a novel framework that leverages textual structure information to tackle relation prediction (see Fig. 1). First of all, we learn the textual embeddings of the entity attributes via a Bidirectional Long Short-Term Memory network (Bi-LSTM), and we learn the structural embeddings of the entity and relation included in the KG via TransE. Meanwhile, we project both entity embeddings and relation embeddings into the same representation space to generate multi-hop paths. Afterward, for the path denoising, we perform information propagation through a Graph Attention Network (GAT) network, which captures the semantic correlation between neighboring path nodes, and computes their linear combination using the normalized attention coefficients. Finally, we feed the path representation into a fully connected network to obtain the final relation prediction results. Next, we would elaborate the framework in detail.

Table 1 Notation list
Fig. 1
figure 1

Overview of ARPP model. Dark green, light green, and blue matrices denote entity representation, relation representation, and path representation, respectively

3.1 Entity and relation representation

3.1.1 Structural embeddings of node and edge

Given a training set T of tuples (hrt) composed of two entity nodes \(h, t \in E\) and a relationship edge \(r \in R\), we transform each KG node and edge to a real-valued representation using the TransE model with the Freebase 15K database. The structures of the node and edge can be obtained from pre-trained word embedding matrix and are embedded as \(v_{es} \in R^{dl}\) and \(v_{rs} \in R^{dl}\), respectively, where dl is the vector dimension set by TransE.

3.1.2 Textual embeddings of node description

Each node has various attributes. Since the content of the node “description” attribute in KGs is much complete than that of other attributes, we encode the node “description” attribute as a textual representation:

The textual embedding consists of two parts, i.e., a Bi-LSTM network and a self-attention mechanism. The self-attention mechanism provides a set of summation weight vectors which are dotted with the hidden LSTM states. The resulting weighted hidden LSTM states are regarded as an embedding for the sequence.

“Description” Sequence Representation. We adopt TransE as the knowledge embedding method to generate initial embedding for the “Description” sequence. The input of the GloVE is the word sequence \(W_{l} = [w_{1},w_{2}, \dots ,w_{m}]\), and the output is the word vector sequence \(V_{l} = [v_{1},v_{2}, \dots ,v_{m}]\), where m is the fixed length of the word sequence, \(v_{i} = R^{dw}\) is the word vector of \(d_{w}\) dimension, \(i \in [1,m]\) denotes the index of the word sequence.

We adopt a Bi-LSTM to capture the “description” information. The output \(hw_{t}\) at time step t is computed by combining the output of two sub-networks for the past contexts (\(V_{l} = [v_{1},v_{2}, \dots ,v_{m}]\)) and future contexts (\(V_{l_{reverse}} = [v_{m},v_{m-1}, \dots ,v_{1}]\)), which is given by Eq. (1):

$$\begin{aligned} hw_{t} = hf_{t} + hb_{t}. \end{aligned}$$
(1)

The vector sequence \(Hw_{l} = [hw_{1}, hw_{2}, \dots , hw_{m}]\) refers to the output of the Bi-LSTM model at all time steps.

Sequence Reweighted by Attention. We apply a self-attention mechanism [13] to summarize the attention values and pay due attention to each word in sequence according to their interaction. The input of attention mechanism is all hidden LSTM states Hw, while the output is a vector of weights \(\alpha _{w} \in R^{m}\), which is given by Eq. (3):

$$\begin{aligned} M_{w}= & {} tanh(W_{sw}H_{w}), \end{aligned}$$
(2)
$$\begin{aligned} \alpha _{w}= & {} softmax(w_{w}^{T}M_{w}), \end{aligned}$$
(3)

where \(M_{w} \in R^{dlxm}\) is the nonlinear mapping function, \(W_{sw} \in R^{dlxdl}\) and \(w_{w} \in R^{dl}\) are projection parameters. Accordingly, the textual embeddings of entity description \(v_{et}\) can be calculated by Eq. (4):

$$\begin{aligned} v_{et} = H_{w}\alpha _{w}^{T}. \end{aligned}$$
(4)

3.1.3 Unifying encoding

Given the structural embeddings of the entity and relation as well as the textual embeddings of the entity description, we attempt to learn the node and edge representations and map them into the same representation space.

KG node representation using self-attention mechanism The projection matrices, \(M_{s}\) and \(M_{t}\), are adopted to project the node structural embeddings \(v_{es}\) and “description” textual embeddings \(v_{et}\), respectively, into the same representation space, which are given by Eq. (5):

$$\begin{aligned} v_{etm} = M_{t}v_{et}, \qquad v_{esm} = M_{s}v_{es}. \end{aligned}$$
(5)

We adjust the attention weight of the node structural embeddings and textual embeddings; then, the node representation \(v_{e} \in R^{dl}\) is computed by merging both embeddings, which is given by Eq. (8):

$$\begin{aligned} M_{e}= & {} \text{tan}h(W_{e}[v_{etm}, v_{esm}]),\end{aligned}$$
(6)
$$\begin{aligned} \alpha _{e}= & {} \text{softmax}(w_{e}^{T}M_{e}), \end{aligned}$$
(7)
$$\begin{aligned} v_{e}= & {} v_{etm}\alpha _{e}^{1} + v_{esm}\alpha _{e}^{2}, \end{aligned}$$
(8)

where \(W_{e} \in R^{dlxdl}\) and \(w_{e} \in R^{dl}\) are projection parameters to be learned, \(M_{e} \in R^{dlx2}\) is the nonlinear mapping function, and \(\alpha _{e}^{1}\) and \(\alpha _{e}^{2}\) are the attention weights for the structural embeddings and textual embeddings, respectively.

KG Edge Representation. The edge representation is given by \(v_ {r} = M_ {s} v_ {rs}\). The edge representation is in the same vector space as the node representation because the structural embeddings of both are processed by TransE.

3.2 Path representation

The path representation \(g_i=\{u_{1},u_{2}, \dots , u_{l}\}\) is a multi-hop sequence, where \(u_{2i}\) and \(u_{2i+1}\) are the edge embedding \(v_{r}\) and the node embedding \(v_{e}\), respectively.

Information Propagation. We adopt a variant of Graph Attention Network (GAT) [29] to propagate information among path nodes. The hidden states of input nodes are denoted as \(g_i \in {\mathcal {R}}^{2d \times 1}\), \(i \in \{1,\ldots , (m+n)\}\). A Multi-Layer Perceptron (MLP) is used to evaluate attention coefficients between a node i and its neighboring nodes j (\(j \in {\mathcal {N}}_i\)) at layer t as given by Eq. (9):

$$\begin{aligned} p_{ij}^{(t)}=W_a^{(t)}({\mathrm{ReLU}}(W_b^{(t)}[g_i^{(t-1)} \oplus g_j^{(t-1)} \oplus e_{ij}])), \end{aligned}$$
(9)

where \(W_a^t\) and \(W_b^t\) are trainable parameters of the t-th layer, \(\oplus\) indicates the concatenation operation, and \({\mathcal {N}}_i\) indicates the set of neighbors of node i. Afterwards, we apply the softmax function to normalize the coefficients as shown in Eq. (10):

$$\begin{aligned} \alpha _{ij}^{(t)}={\mathrm{softmax}}_j(p_{ij}^{(t)})=\frac{\mathrm{exp}(p_{ij}^{(t)})}{\sum _{k \in {\mathcal {N}}_i}{\mathrm{exp}(p_{ik}^{(t)})}}. \end{aligned}$$
(10)

Finally, a linear combination of the neighboring nodes is evaluated based on the normalized attention coefficients. The updated vector for node i at the t-th layer is formulated as in Eq. (11):

$$\begin{aligned} g_i^{(t)}=\sum _{j \in {\mathcal {N}}_i}{\alpha _{ij}^{(t)} g_j^{(t-1)}}. \end{aligned}$$
(11)

Inspired by Transformer [28], we further apply a position-wise feed-forward (FFN) layer after each GAT layer. The path representation after propagation is represented as \(g_i = g_i^{(2)}\).

3.3 Relation prediction

The path representation \(g_i\) from the Aggregation Graph is fed into a fully connected network to obtain the final relation prediction results as given by Eq. (12):

$$\begin{aligned} \tilde{y_i}={\mathrm{softmax}}(W_c g_i+b_c), \end{aligned}$$
(12)

where \(W_c\) and \(b_c\) are trainable parameters of the predictor. The whole model is trained end-to-end by minimizing the cross-entropy loss, which is given by Eq. (13):

$$\begin{aligned} L={\mathrm{CrossEntropy}}(y, {\tilde{y}}). \end{aligned}$$
(13)

4 Results and discussion

4.1 Experimental setup

We test our algorithm on a subset of Freebase 15K-237, whose paths were labeled in 2016 by Neelakantan’s group [15]. The dataset consists of triples \((e_1, r, e_2)\) and the paths that connecting entity pairs \((e_1, e_2)\) in the knowledge graph. Table 1 provides statistics of the dataset used.

Table 2 Statistics of our dataset

We use the mean average precision (MAP) and mean rank (MR) as evaluation metrics to analyze the incorporation of structural and textual information and the effectiveness of path weight re-distribution, respectively.

For the word embeddings, we set the dimension to be 50 and pre-train them by GloVEFootnote 1. We feed initial word representation into a Bi-LSTM with 230 hidden units. The batch size is set to be 32. The dropout rate and learning rate are set to 0.5 and 0.001, respectively. All other model parameters are randomly initialized from [− 0.1,0.1]. The parameters are regularized with a \(L_2\) regularization strength of 0.0001. We use the depth-first search algorithm for path searching, setting the value of the path number to be 100 and the minimum and maximum lengths of the path to be 2 and 6, respectively. The threshold for path searching is set to 0.5. The score indicates the informative degree of the path pattern; the higher, the more important. For the GRU model implementation, we compute the cross-entropy as the loss function and use the ReLU activation function and AdaGrad optimizer. The dropout is adopted in the output layer to prevent overfitting.

For the baseline models, we use the implementation of the original paper.

4.2 Experimental results

We compare the performance of ARPP with the state-of-the-art methods, including the supervised relation extraction methods, translation-based methods, path-based methods, structural-based methods, and hybrid methods that utilize structural and textual information. Table 3 summarizes a comparison, introducing state-of-the-art methods in terms of the aim and approach.

Table 3 Overall comparison between the state-of-the-art methods
Table 4 Result on FB15K Dataset (with ablation study)

To verify the effectiveness of various components of our model, we conduct the ablation test by removing the textual embeddings of “description” attributes (w/o textual description) and attention mechanism. The ablation test of the attention mechanism includes discarding the attention concerning the description sequence representation (w/o sequence attention) and entity representation (w/o entity attention). In addition, we process path representation through GRU rather than through path information propagation, which denotes (w/o path information propagation). The results of link prediction and entity prediction on the FB15K dataset are reported in Table 4. We observe the following:

  1. (1)

    ARPP is superior to the state-of-the-art results on FB15K. The reason may be that, compared with a model that only uses structural information or text information in the path, the path information obtained through information propagation has richer semantic knowledge. Therein demonstrating that the simultaneous consideration of structural information and textual description can effectively alleviate the training ineffectiveness on entities of low degree. In addition, unlike the knowledge representation model that uses a single triple, the path information takes advantage of all the triple information in the path, which greatly improves the relation prediction task.

  2. (2)

    Our proposed model outperforms the TAPR model [21] which also incorporates structural information and textual description for knowledge graph completion tasks but does not conduct the path denoising with information propagation, indicating the advantage of combining the structural, textural, and path information included in the knowledge graph.

  3. (3)

    Generally, all factors contribute, and this results in a larger performance boost for relation prediction. The basic ARPP model (w/o textual description) cannot perform as well as the ARPP model, which shows that, for low-frequency entities, the description provides supplementary information for embedding; thus, the issue of sparsity in a knowledge base can be addressed properly. It is within our expectation that the adopted attention mechanisms significantly reduce noise and enhance the representation learning of description sequences and entities.

4.3 Model analysis

4.3.1 Utilization of attributes

To further analyze the performance of our model with respect to the entity attribute, we report the MAP results of ARPP and its basic model (w/o textual description) in Fig. 2. We can observe that in the MAP accuracy interval [0.6, 0.9], the ARPP achieves a significant improvement, indicating that textual information can largely alleviate the issue of sparsity when the structural information can provide certain but insufficient support.

Fig. 2
figure 2

Effect of textual description. The blue and red lines represent the MAP results for ARPP and its basic model (w/o textual description), respectively. The higher the MAP value is, the more accurate the method

4.3.2 Utilization of sequence attention

Due to space limitations, we randomly choose one relation, “_soccer_football_ player_position_s,” from Freebase 15K to analyze the attention mechanism concerning the sequence representation.

Substantial amounts of positional information can aid in identifying the football player’s position, including forward, midfielder, defender, and goalkeeper. We classify the positive samples into two groups, i.e., the “position group,” involving the entities with positional information, and the “w/o position group,” containing the remaining entities.

Table 5 shows the accuracy and mean rank of different models for relation prediction. The accuracy increases when the mean rank decreases. Accordingly, we can observe that ARPP demonstrates robust superiority over its basic models, which are developed without sequence attention or positional information.

Table 5 Comparison of models w/ and w/o sequence attention
Fig. 3
figure 3

An example of the sequence attention visualization when predicting the relation between “Marko Arnautović” and “Forward.” The color depth indicates the importance degree of the words, darker representing greater importance

To predict the relation between “Marko Arnautović” and “Forward,” we visualize the attention scores assigned by ARPP in Fig. 3. The color depth indicates the importance degree of the words, darker representing greater importance. As one may expect, ARPP pays significantly more attention to those words that are contextually related to the pairwise entity, such as “as a forward for,” “Austrian footballer,” and “in race,” which verifies the effectiveness of path weight re-distribution and denoising.

4.3.3 Utilization of entity attention

The adopted attention mechanism can improve the entity representation by merging the structural information and attribute information according to their distributions. The abscissa in Fig. 4 is the proportion of entities with a degree less than or equal to 10 in the dataset, and the ordinate is the MAP.

Fig. 4
figure 4

Effect of entity attention. The abscissa is the proportion of entities with a degree less than or equal to 10 in the dataset, and the ordinate is the MAP. The entity attention can increase the entity degree and facilitate better entity training when there are a large number of untrained entities in the dataset

We can see that when the entity training is sufficient, the entity attention cannot significantly improve the MAP. The increase in the MAP is more obvious when there are a large number of untrained entities in the dataset. The experimental results indicate that the entity attention can increase the entity degree and facilitate a better entity training.

4.4 Knowledge graph completion

The current KGs contain a large number of potential relations that have not been explored. In this section, we elaborate on several actual prediction results and show examples that highlight the strengths of the ARPP model.

Due to space limitations, we randomly choose one relation, i.e., the “disease-symptom” relationship, to present the performance of ARPP on the KG completion task (see Table 6). Considering the “head” entities (disease) as the starting point, we collect the “head–tail” entity pairs (disease–symptom) that satisfy the path pattern. For different paths within the “disease–syndrome” entity pairs, ARPP utilizes the entity attribute to further verify the authenticity of the specific relationship in an instance, and it selects paths with scores above the threshold (0.70) as candidate paths.

Table 6 shows that ARPP can discover certain diseases that are not directly connected to a certain symptom through an “ alias ” or “ complication ” relationship. For example, a birth defect, also known as a congenital disorder, can result in “disabilities that may be physical, intellectual, or developmental. Accordingly, we can associate the disease “birth defect” with the symptom “physical disability.”

Table 6 Sampled “disease–syndrome” path patterns discovered by ARPP

Another interesting result is the prediction given from the unrelated diseases and symptoms through the intermediate entity. For example, one of the lesions of the disease “stroke” is the “head,” whose related symptom may be “facial paralysis.” On the other hand, two different diseases, “lobular pneumonia” and “acute lung abscess,” that occur in the same body part, “lung,” may have the common symptom “empyema.” We also report noise instances in Table 6. The score of the “stroke\(\rightarrow\)head\(\leftarrow\)eczema” instance is low because the textual information of the “stroke” and “eczema” entities are quite different, which indicates the benefits of incorporating textual information into predicting paths and ranking instances.

Note that ARPP can identify noise instances that require medical knowledge. Take the “bronchial asthma\(\rightarrow\)lung\(\leftarrow\)lobular pneumonia\(\rightarrow\)shortness of breath” and “bronchial asthma \(\rightarrow\)lung\(\leftarrow\)lobular pneumonia\(\rightarrow\)chills” instances as an example. The “shortness of breath” is the syndrome of the given two diseases, while “chills” is the specific symptom of lobular pneumonia. It is shown that ARPP is able to identify that the latter case is a noisy case.

5 Conclusions and future work

In this paper, we study the task of knowledge graph completion, which has recently attracted increased attention due to their broad applications in natural language processing and natural language generation. We present a novel and general relation prediction framework (ARPP) to predict unseen relationships between entities through reasoning inside a given KG. We greatly expand the KG even without external textual resources. The experimental results on public datasets show that our ARPP method significantly outperforms the state-of-the-art methods on path prediction tasks, which confirms that ARPP is an interpretable model that exploits the differences between paths and demonstrates the impact of each path on relation prediction tasks.

In the future, we will further alleviate the data sparsity problem by exploring the hidden relations beyond the context and other auxiliary data via BERT and transformer. In addition, a multi-view attention mechanism will be designed to interactively learn attentions from different graph components to improve overall representational learning.